Scout APM incident

Database Connection Issues

Major Resolved View vendor source →

Scout APM experienced a major incident on June 3, 2019 affecting Application Monitoring, lasting 48m. The incident has been resolved; the full update timeline is below.

Started
Jun 03, 2019, 03:47 PM UTC
Resolved
Jun 03, 2019, 04:36 PM UTC
Duration
48m
Detected by Pingoru
Jun 03, 2019, 03:47 PM UTC

Affected components

Application Monitoring

Update timeline

  1. investigating Jun 03, 2019, 03:47 PM UTC

    We appear to be using more than the expected number of database connections, causing failures on our Web UI. Ingestion is backed up, but the incoming data is safe and collected.

  2. investigating Jun 03, 2019, 04:03 PM UTC

    We've identified and fixed the database connection issue. We are currently loading the backlog of data that was held during the incident. Data will be appearing in the UI shortly.

  3. resolved Jun 03, 2019, 04:36 PM UTC

    All chart metrics are now completely caught up. The root cause of the incident was due to attempted table partitioning during a database vacuum, which caused a lock on a critical table and cascaded to impact the rest of the application. We'll be adjusting our vacuum and partitioning schedules to avoid this lock again.