Scout APM incident

Ingestion Issue

Major Resolved View vendor source →

Scout APM experienced a major incident on May 31, 2020 affecting Application Monitoring, lasting 1h 58m. The incident has been resolved; the full update timeline is below.

Started
May 31, 2020, 03:46 PM UTC
Resolved
May 31, 2020, 05:44 PM UTC
Duration
1h 58m
Detected by Pingoru
May 31, 2020, 03:46 PM UTC

Affected components

Application Monitoring

Update timeline

  1. investigating May 31, 2020, 03:46 PM UTC

    We are investigating an issue in ingestion of agent data.

  2. monitoring May 31, 2020, 04:22 PM UTC

    We have restarted several of our Kafka servers in ingestion, and ingestion appears to be recovering. Data should begin appearing on your dashboard again.

  3. resolved May 31, 2020, 05:44 PM UTC

    Ingestion for all customers has been operating normally since 10:20AM MT. Some customers will have some or no data from 8:50AM to 10:20AM MT. We will follow up with more information about the cause of the outage.

  4. postmortem Jun 01, 2020, 09:40 PM UTC

    On 2020/05/31 we experienced a short network outage that prevented our zookeeper and kafka nodes from reaching each other. When connectivity was restored, there was a problem with stale zookeeper data which prevented the kafka brokers from initiating a proper leader election for topic partitions. This also prevented kafka producers from being able to produce to a majority of partitions. A manual leader election was attempted, but failed to correct the issue. We began a rolling restart of our entire kafka cluster, which ultimately resolved the issue. Later versions of Kafka have better handling around this particular failure, and we anticipate moving to a recent version to prevent entering this failure mode again.