Scout APM incident

Ingestion Issue

Scout APM experienced a major incident on May 31, 2020 affecting Application Monitoring, lasting 1h 58m. The incident has been resolved; the full update timeline is below.

Started: May 31, 2020, 03:46 PM UTC
Resolved: May 31, 2020, 05:44 PM UTC
Duration: 1h 58m
Detected by Pingoru: May 31, 2020, 03:46 PM UTC

Affected components

Application Monitoring

Update timeline

investigating May 31, 2020, 03:46 PM UTC

We are investigating an issue in ingestion of agent data.
monitoring May 31, 2020, 04:22 PM UTC

We have restarted several of our Kafka servers in ingestion, and ingestion appears to be recovering. Data should begin appearing on your dashboard again.
resolved May 31, 2020, 05:44 PM UTC

Ingestion for all customers has been operating normally since 10:20AM MT. Some customers will have some or no data from 8:50AM to 10:20AM MT. We will follow up with more information about the cause of the outage.
postmortem Jun 01, 2020, 09:40 PM UTC

On 2020/05/31 we experienced a short network outage that prevented our zookeeper and kafka nodes from reaching each other. When connectivity was restored, there was a problem with stale zookeeper data which prevented the kafka brokers from initiating a proper leader election for topic partitions. This also prevented kafka producers from being able to produce to a majority of partitions. A manual leader election was attempted, but failed to correct the issue. We began a rolling restart of our entire kafka cluster, which ultimately resolved the issue. Later versions of Kafka have better handling around this particular failure, and we anticipate moving to a recent version to prevent entering this failure mode again.