Scout APM experienced a major incident on May 31, 2020 affecting Application Monitoring, lasting 1h 58m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating May 31, 2020, 03:46 PM UTC
We are investigating an issue in ingestion of agent data.
- monitoring May 31, 2020, 04:22 PM UTC
We have restarted several of our Kafka servers in ingestion, and ingestion appears to be recovering. Data should begin appearing on your dashboard again.
- resolved May 31, 2020, 05:44 PM UTC
Ingestion for all customers has been operating normally since 10:20AM MT. Some customers will have some or no data from 8:50AM to 10:20AM MT. We will follow up with more information about the cause of the outage.
- postmortem Jun 01, 2020, 09:40 PM UTC
On 2020/05/31 we experienced a short network outage that prevented our zookeeper and kafka nodes from reaching each other. When connectivity was restored, there was a problem with stale zookeeper data which prevented the kafka brokers from initiating a proper leader election for topic partitions. This also prevented kafka producers from being able to produce to a majority of partitions. A manual leader election was attempted, but failed to correct the issue. We began a rolling restart of our entire kafka cluster, which ultimately resolved the issue. Later versions of Kafka have better handling around this particular failure, and we anticipate moving to a recent version to prevent entering this failure mode again.