SolarWinds incident

AWS Outage Affecting App & Ingest

Critical Resolved View vendor source →

SolarWinds experienced a critical incident on August 31, 2019 affecting Web Application and APIs and 1 more component, lasting 2h 8m. The incident has been resolved; the full update timeline is below.

Started
Aug 31, 2019, 01:27 PM UTC
Resolved
Aug 31, 2019, 03:36 PM UTC
Duration
2h 8m
Detected by Pingoru
Aug 31, 2019, 01:27 PM UTC

Affected components

Web ApplicationAPIsMetrics Ingest Pipeline

Update timeline

  1. investigating Aug 31, 2019, 01:27 PM UTC

    We are currently investigating an AWS outage affecting all Vividcortex services. https://downdetector.com/status/aws-amazon-web-services

  2. investigating Aug 31, 2019, 01:27 PM UTC

    We are continuing to investigate this issue.

  3. resolved Aug 31, 2019, 03:36 PM UTC

    Our ingest and application issues have cleared. We are working to restore data to the product from the outage window.

  4. postmortem Sep 04, 2019, 07:32 PM UTC

    At 8:43am EDT on 8/31/19 AWS suffered an outage in US-East-1a affecting all of their customers in that region, including VividCortex. Until 11:21am EDT our product was accessible but not accepting new data. At 11:21am we were able to restore our ingest pipeline so that new data was being processed in the product but the system remained degraded for the next several hours as our services re-synced data away from our permanently failed instances. On 9/2 the re-sync completed and the restore process was started that took some time to complete. After the restore process was completed, it was apparent that most customers had a 2.5 hour gap in their data. Our team dove deeper and during our investigation determined that there was an agent misconfiguration \(push\) that inhibited AWS communication with our failover system. Thus, all data during the 2.5 hour window is unrecoverable for customers. The issue that caused this problem has been fixed as of 9/3/19 and agents updated. Customer Impact: Likely 2.5 hours of missing metrics during Friday’s AWS outage window Corrective actions: VividCortex conducting full internal post-mortem, looking to extend high availability into multiple AWS regions, as well as re-examining unit testing and QA process for configurations.