Lightstep experienced a critical incident on January 20, 2022 affecting US - Service Directory and US - Dashboards and 1 more component, lasting 3h 23m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jan 20, 2022, 05:58 PM UTC
We are currently investigating this issue.
- investigating Jan 20, 2022, 06:41 PM UTC
All pages are operational except for the streams page. We are continuing to investigate.
- monitoring Jan 20, 2022, 08:08 PM UTC
A fix for the remaining unavailability on the streams page is rolling out, and is expected to complete by 1pm PST.
- monitoring Jan 20, 2022, 09:21 PM UTC
We are continuing to monitor for any further issues.
- resolved Jan 20, 2022, 09:21 PM UTC
This incident has been resolved.
- postmortem Feb 01, 2022, 11:14 PM UTC
## **Summary** Web UI inaccessible for 48 minutes \(9:49am - 10:37am\), and streams page inaccessible for an additional 48 minutes \(until 11:25am\). ## **Timeline \(12-hour Pacific Time\)** 09:49 AM PT: 100% of API requests for the webapp begin failing 09:52 AM PT: Database hits 100% resource utilization 10:31 AM PT: Divert all traffic from webapp to bring up database 10:37 AM PT: Allow all traffic except from the operation/get endpoint so the webapp is back up 11:25 AM PT: Allow operation/get traffic and incident is resolved ## **Root Cause** Database errors led to transient unavailability. Aggressive retries saturated the database, leading to a negative feedback loop. ## **Action items** We have updated our database configuration to limit the blast radius of failing or slow database calls. We have also implemented additional rate limiting to avoid the negative feedback loop where many concurrent requests lead to failures, leading to more requests.