Lightstep incident

Lightstep Webapp Down

Critical Resolved View vendor source →

Lightstep experienced a critical incident on January 20, 2022 affecting US - Service Directory and US - Dashboards and 1 more component, lasting 3h 23m. The incident has been resolved; the full update timeline is below.

Started
Jan 20, 2022, 05:58 PM UTC
Resolved
Jan 20, 2022, 09:21 PM UTC
Duration
3h 23m
Detected by Pingoru
Jan 20, 2022, 05:58 PM UTC

Affected components

US - Service DirectoryUS - DashboardsUS - Change IntelligenceUS - Explorer

Update timeline

  1. investigating Jan 20, 2022, 05:58 PM UTC

    We are currently investigating this issue.

  2. investigating Jan 20, 2022, 06:41 PM UTC

    All pages are operational except for the streams page. We are continuing to investigate.

  3. monitoring Jan 20, 2022, 08:08 PM UTC

    A fix for the remaining unavailability on the streams page is rolling out, and is expected to complete by 1pm PST.

  4. monitoring Jan 20, 2022, 09:21 PM UTC

    We are continuing to monitor for any further issues.

  5. resolved Jan 20, 2022, 09:21 PM UTC

    This incident has been resolved.

  6. postmortem Feb 01, 2022, 11:14 PM UTC

    ## **Summary** Web UI inaccessible for 48 minutes \(9:49am - 10:37am\), and streams page inaccessible for an additional 48 minutes \(until 11:25am\). ## **Timeline \(12-hour Pacific Time\)** 09:49 AM PT: 100% of API requests for the webapp begin failing 09:52 AM PT: Database hits 100% resource utilization 10:31 AM PT: Divert all traffic from webapp to bring up database 10:37 AM PT: Allow all traffic except from the operation/get endpoint so the webapp is back up 11:25 AM PT: Allow operation/get traffic and incident is resolved ## **Root Cause** Database errors led to transient unavailability. Aggressive retries saturated the database, leading to a negative feedback loop. ## **Action items** We have updated our database configuration to limit the blast radius of failing or slow database calls. We have also implemented additional rate limiting to avoid the negative feedback loop where many concurrent requests lead to failures, leading to more requests.