Lightstep incident

Ingestion and web UI outage

Major Resolved View vendor source →

Lightstep experienced a major incident on July 21, 2022 affecting US - Trace Assembly and PagerDuty Events API and 1 more component, lasting 56m. The incident has been resolved; the full update timeline is below.

Started
Jul 21, 2022, 08:05 PM UTC
Resolved
Jul 21, 2022, 09:02 PM UTC
Duration
56m
Detected by Pingoru
Jul 21, 2022, 08:05 PM UTC

Affected components

US - Trace AssemblyPagerDuty Events APIUS - Service DirectoryUS - Unified AlertingUS - DashboardsUS - Streams AlertingUS - Trace StatisticsSlack Apps/IntegrationsUS - NotebooksUS - Metrics

Update timeline

  1. investigating Jul 21, 2022, 07:35 PM UTC

    We are currently investigating this issue.

  2. investigating Jul 21, 2022, 07:35 PM UTC

    We are continuing to investigate this issue.

  3. investigating Jul 21, 2022, 07:48 PM UTC

    We are still actively investigating this issue, and will provide another update shortly.

  4. investigating Jul 21, 2022, 08:05 PM UTC

    We have now identified the issue and are actively implementing mitigations.

  5. identified Jul 21, 2022, 08:20 PM UTC

    We have mitigated issues with metric ingestion. Trace ingestion, assembly and alerting are still experiencing degraded performance for some customers.

  6. monitoring Jul 21, 2022, 08:30 PM UTC

    We have resolved issues with trace ingestion, assembly and alerting and believe the issue has been resolved, but are actively monitoring.

  7. resolved Jul 21, 2022, 09:02 PM UTC

    All components are operational.

  8. postmortem Jul 29, 2022, 12:21 AM UTC

    ### Summary Lightstep UI and Data Ingest experienced an outage triggered by a planned database change on July 21st. This database change caused a service failure, which led to a cascading failure in dependent services. Action was taken to restore the service in a degraded mode, which recovered the Lightstep UI. A subsequent roll back of the database change restored all systems and services. ### Timeline 12:22 PM: A database change was run causing a service to become unavailable. 12:25 PM: Cascading failures impact Lightstep UI and ingestion. Ingested data loss begins. 12:35 PM: Incident declared in status page. 12:49 PM: Root cause identified, service brought back in a degraded mode, recovery begins. 13:01 PM: Several systems stabilize, Lightstep UI recovers. 13:17 PM: Database change rolled back. Remaining affected services begin recovering. 13:30 PM: Ingestion recovers. Incident resolved, status page updated to “mitigated and monitoring”. 14:02 PM: Status page updated to fully operational. ### Action Items * Adding automated checks on code changes with database changes to help prevent the original root cause * Enabling the directly affected service to continue to serve traffic in a degraded state, rather than failing entirely. * Making architectural changes to Lightstep’s Data Ingest path to prevent data loss in these situations by continuing to accept and buffer incoming traffic.