Lightstep incident

Ingestion and web UI outage

Lightstep experienced a major incident on July 21, 2022 affecting US - Trace Assembly and PagerDuty Events API and 1 more component, lasting 56m. The incident has been resolved; the full update timeline is below.

Started: Jul 21, 2022, 08:05 PM UTC
Resolved: Jul 21, 2022, 09:02 PM UTC
Duration: 56m
Detected by Pingoru: Jul 21, 2022, 08:05 PM UTC

Affected components

US - Trace AssemblyPagerDuty Events APIUS - Service DirectoryUS - Unified AlertingUS - DashboardsUS - Streams AlertingUS - Trace StatisticsSlack Apps/IntegrationsUS - NotebooksUS - Metrics

Update timeline

investigating Jul 21, 2022, 07:35 PM UTC

We are currently investigating this issue.
investigating Jul 21, 2022, 07:35 PM UTC

We are continuing to investigate this issue.
investigating Jul 21, 2022, 07:48 PM UTC

We are still actively investigating this issue, and will provide another update shortly.
investigating Jul 21, 2022, 08:05 PM UTC

We have now identified the issue and are actively implementing mitigations.
identified Jul 21, 2022, 08:20 PM UTC

We have mitigated issues with metric ingestion. Trace ingestion, assembly and alerting are still experiencing degraded performance for some customers.
monitoring Jul 21, 2022, 08:30 PM UTC

We have resolved issues with trace ingestion, assembly and alerting and believe the issue has been resolved, but are actively monitoring.
resolved Jul 21, 2022, 09:02 PM UTC

All components are operational.
postmortem Jul 29, 2022, 12:21 AM UTC

### Summary Lightstep UI and Data Ingest experienced an outage triggered by a planned database change on July 21st. This database change caused a service failure, which led to a cascading failure in dependent services. Action was taken to restore the service in a degraded mode, which recovered the Lightstep UI. A subsequent roll back of the database change restored all systems and services. ### Timeline 12:22 PM: A database change was run causing a service to become unavailable. 12:25 PM: Cascading failures impact Lightstep UI and ingestion. Ingested data loss begins. 12:35 PM: Incident declared in status page. 12:49 PM: Root cause identified, service brought back in a degraded mode, recovery begins. 13:01 PM: Several systems stabilize, Lightstep UI recovers. 13:17 PM: Database change rolled back. Remaining affected services begin recovering. 13:30 PM: Ingestion recovers. Incident resolved, status page updated to “mitigated and monitoring”. 14:02 PM: Status page updated to fully operational. ### Action Items * Adding automated checks on code changes with database changes to help prevent the original root cause * Enabling the directly affected service to continue to serve traffic in a degraded state, rather than failing entirely. * Making architectural changes to Lightstep’s Data Ingest path to prevent data loss in these situations by continuing to accept and buffer incoming traffic.