Lightstep incident

Failing to ingest some span data

Major Resolved View vendor source →

Lightstep experienced a major incident on July 21, 2022 affecting US - Trace Assembly, lasting 1h 25m. The incident has been resolved; the full update timeline is below.

Started
Jul 21, 2022, 10:15 AM UTC
Resolved
Jul 21, 2022, 11:41 AM UTC
Duration
1h 25m
Detected by Pingoru
Jul 21, 2022, 10:15 AM UTC

Affected components

US - Trace Assembly

Update timeline

  1. investigating Jul 21, 2022, 10:44 AM UTC

    We have identified the issue and are monitoring a fix.

  2. monitoring Jul 21, 2022, 10:49 AM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved Jul 21, 2022, 11:41 AM UTC

    This incident has been resolved.

  4. postmortem Jul 29, 2022, 06:36 PM UTC

    ## Summary Increased span query load led to backpressure on span data writes. This backpressure led to resource exhaustion in one fleet of span ingestion pods, which caused a partial failure to ingest span data. The issue was resolved by scaling up the ingest pool. ## Timeline \(Pacific Time\) _2:45AM: Expensive queries lead to a spike in CPU utilization for a database pool. This spike leads to load shedding for writes, causing span ingestion errors._ _3:14AM: Scaled up the CPU for the affected pool._ _3:33AM: Scaled up the span ingestion pods._ _4:15AM: All systems are healthy again._ ## Action items We have compiled a list of action items, including prioritizing ingest over queries and improving retry handling, to more gracefully handle increased load.