GraphCDN incident

Issues with Purging API

Major Resolved View vendor source →

GraphCDN experienced a major incident on July 22, 2024 affecting Purging API, lasting 2h 27m. The incident has been resolved; the full update timeline is below.

Started
Jul 22, 2024, 05:40 AM UTC
Resolved
Jul 22, 2024, 08:07 AM UTC
Duration
2h 27m
Detected by Pingoru
Jul 22, 2024, 05:40 AM UTC

Affected components

Purging API

Update timeline

  1. investigating Jul 22, 2024, 05:40 AM UTC

    We are currently looking into an issue with the Purging API.

  2. identified Jul 22, 2024, 07:32 AM UTC

    The team has identified the issue and is currently implementing a fix.

  3. monitoring Jul 22, 2024, 07:38 AM UTC

    A fix has been implemented and the Purging API is working as expected again. We are monitoring all systems to make sure they are working as expected.

  4. monitoring Jul 22, 2024, 07:45 AM UTC

    We are continuing to monitor for any further issues.

  5. resolved Jul 22, 2024, 08:07 AM UTC

    This incident has been resolved.

  6. postmortem Jul 23, 2024, 06:36 PM UTC

    During a routine employee offboarding, we revoked that employee’s access to Fastly. Revoking their access to Fastly also revoked all access tokens that engineer created. Unfortunately, this included the central API token all our systems use to communicate with the Fastly API. This had two immediate impacts: 1. Purging started failing silently: Stellate’s purging API kept returning successful responses even though data would not be evicted from the cache. 2. Service configuration updates failing silently: Service configuration updates appeared to persist even though they were not updated in the CDN. As part of the incident response, we switched the central Fastly API token to a new token owned by a shared engineering account. Further, we will work on gaining better visibility and alerting on failure conditions with the purging API, as well as audit all tokens in use by our services to ensure they are not owned by individual engineers.