GraphCDN incident

Issues with configuration updates propagating

GraphCDN experienced a major incident on August 18, 2023 affecting GraphQL Edge Caching and GraphQL Rate Limiting, lasting 5h 1m. The incident has been resolved; the full update timeline is below.

Started: Aug 18, 2023, 07:15 AM UTC
Resolved: Aug 18, 2023, 12:17 PM UTC
Duration: 5h 1m
Detected by Pingoru: Aug 18, 2023, 07:15 AM UTC

Affected components

GraphQL Edge CachingGraphQL Rate Limiting

Update timeline

investigating Aug 17, 2023, 11:55 AM UTC

We are investigating an issue with configuration updates propagating to the respective services. If you didn't make configuration changes recently, your services are not impacted by this incident.
investigating Aug 17, 2023, 01:56 PM UTC

We are continuing to investigate this issue together with our infrastructure providers. If you haven't made configuration changes to your service recently, you are not affected by this issue.
identified Aug 17, 2023, 04:03 PM UTC

Our infrastructure partner has identified the issue and is working on fixing it.
identified Aug 18, 2023, 07:15 AM UTC

The incident with KV stores, which are used for service configuration, is now spreading to additional edge locations and affecting overall service availability for services on the new infrastructure. We have disabled the new infrastructure to provide our partner more time to identify and resolve the issue on their end.
identified Aug 18, 2023, 07:29 AM UTC

We are continuing to work on a fix for this issue.
identified Aug 18, 2023, 09:53 AM UTC

We continue working with Fastly to resolve this issue. Please see https://www.fastlystatus.com/incident/376022 for updates from their team as well.
monitoring Aug 18, 2023, 10:33 AM UTC

Fastly has implemented a fix for the issue, all services are working as expected again. We have temporarily disabled switching over to the new infrastructure and are working with Fastly to better understand what happened on their end, why it took so long to identify and rectify this and how we can better monitor and prevent this in the future. We well enable the new infrastructure again, once we are confident in any services we rely on.
resolved Aug 18, 2023, 12:17 PM UTC

This issue has been resolved. We have temporarily switched all services back to our "old infrastructure" and are running additional tests as well as working with Fastly before we reopen the "new infrastructure". We will also publish additional details once we conclude our internal post mortem process.
postmortem Sep 11, 2023, 10:52 AM UTC

* Stellate relies on Fastly infrastructure for our offerings * Fastly experienced a partial outage of their KV Store offering on June 17th and June 18th, which affected Stellate. They provide a summary of this incident on their status page at [https://www.fastlystatus.com/incident/376022](https://www.fastlystatus.com/incident/376022) ## Timeline * August 17th 10:46 UTC - A customer reported their stellate endpoint failing in the FRA \(Frankfurt\) point of presence \(POP\), as well as in several other edge locations. This was due to them pushing an update to their configuration, specifically the `originUrl` . * 10:50 - We identified the issue as being a stale KV value in the FRA POP, as well as several others. * 10:55 - We created an incident on our status page for degraded KV in the FRA POP and several others. * 13:08 - We realized that Rate Limiting and Developer Portals were affected by this outage as well. * 13:30 - We reported this incident to Fastly. * August 18th 4:00 UTC - Fastly was not yet able to provide us with a satisfactory response on what was causing this and didn’t acknowledge the ongoing outage. * 6:23 - A large e-commerce customer reported their website was unavailable. This was due to a KV key disappearing in the FRA POP, as well as several others. * 7:09 - Additional reports started to come in via Intercom about services not responding properly. * 7:15 - We escalated the incident with Fastly as from our view more regions seemed to be affected and becoming unavailable. * 7:16 - We deployed a partial fix that disabled our new infrastructure. This fixed edge caching for users who didn’t recently push configuration changes \(the majority of services\). Rate Limiting, JWT-based scopes, and the Developer Portal were still affected by the KV outage. * 8:01 - Fastly was able to reproduce the bug based on a reproduction that we provided earlier and started working on a fix. * 9:02 - Fastly opened an [official incident](https://www.fastlystatus.com/incident/376022) on their status page. * 10:04 - Fastly marked the incident as resolved * 10:19 - Fastly communicated to us that the cause was an issue with surrogate keys in their C@E caching layer. * August 22nd - Fastly shared their confidential Fastly Service Advisory with us providing additional information about this incident and how they want to prevent this from happening again. ## Next Steps * We have had several calls with Fastly over the last couple of days, working with them to analyze what went wrong, why it took them so long to escalate this internally, and how we can improve communication and collaboration going forward. * As a direct outcome of this, we have re-connected with our European contacts at Fastly and designated a direct contact to involve in conversations and escalations going forward. * We are going to investigate a fallback option for Fastly KV. * Additionally, we will review all possible failure points that could make Stellate core services inaccessible \(in the event of a third-party outage\) and investigate options for additional redundancies for those services.