GraphCDN incident

Cloudflare Outage

GraphCDN experienced a major incident on June 21, 2022, lasting —. The incident has been resolved; the full update timeline is below.

Started: Jun 21, 2022, 07:20 AM UTC
Resolved: Jun 21, 2022, 07:20 AM UTC
Duration: —
Detected by Pingoru: Jun 21, 2022, 07:20 AM UTC

Update timeline

resolved Jun 21, 2022, 03:44 PM UTC

- Around 6:40 am we started getting customer reports about our CDN service being unavailable - Around 6:52 am we linked this to the Cloudflare incident - At 7:03 am an incident was opened at Stellate for a failing part of our internal system - Around 7:20 am Cloudflare implemented a fix, in the minutes after that we saw our services returning back to normal
postmortem Jun 21, 2022, 03:45 PM UTC

## Leadup/fault Cloudflare deployed a change to its global network, taking the busiest 19 locations offline \(accounting for about 50% of total traffic passing through Cloudflare\). This outage propagated to the Stellate GraphQL Edge Cache which uses Cloudflare Workers under the hood. Cloudflare posted an [elaborate explanation](https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/) about this incident on their blog. ## Impact * Traffic passing through Stellate POPs \(provided by Fastly\) which routed to affected Cloudflare locations saw increased error rates and outages. This affected all Stellate services, no matter if GraphQL Edge Caching was enabled or not. * Since we use our GraphQL Analytics service for internal APIs, our dashboard was affected by the outage as well. * The Stellate Purging API also runs on Cloudflare Workers and was unavailable in affected locations. * Lastly, we observed failed attempts for users trying to log in to the dashboard via email. Our endpoint errored due to the [WorkOS](https://workos.com/) API \(used internally to power magic login links\) returning an error. WorkOS also mentioned a “degraded service” incident on their [status page](https://status.workos.com/incidents/s5kl869ldj94) that aligns with the timing of the Cloudflare outage. ## Timeline \(all times in UTC\) * On 2022-06-21, around 6:40 am we started getting customer reports about our CDN service being unavailable * Around 6:52 am we linked this to the Cloudflare incident * At 7:03 am an incident was opened at Stellate for a failing part of our internal system * Around 7:20 am Cloudflare implemented a fix, in the minutes after that we saw our services returning back to normal ## Short-term solution * We improved our internal monitoring to check more locations. This will help us spot partial outages of our CDN services quicker in the future. * We made the email login endpoint more resilient to outages of WorkOS. ## Future plans * Already before the incident today we were planning on consolidating our CDN service and reducing the dependencies on third-party providers like Cloudflare.