imgix incident

Elevated rendering errors

imgix experienced a major incident on June 12, 2025 affecting Rendering Infrastructure and Web Administration Tools and 1 more component, lasting 2h 43m. The incident has been resolved; the full update timeline is below.

Started: Jun 12, 2025, 06:09 PM UTC
Resolved: Jun 12, 2025, 08:52 PM UTC
Duration: 2h 43m
Detected by Pingoru: Jun 12, 2025, 06:09 PM UTC

Affected components

Rendering InfrastructureWeb Administration ToolsAPI Service

Update timeline

investigating Jun 12, 2025, 06:09 PM UTC

We are investigating elevated render error rates for the service. Previously cached derivatives are not impacted.
identified Jun 12, 2025, 06:15 PM UTC

The issue has been identified and we are investigating a solution.
identified Jun 12, 2025, 06:53 PM UTC

The service is experiencing elevated error rates due to a major Google Cloud outage affecting services downstream. Previously cached derivatives are not impacted. We are investigating ways to mitigate this issue.
monitoring Jun 12, 2025, 08:26 PM UTC

The service is restored. We are monitoring the situation.
identified Jun 12, 2025, 08:38 PM UTC

The Rendering API has fully recovered. We are continuing to investigate Web Administration related issues (login and Management API) related to the incident.
monitoring Jun 12, 2025, 08:47 PM UTC

The Rendering API is fully recovered. Web Administration tooling (logins and the Management API) are recovering. We are monitoring the results.
resolved Jun 12, 2025, 08:52 PM UTC

The service is completely restored.
postmortem Jun 20, 2025, 07:03 PM UTC

# Incident Summary Between **17:55 and 20:22 UTC** on **June 12, 2025**, Imgix services experienced major service disruptions across several key interfaces: * **Dashboard and Asset Manager**: These interfaces were inaccessible, preventing users from managing their assets or viewing account information. * **Management API**: Requests to the Management API consistently returned errors, affecting workflows reliant on programmatic updates or asset administration. * **Rendering API:** Approximately 8% of all Imgix service requests failed due to a high error rate \(~80%\) for **uncached** assets via the Rendering API. Requests in the EU saw a lower failure rate \(~50%\) and a faster recovery time \(45 minutes\) for these uncached requests. # What caused it The incident was triggered by a global outage within Google Cloud, which serves as a core infrastructure provider for Imgix. The outage affected most services in all regions simultaneously. You can read more about the [Google Cloud outage here](https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW#RP1d9aZLNFZEJmTBk8e1). # What happened * **17:55 UTC:** Internal alerts triggered due to a spike in rendering errors and service timeouts. * **18:01 UTC:** A short investigation uncovers several timeouts and increased error rates from Google Cloud. * **18:09 UTC:** Our status page is updated. * **18:53 UTC:** A major Google Cloud outage is confirmed, after which we update our status page. * **18:00–20:16 UTC:** Mitigation efforts were hampered by the far-reaching effects of the outage, preventing us from redirecting traffic or applying configuration changes. * **Throughout:** We confirm that cached images were not affected, though the downtime of several data sources prevents evaluating the full scope and effect of the outage. * **20:16 UTC:** Google reported recovery in all regions except `us-central1`. This allowed us to verify **significantly lower error rates for EU traffic** * **20:47 UTC:** Imgix systems achieved full recovery. * **20:52 UTC:** The incident was officially resolved on our status page. * **21:23 UTC:** Google confirms a full-service recovery at 21:23 UTC. # What went wrong * Google Cloud experienced an outage that simultaneously affected nearly every service in every region worldwide, negating our multi-region redundancy for the image rendering service. * 3rd party services \(such as our CDN\) were also affected by the outage, which removed some of our options for redirecting traffic across regions based on performance. * The outage included the control panes that Google provides to its customers, which removed additional options for redirecting traffic and implementing mitigations. # What we will do to prevent this in the future * Continuing our ongoing internal discussions and evaluations of a multi-cloud render stack to enable failover in the event of a provider-wide outages. * Continue evaluating and improving tools to automatically and manually shift traffic as necessary at each layer of the stack. * Review and enhance incident communication protocols, focusing on faster root cause disclosure and update frequency.