imgix incident

Elevated rendering errors

imgix experienced a critical incident on June 17, 2024 affecting Rendering Infrastructure, lasting 58m. The incident has been resolved; the full update timeline is below.

Started: Jun 17, 2024, 12:21 AM UTC
Resolved: Jun 17, 2024, 01:19 AM UTC
Duration: 58m
Detected by Pingoru: Jun 17, 2024, 12:21 AM UTC

Affected components

Rendering Infrastructure

Update timeline

investigating Jun 17, 2024, 12:21 AM UTC

We are currently investigating elevated render error rates for uncached derivative images. We will update once when we obtain more information. Previously cached derivatives are not impacted.
identified Jun 17, 2024, 12:50 AM UTC

The issue has been identified and our engineering team is developing a fix.
identified Jun 17, 2024, 12:51 AM UTC

We are continuing to work on a fix for this issue.
monitoring Jun 17, 2024, 01:06 AM UTC

A fix has been implemented and error rates are returning to normal. We are continuing to monitor the service.
resolved Jun 17, 2024, 01:19 AM UTC

This incident has been resolved.
postmortem Jun 21, 2024, 11:54 PM UTC

## What happened? On June 17, 2024, at 00:00 UTC, imgix experienced an extreme spike in requests to our render stack. This unexpected surge caused a failure in our auto-scaling infrastructure, leading to an inability to manage all incoming traffic effectively. A fix was implemented at 00:38 UTC, and the issue was resolved by 01:06 UTC. ## How were customers impacted? Between 00:00 and 01:06 UTC, customers may have experienced failures when requesting new renders. However, previously cached assets served successfully during this time. ## What went wrong during the incident? The incident was triggered by a significant increase in requests, which our automated systems did not properly handle. Although the system started to auto-scale as expected, the unexpected surge caused issues with the health checks used for auto-scaling. The combination of extra traffic and health check failure led to an inability to render new images that required manual intervention to resolve. ## What will imgix do to prevent this in the future? To avoid similar incidents in the future, imgix is taking the following actions: 1. **Health Check Enhancement:** We have investigated and implemented updated health checks to support increased traffic volumes. 2. **Rate Limiting:** Further rate limits will be applied to manage traffic spikes and minimize their impact. 3. **Traffic Routing:** Traffic will be rerouted as necessary to distribute the load and reduce the risk of system overloads. 4. **Automated Alerts Improvement:** We will enhance our automated alert systems to respond more effectively to traffic surges and potential issues, including health check failures. By addressing these areas, we aim to further improve our system's resilience and ensure a smoother customer experience during periods of high demand.