imgix incident

Intermittent Dashboard and Management API Issues

imgix experienced a notice incident on February 1, 2023 affecting Web Administration Tools, lasting 4h 40m. The incident has been resolved; the full update timeline is below.

Started: Feb 01, 2023, 03:22 AM UTC
Resolved: Feb 01, 2023, 08:02 AM UTC
Duration: 4h 40m
Detected by Pingoru: Feb 01, 2023, 03:22 AM UTC

Affected components

Web Administration Tools

Update timeline

monitoring Feb 01, 2023, 03:22 AM UTC

We are currently monitoring Dashboard and Management API performance.
resolved Feb 01, 2023, 08:02 AM UTC

This incident has been resolved.
postmortem Feb 03, 2023, 12:08 AM UTC

# What happened? On February 1st, 2023 14:07 UTC, the imgix service experienced intermittent spikes in latency for web administration services, such as the imgix Dashboard and Management API. The incident was resolved later in the day at 20:03 UTC. # How were customers impacted? Customers may have experienced issues with using the Dashboard and the Management API. Actions such as logging in, loading pages, and making requests to the Management API resulted in intermittent timeouts. The Rendering API was not affected by this incident. # What went wrong during the incident? After our engineers identified the initial latency spike, we deployed a workaround that initially resolved the issue. After monitoring the results, we closed the incident, but latency shortly spiked again. The spike was sustained, and requests to the Web Administration parts of our service started to show long response times. The identified issues were similar to a recent incident that had occurred due to upstream providers. Our engineers applied similar mitigation steps, though they were less effective for this incident. Upon further discussion, our engineering team identified a path to resolution by fast-tracking a future planned infrastructure change. This involved reducing connections between our internal services. This change immediately fixed the latency in our Web Administration services. # What will imgix do to prevent this in the future? Internal documentation and tooling allowed our team to easily apply configuration changes and quickly push the needed architecture updates. We have updated this documentation and tooling involving the communication between our internal services to further facilitate these deployments in the future. The diagnostic steps and active monitoring/alerting have been updated as well. Additionally, we have completed an infrastructure upgrade which is designed to prevent this issue from recurring. As we gather more data on the new and improved performance metrics, we will proactively continue tuning our configurations to ensure future stability.