imgix incident

Intermittent rendering errors

Minor Resolved View vendor source →

imgix experienced a minor incident on October 23, 2023 affecting Rendering Infrastructure, lasting 58m. The incident has been resolved; the full update timeline is below.

Started
Oct 23, 2023, 10:34 PM UTC
Resolved
Oct 23, 2023, 11:32 PM UTC
Duration
58m
Detected by Pingoru
Oct 23, 2023, 10:34 PM UTC

Affected components

Rendering Infrastructure

Update timeline

  1. investigating Oct 23, 2023, 10:34 PM UTC

    We are investigating an issue affecting a small percentage of renders.

  2. monitoring Oct 23, 2023, 11:15 PM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved Oct 23, 2023, 11:32 PM UTC

    This incident has been resolved.

  4. postmortem Nov 02, 2023, 06:52 PM UTC

    # What happened? On October 23, 2023, between 21:43 UTC and 23:14 UTC, imgix experienced a partial outage affecting images served from the Rendering API. During this time, a small percentage \(<0.45% on average\) of non-cached requests returned a server error. A fix was implemented at 23:02 UTC, which allowed the service to recover by 23:14 UTC fully. # How were customers impacted? Between 21:46 UTC and 23:14 UTC, requests to the Rendering API returned a server error, with 0.65% of all requests to our CDN returning an error at the height of the incident. Additionally, Sources returned an unknown status between 21:06 UTC to 21:09 UTC. During this period, customers reported being unable to create Sources. # What went wrong during the incident? Our Rendering API experienced an unexpected interaction that caused a dramatic increase in server load. This caused error rates to increase as the network became overloaded slowly. The errors fluctuated between 0.07% to 0.65% until we resolved the issue. To restore the service, our engineers re-configured our network traffic to handle the unexpected Rendering behavior. During the incident, a separate issue \(unrelated to rendering\) impacted our Source data. This led to a delay in investigating the cause of the rendering errors. # What will imgix do to prevent this in the future? We have taken the following steps to prevent this issue from recurring: * Fixed the misconfigured server interaction * We will put an alert system in place to notify us when traffic congestion happens from a misconfigured source interaction. We are in the process of implementing the following: * Conducting a review of our current tooling to increase our traffic and network configuration capabilities. * Reviewing our current configuration to limit the affected services should a similar incident happen.