imgix incident

Intermittent 5xx errors

Minor Resolved View vendor source →

imgix experienced a minor incident on May 1, 2023 affecting Rendering Infrastructure, lasting 1h 15m. The incident has been resolved; the full update timeline is below.

Started
May 01, 2023, 02:20 PM UTC
Resolved
May 01, 2023, 03:35 PM UTC
Duration
1h 15m
Detected by Pingoru
May 01, 2023, 02:20 PM UTC

Affected components

Rendering Infrastructure

Update timeline

  1. investigating May 01, 2023, 02:20 PM UTC

    We are currently investigating reports of intermittent 5xx errors causing some images to initially return a 5xx error.

  2. identified May 01, 2023, 02:47 PM UTC

    The issue has been identified, and a fix is being implemented.

  3. monitoring May 01, 2023, 03:08 PM UTC

    A fix has been implemented, and we are monitoring the results.

  4. resolved May 01, 2023, 03:35 PM UTC

    This incident has been resolved.

  5. postmortem May 11, 2023, 08:37 PM UTC

    # What happened On May 1st, 2023, between the hours of 08:23 UTC and 15:08 UTC, imgix experienced intermittent errors affecting a small percentage of non-cached renders. # How were customers impacted? During the affected period, a small percentage of requests to the Rendering API returned a `502` or `503` error for non-cached requests. Errors slowly and gradually increased, with <.5% of requests returning an error at the height of the incident. # What went wrong during the incident? Our upstream provider experienced communication issues between CDN POPs, causing intermittent `502`/`503` responses in a small percentage of requests to our Rendering API. The increase in errors was so minor that it did not meet our monitoring thresholds for triggering alerts. One of our engineers observed a slow increase in errors and alerted other team members to a potential issue with our service. After tracing the issue to our upstream provider, we pushed a patch to mitigate intermittent connectivity issues, resolving the incident. # What will imgix do to prevent this in the future? We have refined our alerting to better catch the slowly increasing error rates. We have also ensured that the root cause of this incident has been fixed by our upstream provider. We are also updating our traffic routing in the case that the upstream issue occurs again.