Svix incident

Most routes 5xx in the US region [resolved]

Major Resolved View vendor source →

Svix experienced a major incident on March 11, 2023 affecting API, lasting 47m. The incident has been resolved; the full update timeline is below.

Started
Mar 11, 2023, 08:35 PM UTC
Resolved
Mar 11, 2023, 09:23 PM UTC
Duration
47m
Detected by Pingoru
Mar 11, 2023, 08:35 PM UTC

Affected components

API

Update timeline

  1. investigating Mar 11, 2023, 07:38 PM UTC

    We are currently investigating this issue.

  2. investigating Mar 11, 2023, 08:27 PM UTC

    Still investigating.

  3. investigating Mar 11, 2023, 08:35 PM UTC

    We are still investigating, but we put mitigating changes in place.

  4. monitoring Mar 11, 2023, 08:42 PM UTC

    The service is back up. We are still monitoring, but everything is operational.

  5. resolved Mar 11, 2023, 09:23 PM UTC

    The issue has been resolved, though we are still trying to locate the root cause. There were no deploys today (Saturday), so it's not due to any change on our end, and the activity doesn't look unusual. For whatever reason our API workers went from ~10% utilization (normal) to 100% in a short span of time. We are investigating with AWS. Update: after investigating with AWS for a few hours, neither they nor us are able to understand the reason for the memory usage jump though their and our metrics don't indicate any change in load, underlying systems, or anything like that. They've indicated that OOM can happen even if there hasn't been any indication to that in the AWS metrics. We are still investigating.

  6. postmortem Mar 12, 2023, 09:36 PM UTC

    We are still investigating, but here is our more complete update regarding this incident: [https://www.svix.com/blog/we-had-a-partial-outage/](https://www.svix.com/blog/we-had-a-partial-outage/)