Svix incident

Most routes 5xx in the US region [resolved]

Svix experienced a major incident on March 11, 2023 affecting API, lasting 47m. The incident has been resolved; the full update timeline is below.

Started: Mar 11, 2023, 08:35 PM UTC
Resolved: Mar 11, 2023, 09:23 PM UTC
Duration: 47m
Detected by Pingoru: Mar 11, 2023, 08:35 PM UTC

Affected components

API

Update timeline

investigating Mar 11, 2023, 07:38 PM UTC

We are currently investigating this issue.
investigating Mar 11, 2023, 08:27 PM UTC

Still investigating.
investigating Mar 11, 2023, 08:35 PM UTC

We are still investigating, but we put mitigating changes in place.
monitoring Mar 11, 2023, 08:42 PM UTC

The service is back up. We are still monitoring, but everything is operational.
resolved Mar 11, 2023, 09:23 PM UTC

The issue has been resolved, though we are still trying to locate the root cause. There were no deploys today (Saturday), so it's not due to any change on our end, and the activity doesn't look unusual. For whatever reason our API workers went from ~10% utilization (normal) to 100% in a short span of time. We are investigating with AWS. Update: after investigating with AWS for a few hours, neither they nor us are able to understand the reason for the memory usage jump though their and our metrics don't indicate any change in load, underlying systems, or anything like that. They've indicated that OOM can happen even if there hasn't been any indication to that in the AWS metrics. We are still investigating.
postmortem Mar 12, 2023, 09:36 PM UTC

We are still investigating, but here is our more complete update regarding this incident: [https://www.svix.com/blog/we-had-a-partial-outage/](https://www.svix.com/blog/we-had-a-partial-outage/)