LiveKit incident

Intermittent connectivity disruptions in US regions

LiveKit experienced a minor incident on May 15, 2026, lasting 9h 53m. The incident has been resolved; the full update timeline is below.

Started: May 15, 2026, 04:57 PM UTC
Resolved: May 16, 2026, 02:50 AM UTC
Duration: 9h 53m
Detected by Pingoru: May 15, 2026, 04:57 PM UTC

Update timeline

investigating May 15, 2026, 04:57 PM UTC

We are investigating intermittent connectivity disruptions affecting a subset of sessions, caused by internal network degradation between several of our US clusters (most heavily impacting US East). Affected sessions may see brief (<5 min) interruptions to media relay. We are actively investigating. Further updates to follow.
monitoring May 15, 2026, 09:16 PM UTC

We have not observed further instances of connectivity disruptions since 13:19 UTC. We are continuing to investigate why our reroute mechanism did not activate in these isolated instances and are monitoring for further disruptions.
resolved May 16, 2026, 02:50 AM UTC

We are resolving this incident as we have not observed any further connectivity disruptions. We will follow up with a postmortem once available.
postmortem May 22, 2026, 07:36 AM UTC

## Summary LiveKit Cloud runs as a distributed realtime network, with data centers around the world interconnected via dedicated networking. Even with dedicated fiber, momentary disruptions between any two data centers can and do occur. To handle these blips, we have built a comprehensive set of resilience mechanisms that automatically reroute and relay traffic over alternate healthy paths. For example, if data centers A and B cannot reach one another cleanly but both can reach data center C, we use C as a relay so that traffic flows A to C to B. Under normal circumstances, network blips between our data centers are handled transparently and customers do not notice them. Between 2026-05-07 and 2026-05-19, a small number of these otherwise routine network blips did become customer-visible in our US regions. Affected sessions experienced a brief interruption to media \(under 5 minutes\) before recovering on a new path. Realtime connections, SIP calls, agent processes, and the wider fleet were not otherwise impacted. ## Impact Five short windows of disruption were observed, each tied to a brief network blip between certain clusters: * 2026-05-12 17:16 UTC * 2026-05-12 05:00 UTC * 2026-05-12 17:00 UTC * 2026-05-13 19:00 UTC * 2026-05-15 07:30 UTC * 2026-05-19 15:01 UTC The majority of sessions traversed alternate paths normally and were not affected. A subset of sessions whose traffic happened to be relayed through a region in the specific failure state described below experienced a media interruption of up to ~5 minutes before re-routing onto a healthy path. ## Root Cause On 2026-05-11, a change was deployed that altered the relay process. The change introduced a subtle bug that required two conditions to occur simultaneously to manifest: 1. The relay region was under-utilized and had not cached a particular piece of state needed by the relay process. 2. A large volume of traffic was directed to that relay region in a very short window of time. When both conditions were present, the relay process would be stuck and would take minutes to fully catch up. During that time, neither endpoint of the relayed session could continue to receive media from the other. Because both conditions are narrow \(a cold-cache relay region absorbing a sudden burst of traffic\), the bug did not surface during pre-deploy testing, and it did not trigger on every network blip. It only manifested when a real network disruption happened to redirect a sufficiently large burst of traffic to a relay region that had not warmed its cache. Once that occurred on a given relay, sessions flowing through it stalled until traffic shifted off that path. We tracked down the root cause on 2026-05-20, and a fix has been fully rolled out across the fleet. ## Corrective Actions & Prevention * **Improved integration tests for relay locks under load.** We are adding tests that exercise the relay process with cold caches and sudden traffic bursts, so this class of contention is caught before deploy. We are also auditing related lock paths to ensure they behave correctly under similar load profiles. We sincerely apologize for the disruption to customers whose sessions were affected during these windows. Thank you for your patience, and we welcome any additional feedback from customers who were impacted.