postmortem Apr 17, 2026, 04:37 PM UTC
## Summary LiveKit's core realtime and agent services are designed to tolerate database failures. WebRTC media, SIP calls, and hosted agent sessions continue to operate even when our database backend is slow or unavailable. A subset of Room APIs, specifically `CreateRoom`, `DeleteRoom`, and `UpdateRoomMetadata`, do depend on a database for consistency and disaster recovery. That database is highly available and globally distributed, with no single point of failure. When it is under significant contention, these Room APIs can return errors or time out, while realtime traffic continues to flow normally. On 2026-04-15, database contention caused a percentage of Room API calls to fail in our US-West region. Remediation work later produced a 26-minute global outage of the Room APIs. Realtime sessions, SIP calls, and agent processes were unaffected throughout. We sincerely apologize to customers whose applications were disrupted during this incident. ## Impact The incident had two distinct phases of customer impact. ### Phase 1: Elevated Room API timeouts in US-West \(2026-04-15 22:10 UTC to 2026-04-16 03:14 UTC\) A percentage of `DeleteRoom`, `UpdateRoomMetadata`, and `ListRooms` calls timed out, primarily in our US-West region. Other regions saw limited impact during this phase. Customers with high Room API volume in US-West observed elevated error rates on their integrations; the majority of customers were not affected. ### Phase 2: Global Room API outage \(2026-04-16 03:14 to 03:40 UTC, ~26 minutes\) While we were swapping in a rebuilt `rooms` table, the table was briefly missing from the database, and the majority of Room API calls globally returned HTTP 500 with `ERROR: relation "rooms" does not exist`. WebRTC sessions, SIP calls, and agent processes continued to function, and realtime connection counts remained stable. Applications that depend on Room APIs to start or manage sessions saw visible failures during this window. ## Root Cause The sweeper is a background process that removes rows from the `rooms` table as sessions end. Earlier on 2026-04-15, its throughput dropped significantly, and over roughly 8 hours stale rows accumulated to the point where the table was many times larger than its intended steady-state size. At approximately 21:00 UTC, a routine schema migration was applied to a different, unrelated table. The migration itself did not touch `rooms`, but it raised overall database disk utilization and background load. Combined with the oversized `rooms` table, this produced enough contention to slow down reads and writes against it. The effect first appeared in US-West, where the regional mix of Room API traffic was most sensitive to the contention. Once we identified the oversized table as the underlying cause, we needed to restore it to a healthy size. Because the table was already contended, deleting rows directly would have taken additional locks and worsened the contention. We instead chose to rebuild the table: create a new table with the same schema, copy over the active rows, then atomically swap the new table into place via a pair of renames. The copy phase completed quickly. The first rename \(moving the old `rooms` table aside\) completed in about 2.5 minutes. The second rename, moving the new table into the `rooms` name, stalled on our globally distributed database for significantly longer than we anticipated. During the stall, the `rooms` table did not exist from the perspective of any region, and all Room API calls globally returned errors. After roughly 10 minutes, we aborted the stalled rename, created a fresh `rooms` table from scratch, and inserted the active rows into it. Room API traffic recovered globally shortly after. ## Corrective Actions & Prevention The following improvements have been implemented or initiated to reduce the likelihood and impact of similar incidents: * **Enhance monitoring for sweeper throughput and active room count.** We are adding and hardening alerts on sweeper throughput and active room count, so that any future divergence pages on-call well before it threatens production. * **Improve sweeper resilience and throughput.** We are investigating the cause of the sweeper's throughput drop and adding capacity headroom so a transient slowdown cannot translate into multi-hour backlog growth. * **Remove database as a dependency for Room APIs**. This incident reaffirmed our long-held design principle that realtime services should not depend on databases. We believe this is the only way to build a system that approaches 100% uptime, and we will continue the work to ensure Room APIs do not depend on a database either. The Phase 2 outage was caused by our own remediation, and we recognize how disruptive it was for applications that depend on the Room APIs. We are committed to the work above to reduce both the likelihood and the blast radius of a similar failure in the future. Thank you for your patience, and we welcome any additional feedback from customers who were affected.