LiveKit incident

Increased Latency in RoomService APIs, brief period of higher error rate

Major Resolved View vendor source →
Started
Apr 15, 2026, 11:23 PM UTC
Resolved
Apr 16, 2026, 04:09 AM UTC
Duration
4h 45m
Detected by Pingoru
Apr 15, 2026, 11:23 PM UTC

Affected components

Global Real Time Communication

Update timeline

  1. investigating Apr 15, 2026, 11:23 PM UTC

    We are investigating reports of increased latencies in RoomService APIs in the US West region, specifically on CreateRoom, DeleteRoom, and UpdateRoomMetadata APIs.

  2. investigating Apr 16, 2026, 12:13 AM UTC

    We are continuing to investigate this issue.

  3. investigating Apr 16, 2026, 01:07 AM UTC

    We believe these elevated latencies began around 22:00 UTC. We have confirmed that only API requests in US-West should be impacted. The current list of impacted APIs appears to be CreateRoom, DeleteRoom, and UpdateRoomMetadata. We are working on mitigating the issue to return latencies back to normal.

  4. identified Apr 16, 2026, 02:10 AM UTC

    We continue to see the long Room API latencies which are now also impacting other regions. The latency increases appear to originate from a specific table in our distributed database. The issue has been escalated with the database vendor and we are working on a workaround for decreasing the API latencies. Other services are not impacted.

  5. investigating Apr 16, 2026, 03:45 AM UTC

    While applying a fix for the API latencies, we are temporarily seeing increased failure rates in RoomServices APIs, including CreateRoom, UpdateRoomMetadata, and DeleteRoom. We are actively working on mitigating this. Impact has been upgraded to major.

  6. monitoring Apr 16, 2026, 03:51 AM UTC

    Our fix is fully implemented and we are not seeing any more failures or high latencies of the RoomService APIs. We are continuing to monitor the issue. We did observe a period of 15 minutes with high API failures while mitigation steps were being applied.

  7. resolved Apr 16, 2026, 04:09 AM UTC

    This issue is now fully resolved. We will be posting a detailed RCA.

  8. postmortem Apr 17, 2026, 04:37 PM UTC

    ## Summary LiveKit's core realtime and agent services are designed to tolerate database failures. WebRTC media, SIP calls, and hosted agent sessions continue to operate even when our database backend is slow or unavailable. A subset of Room APIs, specifically `CreateRoom`, `DeleteRoom`, and `UpdateRoomMetadata`, do depend on a database for consistency and disaster recovery. That database is highly available and globally distributed, with no single point of failure. When it is under significant contention, these Room APIs can return errors or time out, while realtime traffic continues to flow normally. On 2026-04-15, database contention caused a percentage of Room API calls to fail in our US-West region. Remediation work later produced a 26-minute global outage of the Room APIs. Realtime sessions, SIP calls, and agent processes were unaffected throughout. We sincerely apologize to customers whose applications were disrupted during this incident. ## Impact The incident had two distinct phases of customer impact. ### Phase 1: Elevated Room API timeouts in US-West \(2026-04-15 22:10 UTC to 2026-04-16 03:14 UTC\) A percentage of `DeleteRoom`, `UpdateRoomMetadata`, and `ListRooms` calls timed out, primarily in our US-West region. Other regions saw limited impact during this phase. Customers with high Room API volume in US-West observed elevated error rates on their integrations; the majority of customers were not affected. ### Phase 2: Global Room API outage \(2026-04-16 03:14 to 03:40 UTC, ~26 minutes\) While we were swapping in a rebuilt `rooms` table, the table was briefly missing from the database, and the majority of Room API calls globally returned HTTP 500 with `ERROR: relation "rooms" does not exist`. WebRTC sessions, SIP calls, and agent processes continued to function, and realtime connection counts remained stable. Applications that depend on Room APIs to start or manage sessions saw visible failures during this window. ## Root Cause The sweeper is a background process that removes rows from the `rooms` table as sessions end. Earlier on 2026-04-15, its throughput dropped significantly, and over roughly 8 hours stale rows accumulated to the point where the table was many times larger than its intended steady-state size. At approximately 21:00 UTC, a routine schema migration was applied to a different, unrelated table. The migration itself did not touch `rooms`, but it raised overall database disk utilization and background load. Combined with the oversized `rooms` table, this produced enough contention to slow down reads and writes against it. The effect first appeared in US-West, where the regional mix of Room API traffic was most sensitive to the contention. Once we identified the oversized table as the underlying cause, we needed to restore it to a healthy size. Because the table was already contended, deleting rows directly would have taken additional locks and worsened the contention. We instead chose to rebuild the table: create a new table with the same schema, copy over the active rows, then atomically swap the new table into place via a pair of renames. The copy phase completed quickly. The first rename \(moving the old `rooms` table aside\) completed in about 2.5 minutes. The second rename, moving the new table into the `rooms` name, stalled on our globally distributed database for significantly longer than we anticipated. During the stall, the `rooms` table did not exist from the perspective of any region, and all Room API calls globally returned errors. After roughly 10 minutes, we aborted the stalled rename, created a fresh `rooms` table from scratch, and inserted the active rows into it. Room API traffic recovered globally shortly after. ## Corrective Actions & Prevention The following improvements have been implemented or initiated to reduce the likelihood and impact of similar incidents: * **Enhance monitoring for sweeper throughput and active room count.** We are adding and hardening alerts on sweeper throughput and active room count, so that any future divergence pages on-call well before it threatens production. * **Improve sweeper resilience and throughput.** We are investigating the cause of the sweeper's throughput drop and adding capacity headroom so a transient slowdown cannot translate into multi-hour backlog growth. * **Remove database as a dependency for Room APIs**. This incident reaffirmed our long-held design principle that realtime services should not depend on databases. We believe this is the only way to build a system that approaches 100% uptime, and we will continue the work to ensure Room APIs do not depend on a database either. The Phase 2 outage was caused by our own remediation, and we recognize how disruptive it was for applications that depend on the Room APIs. We are committed to the work above to reduce both the likelihood and the blast radius of a similar failure in the future. Thank you for your patience, and we welcome any additional feedback from customers who were affected.

Looking to track LiveKit downtime and outages?

Pingoru polls LiveKit's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when LiveKit reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track LiveKit alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring LiveKit for free

5 free monitors · No credit card required