LiveKit incident

US - Issues with Track Egress starts

LiveKit experienced a minor incident on November 10, 2025 affecting US West - Egress and US East - Egress and 1 more component, lasting 2h 1m. The incident has been resolved; the full update timeline is below.

Started: Nov 10, 2025, 04:29 PM UTC
Resolved: Nov 10, 2025, 06:30 PM UTC
Duration: 2h 1m
Detected by Pingoru: Nov 10, 2025, 04:29 PM UTC

Affected components

US West - EgressUS East - EgressUS Central - Egress

Update timeline

investigating Nov 10, 2025, 04:29 PM UTC

We are currently investigating reports of track egresses failing to start in our US West and US Central regions. Other egress types are not impacted.
resolved Nov 10, 2025, 08:47 PM UTC

This incident was resolved as of 10:30 am PST and we have not observed any additional errors since then. We will update this incident with more details on the impact that we observed.
postmortem Nov 14, 2025, 12:55 AM UTC

# Summary Track Egress experienced intermittent failures and delayed status updates caused by RPC instability between the egress and controller services in US regions \(primarily Phoenix and Chicago\). Room Composite egresses saw a much smaller impact. Other egress types were not broadly affected. Customer impact \(initial estimates during window 2025-11-10 14:00–18:30 UTC\): * ~2.25% of Track Egress start requests failed * ~0.0175% of Room Composite starts failed * ~0.75% of egresses had missing or delayed status updates The incident was caused by a bug in the egress RPC client causing RPCs to fail in come conditions, affecting status updates and egress service availability. # Timeline Timestamps are in UTC * 14:00 Some TrackEgress request start failing. Some successful egresses never reach the COMPLETE status in the cloud dashboard * 14:14: First alert for increased egress start latency. Investigation starts * 18:00: Issue is identified as RPC failures preventing some egress instances from updating the egress state other or servicing new egress requests * 18:15: New egress instances are brought up to replace the failed ones, mitigating the outage. Customer impact ends. * Nov 12: Underlying bug in RPC client is identified and deployed to the egress cluster. # Root Cause Analysis The LiveKit infrastructure relies on controller nodes to dispatch requests to egress nodes, and to update the stored Egress status. A RPC mechanism is used to transport messages between these 2 services. A bug in the egress RPC client caused it to rarely get into a bad state, preventing it from sending new RPC messages. This means that the egress instance would be unable to update an egress request status, or to start servicing new requests. This would in turn cause the egress cluster to run out of capacity. # Mitigations The failed egress instances were drained and replaced with new ones. These new instances were monitored to ensure thay did not get into the failed state. The underlying issue in the RPC implementation was identified and corrected. We also added a watchdog into egress instances to automatically replace them with new ones such an issue were to occur again.