LiveKit incident

SIP outbound call issues

LiveKit experienced a major incident on January 9, 2026 affecting US West - SIP and US East - SIP and 1 more component, lasting 24m. The incident has been resolved; the full update timeline is below.

Started: Jan 09, 2026, 10:00 AM UTC
Resolved: Jan 09, 2026, 10:24 AM UTC
Duration: 24m
Detected by Pingoru: Jan 09, 2026, 10:00 AM UTC

Affected components

US West - SIPUS East - SIPUS Central - SIP

Update timeline

investigating Jan 09, 2026, 10:00 AM UTC

We are investigating customer reports regarding SIP outbound calls. Inbound calls are not affected. We'll provide more details soon. (SIP Global)
investigating Jan 09, 2026, 10:23 AM UTC

We have mitigated the issue by rolling back a media service release that went out around the same time. The time of impact was about 07:41 - 10:01 UTC. (SIP Global)
investigating Jan 09, 2026, 10:24 AM UTC

We are continuing to investigate this issue. (SIP Global)
resolved Jan 09, 2026, 10:24 AM UTC

This incident has been resolved. It is our highest priority to understand the root cause and will share a full post mortem shortly
postmortem Jan 10, 2026, 07:14 PM UTC

**Root Cause** A code change unintentionally overrode the user-supplied `ringing_timeout` for synchronous `CreateSIPParticipant` API calls. As a result, calls with `wait_until_answered=true` timed out significantly earlier than intended and failed prematurely. Widespread impact lasted for over 2 hours because it evaded our automated tests and monitoring systems. Below, we share some technical details behind the failure to provide transparency. **Technical Details** The `CreateSIPParticipant` API, used for making outbound calls, supports two modes of operation: * Async mode: This is the default behavior, where the API dials the call and returns `200` immediately. * Sync mode: When `wait_until_answered` is set to `true`, the API holds the connection open until the user answers or declines the call. In sync mode, the API returns a `200` status only if the user answers the call, or a `408` if they do not respond within the user-defined `ringing_timeout` \(which defaults to 30 seconds\). In an effort to enhance observability in our RPC stack, we've recently introduced a change to enable tracing for internal RPC calls. This automatically exports traces to our observability platform whenever an internal RPC is invoked. However, enabling this tracing also unintentionally introduced a default timeout of 3 seconds on internal RPCs. Consequently, two competing timeouts came into play during `CreateSIPParticipant` calls: * `ringing_timeout`: User-defined, defaults to 30 seconds. * internal RPC timeout: Fixed at 3 seconds. These inconsistent timeouts caused internal APIs to return a `408` error before the `ringing_timeout` was reached. As a result, SIP outbound calls with `wait_until_answered=true` would ring for only three seconds before aborting. Calls answered within 3 seconds or those in async mode proceeded without issues. **Detection and Response Challenges** Service reliability is our top priority; we maintain rigorous testing and alerting systems, including: * Continuous end-to-end tests running against both staging and production environments. * Phased deployment across our global infrastructure, starting with low-traffic regions. * Alarms triggered by high error rates \(5xx\) on customer-facing API calls. * Alarms for elevated internal RPC error rates. * Manual review of key health indicators during deployments. Despite these measures, the issue went undetected during deployment for the following reasons: * The 3s timeout has caused `CreateSIPParticipant` to return a `408`, before it has reached `ringing_timeout`. * Initial rollout regions had insufficient users relying on sync mode, so their calls completed without disruption. * Our end-to-end tests simulate actual calls but use a bot on the receiving end, which answers within 3 seconds. * SIP health indicators showed calls being made and completing overall \(though average call duration dropped during the incident, it was not monitored as a key health metric\). **Timeline** 2026-01-08 04:47 UTC – Change first deployed to a limited set of low-traffic regions. 2026-01-08 15:40 UTC – Second rollout phase to regions including Asia. 2026-01-08 20:45 UTC – Third rollout phase to additional regions, including EU. 2026-01-09 07:38 UTC – Change deployed to the majority of regions, resulting in widespread impact. 2026-01-09 09:58 UTC – Change fully rolled back, resolving the issue. **Scope of Impact** During the incident window, outbound calls meeting the following conditions failed: * `wait_until_answered=true ` * User did not answer within 3s **Mitigations and Follow-ups** To prevent similar issues in the future, we are implementing the following: * A more robust design for managing internal and system-level timeouts, scheduled for rollout within the next week. * Updates to end-to-end testing to include scenarios with longer delays before call pickup. * Addition of call duration as a key health indicator in our monitoring dashboards. We appreciate your understanding and are committed to continuously improving our platform's reliability. If you have any questions or feedback, please reach out to our support team.