Livepeer Studio incident

API failing in Asia region

Livepeer Studio experienced a major incident on July 23, 2024 affecting Livepeer Streaming API, lasting 1h 34m. The incident has been resolved; the full update timeline is below.

Started: Jul 23, 2024, 09:53 AM UTC
Resolved: Jul 23, 2024, 11:28 AM UTC
Duration: 1h 34m
Detected by Pingoru: Jul 23, 2024, 09:53 AM UTC

Affected components

Livepeer Streaming API

Update timeline

investigating Jul 23, 2024, 09:53 AM UTC

We are currently investigating the issue
monitoring Jul 23, 2024, 11:09 AM UTC

A fix has been implemented and API error rates have dropped. We'll continue to monitor.
resolved Jul 23, 2024, 11:28 AM UTC

This incident has been resolved.
postmortem Jul 25, 2024, 01:32 PM UTC

**Incident Report** **Date**: July 23rd, 2024 **Time**: 9:53 UTC **Resolved**: 11:28 UTC **Incident Summary**: At 9:53 UTC on July 23rd, 2024, a stream trigger alert was activated indicating that stream triggers had occurred but were not receiving responses, resulting in a timeout. The issue was traced to an excessive amount of database load triggered by a specific customer's misconfiguration of their implementation. The incident was resolved at 11:28 UTC by rate-limiting the customer and then working with them to correct the issue. **Incident Details**: 1. **Initial Alert:** * **Time**: 9:53 UTC * **Event**: Stream trigger alert activated indicating no response from triggered streams, leading to a timeout. 2. **Investigation Findings:** The incident was caused by the CPU regional database replica in Singapore getting pegged at 100% and consequently, nodes failing to connect to the database, which then caused issues with processing new playback requests. 3. **Mitigation Steps:** After initially increasing our database capacity failed to handle the increased load, we tracked down the problematic queries as being driven by a specific customer. We brought in rate limiting for this customer and then restarted the database to cancel in-flight queries, which immediately resolved the issue. **Actions Taken:** 1. Restarted database replications in Singapore. 2. Suspended streams causing rapid spikes. 3. Analyzed stream and viewer behavior to identify patterns and prevent future occurrences. **Next Steps**: 1. Rate limiting by default: Make sure all API endpoints have per-customer rate limits 2. Increase internal caching, including of errors: Avoid an exponential effect when an issue occurs