100ms experienced a notice incident on August 28, 2024, lasting —. The incident has been resolved; the full update timeline is below.
Update timeline
- resolved Aug 28, 2024, 07:24 AM UTC
The HLS live streaming was down from 11:41AM IST to 11:52AM IST in India region
- postmortem Aug 28, 2024, 10:06 AM UTC
**Incident Summary** HLS live streaming was fully degraded from 11:40 AM IST to 11:52 AM IST on 28th August 2024. Users were seeing a black screen while streaming live sessions. **Root Cause** We use an event-driven scaling approach to scale web servers when there is a surge in requests. For web servers, we rely on Prometheus metrics for scaling up and down. During this incident, Prometheus crashed due to high memory utilization. In the event of a data source failure, we fall back to specific instances of web servers. Unfortunately, the fallback number of web server instances was inadequate to manage the volume of requests, which led to the servers being overwhelmed and ultimately crashing.