Pusher incident

Elevated API Errors in AP4 Cluster

Pusher experienced a major incident on October 21, 2025 affecting Channels REST API and Channels WebSocket client API and 1 more component, lasting 2h 36m. The incident has been resolved; the full update timeline is below.

Started: Oct 21, 2025, 03:06 PM UTC
Resolved: Oct 21, 2025, 05:43 PM UTC
Duration: 2h 36m
Detected by Pingoru: Oct 21, 2025, 03:06 PM UTC

Affected components

Channels REST APIChannels WebSocket client APIChannels Stats Integrations

Update timeline

investigating Oct 21, 2025, 03:06 PM UTC

We're experiencing an elevated level of API errors and latency in AP4 and are currently looking into the issue.
identified Oct 21, 2025, 03:09 PM UTC

The team has identified the issue with one of our backend caching servers. The team is working to restore the connections to the server.
identified Oct 21, 2025, 03:27 PM UTC

The team continues to make progress restoring the cache services. We anticipate full resolution in the next 10-15 minutes.
monitoring Oct 21, 2025, 03:29 PM UTC

A fix has been implemented and we are monitoring. The cluster will take a few more minutes to fully stabilize across all nodes.
identified Oct 21, 2025, 05:32 PM UTC

We are currently investigating an issue affecting the Channels API. A large number of customers may be unable to publish new messages through the API in the AP4 cluster.
monitoring Oct 21, 2025, 05:39 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Oct 21, 2025, 05:43 PM UTC

This incident has been resolved.
postmortem Oct 23, 2025, 02:12 PM UTC

## **Root Cause Analysis: Elevated API Errors and Outage in AP4 Cluster** **Incident Date:** October 20, 2025 **Status:** Resolved ### **Summary** Between **October 20 and October 21, 2025**, customers using the **AP4 cluster** experienced elevated API errors, latency, and message publishing failures. The issue primarily affected the **Channels API**, preventing customers from publishing new messages and leading to degraded real-time functionality for end-users. During system recovery and while implementing mitigations from a previous incident on Oct 18th, a **misconfigured Redis container** in the AP4 cluster failed to start correctly, preventing caching operations needed for API requests. This misconfiguration went undetected by proactive monitoring, delaying full recovery until October 21 at 17:43 UTC. ### **Impact** Throughout the incident period, customers in the **AP4 cluster** experienced: * **High API error rates** when attempting to publish messages through the Channels API * **Failed or delayed message delivery** for connected clients * **Temporary downtime** for end-customer applications relying on real-time messages Other clusters remained operational, though some minor latency was observed in isolated regions due to dependencies on shared services. ### **Root Cause** This incident resulted from **a chain of events** involving both external and internal factors: 1. \*\*Major AWS Outage \(October 20\)\*\*A large-scale **AWS outage in the US-East region** disrupted multiple dependent systems, impacting several Pusher clusters. 2. **Misconfigured Redis Container \(October 21\)** As systems in the AP4 cluster attempted to scale during recovery, one of the backend **Redis cache containers** failed to start due to a **misconfigured environment variable**. This prevented Redis from initializing properly, resulting in API operations failing or timing out. 3. **Monitoring Gap** Existing monitoring did not capture the **Redis startup failure** because the specific failure mode occurred after initialization checks had passed. This delayed internal detection until API error rates increased and customer impact was observed. 4. **Delayed Customer Communication** Initial updates to customers were delayed while the team triaged the issue and verified the failure pattern, prolonging the time before external notification. ### **Detection and Response** The issue was detected through a combination of **monitoring alerts** showing elevated error rates and **customer reports** of publishing failures. **Timeline of Events:** * **October 20** – AWS outage began, affecting multiple Pusher clusters leading to increased delays and errors. **October 20, evening UTC** – Pusher clusters began recovery as AWS services were restored. * **October 21, 15:06 UTC** – Internal monitoring detected elevated API errors in AP4; engineers began investigation. Incident was unrelated to prior AWS outage. * **October 21, 15:09 UTC** – Root cause identified as a failed Redis caching container. * **October 21, 15:27 UTC** – Restoration of Redis connections underway. * **October 21, 15:29 UTC** – Fix implemented; cluster began gradual recovery. * **October 21, 17:39 UTC** – Full stabilization confirmed across AP4 nodes. * **October 21, 17:43 UTC** – Incident marked resolved after sustained recovery. ### **Resolution** To restore full functionality, the engineering team: * Corrected the **Redis container configuration** preventing startup * Restarted and validated cache services across all AP4 nodes * Confirmed API endpoints were fully operational and message publishing resumed * Monitored latency and error metrics to confirm sustained stability ### **Preventative Actions** To reduce recurrence risk and improve detection and response, Pusher is implementing the following: * **Enhanced Redis Monitoring:** Extending monitoring coverage to detect Redis startup and post-init failures. * **Customer Communication Enhancements:** Improving internal escalation and communication processes to ensure faster external updates. ### **Next Steps and Commitment** We recognize the importance of reliable API performance for our customers. Our teams are conducting a full review of caching dependencies and configuration management across all clusters to prevent similar incidents. We sincerely apologize for the disruption caused by this event and appreciate your patience as we worked through a complex multi-day recovery scenario. Pusher remains committed to transparency, reliability, and continuous improvement in service resilience.