Pusher incident

US3 cluster - major outage

Pusher experienced a major incident on October 20, 2025 affecting Channels REST API and Channels WebSocket client API and 1 more component, lasting 2h 5m. The incident has been resolved; the full update timeline is below.

Started: Oct 20, 2025, 06:36 PM UTC
Resolved: Oct 20, 2025, 08:41 PM UTC
Duration: 2h 5m
Detected by Pingoru: Oct 20, 2025, 06:36 PM UTC

Affected components

Channels REST APIChannels WebSocket client APIChannels presence channels

Update timeline

investigating Oct 20, 2025, 06:36 PM UTC

We're seeing an increased latency and a number of 503 errors on us3 cluster. The team is investigating.
investigating Oct 20, 2025, 07:01 PM UTC

We are continuing to investigate this issue.
identified Oct 20, 2025, 07:01 PM UTC

We are continuing to investigate this issue.
identified Oct 20, 2025, 07:04 PM UTC

We are currently experiencing a major outage in US3 cluster. Our engineering team is actively working to restore full functionality as quickly as possible. In the meantime, we recommend customers temporarily switch to an alternative cluster to minimize service disruption.
identified Oct 20, 2025, 07:59 PM UTC

Our team has identified the root cause of the issue and implemented a fix. The affected cluster is now in the process of recovering, and performance is gradually returning to normal. We will continue to closely monitor the situation until full recovery is confirmed.
monitoring Oct 20, 2025, 08:20 PM UTC

All services are now fully operational. Our team continues to monitor the system closely to ensure ongoing stability and performance.
resolved Oct 20, 2025, 08:41 PM UTC

This incident has been resolved.
postmortem Nov 03, 2025, 08:04 PM UTC

## Root Cause Analysis: Redis Cluster Startup Failures – US3 Cluster **Incident Date:** October 20, 2025 **Duration:** 18:36 UTC – 20:41 UTC **Status:** Resolved ## **Summary** On October 20, 2025, customers experienced a significant outage affecting the US3 cluster. The disruption began with increased latency and 503 errors before escalating into full service downtime as Redis clusters in the US3 region failed to start successfully during an infrastructure update. Service was fully restored at 20:41 UTC after engineers identified and resolved the underlying startup issue, confirming stability across all Redis clusters. ## **Impact** Between **18:36 UTC and 20:41 UTC**, customers experienced: * Major outage in the **US3** cluster, impacting message delivery and connection reliability * Elevated error rates and timeouts across dependent APIs * Temporary need for customers to switch to alternate clusters to maintain service continuity No customer data was lost. However, applications relying solely on the affected cluster experienced full downtime for a large portion of the incident. ## **Root Cause** The outage was caused by a failure of Redis clusters to start correctly following an infrastructure update due to a missing configuration flag upon startup. An unexpected upgrade to the Docker runtime running our Redis cluster introduced a breaking change that prevented container startup for certain Redis deployments. When replacement Redis instances in the US3 redis-main cluster were launched, they failed initialization checks and repeatedly restarted, rendering the cluster unavailable. The incompatibility remained undetected until a routine node replacement in the US3 cluster introduced new Redis instances to the cluster. ## **Detection and Response** Monitoring systems first detected increased error rates and latency at **18:36 UTC**, followed by a rise in 503 responses from the affected APIs. **Timeline of events:** * **18:36 UTC** – Increased latency and 503 errors observed in US3 cluster * **19:01 UTC** – Engineering began investigation into Redis startup failures * **19:04 UTC** – Incident declared a major outage; mitigation efforts initiated * **19:59 UTC** – Root cause identified and configuration fix applied * **20:20 UTC** – Services operational; monitoring for recovery stability * **20:41 UTC** – All Redis nodes confirmed healthy; incident resolved ## **Resolution** The engineering team: * Implemented a temporary fix to the Redis environments ensure Redis instances could initialize successfully * Blocked further automated replacements in other clusters until validated * Verified recovery and stability across all Redis clusters After deployment of the fix, Redis instances in the US3 redis-main cluster started correctly and full service was restored. ## **Preventative Actions** To prevent recurrence, the team has: * Rolled out a permanent fix across all Redis clusters used in all regions * Planned a long-term remediation to modernize Redis image packaging for compatibility with current and future Docker releases ## **Next Steps and Commitment** We are conducting a broader review of infrastructure upgrade processes to better detect runtime incompatibilities before they impact production. We apologize for the disruption this incident caused. Ensuring reliability and transparency remains our highest priority, and we continue to strengthen our processes to maintain consistent, predictable service for all Pusher customers.