Pusher incident

Increased latency - US2 cluster

Pusher experienced a minor incident on October 19, 2025 affecting Channels WebSocket client API and Channels presence channels, lasting 7h 32m. The incident has been resolved; the full update timeline is below.

Started: Oct 19, 2025, 12:44 AM UTC
Resolved: Oct 19, 2025, 08:17 AM UTC
Duration: 7h 32m
Detected by Pingoru: Oct 19, 2025, 12:44 AM UTC

Affected components

Channels WebSocket client APIChannels presence channels

Update timeline

investigating Oct 19, 2025, 12:44 AM UTC

Increased latency and degraded performance on clusters main, us2 and us3.
identified Oct 19, 2025, 03:11 AM UTC

The issue has been identified and we are applying mitigations. We are seeing improved latency in main and us3 clusters. us2 still has elevated latency.
identified Oct 19, 2025, 04:03 AM UTC

We have scaled all three clusters out to handle the increased traffic. We anticipate the latency to slowly improve.
resolved Oct 19, 2025, 08:17 AM UTC

Latency has remained stable over the past three hours. Our investigation identified multiple contributing factors to the earlier latency increase, including network saturation and capacity limits at our cloud providers affecting the US2 cluster. The US3 cluster issue was resolved quickly through autoscaling, while the US2 cluster experienced scaling delays that required manual intervention. We are implementing corrective measures to prevent recurrence and improve system resilience.
postmortem Oct 23, 2025, 02:16 PM UTC

## **Root Cause Analysis: Increased Latency and Message Delivery Failures – US2 Cluster** **Incident Date:** October 19, 2025 **Duration:** 00:44 UTC – 08:17 UTC **Status:** Resolved ### **Summary** On October 19, 2025, customers using Pusher experienced increased latency and message delivery failures. These issues primarily affected the US2 cluster, with intermittent impact also observed in the MT1 and US3 clusters. The incident resulted in delayed or undelivered messages for many applications. Latency stabilized at 08:17 UTC after mitigation actions were completed. ### **Impact** Between 00:44 UTC and 08:17 UTC, multiple customers experienced: * **Delayed or failed message delivery** across affected clusters * **Degraded performance** in connection establishment and publishing The most significant and prolonged impact occurred in the **US2 cluster**, while **MT1** and **US3** clusters saw elevated latency for a shorter period before stabilizing. ### **Root Cause** The primary cause of the incident was IP address saturation within the subnet assigned to the public Pusher clusters. When traffic levels increased, the **US2 cluster** was unable to scale out further because the available IP addresses in its subnet were fully utilized. This IP scaling limitation prevented the creation of additional instances needed to handle the load. Secondary factors included temporary **network saturation** and **capacity limits** at our cloud provider, which amplified the latency in the early stages of the incident. ### **Detection and Response** The issue was first detected through a combination of **customer reports** and **internal monitoring alerts** showing elevated response times and connection errors. The timeline of actions was as follows: * **00:44 UTC** – Monitoring alerted the team to increased latency across MT1, US2, and US3 clusters. * **03:11 UTC** – Engineers identified subnet capacity as a contributing factor; mitigations began. * **04:03 UTC** – All clusters were scaled out to distribute traffic; latency began to improve in MT1 and US3 clusters. * **08:17 UTC** – Manual intervention allowed US2 to scale successfully, restoring normal latency. ### **Resolution** To restore service, the engineering team: * Scaled out **MT1** and **US3** clusters to handle increased traffic loads * Monitored all clusters to confirm sustained stability Once additional capacity was provisioned and high loads normalized, latency levels returned to normal and remained stable. ### **Preventative Actions** To prevent recurrence, Pusher has initiated the following actions: * **Rate Limits:** We will re-evaluate how rate limits are implemented to better mitigate content from neighboring customers on the shared clusters. * **Subnet Expansion:** Re-evaluating and increasing the size of subnets assigned to shared clusters to ensure sufficient IP availability for future scaling events. * **Load Balancer Enhancements:** We will implement load balance sharding in order to better distribute connections. * **Capacity Planning Improvements:** Enhancing internal monitoring and alerting for subnet and IP utilization thresholds. ### **Next Steps and Commitment** We recognize that message latency and delivery reliability are critical to our customers’ applications. Our team is continuing a full review of cluster capacity management and provider configuration to improve resilience under high traffic conditions. We apologize for the disruption this incident caused and appreciate your patience while we worked to resolve it. Ensuring reliability and transparency remains our highest priority.