PubNub incident

Server Errors and Elevated Latency

PubNub experienced a minor incident on April 25, 2025 affecting Publish/Subscribe Service and North America Points of Presence and 1 more component, lasting 1h 41m. The incident has been resolved; the full update timeline is below.

Started: Apr 25, 2025, 12:51 AM UTC
Resolved: Apr 25, 2025, 02:32 AM UTC
Duration: 1h 41m
Detected by Pingoru: Apr 25, 2025, 12:51 AM UTC

Affected components

Publish/Subscribe ServiceNorth America Points of PresenceAsia Pacific Points of PresencePresence ServiceAccess Manager ServiceMobile Push Gateway

Update timeline

investigating Apr 25, 2025, 12:51 AM UTC

At about 12:17 AM UTC, we started to experience elevated latencies and server errors in all PoPs. PubNub Technical Staff is currently investigating and more updates will follow once available. If you are experiencing issues and believe them to be related to this incident, please report them to PubNub Support at [email protected].
investigating Apr 25, 2025, 01:22 AM UTC

The PubNub Technical Staff is investigating. More updates will follow once available.
investigating Apr 25, 2025, 01:57 AM UTC

The PubNub Technical Staff continues to investigate. Errors and elevated latency are with some Presence customers. The real-time network services are operational.
identified Apr 25, 2025, 02:05 AM UTC

The issue has been identified and a fix is being implemented.
monitoring Apr 25, 2025, 02:11 AM UTC

A fix has been implemented and we are monitoring the results.
resolved Apr 25, 2025, 02:32 AM UTC

With no further issues observed, the incident has been resolved. We will follow up soon with a root cause analysis. If you believe you experienced impact related to this incident, please report them to PubNub Support at [email protected].
postmortem Apr 25, 2025, 07:36 PM UTC

### Problem Description, Impact, and Resolution At approximately 00:17 UTC on April 25, 2025, we observed elevated server errors and increased latency impacting multiple API endpoints, most notably the Presence service. Our engineering team immediately began investigating the issue. We identified that the root cause involved resource contention within a specific component of our backend infrastructure responsible for managing presence state. We implemented targeted configuration changes to better distribute this traffic and alleviate the resource contention. The issue was fully resolved by 02:32 UTC on April 25, 2025. ### Mitigation Steps and Recommended Future Preventative Measures To prevent a similar issue from occurring in the future, we have already implemented specific configuration changes to ensure the responsible backend component can more effectively handle the type of traffic pattern encountered. Furthermore, we are actively working to enhance our monitoring and alerting systems.