PubNub incident

Presence service errors in multiple regions

PubNub experienced a minor incident on July 31, 2025 affecting North America Points of Presence and Asia Pacific Points of Presence and 1 more component, lasting 59m. The incident has been resolved; the full update timeline is below.

Started: Jul 31, 2025, 10:14 PM UTC
Resolved: Jul 31, 2025, 11:14 PM UTC
Duration: 59m
Detected by Pingoru: Jul 31, 2025, 10:14 PM UTC

Affected components

North America Points of PresenceAsia Pacific Points of PresencePresence Service

Update timeline

investigating Jul 31, 2025, 10:14 PM UTC

We have detected elevated error levels with the Presence service in multiple regions. Our Engineers are actively working to mitigate the issues and return service to normal levels. We will provide updates here. If you believe you have been impacted by the issue, please report impact to [email protected].
identified Jul 31, 2025, 10:36 PM UTC

The issue has been identified and a fix is being implemented. We will provide updates here as progress is made.
monitoring Jul 31, 2025, 10:44 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Jul 31, 2025, 11:14 PM UTC

With no further issues observed, the incident has been resolved. We will follow up soon with a root cause analysis. If you believe you experienced an impact related to this incident, please report it to PubNub Support at [email protected].
postmortem Aug 04, 2025, 03:26 PM UTC

### **Problem Description, Impact, and Resolution** On August 1, 2025 at 22:15 UTC, we observed elevated 5xx errors across the Presence service in multiple regions. Customers may have experienced intermittent failures when attempting to receive or update presence messages. We identified a subset of channels exhibiting highly concentrated activity patterns and applied a targeted configuration change to rebalance traffic across the cluster. The issue was resolved by 23:15 UTC on August 1, 2025. This issue occurred because our infrastructure lacked proactive safeguards to evenly distribute presence traffic across nodes in scenarios where a small number of channels receive a disproportionately high number of presence updates. This resulted in resource saturation on some nodes without triggering early mitigation. ### **Mitigation Steps and Recommended Future Preventative Measures** To prevent a similar issue from occurring in the future, we have applied a sharding configuration to affected channel patterns, which redistributes load more evenly across infrastructure components. This approach reduces the risk of overload caused by concentrated traffic. In the coming days we will be: * Reviewing long-term suitability of the sharding configuration applied during this incident. * Investigating automation options for dynamically applying sharding logic based on real-time usage patterns. * Enhancing internal tooling and monitoring to better detect and respond to load imbalance scenarios before they cause service degradation.