PubNub incident

Potential for some missed messages for subscribers in IAD

Major Resolved View vendor source →

PubNub experienced a major incident on October 17, 2025 affecting Publish/Subscribe Service and North America Points of Presence, lasting 1h 23m. The incident has been resolved; the full update timeline is below.

Started
Oct 17, 2025, 06:42 AM UTC
Resolved
Oct 17, 2025, 08:06 AM UTC
Duration
1h 23m
Detected by Pingoru
Oct 17, 2025, 06:42 AM UTC

Affected components

Publish/Subscribe ServiceNorth America Points of Presence

Update timeline

  1. investigating Oct 17, 2025, 06:42 AM UTC

    We are currently investigating an incident that could lead to some missed messages by subscribers in the IAD region. All messages are being received and persisted, and can be retrieved from the Storage service. This incident started around 22:07 UTC (03:07 PDT) on 16th Oct 2025. We suspect a moderate impact. Please report any impact related to this incident to [email protected] with any details that you can provide.

  2. identified Oct 17, 2025, 06:48 AM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Oct 17, 2025, 07:08 AM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Oct 17, 2025, 08:06 AM UTC

    There have been no further issues for the past 45 minutes. We are resolving this issue, and we will follow up with a post-mortem soon.

  5. postmortem Oct 22, 2025, 10:20 PM UTC

    ### **Problem Description, Impact, and Resolution** On October 17, 2025 at 04:51 UTC, some customers may have experienced elevated latency and error rates with the Pub/Sub service in the IAD region \(US-East\). Our engineering teams began immediate investigation and identified a spike in errors related to a recent update to the Pub/Sub service. We began formal incident response and initiated rollback of the service deployment shortly thereafter. The issue was fully resolved by 06:50 UTC, and rollback across all regions was completed by 08:00 UTC. The issue occurred because a misconfiguration in the release caused incorrect behavior in the channel cleanup logic. Additionally, our alerting configuration did not include coverage for the synthetic test failures that would have surfaced this issue sooner, delaying detection. ### **Mitigation Steps and Recommended Future Preventative Measures** To prevent a similar issue from occurring in the future, our engineering teams have written a simpler and more reliable replacement for the faulty logic. That code is currently undergoing rigorous testing before being reintroduced in a future release. We are also addressing the lack of proper alerting that contributed to a delayed response. Synthetic tests have been reviewed, and appropriate alerting will be implemented to ensure similar regressions are detected earlier. In parallel, we are updating our development and testing processes to catch such issues before code reaches production. Lastly, we are conducting a refresher training on our incident response process to ensure faster execution and coordination in the future.