PubNub incident

Errors and Latency Across All Services in our US-East and US-West Points of Presence

Minor Resolved View vendor source →

PubNub experienced a minor incident on December 11, 2024 affecting Publish/Subscribe Service and Website and 1 more component, lasting 18m. The incident has been resolved; the full update timeline is below.

Started
Dec 11, 2024, 10:04 PM UTC
Resolved
Dec 11, 2024, 10:22 PM UTC
Duration
18m
Detected by Pingoru
Dec 11, 2024, 10:04 PM UTC

Affected components

Publish/Subscribe ServiceWebsiteFunctions ServiceNorth America Points of PresenceStorage and Playback ServiceAdministration PortalVaultStream Controller ServiceSDK DocumentationKey Value store

Update timeline

  1. investigating Dec 11, 2024, 10:04 PM UTC

    We are currently investigating an issue that is causing requests in our US-East points-of-presence to fail or respond slowly.

  2. monitoring Dec 11, 2024, 10:10 PM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved Dec 11, 2024, 10:22 PM UTC

    This incident has been resolved. We apologize for any impact this may have had on your service. Don't hesitate to contact us by reaching out to PubNub Support ([email protected]) if you wish to discuss the impact on your service. An RCA will be provided soon.

  4. postmortem Dec 16, 2024, 05:26 AM UTC

    ### **Problem Description, Impact, and Resolution** At 21:50 UTC on 2024-12-11 we observed increased latencies and error rates across all services in our US-East point-of-presence and, a few minutes later, in US-West as well. We observed that the PubNub Access Manager \(PAM\) was at the center of the degradation, and an investigation noted that nodes in that service were highly memory constrained. We increased capacity and the issue was mitigated in both points-of-presence at 22:10 UTC, and declared resolved at 22:22 UTC. This issue occurred because a previously unseen pattern of customer behavior overwhelmed a cache in the PAM system, causing memory to become constrained and performance to degrade. ### **Mitigation Steps and Recommended Future Preventative Measures** To prevent a similar issue from occurring in the future we changed the cache capacity and updated our monitoring to alert on this and similar patterns of behavior.