PubNub incident

Elevated latency and errors for history service in us-east-1 region

Minor Resolved View vendor source →

PubNub experienced a minor incident on October 5, 2024 affecting North America Points of Presence and Storage and Playback Service, lasting 2h 50m. The incident has been resolved; the full update timeline is below.

Started
Oct 05, 2024, 11:16 AM UTC
Resolved
Oct 05, 2024, 02:07 PM UTC
Duration
2h 50m
Detected by Pingoru
Oct 05, 2024, 11:16 AM UTC

Affected components

North America Points of PresenceStorage and Playback Service

Update timeline

  1. investigating Oct 05, 2024, 11:16 AM UTC

    At about 8:00 AM UTC, History service started to experience elevated latencies and errors in North America PoP. PubNub Technical Staff is currently investigating and more updates will follow once available. If you are experiencing issues and believe them to be related to this incident, please report it to PubNub Support at [email protected].

  2. identified Oct 05, 2024, 11:55 AM UTC

    The issue has been identified and our engineers are engaged and continue to work on the issue. Latency and errors rates are improving.

  3. identified Oct 05, 2024, 12:34 PM UTC

    PubNub Technical Staff still working on fixing the issue.

  4. identified Oct 05, 2024, 01:09 PM UTC

    We are continuing to work on a fix for this issue.

  5. monitoring Oct 05, 2024, 01:33 PM UTC

    Remediation actions have been taken. Our engineers are currently monitoring the incident to ensure the stability has been restored.

  6. resolved Oct 05, 2024, 02:07 PM UTC

    Beginning at around 8:00 UTC we observed increased latency and errors for our History service in one of our North America regions. The issue has been resolved as of 14:05 UTC. We will continue to monitor the incident to ensure service stability has been fully restored.

  7. postmortem Oct 16, 2024, 09:49 PM UTC

    ### **Problem Description, Impact, and Resolution** At 7:35 UTC on October 5, 2024, we received a report of intermittent failures \(5xx errors\) for History API requests. The issue was triggered by an unexpectedly high volume of data requests processed through our shared infrastructure, overwhelming the shared history reader containers responsible for fetching this data from our storage nodes. As data was retrieved and processed by the history reader containers, we observed memory exhaustion \(OOM-kills\), even though the memory capacity had been significantly increased. This impacted the performance of our system, causing History API requests to fail when the memory overload occurred. We took action by isolating the requests responsible for the high data volume and deploying dedicated infrastructure for them. This ensured that the issue was resolved at 00:43 UTC on October 6, and no further impact was observed across the broader customer base. ### **Mitigation Steps and Recommended Future Preventative Measures** To prevent this issue from recurring, we deployed dedicated infrastructure for high-volume data requests, and we implemented dynamic data bucket creation to distribute large data volumes more efficiently, reducing strain on our nodes. These steps ensure that our system can handle sudden spikes in resource usage while maintaining stability for all customers.