PubNub experienced a notice incident on October 24, 2024 affecting Presence Service, lasting 2h 57m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 24, 2024, 12:01 PM UTC
At about 11:00 AM UTC, Presence service started to experience elevated latencies and server errors in all PoPs. PubNub Technical Staff is currently investigating and more updates will follow once available. If you are experiencing issues and believe them to be related to this incident, please report them to PubNub Support at [email protected].
- investigating Oct 24, 2024, 12:11 PM UTC
We are continuing to investigate this issue.
- investigating Oct 24, 2024, 12:37 PM UTC
We are continuing to investigate this issue.
- investigating Oct 24, 2024, 01:22 PM UTC
We are continuing to investigate this issue.
- identified Oct 24, 2024, 01:46 PM UTC
We have successfully identified the issue, and our dedicated engineers are actively working to resolve it. We are seeing positive trends, with both latency and error rates improving significantly.
- monitoring Oct 24, 2024, 02:11 PM UTC
We have taken effective remediation actions, and our engineers are diligently monitoring the situation to guarantee that stability is fully restored.
- resolved Oct 24, 2024, 02:58 PM UTC
Beginning at around 11:00 UTC we observed elevated latency and server errors for our Presence service in all of our server endpoints. The issue has been resolved as of 14:11 UTC. We will continue to monitor the incident to ensure service stability has been fully restored. Your trust is our top priority, and we are committed to ensuring smooth operations.
- postmortem Oct 25, 2024, 07:21 PM UTC
### **Problem Description, Impact, and Resolution** On October 24, 2024 at 11:15AM UTC, we observed elevated latency and errors in the Presence service across our global points of presence. Affected customers may have experienced a slowdown in Presence request responses and/or failures with 5XX server errors returned. After investigating, we identified the cause of the issue, blocked the source of traffic causing it, and the issue was resolved on October 24, 2024 at 2:00PM UTC. This issue occurred because our services were not auto scaled appropriately in response to a spike in unexpected traffic from non-standard usage of the Presence service. ### **Mitigation Steps and Recommended Future Preventative Measures** To prevent a similar issue from occurring in the future, we have addressed the source of the unexpected traffic spike directly, ensuring changes were made to align usage with our prescribed methods for Presence. Additionally, we are working in the coming days to deploy sharding in Presence infrastructure to enhance scalability and better manage traffic surges like this, should they recur.