Nebula incident

Degradation in platform zone

Nebula experienced a minor incident on January 22, 2024 affecting Core Network, lasting 6d 21h. The incident has been resolved; the full update timeline is below.

Started: Jan 22, 2024, 03:08 PM UTC
Resolved: Jan 29, 2024, 12:37 PM UTC
Duration: 6d 21h
Detected by Pingoru: Jan 22, 2024, 03:08 PM UTC

Affected components

Core Network

Update timeline

investigating Jan 22, 2024, 03:08 PM UTC

We are investigating a degradation in performance to one zone impeding broadly one third of traffic. The incident commenced 14:50. We will provide a further update at 15:20 if no other information is available prior.
identified Jan 22, 2024, 03:20 PM UTC

We have identified what we suspect to be the cause and our team are working hard to restore impacted traffic with urgency. We continue to investigate. A further update will be provided at 15:30
identified Jan 22, 2024, 03:38 PM UTC

All impacted traffic was restored at circa 15:23 and all traffic is traversing as normal. We continue to investigate the incident and have not yet conclusively identified the root cause. Our full NOC team remain active on the incident. Further updates will be published as more information becomes available.
identified Jan 23, 2024, 10:00 AM UTC

We continue to investigate this incident in order to conclusively define the root cause. Our full NOC team remain active on the incident. Further updates will be published as more information becomes available. All traffic continues to traverse normally and we don't anticipate further degradation in performance whilst our work continues.
identified Jan 26, 2024, 02:07 PM UTC

This incident remains ongoing. A small subset of our platform saw degradation in performance between circa 13:15 to 13:45 today. It was isolated and resolved promptly. We do not yet have conclusive evidence of the root cause for this incident and our teams continue to investigate - that investigation is now very focused on a number of working theories. We cannot speculate a root cause at this stage. We continue to work hard on an full and complete resolution.
monitoring Jan 26, 2024, 03:51 PM UTC

We have identified the root cause and implemented a permanent solution. We will move the incident to resolved shortly and publish a Post Mortem.
resolved Jan 29, 2024, 12:37 PM UTC

This incident has been resolved and a Post Mortem will be published before end of business Tuesday 30th January 2024.
postmortem Jan 29, 2024, 12:39 PM UTC

We are now linking the incidents of 17th & 18th Jan to this incident. As explained in our Post Mortem for the incident dated 18th Jan, our core platform is split into three zones. Customers are distributed across those three zones automatically and are not fixed to any one zone. Any one zone can handle the traffic of all three zones with substantial overhead in addition. Within each zone, we operate an entirely isolated load balanced core platform which is broadly split into three functions. * The Proxy Cluster - Handling incoming and outgoing SIP conversations \(Session Initiation Protocol\) that are the initial handshake of a phone call and device or app connection. * The Switchboard Cluster - Providing the ring tone and all PBX functionality to our platform \(IVR’s, Call Queues etc\) * The Media Cluster - Handling the audio of a call once established During each of these incidents we have seen degraded performance to one of those zones and its functions impacting broadly 33% of our traffic. During each incident we have seen the volume of commands that our memory cache database cluster \(“Cache”\) processes grow substantially and ultimately overwhelm the cluster. We monitor many metrics across our platform, informing and monitoring proactively on key metrics. Commands processed per second isn’t a typical metric we would proactively monitor. Too much proactive data can both impede platform performance and reduce visibility to core metrics that are more useful hour to hour. Typically, we process between 6,000 and 10,000 commands per second in our Cache for any one zone. During the course of this incident, we found our Cache had been processing nearly 10x more commands per second. Once identified, we noted the same pattern had occurred on the two prior NOC incidents but couldn't identify what was driving the upward trend. On each occasion, the trend was only prevalent in one zone of three. Within minutes of the initial resolution of this incident, we started to trace these commands in near real time to allow us to identify the ultimate source driving the trend. Whilst we did so, we pursued a number of directed theories we thought could be driving the upward command trend, including recent changes to WebRTC libraries in Google Chrome’s browser and much else. We didn’t see any change in the normal command pattern until Friday 26th January 24 at shortly before 13:00 when our commands per second on the Cache cluster of one zone grew at rate, overwhelming the Cache quickly, at around 13:15. We restored service to this zone as quickly as possibly \(~13:45\) and set about reviewing the traces. This immediately highlighted an issue with a high volume of handsets de-registering from our platform ungracefully. Handsets routinely register, de-register and re-register by design. That can be every hour, once a day or even once a week; the handset chooses how often it wishes to “handshake” with us. When a handset de-registers through say a power failure it does so ungracefully and we must “clean up” that registration. We noted multiple key-value pairs in the Cache cluster that held inordinately high numbers of values assigned to the key. That’s best explained as Bob’s Phone being registered 1000’s of times instead of once. Key-value database pairs are more typically 1-1 relationships. E.G. Bobs Phone>Lives Here. So that's unusual data for any memory cache database cluster to both handle and process. Although un-usual, our Cache was handling these data heavy key-value pairs to a point. We now believe the replication challenges set out in our Post Mortem report of 18th Jan to remain a bug in the database software but driven from the ability to handle these data heavy key-value pairs day to day, yet not replicate them, mistaken for a special character causing that lack of replication. That issue remains a bug with the database’s replication performance. That said, we believe the root cause remains the inordinate volume of these data heavy key-value pairs generated by ungraceful de-registrations of handsets. Now identified, we have already implemented a resolution to gracefully handle ungraceful de-registrations at scale. This will ensure the key-value pairs in our Cache cluster remain in balance allowing such an important part of any large scale rapid access platform to perform at the highest of availability.