Nebula incident

Post Dial Delay on Inbound and Outbound calling

Minor Resolved View vendor source →

Nebula experienced a minor incident on January 18, 2024 affecting Core Network, lasting 52m. The incident has been resolved; the full update timeline is below.

Started
Jan 18, 2024, 01:19 PM UTC
Resolved
Jan 18, 2024, 02:12 PM UTC
Duration
52m
Detected by Pingoru
Jan 18, 2024, 01:19 PM UTC

Affected components

Core Network

Update timeline

  1. investigating Jan 18, 2024, 01:19 PM UTC

    We are aware of post dial delay of up to 60 seconds on inbound and outbound calling to a subset of our customers. We are in the process of moving the impacted customers to a different zone whilst we identify the root cause in the affected platform zone. Customers in other platform zones are not impacted and call audio once connected is transiting as normal with no delay or impact. We will provide a further update at 13:45 or before if new information becomes available.

  2. investigating Jan 18, 2024, 01:34 PM UTC

    Impacted customers saw service restore at circa 13:23 following reallocation between platform zones. We are continuing to investigate and are monitoring operational zones carefully whilst we understand the root cause to the impacted zoom. Despite service restoration to all users, the incident is not identified and we continue to work on the event.

  3. monitoring Jan 18, 2024, 02:11 PM UTC

    We have identified the root cause in the affected zone and implemented a solution to both the impacted zone and wider platform. A Post Mortem will be published shortly. We continue to monitor.

  4. resolved Jan 18, 2024, 02:12 PM UTC

    This incident will be closed to allow the swift publication of our post mortem but will continue to be monitored.

  5. postmortem Jan 18, 2024, 02:12 PM UTC

    Our core platform that handles all inbound and outbound calling to both our CallSwitch One and Legacy products is split into three platform zones. Customers are distributed across those three zones automatically and are not fixed to any one zone. Any one zone can handle the traffic of all three zones with substantial overhead in addition. Within each zone, we operate an entirely isolated load balanced core platform which is broadly split into three functions. * The Proxy Cluster - Handling incoming and outgoing SIP conversations \(Session Initiation Protocol\) that are the initial handshake of a phone call and device or app connection. * The Switchboard Cluster \(“Switchboard”\)- Providing the ring tone and all PBX functionality to our platform \(IVR’s, Call Queues etc\) * The Media Cluster - Handling the audio of a call once established During this incident we saw a number of servers in the Switchboard at one of our platform zones utilise an unusually high amount of compute power. This led to Post Dial Delay of up to 60 seconds and in some cases, the lack of calling tones or PBX functionality to broadly 1/3rd of our customers \(Those in the impacted zone at that time\) during the incident. Once calls were established, audio traversed as expected and normally, but the silence in establishing calls for a lengthy period will have meant some customers aborted their call prior to connection. The root cause of the issue was the inability of a memory cache database cluster \(“Cache”\), that the Switchboard relies on, to replicate. The issue wasn’t in the Cache, but in the ability for the Cache itself to replicate with its presence. **A bug was identified in the Cache replication software.** This drove a number of Cache servers to access out of bound memory and fail. This then sent erroneous traffic to the Switchboard driving up its compute power and hampering its ability to already survive on a degraded Cache. We use an industry standard memory cache database solution that is backed by the team at SnapChat. We will report the issue for resolution. Our platform processes millions of packets per second and we have not seen this issue predicate previously. Our Cache replication pre-dates the multiple zoning of our platform and is no longer a necessary resilience layer so has now been removed from all platform zones to provide a permanent resolution. Our sincerest thanks to all staff for such a quick resolution and for the later accurate and permanent solution to the issue.