Nebula incident

Handset Performance Reliability

Nebula experienced a minor incident on October 16, 2024 affecting Core Network, lasting 1d 6h. The incident has been resolved; the full update timeline is below.

Started: Oct 16, 2024, 10:51 AM UTC
Resolved: Oct 17, 2024, 05:18 PM UTC
Duration: 1d 6h
Detected by Pingoru: Oct 16, 2024, 10:51 AM UTC

Affected components

Core Network

Update timeline

identified Oct 16, 2024, 10:51 AM UTC

Our engineering team is currently investigating reports of intermittent issues with handsets using TCP registrations, for a small number of customers. We are working hard to identify the issue and will provide updates as they become available.
monitoring Oct 17, 2024, 10:52 AM UTC

After a fix implemented overnight we're pleased to report that all symptoms seem to be resolved for handsets using TCP connections. This incident will be closed by the end of the day after a further period of successful monitoring.
resolved Oct 17, 2024, 05:18 PM UTC

After an extended period of monitoring our engineers are confirming this incident is resolved.
postmortem Oct 22, 2024, 01:30 PM UTC

On the 2nd October 2024 our engineers became aware of an intermittent issue which was causing some desk phones to show symptoms such as flashing BLF lights, and not stop ringing when the call was answered elsewhere. Although we initially only received a small number of customer tickets about this, we immediately launched an investigation. We found the root cause of the problem was how our system was handling TCP traffic. TCP is a more secure transfer protocol than UDP and even though it has been available for some time, most of our customers still use UDP so the problem had not been apparent until now as the uptake of TCP increases. A key difference between the two protocols is that TCP packets are sent to a specific server. When they went through our load balancers they were occasionally being routed to a server that was not expecting them, and thus the packets were being lost. This was only in small numbers of cases and most traffic was successfully reaching its destination. After implementing an update to adjust the configuration of our load balancer we see most symptoms alleviated, however after further reports identified other more niche cases where this issue was still present. Having investigated further and with the use of more detailed monitoring put in place, we identified the full extent of the problem and rolled out a full and final fix on 17th October - this was successful and the problem has not been experienced since. As a result of these patch our system will now have much better resilience in future.