Nebula incident

Service Degradation on Outbound Calling and Portal Access

Nebula experienced a minor incident on August 7, 2024 affecting Core Network, lasting 20h 13m. The incident has been resolved; the full update timeline is below.

Started: Aug 07, 2024, 12:01 PM UTC
Resolved: Aug 08, 2024, 08:14 AM UTC
Duration: 20h 13m
Detected by Pingoru: Aug 07, 2024, 12:01 PM UTC

Affected components

Core Network

Update timeline

investigating Aug 07, 2024, 12:01 PM UTC

We have identified an issue which is causing service degradation for a small number of customers. Our engineers are working to resolve the issue urgently. Next update will be within 30 minutes.
monitoring Aug 07, 2024, 12:15 PM UTC

Our engineers have identified the problem and are implementing a solution. We estimate full resumption of service within the next 5 minutes.
monitoring Aug 07, 2024, 12:22 PM UTC

Our engineers are satisfied that all affected services should be restored shortly, and will continue to monitor over the next 24 hours.
monitoring Aug 07, 2024, 12:32 PM UTC

Our engineers are still working on the remaining affected areas of the network and continue to work towards a more permanent resolution. Next update within 30 minutes.
monitoring Aug 07, 2024, 01:17 PM UTC

Our team continue to see improvements across our network as a solution is being implemented. Call traffic has returned to normal and we're currently investigating reports of delays on softphone applications and portals.
monitoring Aug 07, 2024, 03:05 PM UTC

All services have restored to normal and our team are closely monitoring traffic on our network to ensure this continues. Will close this incident as resolved within the next 24 hours.
resolved Aug 08, 2024, 08:14 AM UTC

Our engineers are satisfied this incident has been resolved and all services are fully operational.
postmortem Aug 08, 2024, 12:47 PM UTC

The impact of this incident resulted in a small subset of customers being unable to carry out activities on our network which required authentication \(for example accessing portals and initiating outbound calls\). Existing calls in progress, along with inbound calls were largely unaffected. The incident was caused by a legacy system running a backup process on the database cluster which handles authentication. This backup process took longer than it should have, and due to the very high level of authentication requests at that time of day, a backlog quickly built up, resulting in the symptoms outlined above. Since the database is replicated and load balanced, all other databases in the cluster continued to operate normally and customers using these were unaffected. However these had to increase throughput to take on the load from the affected database. As load balancing is done based on trends rather than instantaneous values, it took a short time for the remaining databases to react accordingly. With the other databases increasing taking on the load, they started to clear a backlog of circa 400k requests, which naturally took some time. Although all systems operated as expected in response, the underlying backup process that caused it has been permanently disabled, and a review is underway to identify why it ran during peak hours.