Nebula incident

Service Degradation on Outbound Calling and Portal Access

Minor Resolved View vendor source →

Nebula experienced a minor incident on August 7, 2024 affecting Core Network, lasting 20h 13m. The incident has been resolved; the full update timeline is below.

Started
Aug 07, 2024, 12:01 PM UTC
Resolved
Aug 08, 2024, 08:14 AM UTC
Duration
20h 13m
Detected by Pingoru
Aug 07, 2024, 12:01 PM UTC

Affected components

Core Network

Update timeline

  1. investigating Aug 07, 2024, 12:01 PM UTC

    We have identified an issue which is causing service degradation for a small number of customers. Our engineers are working to resolve the issue urgently. Next update will be within 30 minutes.

  2. monitoring Aug 07, 2024, 12:15 PM UTC

    Our engineers have identified the problem and are implementing a solution. We estimate full resumption of service within the next 5 minutes.

  3. monitoring Aug 07, 2024, 12:22 PM UTC

    Our engineers are satisfied that all affected services should be restored shortly, and will continue to monitor over the next 24 hours.

  4. monitoring Aug 07, 2024, 12:32 PM UTC

    Our engineers are still working on the remaining affected areas of the network and continue to work towards a more permanent resolution. Next update within 30 minutes.

  5. monitoring Aug 07, 2024, 01:17 PM UTC

    Our team continue to see improvements across our network as a solution is being implemented. Call traffic has returned to normal and we're currently investigating reports of delays on softphone applications and portals.

  6. monitoring Aug 07, 2024, 03:05 PM UTC

    All services have restored to normal and our team are closely monitoring traffic on our network to ensure this continues. Will close this incident as resolved within the next 24 hours.

  7. resolved Aug 08, 2024, 08:14 AM UTC

    Our engineers are satisfied this incident has been resolved and all services are fully operational.

  8. postmortem Aug 08, 2024, 12:47 PM UTC

    The impact of this incident resulted in a small subset of customers being unable to carry out activities on our network which required authentication \(for example accessing portals and initiating outbound calls\). Existing calls in progress, along with inbound calls were largely unaffected. The incident was caused by a legacy system running a backup process on the database cluster which handles authentication. This backup process took longer than it should have, and due to the very high level of authentication requests at that time of day, a backlog quickly built up, resulting in the symptoms outlined above. Since the database is replicated and load balanced, all other databases in the cluster continued to operate normally and customers using these were unaffected. However these had to increase throughput to take on the load from the affected database. As load balancing is done based on trends rather than instantaneous values, it took a short time for the remaining databases to react accordingly. With the other databases increasing taking on the load, they started to clear a backlog of circa 400k requests, which naturally took some time. Although all systems operated as expected in response, the underlying backup process that caused it has been permanently disabled, and a review is underway to identify why it ran during peak hours.