Nebula experienced a minor incident on September 25, 2024 affecting Core Network and Mobile Applications and 1 more component, lasting 6h 12m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Sep 25, 2024, 10:41 AM UTC
Our engineering team is currently investigating reports of audio delay, and app connection issues for a small number of customers. We are working hard to identify the issue and will provide updates as they become available.
- investigating Sep 25, 2024, 10:51 AM UTC
We are continuing to investigate this issue.
- monitoring Sep 25, 2024, 10:59 AM UTC
This issue is now resolved and we are currently monitoring the situation, and will post a further update as a final confirmation.
- identified Sep 25, 2024, 11:49 AM UTC
Our engineers have received further reports of recurrence of the above issues and are investigating further with urgency. More updates to be provided shortly.
- monitoring Sep 25, 2024, 11:55 AM UTC
After further adjustments to traffic our engineers are now seeing services return to usual levels. Will continue monitoring with close attention and provide further updates once confident the incident is fully resolved.
- identified Sep 25, 2024, 12:04 PM UTC
Our engineers have received further reports of recurrence of the above issues and are investigating further with urgency. More updates to be provided shortly
- identified Sep 25, 2024, 12:19 PM UTC
We are continuing to work on a fix for this issue.
- monitoring Sep 25, 2024, 12:20 PM UTC
After further adjustments to traffic our engineers are now seeing services return to usual levels. Will continue monitoring with close attention and provide further updates once confident the incident is fully resolved.
- resolved Sep 25, 2024, 04:54 PM UTC
Following a successful monitoring period we can confirm resolution and closure of this issue.
- postmortem Sep 26, 2024, 12:39 PM UTC
At 11:42 on 25th September, some of our partners reported a long delay when initiating calls. At that time, we were already investigating an issue which had caused one of our core telephony clusters to stop processing traffic. This had the effect of increasing the traffic to other clusters, hence the reduced response time. Even though the load balancers had responded appropriately, this was not fast enough to cause a backlog in calls. As a result of this incident, our engineers have initiated two workstreams: the first is to identify and permanently fix the root cause of the server issue in one of the clusters; the second is to investigate ways to make the response of the load balancer faster to limit the effect of such issues.