Rainbow incident

[WW] Rainbow / Hub Softphony — Intermittent Connectivity Issues

Rainbow experienced a minor incident on March 31, 2026, lasting —. The incident has been resolved; the full update timeline is below.

Started: Mar 31, 2026, 06:28 AM UTC
Resolved: Mar 31, 2026, 06:28 AM UTC
Duration: —
Detected by Pingoru: Mar 31, 2026, 06:28 AM UTC

Update timeline

resolved Mar 31, 2026, 06:28 AM UTC

Type: Incident Duration: 2 days, 4 hours and 32 minutes Affected Components: [WW] Rainbow Hybrid PBX Telephony, , [APAC] Rainbow Media Relays, [EMEA] Rainbow Core Services, [APAC] Rainbow Core Services, [ANZ] Rainbow Hub Voice Services, [CALA] Rainbow Conferencing, [ANZ] Rainbow Media Relays, [CALA] Rainbow Core Services, [WW] Rainbow Administration & Subscriptions, , [DE] Rainbow Core Services, [NA] Rainbow Hub Voices Services, [NA] Rainbow Conferencing, [CALA] Rainbow Hub Voices Services, [NA] Rainbow Media Relays, [DE] Rainbow Media Relays, [CALA] Rainbow Media Relays, [EMEA] Rainbow Media Relays, , , [DE] Rainbow Conferencing, [APAC] Rainbow Conferencing, [EMEA] Rainbow Conferencing, [EMEA] Rainbow Hub Voice Services, [ANZ] Rainbow Conferencing, [APAC] Rainbow Hub Voice Services, [ANZ] Rainbow Core Services, [NA] Rainbow Core Services, North America (NA) → Caribbean & Latin America (CALA) → Australia-New Zealand (ANZ) → Asia-Pacific (APAC) → Mar 31, 06:28:19 GMT+0 - Investigating - We are currently investigating this incident. Mar 31, 06:55:34 GMT+0 - Identified - We’re facing a major incident caused by unexpected database latency. Our engineers are investigating and working to restore normal performance. We sincerely apologize for the inconvenience and will keep you updated. Mar 31, 07:25:24 GMT+0 - Identified - Our teams are actively working on restoring the service. Services are partially recovered, although we continue to observe latency. We will share further updates as progress is made. Mar 31, 07:59:03 GMT+0 - Identified - We are continuing to make progress in restoring the service, although elevated load is currently slowing the process. Our teams remain fully mobilized, and we will provide further updates as the situation evolves. Mar 31, 08:32:59 GMT+0 - Identified - We are still experiencing degraded performance, but we are seeing ongoing progress as our teams actively manage elevated traffic levels as part of the mitigation efforts. We currently estimate full recovery within the next several hours and are working to shorten this timeframe as improvements continue. Mar 31, 08:37:39 GMT+0 - Identified - As part of the ongoing incident resolution, a full system restart will be performed. This will cause active sessions to be disconnected. This action is necessary to restore the service and stabilize the platform. Our teams will be closely monitoring the situation throughout this operation, and we expect this step to accelerate the return to normal service. Mar 31, 08:58:38 GMT+0 - Identified - The system restart is still ongoing and is expected to take approximately one hour. We will provide another update as soon as the process is completed. Mar 31, 09:32:03 GMT+0 - Identified - The system restart is still in progress. We will provide another update as soon as the operation is completed. Mar 31, 10:01:38 GMT+0 - Identified - We are still completing the restart process, and some services may take a bit more time to fully stabilize. Our teams remain actively engaged, and we will provide a further update as soon as the next milestone is reached. Mar 31, 10:26:20 GMT+0 - Identified - We are performing additional actions on our infrastructure, but the service has not yet returned to normal. Our teams are continuing to work through the situation, and we will share further details as soon as new progress is made. Mar 31, 11:00:59 GMT+0 - Identified - Despite no significant change for users at this time, our teams continue to work steadily on the resolution. Based on current progress, we expect full recovery before the end of the day. We will continue to keep you informed and share more details as soon as we make further progress. Mar 31, 11:34:07 GMT+0 - Identified - Our teams continue to work actively on restoring the service. While the situation remains complex and no definitive recovery path has been confirmed yet, all necessary actions are being taken to move forward. Mar 31, 12:07:10 GMT+0 - Identified - To stabilise the service, a portion of user traffic has been temporarily reduced while our teams work on redistributing the load across our systems. This may prevent access to the platform for some users. We are actively working to restore normal access as quickly as possible. Mar 31, 12:28:21 GMT+0 - Identified - Services are starting to be partially restored, although not all users may be able to access them yet. Our teams continue to work on completing the recovery. Mar 31, 12:54:31 GMT+0 - Identified - Most of our services are coming back online, and availability continues to improve, even though some users may still experience limited access. Our teams remain fully committed and are making steady progress toward full recovery. Mar 31, 13:27:03 GMT+0 - Identified - Telephony services are starting to recover, and PBX systems are reconnecting. Some users may still experience slow response times, but overall availability continues to improve. Our teams remain fully engaged and are closely monitoring the situation to support the ongoing recovery. Mar 31, 13:56:36 GMT+0 - Identified - We continue to observe some instability on our infrastructure, but our teams have identified the contributing network issues and are actively working with our provider to address them. As part of our stabilisation efforts, several servers are being redirected to an alternate site to ensure a more reliable environment. These actions are already helping to improve the situation, and we remain fully mobilized to restore full service as quickly as possible. Further updates will follow as progress continues. Mar 31, 14:31:20 GMT+0 - Monitoring - Services have been restored, and the vast majority of data is now available to users. Our teams are monitoring the platform closely to ensure full stability. Mar 31, 14:55:50 GMT+0 - Monitoring - Services have been restored, and we continue to monitor the platform closely to ensure everything remains stable. If you experience any issue with Rainbow, a simple log out and log back in should help. Mar 31, 15:31:12 GMT+0 - Monitoring - Monitoring activities are still ongoing, and current indicators remain positive. Our IaaS provider has also made progress on resolving the underlying network issues. We will continue to observe the platform closely and provide further updates if necessary. Apr 2, 11:00:01 GMT+0 - Resolved - This incident has been resolved. Mar 31, 15:58:57 GMT+0 - Resolved - The Rainbow service remains operational. OVH is still experiencing network issues in the Roubaix datacenter, which temporarily reduces our high‑availability capacity. Despite this, the platform is functioning normally, and we are maintaining reinforced monitoring to ensure service continuity. Our teams are closely following the situation with our provider and will take any additional measures needed. Thank you for your understanding while we worked through this incident. Apr 2, 11:00:00 GMT+0 - Postmortem - **Post‑mortem Update** – Apr 02, 13:00 AM CEST Following the application of OVH’s fix yesterday at approximately 22:00 CEST, all data centers are now fully operational. All services have returned to normal and continue to be closely monitored. \=== Apr 01, 2026 10:07 AM **Incident Report: Root Cause Analysis – Rainbow Data Center** **1\. Executive Summary** Our Rainbow Data Center provider experienced significant service degradation impacting our Rainbow services in our different regions due to a network routing anomaly at our primary hosting provider. Incident was caused by the instability of IP Routing between the different Data Center virtual router resulted in an active/passive switch flapping scenario that ultimately isolated database nodes and interrupted client connections. All services have now been restored by routing traffic to an alternate data center, and we are currently in strict monitoring mode. **2\. Timeline of Events** * **31/03, 08:00:** Flapping issue escalated on OVH Data Center, leading to a split-brain condition on our database infrastructure. Service disruption became widespread. * **31/03, 08:00 – 10:00:** The team assessed all available options to recover service without resorting to a full platform restart, which was considered a last resort given the risk of a broader outage. Despite these efforts, the underlying network instability could not be resolved through targeted intervention alone. * **31/03, 11:00 - 11:30:** The engineering team attempted a first restoration. This was unsuccessful as the network continued to flap between DCs. * **31/03, 12:00:** A second restart was initiated, attempting to distribute the load across different data centers. * **31/03, 13:00:** Our teams executed a major restoration by isolating faulty datacenter network elements. * **31/03, 13:30-16:30:** Rainbow services were recovered and users were getting progressively access to their services in the different regions * **Current Status:** All IP addresses from Roubaix were successfully switched to the Gravelines data center, and services have been restored. We are now in a dedicated monitoring phase and working with OVH Team to ensure DC Roubaix stability. **3\. Root Cause Description** The service disruption was triggered by a malfunction in the network at our data center, which is operated by our infrastructure provider, OVH. We observed repeated VRRP flapping, preventing incoming traffic from reliably reaching the active destination server. This persistent routing instability ultimately disrupted database synchronization, resulting in a split-brain scenario in which database nodes operated independently without maintaining consensus. As the VRRP protocol is managed externally by the hosting provider, direct intervention at the switch level was not possible on our end, requiring the implementation of an architectural bypass to restore service. **4\. Corrective Actions and Next Steps** We are accelerating the expansion Rainbow Service with new IaaS provider to avoid full dependency on OVH Cloud network solution – A detailed timeline will be shared before end of next week **Invitation: Partner Webinar on Rainbow Solution Stability** To ensure full transparency and ongoing collaboration, we will invite our partner community to a dedicated technical webinar next week. Invitation to follow before end of this week.