MindTouch incident
CXOne MPower Expert - Service Degradation: Sites unavailable
MindTouch experienced a minor incident on March 13, 2025 affecting Application (General Service) and Search and 1 more component, lasting 42m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Mar 13, 2025, 07:21 AM UTC
CXOne MPower Expert Service Degradation: Sites unavailable. The CXOne MPower Expert Engineering team is investigating reports of site unavailability.
- investigating Mar 13, 2025, 07:30 AM UTC
We are continuing to investigate this issue.
- monitoring Mar 13, 2025, 07:50 AM UTC
A fix will be implemented while we monitor results
- resolved Mar 13, 2025, 08:03 AM UTC
This incident has been resolved.
- postmortem Mar 19, 2025, 02:31 PM UTC
**Impact Start Time** \(UTC\) 03/13/2025 07:15 AM \(UTC\) **Impact End Time** \(UTC\) 03/13/2025 08:03 AM \(UTC\) **Incident Summary** On 03/13/2025, some CXone Mpower customers reported being unable to access the CXone Mpower Expert knowledge portal. The issue was caused be a network disruption following our cloud service provider’s \(CSP’s\) routine security patch updates, which prevented the internal load balancer from refreshing the nodes. The impact was resolved by refreshing the routing table until the rolling restart was completed, which restored all nodes to normal operation and brought the service back online. **Root Cause** The issue was caused by a network disruption following our cloud service provider’s \(CSP’s\) routine security patch updates, which prevented the internal load balancer from refreshing the nodes. While the rolling updates on all the CXone Mpower Expert nodes were expected based on the current server instance configuration, it was also anticipated that a force rolling restart would occur across all nodes after an update. However, the system unexpectedly initiated force restarts on all the nodes simultaneously instead of performing a rolling restart. This prevented network traffic from being routed, leading to the impact experienced by customers. While the rolling restart on the nodes was expected, the disruption was not. This was due to a backend routing failure that prevented automatic updates during the rolling restart of the nodes. Engineers identified the need to enhance the internal load balancer updater service, as it was not signaling that the backend components were healthy, which led to the backend routing failure. **Corrective Actions** **Detection:** * Internal support teams detected a potentially customer-impacting issue through proactive alarms and monitoring mechanisms, which was later confirmed by customer reports indicating that they were unable to access the CXone Mpower Expert knowledge portal. **Remediation:** * The impact was resolved by refreshing the routing table until the rolling restart was completed, which restored all nodes to normal operation and brought the service back online. Completed on 03/13/2025. **Prevention:** * Following the service restoration, the Engineering team updated the server instance configuration to prevent automatic security patch updates from occurring during release operations. The security update automation will be conducted during scheduled maintenance windows with close supervision to prevent unexpected disruptions in the future. Completed on 03/13/2025. * Engineering team will implement additional health check mechanisms in the application load balancer to verify if the internal load balancers are functioning properly. This enhancement will improve the built-in automatic failover mechanism, allowing the system to switch to a healthy load balancer and prevent customer impact. An update will be provided by End of Day \(EOD\) MT on 03/28/2025. * Engineering team will enhance the internal load balancer updater service to optimize its functionalities and prevent unwanted failures in the future. This will ensure a seamless update during the rolling restart of the nodes. An update will be provided by EOD MT on 05/02/2025. **Risk of Reoccurrence of Impact Low** **Incident Timeline \(UTC\)** 03/13/2025 07:15 AM \(UTC\) - Engineers received alarms indicating potential service disruption and immediately began their validations and troubleshooting. 03/13/2025 07:18 AM \(UTC\) - The first customer case was opened, and Tech Support \(TS\) engineers began their initial validations and troubleshooting investigation. This was later confirmed related to the proactive major incident. 03/13/2025 07:21 AM \(UTC\) - A service degradation was sent to Network Operations Center \(NOC\) engineers, while engineers identified a suspected cause and began their remediation steps. 03/13/2025 07:30 AM \(UTC\) - After completing the necessary validations, NOC engineer raised and confirmed a proactive major incident \(02521994\). 03/13/2025 07:38 AM \(UTC\) - Engineers continued monitoring and refreshing the backend components while waiting for the rolling node restart to complete. 03/13/2025 08:03 AM \(UTC\) - The impact was resolved after the rolling restart was completed while refreshing the routing table to restore the affected nodes. Following successful internal validations, the major incident was marked as resolved.