MindTouch incident

CXOne MPower Expert - Service Degradation: Sites unavailable

MindTouch experienced a minor incident on March 13, 2025 affecting Application (General Service) and Search and 1 more component, lasting 1h 13m. The incident has been resolved; the full update timeline is below.

Started: Mar 13, 2025, 08:20 AM UTC
Resolved: Mar 13, 2025, 09:34 AM UTC
Duration: 1h 13m
Detected by Pingoru: Mar 13, 2025, 08:20 AM UTC

Affected components

Application (General Service)SearchIn-Product Contextual HelpEmail ServicesMindTouch Success CenterAnalyticsGeoblocking for Russia

Update timeline

investigating Mar 13, 2025, 08:20 AM UTC

CXOne MPower Expert Service Degradation: Sites unavailable. The CXOne MPower Expert Engineering team is investigating reports of site unavailability.
monitoring Mar 13, 2025, 08:32 AM UTC

A fix will be implemented while we monitor results.
monitoring Mar 13, 2025, 08:53 AM UTC

A fix will be implemented while we monitor results.
monitoring Mar 13, 2025, 09:15 AM UTC

We are continuing to monitor for any further issues.
resolved Mar 13, 2025, 09:34 AM UTC

This incident has been resolved.
postmortem Mar 19, 2025, 02:36 PM UTC

**Impact Start Time \(UTC\)** - 03/13/2025 08:20 AM \(UTC\) **Impact End Time \(UTC\)** - 03/13/2025 09:34 AM \(UTC\) ‌ **Incident Summary** On 03/13/2025, some CXone Mpower customers reported encountering a '503 Service Unavailable' error message when attempting to access the CXone Mpower Expert knowledge portal. The issue stemmed from a routing pre-configuration change activity in preparation for a scheduled system update deployment on the routing service, where a certain configuration was inadvertently reverted. The issue was resolved when engineers rectified the misconfiguration and restarted the affected routing service, fully restoring all functionalities. ‌ **Root Cause** The issue stemmed from a network disruption following our cloud service provider’s \(CSP’s\) routine security patch updates, which prevented the internal load balancer from refreshing the nodes. While the rolling updates on all the CXone Mpower Expert nodes were expected based on the current server instance configuration, it was also anticipated that a force rolling restart would occur across all nodes after an update. However, the system unexpectedly initiated force restarts on all the nodes simultaneously instead of performing a rolling restart. This prevented network traffic from being routed, leading to the impact experienced by customers. Following the service restoration of the first major incident \(02521994\), engineers initiated the migration of the server instance configuration as part of a follow-up remediation effort. This triggered another expected rolling restart on the nodes, but it again caused an unexpected disruption. This was due to a backend routing failure that prevented automatic updates during the rolling restart of the nodes. Engineers identified the need to enhance the internal load balancer updater service, as it was not signaling that the backend components were healthy, which led to the backend routing failure. **Corrective Actions** **Detection:** * Internal support teams detected a potentially customer-impacting issue through proactive alarms and monitoring mechanisms, which was later confirmed by customer reports indicating that they were unable to access the CXone Mpower Expert knowledge portal. **Remediation:** * The impact was resolved by refreshing the routing table until the rolling restart was completed, which restored all nodes to normal operation and brought the service back online. Completed on 03/13/2025. **Prevention:** * Following the service restoration, the Engineering team updated the server instance configuration to prevent automatic security patch updates from occurring during release operations. The security update automation will be conducted during scheduled maintenance windows with close supervision to prevent unexpected disruptions in the future. Completed on 03/13/2025. * Engineering team will implement additional health check mechanisms in the application load balancer to verify if the internal load balancers are functioning properly. This enhancement will improve the built-in automatic failover mechanism, allowing the system to switch to a healthy load balancer and prevent customer impact. An update will be provided by End of Day \(EOD\) MT on 03/28/2025. * Engineering team will enhance the internal load balancer updater service to optimize its functionalities and prevent unwanted failures in the future. This will ensure a seamless update during the rolling restart of the nodes. An update will be provided by EOD MT on 05/02/2025. **Risk of Reoccurrence of Impact: Low** ‌ **Incident Timeline \(UTC\)** 03/13/2025 08:17 AM \(UTC\) - Engineers initiated the follow-through remediation effort as a preventive measure, following the first major incident. 03/13/2025 08:20 AM \(UTC\) - A new set of alarms was triggered, and engineers quickly began their remediation actions. 03/13/2025 08:23 AM \(UTC\) - Engineers were able to restore the service and continued to monitor the system’s health. 03/13/2025 08:36 AM \(UTC\) - Engineers notified the Network Operations Center \(NOC\) engineers of this recurrence, and a second proactive major incident \(02522001\) was raised and confirmed. 03/13/2025 08:49 AM \(UTC\) - While monitoring the system, engineers received another set of alarms. They continued refreshing the routing table to prevent impact and eventually restored the service. Engineers conducted further health checks and system monitoring. 03/13/2025 08:53 AM \(UTC\) - While the issue was already resolved, the first customer case opened, which was later validated as related to this ongoing major incident. 03/13/2025 09:34 AM \(UTC\) - The impact was resolved after the rolling restart was completed, and the routing table was refreshed to restore the affected nodes. Following successful internal validations, the major incident was marked as resolved.