MindTouch incident

CXOne MPower Expert - Service Degradation: Slow Load Times

Minor Resolved View vendor source →

MindTouch experienced a minor incident on March 13, 2025 affecting Application (General Service) and Search and 1 more component, lasting 58m. The incident has been resolved; the full update timeline is below.

Started
Mar 13, 2025, 07:42 PM UTC
Resolved
Mar 13, 2025, 08:41 PM UTC
Duration
58m
Detected by Pingoru
Mar 13, 2025, 07:42 PM UTC

Affected components

Application (General Service)SearchIn-Product Contextual HelpEmail ServicesMindTouch Success CenterAnalyticsGeoblocking for Russia

Update timeline

  1. investigating Mar 13, 2025, 07:42 PM UTC

    CXOne MPower Expert Service Degradation: Sites unavailable. The CXOne MPower Expert Engineering is looking into issues related to slow load times on sites.

  2. monitoring Mar 13, 2025, 07:47 PM UTC

    A fix will be implemented while we monitor results.

  3. monitoring Mar 13, 2025, 08:06 PM UTC

    A fix will be implemented while we monitor results.

  4. monitoring Mar 13, 2025, 08:22 PM UTC

    A fix will be implemented while we monitor results.

  5. monitoring Mar 13, 2025, 08:37 PM UTC

    A fix will be implemented while we monitor results.

  6. resolved Mar 13, 2025, 08:41 PM UTC

    This incident has been resolved.

  7. postmortem Mar 19, 2025, 02:41 PM UTC

    **Impact Start Time \(UTC\) - 03/13/2025 07:42 PM \(UTC\)** **Impact End Time \(UTC\) - 03/13/2025 08:41 PM \(UTC\)** ‌ **Incident Summary** On 3/13/2025, some CXone Mpower customers reported being unable to access the CXone Mpower Expert knowledge portal. The issue stemmed from another routine security patch update implemented by our cloud service provider \(CSP\), following their first update that triggered the first major incident \(02521994. The impact was resolved by refreshing the routing table until the rolling restart was completed, which restored all nodes to normal operation and brought the service back online. ‌ **Root Cause** The issue stemmed from another routine security patch update implemented by our cloud service provider \(CSP\), following their first update that triggered the first major incident \(02521994\). This update triggered the expected rolling restart on all nodes during our internal planned change deployment. While a normal rolling restart typically doesn’t cause impact, some proxy nodes unexpectedly crashed while engineers were refreshing the routing table to prevent disruptions. This resulted in sporadic service disruptions, and engineers had to repeatedly refresh the routing tables until the system fully recovered. Despite this, the Engineering team identified key areas for improvement in the health check mechanisms to prevent similar incidents in the future. ‌ **Corrective Actions** **Detection:** * Internal support teams detected a potentially customer-impacting issue through proactive alarms and monitoring mechanisms, which was later confirmed by customer reports indicating that they were unable to access the CXone Mpower Expert knowledge portal. **Remediation:** * The impact was resolved by refreshing the routing table until the rolling restart was completed, which restored all nodes to normal operation and brought the service back online. Completed on 03/13/2025. **Prevention:** * Following the service restoration, the Engineering team updated the server instance configuration to prevent automatic security patch updates from occurring during release operations. The security update automation will be conducted during scheduled maintenance windows with close supervision to prevent unexpected disruptions in the future. Completed on 03/13/2025. * Engineering team will implement additional health check mechanisms in the application load balancer to verify if the internal load balancers are functioning properly. This enhancement will improve the built-in automatic failover mechanism, allowing the system to switch to a healthy load balancer and prevent customer impact. An update will be provided by End of Day \(EOD\) MT on 03/28/2025. * Engineering team will enhance the internal load balancer updater service to optimize its functionalities and prevent unwanted failures in the future. This will ensure a seamless update during the rolling restart of the nodes. An update will be provided by EOD MT on 05/02/2025. **Risk of Reoccurrence of Impact - Low** ‌ **Incident Timeline \(UTC\)** 03/13/2025 07:42 PM \(UTC\) - Engineers proactively identified a customer-impacting incident via their internal monitoring and began remediation steps. 03/13/2025 07:43 PM \(UTC\) - The first customer case opened, which was later confirmed to be related to this major incident. 03/13/2025 07:47 PM \(UTC\) - While engineers had already identified a suspected cause and were actively working on remediating the impact, they notified the Network Operations Center \(NOC\) engineers about the potentially customer-impacting incident. A proactive major incident was raised and confirmed. 03/13/2025 08:41 PM \(UTC\) - The impact was resolved after the rolling restart was completed, and the routing table was refreshed to restore the affected nodes. Following successful internal validations, the major incident was marked as resolved.