MindTouch incident

MindTouch Service Degradation: US Sites unavailable

MindTouch experienced a critical incident on February 25, 2025 affecting Application (General Service) and Search and 1 more component, lasting 8h 23m. The incident has been resolved; the full update timeline is below.

Started: Feb 25, 2025, 11:26 PM UTC
Resolved: Feb 26, 2025, 07:49 AM UTC
Duration: 8h 23m
Detected by Pingoru: Feb 25, 2025, 11:26 PM UTC

Affected components

Application (General Service)SearchIn-Product Contextual HelpEmail ServicesMindTouch Success CenterAnalytics

Update timeline

investigating Feb 25, 2025, 11:26 PM UTC

MindTouch Service Degradation: Sites unavailable. The MindTouch Engineering team is investigating reports of site unavailability.
identified Feb 26, 2025, 12:26 AM UTC

The issue has been identified and a fix is being implemented.
resolved Feb 26, 2025, 07:49 AM UTC

This incident has been resolved.
postmortem Mar 04, 2025, 12:16 AM UTC

### Impact Start Time \(UTC\) 02/25/2025 11:23 PM UTC Impact End Time \(UTC\) 02/26/2025 07:48 AM UTC ### Incident Summary On 02/25/2025, some CXone Mpower customers reported being unable to access the CXone Mpower Expert knowledge portal, encountering '503 Service Unavailable' and '504 Gateway Timeout' error messages. The impact was caused by the application service cluster, which is responsible for managing cluster infrastructure, entering a failed state after being inadvertently terminated during a routine remediation procedure. The issue was resolved by rebuilding the application service cluster using the latest infrastructure configurations and re-provisioning all necessary resources, including worker nodes and networking settings. ### Root Cause The impact was caused by the application service cluster, which is responsible for managing cluster infrastructure, entering a failed state after being inadvertently terminated during a routine remediation procedure. This led to a service disruption in the US regional platform, causing the applications running on the affected cluster to be unavailable. Although the issue was promptly detected and identified, the Engineering team recognized the need for system and procedural enhancements, particularly for critical processes, and implemented additional safeguards to prevent similar incidents in the future. ## Corrective Actions ### Detection * Internal support teams detected a potentially customer-impacting issue through proactive alarms and monitoring mechanisms, which was later confirmed by customer reports indicating that they were encountering a '503 Service Unavailable' error message when attempting to access the CXone Mpower Expert knowledge portal. ### Remediation * The issue was resolved by rebuilding the application service cluster using the latest infrastructure configurations and re-provisioning all necessary resources, including worker nodes and networking settings. Completed on 02/25/2025. ### Prevention * Based on the lessons learned from this incident, the Engineering team revised the standard operating procedures for executing critical system processes. An additional role was deployed in each production environment to manage and restrict access to manual modifications and updates. Furthermore, termination protection was implemented as an extra safeguard to prevent unintended actions in case of a tool failure, such as a malfunctioning confirmation prompt. Completed on 02/26/2025. ### Risk of Reoccurrence of Impact Low ### Incident Timeline \(UTC\) 02/25/2025 11:23 PM \(UTC\) - Internal support teams detected and notified the Network Operations Center \(NOC\) engineers of a potential customer-impacting issue. At the same time, the first customer case was reported, which was later confirmed to be related to the identified issue. 02/25/2025 11:29 PM \(UTC\) - A major incident was proposed and confirmed. 02/25/2025 11:41 AM \(UTC\) - Engineers engaged with our cloud service provider \(CSP\) support for assistance in identifying the best approach to implement a solution. 02/26/2025 12:26 AM \(UTC\) - Engineers identified the cause of the issue and began their remediation efforts. This took longer than expected to complete since it was updating a large number of service instances. 02/26/2025 03:32 AM \(UTC\) - Engineers re-ran the configuration management process to ensure all resources were properly retained in the system, preventing any unnecessary issues. 02/26/2025 05:30 AM \(UTC\) - The new application service cluster was fully rebuilt, ensuring that all resources were successfully retained. 02/26/2025 06:17 AM \(UTC\) - Engineers continued the remediation and proceeded to run the installation configuration. 02/26/2025 07:48 AM \(UTC\) - The impact was fully resolved after engineers rebuilt the application service cluster with necessary provisioned resources and required configurations. Following successful test validations, the major incident was marked as resolved.