MindTouch incident

MindTouch Service Degradation: US Sites unavailable

MindTouch experienced a major incident on February 25, 2025 affecting Application (General Service) and Search and 1 more component, lasting 6h 4m. The incident has been resolved; the full update timeline is below.

Started: Feb 25, 2025, 01:51 PM UTC
Resolved: Feb 25, 2025, 07:55 PM UTC
Duration: 6h 4m
Detected by Pingoru: Feb 25, 2025, 01:51 PM UTC

Affected components

Application (General Service)SearchIn-Product Contextual HelpEmail ServicesMindTouch Success CenterAnalytics

Update timeline

investigating Feb 25, 2025, 01:51 PM UTC

MindTouch Service Degradation: Sites unavailable. The MindTouch Engineering team is investigating reports of site unavailability.
investigating Feb 25, 2025, 02:27 PM UTC

We are continuing to investigate this issue.
investigating Feb 25, 2025, 02:31 PM UTC

We are continuing to investigate this issue.
investigating Feb 25, 2025, 02:50 PM UTC

We are continuing to investigate this issue.
investigating Feb 25, 2025, 03:25 PM UTC

We are continuing to investigate this issue.
monitoring Feb 25, 2025, 06:32 PM UTC

Our internal teams have found a fix for the issue that has been deployed. We will be monitoring further.
resolved Feb 25, 2025, 07:55 PM UTC

This incident has been resolved.
postmortem Mar 03, 2025, 11:37 PM UTC

**Impact Start Time** \(UTC\) 02/25/2025 01:44 PM UTC **Impact End Time** \(UTC\) 02/25/2025 06:27 PM UTC ### Incident Summary Updated on 03/03/2025 - On 2/25/2025, some CXone Mpower customers reported that the CXone Mpower Expert application sites were unavailable. Internal teams observed a potentially customer impacting issue, which was later confirmed via customer reports. The impact stemmed from an intermittent malfunction in the auto-scaling services of the application. The impact was resolved when engineers disabled the auto-scaling configuration and manually configured the number of available nodes sufficient in managing requests during the peak load times ### Root Cause The root cause stemmed with the auto-scaling service of the affected application, which kept scaling node resources to minimum levels. Internal teams identified this behavior when it was observed that most clusters were scaled way under the demand in the US region. ## Corrective Actions ### Detection: * Internal teams observed a potentially customer impacting issue, which was later confirmed via customer reports of unavailability of CXone Mpower Expert sites. ### Remediation: * Engineers disabled the cluster auto scaler and set the minimum and maximum desired node count to the cluster peak usage. Completed on 02/25/2025. * Engineers worked with the Cloud Service Provider \(CSP\) support to configure the auto-scaling policy to allow larger instance types as a type of bursting configuration was preventing the use of other allowed instance types. Completed on 02/25/2025. ### Prevention: * The Engineering team removed the auto scaling policy and statically set the nodes far above peak demand and ensure service continuity. Completed on 02/25/2025 ### Risk of Reoccurrence of Impact: Low ### Incident Timeline \(UTC\) 02/25/2025 01:44 PM \(UTC\) - Internal teams received an alerting regarding potentially customer impacting issues with the CXone Mpower Expert application. 02/25/2025 01:50 PM \(UTC\) - First customer case opened, and Tech Support \(TS\) engineers began the troubleshooting investigation 02/25/2025 01:56 PM \(UTC\) - Engineers found the desired scaled state was below desired count. 02/25/2025 02:02 PM \(UTC\) - Engineers created a support ticket to the CloudService Provider \(CSP\). 02/25/2025 02:10 PM \(UTC\) - The Engineering team manually set the number o favailable nodes on the affected region to be sufficient in handling peak traffic. 02/25/2025 02:13 PM \(UTC\) - TS engineers notified the Network OperationsCenter \(NOC\) engineers about the reported customer impact; a major incident was proposed and confirmed. 02/25/2025 02:13 PM \(UTC\) - Worker nodes were scaled out but scaled back shortly after. 02/25/2025 02:42 PM \(UTC\) - Engineers re-deployed cluster configuration. 02/25/2025 03:15 PM \(UTC\) - Engineers lowered the available nodes after the deployment was completed 02/25/2025 03:35 PM \(UTC\) - Teams observed the auto-scaler configuration working as expected to increase the number of nodes and accommodate incoming traffic. 02/25/2025 03:40 PM \(UTC\) - The CSP worked with The Engineering team to investigate the issue. 02/25/2025 04:20 PM \(UTC\) - Disabled burst mode for small instance types to allow large instance types to be used. 02/25/2025 04:36 PM \(UTC\) - Engineers identified a suspected cause and began remediation steps. 02/25/2025 06:20 PM \(UTC\) - Engineers disabled cluster auto scaler and predictive scaling systems. 02/25/2025 06:22 PM \(UTC\) - Engineers manually scaled up system to much higher than peak demand. 02/25/2025 06:25 PM \(UTC\) - Confirmed no scale in requests were being called within the system. Internal tests show recovery. 02/25/2025 06:27 PM \(UTC\) - Internal tests were successful. Impact and major incident resolved