MindTouch incident

CXOne Expert Site Outage

MindTouch experienced a notice incident on March 4, 2025, lasting —. The incident has been resolved; the full update timeline is below.

Started: Mar 04, 2025, 12:11 AM UTC
Resolved: Feb 24, 2025, 03:52 PM UTC
Duration: —
Detected by Pingoru: Mar 04, 2025, 12:11 AM UTC

Update timeline

resolved Mar 04, 2025, 12:11 AM UTC

- On 2/24/2025, some CXone Mpower customers reported that the CXone Mpower Expert application sites were unavailable. Internal teams observed a potentially customer impacting issue, which was later confirmed via customer reports. The impact stemmed from from some network nodes that were in an unhealthy state. The impact was resolved when engineers restarted the affected network nodes.
postmortem Mar 04, 2025, 12:11 AM UTC

### Impact Start Time \(UTC\) 02/24/2025 03:52 PM UTC Impact End Time \(UTC\) 02/24/2025 03:57 PM UTC ### Incident Summary On 2/24/2025, some CXone Mpower customers reported that the CXone Mpower Expert application sites were unavailable. Internal teams observed a potentially customer impacting issue, which was later confirmed via customer reports. The impact stemmed from from some network nodes that were in an unhealthy state. The impact was resolved when engineers restarted the affected network nodes. ### Root Cause The root cause stemmed from an issue with the auto-scaling service of the affected application, which caused the service to intermittently fail during high-load times. This resulted in the application falling behind in processing capabilities and led to the drop in traffic to backend services. ## Corrective Actions ### Detection: * Internal teams observed a potentially customer impacting issue, which was later confirmed via customer reports of the unavailability of application sites. ### Remediation: * The auto-scaling service increased the number application instances to handle the incoming traffic. Completed on 02/24/2025. ### Prevention: * The Engineering team removed the auto scaling policy and statically set the nodes far above peak demand and ensure service continuity. Completed on 02/25/2025. * Engineers increased the minimum number of network nodes to address the increase with incoming traffic. Completed on 02/25/2025. ‌ ### Incident Timeline \(UTC\) 02/24/2025 03:51 PM \(UTC\) - Internal teams received a page regarding a potentially customer impacting issue. 02/24/2025 03:52 PM \(UTC\) - Impact start time determined by engineers. 02/24/2025 03:53 PM \(UTC\) - Engineers restarted the affected network nodes in an attempt to resolve the issue. 02/24/2025 03:57 PM \(UTC\) - Impact end time determined by engineers as the system recovered on its own. 02/24/2025 04:00 PM \(UTC\) - Network Operations Center \(NOC\) engineers was informed about the reported customer impact. A major incident was created and confirmed. 02/24/2025 04:01 PM \(UTC\) - As a preventive measure, engineers recycled the router pods. 02/24/2025 04:04 PM \(UTC\) - First customer case opened was reported to the Tech Support \(TS\) engineers. 02/24/2025 04:09 PM \(UTC\) - Engineers continued monitoring and determined that the system is stable. 02/24/2025 04:30 PM \(UTC\) - Customer feedback confirmed that sites were restored. 02/24/2025 05:00 PM \(UTC\) - Internal teams disbanded the Major Incident \(MI\) bridge