MindTouch incident

MindTouch Service Degradation: US Sites unavailable

Minor Resolved View vendor source →

MindTouch experienced a minor incident on February 24, 2025 affecting Application (General Service) and Search and 1 more component, lasting 1h 34m. The incident has been resolved; the full update timeline is below.

Started
Feb 24, 2025, 03:30 PM UTC
Resolved
Feb 24, 2025, 05:05 PM UTC
Duration
1h 34m
Detected by Pingoru
Feb 24, 2025, 03:30 PM UTC

Affected components

Application (General Service)SearchIn-Product Contextual HelpEmail ServicesMindTouch Success CenterAnalytics

Update timeline

  1. investigating Feb 24, 2025, 03:30 PM UTC

    MindTouch Service Degradation: Sites unavailable. The MindTouch Engineering team is investigating reports of site unavailability.

  2. investigating Feb 24, 2025, 03:43 PM UTC

    We are continuing to investigate this issue.

  3. resolved Feb 24, 2025, 05:05 PM UTC

    This incident has been resolved.

  4. postmortem Mar 04, 2025, 12:01 AM UTC

    ### Impact Start Time \(UTC\) 02/24/2025 03:27 PM UTC Impact End Time \(UTC\) 02/24/2025 03:31 PM UTC ### Incident Summary Updated on 02/28/2025 - On 2/24/2025, a CXone Mpower customer reported that the CXone Mpower Expert application sites were unavailable. Internal teams observed a potentially customer impacting issue, which was later confirmed via customer reports. Engineers confirmed that this was a reoccurrence of an earlier Major Incident \(02509572\). The behavior was a momentary loss of routing to the service backends. The impact stemmed from an intermittent malfunction in the auto-scaling services of the application. The impact was resolved when the auto-scaling service increased the number application instances to handle the incoming traffic ### Root Cause The root cause stemmed from an issue with the auto-scaling service of the affected application, which caused the service to intermittently fail during high-load times. This resulted in the application falling behind in processing capabilities and led to the drop in traffic to backend services. ## Corrective Actions ### Detection: * Internal teams observed a potentially customer impacting issue, which was later confirmed via customer reports of the unavailability of application sites. ### Remediation: * The auto-scaling service increased the number application instances to handle the incoming traffic. Completed on 02/24/2025. ### Prevention: * The Engineering team removed the auto scaling policy and statically set the nodes far above peak demand and ensure service continuity. Completed on 02/25/2025. * Engineers increased the number of minimum number of application nodes to address the increase with incoming traffic. Completed on 02/25/2025. ### Risk of Reoccurrence of Impact Low ### Incident Timeline \(UTC\) 02/24/2025 03:27 PM \(UTC\) - Internal teams observed a potentially customer impacting issue with the CXone Mpower application returning "503 Server" error messages. 02/24/2025 03:29 PM \(UTC\) - Application support team was engaged regarding the issue. 02/24/2025 03:30 PM \(UTC\) - The Network Operations Center \(NOC\) was notified regarding the remediation and investigation efforts. 02/24/2025 03:31 PM \(UTC\) - First customer case opened, and Tech Support \(TS\) engineers began the troubleshooting investigation. 02/24/2025 03:37 PM \(UTC\) - Engineers reported that the issue was resolved at 3:31 PM \(UTC\). Impact was resolved after the system recovered on its own. 02/24/2025 03:43 PM \(UTC\) - TS engineers notified the NOC engineers about the reported customer impact. 02/24/2025 03:47 PM \(UTC\) - A major incident was created for documentation