MindTouch incident

CXOne MPower Expert - Service Degradation: Sites unavailable

Minor Resolved View vendor source →

MindTouch experienced a minor incident on March 20, 2025 affecting Application (General Service) and Search and 1 more component, lasting 2h 40m. The incident has been resolved; the full update timeline is below.

Started
Mar 20, 2025, 01:46 PM UTC
Resolved
Mar 20, 2025, 04:27 PM UTC
Duration
2h 40m
Detected by Pingoru
Mar 20, 2025, 01:46 PM UTC

Affected components

Application (General Service)SearchIn-Product Contextual HelpEmail ServicesMindTouch Success CenterAnalyticsGeoblocking for Russia

Update timeline

  1. investigating Mar 20, 2025, 01:46 PM UTC

    CXOne MPower Expert Service Degradation: Sites unavailable. The CXOne MPower Expert Engineering team is investigating reports of site unavailability.

  2. investigating Mar 20, 2025, 01:54 PM UTC

    We are continuing to investigate this issue.

  3. monitoring Mar 20, 2025, 02:01 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. monitoring Mar 20, 2025, 02:17 PM UTC

    We are continuing to monitor for any further issues.

  5. monitoring Mar 20, 2025, 02:33 PM UTC

    We are continuing to monitor for any further issues.

  6. monitoring Mar 20, 2025, 02:48 PM UTC

    We are continuing to monitor for any further issues.

  7. monitoring Mar 20, 2025, 03:03 PM UTC

    We are continuing to monitor for any further issues.

  8. monitoring Mar 20, 2025, 03:19 PM UTC

    We are continuing to monitor for any further issues.

  9. monitoring Mar 20, 2025, 03:34 PM UTC

    We are continuing to monitor for any further issues.

  10. monitoring Mar 20, 2025, 04:11 PM UTC

    We are continuing to monitor for any further issues.

  11. monitoring Mar 20, 2025, 04:26 PM UTC

    We are continuing to monitor for any further issues.

  12. resolved Mar 20, 2025, 04:27 PM UTC

    Expert Monitoring Complete - services running normally

  13. postmortem Mar 26, 2025, 06:54 PM UTC

    **Impact Start Time \(UTC\) 03/20/2025 01:46 PM UTC** **Impact End Time \(UTC\) 03/20/2025 02:01 PM UTC** **Incident Summary** Updated on 03/26/2025 - On 03/20/2025, some CXone Mpower customers reported being unable to access the CXone Mpower Expert knowledge portal, encountering "503" and "504" error messages. The issue occurred during the regular Quality Assurance \(QA\) site creation process, where rapidly deploying multiple sites through a system script generated an unexpectedly high load, placing excessive strain on infrastructure components. The impact was resolved by restarting the affected services and performing a rolling restart on the affected nodes, restoring services to normal operation. **Root Cause** The issue occurred during the regular QA site creation process, where rapidly deploying multiple sites through a system script generated an unexpectedly high load, placing excessive strain on infrastructure components. As part of a regular deployment process, several QA sites were created to run integration tests before directing customer traffic to the new deployment. However, this unexpectedly caused frequent reloads of the load balancer, leading to timeouts and unresponsive pages. Ultimately, this triggered alerts and resulted in customer impact. Although this procedure had never previously caused such an issue, engineers recognized the need to enhance the system to accommodate the growing load in the production environment driven by an increasing number of customers and their utilization. They promptly developed and implemented preventive measures to mitigate the risk of similar incidents in the future. **Corrective Actions** **Detection:** Internal support teams detected a potentially customer-impacting issue through proactive alarms and monitoring mechanisms, which was later confirmed by customer reports being unable to access the CXone Mpower Expert knowledge portal, encountering "503" and "504" error messages. **Remediation:** The impact was resolved by restarting the affected services and performing a rolling restart on the affected nodes, restoring services to normal operation. Completed on 03/20/2025. **Prevention:** Engineering team implemented rate-limiting measures to control the number of QA sites created simultaneously and increased the pause duration between each site's creation, preventing excessive load spikes during such procedure. Completed on 03/20/2025 **Risk of Reoccurrence of Impact:** Low **Incident Timeline \(UTC\)** 03/20/2025 01:46 PM \(UTC\) - Internal support teams received potentially customer-impacting alerts and posted a service disruption notification on the Status Health Portal. Simultaneously, the first customer case was opened, prompting Tech Support \(TS\) engineers to begin their initial validation and troubleshooting investigation, which later confirmed the issue was related to the major incident. 03/20/2025 01:48 PM \(UTC\) - Engineers proactively raised a major incident while continuing to work on restoring the service. 03/20/2025 01:51 PM \(UTC\) - Engineers restarted the affected service components, stabilizing the system. They continued to monitor the system’s health. 03/20/2025 02:01 PM \(UTC\) - After further monitoring and health checks, the impact was confirmed to be fully resolved. Following successful test validations, the major incident was officially marked as resolved