MindTouch incident

CXone Knowledge Management - Monitoring complete. Status = All Services Running Normally

MindTouch experienced a critical incident on April 29, 2026 affecting Application (General Service) and Search and 1 more component, lasting 53m. The incident has been resolved; the full update timeline is below.

Started: Apr 29, 2026, 11:19 AM UTC
Resolved: Apr 29, 2026, 12:13 PM UTC
Duration: 53m
Detected by Pingoru: Apr 29, 2026, 11:19 AM UTC

Affected components

Application (General Service)SearchGenerative SearchIn-Product Contextual HelpMindTouch Success CenterAnalytics

Update timeline

investigating Apr 29, 2026, 11:19 AM UTC

CXone Knowledge Management Service Degradation: Sites unavailable. The CXone Knowledge Management Engineering team is investigating reports of site unavailability.
investigating Apr 29, 2026, 11:38 AM UTC

CXone Knowledge Management Service Degradation: Sites unavailable. The CXone Mpower Expert Engineering team is investigating reports of site unavailability.
investigating Apr 29, 2026, 11:54 AM UTC

CXone Knowledge Management - Fix Deployed - All Services Running Normally. The CXone Mpower Expert Engineering team has deployed a fix and all services are running normally. We are currently monitoring sites for deployment stability. Event duration 51 mins
monitoring Apr 29, 2026, 11:55 AM UTC

A fix has been implemented and we are monitoring the results.
resolved Apr 29, 2026, 12:13 PM UTC

Service Disruption Resolved - All Services Running Normally. The CXone Knowledge Management Engineering team has deployed a fix and monitored the deployment to make sure sites are stable. The issue is now resolved at this time. Event duration 51 mins
postmortem May 05, 2026, 06:32 PM UTC

**Impact Start Time \(UTC\) 04/29/2026 10:31 AM UTC** **Impact End Time \(UTC\) 04/29/2026 11:56 AM UTC** **Incident Summary** On 04/29/2026, some NiCE CXone Mpower customers initially experienced slowness when accessing sites within the CXone Mpower Expert knowledge portal. The issue subsequently worsened, leading to complete inaccessibility, with users encountering “503 Service Unavailable” and “504 Gateway Timeout” errors. The service disruption was caused by a short-term resource saturation within one of the platform’s backend Application Programming Interface \(API\)\* components. The issue was resolved by scaling up multiple services, increasing Central Processing Unit \(CPU\) allocation per-pod to better handle traffic spikes, and restarting the affected API services to apply the updated resource configurations, restoring normal service operation. **Root Cause** The service disruption was caused by a short-term resource saturation within one of the platform’s backend API components. Under certain conditions, a burst of long running requests placed higher than expected demand on processing resources, leading to slower response times and errors for some customers. Certain subset of API requests required significantly more CPU than others. When many of these requests were processed concurrently, CPU utilization on affected pods became saturated, causing requests to slow down or time out. This saturation cascaded to upstream services, more broadly impacting site availability. Although a built in autoscaling mechanism was in place and functioned as designed, it reached its maximum limit and proved insufficient because the underlying issue was related to per-pod CPU constraints rather than the number of pods. The observed workload pattern required additional per-service processing capacity to compensate for these limitations and restore stable operation. ### Corrective Actions: **Detection:** Although a builtin alerting mechanism was in place to detect the failure condition, the alarm did not trigger to the responsible team as expected. Subsequently, internal support teams began receiving customer reports of slowness when accessing sites within the CXone Mpower Expert knowledge portal. The condition later escalated to complete service inaccessibility, with users encountering “503 Service Unavailable” and “504 Gateway Timeout” errors. **Remediation:** The issue was resolved by scaling up multiple services, increasing CPU allocation per-pod to better handle traffic spikes, and restarting the affected API services to apply the updated resource configurations, restoring normal service operation. Completed on 04/29/2026. **Prevention:** The Engineering team will enhance alert notification delivery to ensure alarms are reliably triggered and routed to the appropriate response teams as expected, enabling faster detection and more timely corrective action when issues arise. An update will be provided by End of Day \(EOD\) MT on 05/15/2026. While the risk of recurrence is currently low following the increase in resource capacity, the Engineering team will implement targeted performance improvements to the affected API component that contributed to this incident. In addition, code level safeguards will be introduced to ensure that, under similar conditions, inefficient requests are terminated sooner, and system resources are released more quickly. These measures are intended to enable faster failure handling and prevent prolonged CPU saturation. An update will be provided by EOD MT on 06/05/2026. **Incident Timeline \(UTC\)** 04/29/2026 10:31 AM \(UTC\) - Telemetry detected initial slow responses and timeout errors, marking the start of customer impact. 04/29/2026 10:51 AM \(UTC\) - The first customer case opened, and Tech Support \(TS\) engineers began the troubleshooting investigation. 04/29/2026 11:20 AM \(UTC\) - TS engineers notified the Network Operations Center \(NOC\) engineers about the reported customer impact; a major incident was proposed and confirmed. 04/29/2026 11:41 AM \(UTC\) - Engineers identified a suspected cause and began remediation steps. 04/29/2026 11:56 AM \(UTC\) - The issue was resolved after scaling up resources and restarting the affected API services. Following successful validation, the major incident was marked as resolved.