MindTouch incident
CXone Mpower Expert - – Monitoring complete. Status = All Services Running Normally
MindTouch experienced a major incident on April 10, 2025 affecting Application (General Service) and Search and 1 more component, lasting 15m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 10, 2025, 01:59 PM UTC
CXone Mpower Expert Service Degradation: Sites unavailable. The CXone Mpower Expert Engineering team is investigating reports of site unavailability
- identified Apr 10, 2025, 02:18 PM UTC
CXone Mpower Expert Service Degradation: Sites unavailable. The issue has been identified and a fix is being worked on for deployment.
- identified Apr 10, 2025, 02:36 PM UTC
CXone Mpower Expert Service Degradation: Sites unavailable. The issue has been identified and a fix is being worked on for deployment.
- monitoring Apr 10, 2025, 02:48 PM UTC
CXone Mpower Expert - Fix Deployed. Event duration (32 mins). Status = initiating monitoring
- resolved Apr 10, 2025, 03:03 PM UTC
CXone Mpower Expert - Service Disruption Resolved - All Services Running Normally. The CXone Mpower Expert Engineering team has deployed a fix and monitored the deployment to make sure sites are stable. The issue is now resolved at this time. Event duration 32 minutes
- postmortem Apr 16, 2025, 05:55 PM UTC
**Impact Start Time \(UTC\)** - 04/10/2025 01:50 PM \(UTC\) **Impact End Time \(UTC\)** - 04/10/2025 02:22 PM \(UTC\) **Incident Summary** Updated on 04/16/2025 - On 4/10/2025, some CXone Mpower customers reported inability to access CXone Mpower Expert application sites, encountering '504 Gateway Timeout" error messages and slow response times when attempting to access the affected sites. The issue was caused by the system being overwhelmed due to numerous middleware service pods making requests to the Application Programming Interface \(API\) pod, which could not keep up due to each pod having limited database connections. The impact was resolved by increasing the resources capacity of the API pods, which recovered the system to be able to process the requests that had queued up as well as the normal traffic. **Root Cause** The issue was caused by the system being overwhelmed due to numerous middleware service pods making requests to the API pod, which could not keep up due to each pod having limited database connections. To address previous slow response times, internal teams increased the number of middleware service pods for better throughput. However, each API pod had limited connections, which did not scale with the increased traffic and caused a bottleneck on the system. Since the API pods were unable to scale automatically, the system entered an unhealthy state. The high number of middleware service pods making requests to the API exceeded the connection limit per pod, leading to delays. **Corrective Actions** **Detection:** Internal support teams detected a potentially customer-impacting issue through proactive alarms and monitoring mechanisms, which was later confirmed by customer reports indicating their inability to access CXone Mpower Expert application sites, encountering '504 Gateway Timeout" error messages and slow response times when attempting to access the affected sites. **Remediation:** The impact was resolved by increasing the capacity resources of the API pods, which recovered the system to be able to process the requests that had queued up as well as the normal traffic. Completed on 04/10/2025. **Prevention:** The Engineering team is actively working to have the API pods scale off of other metrics to help with the API scaling issues. **Incident Timeline \(UTC\)** 04/10/2025 01:50 PM \(UTC\) - The Engineering team detected a potential customer issue. 04/10/2025 02:06 PM \(UTC\) - The first customer case was opened, and Tech Support \(TS\) engineers began the troubleshooting investigation. 04/10/2025 02:08 PM \(UTC\) - Tech Support \(TS\) engineers notified the Network Operations Center \(NOC\) engineers about the reported customer impact; a major incident was proposed and confirmed. 04/10/2025 02:18 PM \(UTC\) - Engineers identified a suspected cause and began the remediation steps. 04/10/2025 02:22 PM \(UTC\) - Impact was resolved after engineers identified the issue and implemented a fix and internal tests were successful. Impact and major incident resolved.