MindTouch incident

CXOne Expert Degradation: Sites unavailable

Minor Resolved View vendor source →

MindTouch experienced a minor incident on October 14, 2024 affecting Application (General Service) and Search and 1 more component, lasting 1h 36m. The incident has been resolved; the full update timeline is below.

Started
Oct 14, 2024, 11:55 AM UTC
Resolved
Oct 14, 2024, 01:32 PM UTC
Duration
1h 36m
Detected by Pingoru
Oct 14, 2024, 11:55 AM UTC

Affected components

Application (General Service)SearchIn-Product Contextual HelpEmail ServicesMindTouch Success CenterAnalytics

Update timeline

  1. investigating Oct 14, 2024, 11:55 AM UTC

    CXone Expert Service Degradation: Sites unavailable. The Expert Engineering team is investigating reports of site unavailability.

  2. identified Oct 14, 2024, 12:28 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Oct 14, 2024, 12:49 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Oct 14, 2024, 01:32 PM UTC

    This incident has been resolved.

  5. postmortem Oct 18, 2024, 08:16 PM UTC

    ## **Summary** Updated on 10/17/2024 - On 10/14/2024, a NICE CXone customer reported that their knowledge portal sites were intermittently failing to load within the CXone Expert knowledge platform. The impact was caused by a sudden increase in the volume of requests, which overwhelmed the platform and led to service degradation. The issue was resolved when engineers blocked the Internet Protocol \(IP\) addresses of the offending sources, which restored the services. ## **Root Cause** The impact was caused by a sudden increase in the volume of requests, which overwhelmed the platform and led to service degradation. Engineers identified a third-party tool that generates PDF content for customer sites, which was routing excessive requests back into our network through a router. The router's built-in auto-scaler functionality did not activate as expected because the pod's autoscaling metrics did not reach the threshold, leading to intermittent spikes in workloads. These instances triggered multiple alerts and spikes in errors, site timeouts, a sudden increase in database connections, and transient traffic. ‌ Additionally, the system has built-in mechanisms to detect and automatically quarantine malicious activities; however, the traffic headers appeared to be normal requests. ## **Corrective Actions** **Detection** * Internal support teams detected a potentially customer-impacting issue through proactive alarm and monitoring mechanisms, which was later confirmed by a customer report about their knowledge portal sites intermittently failing to load within the CXone Expert knowledge platform. **Remediation** * The issue was resolved when engineers blocked the IP addresses of the offending sources, which restored the services. Completed on 10/14/2024. **Prevention** * Engineers opted to permanently block the IP addresses of the offending sources. Completed on 10/14/2024. * Engineers implemented additional subnets in the infrastructure to enhance the system's ability to manage increased traffic volume. Completed 10/17/2024. * Engineers are working on adjusting the auto-scaler threshold to include additional load indicators beyond just Central Processing Unit \(CPU\) usage. An update will be provided by End of Day \(EOD\) MT on 11/22/2024. ## **External Timeline** 10/14/2024 11:55 AM \(UTC\) - Internal support teams received multiple alerts indicating potential issues and performed initial validations and troubleshooting. 10/14/2024 12:24 PM \(UTC\) - The customer case was opened, which was later confirmed related to this incident. 10/14/2024 12:28 PM \(UTC\) - Engineers identified a suspected cause and began the remediation efforts. 10/14/2024 12:49 PM \(UTC\) - The impact was resolved when engineers blocked the IP addresses of the offending sources, and internal tests were successful. Impact and major incident resolved. 10/14/2024 01:21 PM \(UTC\) - A historical major incident was created to document this event.