MindTouch incident

MindTouch Service Degradation: Sites unavailable

Minor Resolved View vendor source →

MindTouch experienced a minor incident on October 3, 2024 affecting Application (General Service) and Search and 1 more component, lasting 1h 34m. The incident has been resolved; the full update timeline is below.

Started
Oct 03, 2024, 04:34 PM UTC
Resolved
Oct 03, 2024, 06:09 PM UTC
Duration
1h 34m
Detected by Pingoru
Oct 03, 2024, 04:34 PM UTC

Affected components

Application (General Service)SearchIn-Product Contextual HelpEmail ServicesMindTouch Success CenterAnalytics

Update timeline

  1. investigating Oct 03, 2024, 04:34 PM UTC

    MindTouch Service Degradation: Sites unavailable. The MindTouch Engineering team is investigating reports of site unavailability.

  2. monitoring Oct 03, 2024, 04:51 PM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved Oct 03, 2024, 06:09 PM UTC

    This incident has been resolved.

  4. postmortem Jan 27, 2025, 09:09 PM UTC

    ## Incident Summary On 10/03/2024, a NICE CXone customer reported encountering a “503 error” when accessing the CXone Expert knowledge platform. The impact was caused by a failure of the node hosting the Domain Name System \(DNS\) service. The issue self-resolved after several retries of the built-in self-recovery mechanism, which spun up a new node instance and restored the service. ‌ ## Root Cause The impact was caused by a failure of the node hosting the DNS service. The platform is dynamic and scalable by design, allowing it to self-recover in such situations. However, when it attempted to schedule a new node instance, the process failed due to insufficient available Internet Protocol \(IP\) addresses within the subnet, preventing the service from starting. The impacted node became inaccessible when it was removed from our cloud service provider’s platform after a new node was spun up, making it impossible to determine what triggered the issue. Additionally, engineers reviewed the event logs but found no indicators to trace the failure. ‌ ## Corrective Actions **Detection:** Internal support teams detected a potentially customer-impacting issue through proactive alarm and monitoring mechanisms, which was later confirmed through a customer report of “503 error” when accessing the CXone Expert knowledge platform. **Remediation:** The issue self-resolved after several retries of the built-in self-healing mechanism, which spun up a new node instance and restored the service. Completed on 10/03/2024. **Prevention:** Engineers scaled down some unused services to ensure more network space was available and prevented failures in dynamic recovery systems. Completed on 10/03/2024. Engineers are working to implement additional subnets within the infrastructure to allow the addition of new IP addresses as part of the network expansion effort. Completed on 10/17/2024. ‌ ## Incident Timeline \(UTC\) 10/03/2024 04:34 PM \(UTC\) - Engineers notified the Network Operations Center \(NOC\) about an alerting condition that could lead to customer impact; a major incident was proposed and confirmed. 10/03/2024 04:51 PM \(UTC\) - Engineers identified a suspected cause and Incident Timeline \(UTC\) began remediation steps; the first customer case opened. 10/03/2024 06:09 PM \(UTC\) - The issue self-resolved after several retries of the built-in self-recovery mechanism and internal tests were successful. Impact and major incident resolved