MindTouch incident

Sites Down

Critical Resolved View vendor source →

MindTouch experienced a critical incident on December 4, 2024 affecting Application (General Service) and Search and 1 more component, lasting 35m. The incident has been resolved; the full update timeline is below.

Started
Dec 04, 2024, 08:50 AM UTC
Resolved
Dec 04, 2024, 09:25 AM UTC
Duration
35m
Detected by Pingoru
Dec 04, 2024, 08:50 AM UTC

Affected components

Application (General Service)SearchIn-Product Contextual HelpEmail ServicesMindTouch Success CenterAnalytics

Update timeline

  1. investigating Dec 04, 2024, 08:50 AM UTC

    We are currently investigating this issue.

  2. investigating Dec 04, 2024, 08:51 AM UTC

    We are continuing to investigate this issue.

  3. monitoring Dec 04, 2024, 09:03 AM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Dec 04, 2024, 09:25 AM UTC

    This incident has been resolved.

  5. postmortem Dec 12, 2024, 11:41 PM UTC

    Impact Start Time \(UTC\) 12/04/2024 08:39 AM UTC Impact End Time \(UTC\) 12/04/2024 09:25 AM UTC ‌**Summary** Updated on 12/12/2024 - On 12/04/2024, some NICE CXone customers reported receiving a "504 Gateway Timeout" error when accessing the CXone Expert knowledge portal. The impact stemmed from a procedural error where the sequence of actions performed during the implementation of a planned network change caused misconfigurations on the recently updated subnets following a network node upgrade. The issue was resolved when engineers disabled the misconfigured subnets and then restarted the network service, fully restoring operations. ## **Root Cause** The impact stemmed from a procedural error where the sequence of actions performed during the implementation of a planned network change caused misconfigurations on the recently updated subnets following a network node upgrade. This resulted in a misconfigured subnet routing table, preventing the new subnets from routing to their intended private network route. During the regular rotation of service node assignments as part of an automated system process, some nodes were assigned to the misconfigured subnets. Engineers were not aware that the service node rolling restart automation had to be disabled while updating the subnets, and this step was not included in the procedural guide. As a result, this caused an unwanted network traffic disruption, resulting in the observed impact. Although the issue was quickly detected and resolved, engineers recognized the lessons learned and identified key opportunities for improvement. Preventive measures were put in place to avoid similar issues in the future. ## **Corrective Actions** **Detection** * Internal support teams identified a potentially customer-impacting issue through proactive alarms and monitoring mechanisms, which was later confirmed by customer reports of a "504 Gateway Timeout" error when accessing the CXone Expert knowledge portal. **Remediation** * The issue was resolved when engineers disabled the misconfigured subnets and then restarted the network service, fully restoring operations. Completed on 12/04/2024. **Prevention** * The engineering team updated their procedural guides based on the lessons learned and ensured that all engineers were informed of the new updates to prevent similar issues during future network changes. Completed on 12/11/2024. ## **Incident Timeline \(UTC\)** 12/04/2024 08:39 AM \(UTC\) - Internal support engineers detected the issue and began validating and isolating the cause. Engineers notified the technical groups about the identified incident. 12/04/2024 08:42 AM \(UTC\) - While engineers were actively restoring the service, the first customer case was opened, and Tech Support \(TS\) engineers began troubleshooting the issue. This was later validated and confirmed to be related to the major incident. 12/04/2024 09:03 AM \(UTC\) - Engineers identified a suspected cause and began remediation steps 12/04/2024 09:25 AM \(UTC\) - Impact was resolved after engineers developed and deployed a fix and internal tests were successful. Impact and major incident resolved 12/04/2024 09:54 PM \(UTC\) - A proactive major incident was raised to document this event due to the potential for customer impact, which was later confirmed through customer-reported incident cases.