ServiceChannel incident

Work Orders issue

Major Resolved View vendor source →

ServiceChannel experienced a major incident on September 24, 2025 affecting Work Order Manager, lasting 28m. The incident has been resolved; the full update timeline is below.

Started
Sep 24, 2025, 04:50 PM UTC
Resolved
Sep 24, 2025, 05:18 PM UTC
Duration
28m
Detected by Pingoru
Sep 24, 2025, 04:50 PM UTC

Affected components

Work Order Manager

Update timeline

  1. investigating Sep 24, 2025, 04:50 PM UTC

    We are currently investigating this issue.

  2. identified Sep 24, 2025, 05:08 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Sep 24, 2025, 05:08 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Sep 24, 2025, 05:18 PM UTC

    This incident has been resolved.

  5. postmortem Oct 10, 2025, 07:33 PM UTC

    **Incident Report: Unstable Network component cause Work Order Notes Issues.** **Dates of Incident:** 09/24/2025, 09/26/2025 **Time/Date Incident Started:** 09/24/2025, 1:17 PM EDT **Time/Date Stability Restored:** 09/26/2025, 5:43 PM EDT **Time/Date Incident Resolved:** 09/26/2025, 5:43 PM EDT **Users Impacted:** Many **Frequency:** Intermittent **Impact:** Major **Incident description:** On September 24 and September 26, 2025, Service Channel experienced 2 brief service disruptions that impacted the Work Order Notes feature for certain customers. These incidents each lasted approximately 15 to 30 minutes, during which some users encountered difficulties loading notes or received \(NTE\) errors when attempting to create a note. Our internal monitoring systems also recorded a small rise in HTTP 4xx errors related to these events. **Root Cause Analysis:** With the assistance of our Cloud Provider, the root cause was traced to an internal network component that had entered an unstable state. While in this state, it would periodically direct some traffic for the Work Order Notes service to an incorrect internal endpoint. This misdirected traffic was subsequently blocked as per security protocols, causing connection failures. This behavior directly corresponds to the intermittent periods of disruption and resulted in the loading issues and NTE errors observed by users. **Actions Taken:** * The SRE team began investigating alerts on 09/24 at 1:17 PM EDT, but troubleshooting was challenging due to the issue's intermittent nature. * The incident was escalated to Microsoft support for further analysis. * Joint diagnostics identified a faulty internal network component. * The SRE team removed the component from production to restore stability. * After removal, service was stable again by 5:43 PM EDT on 09/26. **Mitigation Measures:** * Internal service configurations were audited and updated to eliminate dependencies on this decommissioned network component, removing this failure point from other services. * Standard work includes a monitor that now detects and reports this error for faster future identification.