ServiceChannel incident

ServiceChannel System Performance Degradation

ServiceChannel experienced a major incident on June 12, 2025 affecting SC Mobile and Work Order Manager, lasting 6h 8m. The incident has been resolved; the full update timeline is below.

Started: Jun 12, 2025, 02:14 PM UTC
Resolved: Jun 12, 2025, 08:22 PM UTC
Duration: 6h 8m
Detected by Pingoru: Jun 12, 2025, 02:14 PM UTC

Affected components

SC MobileWork Order Manager

Update timeline

investigating Jun 12, 2025, 02:14 PM UTC

We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
investigating Jun 12, 2025, 04:24 PM UTC

We continue to see issues with work order search the team is actively troubleshooting
monitoring Jun 12, 2025, 06:44 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Jun 12, 2025, 08:22 PM UTC

This incident has been resolved.
postmortem Jun 23, 2025, 07:40 PM UTC

**Date of Incident:** 06/12/2025 **Time/Date Incident Started:** 06/12/2025, 9:35 am EDT **Time/Date Stability Restored:** 06/12/2025, 2:12 pm EDT **Time/Date Incident Resolved:** 06/12/2025, 2:12 pm EDT **Users Impacted:** All **Frequency:** Intermittent **Impact:** Major **Incident description:** US clients experienced intermittent issues loading their work order lists due to degraded performance in the underlying services responsible for filtering and NTE calculations. **Root Cause Analysis:** Processing certain very large provider accounts with a high number of serviceable locations triggered significant memory overconsumption, causing prolonged response times \(>60 seconds\) from an internal Elasticsearch service that affected all users. The system had been functioning normally until these large provider accounts triggered the condition causing a cascading effect on all users for work order lists. The intermittent nature of these issues extended the overall duration of user-facing problems **Actions Taken:** As soon as the issue was identified, our team initiated a series of mitigation steps to restore service as quickly as possible: * Restarted and scaled out the Elasticsearch service to address potential performance or resource bottlenecks. This had a slight positive effect but wasn't enough to stabilize the system. * Rolled back recent changes affecting client interaction with Elasticsearch, resulting in minor improvements. * Removed multiple clients from the Elasticsearch service to reduce load, which slightly decreased strain, but overall service stability was still insufficient. * Rolled back Elasticsearch itself to a previously stable version, reducing some system pressure but not fully resolving the problem. * Finally, rolled back the entire release, leading to system recovery and a return to normal service performance within approximately 30 minutes. **Mitigation Measures:** * Implemented code updates to address an edge case affecting providers with extensive serviceable location networks. * Enhanced code for better system resilience with effective Elastic fallback mechanisms. * Expanded monitoring capabilities with incident-specific metrics to increase visibility and decrease diagnostic time.