ServiceChannel incident
ServiceChannel Work Order Report Downloads
ServiceChannel experienced a major incident on April 1, 2024 affecting Work Order Manager and Work Order Manager, lasting 2h 18m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 01, 2024, 03:58 PM UTC
We are actively investigating an issue with Work Order Report Downloads. An update will be provided shortly. Thank you for your patience.
- monitoring Apr 01, 2024, 05:03 PM UTC
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
- resolved Apr 01, 2024, 06:16 PM UTC
This incident has been resolved. All services are working as expected.
- postmortem Apr 19, 2024, 08:05 PM UTC
**Increased platform latency and workorder reports unresponsive Incident Report** **Date of Incident:** 04/01/2024 **Time/Date Incident Started:** 04/01/2024, 10:34 am EST **Time/Date Stability Restored:** 04/01/2024, 12:45 pm EST **Time/Date Incident Resolved:** 04/01/2024, 1:05 pm EST **Users Impacted:** All Users **Frequency:** Intermitted **Impact:** Major **Incident description:** Users experienced sporadic latency and timeout issues while engaging with the ServiceChannel Platform, particularly for workorder report services. **Root Cause Analysis:** The automated monitoring systems of the ServiceChannel SRE and DBA teams detected elevated CPU utilization on database read replicas. A subsequent investigation into the logs identified that the incident coincided with a spike in user traffic. This surge in activity caused extended wait times for certain Servicechannel Services, notably the excel report services, leading to slower page loads and timeouts. The SRE team swiftly acted by scaling up our infrastructure resources to accommodate the increased traffic. Following the expansion of capacity, normal system operations resumed. **Actions Taken:** 1. Manually tested our services to replicate the issue 2. Isolated the performance degradation to report queues and related database services. 3. Enhanced the capacity of affected services to manage the load and restore full functionality. **Mitigation Measures:** 1. Expansion of database resources to more effectively manage reporting queues. 2. Implementation of refined monitoring systems for better oversight of reporting queues.