ServiceChannel incident

ServiceChannel System Performance Degradation

ServiceChannel experienced a major incident on August 31, 2023 affecting Work Order Manager and Maps, lasting 38m. The incident has been resolved; the full update timeline is below.

Started: Aug 31, 2023, 06:36 PM UTC
Resolved: Aug 31, 2023, 07:14 PM UTC
Duration: 38m
Detected by Pingoru: Aug 31, 2023, 06:36 PM UTC

Affected components

Work Order ManagerMaps

Update timeline

investigating Aug 31, 2023, 06:36 PM UTC

We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
resolved Aug 31, 2023, 07:14 PM UTC

This incident has been resolved. All services are working as expected.
postmortem Sep 14, 2023, 06:13 PM UTC

**Infrastructure/hardware instability** **Incident Report** **Date of Incident:**` `08/31/2023 **Time/Date Incident Started:** 08/31/2023, 02:15 pm EDT **Time/Date Stability Restored:**` `08/31/2023, 02:47 pm EDT **Time/Date Incident Resolved:**` `08/31/2023, 02:50 pm EDT **Users Impacted:** All **Frequency:** Intermittent **Impact:** Major ‌ **Incident description** On August 31st at 02:15 pm EDT, the ServiceChannel Site Reliability Engineering \(SRE\) team received a large number of SQL timeout errors, followed by reports of dashboard slowness. **Root Cause Analysis** The Database Administration \(DBA\) team discovered a growing queue of active database queries and increasing resource waits, resulting from functionality that was causing database blocks and high CPU load on the database cluster. **Actions Taken** 1. Investigated system-generated alerts and identified affected platform functionality. 1. Recompiled the affected stored procedures and dropped all blocking connections to return the database cluster to the steady state. 1. Compiled incident findings for future remediation by the Application Engineering and SRE teams. **Mitigation Measures** 1. Coordinate with the Application Engineering team to identify and remediate the root causes of the high database CPU and blocks. 1. Identify and implement general performance improvements for database queries to increase overall platform stability. 1. Implement infrastructural modifications to distribute database I/O across additional read replicas.