ServiceChannel incident

ServiceChannel System Performance Degradation

ServiceChannel experienced a critical incident on September 8, 2023 affecting Asset Manager, lasting 37m. The incident has been resolved; the full update timeline is below.

Started: Sep 08, 2023, 09:03 PM UTC
Resolved: Sep 08, 2023, 09:41 PM UTC
Duration: 37m
Detected by Pingoru: Sep 08, 2023, 09:03 PM UTC

Affected components

Asset Manager

Update timeline

investigating Sep 08, 2023, 09:03 PM UTC

We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
monitoring Sep 08, 2023, 09:16 PM UTC

System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
resolved Sep 08, 2023, 09:41 PM UTC

This incident has been resolved. All services are working as expected.
postmortem Oct 05, 2023, 01:16 PM UTC

**Incident Report: Infrastructure/Hardware Instability** **Date of Incident:**` `09/08/2023 **Time/Date Incident Started:** 09/08/2023, 04:18 pm EDT **Time/Date Stability Restored:**` `09/08/2023, 05:08 pm EDT **Time/Date Incident Resolved:**` `09/08/2023, 05:15 pm EDT **Users Impacted:** All **Frequency:** Intermittent **Impact:** Major **Incident description:** On September 8th at 04:18 pm EDT, the Site Reliability Engineering \(SRE\) team received an alert regarding "SQL timeout errors" and subsequent reports of dashboard slowness. This slowness had a significant impact on a large number of users, resulting in a suboptimal experience. **Root Cause Analysis:** Upon conducting a thorough investigation, the Database Administration \(DBA\) team identified a series of database requests that were causing blocks and imposing a high CPU load on the database replica servers. This, in turn, led to an increased number of "resource waits." As a preemptive measure, the DBA team initiated a restart of the SQL service on both database replica servers. Following the successful restart of the SQL service, the system's stability was closely monitored and subsequently restored. **Actions Taken:** 1. Investigated system-generated alerts and identified affected platform functionality. 1. DBA team proactively initiated SQL service restart on database replica servers. **Mitigation Measures:** In response to this incident, the following mitigation measures have been implemented: 1. Ongoing Investigation: The team is continuing to investigate the root causes of the high CPU usage and blockages on the database servers. 1. Database Query Performance Improvements: Efforts are being made to enhance the performance of database queries to ensure the overall stability of the platform.