ServiceChannel incident

ServiceChannel System Performance Degradation

Major Resolved View vendor source →

ServiceChannel experienced a major incident on July 10, 2023 affecting Proposal Manager and Invoice Manager and 1 more component, lasting 4h. The incident has been resolved; the full update timeline is below.

Started
Jul 10, 2023, 05:46 PM UTC
Resolved
Jul 10, 2023, 09:47 PM UTC
Duration
4h
Detected by Pingoru
Jul 10, 2023, 05:46 PM UTC

Affected components

Proposal ManagerInvoice ManagerUniversal Connector

Update timeline

  1. investigating Jul 10, 2023, 05:46 PM UTC

    We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.

  2. monitoring Jul 10, 2023, 07:24 PM UTC

    System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.

  3. resolved Jul 10, 2023, 09:47 PM UTC

    This incident has been resolved. All services are working as expected.

  4. postmortem Jul 28, 2023, 02:44 PM UTC

    **Date of Incident:** 07/10/2023 **Time/Date Incident Started:** 07/10/2023, 1:36 PM EDT **Time/Date Stability Restored:** 07/10/2023, 2:27 PM EDT **Time/Date Incident Resolved:** 07/10/2023, 2:53 AM EDT **Users Impacted:** All users **Frequency:** Sustained **Impact:** Major ‌ **Incident description:** On July 10th at approximately 1:36pm EDT, customers encountered significant slowness after logging into the platform. The slowness impacted a large number of users, leading to a suboptimal experience. **Root Cause Analysis:** The ServiceChannel SRE \(Site Reliability Engineering\) and DBA \(Database Admin\) teams responded to an automated alert triggered by high CPU usage on database read replicas. Upon investigation, the DBA team identified a new module and functionality that was executing excessively long queries against the read replicas. This new module was recently enabled for internal vendor logins. **Actions Taken:** 1. The SRE and DBA teams promptly reviewed and acknowledged monitoring alerts related to the affected part of the application. 2. The DBA and engineering teams collaborated to identify the root cause of the high loads, which was traced back to the newly enabled functionality for internal vendor logins. 3. To mitigate the issue, the DBA and engineering teams disabled the problematic functionality through a functionality feature switch. **Mitigation Measures:** 1. Improved monitoring of data-enabled functionality \(functionality feature switches\) to quickly detect anomalies and potential performance issues. 2. Implementation of a more aggressive graceful degradation approach, selectively disabling problematic functionality when high loads are detected to prevent widespread impact. 3. Continuous improvement of stress tests in lower environments to enhance the discovery of similar performance-related issues.