ServiceChannel incident

ServiceChannel Performance Degradation

ServiceChannel experienced a critical incident on August 4, 2023 affecting Work Order Manager and Work Order Manager and 1 more component, lasting 2h 26m. The incident has been resolved; the full update timeline is below.

Started: Aug 04, 2023, 01:19 PM UTC
Resolved: Aug 04, 2023, 03:46 PM UTC
Duration: 2h 26m
Detected by Pingoru: Aug 04, 2023, 01:19 PM UTC

Affected components

Work Order ManagerWork Order ManagerInvoice Manager

Update timeline

investigating Aug 04, 2023, 01:19 PM UTC

We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
monitoring Aug 04, 2023, 02:19 PM UTC

System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
resolved Aug 04, 2023, 03:46 PM UTC

This incident has been resolved. All services are working as expected.
postmortem Aug 23, 2023, 08:54 PM UTC

**Incident Report: Secondary Read Replica Unavailability and Application Degradation** ‌ **Date of Incident:** 08/04/2023 **Time/Date Incident Started:** 08/04/2023, 6:51 AM EDT **Time/Date Stability Restored:** 08/04/2023, 10:00 AM EDT **Time/Date Incident Resolved:** 08/04/2023, 10:45 AM EDT ‌ **Users Impacted:** All users **Frequency:** Sustained **Impact:** Major ‌ **Incident description:** On August 4th at 6:51 am EDT, a significant incident occurred as the secondary read replica became unavailable. This led to an increased load on the DB system, resulting in intermittent slowness that adversely affected a large number of users. The degraded application experience raised concerns and triggered immediate investigation and response. ‌ **Root Cause Analysis:** The incident was promptly addressed by the ServiceChannel SRE \(Site Reliability Engineering\) and DBA \(Database Admin\) teams following an automated alert triggered by an unhealthy state in the AG replication. ‌ Upon thorough investigation, the DEVOPS team meticulously reviewed all logs associated with August 4th within the AG replication timeframe. Their efforts unveiled a configuration modification of the system firewalls that coincided with a triggered restart of the database system. The SRE team effectively pinpointed this change within our configuration management systems, which inadvertently pushed through a firewall policy modification. Consequently, the modified database firewall settings obstructed traffic flow to the replica servers, initiating the incident. ‌ **Actions Taken:** ‌ 1. Immediate Alert Response: The DBA team swiftly responded by reviewing and promptly acknowledging the monitoring alerts associated with the impacted segment of the application. This proactive step ensured that the issue was promptly recognized and addressed. 2. Redeployment and Restart: In a concerted effort to restore system stability, the DBA team executed the strategic redeployment and thorough restart of both primary and secondary database replicas. This rigorous approach aimed to rectify the root cause of the incident and mitigate its impact on performance and availability. 3. Persistent Challenges: Despite the initial actions, the immediate system performance and availability concerns persisted, requiring a deeper investigation to uncover the underlying factors contributing to the incident's persistence. 4. Configuration Management Insights: A comprehensive analysis of our configuration management system logs revealed a crucial breakthrough. This investigation shed light on the unexpected enablement of system firewalls, which had previously gone unnoticed. This realization marked a pivotal turning point in our efforts to restore normalcy. 5. Rapid Firewall Disablement: Armed with the newfound understanding, the necessary steps were taken to promptly disable the system firewalls that were impeding traffic flow. This decisive action facilitated the gradual return of the system to its intended state, marking a definitive resolution to the incident. ‌ **Mitigation Measures:** In light of this incident, several proactive steps have been taken to mitigate the risk of similar occurrences: ‌ 1. Enhanced Monitoring: A robust monitoring system will be implemented to vigilantly track data-enabled functionality changes \(functionality feature switches\). This enhanced monitoring will promptly detect anomalies and potential performance issues, allowing for swift intervention. 2. Playbook Updates: The DBA and DEVOPS teams' troubleshooting playbook will be meticulously updated to incorporate the lessons learned from this incident. These revisions will streamline response procedures and ensure quicker, more effective resolution. 3. Code Review Process: The code review process has been revamped to include a meticulous assessment of dependencies in any configuration changes. This will mitigate unforeseen interactions and potential disruptions. 4. Conditional Logic Refinement: The SRE team has improved the conditional logic governing firewall settings, ensuring that they are enabled only when explicitly defined. This refinement adds an additional layer of control and security. 5. Continuous Enhancement: Our commitment to improvement remains steadfast. The ongoing development of tests and alerting systems will be a top priority, further enhancing our ability to detect and respond to data and configuration changes.