ServiceChannel incident
Transient Platform Downtime Due To Database Cluster Failover
ServiceChannel experienced a critical incident on July 11, 2023, lasting —. The incident has been resolved; the full update timeline is below.
Update timeline
- resolved Jul 11, 2023, 05:25 PM UTC
A hardware fault affecting the server in the primary database cluster caused a brief loss of availability of the Primary Database Replica, and subsequent platform downtime, while the cluster healed itself.
- postmortem Jul 11, 2023, 05:25 PM UTC
**Date of Incident:** 07/04/2023 **Time/Date Incident Started:** 07/04/2023, 10:42 am EDT **Time/Date Stability Restored:** 07/04/2023, 10:51 am EDT **Time/Date Incident Resolved:** 07/04/2023, 12:48 pm EDT **Users Impacted:** All **Frequency:** Continuous **Impact:** Critical **Incident description:** A hardware fault affecting the server in the primary database cluster caused a brief loss of availability of the Primary Database Replica, and subsequent platform downtime, while the cluster healed itself. **Root Cause Analysis:** According to our cloud hosting partner, the server acting as the listener and primary node in the production database cluster suffered a critical hardware fault and went offline. A transient network issue introduced a brief delay in the failover mechanism, but all affected services recovered within a few minutes. **Actions Taken:** 1. Restarted the affected service to bring the failed node back online. 2. Monitored the impacted platform components to ensure application recovery. **Mitigation Measures:** 1. Redeployment of the impacted virtual machine took place during the 7/8/2023 planned maintenance window. 2. Continue the investigation with our cloud service provider to improve cluster recovery even during transient network events.