Welkin Health incident

L1 Welkin Health: Extreme slowness on Welkin Care Portal & API issues on V8

Welkin Health experienced a major incident on September 23, 2021 affecting Care and Designer and 1 more component, lasting 1h 6m. The incident has been resolved; the full update timeline is below.

Started: Sep 23, 2021, 04:20 PM UTC
Resolved: Sep 23, 2021, 05:26 PM UTC
Duration: 1h 6m
Detected by Pingoru: Sep 23, 2021, 04:20 PM UTC

Affected components

CareDesignerAdmin

Update timeline

investigating Sep 23, 2021, 04:20 PM UTC

On September 23, 2021, beginning at around 7:00AM AM PDT, Welkin’s customers began experiencing extreme slowness in performance on the Care portal & issues with API. We are currently working on identifying the root cause. We sincerely apologize for this disruption, and thank you for your patience.
investigating Sep 23, 2021, 04:21 PM UTC

We are continuing to investigate this issue.
resolved Sep 23, 2021, 05:26 PM UTC

The service incident was fully resolved by the Welkin Engineering Team on September at 9:40 AM PDT, We will post the post mortem in the next few days. We sincerely apologize for this disruption, and thank you for your patience.
postmortem Sep 27, 2021, 03:32 AM UTC

# Production Issue 09/23/2021 ## Executive Summary * On 09/23 Welkin v8 platform had an outage cause by Kafka running out of disk space * Between **09/23 2:50 PST and 09/23 09:40 PST** Kafka cluster was not responsive and did not accept connection from Welkin EC2 servers * During that period, observed effects are: * Welkin Care, Designer and Admin were slow to respond * Welkin API had service degradation * Welkin Events \(data audit, automation\) experienced delayed response or no response at all ## Timeline of Events | Time | Event | | --- | --- | | 02:50 09/23 PST | Kafka throws `KAFKA_STORAGE_ERROR` | | 02:50 09/23 PST | Kafka refused connection: `Connection to node 1 could not be established` | | 02:51 09/23 PST | Kafka becomes unresponsive | | 06:15 09/23 PST | Team notices slowness and starts investigation | | 7:14 09/23 PST | Multiple team members and customers notice slowness and report issues | | 7:31 09/23 PST | Issue is escalated | | 8:00 09/23 PST | Rolling restart of all systems is triggered. Does not yield expected result | | 9:20 09/23 PST | Status Page Announcement is posted: [https://welkinhealth.statuspage.io/incidents/l7h9dfg3xhd4](https://welkinhealth.statuspage.io/incidents/l7h9dfg3xhd4) | | 9:32 09/23 PST | Kafka Cluster Upgrade is performed to larger cluster for increased capacity | | 09:42 09/23 PST | Issue is resolved | | 10:20 09/23 PST | Status page is updated with a resolution | ## Learnings and Remediations 1. Review of all critical alerts: The Welkin team has initiated a review of all critical alerts and found a gap in the Kafka Storage alert. There are two gaps in this item: 1. The alert was incorrectly tagged and routed as lower severity vs critical severity 2. The alert is raised only once, and stays in alarm state until resolved, without repeating itself 2. Faster implementation of rolling restart procedure: it is a common procedure to be executed in case of overloaded capacity, however due to the nature of data that Welkin stores, access to the Live account is restricted to key personnel within Welkin. Welkin will be increasing partial access to Live for faster implementation or rolling start procedures. 3. Status updates: We aim to report the status as soon as we identified and confirmed the problem. However in this case, the team missed that opportunity. We are retraining our support team members on the correct process on reporting system degradation