Welkin Health incident

L1 - Sub system degradation: Welkin Care (Multiple issues)

Welkin Health experienced a major incident on December 2, 2021 affecting Care and Designer and 1 more component, lasting 22h 3m. The incident has been resolved; the full update timeline is below.

Started: Dec 02, 2021, 08:24 PM UTC
Resolved: Dec 03, 2021, 06:27 PM UTC
Duration: 22h 3m
Detected by Pingoru: Dec 02, 2021, 08:24 PM UTC

Affected components

CareDesignerAdmin

Update timeline

identified Dec 02, 2021, 08:24 PM UTC

We started experiencing an issue with Welkin Notification system. The issue affects some of the user, by not displaying a red dot icon in the notification tray. Refresh of the screen will still bring all the notifications back. We are working with our vendors to resolve the issue. There is no other side effect known at this point
identified Dec 03, 2021, 04:25 AM UTC

We are continuing to work on a fix for this issue.
identified Dec 03, 2021, 04:26 AM UTC

We have escalated the issue to major outage and working all hands on deck to resolve it
identified Dec 03, 2021, 05:44 AM UTC

Our team has found the root cause and working on fixing it. Currently no ETA is available
identified Dec 03, 2021, 06:28 AM UTC

Our team has identified several problems related to our service and continues working on it. We expect the issue to be resolved soon and apologize for the outage
monitoring Dec 03, 2021, 07:11 AM UTC

A fix has been implemented and we are monitoring the results.
resolved Dec 03, 2021, 06:27 PM UTC

This incident has been resolved.
postmortem Dec 29, 2021, 07:35 PM UTC

## Timeline of events | Time | Event | | --- | --- | | Dec 1, 2021, 20:20 PST | The team noticed excessive API calls to our notification sub system \(pusher\). The team is working on troubleshooting the issue with the vendor | | Dec 2, 2021, 06:30 PST | We decided to increase our plan limits with the sub system service provider, to avoid disruptions delivering notifications, while continue investigating the issue we have with the service | | Dec 2, 2021, 12:24 PST | The plan limit exhausted itself and the team put a break on notifications system while continue investigating what is causing excessive API calls | | Dec 2, 2021, 17:30 PST | After reviewing the machine set ups, we decided to execute a rolling restart before attempting to fix the issue \(post working hours\) | | Dec 2, 2021, 17:50 PST | Rolling restart has failed, and took down the machines hosting Care portal \(API and Designer were affected as well\) | | Dec 2, 2021 20:25 PST | We have identified the root cause of the issue, and working to resolve it | | Dec 2, 2021 21:40 PST | We have found a secondary root cause and were working to mitigate that | | Dec 2, 2021 22:50 PST | We have started an emergency patch release to address both primary and secondary issues and roll out across different environments | | Dec 2, 2021 23:11 PST | The issue has been resolved and remained in monitoring for next 8 hours to ensure working properly | ## Root Cause Analysis There were several issues here that caused the system to break: 1. Prior to the incident, on Nov 30th, we prepared a feature that added extra security layer, by connecting AWS Parameter Store and Secrets Manager. By extending our use of that integration, we migrated to a different API to support that, however a rolling restart has triggered infrastructure change to be deployed before the code change was released, as a result it created incompatible environment and prevented new machines to be started 2. Kafka Client - once the first issue was mitigated and we were making a change in the production environment, repository for Kafka client that we use to download Kafka jar from, was moved to archive \(by the hosting party\), hence changed the URL. In turn, this prevented our code to be assembled in the new environment hence delayed the fix to be deployed by approximately 2 hours, while we mitigated the issue ## Learnings and Remediations There were several action items that our team has taken and implemented since: 1. Separate infrastructure deployment from code deployment - part of the reason why rolling restart didn’t work is that our infra and code were written to deploy at the same time. Since then, the team has separated infrastructure release and code release into different pipelines that will be written, maintained and deployed independently going forward. It will allow us to find issues like that in our lower environments before they are promoted further 2. Dependency libraries - majority of the libraries that we use are not being moved from one location to another and hosted on maven central. For the few libraries that are not, we decided to host the final products in our build space, to avoid third parties moving them to archive. We will continue doing so going forward, including updating the dependencies versions to newer ones, once they are released 3. Rolling restart deep dive - While rolling restart is the most common strategy, in case of AWS, when machine is coming online, it reports status OK, before waiting on health check. In this case, machine has started, but once health check failed twice, the machine was terminated. Our learning here is to validate health check manually, in case of rolling restart before proceeding further to a next step in a restart