Welkin Health incident

L1 Welkin Health: Welkin Care Portal is not operational on V8.

Critical Resolved View vendor source →

Welkin Health experienced a critical incident on September 30, 2021 affecting Care and Designer and 1 more component, lasting 4h 8m. The incident has been resolved; the full update timeline is below.

Started
Sep 30, 2021, 08:01 PM UTC
Resolved
Oct 01, 2021, 12:09 AM UTC
Duration
4h 8m
Detected by Pingoru
Sep 30, 2021, 08:01 PM UTC

Affected components

CareDesignerAdmin

Update timeline

  1. investigating Sep 30, 2021, 08:01 PM UTC

    On September 30, 2021, beginning at around 12:22 PM PDT, Welkin’s customers reported that they are unable to Login to the Care portal and reporting network issues. We are currently working on identifying the root cause.We sincerely apologize for this disruption, and thank you for your patience.

  2. investigating Sep 30, 2021, 10:23 PM UTC

    We are still continuing to investigate the issue & so far have not identified the root cause as yet. We sincerely apologize for this disruption, and thank you for your patience.

  3. identified Sep 30, 2021, 10:42 PM UTC

    We are working on a potential solution and we are currently working on a release.

  4. resolved Oct 01, 2021, 12:09 AM UTC

    The service incident was fully resolved by the Welkin Engineering Team & the system is fully restored on September 30th at 4:45 PM PDT, We will post the post mortem in the next few days.

  5. postmortem Oct 19, 2021, 10:54 PM UTC

    ## **Production Issue -09/30/2021** ## **Executive Summary** * On September 30th, 2021 between 12:22pm and 4:45 pm PST, Welkin Health’s v8 platform experienced an outage. The outage was due to our production environment’s critical health monitoring service, DataDog, and a temporary loss of coverage. * During the outage window, the AWS Live account was affected. The effects included: * No machines \(EC2\) could be brought online. * Welkin Care, Designer and Admin were not accepting any requests. * Welkin API would not accept any requests and blocked all traffic on Elastic load balancing \(ELB\) Layer. ## **Timeline of Events** | **Time** | **Event** | | --- | --- | | 12:22 09/30 PST | Customers and Internal team noticed service degradation | | 12:25 09/30 PST | Welkin team began investigation | | 13:00 09/30 PST | Rolling restart of EC2 Machines | | 13:10 09/30 PST | It was observed that database does not have any open connections | | 13:30 09/30 PST | Restart of database server | | 13:44 09/30 PST | Manual connection test for database | | 13:44 - 15:00 09/30 PST | Specialist Ops Engineer & Architect are able to identify the root cause | | 16:45 09/30 PST | Issue resolved | | 17:09 09/30 PST | Status page updated with a resolution | ## **Learnings and Remediations** Welkin Health takes outages very seriously because we know it affects our users in many ways. As such, we have come up with the following as a plan of action in the future: * Daily Monitoring of your live systems by Support & Ops teams. We have built new dashboards with the sole purpose of monitoring your live environment that include the following elements: * Live Overview * Live Infrastructure & Traffic * CPU utilization & load on the database servers * Average response time ticker for requests * Strive to be as prompt as possible when communicating system outages. Customers to be notified immediately upon outage detection via status page. Multiple teams have been empowered to perform this task. * Cycle critical production components when initial investigation does not lead to the root cause ## **Root Cause Analysis** During the start of a machine instance, our monitoring solution is packaged as part of the instance and performs monitoring. If the monitoring fails, the instance is terminated. In our case AWS did not update the blueprints of the monitoring solution with the updated certificates and hence instances could not be started successfully. We performed a temporary workaround to bypass the monitoring process to get systems back online. We thank you for your continued partnership. Please feel free to ping our team with feedback. ## **Team Welkin**