Hint Health incident

DB Upgrade Outage

Notice Resolved View vendor source →

Hint Health experienced a notice incident on August 28, 2020, lasting —. The incident has been resolved; the full update timeline is below.

Started
Aug 28, 2020, 11:56 PM UTC
Resolved
Aug 23, 2020, 05:00 AM UTC
Duration
Detected by Pingoru
Aug 28, 2020, 11:56 PM UTC

Update timeline

  1. resolved Aug 28, 2020, 11:56 PM UTC

    We conducted database maintenance on August 22nd at 5pm and concluded the maintenance at 5:45pm. After a period of normal operation, our monitoring detected slower web response times starting August 22nd at 7:00pm PST and then a large number of request timeouts that we characterize as a significant production outage between 10:07pm PST and 8:37am PST on August, 22nd. The high db load at 7pm was the result of thousands of our regular scheduled jobs running a query 1000x slower than usual. We believe the query slowness was a result of the database upgrade, possibly from incorrect query planning. The increased database load caused database fail-overs and eventually database connection issues that resulted in web request timeouts. Our engineering team responded to the alerts at 8:25am and systems returned to normal at 8:37am.

  2. postmortem Aug 28, 2020, 11:56 PM UTC

    I’m extremely sorry for the outage last weekend. We are a small team that aims to provide a world-class product, and the infrastructure team takes great pride in the uptime, stability and performance of our production systems. We’ve conducted a thorough post-mortem evaluation of how our systems and processes responded to this outage and have already implemented several improvements to improve our responsiveness. We’ve revised our database upgrade playbook to allow for a much longer verification period without increasing our effective maintenance period. In addition, we've implemented a new tool \(OpsGenie\) to more effectively notify and escalate production alerts to our on-call engineers to ensure a shorter response time. Although we have no plans to do additional database maintenance for at least a year, we believe these changes will ensure future maintenance is less risky and that we’ll be able to respond much more effectively to future outages. Thanks again, Graham Melcher CTO, Hint Health