Hint Health experienced a major incident on April 1, 2022 affecting HintOS App, lasting 2h 21m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 01, 2022, 04:22 PM UTC
We're experiencing higher than usual database load for the 1st of the month which is causing an application-wide slowdown. We're actively working to address the problem.
- monitoring Apr 01, 2022, 04:55 PM UTC
Our primary database rebooted and failed over to our replica at 2:00am PST, causing the application to be running at roughly half db capacity. We reconfigured the databases to once again share the load and that has stabilized performance. We're nearly through the backlog of billing jobs so load will further decrease in the next 10-20 minutes, at which point performance will return to normal. We're continuing to monitor the situation and have launched an investigation into what caused the database failover at 2:00am.
- resolved Apr 01, 2022, 06:43 PM UTC
Further investigation into the failover uncovered that the failover was due to a minor database patch being applied automatically during our maintenance window. Our maintenance window for each database instance should be around 5pm PST on weekends, but was incorrectly set to 2:00am on Fridays. The failover then caused our secondary databases to handle all traffic between 2am and 9:50am, instead of the load being shared. This is especially inconvenient as early morning is a key time for our nightly billing jobs, and this Friday happened to be the 1st of the month, when our systems experience peak billing load. Our maintenance windows have been updated and we're investigating adding additional monitoring and alerting to identify similar issues in the future.