Hint Health incident
HintOS temporarily unavailable - RESOLVED
Hint Health experienced a critical incident on November 19, 2021 affecting HintOS App, lasting 44m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Nov 19, 2021, 06:55 PM UTC
We are currently investigating this issue.
- identified Nov 19, 2021, 07:07 PM UTC
The outage was caused by a routine deploy to our production web servers. This deploy encountered an unknown error that resulted in our web servers becoming unavailable. We're working with the Aptible team who are investigating the issue. Restarting our servers resolved the problem. We're pausing all deploys until we better understand and resolve the root cause of the problem.
- monitoring Nov 19, 2021, 07:31 PM UTC
All systems have returned to normal and the Aptible team is working to get to the root of the problem.
- resolved Nov 19, 2021, 07:40 PM UTC
Marking this incident as resolved as our systems have remained stable and we do not anticipate further issues. I'll follow up with information on the root cause and fix as those become available. Apologies for the inconvenience and appreciate everyone's patience while we worked through the problem!
- postmortem Nov 22, 2021, 09:45 PM UTC
### **Summary** Hint’s API web-servers stopped responding, causing a 24 minute complete application outage for Hint’s customers and partners. The root cause of the outage has been found and resolved, and we don’t anticipate issues like this in the future. Big thanks to all of our customers and partners for their patience while we dealt with this issue! ### **Timeline for Friday, Nov 19th** 10:44am PST - Hint’s automated monitoring system triggered an outage event and began alerting/escalation. 10:45am PST - Hint’s CS team confirmed the outage. 10:49am PST - The outage event is acknowledged by Hint’s SRE team, beginning our outage response playbook. 10:55am PST - The outage incident is created on [status.hint.com](http://status.hint.com) 10:58am PST - Server restart is initiated 11:04am PST - Webservers come back online as part of restart process 11:07am PST - Outage incident updated to reflect operational status 11:09am PST - Monitoring systems confirm fix and close outage event ### **Root cause analysis and resolution** Hint’s web servers became unresponsive after a routine deploy, and restarting the servers resolved the issue. Hint’s escalated the issue to their hosting provider, Aptible. They determined that the root cause was a rare race condition that we encountered in their recently upgraded deployment process. The underlying issue was quickly resolved by the Aptible team. Hint’s team has scheduled a postmortem where they will evaluate areas for future improvement.