Hint Health incident

Service unresponsive due to high partner load.

Hint Health experienced a major incident on June 14, 2023, lasting —. The incident has been resolved; the full update timeline is below.

Started: Jun 14, 2023, 09:30 PM UTC
Resolved: Jun 14, 2023, 09:30 PM UTC
Duration: —
Detected by Pingoru: Jun 14, 2023, 09:30 PM UTC

Update timeline

resolved Jun 14, 2023, 05:43 PM UTC

Our API experienced a high volume of unusually slow requests from one of our partners. This caused performance issues and database slowdowns that negatively impacted customers, with application sluggishness between 1:30pm and 2:00pm and unavailability beginning around 2pm. The Hint team identified the issue at 1:56pm PST and began to troubleshoot the problem as we as take mitigating steps. The mitigating steps allowed the application to recover temporarily before becoming overloaded again, resulting in inconsistent outages between 2pm and 3pm when the root cause was identified and the partner activity stopped. Solutions taken by the team are as follows: 1. Create a mechanism to be able to rapidly rate-limit specific partners in general or for specific (slower) endpoints - Done. The Hint team can now implement rate limits immediately in cases like these to mitigate the issue. 2. Analyze the slow endpoint and work to increase performance. - Done. The team worked to re-index the tables for significantly improved performance as well as optimizing the code to remove hundreds of database queries from the request. 3. Contact the partner to better understand their needs and ensure that we have a long term, scalable solution in place. In progress - actively working with partner on possible changes to the integration. 4. Pro-active monitoring of partners to identify problems in the future. In progress - our production monitoring tooling is currently being upgraded to allow for more advanced monitoring and alerting capabilities. This work will likely complete by end of June.
postmortem Jun 14, 2023, 05:49 PM UTC

Positives * The team identified the issue quickly and responded immediately to the outage. * Once the root cause was identified, the team quickly identified and began to implement possible solutions. * Some performance improvements were made within hours, and within days the problem was completely resolved. Gaps * Hint’s production monitoring tools mis-reported on the source of the problem, making the issue harder to identify than it should have been \(partner requests are not being reported correctly in the performance summary view\). * Our rate-limiting monitoring provides limited information on the user/requester. * Our rate-limiting was making unnecessary database queries, further contributing to the problem when under high load. Commentary SImilar issues have happened in the past, and Hint should explore long term architectural changes to further lessen the impact of this kind of issue, including: * Having a separate DB for partner-reads than for App reads. Our read DB was overwhelmed but user impact could be mitigated by having separate DB’s for users and partners - so that issues like this only impact other partners. * Having separate webservers for partners and web users. our webservers were overloaded, however by routing requests based on user we could have limited the impact to other partners. This would require an investment in a more sophisticated load-balancer. * Auto-scaling of the webservers. This would have helped to mitigate the issue on the webservers \(although may have exacerbated the issue for the DB\). This will require a significant change in our infrastructure that is already planned over the next year. * Auto-detection and blocking of IP addresses/partners that are causing significant load issues on our servers. We use AWS WAF which may help with this kind of detection and blocking.