ResDiary incident

Diary not loading

ResDiary experienced a notice incident on February 7, 2025 affecting UK/Europe, lasting 3h 8m. The incident has been resolved; the full update timeline is below.

Started: Feb 07, 2025, 05:15 PM UTC
Resolved: Feb 07, 2025, 08:24 PM UTC
Duration: 3h 8m
Detected by Pingoru: Feb 07, 2025, 05:15 PM UTC

Affected components

UK/Europe

Update timeline

investigating Feb 07, 2025, 05:15 PM UTC

We are currently investigating this issue.
monitoring Feb 07, 2025, 06:44 PM UTC

A fix has been implemented and we are now monitoring this
resolved Feb 07, 2025, 08:24 PM UTC

All services have been restored, and our engineers are continuing to monitor performance. We are working on a report of the incident which will be published here as soon as we collate all the details. If you continue to have trouble, please contact support.
postmortem Feb 13, 2025, 10:10 AM UTC

Between approximately 17:00 and 18:30 on Friday 7th Feb 2025 the ResDiary web application was unable to process requests due to a key piece of infrastructure reaching maximum capacity. This impacted the ResDiary web application only. API traffic was unaffected, so mobile apps, widgets and integrations were not disrupted. On-call engineers were alerted of this at 17:08 when an automated P1 alert was triggered. A status page was published at 17:15 to notify customers. While investigating it was discovered that one of our Redis caches was at maximum load. It could also be observed from our internal logs that this was causing requests to this cache to time-out. As this cache provides some key functionality for our web application, this resulted in requests failing and users being presented with an error screen. At 17:33 on-call engineers restarted our Redis cache in an effort to relieve pressure, but this had no appreciable impact and we continued to see 100% load. At 18:36 the decision was made to flush the cache of all data. This action reduced the load on the cache and service immediately returned to normal. The root cause of this incident was a poorly performing query on our cache, which during peak daily traffic, caused the cache to reach capacity and from there it was unable to recover. We have already implemented a number of corrective actions to both prevent this issue occurring in the future, but to also improve our response to any future issues related to this infrastructure - * Updated our automated alerts, so on-call engineers are alerted to high load before it starts impacting end-users * Updated our on-call runbooks, so in future on-call engineers can take quicker, more decisive action * On Wed 12th Feb an application change was deployed to replace the poorly performing query with a much more efficient one, significantly reducing load on our cache The engineering team are continuing to work on further improvements to enhance the resilience of the ResDiary web application. We sincerely apologise for any inconvenience this may have caused. Thanks, ResDiary