Cronofy incident

Increased error rate in US

Cronofy experienced a minor incident on January 29, 2025 affecting API, lasting 1h 19m. The incident has been resolved; the full update timeline is below.

Started: Jan 29, 2025, 10:32 AM UTC
Resolved: Jan 29, 2025, 11:51 AM UTC
Duration: 1h 19m
Detected by Pingoru: Jan 29, 2025, 10:32 AM UTC

Affected components

API

Update timeline

investigating Jan 29, 2025, 10:32 AM UTC

We're investigating an increase in API calls resulting in HTTP 500 errors
monitoring Jan 29, 2025, 10:38 AM UTC

Error rates have returned to normal; we are continuing to monitor.
identified Jan 29, 2025, 11:03 AM UTC

We are still seeing occasional errors via the API. These are in much less common and in smaller numbers, but not back to zero yet. The underlying cause is above average load on our database, which we are working to resolve.
monitoring Jan 29, 2025, 11:17 AM UTC

Database load is back to normal and error have ceased. We are continuing to monitor to ensure normal operation has resumed.
resolved Jan 29, 2025, 11:51 AM UTC

Normal operation has resumed and been consistent for the last 30 minutes, so this incident is resolved. About 1.75% of API calls between 10:08 and 10:38 received a HTTP 500 response due to database connections being refused. A combination of high-load tasks, both from API calls and internal DB processes, all happened at once. This caused the database to degrade in performance and refuse some connections. Attempts to manually kill some of the processes did not succeed, for reasons we will investigate. Once these jobs completed, performance recovered and normal operation resumed. Jobs which had failed and retried have since completed.
postmortem Feb 03, 2025, 06:26 PM UTC

On Wednesday January 29th between 10:00 and 11:09 UTC our US data center experienced degraded performance. This was caused by multiple concurrent delete operations on a heavily used database table, which in turn resulted in slower than usual responses and some failures. The majority of API traffic was unaffected, but a small percentage of calls would have resulted in a HTTP 500 response. Further details, lessons learned, and further actions we will be taking can be found below. ## Timeline _All times rounded for clarity and UTC_ On Wednesday January 29th at 10:00 we began processing a large number of data deletion jobs in line with GDPR compliance. Due to the much higher than usual number of processes of this type, our database system soon began to struggle due to the number of deletions it was being asked to perform. In particular, the CPU usage was very high. At 10:08 the first alarm for high CPU usage alerted our engineering team, and they began to investigate the cause. Additional alarms followed notifying the team of failed jobs, 5xx responses and slower API response times. Starting at 10:25, the database was under so much load that some new connections were refused. Also, during this time our monitoring systems show our slowest API responses. This was cleared by 10:38. From 10:47 to 10:57 we saw response times increase again, but not as severely as during the earlier window. This time we didn’t see any refused database connections. From 11:00 activity on our database had reduced significantly and by 11:09 our database activity had returned to the usual levels. ## Retrospective The questions we ask ourselves in an incident retrospective are: * Could it have been identified sooner? * Could it have been resolved sooner? * Could it have been prevented? We look for holistic improvements alongside targeted ones. **Could it have been identified sooner?** No. We feel that during this incident we were quick to respond to the alarms we received and to identify the cause. **Could it have been resolved sooner?** Possibly. We were a little hesitant to completely stop the large number of processes from being performed as they related to compliance. However, we have identified improvements we can make to our incident playbook that could have helped to us to determine how many remaining tasks there were, as in other scenarios, direct intervention would have been required. **Could it have been prevented?** We use rate limits throughout Cronofy to provide a robust service. However, they were not applied in this area, which was also a resource intensive task. We have already applied rate limiting to spread the load out and prevent a repeat. ## Actions As mentioned, we have already applied rate limiting to the GDPR processes to spread out the load within sensible bounds. We are going to update our incident playbook to highlight where statistics around the remaining number of GDPR tasks can be found. Had this already been in place we would have been able to determine how much longer the incident was going to last. We’re also going to perform an audit of our other background jobs to determine whether there are other areas of our system that lack rate limits on the number of concurrent jobs. ## Further questions? If you have any further questions, please contact us at [[email protected]](mailto:[email protected])