Cronofy incident

Increased error rates in our US data centre

Cronofy experienced a major incident on August 16, 2025 affecting API, lasting 51m. The incident has been resolved; the full update timeline is below.

Started: Aug 16, 2025, 03:36 PM UTC
Resolved: Aug 16, 2025, 04:28 PM UTC
Duration: 51m
Detected by Pingoru: Aug 16, 2025, 03:36 PM UTC

Affected components

API

Update timeline

investigating Aug 16, 2025, 03:36 PM UTC

We are seeing a small number of 5XX errors being returned from our US data center. We are investigating and will update shortly.
identified Aug 16, 2025, 03:53 PM UTC

Investigation has shown that between 15:19 and 15:27 UTC a database issue resulted in a partial loss of service. During this period requests may have returned a 5xx status and failed to process. The database issue has now been resolved and we're moving to monitor the environment.
monitoring Aug 16, 2025, 03:55 PM UTC

We're monitoring the environment to ensure things continue to run as expected.
resolved Aug 16, 2025, 04:28 PM UTC

Normal operations have continued and we are marking this incident as resolved. A postmortem of the incident will take place and be attached to this incident in the next 48 hours. If you have any queries in the interim, please contact us at [email protected]
postmortem Aug 18, 2025, 04:33 PM UTC

# Summary Between 15:19:38 UTC and 15:26:55 UTC on Saturday August 16th 2025, requests made to the API in our US data centre may have received a HTTP status of 500 - Internal Error. This was caused by a transaction on one of our database tables not releasing a lock in a timely fashion which, in turn, caused a bottleneck that led to service degradation as subsequent requests queued behind it. The root cause was identified during the incident and will be addressed by reviewing the way these transactions are handled with a view to preventing future locks. ## Timeline All times are UTC unless otherwise stated. 15:19 - Table lock begins 15:21 - On-call engineers alerted to issues processing requests as 5XX responses increase 15:22 - Initial investigation begins 15:27 - Database lock clears, backlog begins processing 15:28 - Wider environment confirmed to be functional. No undue load observed 15:30 - Lock on database table observed to be likely cause 15:32 - Lock on database table confirmed to be the cause 15:50 - We begin extended monitoring 16:28 - No further long-running locks observed. We mark the incident as resolved ## Retrospective We always ask the questions: * Could the issue have been resolved sooner? * Could be issue have been identified sooner? * Could be issue have been prevented? In this case, we feel that yes, we could have identified the root cause sooner as it has highlighted a gap in our database monitoring that we’ll be working to address. Similarly, it could have been resolved sooner if we were to opt to clear the lock ourselves. However, our preference is to avoid the need for intervention by addressing the way we approach these queries in the first place. ## Actions Our Site Reliability Engineers will be looking at addressing the pinholes in our monitoring of database transactions to improve visibility in the future, while we look to improve the approach we take to table locks and long running transactions.