ControlUp incident
DEX Database Issue - Intermittent connectivity
ControlUp experienced a major incident on April 27, 2026 affecting US Region and EMEA Region and 1 more component, lasting 7h 26m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 27, 2026, 05:10 PM UTC
We are currently investigating an issue causing some DEX environments to be inaccessible.
- investigating Apr 27, 2026, 05:38 PM UTC
We are still investigating the issue. Some environments may still experience intermittent access issues to the DEX platform.
- identified Apr 27, 2026, 06:07 PM UTC
The issue has been identified. We are currently working on a fix to deploy to all environments. Some DEX environments may still be impacted at this time. We will provide another update once the fix is ready for deployment.
- monitoring Apr 27, 2026, 08:03 PM UTC
A fix has been implemented, and we are monitoring the results as we see access being restored across all environments. A final update will be posted once everything is confirmed.
- resolved Apr 28, 2026, 12:37 AM UTC
This incident has been resolved.
- postmortem May 08, 2026, 01:34 PM UTC
On the evening of April 27, 2026, some customers experienced intermittent issues when logging in or accessing resources. The disruption lasted approximately 5.5 hours, from 22:00 to 03:12 \(GMT\+2\). The issue was caused by an internal service becoming unstable, which triggered a surge of automated retry attempts that overwhelmed a supporting system. Our teams responded immediately and put the following protective measures in place: * **Added automatic safeguards** to stop failing requests from piling up and spreading to other services * **Reduced wait times** so requests no longer blocked the system for extended periods * **Limited the number of automatic retries** to prevent traffic from multiplying during the disruption * **Stopped retries on requests that could never succeed**, reducing unnecessary load * **Added safety checks** to prevent one failure from triggering thousands of others * **Doubled the capacity** of the affected system to handle higher demand * **Improved logging** to help diagnose issues faster in the future * **Added temporary data storage** so the system can continue serving customers even during disruptions Full service was restored at 03:12 \(GMT\+2\) on April 28, with no further issues reported. Most customer data continued to be served during the disruption, limiting the overall impact.