ControlUp incident

DEX Database Issue - Intermittent connectivity

Major Resolved View vendor source →

ControlUp experienced a major incident on April 27, 2026 affecting US Region and EMEA Region and 1 more component, lasting 7h 26m. The incident has been resolved; the full update timeline is below.

Started
Apr 27, 2026, 05:10 PM UTC
Resolved
Apr 28, 2026, 12:37 AM UTC
Duration
7h 26m
Detected by Pingoru
Apr 27, 2026, 05:10 PM UTC

Affected components

US RegionEMEA RegionAPAC Region

Update timeline

  1. investigating Apr 27, 2026, 05:10 PM UTC

    We are currently investigating an issue causing some DEX environments to be inaccessible.

  2. investigating Apr 27, 2026, 05:38 PM UTC

    We are still investigating the issue. Some environments may still experience intermittent access issues to the DEX platform.

  3. identified Apr 27, 2026, 06:07 PM UTC

    The issue has been identified. We are currently working on a fix to deploy to all environments. Some DEX environments may still be impacted at this time. We will provide another update once the fix is ready for deployment.

  4. monitoring Apr 27, 2026, 08:03 PM UTC

    A fix has been implemented, and we are monitoring the results as we see access being restored across all environments. A final update will be posted once everything is confirmed.

  5. resolved Apr 28, 2026, 12:37 AM UTC

    This incident has been resolved.

  6. postmortem May 08, 2026, 01:34 PM UTC

    On the evening of April 27, 2026, some customers experienced intermittent issues when logging in or accessing resources. The disruption lasted approximately 5.5 hours, from 22:00 to 03:12 \(GMT\+2\). The issue was caused by an internal service becoming unstable, which triggered a surge of automated retry attempts that overwhelmed a supporting system. Our teams responded immediately and put the following protective measures in place: * **Added automatic safeguards** to stop failing requests from piling up and spreading to other services * **Reduced wait times** so requests no longer blocked the system for extended periods * **Limited the number of automatic retries** to prevent traffic from multiplying during the disruption * **Stopped retries on requests that could never succeed**, reducing unnecessary load * **Added safety checks** to prevent one failure from triggering thousands of others * **Doubled the capacity** of the affected system to handle higher demand * **Improved logging** to help diagnose issues faster in the future * **Added temporary data storage** so the system can continue serving customers even during disruptions Full service was restored at 03:12 \(GMT\+2\) on April 28, with no further issues reported. Most customer data continued to be served during the disruption, limiting the overall impact.