Balena incident

Elevated API Errors

Notice Resolved View vendor source →

Balena experienced a notice incident on September 10, 2025 affecting API and BalenaOS Download and 1 more component, lasting 8h 20m. The incident has been resolved; the full update timeline is below.

Started
Sep 10, 2025, 01:47 PM UTC
Resolved
Sep 10, 2025, 10:08 PM UTC
Duration
8h 20m
Detected by Pingoru
Sep 10, 2025, 01:47 PM UTC

Affected components

APIBalenaOS DownloadDashboardDevice URLsCloudlink (VPN)balenahub

Update timeline

  1. investigating Sep 10, 2025, 01:47 PM UTC

    We're experiencing an elevated level of API errors and are currently looking into the issue.

  2. monitoring Sep 10, 2025, 02:25 PM UTC

    A fix has been implemented and we are monitoring the results.

  3. monitoring Sep 10, 2025, 02:34 PM UTC

    We are continuing to monitor for any further issues.

  4. monitoring Sep 10, 2025, 03:37 PM UTC

    During the incident a lot of devices disconnected from cloudlink. They're slowly but steadily reconnecting. We are monitoring the recovery.

  5. monitoring Sep 10, 2025, 03:46 PM UTC

    It appears that the cloudlink status of many devices didn't update when disconnecting. If your device appear connected to cloudlink for more than 1h, its status is wrong and it has not yet reconnected (it will happen soon) We're still monitoring the recovery.

  6. monitoring Sep 10, 2025, 06:55 PM UTC

    Most devices have reconnected, but we're still experiencing an elevated level of device SSH errors and are currently looking into the issue. Connection that goes through are slow. This is also affecting the public URL feature.

  7. identified Sep 10, 2025, 07:29 PM UTC

    Most devices have reconnected, but we're still experiencing an elevated level of device SSH errors and are currently looking into the issue. Connection that goes through are slow. This is also affecting the public URL feature.

  8. monitoring Sep 10, 2025, 09:19 PM UTC

    A fix has been implemented and we are monitoring the results.

  9. monitoring Sep 10, 2025, 09:25 PM UTC

    We are continuing to monitor for any further issues.

  10. monitoring Sep 10, 2025, 09:26 PM UTC

    All tests are passing and devices are stable. We'll keep monitoring.

  11. resolved Sep 10, 2025, 10:08 PM UTC

    This incident has been resolved.

  12. postmortem Sep 19, 2025, 09:37 AM UTC

    On Sept. 10th, around 13:30pm UTC, our alerting system reported intermittent elevated API errors. We quickly determined the cause of the incident to be an overly aggressive liveness probe rotating our API pods. A fix was deployed immediately. While the API was recovering, an automatic update of the Cloudlink pods occurred. As the API was slower to respond, device reconnections and SSH authentication were slower than usual, but steadily improving. This apparent recovery was masking a different issue in the Cloudlink update itself—an issue that was severely impacting container performance under high concurrency. This problem was only apparent at the scale of the production environment and was completely invisible at the lower scale of our development and testing environments. Once the concurrency issue had been properly identified, we quickly reverted the update and Cloudlink returned to its expected performance level. In the aftermath of this incident, we're making a few important changes in our Cloudlink testing, validation, and deployment protocols to better detect and automatically revert issues that would only happen in production.