Balena experienced a major incident on September 30, 2025 affecting Cloudlink (VPN), lasting 2h 8m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Sep 30, 2025, 07:47 PM UTC
We're experiencing an elevated level of errors in our Device VPN Tunnel infrastructure and are currently looking into the issue.
- identified Sep 30, 2025, 08:35 PM UTC
The issue has been identified and a fix is being implemented.
- monitoring Sep 30, 2025, 09:34 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Sep 30, 2025, 09:56 PM UTC
This incident has been resolved.
- postmortem Oct 01, 2025, 12:51 PM UTC
On **September 30th**, following a production deployment of a core component, a critical authorization failure affected the `balena device tunnel` command, preventing users from establishing port tunnel connections. ### Summary of the Incident A subtle bug was introduced several weeks ago in a **small database query change** within a component. This change passed our standard review process and all pre-deployment testing, including unit tests utilizing mocked API endpoints. Because of other pending changes, this component update was not immediately deployed. When it was finally released to production on September 30th, the query exhibited an unexpected incompatibility with the **live production API environment**. The failure was not immediately apparent through our primary monitoring, but once the authorization issue was identified, our team quickly found the flawed query, deployed a patched component, and restored full functionality. ### Root Cause and Timeline * **Change Introduction:** A small change to a database query was merged several weeks ago. * **Failed Validation:** The change passed unit tests and code review but failed to correctly interact with the real-world production API due to a subtle environmental or data-specific condition. * **Deployment & Failure:** On **September 30th**, the component was deployed. The authorization failure for `balena device tunnel` was subsequently observed. * **Resolution:** The bug was quickly diagnosed, the query was patched, and a fixed component was deployed to production, resolving the incident. ### Corrective Actions We are taking immediate steps to prevent this type of failure from recurring: 1. **Unit Test Realism:** We have **updated our unit test mocks** to more accurately reflect invalid or non-standard production responses, ensuring future query changes are validated against real-world failure modes. 2. **End-to-End Test Scheduling:** We are prioritizing the development and scheduling of a new **end-to-end test** specifically dedicated to validating the full functionality of the `device tunnel` command in a production-like environment. This will catch integration errors sooner. We apologize for the interruption this caused to your workflow. We are committed to using this incident to improve the robustness and reliability of our continuous deployment process.