Balena incident

Elevated Device URLs/VPN Errors

Balena experienced a minor incident on November 19, 2024 affecting Cloudlink (VPN), lasting 1d 4h. The incident has been resolved; the full update timeline is below.

Started: Nov 19, 2024, 10:00 AM UTC
Resolved: Nov 20, 2024, 02:01 PM UTC
Duration: 1d 4h
Detected by Pingoru: Nov 19, 2024, 10:00 AM UTC

Affected components

Cloudlink (VPN)

Update timeline

investigating Nov 19, 2024, 07:29 PM UTC

We're experiencing an elevated level of errors in our Device URLs and VPN infrastructure and are currently looking into the issue.
identified Nov 19, 2024, 08:32 PM UTC

The issue has been identified and a fix is being implemented.
monitoring Nov 19, 2024, 09:24 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Nov 20, 2024, 02:01 PM UTC

This incident has been resolved.
postmortem Nov 22, 2024, 02:06 PM UTC

We observed degraded Cloudlink \(VPN\) connections following several subsequent API release deployments. These were spread out over the course of a day, and took some time to settle without any manual intervention. This is generally referred to as a "thundering herd" when 1000s of devices are attempting to connect to a new node at the same time and get rate limited. Upon investigation we found that when we are running at peak usage, the load balancing policies in place for our TCP Cloudlink connections were not optimized to avoid proxying through nodes that were scaling up and scaling down during deploys. Due to the nature of TCP, even though our Cloudlink instances were largely unmoved, the proxied TCP connections were being interrupted by the shuffle of other backend services. We have since implemented some changes to our load balancers to only route TCP Cloudlink traffic via nodes that have online and ready Cloudlink pods running. We are also in the early stages of enabling UDP connections for this endpoint and will announce more details in the future.