Balena incident

Elevated Device URLs/VPN Errors

Minor Resolved View vendor source →

Balena experienced a minor incident on November 19, 2024 affecting Cloudlink (VPN), lasting 1d 4h. The incident has been resolved; the full update timeline is below.

Started
Nov 19, 2024, 10:00 AM UTC
Resolved
Nov 20, 2024, 02:01 PM UTC
Duration
1d 4h
Detected by Pingoru
Nov 19, 2024, 10:00 AM UTC

Affected components

Cloudlink (VPN)

Update timeline

  1. investigating Nov 19, 2024, 07:29 PM UTC

    We're experiencing an elevated level of errors in our Device URLs and VPN infrastructure and are currently looking into the issue.

  2. identified Nov 19, 2024, 08:32 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Nov 19, 2024, 09:24 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Nov 20, 2024, 02:01 PM UTC

    This incident has been resolved.

  5. postmortem Nov 22, 2024, 02:06 PM UTC

    We observed degraded Cloudlink \(VPN\) connections following several subsequent API release deployments. These were spread out over the course of a day, and took some time to settle without any manual intervention. This is generally referred to as a "thundering herd" when 1000s of devices are attempting to connect to a new node at the same time and get rate limited. Upon investigation we found that when we are running at peak usage, the load balancing policies in place for our TCP Cloudlink connections were not optimized to avoid proxying through nodes that were scaling up and scaling down during deploys. Due to the nature of TCP, even though our Cloudlink instances were largely unmoved, the proxied TCP connections were being interrupted by the shuffle of other backend services. We have since implemented some changes to our load balancers to only route TCP Cloudlink traffic via nodes that have online and ready Cloudlink pods running. We are also in the early stages of enabling UDP connections for this endpoint and will announce more details in the future.