Hosted Mender incident

Server-side generated delta (beta) are failing randomly

Notice Resolved View vendor source →

Hosted Mender experienced a notice incident on April 24, 2024 affecting Hosted Mender EU, lasting 4h 57m. The incident has been resolved; the full update timeline is below.

Started
Apr 24, 2024, 03:16 PM UTC
Resolved
Apr 24, 2024, 08:14 PM UTC
Duration
4h 57m
Detected by Pingoru
Apr 24, 2024, 03:16 PM UTC

Affected components

Hosted Mender EU

Update timeline

  1. investigating Apr 24, 2024, 03:16 PM UTC

    Some of the server-side Generated Delta, which is a beta feature, are not working for some customers in the European cluster. We're investigating the issue.

  2. identified Apr 24, 2024, 04:24 PM UTC

    The issue has been identified: the worker Kubernetes pods that are running the server-side Generated Delta, in some cases get ephemeral storage exhaustion and the job won't run successfully. A fix is being implemented.

  3. monitoring Apr 24, 2024, 04:43 PM UTC

    As a workaround, we applied a Pod Antiaffinity Policy, which avoids server-side Generated Delta pods to be scheduled on the nodes with others worker pods. This allows them to have more storage available

  4. resolved Apr 24, 2024, 08:14 PM UTC

    The metrics indicate no further issues; for now, the situation appears stable with the applied workaround. In the coming days, however, we will implement additional updates.