Hosted Mender incident

Scalability issue

Minor Resolved View vendor source →

Hosted Mender experienced a minor incident on January 13, 2025 affecting Hosted Mender EU, lasting 1h 35m. The incident has been resolved; the full update timeline is below.

Started
Jan 13, 2025, 03:01 PM UTC
Resolved
Jan 13, 2025, 04:37 PM UTC
Duration
1h 35m
Detected by Pingoru
Jan 13, 2025, 03:01 PM UTC

Affected components

Hosted Mender EU

Update timeline

  1. investigating Jan 13, 2025, 03:01 PM UTC

    We are experiencing scalability issue: new Kubernetes worker nodes are rolled out very slow. We're checking with the cloud provider.

  2. monitoring Jan 13, 2025, 03:13 PM UTC

    Now the required load is matching the required number of Kubernetes worker nodes. We're still in contact with the cloud provider support to check the root cause. The incident is still open.

  3. resolved Jan 13, 2025, 04:37 PM UTC

    The cloud provider support is still checking the issue. In the meantime we managed to increase the minimum number of Kubernetes worker node to prevent further autoscaling issue.

  4. postmortem Jan 29, 2025, 09:15 AM UTC

    We discussed the incident with Azure support and decided to replace a problematic component \(an AKS Nodepool\). The new component is working fine and has no scalability issues, so we promoted it to production. No further actions are needed