Hosted Mender incident

Service degradation on hosted Mender EU

Minor Resolved View vendor source →

Hosted Mender experienced a minor incident on October 2, 2024, lasting —. The incident has been resolved; the full update timeline is below.

Started
Oct 02, 2024, 10:21 AM UTC
Resolved
Oct 01, 2024, 10:00 PM UTC
Duration
Detected by Pingoru
Oct 02, 2024, 10:21 AM UTC

Update timeline

  1. resolved Oct 02, 2024, 10:21 AM UTC

    Hosted Mender EU experienced service degradation at approximately 22:05 UTC on October 1st, lasting for about ten minutes. The on-call team was alerted by a failure in a synthetic test, but shortly after acknowledging the alert, the issue was resolved, and the service functionality was restored. Later today, after brief investigations, we identified the root cause as the contextual upgrade of the Azure Kubernetes Service (AKS) cluster from version 1.29.7 to 1.29.8. While the upgrade was expected to be straightforward and smooth, this was not the case tonight, and we will need to investigate further to determine the reason.

  2. postmortem Oct 15, 2024, 06:45 PM UTC

    **What happened** An automated Azure Kubernetes Service \(AKS\) upgrade caused a partial service disruption in the EU cluster. Synthetic tests failed, the on-call team was alerted, and logging in was not possible for several minutes around 00:15 AM CEST on October 3rd. The root cause was that nodes were restarted, and Mender services could not handle the traffic. It is likely that both deviceauth pods were unavailable because one or more nodes had been cordoned. **What went wrong** The minimum resources on hosted Mender EU were limited, even though it has the capability to scale up to tens of instances if load increases. The baseline was set to two pods per service, which appeared insufficient for the AKS upgrade, which rolls out nodes one at a time. This led to about 5 minutes of platform degradation. **Action taken** We resolved the issue by increasing the minimum available pods from 2 to 3.