Hosted Mender incident

Mender Device Auth Module unavailable

Hosted Mender experienced a major incident on August 26, 2025 affecting Hosted Mender EU, lasting 16h 8m. The incident has been resolved; the full update timeline is below.

Started: Aug 26, 2025, 05:37 PM UTC
Resolved: Aug 27, 2025, 09:46 AM UTC
Duration: 16h 8m
Detected by Pingoru: Aug 26, 2025, 05:37 PM UTC

Affected components

Hosted Mender EU

Update timeline

investigating Aug 26, 2025, 05:37 PM UTC

We are investigating the root cause of the device authorization backend issue.
identified Aug 26, 2025, 05:59 PM UTC

The issue has been identified, and we have implemented a temporary fix. We are monitoring the temporary fix.
resolved Aug 27, 2025, 09:46 AM UTC

This incident has been resolved.
postmortem Sep 01, 2025, 12:02 PM UTC

**Abstract** On the 26th of August, at 05:01 PM UTC, we received multiple alerts regarding the hosted Mender EU cluster; in particular many 5xx errors and the pods for the device-auth service which was restarting. Because of this reason the hosted Mender cluster was not fully operational since the devices could not authenticate. The reason for the crash was multiple “out of memory” kills. The on-call team member increased the memory available for the pod, in order to avoid the restarts and restore the hosted Mender functionality. During the following day the backend team analyzed the issue and found that the cluster was receiving an unusual amount of requests to the endpoint `GET /api/management/v2/devauth/devices`: it appeared that a Python agent requested the same resource approximately 200 times per minute \(up from 6 requests per minute before\). We alerted the customer from which the requests were originating, which confirmed an unintended issue on their end and fixed the issue. **Incident timeline**: 26th of August * 5:01 PM - start seeing 5xx errors * 5:02 PM - mender-device-auth restarting * 5:49 PM - memory increased - incident finished 27th of August * about 9 AM - the customer was alerted * about 1 PM - the customer fixed the issue