Hosted Mender incident

Issues with Redis cache and DeviceAuth service

Major Resolved View vendor source →

Hosted Mender experienced a major incident on June 23, 2025 affecting Hosted Mender US, lasting 31m. The incident has been resolved; the full update timeline is below.

Started
Jun 23, 2025, 05:08 AM UTC
Resolved
Jun 23, 2025, 05:39 AM UTC
Duration
31m
Detected by Pingoru
Jun 23, 2025, 05:08 AM UTC

Affected components

Hosted Mender US

Update timeline

  1. investigating Jun 23, 2025, 05:08 AM UTC

    We are investigating an issue regarding Redis cluster and the Device Auth Service which is in degraded state.

  2. monitoring Jun 23, 2025, 05:19 AM UTC

    The issue has been identified: a new Redis pod was restarting because of OOMKill. More memory has been given to the Redis pool and now the services are up. We're monitoring the result.

  3. resolved Jun 23, 2025, 05:39 AM UTC

    This incident has been resolved.

  4. postmortem Jun 23, 2025, 08:29 AM UTC

    This morning, the operation team performed a planned Redis Cluster upgrade, starting at 04:40 UTC. Around 04:56 UTC, one of the Redis pod got killed because of Out of Memory issues, causing the Device Auth service to experience connection failure. To resolve this, the operation team increased the memory allocated to the Redis Cluster, starting at 05:05 UTC. The change was fully implemented by 05:14 UTC, and no more error log was seen from the Device Auth service, which was returned to normal operation.