Hosted Mender incident

API issues with Mender Server

Critical Resolved View vendor source →

Hosted Mender experienced a critical incident on October 22, 2025 affecting Hosted Mender US, lasting 2h 35m. The incident has been resolved; the full update timeline is below.

Started
Oct 22, 2025, 08:56 AM UTC
Resolved
Oct 22, 2025, 11:32 AM UTC
Duration
2h 35m
Detected by Pingoru
Oct 22, 2025, 08:56 AM UTC

Affected components

Hosted Mender US

Update timeline

  1. investigating Oct 22, 2025, 08:56 AM UTC

    We noted a spike in API error metrics; we are investigating the issue.

  2. investigating Oct 22, 2025, 09:07 AM UTC

    We are continuing to investigate this issue.

  3. identified Oct 22, 2025, 09:20 AM UTC

    The issue has been identified. A migration triggered by an upgrade caused an index to be removed prematurely. This in turn caused data corruption. We have initiated a database restore and rolled back the upgrade. We apologize for the inconvenience.

  4. monitoring Oct 22, 2025, 09:44 AM UTC

    The restore to 08:10:00 UTC has completed at 09:28:49 UTC and the server is scaled back up. We will continue to monitor the situation.

  5. resolved Oct 22, 2025, 11:32 AM UTC

    This incident has been resolved

  6. postmortem Dec 11, 2025, 08:26 PM UTC

    **Date:** October 22, 2025 **Duration:** 78 minutes \(08:10 - 09:28 UTC\) **Severity:** Major service disruption ‌ **Executive Summary** A database migration in release v4.1.0-saas.16 caused a complete failure of the Device Authentication service across US and EU hosted Mender clusters. The migration incorrectly deleted a critical uniqueness constraint during online operations, leading to database corruption that prevented service recovery. We restored service by performing a point-in-time database rollback, resulting in 78 minutes of data loss. **Customer Impact**: Device authentication was unavailable for 78 minutes. New device enrollments were blocked, and existing device operations may have been disrupted during this period. **Root cause** The new version contained a database migration to 2.0.1 for the Device Auth database, which was designed to replace a uniqueness constraint on device authentication records but executed the deletion and recreation as separate operations. During online migration, the window between index deletion and recreation allowed duplicate device entries to be created, corrupting the database state and preventing both forward migration completion and rollback. For this reason, the only viable solution was to rollback both the Mender Server version and the Database. ‌ **Resolution and recovery** With duplicate records preventing normal rollback procedures, we performed a point-in-time database restore to 08:10 UTC, with a safe timestamp before migration execution. This restored database integrity but resulted in permanent loss of all data created between 08:10 and 09:28. ‌ **Incident timeline \(UTC\)** * 08:35 AM - the new v4.1.0-saas.16 version was published and both hosted Mender US and EU started the automated upgrade * 08:40 AM - the upgrade failed and rolled back automatically to v4.1.0-saas.15, because the deviceauth service wasn’t able to complete the migration job. * 08:42 AM - the On-call team acknowledged a possible issue with the upgrade, in the meantime the deviceauth service and MongoDB were at 100% load, because of the missing index. * 09:16 AM - we decided to restore the MongoDB database to the Point-in-time with timestamp 08:10:00 AM and the restoration process started. * 09:28 AM - the MongoDB restoration process finished. ‌ **What went wrong** * **Migration Strategy**: The migration required an offline window or an atomic operation strategy, but this requirement was not identified during development or code review. * **Testing Gaps**: Pre-release testing did not simulate high-concurrency writing during the migration, failing to trigger the race condition found in production. * **Data loss**: We failed to export a snapshot of the corrupted state before the point-in-time retention window expired. ‌ **Action Items** * **Enhance Load Testing**: pre-release tests are not sufficient to really simulate the production environment, so to catch this issue in an early stage. We are planning to run load testing and chaos testing more often and extensibly to mitigate this risk. * **Update the rollback playbook**: mandate that a snapshot of the "corrupted" database state be taken immediately following a destructive Point-in-Time recovery to preserve data and to allow recovery of data if necessary. ‌ We sincerely apologize for the disruption to your operations and, specifically, for the data loss that occurred during the recovery window.