Hosted Mender incident

Issues with the deployment service

Hosted Mender experienced a critical incident on September 16, 2025 affecting Hosted Mender US, lasting 2h 9m. The incident has been resolved; the full update timeline is below.

Started: Sep 16, 2025, 01:58 PM UTC
Resolved: Sep 16, 2025, 04:07 PM UTC
Duration: 2h 9m
Detected by Pingoru: Sep 16, 2025, 01:58 PM UTC

Affected components

Hosted Mender US

Update timeline

investigating Sep 16, 2025, 01:58 PM UTC

We are aware of an issue regarding the deployment service and an unusual amount of server errors. We are investigating the issue.
identified Sep 16, 2025, 03:18 PM UTC

The issue has been identified. We scaled up both the MongoDB cluster and the backend services to cope with an unexpected load.
monitoring Sep 16, 2025, 03:30 PM UTC

After scaling the cluster up, the APIs are back to the expected operations. We're continuing monitoring the cluster health.
resolved Sep 16, 2025, 04:07 PM UTC

This incident has been resolved.
postmortem Sep 26, 2025, 04:21 PM UTC

**Abstract** On Tuesday, September 16th, at around 01:50 UTC, a Mender deployment was created from a single tenant targeting about 16,000 devices with a very short polling interval. This caused a sudden load and memory spike in the Deployments Service, the Device Auth Service, and MongoDB. Unfortunately, the MongoDB cluster was unable to scale automatically at that time due to the sudden spike. As a result, we had to first deactivate the Deployments Service, abort the massive deployment, and then allow MongoDB to recover properly. Only then were we able to scale it up and restore full operations. We apologize for the disruption this incident caused. We know that hosted Mender can scale well under certain conditions, but in this case the scaling was too slow and we have to address this issue. **Incident Timeline** * 2025-09-16 01:52PM - The alert NodeHighNumberConntrackEntriesUsed was sent to the on-call operator. The immediate issue was MongoDB rolling back and deployments service and device-auth services restarting due to Out of Memory Kills * 2025-09-16 02:01PM - The root cause was identified: a massive amount of v2/deployments/next were being served \(around 2000 per minute, compared to an average of 20 per minute\) * 2025-09-16 02:10PM - we scaled the Deployments Service and the Device Auth Service, increasing memory by 200% * 2025-09-16 02:15PM - we attempted to apply per-API limits on the tenant causing the massive deployment, but this had no effect. * 2025-09-16 02:20PM - the MongoDB cluster remained unavailable due to repeated crashes across all three nodes. We decided to scale Deployments Service and the Device Auth Service down to 0 to allow MongoDB to recover. * 2025-09-16 02:40PM - MongoDB was fully restored, and we scaled the db up from M40 to M60. * 2025-09-16 02:52PM - MongoDB was successfully running at M60. * 2025-09-16 02:56PM - We marked the massive deployment operation as “finished” to prevent further disruption * 2025-09-16 03:00PM - Since the massive deployment had been aborted, we scaled the Deployments Service and the Device Auth Service back to full operation. * 2025-09-16 03:02PM - hosted Mender US service was fully operational again. * 2025-09-16 03:10PM - we notified the customer that we had aborted the massive deployment. We also prevented the MongoDB cluster from scaling back to M40; the new minimum size is now M50. **What went wrong** The Mender Server is currently managing more than 500.000 active devices, which usually poll the API server at long intervals \(hours or once per day\). Under these conditions, the Deployments Service typically serves about 30 deployments per minute on average, with peaks of around 1,000 per minute. In this case, however, a single deployment targeted about 16,000 devices with a polling interval of 5 minutes or less. As a result, all deployments started nearly simultaneously, leaving neither MongoDB nor the Kubernetes cluster enough time to scale up properly. Both MongoDB and the Deployments Service went out of memory. In short, hosted Mender cannot currently handle this type of sudden load, and we must address this issue. **What we did in the short term** * Disabled the massive deployment and coordinated with the customer to perform deployments in a way that is more optimized for the current cluster capacity. * Scaled up MongoDB, the Deployments Service, and the Device Auth Service to help prevent a recurrence. **What should we do in the long term** We already have a very basic rate limiting in the API server, but it must be improved to protect hosted Mender from usage patterns that could negatively impact other customers, and to allow the Kubernetes and the MongoDB cluster to scale up properly. We will commit to implementing a new set of rate limits.