Hosted Mender incident

Hosted Mender (EU) Site is down

Hosted Mender experienced a critical incident on June 13, 2024 affecting Hosted Mender EU, lasting 1h 5m. The incident has been resolved; the full update timeline is below.

Started: Jun 13, 2024, 10:26 AM UTC
Resolved: Jun 13, 2024, 11:31 AM UTC
Duration: 1h 5m
Detected by Pingoru: Jun 13, 2024, 10:26 AM UTC

Affected components

Hosted Mender EU

Update timeline

investigating Jun 13, 2024, 10:44 AM UTC

Hosted Mender EU is experience an outage. We are investigating the issue.
monitoring Jun 13, 2024, 10:46 AM UTC

We have identified the issue and implemented fix. We are continuing to monitor the situation.
monitoring Jun 13, 2024, 11:27 AM UTC

We are continuing to monitor for any further issues.
resolved Jun 13, 2024, 11:31 AM UTC

This incident has been resolved.
postmortem Jul 20, 2024, 03:45 PM UTC

#### Summary On June 13th, at about 10:30 UTC, the Nginx Ingress Controller deployment encountered a resource exhaustion issue, specifically related to insufficient CPU allocation for the current traffic. This led to CPU throttling and subsequent unhealthy readiness checks, causing several services to become unavailable or perform poorly. The oncall team has been notified immediately by one of the external probes. The Nginx Ingress Controller deployment is already shipped with a HorizontalPodAutoscaler resource, but a sudden traffic spike caused a CPU throttling and the unhealthy check, right before the deployment scaled out. #### Corrective actions On the short term, we increased the CPU allocation for the Nginx Ingress Controller deployment and monitored the deployment to ensure stability and performance. On the long term, we assessed the Nginx Ingress Controller deployment and set a high CPU limit, to give enough room for spike compute requests. **Lesson learned** The Nginx Ingress Controller deployment plays a crucial role in the hosted Mender EU cluster, for this reason is monitored in multiple ways. This incident also teaches us to give an adequate amount of resources, not just based on the historical metrics, but also to give enough room to possible future spike requests.