Hosted Mender Outage History

Hosted Mender is up right now

Hosted Mender had 28 outages in the last 2 years totaling 9h 23m of downtime — averaging 1.2 incidents per month.

There were 28 Hosted Mender outages since July 11, 2024 totaling 9h 23m of downtime. Each is summarised below — incident details, duration, and resolution information.

Source: https://mender.statuspage.io

Minor November 15, 2024

API Errors and Resource Spike usage

Detected by Pingoru
Nov 15, 2024, 11:39 AM UTC
Resolved
Nov 15, 2024, 02:04 PM UTC
Duration
2h 24m
Affected: Hosted Mender US
Timeline · 4 updates
  1. investigating Nov 15, 2024, 11:39 AM UTC

    Today at 11:15 UTC, we observed a sudden spike in workload resources on the hosted Mender. This was accompanied by an increase in API errors. By 11:24 UTC, metrics returned to normal levels, and there have been no further indications of ongoing issues. The incident appears to have been transient and is no longer impacting the system's functionality. We are actively investigating the root cause to ensure stability and prevent recurrence.

  2. identified Nov 15, 2024, 11:50 AM UTC

    The issue has been identified; the database scaled up automatically to address the increased load, so no further actions are needed.

  3. monitoring Nov 15, 2024, 11:52 AM UTC

    We are monitoring the hosted Mender statistics and checking for new possible spikes.

  4. resolved Nov 15, 2024, 02:04 PM UTC

    This incident has been resolved.

Read the full incident report →

Minor October 2, 2024

Service degradation on hosted Mender EU

Detected by Pingoru
Oct 02, 2024, 10:21 AM UTC
Resolved
Oct 01, 2024, 10:00 PM UTC
Duration
Timeline · 2 updates
  1. resolved Oct 02, 2024, 10:21 AM UTC

    Hosted Mender EU experienced service degradation at approximately 22:05 UTC on October 1st, lasting for about ten minutes. The on-call team was alerted by a failure in a synthetic test, but shortly after acknowledging the alert, the issue was resolved, and the service functionality was restored. Later today, after brief investigations, we identified the root cause as the contextual upgrade of the Azure Kubernetes Service (AKS) cluster from version 1.29.7 to 1.29.8. While the upgrade was expected to be straightforward and smooth, this was not the case tonight, and we will need to investigate further to determine the reason.

  2. postmortem Oct 15, 2024, 06:45 PM UTC

    **What happened** An automated Azure Kubernetes Service \(AKS\) upgrade caused a partial service disruption in the EU cluster. Synthetic tests failed, the on-call team was alerted, and logging in was not possible for several minutes around 00:15 AM CEST on October 3rd. The root cause was that nodes were restarted, and Mender services could not handle the traffic. It is likely that both deviceauth pods were unavailable because one or more nodes had been cordoned. **What went wrong** The minimum resources on hosted Mender EU were limited, even though it has the capability to scale up to tens of instances if load increases. The baseline was set to two pods per service, which appeared insufficient for the AKS upgrade, which rolls out nodes one at a time. This led to about 5 minutes of platform degradation. **Action taken** We resolved the issue by increasing the minimum available pods from 2 to 3.

Read the full incident report →

Notice July 11, 2024

Signup malfunction

Detected by Pingoru
Jul 11, 2024, 05:42 AM UTC
Resolved
Jul 11, 2024, 12:41 PM UTC
Duration
6h 59m
Affected: Hosted Mender USHosted Mender EU
Timeline · 2 updates
  1. identified Jul 11, 2024, 05:42 AM UTC

    There is an issue with signing up to Hosted Mender. We have identified the issue and are working on a fix.

  2. resolved Jul 11, 2024, 12:41 PM UTC

    This incident has been resolved.

Read the full incident report →