Hosted Mender incident

Artifacts are downloaded from the wrong location

Major Resolved View vendor source →

Hosted Mender experienced a major incident on December 19, 2024 affecting Hosted Mender US, lasting 3h 26m. The incident has been resolved; the full update timeline is below.

Started
Dec 19, 2024, 10:15 AM UTC
Resolved
Dec 19, 2024, 01:41 PM UTC
Duration
3h 26m
Detected by Pingoru
Dec 19, 2024, 10:15 AM UTC

Affected components

Hosted Mender US

Update timeline

  1. investigating Dec 19, 2024, 10:15 AM UTC

    After the update from this morning, the hosted Mender artifacts are offered not anymore from the default storage, which is Cloudflare, but from the internal storage proxy, from which Cloudflare is not configured. This seems a misconfiguration and we're investigating the issue

  2. monitoring Dec 19, 2024, 10:30 AM UTC

    A fix has been implemented and we're monitoring the results

  3. resolved Dec 19, 2024, 01:41 PM UTC

    This incident has been resolved, the functionality is fully restored.

  4. postmortem Dec 28, 2024, 02:35 PM UTC

    **Abstract** We upgraded hosted Mender with the Helm Chart v6.0.0, on the morning of the 19th of December, as planned; this new Helm chart includes the DEPLOYMENTS\_STORAGE\_PROXY\_URI variable already set, to facilitate open source users onboarding, In hosted Mender US, this variable was overridden for serving a single customer, so it was added with a default value. This caused an unexpected behavior that was to enable the storage proxy feature to every customer that has not the storage setting overridden, like most of the non enterprise customers. Luckily the storage proxy feature was not needed anymore, because the single customer using it had been migrated two weeks before, so we solved the situation by disabling the storage proxy feature entirely. **Incident timeline** * 2024-12-19 05:30 UTC: Mender Server v4.0.0-rc.4 and Helm Chart v6.0.0-rc.4 was applied to the hosted Mender US cluster with the storage proxy feature enabled * 2024-12-19 10:03 UTC: an operator was monitoring the status after the upgrade and found two support tickets from customers complaining about artifacts not getting downloaded. * 2024-12-19 10:15 UTC: this Statuspage incident was opened * 2024-12-10 10:30 UTC: we disabled the storage proxy feature from hosted Mender US and verified the download was successfully again * 2024-12-10 11:00 UTC: customers confirmed that the situation was restored again ‌ **Actions we have decided to take to avoid the same incident to happen again** The only customer that was using the storage proxy feature has already been migrated to a custom and supported solution, right to avoid this kind of issue. Additionally, we have to better monitor the artifacts download, and we’ll introduce more statistics, metrics and alerts.