Hosted Mender incident

Mender documentation website is unreachable

Major Resolved View vendor source →

Hosted Mender experienced a major incident on December 13, 2024 affecting docs.mender.io, lasting 4h 34m. The incident has been resolved; the full update timeline is below.

Started
Dec 13, 2024, 08:09 AM UTC
Resolved
Dec 13, 2024, 12:44 PM UTC
Duration
4h 34m
Detected by Pingoru
Dec 13, 2024, 08:09 AM UTC

Affected components

docs.mender.io

Update timeline

  1. investigating Dec 13, 2024, 08:09 AM UTC

    We are currently investigating the issue.

  2. monitoring Dec 13, 2024, 08:23 AM UTC

    A fix has been implemented and we're monitoring the result

  3. resolved Dec 13, 2024, 12:44 PM UTC

    This incident has been resolved.

  4. postmortem Dec 19, 2024, 03:31 PM UTC

    **Abstract** The documentation website is hosted in a GKE cluster, which requires a Docker Hub user to pull private images to update the documentation pods. This Docker Hub user was supposed to be a new user, ready to be used, but until the 13th, still an old docker hub user was used. The issue started on the 11th when the old user’s password has been rotated for a security improvement, and the operator which performed the rotation didn't check that the user was still in use in the GKE cluster. After a few days the GKE cluster performed a node upgrade, and the pods were rescheduled, but could not pull the image because of the rotated Docker user password. The on-call team gets alerted when the documentation website was not working, and the operator created the new token for the right user this time, and the incident was solved. ‌ **Incident timeline** * 2024-12-11 - An operator, following a security ticket , updated the token for the a Docker hub user and changed the secret to an internal Staging cluster. The operator thought the secret was no longer used in the GKE Website cluster, and that there another user was used. * 2024-12-13 07:50 UTC - GKE performed a node rotation, causing the pods to be rescheduled * 2024-12-13 07:55 UTC - The docs website pods were using the old user’s secret, already rotated, so they didn’t have access anymore to the private repository and the pod was marked as ImagePullBackOff and was not starting * 2024-12-13 07:57 UTC - the monitoring system alerted the on-call operator * 2024-12-13 08:20 UTC - the operator created a new token for the right user and manually replaced the previous dockerconfigjson secret with the new one. The pods were running again * 2024-12-13 08:23 UTC - the monitoring system declared the website was up again ‌ **Actions we have decided to take to avoid the same incident to happen again** We have to refine the GKE cluster documentation, complete the migration to the new dedicated Docker hub user, and move the token to our internal secret renew automation tool.