Hosted Mender incident

Issues with webhooks and AWS IoT Integration

Minor Resolved View vendor source →

Hosted Mender experienced a minor incident on October 31, 2025 affecting Hosted Mender US and Hosted Mender EU, lasting 2d 16h. The incident has been resolved; the full update timeline is below.

Started
Oct 31, 2025, 08:58 PM UTC
Resolved
Nov 03, 2025, 01:19 PM UTC
Duration
2d 16h
Detected by Pingoru
Oct 31, 2025, 08:58 PM UTC

Affected components

Hosted Mender USHosted Mender EU

Update timeline

  1. investigating Oct 31, 2025, 08:58 PM UTC

    We are receiving some reports of problems with webhooks not triggering for device provisioning. We are also receiving report of problems with AWS IoT integrations.

  2. investigating Oct 31, 2025, 08:58 PM UTC

    Regarding the Webhook issue, a possible workaround is to re-create the failing webhook. We're still investigating the root cause.

  3. identified Nov 03, 2025, 11:31 AM UTC

    The issue has been identified and a fix is being implemented.

  4. resolved Nov 03, 2025, 01:19 PM UTC

    We reverted a recent change that caused an unexpected behavior with the encryption algorithm, and we re-encrypted the secrets for the IoT Manager integration. Now the normal operation is restored.

  5. postmortem Nov 06, 2025, 08:42 PM UTC

    **Abstract** On Monday, 27th of October, we released the Mender Server v4.1.0-saas.16 to hosted Mender US and EU. Among the many changes, there was also a change indirectly changing the cipher method for client side encryption of secrets in the IoT Manager database. The change replaced the deprecated Cipher Feedback \(CFB\) cipher mode with Counter \(CTR\) mode as suggested by the [Golang documentation](https://pkg.go.dev/crypto/[email protected]#NewCFBDecrypter). On October 31th, at about 8PM, we were alerted by multiple tickets opened by customers regarding webhooks not working for AWS IoT integration. The on-call team then opened an [incident](https://mender.statuspage.io/incidents/g019zy922897). The engineering team then on Monday, 3rd of November, soon acknowledged the issue and found out the root cause. We briefly discussed how to solve the issue and we decided to rollback the IoT Manager service and re-encrypt the secret with the old algorithm for the affected customers. Two customers, however, already updated their config, because they recreated the webhook configuration, after being suggested by the Northern Tech team as a valid workaround, so the rollback affected their operation for a second time. We are really sorry for the inconvenience, and we are working to fix this process around the IoT Manager integration. **Incident Timeline \(UTC\)** * 2025-10-27 12AM - Mender Server v4.1.0-saas.16 released on hosted Mender EU and US * 2025-10-31 8PM - The Customer Engineer team alerted because of multiple ticket regarding failing IoT Integration * 2025-10-31 8:58PM - This incident has been opened * 2025-11-03 12AM - We reverted the IoT Manager version, decrypted the secret with the new cipher and re-encrypted it again with the old cipher, restoring the operation * 2025-11-03 11AM - Mender server v4.1.0-saas.17 was released to hosted Mender US and EU, including the revert commit for the new cipher, restoring the old one. **What went wrong** Multiple failure at multiple level: * we lack of IoT Manager upgrade tests; for this specific issue, unit and integration tests didn’t catch the issue because they are performing tests on new fresh data, encrypted with the new cipher; * we lack of Synthetic Tests on IoT Manager; * we suggested a workaround for restoring the situation as a first step, but then the rollback to the previous version caused another disruption to some customers. **Actions we decided to take to prevent this issue in the future** * Improve the logging and monitoring around the IoT Manager service * introduce new error log when webhooks are misbehaving and build metrics and alert based on the new log to catch issues faster * Introduce Synthetics tests to periodically assess the IoT Manager webhook functionality * Improve error handling by registering unsuccessful attempts to send webhooks * Register timestamp on secret update and creation, to easily understand the history of a secret * We still need to replace the outdated cipher, we will plan a non disruptive update * Introduce upgrade tests, to check that the IoT Manager service could work with both the old and the new version.