Hosted Mender incident

Emails not sent

Minor Resolved View vendor source →

Hosted Mender experienced a minor incident on February 3, 2025 affecting Hosted Mender US and Hosted Mender EU, lasting 22h 29m. The incident has been resolved; the full update timeline is below.

Started
Feb 03, 2025, 06:57 PM UTC
Resolved
Feb 04, 2025, 05:27 PM UTC
Duration
22h 29m
Detected by Pingoru
Feb 03, 2025, 06:57 PM UTC

Affected components

Hosted Mender USHosted Mender EU

Update timeline

  1. investigating Feb 03, 2025, 06:57 PM UTC

    We are aware that emails are not received from customers. We're currently investigating the issue

  2. identified Feb 03, 2025, 07:16 PM UTC

    The issue has been identified. We're working on a possible solution.

  3. identified Feb 03, 2025, 09:29 PM UTC

    We are continuing to work on a fix. We just introduced higher rate limits for devicemonitor, and requested a quota increase to our email provider, to cope a sudden burst of email request.

  4. identified Feb 04, 2025, 10:43 AM UTC

    The fix has been applied: there is no new email sending burst at this time; unfortunately, the email provider quota is still exceeded and we're waiting for an increase, so emails are still not working.

  5. monitoring Feb 04, 2025, 04:26 PM UTC

    We got our email quota increase request accepted and we're sending emails again. We're still monitoring the issue.

  6. resolved Feb 04, 2025, 05:27 PM UTC

    This incident has been resolved; new emails are correctly sent.

  7. postmortem Apr 04, 2025, 07:33 PM UTC

    **Abstract** Some hosted Mender customers were using the monitoring addon, but a bug in the client caused state flapping and a massive amount of emails to be generated and sent through the email provider. This resulted in hitting the rate limit, allowing hosted Mender to send 176,800 emails in about 3 hours before being rate limited for 24h from the email provider. This occurred for 4 days \(2 working days\) starting from the night of the 29th \(Wednesday\) before we acknowledged the issue on Monday evening the 3rd of February. This resulted in emails not being sent both from the monitoring add-on and from the password reset links, causing customers to not receive emails. After acknowledging the issue, we soon asked the email provider to raise the quota up to three times, and at the same time we got in contact with the hosted Mender customers which were using the monitoring add-on and asked them to temporarily mute the emails from the UI. As soon as the email quota was raised up, the email was flowing again. **Incident timeline** * 2025-01-29 16:00 UTC: The email provider reported an high email volume * 2025-01-29 19:00 UTC: The email provider quota was filled up and no emails from hosted Mender were sent anymore. * 2024-01-30 18:00 UTC: The email quota was cleaned up and hosted Mender started to send emails again. * 2025-01-30 21:00 UTC: The email provider quota was filled up and no emails from hosted Mender were sent anymore. * Same during the weekend: from 2025-01-31 18:00 UTC to 21:30 UTC, from 2025-02-01 19:00 UTC to 22:30 UTC and from 2025-02-02 19:00 UTC to 23:00 UTC emails have been sent. During all the other timeframes, the emails were not sent. * 2025-02-03 18:41 UTC: customers reported emails not working and the SRE team has been engaged * 2025-02-03 18:57 UTC: the incident has been opened * 2025-02-03 20:57 UTC: we increased the internal rate limit for deviceauth API requests, to slow down the emails flood. * 2025-02-03 21:50 UTC: we requested a service quota increase to Amazon SES * 2025-02-03 22:00 UTC: we requested the affected customers to mute the email monitor from the UI * 2025-02-04 15:19 UTC: the email provider increased the service quota: emails started flowing again. ‌ **Actions we have decided to take to avoid the same incident to happen again** * After a similar incident occurred in the past, we started developing an automated email synthetic test; we now finished this development and activated the new alert, to catch email problems in a timely manner. * We’ll split the SMTP service into two: one dedicated to the monitoring add-on, and another one for all the other important emails from hosted Mender