Omnilert incident

Messaging issues

Omnilert experienced a major incident on February 16, 2021 affecting Email and SMS and 1 more component, lasting 32m. The incident has been resolved; the full update timeline is below.

Started: Feb 16, 2021, 07:57 PM UTC
Resolved: Feb 16, 2021, 08:29 PM UTC
Duration: 32m
Detected by Pingoru: Feb 16, 2021, 07:57 PM UTC

Affected components

EmailSMSVoice

Update timeline

investigating Feb 16, 2021, 07:57 PM UTC

Omnilert engineers are investigating intermittent issues with the transmission of messages at this time. Impact on Service: Omnilert will display 0 messages sent in the timeline for various endpoints. We will update this status page as soon as additional details are available.
monitoring Feb 16, 2021, 08:07 PM UTC

A fix has been implemented. Any queued messages have now been processed. Our team is monitoring for any further issues but no further delay in message processing is expected.
resolved Feb 16, 2021, 08:29 PM UTC

This incident has been resolved.
postmortem Feb 16, 2021, 08:30 PM UTC

First off, we are sorry for the temporary outage in sending SMS / Email today. Our purpose is to reliably deliver your messages in a timely manner and this issue impacted that goal. We will use this incident to grow and build an even stronger network for all. We’ve provided details below about the nature of this incident and how we’ve resolved the underlying problem. * **Incident Start Date/Time:** 02/16/2021 14:10 EST \(approximate\) * **Incident End Date/Time:** 02/16/2021 14:59 EST ### **Impact on Service:** Some messages sent during this incident would show “0 messages sent” in the Omnilert Timeline. These messages were then sent upon the resolution of the problem. ### **What happened:** Omnilert’s engineering team was in the process of upgrading part of the network as part of routine maintenance. No service impact was anticipated. In normal cases, the Omnilert network’s built-in redundancy would pick up message processing during these routine events to prevent any service disruption. In this instance, there was an unforeseen race condition that was causing the redundant servers from processing messages as intended. The result was messages not processing out of the messaging queue, causing the delay experienced today. ### **What has been done to prevent recurrence:** We have identified the root cause and are putting additional alarms in place on both our primary and secondary networks that will alert us if this happens again. This should prevent any chance of messages queuing for any significant amount of time should such a situation recur.