Omnilert experienced a major incident on March 25, 2023 affecting CAP and Desktop Alert Services and 1 more component, lasting 30m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Mar 25, 2023, 12:36 PM UTC
Omnilert engineers are investigating intermittent issues with message transmission (all endpoints). Our team is investigating this issue with high priority. Impact on Services: Messages sent via Omnilert may be delayed or not delivered.
- investigating Mar 25, 2023, 12:37 PM UTC
We are continuing to investigate this issue.
- investigating Mar 25, 2023, 12:53 PM UTC
We are no longer seeing issues with message transmission. Omnilert engineers have taken steps to mitigate delivery issues while they continue to investigate further.
- monitoring Mar 25, 2023, 01:05 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Mar 25, 2023, 01:07 PM UTC
This incident has been resolved.
- postmortem Mar 27, 2023, 07:12 PM UTC
### DESCRIPTION: In the overnight/early hours of March 25, 2023, Omnilert’s systems experienced issues with the transmission of alerts on all channels \(SMS, Email\). Customers sending alerts experienced delayed delivery across all endpoints. ### ROOT CAUSE: The cause was investigated by Omnilert’s engineers with the highest priority. It was determined that an issue impacting logging caused the ability of files to be written to be affected, leading to the inability of delivery services to run. Omnilert’s system status warning and recovery automation did not properly detect this specific kind of escalating issue, which led to the service outage and delay in delivery experienced by recipients. ### STEPS TAKEN: Once the root cause issue was discovered, Omnilert engineers were able to correct the logging issue and restart the systems. This alleviated the immediate problem and all of Omnilert’s service was returned to normal functionality. Naturally, this kind of incident is being studied to prevent any recurrence and further harden Omnilert’s systems against problems of this nature. Omnilert’s team is taking the following steps to mitigate any recurrence of this issue: * Creation of additional system alarms to warn of any logging systems reaching capacity for any reason. This should prevent a recurrence of this issue. * We will be examining better methods of assigning technical resources to respond/handle any such off-hours issues of this nature in the future in a more speedy manner.