Splunk OnCall incident
Splunk On-Call Service Disruption: Outbound Notification Delays
Splunk OnCall experienced a major incident on February 26, 2021 affecting Notifications - SMS and Notifications - Google Push and 1 more component, lasting 2h 4m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Feb 26, 2021, 02:33 PM UTC
We are actively investigating a service affecting issue with the message delivery and alerting associated to our platform. We are approaching this issue with the highest level of urgency and the appropriate parties are actively engaged in troubleshooting the issue. Updates to follow as soon as they are available. f you have any immediate questions, please contact the Splunk On-Call Support team: [email protected]
- investigating Feb 26, 2021, 02:33 PM UTC
We are continuing to investigate this issue.
- investigating Feb 26, 2021, 02:46 PM UTC
Update: The service affecting issues that we are actively investigating appear to be associated with a service disruption being experienced by one of our communications partners, Twilio: https://status.twilio.com/ We are continuing to pursue the investigation and resolution of this issue with the highest levels of urgency. Updates to follow.
- investigating Feb 26, 2021, 02:55 PM UTC
Please be advised that message delivery and alerting associated with the Splunk On-Call platform are significantly delayed as a result of this issue. As we continue to investigate and work this issue to ground, we recommend that you directly evaluate any monitoring solutions you may have integrated with Splunk On-Call to better determine the health and status of your associated systems, applications, and/or platform. Updates to follow.
- investigating Feb 26, 2021, 03:30 PM UTC
We are continuing to investigate this issue.
- investigating Feb 26, 2021, 03:38 PM UTC
As we continue to investigate this issue, we recommend leveraging PUSH Notification in your Splunk On-Call personal paging policy settings. For guidance: https://help.victorops.com/knowledge-base/paging-policy-setup/ The service disruption we are continuing to troubleshoot is primarily affecting Phone and SMS messaging. Updates to follow.
- investigating Feb 26, 2021, 03:56 PM UTC
We are continuing to investigate this issue.
- investigating Feb 26, 2021, 03:58 PM UTC
SMS message delivery is improving significantly (to) fully functional. Phone messaging is still delayed. We are continuing to investigate the issue. Updates to follow.
- investigating Feb 26, 2021, 04:34 PM UTC
We are continuing to investigate this issue.
- resolved Feb 26, 2021, 04:38 PM UTC
The issue has been resolved. All message delivery types (Phone, SMS, Push, and Email) and associated notifications are fully functional. We will be providing additional updates on our status page upon the completion of our internal review. If you have any immediate questions, please reach out to the Splunk On-Call Support team: [email protected] We sincerely apologize for any unintended inconvenience this issue may have caused.
- postmortem Feb 26, 2021, 10:53 PM UTC
**Basic Timeline & Incident Overview**: Starting at approximately 6:18 AM \(Mountain\) on the morning of 02.26.21, the _Splunk On-Call_ \(SpOC\) platform began experiencing delays in the outbound delivery of notifications. These delays were directly related to a critical service disruption associated with one of our primary telecommunications providers. Although the service disruption was primarily affecting outbound Phone and SMS notifications, the situation also \(temporarily\) delayed our delivery of Push notifications, as well. At approximately 7:40 AM \(Mountain\), actions taken by the SpOC Engineering team alleviated the Push notification delays and returned Push notification delivery to standard operational efficiency. The proper delivery of SMS notifications resumed at approximately 8:30 AM \(Mountain\). Phone notifications were queued and being delivered at a roughly 40-minute delay rate until approximately 9:30 AM \(Mountain\). The overall incident timeline was from approximately 6:18 AM \(Mountain\) - 9:30 AM \(Mountain\) on the morning of 02.26.21. After internal testing and direct communication with the telecommunications provider in-question, we deemed the incident resolved at 9:30 AM \(Mountain\) on the morning of 02.26.21. . . . As with any and all such incidents, the appropriate _Splunk On-Call_ \(SpOC\) will conduct a collaborative and intensive Post Incident Review \(PIR\) process aimed at both preventative measures and improved responsiveness toward addressing any future issues. If you have any immediate questions or concerns, please contact the _Splunk On-Call_ \(SpOC\) Support Team at: **[email protected]** Once again, we sincerely apologize for any unintended inconvenience this incident may have caused.