OpsGenie incident

Delays in notification service

Critical Resolved View vendor source →

OpsGenie experienced a critical incident on September 14, 2022 affecting Email Notification Delivery and Email Notification Delivery and 1 more component, lasting 52m. The incident has been resolved; the full update timeline is below.

Started
Sep 14, 2022, 04:00 PM UTC
Resolved
Sep 14, 2022, 04:53 PM UTC
Duration
52m
Detected by Pingoru
Sep 14, 2022, 04:00 PM UTC

Affected components

Email Notification DeliveryEmail Notification DeliverySMS Notification DeliverySMS Notification DeliveryVoice Notification DeliveryVoice Notification DeliveryMobile Notification DeliveryMobile Notification Delivery

Update timeline

  1. investigating Sep 14, 2022, 04:59 PM UTC

    We are seeing delays with outbound notifications. We have identified the cause and are currently working on mitigation of this issue.

  2. identified Sep 14, 2022, 05:01 PM UTC

    We have identified the problem and working on it. We are expecting that notification service will return normal state in a short time.

  3. monitoring Sep 14, 2022, 05:07 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Sep 14, 2022, 05:13 PM UTC

    This incident has been resolved.

  5. postmortem Sep 21, 2022, 08:37 AM UTC

    ### **SUMMARY** On Sep 14, 2022, between 03:36 PM and 04:26 PM UTC, Atlassian customers using the Opsgenie product received delayed notifications for up to 50 minutes. The event was triggered by a code change that upgrades a common framework. The changes included in this framework update impacted customers in the both US and EU regions. The incident was detected by the on-call developer and mitigated by reverting the latest changes, which put Opsgenie systems into a known good state. The total time to resolution was around 50 minutes. ### **IMPACT** The overall impact was between Sep 14, 2022, 03:36 PM UTC, and Sep 14, 2022, 04:26 PM UTC on Opsgenie products. The incident service disruption was limited to US and EU region customers who did not receive their notifications immediately, but instead experienced notification delays of up to 50 minutes. In total, ~132K notifications in the US region and ~23.6K notifications in the EU region were sent with delays. Only less than %0.6 of the active customers were affected. ### **ROOT CAUSE** The issue was caused by an Atlassian-initiated change to upgrade a common framework. While the majority of the intended changes had been tested successfully, there were some accompanying changes with the framework upgrade that caused the notification service to stop processing new notification requests. Instead, these notifications remained in the queues until the deployment was reverted, resulting in notification delays for customers of up to 50 minutes. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident: * We are improving the testing and deployment processes we follow after framework updates. * We are implementing new monitoring to reduce the detection and response time even further. We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support