xMatters incident

Issue Discovered - Service disruption in All Regions – Multiple Services

Minor Resolved View vendor source →

xMatters experienced a minor incident on November 16, 2021 affecting Web Interface and Web Interface and 1 more component, lasting 46m. The incident has been resolved; the full update timeline is below.

Started
Nov 16, 2021, 05:55 PM UTC
Resolved
Nov 16, 2021, 06:42 PM UTC
Duration
46m
Detected by Pingoru
Nov 16, 2021, 05:55 PM UTC

Affected components

Web InterfaceWeb InterfaceWeb InterfaceEmail NotificationsEmail NotificationsEmail NotificationsSMS NotificationsSMS NotificationsSMS NotificationsVoice Notifications

Update timeline

  1. investigating Nov 16, 2021, 05:55 PM UTC

    xMatters monitoring tools have identified a potential issue with xMatters On-Demand for clients in All Regions. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support team is waiting to help.

  2. identified Nov 16, 2021, 06:08 PM UTC

    We are currently tracking a problem with our cloud provider and are working directly with them to resolve the issue. We will provide updates as soon as we know more. This outage is impacting multiple services across the internet.

  3. monitoring Nov 16, 2021, 06:12 PM UTC

    The xMatters Incident Response team is seeing some instances recovering, xMatters engineering is monitoring the situation to ensure the system is stable and that all services are restored.

  4. monitoring Nov 16, 2021, 06:20 PM UTC

    We are continuing to monitor for any further issues.

  5. monitoring Nov 16, 2021, 06:27 PM UTC

    We are seeing traffic to all xMatters instances, we continue to monitor. Some instances may experience increased latency.

  6. resolved Nov 16, 2021, 06:42 PM UTC

    The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.

  7. postmortem Nov 23, 2021, 04:34 PM UTC

    ### What happened? On November 16, 2021, at approximately 09:40 AM PT, xMatters monitoring tools alerted technical teams of Google 404 errors from xMatters instances across all regions. For the duration of the incident, users were unable to access the web user interface, incoming signals were not processing, and notifications were not being generated. ### Why did it happen? xMatters uses Google Cloud Load Balancing \(GCLB\) services, which were not operational during the outage and resulted in the errors seen by customers. Based on the RCA provided by Google: "Google Cloud Networking experienced issues with Google Cloud Load Balancing \(GCLB\) service resulting in impact to several downstream Google Cloud services. Impacted customers observed Google 404 errors on their websites. From preliminary analysis, the root cause of the issue was a latent bug in a network configuration service which was triggered during routine system operation." See [https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh](https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh) for the complete report from Google. ### How did we respond? After receiving alert notifications from the xMatters monitoring tools, xMatters Customer Support and the operations team initiated a Severity-1 incident. The incident team quickly identified the issue as related to an incident within the Google Cloud Platform, which impacted a wide range of SaaS operators worldwide hosted by Google. xMatters Customer Support began communicating with customers by updating [https://status.xmatters.com/incidents/rtl4qyz4nj3m](https://status.xmatters.com/incidents/rtl4qyz4nj3m) with detailed, real-time information. xMatters initiated a dialog with Google to gather updates on resolution progress. The incident team remained engaged until Google resolved the incident to ensure that xMatters recovered smoothly once services were restored. There was no intervention required after Google resolved the issue, but some customers may have experienced slow loading times until all Google networking components fully recovered. ### What are we doing to prevent it from happening again? xMatters is committed to providing redundancy and high availability to all customers. Our architecture allows for multiple regional and international failover scenarios, including regionally redundant databases and international traffic rerouting. A worldwide service provider failure is difficult to account for and generally unprecedented. Based on this incident, we are reviewing feasibility options for cloud vendor redundancy; however, there is no imminent action plan for this type of incident. ### Timeline: November 16. 2021 09:43 PT – xMatters monitoring tools alert teams to Google 404 failures; teams initiate Severity-1 incident 09:50 PT – Verification of incident external to xMatters 09:55 PT – xMatters status page posted 10:07 PT – xMatters instances begin to recover 10:09 PT – Google declares incident mitigated 10:42 PT – xMatters declares incident closed If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)