xMatters incident

Service disruption in North American Region

Major Resolved View vendor source →

xMatters experienced a major incident on November 7, 2019 affecting Web Interface and Email Notifications and 1 more component, lasting 56m. The incident has been resolved; the full update timeline is below.

Started
Nov 07, 2019, 02:19 PM UTC
Resolved
Nov 07, 2019, 03:16 PM UTC
Duration
56m
Detected by Pingoru
Nov 07, 2019, 02:19 PM UTC

Affected components

Web InterfaceEmail NotificationsSMS NotificationsVoice NotificationsConferencingIntegration PlatformAPIMobile App

Update timeline

  1. investigating Nov 07, 2019, 02:19 PM UTC

    At approximately 5:56am PDT, our internal monitoring detected an issue affecting some customers in North America. Some customers may be unable to log into their xMatters instances. We are currently investigating this issue.

  2. investigating Nov 07, 2019, 02:50 PM UTC

    We are continuing to investigate this issue.

  3. investigating Nov 07, 2019, 03:12 PM UTC

    We are continuing to investigate this issue.

  4. resolved Nov 07, 2019, 03:16 PM UTC

    At approximately 5:56am PDT, we experienced an issue with xMatters that prevented users in North America to access the web user interface. Services were restored at approximately 7:12am PDT.

  5. postmortem Nov 08, 2019, 09:35 PM UTC

    ## Details ### What happened? On Thursday, November 7, 2019, at approximately 5:20 AM PST, the xMatters network monitoring systems alerted the Customer Support teams to an issue with the On-Demand services within North America. Some users may have experienced intermittent access to the xMatters On-Demand web user interface, and a delay or rejection when injecting events into xMatters. ### Why did it happen? This incident was caused by a single database within one of the database clusters consuming a disproportionate amount of resources. This limited the ability of other databases in the cluster to accept new requests, resulting in intermittent access to the web user interface. ### How did we respond? As soon as the internal monitoring systems alerted to an issue with customer instances, Customer Support confirmed the issue and launched the internal major incident management process. The incident response teams immediately began their investigation and identified a database cluster that was consuming processing resources at an exceptionally high rate. The teams determined that the issue was confined to a specific database in the cluster that was causing latency and preventing other resources from serving their requests. The teams concluded that the best way to remedy the issue quickly was to promote a standby database cluster to become the new primary. The recovery process and redundant service architecture restored services, and system performance resumed normal operations. ### What are we doing to prevent it from happening again? To prevent this issue from reoccurring, the Engineering teams will be taking the following steps: 1. Resize the database cluster to accommodate potential usage spikes and to increase tolerance for similar issues. \(Completed\) 2. Rebalance the database cluster to increase bandwidth for all impacted customers. \(Scheduled for completion on or before November 14, 2019\) 3. Increase monitoring thresholds to identify spikes in usage during peak periods. \(Completed\) xMatters strives to provide high availability to our clients and we recognize that reliability of services is of utmost importance to our customers and their businesses. xMatters is committed to improving our resiliency and investing in the tools and processes required to prevent and minimize service disruptions. ### Timeline: | **Date/Time \(PST\)** | **Description** | | --- | --- | | 2019-11-07 05:20 AM | xMatters monitoring tools alert Customer Support to intermittent access to some client instances in North America. | | 05:45 AM | Severity-1 issue raised; internal major incident management process initiated. | | 06:19 AM | Bulletin posted to xMatters status page: [https://status.xmatters.com/incidents/xrq45x6g0zpp](https://status.xmatters.com/incidents/xrq45x6g0zpp) | | 06:43 AM | Incident team identifies issue as related to a database within the cluster. | | 07:00 AM | Promotion of secondary database cluster begins. | | 07:09 AM | All services are restored. | ‌ If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)