xMatters incident

Issue Discovered - Service disruption in North America

Major Resolved View vendor source →

xMatters experienced a major incident on April 6, 2019 affecting Web Interface and Email Notifications and 1 more component, lasting 1h 5m. The incident has been resolved; the full update timeline is below.

Started
Apr 06, 2019, 01:03 PM UTC
Resolved
Apr 06, 2019, 02:09 PM UTC
Duration
1h 5m
Detected by Pingoru
Apr 06, 2019, 01:03 PM UTC

Affected components

Web InterfaceEmail NotificationsSMS NotificationsVoice NotificationsConferencingIntegration PlatformAPIMobile App

Update timeline

  1. investigating Apr 06, 2019, 01:03 PM UTC

    The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.

  2. identified Apr 06, 2019, 01:17 PM UTC

    The xMatters Incident Response team has identified the source of the issue and is working on a fix. Customers may receive an error when trying to access the system. The error is intermittent. We will update once a solution has been identified and implemented.

  3. monitoring Apr 06, 2019, 01:46 PM UTC

    The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.

  4. resolved Apr 06, 2019, 02:09 PM UTC

    The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.

  5. postmortem Apr 12, 2019, 10:57 PM UTC

    ## What happened? On April 6, 2019, at approximately 4:37 AM PDT, the xMatters monitoring systems alerted the Engineering teams to a service disruption with On-Demand services within the North American region. Users may have experienced intermittent access to the user interface, and a delay or rejection when injecting an event into xMatters. ## Why did it happen? This issue was caused by excessive memory consumption by a monitoring service. The monitoring service was buffering metrics for reporting and consumed an excessive amount of memory, causing some database queries to fail. ## How did we respond? As soon as the xMatters network monitoring tools detected unreliable connectivity in the xMatters system, the Client Assistance team launched the internal severity-1 investigation process, which was later upgraded to a major incident, and posted a notice to the xMatters status page. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. The teams determined that the fastest way to restore service and cause the least impact to clients would be to perform a manual database failover to a system not experiencing resource exhaustion. Once the promotion process was complete, clients confirmed that all services were restored and functioning as expected. ## What are we doing to prevent it from happening again? To help prevent similar incidents in the future, the xMatters Engineering teams are investigating a potential way to improve their current method of resource monitoring. Any knowledge or information they identify will be added to the relevant playbooks to ensure that it becomes a consistent part of our standard processes. In addition, Engineering teams are working with the service vendor to review the issue and determine what additional actions can be taken to ensure the issue does not reoccur. ## Timeline: April 6, 2019 4:37 AM - First notification of potential issue with On-Demand services. No client impact at this time 4:47 AM - Investigation begins 5:37 AM - Severity-1 process launched. Issue becomes client impacting 6:20 AM - Cause is identified. Manual database failover performed 6:30 AM - Monitoring service responsible is disabled 6:33 AM - Client impact is mitigated. Teams continue to monitor 6:37 AM - Confirmation of system recovery 6:47 AM - All services restored. If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)