xMatters incident

Issue Discovered - Service disruption in North America Region – Web User Interface

Notice Resolved View vendor source →

xMatters experienced a notice incident on October 22, 2020 affecting Web Interface, lasting 1h 30m. The incident has been resolved; the full update timeline is below.

Started
Oct 22, 2020, 08:11 AM UTC
Resolved
Oct 22, 2020, 09:42 AM UTC
Duration
1h 30m
Detected by Pingoru
Oct 22, 2020, 08:11 AM UTC

Affected components

Web Interface

Update timeline

  1. investigating Oct 22, 2020, 08:11 AM UTC

    xMatters monitoring tools have identified a potential issue with the xMatters Web User Interface for some clients in All Regions. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.

  2. investigating Oct 22, 2020, 08:16 AM UTC

    We are continuing to investigate this issue.

  3. identified Oct 22, 2020, 08:25 AM UTC

    The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.

  4. monitoring Oct 22, 2020, 08:50 AM UTC

    Customers may continue to experience some slowness as the incident team continues to implement the fixes for this issue. Events will be processing, however some customers may experience some delays. We are currently monitoring the situation to ensure the implementation is stable and services are restored.

  5. resolved Oct 22, 2020, 09:42 AM UTC

    The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.

  6. postmortem Nov 04, 2020, 05:07 PM UTC

    ### What happened? On October 22, 2020, at approximately 9:45 AM Pacific, internal monitoring tools alerted xMatters Customer Support to an issue impacting xMatters database storage services. During the incident, some customers reported not being able to access the xMatters user interface. This impacted some customers in North America for approximately 20 minutes; events processed normally and notifications were not affected. ### Why did it happen? The investigation revealed a loss of network connectivity between two xMatters components, specifically the xMatters API service and analytics database, which lead to the inability to service login requests. These connectivity issues led to a failure of the xMatters API to reconnect with the database. This loss of connectivity to the analytics database had a cascading effect that impacted the querying of a small subset of customer databases and access to the xMatters web user interface. The incident investigation determined that the xMatters API was able to create connections to the database but was unable to complete some queries. This condition resulted in a backlog of connection requests which eventually impacted the xMatters web user interface. ### How did we respond? xMatters engineering restarted the API service as part of the investigation into the cause of the errors. After the restart, xMatters Customer Support confirmed there was still an issue accessing the xMatters web user interface and initiated a Severity-1 incident. The incident response team gathered and promoted impacted instances to redundant architecture. Once that was complete, customers were able to login to xMatters without issue. The connectivity errors cleared without xMatters intervention after the load was removed from the impacted services. ### What are we doing to prevent it from happening again? Once mitigated, the connection issue was resolved. It is expected that the issue is a one time occurrence with a very low likelihood to reoccur; however, we are taking additional steps to improve the resiliency of the retry logic if a future connection failure occurs. Additional monitoring has been added to alert the team of similar conditions, which will allow for proactive measures to be taken before impacting customers. ### Timeline: **Date & Time PDT** **October 21, 2020 - 09:45** - Some customer instances begin reporting errors **October 22, 2020 - 00:45** - Rolling restart of API Service **October 22, 2020 - 00:50** - Login errors identified, Severity 1 Incident called **October 22, 2020 - 00:53** - Impacted instances routed to redundant architecture **October 22, 2020 - 01:06** - Impact mitigated **October 22, 2020 - 01:27** - Incident verified as resolved If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)