xMatters incident

Issue Discovered - Degraded performance in North American Region – Integration Platform

xMatters experienced a minor incident on October 10, 2025 affecting Integration Platform, lasting 3d 17h. The incident has been resolved; the full update timeline is below.

Started: Oct 10, 2025, 10:58 PM UTC
Resolved: Oct 14, 2025, 03:58 PM UTC
Duration: 3d 17h
Detected by Pingoru: Oct 10, 2025, 10:58 PM UTC

Affected components

Integration Platform

Update timeline

monitoring Oct 10, 2025, 10:58 PM UTC

The xMatters monitoring tools have alerted Customer Support to a potential issue with the integration platform in the North America region. Our technical teams have identified intermittent request failures and occasional slowdowns, though the service appears to be responsive and recovering correctly. We are continuing to monitor the situation and mitigating where possible, though some users may encounter a 5xx response code when submitting a request. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
resolved Oct 14, 2025, 03:58 PM UTC

The issue has been addressed, and all services are running as expected. Thank you for your patience while we addressed this matter.
postmortem Oct 31, 2025, 06:48 PM UTC

**What happened?** On October 10th, 2025, at approximately 4:14 PM Pacific, the xMatters internal monitoring tools alerted Customer Support to service degradation related to an issue that was already being internally monitored with the Integration Platform in the North American region. While this issue was being investigated and mitigated, customers may have experienced intermittent request failures and occasional slowdowns. **Why did it happen?** These particular issues were caused by a security update that caused conflicts with the underlying runtime environment, specifically the memory management routine. Slow performance of the routine was triggering frequent health checks for a request processing service and causing automatic restarts. **How did we respond?** As soon as the internal monitoring tools alerted Engineering to a potential issue, they began performing manual rolling restarts and deployed more forgiving liveness checks to avoid increasing error rates and to minimize any potential impact to customers. They also increased resources for the impacted service to improve responsiveness of the underlying routine and deployed a configuration fix to the Http Client Cache to try and stabilize the system. While mitigating the potential impact, they also increased monitoring levels as they continued to investigate the root cause and were able to deploy an update to the service and environment configuration on October 13 that resolved the issue. They continued monitoring and confirmed that the system was stable and all services were operational. **What are we doing to prevent it from happening again?** Engineering has updated the backend service and deployed additional updates and monitoring to the system configuration that will improve overall stability for the environment and prevent this issue from reoccurring.