MASV incident

Increased error rates and latency

MASV experienced a notice incident on July 30, 2024 affecting MASV API and Services and US East, lasting 7h 4m. The incident has been resolved; the full update timeline is below.

Started: Jul 30, 2024, 10:56 PM UTC
Resolved: Jul 31, 2024, 06:00 AM UTC
Duration: 7h 4m
Detected by Pingoru: Jul 30, 2024, 10:56 PM UTC

Affected components

MASV API and ServicesUS East

Update timeline

investigating Jul 30, 2024, 10:45 PM UTC

We are currently investigating an increased rate of errors and high latency when accessing the service
identified Jul 30, 2024, 10:56 PM UTC

The issue has been identified and we are in contact with our service provider to resolve the issue
identified Jul 31, 2024, 01:34 AM UTC

The error rate has been reduced. We expect the service to fully recover within the next 3 hours.
monitoring Jul 31, 2024, 03:15 AM UTC

The majority of issues have been resolved and error rates have subsided. We will continue monitoring the service for the next few hours to ensure stability.
resolved Jul 31, 2024, 01:11 PM UTC

We have confirmed system stability following the previous update. This incident has been resolved.
postmortem Aug 02, 2024, 02:54 PM UTC

**Problem Description, Impact and Resolution** At 22:30 UTC on July 30, 2024 we observed an increased error rate and request latency for interactions with the MASV API. Our client applications for the web, desktop and service are designed to automatically retry failures of this nature relentlessly, meaning that uploads and general account management functionality may have experienced slowdown or increased load time, but would ultimately succeed. Our team immediately began investigating the cause of the issues and identified one of our service providers as the source of the instability. We began communications with the provider, who acknowledged and began investigating issues on their side. At 1:00 UTC, we observed a significant reduction in error rates and overall latency, with the issue fully resolved by 3:00 UTC. **Mitigation Steps and Future Preventative Measures** We take service reliability very seriously as we understand the critical role of our product in the workflows of our customers. We are taking steps to increase horizontal scaling of our core systems to additional regions and add additional redundancies, which will help us redirect traffic across regions if a service impact is detected. This will ensure we are able to respond more effectively to an outage of this nature in the future.