MASV incident

Increased error rates and latency

Notice Resolved View vendor source →

MASV experienced a notice incident on July 30, 2024 affecting MASV API and Services and US East, lasting 7h 4m. The incident has been resolved; the full update timeline is below.

Started
Jul 30, 2024, 10:56 PM UTC
Resolved
Jul 31, 2024, 06:00 AM UTC
Duration
7h 4m
Detected by Pingoru
Jul 30, 2024, 10:56 PM UTC

Affected components

MASV API and ServicesUS East

Update timeline

  1. investigating Jul 30, 2024, 10:45 PM UTC

    We are currently investigating an increased rate of errors and high latency when accessing the service

  2. identified Jul 30, 2024, 10:56 PM UTC

    The issue has been identified and we are in contact with our service provider to resolve the issue

  3. identified Jul 31, 2024, 01:34 AM UTC

    The error rate has been reduced. We expect the service to fully recover within the next 3 hours.

  4. monitoring Jul 31, 2024, 03:15 AM UTC

    The majority of issues have been resolved and error rates have subsided. We will continue monitoring the service for the next few hours to ensure stability.

  5. resolved Jul 31, 2024, 01:11 PM UTC

    We have confirmed system stability following the previous update. This incident has been resolved.

  6. postmortem Aug 02, 2024, 02:54 PM UTC

    **Problem Description, Impact and Resolution** At 22:30 UTC on July 30, 2024 we observed an increased error rate and request latency for interactions with the MASV API. Our client applications for the web, desktop and service are designed to automatically retry failures of this nature relentlessly, meaning that uploads and general account management functionality may have experienced slowdown or increased load time, but would ultimately succeed. Our team immediately began investigating the cause of the issues and identified one of our service providers as the source of the instability. We began communications with the provider, who acknowledged and began investigating issues on their side. At 1:00 UTC, we observed a significant reduction in error rates and overall latency, with the issue fully resolved by 3:00 UTC. **Mitigation Steps and Future Preventative Measures** We take service reliability very seriously as we understand the critical role of our product in the workflows of our customers. We are taking steps to increase horizontal scaling of our core systems to additional regions and add additional redundancies, which will help us redirect traffic across regions if a service impact is detected. This will ensure we are able to respond more effectively to an outage of this nature in the future.