Sumsub incident

Technical Issue with dashboard availability

Sumsub experienced a critical incident on February 2, 2023 affecting API and MobileSDK and 1 more component, lasting 5h 50m. The incident has been resolved; the full update timeline is below.

Started: Feb 02, 2023, 04:30 PM UTC
Resolved: Feb 02, 2023, 10:21 PM UTC
Duration: 5h 50m
Detected by Pingoru: Feb 02, 2023, 04:30 PM UTC

Affected components

APIMobileSDKWebSDK

Update timeline

investigating Feb 02, 2023, 04:30 PM UTC

Service has encountered a technical problem with access for majority of clients. No images can be uploaded for applicants. Our team is investigating the incident.
identified Feb 02, 2023, 04:30 PM UTC

The issue has been identified and a fix is being implemented.
investigating Feb 02, 2023, 06:16 PM UTC

We are currently investigating this issue.
investigating Feb 02, 2023, 06:18 PM UTC

Service has encountered a technical problem with access for majority of clients. No images can be uploaded for applicants. Our team is investigating the incident.
investigating Feb 02, 2023, 06:40 PM UTC

We are continuing to investigate this issue.
identified Feb 02, 2023, 07:40 PM UTC

The issue has been identified and a fix is being implemented.
monitoring Feb 02, 2023, 07:53 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Feb 02, 2023, 10:21 PM UTC

This incident has been resolved.
postmortem Feb 03, 2023, 05:55 PM UTC

Around 15:50 UTC we received the first alerts on increased system load in our platform. Our automatic scaling system attempted to mitigate the problem by increasing the number of backend instances. During this time a significant amount of requests were still going through, but our system was showing an extreme delay performing any of these actions.This prompted our Engineering team to open an incident report and dive into a full scale investigation. As result we found out that IO was the root cause. It is important to clarify that our backend relies heavily on a distributed file system provided by AWS. We opened a case with AWS, as we worked around the clock on a plan to make our system responsive again, without knowing that the root cause had started on Amazon side. Here are some of the actions : 1\. We replaced the file system for another one with more aggressive settings. That action showed improvement, but unfortunately did not gave us expected results. This forced us to make some changes on the backend to prevent any performance degradation while working without distributed filesystem at all. 2\. Around 21:25 UTC - A fix was rolled out. And we confirmed, the changes made in the system were working with the expected performance. 3\. Around 00:15 UTC - AWS acknowledged, there were elevated latencies for the file system and and started investigation on their side 4\. Around 02:00 UTC - AWS identified the issue’s root cause and confirmed they are working on a fix. There were no further updates from AWS on the case yet. Although, this incident was not yet reflected on the global AWS status page.