Onfido incident

Increased turnaround time for document check processing in the EU region

Onfido experienced a minor incident on October 20, 2025 affecting Document Verification, lasting 57m. The incident has been resolved; the full update timeline is below.

Started: Oct 20, 2025, 12:10 PM UTC
Resolved: Oct 20, 2025, 01:07 PM UTC
Duration: 57m
Detected by Pingoru: Oct 20, 2025, 12:10 PM UTC

Affected components

Document Verification

Update timeline

investigating Oct 20, 2025, 12:10 PM UTC

We are currently investigating an increase in turnaround time for document check processing in the EU region. We will provide a status update in the next 15 minutes.
identified Oct 20, 2025, 12:20 PM UTC

The issue has been identified and a fix is being implemented. We will provide a further update in 15 minutes.
monitoring Oct 20, 2025, 12:38 PM UTC

We have implemented a fix for this issue. We are monitoring closely to make sure the issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet, and we appreciate your patience during this incident. We will provide an update in the next 30 minutes.
resolved Oct 20, 2025, 01:07 PM UTC

This issue is now resolved: Increased turnaround time for document check processing in the EU region We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.
postmortem Oct 22, 2025, 02:16 PM UTC

### Summary Between 11:47 and 12:10, 55% of documents reports could not be processed due to a fraud detection service partial failure. Starting 12:10 onwards, traffic was being processed as usual and we started rerunning failed reports. Reports that required manual processing saw additional delays of up to 2 hours in order to clear the backlog. ### Root Causes A sudden increase in CPU usage by the impacted fraud service lasted for a few minutes, leading to retries policies being initiated. The service did not manage to scale properly to handle both the ongoing traffic and the retries, leading to a portion of report processing being halted and stored in dead-letter queues for processing after the systems stabilize. ### Timeline 11:47 UTC: High CPU usage usage on a fraud detection service leads to errors 11:50 UTC: CPU usage is back to normal, reports that errored out are being retried 11:51 UTC: the fraud detection service doesn’t manage to scale to manage both the normal traffic and retries 11:51 UTC: on-call team is alerted of a high error rate on the fraud detection service and starts investigating 12:09 UTC: on-call team identifies the root cause and scales up the service manually 12:10 UTC: Errors have stopped and reports are processed normally 12:12 UTC: We start the process of rerunning reports that failed during the incident 13:20 UTC: All reports that did not require manual review are now completed 13:50 UTC: All reports that required manual review have been processed ### Remedies Review the autoscaling capabilities of the impacted service as well as other services that share a similar architecture.