Onfido incident

Document report processing disrupted in EU

Notice Resolved View vendor source →

Onfido experienced a notice incident on January 21, 2026, lasting —. The incident has been resolved; the full update timeline is below.

Started
Jan 21, 2026, 10:30 AM UTC
Resolved
Jan 21, 2026, 10:30 AM UTC
Duration
Detected by Pingoru
Jan 21, 2026, 10:30 AM UTC

Update timeline

  1. resolved Jan 21, 2026, 02:28 PM UTC

    Between 10:30 UTC and 10:47 UTC there was disruption to document report processing in EU, as a side effect of reports delayed from the earlier incident being re-processed. Further details will follow in a postmortem.

  2. postmortem Jan 30, 2026, 11:37 AM UTC

    ### Summary For the EU region, one critical service struggled to reprocess the traffic affected by a previous faulty release which led to higher Turnaround Time \(TaT\) for all Document reports created between 10:32 and 10:50 UTC. All impacted Document Reports were completed successfully with an average TaT of ~6 minutes. ### Root Causes The re-processing batch caused a spike in traffic and the auto-scaling didn’t work as expected for one critical service. The service entered a crash loop state and had to be manually up-scaled to recover. An unbounded number of in-flight requests were accepted leading to memory exhaustion and I/O event loop non-responsiveness, while waiting for a downstream ML inference service to scale up. ### Timeline _10:32 UTC: The critical service went up to a 100% error rate_ _10:33 UTC: Engineers who initiated the report backlog reprocessing become aware of the issue through our monitoring and start investigating_ _10:48 UTC: We manually scaled up the critical service_ 1_0:50 UTC: The service went back to normal and errors stopped_ ### Remedies * Investigate how to reduce memory footprint on this service, allowing bigger request queues while it waits for downstream ML model serving to scale up * Change auto-scaling parameters to be more aggressive \(i.e. scale with lower CPU targets\) * Add concurrent requests monitoring * Reduce ML model serving image sizes for faster scaling of inference services * Improve back-pressure mechanisms to be able to sustain minimum traffic levels independent of spikes while auto-scaling kicks-in * Change our weekly load testing scripts to specifically test for accelerated traffic spikes