Onfido incident

Delays in Document report processing (all Regions)

Onfido experienced a notice incident on January 21, 2026, lasting —. The incident has been resolved; the full update timeline is below.

Started: Jan 21, 2026, 10:00 AM UTC
Resolved: Jan 21, 2026, 10:00 AM UTC
Duration: —
Detected by Pingoru: Jan 21, 2026, 10:00 AM UTC

Update timeline

resolved Jan 21, 2026, 10:25 AM UTC

Between 10:05 UTC and 10:18 UTC we experienced delays for a significant part of document reports, as well as errors on our Autofill product. This was due to the release of a faulty component. The release was rolled back to resolve the situation. The delayed reports backlog is being processed.
postmortem Feb 03, 2026, 07:14 PM UTC

### Summary The combined deployment of a new Document Extraction ML model and respective service that support Document Verification and Autofill, replaced the previous model version instantly, while rolling out the application layer via canary. When we got alerted for errors during the canary release, we triggered a manual rollback \(both application layer and ML model\), which failed due to a bug in the model deployment framework. Subsequently, the canary release for the application layer failed to progress due to high error rate and automatically reverted both the application and model layer reverting the service to a healthy state. The service unavailability resulted in increased Turnaround Time \(TaT\) for all Document reports and Studio Autofill created between 10:05 and 10:17 UTC. Autofill Classic had an average 70% error rate during that time frame. All impacted Document Reports and Studio Autofill tasks were completed successfully with an average TaT of ~6 minutes. ### Root Causes Our model deployment system doesn’t support model version replacement without downtime. Application and model pods are deployed without synchronization or any specific order requiring a 3-step release \(add new version, update application layer, remove old version\). The lack of automated guardrails preventing releases to replace a model version in a single step has led to the service unavailability due to human error. Our rollback mechanism for models didn’t restore correctly the model config map in the Kubernetes cluster, which led the rollback procedure to fail for both application and models when triggered manually. It succeeded during canary automated reversal. ### Timeline _10:05 UTC: Production release of extraction service updating an ML model_ _10:08 UTC: We unsuccessfully_ _tried to manually rollback after noticing errors on a metrics dashboard_ _10:10 UTC: Automatic monitors alerted the On-call team due to high error rate_ _10:16 UTC: The canary release was aborted automatically due to high error rate_ _10:18 UTC: All services went back to normal_ ### Remedies 1. Add guardrails to the release pipeline in order enforce a 3-step safe release process \(new ML model deployment, application layer update, ML model removal\) 1. We will separate CI/CD pipelines for ML model deployment and application layer 2. Model switching will be exclusively done on the application layer 2. Fix the manual rollback mechanism for models, in particular for releases where we remove a model version