Onfido experienced a critical incident on December 2, 2025, lasting —. The incident has been resolved; the full update timeline is below.
Update timeline
- resolved Dec 02, 2025, 01:30 PM UTC
We have experienced delays and errors in check creations in EU Region, as well as inability to log into the dashboard from 13:05 to 13:25 UTC. The incident is now resolved and we are processing traffic. We will write a post-mortem with additional information.
- postmortem Dec 04, 2025, 08:25 PM UTC
### Summary We were in the process of a routine upgrade to our network ingress controller from the “prior” version to a “next” version. This component is packaged and applied to API-related services as and when those services are deployed. A problem occurred when we packaged the “next” version while one of those services was in the middle of a deployment. This problem rendered our external API, as well as our Dashboard, inaccessible in the EU region, for 25 minutes. The issue was as follows: 1. At the start of the service deployment pipeline, the system set the network component dependency to the “prior” version \(current at the time\), and dynamically created configuration settings based on that version. 2. Meanwhile, the “next” version was released while the service deployment pipeline was still in progress. 3. Later in the pipeline, another step overrode the dependency version to "latest” \(now pointing to “next”\). However, the original configuration settings created in the pipeline were incompatible with this new version, which resulted in a corrupted final network configuration. This situation had not previously occurred because the network ingress controller had never changed while an API service was mid deployment. ### Root Causes The root causes were: * The final network configuration state depending on the output of multiple pipeline steps. * The dependency version not being pinned for the entire deployment process, allowing a mid-pipeline version change. ### Timeline 12:39 UTC: An API service deployment pipeline starts 12:56 UTC: The new network ingress controller is released 13:04 UTC: The API service pipeline triggers an EU Production deployment 13:05 UTC: A corrupted ingress configuration is installed, preventing traffic from being routed to the external API 13:10 UTC: We receive monitoring alerts for the loss of API traffic 13:15 UTC: Incident response mobilized and investigating the problem 13:22 UTC: API service rollback is initiated 13:25 UTC: The rollback for the API completed, restoring a valid ingress configuration and allowing traffic again ### Remedies The following improvements will be made to the standard deployment pipeline to prevent similar network configuration inconsistencies from occurring: 1. Use explicit version numbers for dependencies instead of "latest" during deployments. 2. Consolidate the steps required to build the network configuration.