Onfido incident

Unable to create reports in EU

Onfido experienced a major incident on December 9, 2025 affecting API, lasting 57m. The incident has been resolved; the full update timeline is below.

Started: Dec 09, 2025, 11:15 AM UTC
Resolved: Dec 09, 2025, 12:12 PM UTC
Duration: 57m
Detected by Pingoru: Dec 09, 2025, 11:15 AM UTC

Affected components

API

Update timeline

identified Dec 09, 2025, 11:39 AM UTC

We have identified an issue in the infrastructure that handles reports creations. We are working on restoring it.
monitoring Dec 09, 2025, 11:46 AM UTC

The API is now fully available. We are monitoring the situation.
resolved Dec 09, 2025, 12:12 PM UTC

The incident has been resolved. We'll share a public post-mortem later.
postmortem Dec 19, 2025, 01:19 PM UTC

### Summary On **December 9, 2025**, between **09:50 AM UTC and 11:36 AM UTC**, our services in the **EU region** experienced a **degradation** that affected customers’ ability to **list checks in the dashboard**. During this period, the **webhook logs displayed in the dashboard** were also impacted, leading to **incomplete or delayed visibility**, and the **webhook resend feature** was similarly degraded. Furthermore, as a consequence of the ongoing incident, between **11:15 AM UTC and 11:38 AM UTC \(a total of 23 mins\)**, our **Classic** clients were affected by an issue in the check creation service that resulted in **no new checks being created** during that period in the **EU region**. In addition, both **Classic** and **Studio** clients experienced **increased turnaround times \(TaT\)** for ongoing checks and tasks. ### Root Causes The incident occurred because a **shared indexing component** became unavailable after reaching its **capacity limits** due to a workload spike from another internal system. We had prior spikes with more than double the load, lasting longer, that caused no issues. We suspect a timing issue with background maintenance \(e.g., garbage collection\) caused this. However, we lack sufficient historical telemetry to confirm. This overload caused the shared infrastructure to fail, which **cascaded into disruptions across multiple dependent services** and **led to errors in client-facing operations**. ### Timeline 09:50 UTC: Error rate and round trip time increased for an indexing component 09:51 UTC: Our on-call team is notified about an increase in error rate 10:10 UTC: Investigation and actions begin to reduce non-essential traffic and speed recovery 11:08 UTC: Actions to reduce non-essential traffic completed 11:15 UTC: Shared caching infrastructure dependency becomes unavailable 11:15 UTC: Incident impacts check creation; no new checks succeed 11:18 UTC: Upscale indexer as actions to reduce load were insufficient 11:32 UTC: Upscale shared caching infrastructure to resolve checks creation incident 11:36 UTC: Checks and webhook logs show working again on dashboard; resend webhook function still shows outdated information 11:39 UTC: Our on-call team confirms all traffic returned to normal 12:45 UTC: Upscale completed 13:05 UTC: We continued monitoring to ensure backlog items processed correctly 13:20 UTC: Remaining backlog processed ### Remedies * Introduce rate-limiting, review timeouts, and change query access patterns sent to the indexing service to reduce the risk of excessive load events * Extend our monitoring on the shared caching infrastructure to anticipate resource capacity exhaustion; * Review and reduce the checks creation service's dependency on shared caching infrastructure.