Onfido incident

Facial Similarity and Known Faces service degradation

Minor Resolved View vendor source →

Onfido experienced a minor incident on March 13, 2025 affecting Facial Similarity and Known faces, lasting 1h 15m. The incident has been resolved; the full update timeline is below.

Started
Mar 13, 2025, 06:06 PM UTC
Resolved
Mar 13, 2025, 07:21 PM UTC
Duration
1h 15m
Detected by Pingoru
Mar 13, 2025, 06:06 PM UTC

Affected components

Facial SimilarityKnown faces

Update timeline

  1. investigating Mar 13, 2025, 06:06 PM UTC

    We are currently investigating higher processing times for Facial Similarity and Known Faces reports in the EU region.

  2. investigating Mar 13, 2025, 06:24 PM UTC

    Processing times are back to normal for ongoing reports. There are some pending reports being automatically re-run by our graceful handling of errors as we update this incident page. We are continuing to investigate the issue.

  3. identified Mar 13, 2025, 06:37 PM UTC

    A bad query has been identified as the main culprit. We continue to investigate the issue.

  4. monitoring Mar 13, 2025, 07:07 PM UTC

    We're monitoring the run of pending reports. Almost done now. As previously stated, ongoing processing is back to normal. We'll update again once all pending reports affected during the incident have been recovered.

  5. resolved Mar 13, 2025, 07:21 PM UTC

    All reports have been recovered. We're now back to normal processing and the incident is over.

  6. postmortem Mar 21, 2025, 11:30 AM UTC

    At around 6pm UTC on 13th March 2025, we were alerted for higher turnaround times \(and consequent delays\) in processing Facial Similarity and Known Faces reports in the EU region. This will have affected all clients running reports during this 15 minute time period. These reports didn’t fail, but were only delayed in the end. ### Summary Higher turnaround times \(and consequent delays\) in processing Facial Similarity and Known Faces reports. ### Root Causes * Known Faces and Facial Similarity reports took longer than expected to be processed * because a database was struggling \(heavy CPU usage\) * because an ongoing query was monopolising the database * because the query was not optimised \(and not configured to time out\) * because the depending service is an internal operational tool for report drill down and investigation ### Timeline 17:56 UTC: We get alerted to a high number of pending reports, due to higher turnaround times in processing 18:03 UTC: A suspected feature is turned off as a potential culprit, but nothing changes – not root cause 18:07 UTC: Problem stops, ongoing reports are now being normally processed \(although it is unrelated with feature that was turned off, upon further investigation\) 18:10 UTC: Investigation shows high CPU usage in database 18:34 UTC: Query originating in internal operation tool is identified as culprit 18:35 UTC: Pending reports are seen as dropping, which should indicate process for graceful recovery is being handled. But a quirk in the metric tricks us, and we realise pending reports are stuck 18:36 UTC: Pending reports seem stuck, and are not automatically being recovered, so we resort to manual action to re-run them 18:37 UTC: Search feature in internal operational tool causing bad query is disabled \(functionality removed\) 19:00 UTC: We retrieve all of the affected reports from our logging platform 19:12 UTC: We have re-run all affected reports and incident is over ### Remedies In order to make sure this doesn’t happen again: * We will remove the search feature from the internal operational tool for report drill down whilst we optimise the query powering it * We will only reinstate the search feature after the query is optimised and set to use a read replica instead of a write replica for our PostgreSQL database * We will only reinstate the search feature after the query is optimised and adequate query timeout is set * We will fix the Cron job for automatic and graceful recovery of pending reports * We have fixed the operational dashboards to use the right metric for pending reports monitoring