Sardine incident

Risk evaluations might be coming up as Very High risk with blocklisted items when this is not actually true

Sardine experienced a major incident on July 24, 2025 affecting Customer APIs, lasting 11m. The incident has been resolved; the full update timeline is below.

Started: Jul 24, 2025, 08:41 PM UTC
Resolved: Jul 24, 2025, 08:53 PM UTC
Duration: 11m
Detected by Pingoru: Jul 24, 2025, 08:41 PM UTC

Affected components

Customer APIs

Update timeline

identified Jul 24, 2025, 08:41 PM UTC

The issue has been identified and a fix is being implemented.
identified Jul 24, 2025, 08:43 PM UTC

Team already found root cause and it's working on a fix.
monitoring Jul 24, 2025, 08:50 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Jul 24, 2025, 08:53 PM UTC

This incident has been resolved.
postmortem Jul 25, 2025, 01:41 AM UTC

# Incorrect Very High Risk Evaluations **Date:** 2025-07-24 **Duration:** 48 minutes **Severity:** High **Impact:** Multiple B2B clients affected ## Executive Summary A code deployment introduced a race condition in our risk evaluation query logic, causing legitimate sessions to be incorrectly flagged as "very\_high" risk. The incident affected multiple clients for 48 minutes before detection and resolution. ## Incident Details ### What Happened Sessions across multiple Sardine clients received elevated risk scores \("very\_high"\) that did not align with their actual risk profile as determined by our checkpoint-specific rules and ML models. Root cause: a race condition where risk queries executed before the risk engine completed session evaluation, triggering fail-safe logic inappropriately. ### Timeline \(UTC\) | Time | Event | Actor | | --- | --- | --- | | 20:02 | Canary \(phased\) deployment initiated with affecting change | Engineering | | 20:17 | Automated promotion to full deployment | CI/CD Pipeline | | 20:19 | **First client escalation** - elevated risk levels reported | Client | | 20:29 | Issue escalated to engineering | Support Team | | 20:38 | Root cause confirmed, rollback initiated | Engineering | | 20:50 | **Service fully restored** | Engineering | **Key Metrics:** * Time to Detection: 17 minutes \(customer-reported\) * Time to Resolution: 31 minutes from escalation * Total Impact Duration: 48 minutes ## Five Whys Analysis 1. Why were sessions incorrectly marked as very\_high risk? * Risk query executed before risk engine evaluation completed, triggering fail-safe "default to block" logic 1. Why did the risk query execute before risk engine completion? * Race condition introduced in deployment caused timing dependency failure between risk evaluation and query execution 1. Why wasn't this race condition detected in testing? * QA environment lacked sufficient session volume to reproduce the timing conditions that trigger the race condition 1. Why didn't our monitoring detect this anomaly? * Missing alerting thresholds for sudden increases in very\_high risk session percentages per client 1. Why didn't phased rollout catch this issue? * Only subset of clients affected due to configuration differences; clients with custom risk logic bypassed the affected code path ## Impact Analysis ### What Went Wrong 1. **Detection Failure**: No proactive monitoring detected the anomaly - relied on customer escalation 2. **Testing Gap**: QA missed race condition scenarios due to insufficient test data volume 3. **Monitoring Gap**: No alerting on risk level distribution changes 4. **Phased Launch Limitation**: Configuration variance masked issue during gradual rollout ### What Went Well 1. **Rapid Response**: Support team escalated within 10 minutes of customer report 2. **Quick Resolution**: Engineering confirmed and initiated rollback within 19 minutes 3. **Effective Rollback**: Clean restoration with no data corruption ## Action Items ### Immediate \(Week 1\) * \[ \] **Owner: Engineering Lead** - Implement alerting for anomalous \(zscore\) increase in very\_high risk sessions per client \(Target: 3 days\) * \[ \] **Owner: QA Lead** - Add race condition test scenarios with production-like session volumes \(Target: 5 days\) * \[ \] Per client incident report and queries for assessing impacted sessions ### Short-term \(Month 1\) * \[ \] **Owner: Platform Team** - Review and refactor fail-safe logic to handle timing dependencies \(Target: 2 weeks\) * \[ \] **Owner: Engineering** - Enhance unit and E2E test coverage for risk evaluation edge cases \(Target: 3 weeks\) * \[ \] **Owner: QA Team** - Implement exploratory testing protocols for risk evaluation scenarios \(Target: 4 weeks\) ### Medium-Term In progress - ownership Engineering ## Prevention Measures 1. **Enhanced Monitoring**: Real-time risk distribution tracking with client-level granularity 2. **Improved Testing**: Production-volume simulation in staging environment 3. **Robust Fail-safes**: Timeout handling and graceful degradation for race conditions 4. **Phased Deployment**: Configuration-aware rollout strategy to ensure comprehensive coverage ## Lessons Learned * Customer-reported incidents indicate monitoring blind spots requiring immediate attention * Race conditions in distributed systems require specific testing strategies beyond happy/sad path scenarios * Fail-safe mechanisms must account for timing dependencies in microservices architectures **Document Owner:** Engineering Team **Review Date:** 2025-08-24 **Distribution:** Executive Team, Engineering, Support, QA