Sardine incident

Elevated API latency

Sardine experienced a major incident on March 4, 2026 affecting Customer APIs, lasting 31m. The incident has been resolved; the full update timeline is below.

Started: Mar 04, 2026, 10:09 PM UTC
Resolved: Mar 04, 2026, 10:40 PM UTC
Duration: 31m
Detected by Pingoru: Mar 04, 2026, 10:09 PM UTC

Affected components

Customer APIs

Update timeline

investigating Mar 04, 2026, 10:09 PM UTC

Team is already working on mitigation and resolution
monitoring Mar 04, 2026, 10:15 PM UTC

A fix has been implemented and we are monitoring the results.
identified Mar 04, 2026, 10:15 PM UTC

The issue has been identified and a fix is being implemented.
resolved Mar 04, 2026, 10:40 PM UTC

This incident has been resolved.
postmortem May 07, 2026, 12:51 PM UTC

## Summary * **Date:** March 4, 2026 * **Primary impact window:** March 4, 21:45–22:40 UTC \(≈55 minutes\) * **Secondary impact window:** March 5, 08:00–08:10 UTC \(≈10 minutes\) * **Services affected:** Advanced aggregations, then broader API traffic on Production \(US\) and Production \(EU\). * **User-visible symptoms:** Elevated latency and elevated 5xx error responses on session-related endpoints. * **Status:** Resolved. Root cause identified and follow-up work in progress. ## What happened During a routine production deployment on March 4, 2026 at 21:45 UTC, a database schema change was applied to one of our largest tables. The change acquired locks that prevented reads and writes from completing in time, which caused requests dependent on that table — initially advanced aggregations and shortly after a broader set of production endpoints — to time out and return errors. The deployment was aborted and rolled back, and the long-running schema change was terminated. Service returned to normal at 22:40 UTC. A scheduled background job retained the new release configuration after the rollback. At 08:00 UTC on March 5, that job re-applied the schema change. Because traffic was low and the system was otherwise stable at that time, the change completed quickly and cleanly, with a brief \(~10 minute\) period of degraded performance before normal operations resumed at 08:10 UTC. ## Why it happened * A schema change targeting a very large table was bundled with a standard application deployment rather than executed as a separate, controlled operation. * Aborting the deployment did not immediately stop the in-flight schema change; new application instances continued to attempt the migration, extending the lock window. * Our rollback path covered the application release but did not roll back an associated scheduled job, which later re-executed the same change outside the deployment window. ## What we are doing about it * **Decoupling risky schema changes from deployments.** Large or lock-prone schema changes will be reviewed, scheduled, and executed separately from application releases, with an explicit risk review step. * **Stronger migration timeouts and abort behavior.** Default timeouts will be enforced on schema changes, and a timeout will automatically halt the rollout rather than allow it to continue. * **Cleaner rollback semantics.** Background and scheduled jobs will be included in the rollback path so that an aborted release cannot be re-applied later by an out-of-band component. * **Removing schema-change execution from application code paths.** Schema changes will be executed only by a single, controlled pre-deploy step. * **Improved runbooks.** Our deployment runbook is being expanded with explicit guidance for identifying and safely terminating problematic schema changes.