Sardine incident
Dashboard instability while loading certain entities might ocurr
Sardine experienced a minor incident on May 12, 2026 affecting Dashboard, lasting 12h 59m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified May 12, 2026, 11:16 PM UTC
The issue has been identified and a fix is being implemented.
- resolved May 13, 2026, 12:16 PM UTC
This incident has been resolved.
- postmortem May 15, 2026, 12:51 PM UTC
**Impact:** During incident window * **Customer Intelligence Search** latency was degraded for queries spanning **>30 days** of data. * **Session Details** and **Customer Details** pages load were slow * **Connections Graph** and **Timeline** features were also impacted ## Executive Summary As part of infrastructure optimization, our development team performed multiple operations to our search databases to optimize index structure and data storage. This resulted in inefficient provision of our warm data cluster, and resulted in degraded performance. The team ultimately resolved the incident by updating data cluster configuration. Due to the volume of data, simple rollback was not possible, resulting in the long incident. ## Incident Details ### What Happened Our development team performed multiple operations to our search databases to optimize index structure and data storage. Due to bug in migration script, we migrated more data than initially anticipated. The destination cluster didn’t have sufficient storage and computing resources assigned. Latency started rising slowly as more data was migrated. This was initially dismissed as expected as we’re moving older data to separate clusters that are indeed slower but should remain within acceptable bounds. Two days later, on May 12, as the warm indices filled up as the migration completed, users began reporting that dashboard search was very slow. We then attempted upsizing the cluster but it was not able to upsize due to high traffic and large amount of data. Incident was resolved by our team manually reverted some of the operation. ## Timeline | Time \(PT, May 12\) | Event | | --- | --- | | **May 10, 23:38** | Automated operation around data migration was initiated, team was monitoring and didn’t report any issue | | **May 11, 00:00** | Latency starts climbing. Alerts were triggered but assumed as expected. | | **May 12, 6:02 AM** | Support reports dashboard slowness; on-call begins investigation | | **9:04 AM** | Incident formally created | | **10:56 AM** | First code fix deployed for customer details \+ session details | | **11:18 AM** | Deploy complete, pages still slow | | **12:25 PM** | Removed search dependency on Customer Profile \+ Session Details. Page Loads improved, Network Graph \+ Customer search still slow. | | **1:09 PM** | Root cause identified: indices incorrectly in warm tier; direct hot-tier migration initiated \(~10h estimated\) | | **3:05 PM** | Warm tier upsized aggressively migration still not converging | | **7:00–7:08 PM** | search cluster repeatedly auto-cancels in-flight shard recovery; direct migration abandoned | | **7:19 PM** | Switched to another approach of spinnig up new cluster | | **7:41 PM** | April indicies restored from snapshot; last-30d queries drop to ~15ms | | **9:03 PM** | February \+ March indicies restores complete | | **10:14 PM** | Replicas added to hot copies; search queue drops to 0. Incident resolved. | ## Action Items Immediate: * Manually rollback problematic resource allocation * Ensure all node pools have enough resources Medium Term Process Improvements: * Runbook and Migration process for search database upgrade operation * Better review process for Infra changes * Runbook for monitoring upgrade and immediate rollback * Observability in order to know if latency is expected