Sardine experienced a minor incident on October 29, 2025 affecting Customer APIs, lasting 20m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 29, 2025, 07:28 PM UTC
Team is investigating we'll update as soon as possible. Except errors from our API.
- resolved Oct 29, 2025, 07:49 PM UTC
We have had elevated 5xx errors for customers API from 19:04 UTC to 19:43 UTC incident is resolved now and we'll come back with post mortem as soon as possible.
- postmortem Oct 31, 2025, 06:47 PM UTC
# Rule engine outage on deploy ## Overview _On Oct 29, Sardine experienced an outage on our rules-engine service that impacted our_ `/v1/customers`endpoint causing some requests to fail during this period, with a big spike when the error happened and a fast decrease on the error rate. ## What happened Sardine deploys the backend services every Wednesday, during one of our regular deployments we noticed an error while the rules-engine service was being deployed in canary mode \(25% of traffic gets routed to the new version instances\). After noticing the error our team immediately started a rollback and checked for root causes, finding it a few minutes later and only continuing the deployment on the next day, this time, a successful one. The error was a database migration that caused the application old versions to lose the reference to schema they used, causing the errors and triggering 5xx responses in the `customers` api. ## Impact `/v1/customers` API endpoint responding with 500 http response for around 30 minutes, with the first minutes concentrating 90% of the errors ## Timeline \(UTC\) * 18:44 deployment ticket gets approved * 18:55 Release engineering starts deployment * 19:05 Release engineer notices something is wrong, rules-engine service has an elevated number of errors after deployment * 19:07 Release engineer starts rollback action * 19:10 incident is initiated internally and externally * 19:44 All services have finished rolling back to the previous version * 19:54 Incident is considered as solved. ## Action items * Fix root cause issue where migrations running in a canary deployment made old versions receive an error in prepared statement cache * Enhance internal toolings to deployment to have faster deployment and rollbacks