Zonos experienced a major incident on December 14, 2023 affecting Landed Cost API and International Checkout and 1 more component, lasting 34m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Dec 13, 2023, 10:22 PM UTC
We are currently investigating this issue.
- identified Dec 13, 2023, 11:17 PM UTC
The issue has been identified and a fix is being implemented.
- monitoring Dec 13, 2023, 11:21 PM UTC
A fix has been implemented and we are monitoring the results.
- monitoring Dec 14, 2023, 12:59 AM UTC
We are continuing to monitor for any further issues.
- resolved Dec 14, 2023, 01:00 AM UTC
This incident has been resolved.
- postmortem Dec 14, 2023, 05:45 PM UTC
**What products were affected and what was the impact?** * Checkout * Landed Cost \(Legacy\) * Landed Cost API Impact: * Checkout MAJOR OUTAGE * Landed Cost \(Legacy\) MAJOR OUTAGE * Landed Cost API SERVICE DEGRADATION **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | December 13, 2023 | 3:02 MST | | To: | December 13, 2023 | 5:35 MST | **How was the issue detected?** There was increased database load and request timeouts. This was detected by the monitoring system and the team was notified. **What functionality was affected?** Shipping Quotes for the checkout process and Landed Cost \(Legacy\) API were directly affected. Landed Cost API was indirectly affected by the increased load on the database server. **What problems did this cause?** In the process of removing invalid data from the database a database table index became corrupted causing increased latency and load on the database server. The affected table was for providing flat rate shipping rates to the landed cost \(legacy\) service. The landed cost service is used by the checkout process to calculate shipping rates for international orders. This caused the landed cost \(legacy\) and checkout processes to fail when attempting to calculate shipping. This also caused landed cost API to fail intermittently with timeouts because of the increased load on the database. **What was the resolution of the problem and steps that are being taken for continued follow-up?** A patch was deployed to disable the flat rate charts and allow partial recovery and to allow for correcting the affected database table. The affected database table was restored and the flat rate charts were re-enabled. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** This issue occurred in our legacy database system. The data removal was done to improve query performance, and the operation should have been safe. Though index corruption is a very rare and unexpected outcome, we are doing two things to mitigate this risk and prevent future failure cases: 1. migrating legacy services to a more robust database technology, and 2. creating an additional review policy for potentially destructive database operations.