Zonos experienced a critical incident on August 5, 2024 affecting Landed Cost API and International Checkout and 1 more component, lasting 2h 14m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Aug 05, 2024, 08:45 PM UTC
We are currently investigating database issues.
- identified Aug 05, 2024, 08:47 PM UTC
The issue has been identified and a fix is being implemented.
- monitoring Aug 05, 2024, 08:53 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Aug 05, 2024, 10:46 PM UTC
This incident has been resolved.
- postmortem Aug 05, 2024, 11:10 PM UTC
**What products were affected and what was the impact?** * All non-legacy products Impact: MAJOR OUTAGE **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | Aug 5, 2024 | 14:31 MST | | To: | Aug 5, 2024 | 14:48 MST | **How was the issue detected?** Synthetic tests began failing which notified our DevOps team. **What functionality was affected?** Queries to the database were degraded and eventually became unsuccessful. **What problems did this cause?** All non-legacy services experienced degraded performance and a brief major outage where all database queries failed. **What was the resolution of the problem and steps that are being taken for continued follow-up?** The incident was caused by storage exhaustion on one of our production database clusters. We normally have autoscaling and alerting configured on our database clusters; however, in this case the cluster was created as part of a migration, and the proper alerting was not configured. The incident was resolved by allocating additional storage capacity to the affected database cluster. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** To prevent similar incidents in the future, we are conducting a thorough audit of our infrastructure inventory. This includes reviewing system health and monitoring configurations to ensure proper alerting and maximum visibility into critical infrastructure metrics.