Zonos incident

2024-08-05 Investigating database issues

Critical Resolved View vendor source →

Zonos experienced a critical incident on August 5, 2024 affecting Landed Cost API and International Checkout and 1 more component, lasting 2h 14m. The incident has been resolved; the full update timeline is below.

Started
Aug 05, 2024, 08:31 PM UTC
Resolved
Aug 05, 2024, 10:46 PM UTC
Duration
2h 14m
Detected by Pingoru
Aug 05, 2024, 08:31 PM UTC

Affected components

Landed Cost APIInternational CheckoutShopify Duty TaxLanded Cost API (Legacy)BigCommerce Duty TaxQuoterMagento Duty TaxLanded Cost API (GraphQL)ClassifySalesforce Duty Tax

Update timeline

  1. investigating Aug 05, 2024, 08:45 PM UTC

    We are currently investigating database issues.

  2. identified Aug 05, 2024, 08:47 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Aug 05, 2024, 08:53 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Aug 05, 2024, 10:46 PM UTC

    This incident has been resolved.

  5. postmortem Aug 05, 2024, 11:10 PM UTC

    **What products were affected and what was the impact?** * All non-legacy products Impact: MAJOR OUTAGE ‌ **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | Aug 5, 2024 | 14:31 MST | | To: | Aug 5, 2024 | 14:48 MST | ‌ **How was the issue detected?** ‌ Synthetic tests began failing which notified our DevOps team. ‌ **What functionality was affected?** ‌ Queries to the database were degraded and eventually became unsuccessful. ‌ **What problems did this cause?** ‌ All non-legacy services experienced degraded performance and a brief major outage where all database queries failed. ‌ **What was the resolution of the problem and steps that are being taken for continued follow-up?** ‌ The incident was caused by storage exhaustion on one of our production database clusters. We normally have autoscaling and alerting configured on our database clusters; however, in this case the cluster was created as part of a migration, and the proper alerting was not configured. The incident was resolved by allocating additional storage capacity to the affected database cluster. ‌ **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** ‌ To prevent similar incidents in the future, we are conducting a thorough audit of our infrastructure inventory. This includes reviewing system health and monitoring configurations to ensure proper alerting and maximum visibility into critical infrastructure metrics.