Zonos incident

2024-09-30 Elevated error rates on landed cost

Major Resolved View vendor source →

Zonos experienced a major incident on September 30, 2024 affecting Landed Cost API and International Checkout and 1 more component, lasting 1h 36m. The incident has been resolved; the full update timeline is below.

Started
Sep 30, 2024, 10:44 PM UTC
Resolved
Oct 01, 2024, 12:21 AM UTC
Duration
1h 36m
Detected by Pingoru
Sep 30, 2024, 10:44 PM UTC

Affected components

Landed Cost APIInternational CheckoutShopify Duty TaxLanded Cost API (Legacy)BigCommerce Duty TaxQuoterMagento Duty TaxLanded Cost API (GraphQL)Salesforce Duty TaxShopify Checkout

Update timeline

  1. investigating Sep 30, 2024, 11:07 PM UTC

    We are currently investigating this issue.

  2. identified Sep 30, 2024, 11:24 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Sep 30, 2024, 11:47 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. monitoring Oct 01, 2024, 12:20 AM UTC

    We are continuing to monitor for any further issues.

  5. resolved Oct 01, 2024, 12:21 AM UTC

    This incident has been resolved.

  6. postmortem Oct 02, 2024, 05:42 PM UTC

    ### **What products were affected and what was the impact?** * Landed Cost API ### Impact: * DEGRADED SERVICE ### **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | Sep 30, 2024 | 16:44 MST | | To: | Sep 30, 2024 | 18:12 MST | ### **How was the issue detected?** Synthetic test failures alerted our team. ### **What functionality was affected?** Increased latency on landed cost quotes, plus a period where landed cost quotes failed. ### **What problems did this cause?** Landed cost quotes were slow to return and/or failed to return. ### **What was the resolution of the problem and steps that are being taken for continued follow-up?** The problem was caused by a very large message added to our message queue that could not be consumed due to insufficient resources. The immediate resolution was to increase maximum allocated memory to allow the message queue to clear. ### **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** To prevent this from happening again, we have decreased the total allowed message size by both the producers and consumers. We are also evaluating techniques to make the message queue more robust to this type of failure.