Zonos incident

2024-11-08 - Landed Cost Quote Failures

Major Resolved View vendor source →

Zonos experienced a major incident on November 8, 2024 affecting Landed Cost API and International Checkout, lasting 1h 25m. The incident has been resolved; the full update timeline is below.

Started
Nov 08, 2024, 04:00 PM UTC
Resolved
Nov 08, 2024, 05:26 PM UTC
Duration
1h 25m
Detected by Pingoru
Nov 08, 2024, 04:00 PM UTC

Affected components

Landed Cost APIInternational Checkout

Update timeline

  1. investigating Nov 08, 2024, 07:49 PM UTC

    We are currently investigating this issue.

  2. identified Nov 08, 2024, 07:49 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Nov 08, 2024, 07:50 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Nov 08, 2024, 07:52 PM UTC

    This incident has been resolved.

  5. postmortem Nov 08, 2024, 07:53 PM UTC

    ### What products were affected and what was the impact? Landed Cost API, Checkout Impact: CRITICAL ### What timeframe did this issue occur? | **Date** | **Time** | | --- | --- | | November 8, 2024 | 8:50am - 10:26am MST | ### How was the issue detected? A spike in error logs triggered an alert to our Engineering team, who responded immediately to the issue. ### What functionality was affected? Landed Cost quotes that use our automated item classification service failed. ### What problems did this cause? If an HS Code was not provided in the API request to Landed Cost, and the automatic classification service was enabled, then the landed cost quote would fail. When the landed cost quote fails, shoppers may not be able to place their order. ### What was the resolution of the problem and steps that are being taken for continued follow-up? The root cause of the issue was a deployment issue with the item service used for automatic classification. While the issue was detected immediately, resolution required rebuilding and redeploying services, which took longer than expected. After services were rebuilt and redeployed, the system health was validated and normal operations resumed. ### What mitigation solutions will we put in place to prevent this issue from occurring in the future? We discovered that this issue was due, in part, to a deficiency in our deployment procedures. We are working to update the procedure to prevent any future issues. We are also creating a synthetic test in our lower environments that will catch similar issues before they can be deployed into production.