Zonos experienced a critical incident on April 1, 2023 affecting Shopify Duty Tax, lasting 27m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 03, 2023, 04:22 PM UTC
Investigating issues with quoting on Shopify.
- resolved Apr 03, 2023, 04:23 PM UTC
This incident has been resolved.
- postmortem Apr 03, 2023, 10:32 PM UTC
**What products were affected and what was the impact?** All Zonos GraphQL services. Impact: CRITICAL **What timeframe did this issue occur?** | **Date** | **Time** | | --- | --- | | Mar 31, 2023 | Starting at 18:00 MDT | | Apr 1, 2023 | Ending at 12:45 MDT | **How was the issue detected?** On the morning of April 1, Shopify GraphQL customers began noticing issues with landed cost quotes and notified CS, who then escalated the issue to the Engineering team. **What functionality was affected?** All GraphQL services in the Zonos Cloud were impacted. **What problems did this cause?** Merchants on GraphQL were unable to receive shipment ratings and landed cost quotes. **What was the resolution of the problem and steps that are being taken for continued follow-up?** After being notified of the issue, we worked quickly to switch GraphQL merchants over to our REST endpoints, which were not experiencing any issues. We then identified the root cause of the issue with GraphQL: a code deployment that caused broke event serialization and caused synchronous events to fail. A weakness with synchronous event handling then caused the event failure to cascade to the cluster-level. We immediately released a fix to prevent future occurrences. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** Our monitoring and notification channels for production server clusters were focused on unhealthy target groups and container failures. Due to the nature of the failure, we didn't receive notifications for either. This is a clear gap in monitoring coverage at a cluster-wide level. To make sure this never happens again, we are configuring task-based monitoring outside of the clusters where we will: * query each service in the cluster directly for the minimum amount of tasks that should be running and the actual number of tasks that are running, * make mock requests to each service to make sure they are returning correct responses, and * direct these notifications to our alerting platform with "on-call" rotations to make sure there are no lapses in coverage. We have also improved the resiliency of our event system, such that even if there were a future issue with event serialization, it would have no effect upon our public GraphQL services.