ShipHawk incident

Service disruption

ShipHawk experienced a critical incident on June 10, 2022 affecting Shipping APIs and TMS, lasting 22m. The incident has been resolved; the full update timeline is below.

Started: Jun 10, 2022, 04:27 AM UTC
Resolved: Jun 10, 2022, 04:49 AM UTC
Duration: 22m
Detected by Pingoru: Jun 10, 2022, 04:27 AM UTC

Affected components

Shipping APIsTMS

Update timeline

investigating Jun 10, 2022, 04:27 AM UTC

We are currently experiencing a service disruption. Our DevOps team is working to identify the root cause and implement a solution. Further details will be provided shortly. Customer Impact: Some customers are reporting that they are unable to ship. We will send an additional update on or before 10:00pm Pacific Time.
monitoring Jun 10, 2022, 04:43 AM UTC

Our DevOps team has implemented a fix. Users should now see be able to book shipments as expected. We are monitoring to ensure no further customer impact. Customer Impact: Some customers were unable to ship. Start Time: 8:04 PM Pacific Time EndTime: 9:29 PM Pacific Time
resolved Jun 10, 2022, 04:49 AM UTC

This issue is now resolved. Users can book shipments as expected. A post mortem will be shared within the next 2-3 business days to summarize this incident, how it was resolved and how we intend to mitigate such an event in the future. Customer Impact: Some customers were unable to ship. Start Time: 8:04 PM Pacific Time EndTime: 9:29 PM Pacific Time
postmortem Jun 10, 2022, 06:34 PM UTC

## **Incident summary** We determined the actual start to be 6:24 PM Pacific Time. The issue was reported by an affected customer at 8:02 PM Pacific Time and was resolved at 9:29 PM Pacific Time. During this incident, some customers were unable to ship. ## **Leadup** As a part of a routine database maintenance process, we planned a standard procedure for reclaiming unused disk space. The process started as planned but took more time than originally estimated when we ran this in our test environment. This eventually caused issues with the document generation processes. That, in turn, affected the ability to book new shipments, which heavily rely on new document generation. ## **Fault** The process of reclaiming unused disk space for document generation took longer than expected that eventually caused the table to be locked. Attempts to save new documents to the database failed because of this. Because document generation is a part of the shipments booking process, attempts to book new shipments failed as well. ## **Impact** Some ShipHawk users were not able to book new shipments from 6:24 PM to 9:29 PM Pacific Time. Some of the API requests related to document generation failed by timeout. ## **Detection** The incident was first detected when reported by a customer at 8:02 PM Pacific Time. ## **Response & Recovery** We responded to the incident with all possible urgency and ultimately made the necessary changes to unlock the tables and recover the service. The DevOps team made an analysis of the issue and after considering multiple options and made a decision to terminate the database optimization process and manually release the table lock. ## **Timeline** All times are in Pacific Time. **Thursday, 10 June 2022** 5:30 PM - the standard database maintenance process started 6:24 PM - the tool designed for reclaiming unused disk space acquired a lock on the table 8:02 PM - a customer reported issues with BOL generation and shipment booking 8:06 PM - the support team began investigating the reported issue 8:15 PM - the ticket was passed to the engineering team, and the DevOps engineering team started investigating 8:30 PM - the root cause was identified 9:10 PM - the DevOps team identified a way to recover the service without data loss 9:29 PM - the service was restored ## **Root cause identification: The Five Whys** 1. Document generation and shipment booking failed by timeout. 2. Because the system was not able to save newly generated documents into the database. 3. Because the documents table was locked. 4. Because the process of reclaiming unused disk space took longer than expected. 5. Because one of the database tables was too big. ## **Root cause** An existing procedure for reclaiming unused disk space does not work sufficiently for large database tables \(>2Tb\). ## **Lessons learned** * The procedure for reclaiming unused disk space should be optimized for large tables. * We need to improve monitoring for anomalies in shipping API usage, especially during routine database maintenance. ## **Corrective actions** 1. Optimize the procedure for reclaiming unused disk space for large database tables. 2. Begin monitoring anomalies in shipping API usage.