ShipHawk incident

Trouble logging in

ShipHawk experienced a critical incident on October 15, 2021 affecting Shipping APIs and TMS, lasting 2h 27m. The incident has been resolved; the full update timeline is below.

Started: Oct 15, 2021, 06:45 PM UTC
Resolved: Oct 15, 2021, 09:12 PM UTC
Duration: 2h 27m
Detected by Pingoru: Oct 15, 2021, 06:45 PM UTC

Affected components

Shipping APIsTMS

Update timeline

investigating Oct 15, 2021, 06:20 PM UTC

Some users may be experiencing trouble when logging in to ShipHawk. Our Engineering team is currently investigating issues related to login. We will send an additional update at 11:45am Pacific Time.
investigating Oct 15, 2021, 06:24 PM UTC

We are continuing to investigate this issue.
monitoring Oct 15, 2021, 06:45 PM UTC

A fix has been implemented and we are monitoring the results. Customers can now login. Monitoring will continue throughout the day. Next update to finalize/close this incident will be provided within the next few hours.
resolved Oct 15, 2021, 09:12 PM UTC

This incident is resolved. We’re sorry this prevented your team from fulfillment during this outage period. Understanding this urgency, we made every possible effort to solve this as quickly as possible. The incident started at 10:42am and was resolved before 11:45am Pacific Time. A post-mortem will be provided and accessible on this status page within the next 3-5 business days. Please contact [email protected] if you have additional questions or concerns.
postmortem Oct 15, 2021, 09:13 PM UTC

**Incident summary** During an internal process that archives data, we noticed that disk usage beginning to increase and decided to upgrade the volume proactively. Due to internal AWS optimization processes, the upgrade created slowness in the system, which later led to the incident. We promoted a replica database to restore the service and service was restored at 11:45am PST. ## **Leadup** 9:30am PST - we started an internal process that archives data 10:30am PST - internal monitoring systems alerted fast increasing disk usage 10:35am PST - the volume attached to the database servers was upgraded This change resulted in degraded database performance. ## **Fault** Due to internal AWS optimization processes, the volume upgrade created slowness in the system, which later led to the incident starting at 10:42am PST. ## **Impact** Customers hosted on shared instances were not able to use the system from 10:42am PST to 11:45am PST. Affected services: * Web Portal * Workstations * ShipHawk API ## **Detection** The Incident was detected by the automated monitoring system and was reported by multiple customers. ## **Response** After receiving the alerts from the monitoring system, the engineering team connected with ShipHawk Customer Success and described the level of impact. The incident notification was posted to [https://status.shiphawk.com/](https://status.shiphawk.com/) ## **Recovery** 3 steps were performed for the service recovery: * primary database node disabled * the database replica was promoted to primary * the OLD primary node hostname was pointed to the NEW primary node by updating DNS records ## **Timeline** All times are in PST. **10/15/2021:** 10:00am - an internal process that archives data started 10:30am - internal monitoring systems alerted fast increasing disk usage 10:35am - the volume attached to the primary database node was upgraded 10:42am - the database performance degraded 10:43am - the monitoring system alerted multiple errors and API unresponsiveness 10:50am - the engineering team began an investigation of the incident 11:20am - the root cause was understood and the team created an action plan 11:30am - primary node was disabled and the replica was promoted to a primary 11:40am - OLD primary node hostname was pointed to the NEW primary node by updating DNS records **11:45am - the service is fully restored** 1:30pm - a new database replica was created and the sync process started **10/16/2021:** 2:30pm - the new database replica sync process finished ## **Root cause identification: The Five Whys** 1. The application had an outage because the database performance degraded 2. The database performance degraded because the volume, attached to the primary database node, was upgraded 3. The volume was upgraded because the disk usage fastly increased 4. Because we ran data archiving processes that used more disk than was expected 5. Because the data archiving process was tested on the environment with different primary/replica database configurations and the problem was not identified during tests ## **Root cause** The difference in configurations of the test and production systems led to missed inefficiency in the data archiving process. ## **Lessons learned** * The test environment requires configuration changes to more closely resemble production * The data archiving process should start slower * The internal process to promote replica databases to primary needs to be faster