ShipHawk incident

AWS US-East 2a outage

ShipHawk experienced a critical incident on July 28, 2022 affecting TMS, lasting 3h 16m. The incident has been resolved; the full update timeline is below.

Started: Jul 28, 2022, 05:11 PM UTC
Resolved: Jul 28, 2022, 08:28 PM UTC
Duration: 3h 16m
Detected by Pingoru: Jul 28, 2022, 05:11 PM UTC

Affected components

TMS

Update timeline

investigating Jul 28, 2022, 05:11 PM UTC

We are currently investigating this issue.
investigating Jul 28, 2022, 06:06 PM UTC

It appears that Amazon hosting (AWS) in US-East 2a is experiencing an outage. Our DevOps team is actively working to restore ShipHawk by switching to an AWS facility that is not impacted by this outage. We expect to restore services soon. To follow updates from Amazon, please see: https://health.aws.amazon.com/health/status Customer impact: Customers are not able to use ShipHawk services. Start Time: 9:57am Pacific Time
monitoring Jul 28, 2022, 06:27 PM UTC

ShipHawk services are now back online. We will continue to monitor as services are restored. To follow updates from Amazon, please see: https://health.aws.amazon.com/health/status Customer impact: Customers are not able to use ShipHawk services. Start Time: 9:57am Pacific Time End Time: 11:25am Pacific Time
resolved Jul 28, 2022, 08:28 PM UTC

This incident is fully resolved. Customer impact: Customers were not able to use ShipHawk services. Start Time: 9:57am Pacific Time End Time: 11:25am Pacific Time
postmortem Jul 28, 2022, 11:22 PM UTC

## Incident summary ShipHawk API and Web Portal were not available between 9:57 AM and 11:15 AM Pacific Time, 7/28/2022. The incident was caused by an AWS outage at US-EAST-2. ## Detection This incident was detected at 10:02 AM Pacific Time when the internal alerting system diagnosed an outage. Some of the application servers, primary database node, search engine nodes were not accessible. After more investigation, we found that disk volumes attached to the primary database are completely inaccessible. Eventually, we found that it was caused by a major outage in the AWS US-EAST-2a availability zone. ## Recovery After it was confirmed that the issues are caused by the US-EAST-2a outage at 10:30 AM, the devops team initiated switching to the database replica which is located in a different AWS availability zone. That was finished at 11:09 AM and it took additional 10 minutes until all services fully recovered. ## Timeline All times are Pacific Time. 09:57 AM - the system response time started growing 10:02 AM - internal notification systems signaled about the primary database node outage 10:07 AM - the engineering team started the investigation 10:30 AM - the root cause was identified and the team started working on recovery plan 11:09 AM - the database replica was promoted to a primary node 11:19 AM - the system has fully recovered ## Corrective actions 1. Increase number of availability zones in order to minimize the effect of potential AWS outage 2. Reduce time it takes to switch to redundant availability zones.