ShipHawk incident

Pages may be slow loading

Minor Resolved View vendor source →

ShipHawk experienced a minor incident on November 29, 2021 affecting Shipping APIs and TMS, lasting 9h 23m. The incident has been resolved; the full update timeline is below.

Started
Nov 29, 2021, 02:50 PM UTC
Resolved
Nov 30, 2021, 12:13 AM UTC
Duration
9h 23m
Detected by Pingoru
Nov 29, 2021, 02:50 PM UTC

Affected components

Shipping APIsTMS

Update timeline

  1. identified Nov 29, 2021, 02:50 PM UTC

    The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive. We're investigating the cause and will provide an update as soon as possible. Our engineering team is working on a solution. The next update will be within 30 mInutes.

  2. identified Nov 29, 2021, 03:51 PM UTC

    There are no new updates at this time. Engineering is continuing to resolve this issue. We will update you as soon as we have more information.

  3. monitoring Nov 29, 2021, 04:39 PM UTC

    Our engineering team was able to improve the responsiveness of ShipHawk's WebPortal and API, and error messages have subsided. We will continue to monitor the issue throughout the day to confirm the resolution of this issue.

  4. investigating Nov 29, 2021, 05:16 PM UTC

    Some clients have reported they are still seeing slow response times. Our engineering team is investigating further for a complete resolution. We will update you as soon as we know more information.

  5. identified Nov 29, 2021, 06:23 PM UTC

    ShipHawk Engineering is deploying changes to address system performance. We expect those changes to have a positive impact on site and API responsiveness over the next 15-25 minutes, and we will continue to monitor system performance.

  6. identified Nov 29, 2021, 07:09 PM UTC

    The deployed changes are now in effect across the system. Overall site and API performance continues to improve. ShipHawk Engineering will continue to tune and monitor performance.

  7. identified Nov 29, 2021, 08:31 PM UTC

    We continue to experience exponentially larger volumes than anticipated, despite significant over-provisioning of system resources in preparation for Black Friday/Cyber Monday. As a result, some customers are experiencing slower than normal performance. ShipHawk engineering will continue to make incremental improvements throughout the day and will inform you as changes are made.

  8. monitoring Nov 29, 2021, 11:12 PM UTC

    Our engineering team is deploying additional changes to address page slowness. We are seeing significant improvement with site and API responsiveness with these changes, and we will continue to closely monitor system performance.

  9. resolved Nov 30, 2021, 12:13 AM UTC

    This incident has been resolved. In an effort to help during this heightened holiday processing, we will provide extended support hours from 3:00 AM to 9:00 PM Pacific Time via normal support channels through Friday 12/3/21 for all customers.

  10. postmortem Nov 30, 2021, 01:08 PM UTC

    ## **Incident summary** Between 6:30am and 3:30pm PST, several customers experienced slowness of the application. ## **Leadup** In preparation for the peak season, we provisioned additional servers for anticipated volume. Our customers collectively experienced larger order, shipment and rate request volumes than we expected. Additionally, FedEx, UPS and other Carrier APIs experienced delayed response times to requests made by our system. The combination of these issues slowed down ShipHawk API response times for some customers. ## **Fault** With the load more than expected, API response time slowed down. Automated load balancer marked some of the slower servers as unhealthy which led to higher load on healthy servers and that slowed down the system even more. The engineering team made a decision to add more servers to help handle the extra load. The added resources did not help. Adding new resources for rating caused much higher use of database connections, which resulted in errors and did not help with performance degradation. ## **Impact** ShipHawk users experienced slowness of the service from 6:30 am PST till 3:30 pm PST. Some of the API requests were failing by timeout and syncing with external systems was delayed. A total of 9 urgent support cases were submitted to ShipHawk during the impact window. ## **Detection** It was first detected by monitoring systems at 6:30 am PST and then was reported by customers at 6:42 am PST ## **Response** Customers were notified about the slowness via our status page at 6:44am PST. We responded to the incident with all possible urgency and ultimately made the necessary changes to solve the problem while continuing to processing similar volumes to Black Friday and Cyber Monday through the end of the week. ## **Recovery** We needed to add more servers for processing extra API requests, but that created too many connections to the database. The solution was to implement a database connection pooling system that allowed us to optimize the database connections usage. Around 3:00 pm PST, the new connection pool system was activated and we were able to added more resources to process API requests and background jobs. That resolved the slowness at 3:30pm PST. To further mitigate the chances of another incident, we set up redundant ​connection poolers and provisioned more resources to production throughout the night. That proved effective during the next day \(Tuesday 11/30\), when ShipHawk was experience similar API load and response times remained stable throughout. ## **Timeline** All times in PST **Monday, 29 November** 6:30am - monitoring systems alerted average API response time increase and an increased number of "499 Client Closed Request" errors 6:32 am - engineering team started investigating the slowness 6:42 am - customers reported slowness of Item Fulfillments sync and overall application slowness 6:44 am - Status Page was updated with the details about the incident. 7:30 am - API load balancer reconfigured, to prevent a cascade effect when the load balancer was removing slow instances from the pool which was adding more load to healthy instances, and that made them slow/unhealthy too 8:00 am - application servers reconfigured, more resources moved to API services from backend services, to better match the type of the load 9:00 am - existing servers upgraded to more powerful EC2 instances, extra servers provisioned for handling the extra load 10:00 am - monitoring systems detected errors related to extra high use of database connections which prevented us from provisioning more servers 11:00 am - the decision was made to configure a new database connection pooling system that should mitigate the database connections issue and allow provisioning more resources 3:00 pm - a new database connection pooling system was installed and configured 3:30 pm - confirmed that the incident resolved **Tuesday, 30 November** 12:00am - 4:30am - additional application and background processing servers added for redundancy ## **Root cause identification: The Five Whys** 1. The application had degraded performance because of added load on the API and slow carrier response times. 2. The system did not automatically address the added load because database connections were consumed. 3. Because we pushed extra resources and didn’t expect this to cause an issue with database connections. 4. Because we need did not have tests to cover load tests that would have identified this. 5. Because we had not previously felt this kind of testing was necessary until we reached this level of scale. ## **Root cause** Suboptimal use of database connections led to issues with the application scaling. The team did not have an immediate solution for that because the issue had not been replicated in testing. ## **Lessons learned** * We need more application load testing in place. * Carrier API response slowness can cause slowness for the application. * Customers with high API usage volatility should isolated from other multi-tenant users. ## **Corrective actions** 1. Introduce new load testing processes. 2. Implement better automated scaling system for the peak load periods. 3. Prioritize solutions to mitigate response time delays due to carrier response time delays.