Parade incident

McLeod DFM Load Processing Delays

Parade experienced a notice incident on August 25, 2022, lasting —. The incident has been resolved; the full update timeline is below.

Started: Aug 25, 2022, 11:00 AM UTC
Resolved: Aug 25, 2022, 11:00 AM UTC
Duration: —
Detected by Pingoru: Aug 25, 2022, 11:00 AM UTC

Update timeline

resolved Sep 01, 2022, 05:21 PM UTC

Issue Summary We received a major slowdown of how we process Loads from McLeod DFM TMS integrations. No data loss occurred, but we were delayed in updating loads in the Parade system. Timeline The issue was first identified on Aug 25th at 4:54 AM PST. That morning, one of our standard integration health alerts notified the team that we were experiencing thousands of load updates that were not being processed across all of our DFM customers. StatusPage was also updated with a Degraded Performance tag on the DFM Load Processing category at this time. The issue was identified and a fix was put into place at 10:05 AM PST. Root Cause The issue was an infrastructure issue related to our DFM middleware deployment. We saw issues with our workers hitting CPU limits, and also jobs that were reaching an Out of Memory error. This was a rather difficult issue to catch, as the issue was not due to any deployment or code change. Our systems were slowly using more and more CPU and memory as we scale, and this day was when we hit our limits. Resolution and recovery Resolution was achieved by restarting our deployed infrastructure for our DFM middleware layer. This allowed us to start processing updates again in a timely manner. This was done at 10:05 AM PST. At this time, we still had 1000s of unprocessed updates. Our system caught up with all pending updates at 10:54 AM PST. No data loss occurred, as we had all of the updates stored in our database that we could replay. Corrective and Preventative Measures We have implemented some better logging and monitoring around CPU and memory usage of our infrastructure for DFM to catch these infrastructure issues earlier in the future. We have also bumped the amount of resourcing our DFM middleware hardware is able to use. CPU usage limits have been doubled, and memory usage limits were increased by 50%. This should benefit all DFM integrations going forward, and we will continue to monitor if these limits need to be adjusted on a monthly basis. Some lessons were also learned about downtime alerting internally. We have revisited our downtime alerting process to establish better communication between our routine health checks and the support/engineering teams.