Parade incident

CSV Integration Processing Delay

Major Resolved View vendor source →

Parade experienced a major incident on January 31, 2023, lasting —. The incident has been resolved; the full update timeline is below.

Started
Jan 31, 2023, 08:40 PM UTC
Resolved
Jan 14, 2023, 06:30 AM UTC
Duration
Detected by Pingoru
Jan 31, 2023, 08:40 PM UTC

Update timeline

  1. resolved Jan 31, 2023, 08:40 PM UTC

    Issue Summary We had a major delay in processing CSV files for Available Load integrations with some customers. This only affected customers on a CSV load integration, and not all customers using the integration were affected. Customers sending over larger files were more likely to be affected. Timeline We first detected slowdown in CSV file processing with 1 of our customers on 1/13/2023. Over the weekend this issue got worse, and the majority of Available Load CSVs were not processing on the Monday of 1/16/2023. We resolved this issue on the night of 1/19/2023 with a hotfix deployment. Root Cause We discovered that the root cause of the issue was a bugfix that was deployed on the night of 1/12/2023. This bug fix helped improve the consistency and timing of loadboard postings after a load was made re-available over our CSV load integration. However, what we failed to recognize was that the code change resulted in a higher usage of memory. This increase of memory caused our application to exceed the allocated memory threshold for our provisioned computing resources. Out of Memory errors were more common for customers with larger files. This resulted in files being partially processed, before getting interrupted due to memory constraints, and therefore customers saw a delay in load updates coming into Parade. Resolution and recovery On 1/13/2023, only one customer was affected and a support ticket was raised to our team. When more customers were affected on the morning of 1/16/2023, the ticket was immediately re-prioritized to be P0. Some optimizations were deployed the night of 1/16/2023, but did not consistently solve the problem. From 1/17/2023 to 1/18/2023, out team continued to monitor processing times, and noticed that larger CSV files were still seeing major delays in processing. Some small optimizations were implemented that benefited a few customers, but not all. The root cause was identified and tested on 1/19/2023, and deployed that night. This resulted in all customer data being updated successfully. Since CSV files are snapshots of customer load data, no load data was lost. Corrective and Preventative Measures We are working on better preventative measures and monitoring for resource-constraint issues. This includes re-evaluating any CPU and Memory thresholds for our integration pipeline. We have also implemented preventative measures to increase the overall memory allocation for crucial parts of our platform.