Parade incident

Delays in processing load integration payloads

Parade experienced a major incident on November 5, 2024, lasting —. The incident has been resolved; the full update timeline is below.

Started: Nov 05, 2024, 01:30 AM UTC
Resolved: Nov 05, 2024, 01:30 AM UTC
Duration: —
Detected by Pingoru: Nov 05, 2024, 01:30 AM UTC

Update timeline

resolved Nov 08, 2024, 03:35 PM UTC

Service Disruption Report: Delays in processing load integration payloads sent to Parade
postmortem Nov 08, 2024, 04:50 PM UTC

## Overview: Our load integration payload processing queue experienced an unusually high influx of messages, leading to a significant backlog and delays in processing. This was caused by a large volume of payloads from a newly onboarded customer combined with a legacy reprocessing job that unintentionally re-published unprocessed messages, creating duplicate entries in the queue and quickly overwhelming the system. ## Incident Timeline: * 11/05 1:26 AM UTC: The load integration payload processing queue starts receiving messages at an exceptionally high rate. * 11/05 1:46 AM UTC: Queue accumulates over 5,000 messages, triggering an alert. * 11/05 1:13 PM UTC: An investigation was underway to identify the root cause. * 11/05 1:39 PM UTC: an additional processing capacity was added, but acknowledgment rates remain insufficient compared to publish rates. * 11/05 1:49 PM UTC: the legacy reprocessing job was stopped, reducing the message duplication rate. * 11/05 7:48 PM UTC the backlog was fully processed. On 23:02 UTC - a permanent fix was implemented, preventing the legacy reprocessing job from re-publishing messages. ## Root Cause The issue resulted from two main factors, 1. A large volume of integration payloads was sent by a new customer, creating an unexpected load. 2. A legacy reprocessing job, configured to re-publish messages periodically, inadvertently duplicated messages faster than they could be acknowledged, overwhelming the queue. ## Resolution and Recovery Steps: To resolve the issue, the team scaled up the number of queue consumers and disabled the reprocessing job responsible for causing the queue by duplicating unprocessed messages. The reprocessing job was deemed unnecessary due to the presence of a dead-letter queue, which ensures resilience against message processing failures. With this change, the system is now better equipped to handle large spikes in message volume without overloading consumers, improving overall stability during high-demand periods.