Redox incident

Slow traffic and increased latency

Minor Resolved View vendor source →

Redox experienced a minor incident on January 24, 2025 affecting Traffic Processing and Platform API, lasting 51m. The incident has been resolved; the full update timeline is below.

Started
Jan 24, 2025, 09:14 AM UTC
Resolved
Jan 24, 2025, 10:06 AM UTC
Duration
51m
Detected by Pingoru
Jan 24, 2025, 09:14 AM UTC

Affected components

Traffic ProcessingPlatform API

Update timeline

  1. investigating Jan 24, 2025, 09:14 AM UTC

    We are currently seeing slow traffic and increased latency for message processing. We are investigating this issue.

  2. monitoring Jan 24, 2025, 09:40 AM UTC

    Traffic and latency seem to be returning to normal, but we are continuing to monitor for further developments

  3. resolved Jan 24, 2025, 10:06 AM UTC

    The latency and slowness have gone back to normal levels

  4. postmortem Feb 14, 2025, 07:28 PM UTC

    ## Summary Starting January 24, 2025 at 2:40CT we became aware of message processing latency for some of our customers. This latency occurred intermittently through Feb 3 when some message processing stopped, resulting in rejected messages for a subset of customers. A subset of customers with subscriptions in the affected database until the root causes were determined to be from a\(n\) * inefficient query that monitors message processing * lack of monitoring visibility into a set of waiting messages that were in an errored state On February 4, 2025 at 1:15 CT, changes were deployed to fix both root causes were applied with most customers being mitigated by February 4, 2025 17:14 CT. All impacted customer were fully operational by February 5, 2025 at 12:33 CT. ## What Happened * On January 23, atypical messages became stuck in a processing waiting state. Combined with a lack of visibility into errors for that waiting state and an inefficient query for monitoring message processing, one database ran out of available space. * Customers with subscriptions on that database experienced increasing latency intermittently from Jan 24 thru Feb 4. * To mitigate this incident, we removed the problematic messages to unblock customers subscriptions on that one database. Additionally, we made optimizations to the database query that monitors message processing and added metrics to capture and alert the errors from messages waiting to be processed. ## What we are doing about this: * We have created an alert that captures when messages are erroring in this waiting state. * We have corrected the edge case discovered allowing the large message payload. * We have improved performance on a query monitoring message process. * We are improving the process of moving waiting messages into processing to handle atypical messages better.