Redox incident

Slow traffic and increased latency

Redox experienced a minor incident on January 24, 2025 affecting Traffic Processing and Platform API, lasting 51m. The incident has been resolved; the full update timeline is below.

Started: Jan 24, 2025, 09:14 AM UTC
Resolved: Jan 24, 2025, 10:06 AM UTC
Duration: 51m
Detected by Pingoru: Jan 24, 2025, 09:14 AM UTC

Affected components

Traffic ProcessingPlatform API

Update timeline

investigating Jan 24, 2025, 09:14 AM UTC

We are currently seeing slow traffic and increased latency for message processing. We are investigating this issue.
monitoring Jan 24, 2025, 09:40 AM UTC

Traffic and latency seem to be returning to normal, but we are continuing to monitor for further developments
resolved Jan 24, 2025, 10:06 AM UTC

The latency and slowness have gone back to normal levels
postmortem Feb 14, 2025, 07:28 PM UTC

## Summary Starting January 24, 2025 at 2:40CT we became aware of message processing latency for some of our customers. This latency occurred intermittently through Feb 3 when some message processing stopped, resulting in rejected messages for a subset of customers. A subset of customers with subscriptions in the affected database until the root causes were determined to be from a\(n\) * inefficient query that monitors message processing * lack of monitoring visibility into a set of waiting messages that were in an errored state On February 4, 2025 at 1:15 CT, changes were deployed to fix both root causes were applied with most customers being mitigated by February 4, 2025 17:14 CT. All impacted customer were fully operational by February 5, 2025 at 12:33 CT. ## What Happened * On January 23, atypical messages became stuck in a processing waiting state. Combined with a lack of visibility into errors for that waiting state and an inefficient query for monitoring message processing, one database ran out of available space. * Customers with subscriptions on that database experienced increasing latency intermittently from Jan 24 thru Feb 4. * To mitigate this incident, we removed the problematic messages to unblock customers subscriptions on that one database. Additionally, we made optimizations to the database query that monitors message processing and added metrics to capture and alert the errors from messages waiting to be processed. ## What we are doing about this: * We have created an alert that captures when messages are erroring in this waiting state. * We have corrected the edge case discovered allowing the large message payload. * We have improved performance on a query monitoring message process. * We are improving the process of moving waiting messages into processing to handle atypical messages better.