Redox incident

Traffic delayed and increased errors

Redox experienced a minor incident on May 29, 2025 affecting Traffic Processing, lasting 1h 58m. The incident has been resolved; the full update timeline is below.

Started: May 29, 2025, 03:35 PM UTC
Resolved: May 29, 2025, 05:34 PM UTC
Duration: 1h 58m
Detected by Pingoru: May 29, 2025, 03:35 PM UTC

Affected components

Traffic Processing

Update timeline

investigating May 29, 2025, 03:35 PM UTC

We are seeing a number of increased errors with our API and delayed messages of up to 30 minutes. We are currently investigating the issue.
identified May 29, 2025, 03:37 PM UTC

We believe we have identified the root cause of the issue and are deploying a fix.
monitoring May 29, 2025, 04:14 PM UTC

We have implemented a fix and seen that error rates have returned to nominal levels. The message latency is starting to decline. We are continuing to monitor until latency fully resolves.
monitoring May 29, 2025, 04:52 PM UTC

For most feeds we have seen resolved latency. We are continuing to monitor until we see latency fully resolved.
resolved May 29, 2025, 05:34 PM UTC

Latency has now resolved and traffic is flowing as expected.
postmortem Jun 11, 2025, 11:19 PM UTC

## Summary On May 29, 2025, between 1000CT and 1115CT, some customers experienced elevated error rates and/or delayed processing. The issue impacted some customer traffic during that time, and the duration was <1 1/2 hours. ## What Happened * This issue was caused by a data consistency error with Redox base configs related to change which led to elevated errors and/or delayed processing for customers using a specific version and subset of Redox base configs. * During the incident customers may have experienced an increase in 5XX errors or increased latency, depending on whether traffic was synchronous or asynchronous. * Our team was alerted by monitoring at 1007CT and immediately started investigating. Mitigation efforts included immediately preparing to rollback changes, scaling to increase capacity, and active monitoring of system health to ensure message processing continued. By 1102CT errors were decreasing and latency was starting to return to normal levels Full service was restored by 1115CT. ## What we are doing about this: * Adding automated detection to all environments that would prevent the data consistency errors that caused the incident along with a broader category of data consistency errors. * Improving our rollback capabilities for faster mitigation. * Auditing and improving our standard operating procedure for the process related to working with this type of data.