Roadie incident

Core Platform Outage

Roadie experienced a critical incident on December 16, 2025 affecting Core Platform, lasting 1h 13m. The incident has been resolved; the full update timeline is below.

Started: Dec 16, 2025, 07:34 PM UTC
Resolved: Dec 16, 2025, 08:48 PM UTC
Duration: 1h 13m
Detected by Pingoru: Dec 16, 2025, 07:34 PM UTC

Affected components

Core Platform

Update timeline

investigating Dec 16, 2025, 07:34 PM UTC

We are currently experience a core platform outage. Our engineers are currently investigating.
investigating Dec 16, 2025, 07:39 PM UTC

Our engineers have identified the cause of the outage and are applying a fix now.
identified Dec 16, 2025, 08:02 PM UTC

Our engineers are still applying the fix.
monitoring Dec 16, 2025, 08:11 PM UTC

A fix has been implemented and our engineers are continuing to monitor the situation.
monitoring Dec 16, 2025, 08:14 PM UTC

We are continuing to monitor for any further issues.
resolved Dec 16, 2025, 08:48 PM UTC

This incident has been resolved.
postmortem Dec 18, 2025, 08:42 PM UTC

**Core Platform Outage Summary – December 16th, 2025** On Sept 16th 2025 at approximately 2:00 pm EST, our platform experienced a core platform outage affecting the Roadie Platform. While we were observing limited critical deployments only during the peak period, there was a need to make a configuration change that was unrelated to production traffic and was intended to improve the system resiliency and further support the scaling of our hybrid cloud architecture. The change itself was not directly related to production request handling, but it impacted critical ingress and platform components in an unforeseen way, resulting in a complete service disruption. Our engineers immediately began investigating, and by 2:39 pm EST, they identified the cause of the issue and applied a fix, reverting the earlier configuration change. By 3:11 pm EST the fix was fully applied and we began to see the Platform function normally again. By 3:48 pm EST, all key metrics had returned to normal and we declared the outage resolved. To reduce the likelihood of similar outages in the future, we have implemented the following measures: * **Enhanced monitoring** for ingress and other critical platform components to improve early detection of anomalous behavior. * **Strengthened configuration safeguards**, including additional validation and checks to prevent high-risk or incompatible configuration changes from being applied. * **Improved change review practices** for infrastructure and resiliency-related updates to better identify potential system-wide impacts before deployment. We sincerely apologize for any disruption this may have caused. If you have any questions, please feel free to reach out to us at [**[email protected]**](mailto:[email protected])