DataRobot incident
Delay updating the Deployment Monitoring Information.
DataRobot experienced a minor incident on January 15, 2025 affecting MLOps, lasting 1d 4h. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Jan 15, 2025, 04:56 PM UTC
Our team has identified an issue with our Deployment Monitoring Information. This is a process delay and no data loss is expected. Our team is currently investigating the root cause and is working on a fix. The following services are currently impacted Service Health, Data Drift, and Accuracy monitoring.
- monitoring Jan 15, 2025, 07:36 PM UTC
Our team has identified the root cause and implemented the fix. Service Health and Accuracy no longer have a delay and are operating normally. The delay in Data Drift monitoring is improving, however the Engineering team expects it will take several hours to fully recover as the system processes through accumulated data. The team can confirm there has been no data loss during this time. The team is currently monitoring the situation.
- monitoring Jan 16, 2025, 09:30 AM UTC
The unprocessed message backlog continues to catch up. The engineering team is closely monitoring the process. We will provide an update once the processing of delayed messages is caught up.
- resolved Jan 16, 2025, 09:54 PM UTC
This incident has been contained.