DataRobot incident

Delay updating the Deployment Monitoring Information.

Minor Resolved View vendor source →

DataRobot experienced a minor incident on January 15, 2025 affecting MLOps, lasting 1d 4h. The incident has been resolved; the full update timeline is below.

Started
Jan 15, 2025, 04:56 PM UTC
Resolved
Jan 16, 2025, 09:54 PM UTC
Duration
1d 4h
Detected by Pingoru
Jan 15, 2025, 04:56 PM UTC

Affected components

MLOps

Update timeline

  1. identified Jan 15, 2025, 04:56 PM UTC

    Our team has identified an issue with our Deployment Monitoring Information. This is a process delay and no data loss is expected. Our team is currently investigating the root cause and is working on a fix. The following services are currently impacted Service Health, Data Drift, and Accuracy monitoring.

  2. monitoring Jan 15, 2025, 07:36 PM UTC

    Our team has identified the root cause and implemented the fix. Service Health and Accuracy no longer have a delay and are operating normally. The delay in Data Drift monitoring is improving, however the Engineering team expects it will take several hours to fully recover as the system processes through accumulated data. The team can confirm there has been no data loss during this time. The team is currently monitoring the situation.

  3. monitoring Jan 16, 2025, 09:30 AM UTC

    The unprocessed message backlog continues to catch up. The engineering team is closely monitoring the process. We will provide an update once the processing of delayed messages is caught up.

  4. resolved Jan 16, 2025, 09:54 PM UTC

    This incident has been contained.