Datadog AP1 incident

Elevated Error Rates for Log Queries and Monitors

Major Resolved View vendor source →

Datadog AP1 experienced a major incident on October 3, 2023 affecting Log Management and Monitors, lasting 20h 59m. The incident has been resolved; the full update timeline is below.

Started
Oct 03, 2023, 05:33 PM UTC
Resolved
Oct 04, 2023, 02:33 PM UTC
Duration
20h 59m
Detected by Pingoru
Oct 03, 2023, 05:33 PM UTC

Affected components

Log ManagementMonitors

Update timeline

  1. investigating Oct 03, 2023, 05:33 PM UTC

    We are actively investigating issues with Log Queries returning unexpected results. As a result of this issue, some users may experience issues querying logs on the web application or API, and with Logs based Monitors and Log-Based Metrics.

  2. investigating Oct 03, 2023, 06:50 PM UTC

    We are continuing to investigate these issues, and will provide an update as soon as possible.

  3. identified Oct 03, 2023, 07:33 PM UTC

    We have identified the underlying issue and are working on a fix.

  4. monitoring Oct 03, 2023, 08:49 PM UTC

    We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved. At this time, newly ingested data is properly queryable, and monitors targeting Logs sent from 2023-10-03 20:40 UTC onwards are valid. Queries targeting logs between 2023-10-02 11:40 UTC and 2023-10-03 20:40 UTC may return erroneous data. We are evaluating a fix that will restore query correctness for this time-window.

  5. monitoring Oct 04, 2023, 09:20 AM UTC

    We're still working on a fix for historical data impacted by this incident.

  6. monitoring Oct 04, 2023, 10:26 AM UTC

    We're still working on a fix for historical data impacted by this incident.

  7. monitoring Oct 04, 2023, 11:06 AM UTC

    We're still working on a fix for historical data impacted by this incident.

  8. monitoring Oct 04, 2023, 11:41 AM UTC

    We're still working on a fix for historical data impacted by this incident.

  9. monitoring Oct 04, 2023, 12:19 PM UTC

    We're still working on a fix for historical data impacted by this incident.

  10. monitoring Oct 04, 2023, 01:05 PM UTC

    We have successfully tested a fix for this issue and are currently deploying it to resolve this incident.

  11. monitoring Oct 04, 2023, 01:09 PM UTC

    Fix has been rolled out and we are currently monitoring to confirm full resolution.

  12. resolved Oct 04, 2023, 02:33 PM UTC

    This incident has been resolved.