Postmortem May 12, 2026 · 3 min read

Datadog lost 6+ hours of telemetry processing on May 8, and blamed AWS

Datadog reported an incident at 23:39 UTC on May 7. The fix took 14 hours. Datadog linked AWS's health dashboard in ten separate updates over the following 6 hours.

By Pingoru team

At 23:39 UTC on May 7, monitor notifications on Datadog US1 started piling up. By 00:47 UTC, Datadog had decided it wasn't their problem:

Due to upstream provider issues, we are also continuing to see unavailability of telemetry data coming from AWS into Datadog.

https://health.aws.amazon.com/health/status

That AWS Health URL appears in nine more updates over the next 6 hours. Datadog didn't say which AWS service, didn't say which region, didn't say what "unavailability of telemetry data" meant at the wire level. They just kept pointing at the AWS health dashboard and saying "we're working on it."

The incident ran 14 hours end to end. Processing was effectively back to normal by 06:21 UTC, when Datadog moved to monitoring state. The resolved update didn't land until 14:07 UTC.

The timeline, compressed

The first update at 00:06 UTC framed this as a Monitors problem only. Eighteen minutes later, the scope widened:

We are investigating increased latency across multiple products which began at 23:39 UTC. As a result of this issue, some users may see delays in data across the platform.

Seven product surfaces ended up on the affected list: APM, CI Visibility, Error Tracking, Log Management, Metrics and Infra Monitoring, Monitors, and RUM. That's most of Datadog's catalog.

Recovery happened in stages, not as one switch. At 02:59 UTC Datadog noted that count, rate and gauge metrics were flowing for most customers but distribution metrics were still delayed. By 04:13 UTC distribution metrics had caught up. At 05:44 UTC:

Metrics, Logs, APM and RUM data are being processed normally for all customers. Alerting is also functional. There are still processing delays for CI visibility.

CI Visibility caught up by 06:21 UTC. Then eight hours of silence before the resolved update.

Amazon Web Services

Datadog never named the AWS service. The AWS Health Dashboard had two concurrent incidents in us-east-1 starting at 00:25 UTC: an EC2 N. Virginia increased error rate and latency event we classified as major, and a paired EC2 us-east-1 event at minor severity. Datadog's subprocessor disclosure lists AWS for "infrastructure services" without naming a region. Their US1 region is most plausibly hosted on us-east-1 given the naming and the timing, but neither side spells it out. AWS's two concurrent us-east-1 incidents opened 46 minutes after Datadog's impact began at 23:39. That's the most likely culprit, though Datadog never spelled it out and AWS only flagged EC2 in their public updates.

If the actual blast radius was wider inside AWS, neither vendor said so on a public status page.

Who else broke, and who didn't blame anyone

This is the part where Pingoru usually has the receipts. This time the receipts are thin.

In the same window, Twilio degraded across roughly 45 components starting 00:13 UTC, twelve minutes before AWS posted. Twilio's incident update never mentioned AWS. Twilio's own subprocessor disclosure lists AWS for "all services" under cloud infrastructure, without naming a region. The timing is suggestive. The attribution isn't there.

Nobody else in the catalog blamed Datadog. That's the expected shape, Datadog is an observability dependency, not a serving-path dependency, so a Datadog telemetry stall makes your dashboards lie but doesn't break your product. Of the providers we monitor whose incidents overlapped this window, Datadog was the only one to name AWS as the cause. Twilio's concurrent incident updates didn't mention an upstream at all.

What the numbers say about Datadog US1

In our database, Datadog US1 has logged 74 incidents in the last 12 months, 25 of which we classified at major or higher. Median duration is 52 minutes. This one ran 840 minutes. The previous major was a week earlier, on May 1.

The 14-hour wall-clock duration overstates customer pain here. Most surfaces were processing normally by hour 6. The remaining 8 hours were pure resolved-status delay; Datadog kept the incident in monitoring state long after every product had recovered. But if you woke up at 02:00 UTC and your PagerDuty was quiet because monitor notifications were stalled, the wall-clock duration is the one you care about.

#postmortem#datadog#aws-cascade