Affected components
Update timeline
- investigating Apr 23, 2026, 04:33 AM UTC
There appears to be an issue regarding sending metrics to datadog. We are currently investigating the issue
- monitoring Apr 23, 2026, 06:53 AM UTC
A fix has been implemented and we are monitoring the results.
- resolved Apr 23, 2026, 11:32 AM UTC
This incident has been resolved.
- postmortem Apr 24, 2026, 08:38 AM UTC
Broker metrics and metrics integrations degradation Window: 2026-04-23 03:25 – 05:35 UTC \(2h 10m\) ## Impact Broker metrics \(connections, channels, queues, consumers, message rates, node and netsplit state\) were degraded, progressively dropping until recovery at 05:35 UTC. Downstream effects: * Queue, Consumer, Connection, Channel, Connection flow, and Netsplit alarms: threshold breaches during the window may not have triggered, or triggered late. Alarms already tripped before 03:25 UTC kept their state. * Metrics Integrations \(Datadog, CloudWatch, New Relic, Splunk, Dynatrace, etc.\): delivery of broker metrics was reduced for the duration of the window. Alarms configured on the receiving side may likewise have missed or lagged. Not affected: server metrics \(CPU, memory, disk\) and their corresponding alarms, Notice alarms, broker availability, and the metric graphs shown in the CloudAMQP console \(these use a separate data path and continued to update normally\). ## Timeline \(UTC\) * 03:25 — broker-metrics collection begins degrading * 03:35 — ~30% of broker-metric samples missing * 03:50 — ~50% plateau * 05:25 — collector workers cycled and re-initialised * 05:35 — full throughput restored ## Root cause Broker metrics are polled from each cluster's management HTTP API by a pool of collector workers and published to an internal message bus that both the Alarms service and Metrics Integrations consume from. At 03:25 UTC this service became unresponsive without raising an error. Affected workers silently stopped polling while remaining alive to the platform, so automatic restart did not trigger. On-call staff began investigating around 05:00 UTC; force-restarting the service restored metric flow. The exact trigger has not been identified. Our focus is on ensuring the condition is detected and handled promptly if it recurs. ## What we are changing * Added metrics and alarms specific to this service, with high-urgency paging for on-call. * Reliability improvements in the service itself to detect and restart stalled work. ## What you can do CloudAMQP offers a new generation of metrics integrations based on Prometheus, with a re-engineered pipeline that does not rely on these centralised services — each server forwards data directly to your endpoint. These have been running in production for some time and have proved very reliable. * [https://www.cloudamqp.com/blog/prometheus-metrics-integrations.html](https://www.cloudamqp.com/blog/prometheus-metrics-integrations.html) * [https://www.cloudamqp.com/docs/monitoring\_metrics\_datadog\_v3.html](https://www.cloudamqp.com/docs/monitoring_metrics_datadog_v3.html) * Background: [https://www.cloudamqp.com/blog/decentralized-observability-with-open-telemetry-part-1.html](https://www.cloudamqp.com/blog/decentralized-observability-with-open-telemetry-part-1.html)
Looking to track CloudAMQP downtime and outages?
Pingoru polls CloudAMQP's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.
- Real-time alerts when CloudAMQP reports an incident
- Email, Slack, Discord, Microsoft Teams, and webhook notifications
- Track CloudAMQP alongside 5,000+ providers in one dashboard
- Component-level filtering
- Notification groups + maintenance calendar
5 free monitors · No credit card required