CloudAMQP incident

System metrics not forwarded to legacy integration

CloudAMQP experienced a minor incident on May 22, 2026 affecting Metrics, lasting 1h 33m. The incident has been resolved; the full update timeline is below.

Started: May 22, 2026, 03:16 AM UTC
Resolved: May 22, 2026, 04:50 AM UTC
Duration: 1h 33m
Detected by Pingoru: May 22, 2026, 03:16 AM UTC

Affected components

Metrics

Update timeline

investigating May 22, 2026, 03:16 AM UTC

Since around 8pm UTC last night, some metrics are not being forwarded to legacy integration. Prometheus integrations not affected. We are investingating.
monitoring May 22, 2026, 04:08 AM UTC

A fix has been implemented and we are monitoring the results.
resolved May 22, 2026, 04:50 AM UTC

This incident has been resolved.
postmortem May 22, 2026, 01:52 PM UTC

# Legacy host metrics integrations degraded Incident window: 2026-05-21 18:37 UTC – 2026-05-22 04:05 UTC \(9h 28m\) Affected service: Legacy metric integrations \(CloudWatch, Datadog, Librato, New Relic, Splunk, Stackdriver\). Host metrics affected, not broker metrics, nor was internal monitoring or console graphs. ## Summary For ~9.5 hours, the service that collects host-level metrics \(CPU, memory, disk, network\) for legacy third-party integrations entered a crash loop and stopped shipping data. ## Root Cause An internal credential-signing service was migrated to a new container runtime the afternoon before. Due to a bug with environment variables the collector could not renew it's credentials. The collector's existing credential remained valid for several hours, so the failure only surfaced when it tried to renew. ## Resolution On-call staff detected the failure early morning, developers helped restore the service and collector recovered at 04:05 UTC. ## Prevention * All services, both signing and collector report failed authentication attempts earlier and with higher severity * Metrics pipeline alarm thresholds was tightened so a similar drop in data triggers alarms faster.