IONOS Cloud incident

IONOS Container Registry

Minor Resolved View vendor source →
Started
Mar 04, 2026, 09:23 AM UTC
Resolved
Mar 04, 2026, 04:39 PM UTC
Duration
7h 16m
Detected by Pingoru
Mar 04, 2026, 09:23 AM UTC

Update timeline

  1. investigating Mar 04, 2026, 09:23 AM UTC

    We are currently investigating an increased error count on the IONOS Container Registry. Customers might be unable to pull images currently.

  2. identified Mar 04, 2026, 10:23 AM UTC

    Our Container Registry team has identified an issue in the underlying Kubernetes cluster serving a subset of images. The team is currently working on applying a fix for the issue.

  3. monitoring Mar 04, 2026, 12:22 PM UTC

    The Kubernetes Team has deployed a mitigation to the issue which involved a version rollback of a component of the K8s control plane. We are currently monitoring the service recovery.

  4. resolved Mar 04, 2026, 04:39 PM UTC

    We are marking this incident as resolved because no further issues were found in the setup. A Root‑Cause Analysis (RCA) will be published once the team has completed its analysis.

  5. postmortem Mar 11, 2026, 06:02 PM UTC

    We want to share the Root Cause Analysis to this incident: **What happened** On 4 March 2026 customers of the IONOS Container Registry experienced 504 Gateway Timeout errors when pushing or pulling container images. Deployments that relied on the registry were blocked. **How was that possible \(Root Cause\)** The registry runs on IONOS Managed Kubernetes \(MK8s\) infrastructure. A temporary capacity constraint caused two critical control‑plane components to be placed on the same proxy instance instead of distributing them across separate proxies. This happened despite existing anti-affinity rules. The shared proxy reached its maximum concurrent‑connection limit and stopped accepting new connections. Because all registry traffic to the Kubernetes API traverses this proxy, push and pull operations failed with 504 errors. The migration created the co‑location condition; the connection‑limit exhaustion was the direct trigger. **What we are doing to prevent recurrence** Immediate \(completed\) * Provisioned additional proxy capacity. * Relocated the affected control‑plane components onto separate proxy instances, restoring balanced load and ending the 504 errors. Short‑term * Architectural redesign: Redesign registry‑to‑API connectivity so each node uses a dedicated local proxy, eliminating shared‑proxy bottlenecks. Design validated in test environments; production rollout scheduled for Q2 2026. * Alert‑threshold review: Adjust alerting thresholds to trigger warnings before proxy connection utilization approaches capacity. Rollout in progress, expected completion Q2 2026. Mid‑term * Load redistribution: Deploy additional infrastructure clusters and redistribute existing registries to ensure no single cluster exceeds safe operating capacity. Automation will continuously balance load as usage grows. **Closing remarks** The outage directly impacted container‑image delivery and delayed customer deployments. We have restored full service and implemented concrete architectural and operational changes to eliminate the identified bottleneck.

Looking to track IONOS Cloud downtime and outages?

Pingoru polls IONOS Cloud's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when IONOS Cloud reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track IONOS Cloud alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring IONOS Cloud for free

5 free monitors · No credit card required