IONOS Cloud incident

K8s Control Planes not available

Major Resolved View vendor source →

IONOS Cloud experienced a major incident on May 8, 2026 affecting Managed Kubernetes and Managed Kubernetes and 1 more component, lasting 2d 17h. The incident has been resolved; the full update timeline is below.

Started
May 08, 2026, 03:15 PM UTC
Resolved
May 11, 2026, 08:30 AM UTC
Duration
2d 17h
Detected by Pingoru
May 08, 2026, 03:15 PM UTC

Affected components

Managed KubernetesManaged KubernetesManaged Kubernetes

Update timeline

  1. investigating May 08, 2026, 03:15 PM UTC

    Some customers may experience connection problems to the control plane and degraded functionality of kubernetes. Our engineers are working on a fix.

  2. identified May 08, 2026, 03:52 PM UTC

    Our Kubernetes engineering team has identified the root cause of the issue. The incident stemmed from a defect in a configuration filtering mechanism. We are currently implementing mitigation steps and will provide further updates on our progress here.

  3. identified May 08, 2026, 04:29 PM UTC

    The team continues to audit affected resources to verify that all remaining inadvertent configuration changes are reverted.

  4. identified May 08, 2026, 05:13 PM UTC

    The team is now preparing to safely rollback identified changes on affected resources.

  5. identified May 08, 2026, 06:42 PM UTC

    We are currently rolling out the correction. We expect this to take several minutes and will update this page once the rollout is completed.

  6. monitoring May 08, 2026, 08:42 PM UTC

    The rollback was completed successfully. The team is currently checking the affected resources for potential residual inconsistencies. We are setting this incident into monitoring status.

  7. resolved May 11, 2026, 08:30 AM UTC

    We are marking this incident as resolved. A RCA will be published here once it is compiled.

  8. postmortem Jun 15, 2026, 09:12 AM UTC

    **What happened?** On May 8, 2026 between 15:22 and 16:12 \(UTC\+2\) a routine internal operation to onboard a new Managed Kubernetes development cluster caused the deployment automation system to unintentionally target and modify additional clusters and install applications meant for the development cluster. The applications deployed to affected clusters could not be executed, as the required image pull secrets were not present in those environments, preventing any images from being pulled or run. Once the full scope of the incident was established, a recovery script was developed, validated against our staging environment, and executed in production. All unintended changes have been identified and removed by May 15, 2026. **How was this possible? \(Root Cause\)** The incident was caused by a software bug in our platform framework onboarding operator, compounded by the absence of a configuration completeness gate in the deployment pipeline. The onboarding operator is responsible for registering clusters with our deployment system. It relies on a scoping configuration file to restrict its operations to a defined set of target clusters. As part of the planned onboarding of the new development cluster, this configuration file was to be automatically propagated to the deployment branch. The propagation failed silently due to a merge conflict - the file never landed on the target branch. Simultaneously, a second change removed the onboarding operator from the exclusion list, triggering an immediate deployment. As the deployment system was not configured properly to react to missing configurations - rather than halting the sync - the operator was deployed without any scoping restriction. In this unconstrained state, it treated further cluster-api based clusters as valid onboarding targets. The root cause is a structural gap in the CI/CD pipeline: the mechanisms to verify that all required configuration files had been successfully delivered to the target branch before a deployment was permitted proved inadequate. The cherry-pick failure was logged but did not block the pipeline. This made it possible to deploy a fully functional but unscoped operator under conditions that appeared normal to the team monitoring the rollout. **What are we doing to prevent recurrence?** Upon detection, the two immediate technical failures were resolved. The onboarding operator has been corrected and the deployment system configuration has been hardened. In parallel, the recovery effort - which included a full staging test cycle before any production changes - identified and cleaned up all affected clusters within the week following the incident. **Immediate Technical Actions** * Operator Scope Restriction: The onboarding operator logic has been corrected to require explicit namespace configuration before it can act on any clusters. It can no longer run unscoped against cluster-api based clusters by default. \(DONE\) * Deployment Configuration Safety: The deployment system has been reconfigured to fail deployments when required configuration files are absent, replacing the previous behavior that allowed deployments to proceed silently with an incomplete configuration. \(DONE\) * Cluster Cleanup: A recovery script was developed, validated in staging, and executed across all affected production clusters. All unintended installations and updates have been removed from all affected clusters. \(DONE, completed May 15, 2026\) **Structural Improvements:** * Pipeline Configuration Gate: We are implementing a mandatory verification step in the CI/CD pipeline to confirm that all required configuration files have been successfully propagated to the target branch before any deployment is triggered. \(ETA: Q3 2026\) * Change Management Controls: We are enforcing additional mandatory peer review and change management approvals for all deployments that touch production systems independent of change size, closing the process gap that allowed this change to proceed without the required _additional_ oversight. This is done in addition to the already established Change Approval Board review process. \(ETA: Q3 2026\) **Procedural Improvements** Affected customers were not properly kept informed about the cleanup efforts which compounded the uncertainty. Improvements to the incident communication process are currently being implemented into the wider incident management process ensuring that scalable options exist for tech teams to inform individually affected customers in a timely manner about the status of post-incident remediation efforts. \(ETA: Q3 2026\) **Closing remarks** We recognize that unintended modification of cluster configurations represents a failure in the controls that should prevent our internal operations from influencing customer environments. The fixes now in place and planned will close the immediate technical and procedural shortcomings. The structural measures will further harden the Change Process reducing the likelihood and potential fallout of maintenance related incidents further. Thank you for your patience during the incident and the remediation efforts.