IONOS Cloud incident

Performance Degradation Compute FRA

Minor Resolved View vendor source →
Started
Apr 08, 2026, 06:48 AM UTC
Resolved
Apr 08, 2026, 07:50 PM UTC
Duration
13h 2m
Detected by Pingoru
Apr 08, 2026, 06:48 AM UTC

Affected components

ComputeManaged Kubernetes

Update timeline

  1. investigating Apr 08, 2026, 06:48 AM UTC

    We are currently investigating performance degradation affecting compute components in our FRA DC. This issue is impacting a subset of Virtual Machines (VMs) and Kubernetes Clusters. We will provide further updates as our investigation progresses.

  2. identified Apr 08, 2026, 08:27 AM UTC

    We have identified an increase in CPU steal time on affected hosts. Our Compute team has identified a likely culprit and is currently testing a potential mitigation to ensure its effectiveness before a rollout.

  3. identified Apr 08, 2026, 09:54 AM UTC

    Our compute team has found another factor negatively impacting CPU performance for affected VMs. We are currently testing a potential transparent resolution for the problematic CPU affinity setting.

  4. identified Apr 08, 2026, 10:12 AM UTC

    Our compute team has successfully tested the proposed fix for the CPU core affinity and is preparing a rollout. We will monitor the results.

  5. monitoring Apr 08, 2026, 11:28 AM UTC

    The adjustment was rolled out. Our Compute team is seeing dropping CPU steal time. We are monitoring the situation. Our Tech Teams are preparing another rollout that should improve the performance further.

  6. monitoring Apr 08, 2026, 03:29 PM UTC

    The second configuration update rollout is currently in progress, and we have confirmed initial improvements related to CPU performance. Due to the size of the fleet, we expect the rollout to take some time to complete. Throughout the process, customers will see performance gains as soon as the specific hosts supporting their workloads have been updated. We will provide a final update once the rollout is finished.

  7. monitoring Apr 08, 2026, 05:58 PM UTC

    Our Compute Team has confirmed that the fix has been rolled out to the majority of affected hosts. We are currently finishing the rollout and will provide an update once the remaining hosts on affected clusters are covered

  8. resolved Apr 08, 2026, 07:50 PM UTC

    We have successfully completed the rollout to all remaining hosts and are closing this incident. A Root Cause Analysis is currently being conducted by the Compute Team and will be shared here upon completion.

  9. postmortem Apr 21, 2026, 06:04 PM UTC

    **What happened** Virtual machines with dedicated CPU allocations in the Frankfurt data center began exhibiting abnormally high CPU steal time indicating that the hypervisor was unable to provide the requested CPU resources. This performance degradation occurred on multiple customer instances and persisted even after guest operating system reboots and configuration changes. This lead to performance degradation in customer workloads. **How was that possible? \(Root cause\)** The issue was caused by a regression introduced during recent improvements to the virtual machine checkpointing mechanism. A code change intended to optimize the checkpoint/restore process inadvertently affected the live migration code path, which is used when VMs are moved between physical servers. The virtualization hypervisor uses a process called CPU pinning to ensure that guest VM threads run on their dedicated CPU cores as allocated. When this pinning is not properly configured, the VM's processes fall back to inheriting the CPU assignment of the parent process. In this case, the parent process's assignment included core 0 - a core normally reserved exclusively for the host operating system that should not be allocated to guest workloads. The result was a failure of the resource allocation guarantees: * VMs with dedicated vCPU allocations could not be pinned to their assigned cores * The VM's virtual CPU threads instead competed for time on an oversubscribed, host-reserved core * The hypervisor scheduler could not guarantee the VM's promised CPU time High CPU steal time resulted despite adequate physical CPU resources being available. This issue affected VMs in the Frankfurt infrastructure that underwent live migration operations during the deployment window when the problematic code was active. **How we prevent recurrence** _Enhanced CPU Pinning Validation_: The virtualization infrastructure codebase has been updated to restore proper CPU pinning for all live migration operations.. \(DONE\) _Strengthened Pre-deployment Testing_: Enhance the validation procedures for virtualization infrastructure changes to catch CPU allocation anomalies before code is deployed to production. \(DONE\) _Automated Abnormal Steal Time Alertin_g: Implement automated monitoring and alerting to detect abnormal CPU steal time on VMs with dedicated vCPU allocations. This will enable faster detection of similar configuration regressions in the future. \(Within Q2 2026\) _Enhancing post rollout monitoring_: Extend the post-deployment monitoring and assessment window to increase likelihood of anomalies being spotted and correctly correlated to a change. \(DONE\) **Closing remark** The incident resulted in measurable performance degradation for customers over an extended period. The fact that the increase in steal time initially went unnoticed highlighted a gap in our alerting and monitoring setup. The ambiguity of the symptoms - characterized by general performance issues in some guests -, and apparent lack of “common denominators” in affected systems meant that initial incident reports and existing indicators were not properly understood, correlated. and attributed in a timely fashion. The corrective actions outlined above address both the immediate defect and the systemic factors that allowed it to surface. These measures are designed to prevent recurrence and significantly reduce the time to detection and resolution in the future. We thank our affected customers and partners for their patience and constructive collaboration throughout this incident.

Looking to track IONOS Cloud downtime and outages?

Pingoru polls IONOS Cloud's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when IONOS Cloud reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track IONOS Cloud alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring IONOS Cloud for free

5 free monitors · No credit card required