AlphaVPS experienced a notice incident on December 20, 2024 affecting KVM Services in Nuremberg, lasting 12d 4h. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Dec 20, 2024, 07:49 AM UTC
We've been alerted of a catastrophic RAID failure on our nuestorage07 node. Earlier this morning, the controller started flapping, which jumped the I/O load to over 90. This made every VM on the host hypervisor unresponsive. We've proceeded with replacing the RAID adapter and we're moving towards an attempted recovery. At this point, we're unsure if recovery would be possible.
- identified Dec 20, 2024, 09:31 AM UTC
We're currently working on bringing the node back online in a stable state. We estimate that it would take us 3-4 additional hours before we start booting the virtual machines up and assessing the damage. Please stand by for further updates.
- identified Dec 20, 2024, 01:54 PM UTC
We are starting with restoration of individual KVM machines. Further updates to follow.
- identified Dec 20, 2024, 09:04 PM UTC
We're continuing to work on restoring VMs, if your VM is affected - please wait until it's booted up.
- monitoring Dec 21, 2024, 02:01 PM UTC
We'll be booting up the remaining virtual servers in the next 1 hour and continue with monitoring the hypervisor.
- monitoring Dec 21, 2024, 03:44 PM UTC
Unfortunately, as we booted up the majority of the affected VMs, the node started becoming unstable again with flush operations starting to get queued. We're looking into alternative solutions at the moment. Further updates to come.
- monitoring Dec 21, 2024, 11:35 PM UTC
We've proceeded with replacing additional hardware. After we've performed manual fsck on each virtual machine's virtual drive - we've proceeded with booting up all affected VPS. As of now, all services are restored. We'll be monitoring the situation throughout the next days.
- resolved Jan 01, 2025, 12:08 PM UTC
We've monitored the situation for the past 11 days and as of now, everything is stable and we're closing the issue.