AlphaVPS incident

nuestorage07 - Catastrophic RAID failure

Notice Resolved View vendor source →

AlphaVPS experienced a notice incident on December 20, 2024 affecting KVM Services in Nuremberg, lasting 12d 4h. The incident has been resolved; the full update timeline is below.

Started
Dec 20, 2024, 07:49 AM UTC
Resolved
Jan 01, 2025, 12:08 PM UTC
Duration
12d 4h
Detected by Pingoru
Dec 20, 2024, 07:49 AM UTC

Affected components

KVM Services in Nuremberg

Update timeline

  1. investigating Dec 20, 2024, 07:49 AM UTC

    We've been alerted of a catastrophic RAID failure on our nuestorage07 node. Earlier this morning, the controller started flapping, which jumped the I/O load to over 90. This made every VM on the host hypervisor unresponsive. We've proceeded with replacing the RAID adapter and we're moving towards an attempted recovery. At this point, we're unsure if recovery would be possible.

  2. identified Dec 20, 2024, 09:31 AM UTC

    We're currently working on bringing the node back online in a stable state. We estimate that it would take us 3-4 additional hours before we start booting the virtual machines up and assessing the damage. Please stand by for further updates.

  3. identified Dec 20, 2024, 01:54 PM UTC

    We are starting with restoration of individual KVM machines. Further updates to follow.

  4. identified Dec 20, 2024, 09:04 PM UTC

    We're continuing to work on restoring VMs, if your VM is affected - please wait until it's booted up.

  5. monitoring Dec 21, 2024, 02:01 PM UTC

    We'll be booting up the remaining virtual servers in the next 1 hour and continue with monitoring the hypervisor.

  6. monitoring Dec 21, 2024, 03:44 PM UTC

    Unfortunately, as we booted up the majority of the affected VMs, the node started becoming unstable again with flush operations starting to get queued. We're looking into alternative solutions at the moment. Further updates to come.

  7. monitoring Dec 21, 2024, 11:35 PM UTC

    We've proceeded with replacing additional hardware. After we've performed manual fsck on each virtual machine's virtual drive - we've proceeded with booting up all affected VPS. As of now, all services are restored. We'll be monitoring the situation throughout the next days.

  8. resolved Jan 01, 2025, 12:08 PM UTC

    We've monitored the situation for the past 11 days and as of now, everything is stable and we're closing the issue.