Crusoe incident

VM creation and networking failure for A100 Infiniband type VMs in us-east region

Major Resolved View vendor source →

Crusoe experienced a major incident on August 1, 2025 affecting us-east1, lasting 3h 2m. The incident has been resolved; the full update timeline is below.

Started
Aug 01, 2025, 03:18 AM UTC
Resolved
Aug 01, 2025, 06:21 AM UTC
Duration
3h 2m
Detected by Pingoru
Aug 01, 2025, 03:18 AM UTC

Affected components

us-east1

Update timeline

  1. investigating Aug 01, 2025, 03:18 AM UTC

    We have identified an issue that is preventing new or restarted Virtual Machines from booting successfully on our A100 Infiniband hardware fleet. Any new VM provisioning request for this hardware type will also fail. Additionally, any existing VM on an A100 Infiniband node that is stopped and started (or rebooted) will also fail to come back online. Existing, currently running VMs are not affected and will continue to operate normally. We advise customers to avoid rebooting critical workloads on this hardware until a resolution is in place. Our engineering teams are actively investigating the root cause and are working to restore normal provisioning operations as quickly as possible.

  2. identified Aug 01, 2025, 03:44 AM UTC

    The issue has been identified, and we have tested a fix internally. We are working on rolling out the fix to our A100 Infiniband type servers now.

  3. monitoring Aug 01, 2025, 05:06 AM UTC

    A fix has been implemented, and we are monitoring the environment for now.

  4. resolved Aug 01, 2025, 06:21 AM UTC

    This incident is now resolved