Crusoe incident

[US-East] Transient Disconnection of Network Disks Impacting Compute and GPU Instances

Crusoe experienced a major incident on March 1, 2025 affecting us-east1, lasting 3d 3h. The incident has been resolved; the full update timeline is below.

Started: Mar 01, 2025, 12:26 PM UTC
Resolved: Mar 04, 2025, 03:33 PM UTC
Duration: 3d 3h
Detected by Pingoru: Mar 01, 2025, 12:26 PM UTC

Affected components

us-east1

Update timeline

investigating Mar 01, 2025, 12:26 PM UTC

We are currently investigating an issue in the us-east region, causing a few instances to be unreachable
investigating Mar 01, 2025, 12:27 PM UTC

We are continuing to investigate this issue.
identified Mar 02, 2025, 05:15 PM UTC

We've identified a transient disconnection of network disks impacting compute and GPU instances in the US-East region leading to some VMs becoming unresponsive (including SSH). Our team is working to actively mitigate this issue. You may notice some interruptions to compute instances during this time.
identified Mar 03, 2025, 06:36 PM UTC

We have identified a potential trigger for a kernel bug with the assistance of our storage vendor and are actively investigating the conditions that trigger the issue, as well as potential remediation steps.
identified Mar 03, 2025, 06:41 PM UTC

We have identified a potential trigger for a kernel bug with the assistance of our storage vendor which causes certain disk connection failures, causing the VMs to go into an unresponsive state. We are actively investigating the conditions that trigger the issue, as well as potential remediation steps.
identified Mar 03, 2025, 09:15 PM UTC

We have identified a network trigger for a kernel bug with the assistance of our storage vendor which causes certain disk connection failures and cause a VM to go into an unresponsive state. We are actively investigating the conditions that trigger the issue, as well as potential remediation steps.
identified Mar 03, 2025, 11:31 PM UTC

We resolved a network issue that caused temporary inaccessibility and availability problems for some VMs. We are currently identifying impacted servers and are proactively contacting affected customers to migrate them to remediated servers.
monitoring Mar 04, 2025, 12:53 AM UTC

We have implemented preventive measures to mitigate recurrence of this issue and are actively monitoring for any further transient disconnects.
resolved Mar 04, 2025, 03:33 PM UTC

This incident is now resolved. If you have any questions or experience any further issues, please reach out to our support team at [email protected].