Crusoe incident
[US-East] Transient Disconnection of Network Disks Impacting Compute and GPU Instances
Crusoe experienced a major incident on March 1, 2025 affecting us-east1, lasting 3d 3h. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Mar 01, 2025, 12:26 PM UTC
We are currently investigating an issue in the us-east region, causing a few instances to be unreachable
- investigating Mar 01, 2025, 12:27 PM UTC
We are continuing to investigate this issue.
- identified Mar 02, 2025, 05:15 PM UTC
We've identified a transient disconnection of network disks impacting compute and GPU instances in the US-East region leading to some VMs becoming unresponsive (including SSH). Our team is working to actively mitigate this issue. You may notice some interruptions to compute instances during this time.
- identified Mar 03, 2025, 06:36 PM UTC
We have identified a potential trigger for a kernel bug with the assistance of our storage vendor and are actively investigating the conditions that trigger the issue, as well as potential remediation steps.
- identified Mar 03, 2025, 06:41 PM UTC
We have identified a potential trigger for a kernel bug with the assistance of our storage vendor which causes certain disk connection failures, causing the VMs to go into an unresponsive state. We are actively investigating the conditions that trigger the issue, as well as potential remediation steps.
- identified Mar 03, 2025, 09:15 PM UTC
We have identified a network trigger for a kernel bug with the assistance of our storage vendor which causes certain disk connection failures and cause a VM to go into an unresponsive state. We are actively investigating the conditions that trigger the issue, as well as potential remediation steps.
- identified Mar 03, 2025, 11:31 PM UTC
We resolved a network issue that caused temporary inaccessibility and availability problems for some VMs. We are currently identifying impacted servers and are proactively contacting affected customers to migrate them to remediated servers.
- monitoring Mar 04, 2025, 12:53 AM UTC
We have implemented preventive measures to mitigate recurrence of this issue and are actively monitoring for any further transient disconnects.
- resolved Mar 04, 2025, 03:33 PM UTC
This incident is now resolved. If you have any questions or experience any further issues, please reach out to our support team at [email protected].