Crusoe incident

Service Degradation in us-east1-a Region due to Power Disruption

Crusoe experienced a major incident on August 20, 2025 affecting us-east1 and Infiniband Networks, lasting 1d 6h. The incident has been resolved; the full update timeline is below.

Started: Aug 20, 2025, 05:09 PM UTC
Resolved: Aug 21, 2025, 11:35 PM UTC
Duration: 1d 6h
Detected by Pingoru: Aug 20, 2025, 05:09 PM UTC

Affected components

us-east1Infiniband Networks

Update timeline

investigating Aug 20, 2025, 05:09 PM UTC

We're investigating a service degradation in our us-east1-a region, triggered by a facility power disruption at our data center. The primary impact is to the Infiniband networking fabric, which may cause intermittent errors or failures for multi-node, distributed workloads. Some customers may also experience individual virtual machines becoming unavailable. Our teams are working to identify all affected resources. Our engineering teams are actively working to stabilize the affected systems and mitigate the risk of further disruption. We are coordinating with our data center provider to support their remediation efforts and restore full service resiliency as quickly as possible. We apologize for any impact this is causing.
investigating Aug 21, 2025, 02:32 AM UTC

We are continuing to investigate and mitigate the service degradation affecting our us-east1-a region, following a facility power disruption at our data center. Our teams remain in close coordination with the data center provider as they work to fully restore services. Recovery of critical systems remains our top priority. We sincerely apologize for the ongoing impact and appreciate your continued patience as we work to resolve the issue.
monitoring Aug 21, 2025, 07:59 PM UTC

We have successfully mitigated the issue affecting the us-east1-a region. The facility power disruption has been addressed, and the impacted Infiniband networking fabric and associated systems have returned to normal operation. All affected services are now stable, and full functionality has been restored. Our teams will continue to monitor the region closely to ensure continued stability. We appreciate your patience during this incident and apologize again for any disruption it may have caused.
resolved Aug 21, 2025, 11:35 PM UTC

This incident has been resolved.