Storj incident

US1 - Upload and Download failing requests

Storj experienced a major incident on August 7, 2025 affecting US1 - Linksharing and US1 - Gateway and 1 more component, lasting 1h 39m. The incident has been resolved; the full update timeline is below.

Started: Aug 07, 2025, 09:25 PM UTC
Resolved: Aug 07, 2025, 11:05 PM UTC
Duration: 1h 39m
Detected by Pingoru: Aug 07, 2025, 09:25 PM UTC

Affected components

US1 - LinksharingUS1 - GatewayUS1 - Select

Update timeline

investigating Aug 07, 2025, 09:25 PM UTC

We are investigating upload/download failures on US1
identified Aug 07, 2025, 09:30 PM UTC

The issue has been identified and a fix is being implemented.
monitoring Aug 07, 2025, 09:55 PM UTC

A fix has been implemented and we are monitoring the results.
monitoring Aug 07, 2025, 10:31 PM UTC

We are investigating another increase in error rates.
monitoring Aug 07, 2025, 10:45 PM UTC

Error rate are normal and we are continuing to monitor for further issues.
resolved Aug 07, 2025, 11:05 PM UTC

This incident has been resolved.
postmortem Aug 22, 2025, 07:24 PM UTC

### Summary of the Storj US1 Service Disruption on August 7, 2025 #### Overview: On August 7, 2025, at approximately 21:04 UTC, the Storj US1 satellite experienced a performance degradation due to an unusually high volume of concurrent uploads to the same objects. The incident affected only the US1 satellite, while the AP1 and EU1 satellites remained fully operational throughout the event. #### Root Cause: The primary root cause was an exceptionally high volume of simultaneous uploads targeting the same objects, which triggered a bottleneck in the database due to transaction contention. This led to a cascade of issues, including database locking, request timeouts, and connection pool exhaustion. Database transactions can cause issues because they hold locks on the affected data, preventing other operations from accessing or modifying that data until the transactions are committed or rolled back. The increased level of long running transactions eventually leads to request timeouts. #### Impact: The incident affected customers and applications relying on the Storj US1 satellite for storage and retrieval operations. The AP1 and EU1 satellites were not affected by this incident and continued to operate normally. #### Timeline: 21:04 UTC: Database transaction locks started to increase. 21:14 UTC: The on-call team received a page and started investigating the issue. 21:55 UTC: A fix was implemented and the on-call team started monitoring the results. 22:31 UTC: The on-call team started investigating another increase in error rates. 22:45 UTC: Error rates trended back down to normal and the on-call team continued to monitor for further issues. 23:05 UTC: Operations were fully restored. #### Remediation and Prevention: To address the issue, we periodically reset the state of connections thus avoiding a cascading growth of contention related errors. To prevent similar incidents from occurring in the future, we implemented the following measures: 1. Identified and resolved sources of contentions between database transactions by implementing appropriate code updates. 2. Implemented additional monitoring and alerting mechanisms to detect and notify the team of increased levels of database locks. 3. Conducted thorough post-mortem analysis to identify any other potential improvements to our processes and systems, including building tools that ease or eliminate the need for manual maintenance. We apologize for any impact this service disruption may have caused our customers and users. We are committed to learning from this incident and continuing to improve the reliability and resilience of our platform.