Wasabi incident

System Errors Affecting All Regions

Wasabi experienced a notice incident on August 6, 2024 affecting US-Central-1 (Texas) and US-East-1 (N. Virginia) and 1 more component, lasting 57m. The incident has been resolved; the full update timeline is below.

Started: Aug 06, 2024, 10:56 AM UTC
Resolved: Aug 06, 2024, 11:54 AM UTC
Duration: 57m
Detected by Pingoru: Aug 06, 2024, 10:56 AM UTC

Affected components

US-Central-1 (Texas)US-East-1 (N. Virginia)US-East-2 (N. Virginia)US-West-1 (Oregon)CA-Central-1 (Toronto)EU-Central-1 (Amsterdam)EU-Central-2 (Frankfurt)EU-West-1 (London)EU-West-2 (Paris)AP-Northeast-1 (Tokyo)

Update timeline

investigating Aug 06, 2024, 10:56 AM UTC

We are currently investigating issues with logging into our Web Console and errors in Wasabi regions.
resolved Aug 06, 2024, 11:54 AM UTC

The operations team has resolved this issue and restored service to normal levels. We will post a postmortem shortly.
postmortem Aug 16, 2024, 04:09 PM UTC

Between 10:32 UTC 2024-08-06 and 20:40 UTC 2024-08-07, we experienced three instances affecting both S3 and user services in all regions. Starting at 10:32 UTC 2024-08-06, our queueing service reached a full capacity state which impacted our database cache causing it to become unresponsive. The Wasabi Operations team initiated a restart to the primary database in an attempt to clear out all stale connections to the database while simultaneously clearing the queuing service queue. When this action failed to bring the database into a fully operational state, the secondary database instance was promoted to primary. At 11:20 UTC the S3 service was fully operational again. Between 13:17 UTC and 13:23 UTC, the database was restarted once more by Operations in order to fully incorporate our queueing service library. Between 02:55 UTC to 03:35 UTC on 2024-08-07, a second event occurred when our Operations team identified a configuration issue within the queueing service and the previously promoted secondary database instance. This configuration issue was causing timeouts to occur on user services such as our Web Console, WAC API, and WACM interface. Our Operations team then promoted the primary database back to production, alleviating these issues. There was no impact to S3 services during this event. Between 20:30 UTC to 20:44 UTC on 2024-08-07, a third event occurred when an automation cluster was failing to be seen by our automation service, causing a small decrease in accepted traffic to our S3 vaults. Our Operations team then recreated and redeployed this cluster, fully restoring the S3 service.