Firebolt incident

Engines may fail to start and stop

Firebolt experienced a major incident on January 26, 2022 affecting Engines, lasting 3h 45m. The incident has been resolved; the full update timeline is below.

Started: Jan 26, 2022, 01:02 PM UTC
Resolved: Jan 26, 2022, 04:48 PM UTC
Duration: 3h 45m
Detected by Pingoru: Jan 26, 2022, 01:02 PM UTC

Affected components

Engines

Update timeline

investigating Jan 26, 2022, 01:02 PM UTC

Engines may fail to start and stop.
investigating Jan 26, 2022, 01:17 PM UTC

We are continuing to investigate this issue
identified Jan 26, 2022, 01:19 PM UTC

The issue has been identified, and a fix is being implemented.
identified Jan 26, 2022, 01:44 PM UTC

We are continuing to work on a fix for the issue.
identified Jan 26, 2022, 01:54 PM UTC

The issue is identified and fix being implemented. The issue affects only engines under US-EAST-1 region.
identified Jan 26, 2022, 02:16 PM UTC

We are still experiencing the issue, and continuing to work on a fix.
identified Jan 26, 2022, 03:28 PM UTC

We have fixed a portion of the issue with our engines and are now working on a complete fix. This issue still only affects engines on US-EAST-1. Starting/Stopping engines in that region will not work until we resolve the issue. We will update you in 30 minutes or earlier if fixed.
identified Jan 26, 2022, 03:41 PM UTC

We have identified and are deploying a fix, we believe the process should take 30minutes - 1hour and we'll update you at the 30m mark if it is still running or complete.
identified Jan 26, 2022, 04:04 PM UTC

Our fix is still underway and looking good. We believe 15-30 more minutes until the issue is resolved. Please do not run any ingestion during this time. We will update you shortly once complete.
monitoring Jan 26, 2022, 04:20 PM UTC

The fix has been deployed successfully and we are now testing it to ensure the efficacy. This should take ~10 minutes Please note we found that any ingestion on the impacted engines during the timeframe may have been affected as well. We are similarly testing those findings, and will update on what actions need to be taken if any.
monitoring Jan 26, 2022, 04:43 PM UTC

Testing is complete and the issue is resolved. A full update on the issue is below, please note the restoration time, and let us know if you notice any residual impacts to your systems. At approximately 1230 UTC we experienced an outage in our master cluster. Though we quickly rectified this particular issue, it had disrupted several internal services - most critically, one of our most important storage services entered an unresponsive state. We made several attempts to regain the health of this system but in the end had to revert to a full restoration, which was successful. That restoration is timestamped at 12:07 UTC, meaning any ingestion done after that time will have been lost.
resolved Jan 26, 2022, 04:48 PM UTC

This incident has been resolved.