Firebolt experienced a major incident on January 26, 2022 affecting Engines, lasting 3h 45m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jan 26, 2022, 01:02 PM UTC
Engines may fail to start and stop.
- investigating Jan 26, 2022, 01:17 PM UTC
We are continuing to investigate this issue
- identified Jan 26, 2022, 01:19 PM UTC
The issue has been identified, and a fix is being implemented.
- identified Jan 26, 2022, 01:44 PM UTC
We are continuing to work on a fix for the issue.
- identified Jan 26, 2022, 01:54 PM UTC
The issue is identified and fix being implemented. The issue affects only engines under US-EAST-1 region.
- identified Jan 26, 2022, 02:16 PM UTC
We are still experiencing the issue, and continuing to work on a fix.
- identified Jan 26, 2022, 03:28 PM UTC
We have fixed a portion of the issue with our engines and are now working on a complete fix. This issue still only affects engines on US-EAST-1. Starting/Stopping engines in that region will not work until we resolve the issue. We will update you in 30 minutes or earlier if fixed.
- identified Jan 26, 2022, 03:41 PM UTC
We have identified and are deploying a fix, we believe the process should take 30minutes - 1hour and we'll update you at the 30m mark if it is still running or complete.
- identified Jan 26, 2022, 04:04 PM UTC
Our fix is still underway and looking good. We believe 15-30 more minutes until the issue is resolved. Please do not run any ingestion during this time. We will update you shortly once complete.
- monitoring Jan 26, 2022, 04:20 PM UTC
The fix has been deployed successfully and we are now testing it to ensure the efficacy. This should take ~10 minutes Please note we found that any ingestion on the impacted engines during the timeframe may have been affected as well. We are similarly testing those findings, and will update on what actions need to be taken if any.
- monitoring Jan 26, 2022, 04:43 PM UTC
Testing is complete and the issue is resolved. A full update on the issue is below, please note the restoration time, and let us know if you notice any residual impacts to your systems. At approximately 1230 UTC we experienced an outage in our master cluster. Though we quickly rectified this particular issue, it had disrupted several internal services - most critically, one of our most important storage services entered an unresponsive state. We made several attempts to regain the health of this system but in the end had to revert to a full restoration, which was successful. That restoration is timestamped at 12:07 UTC, meaning any ingestion done after that time will have been lost.
- resolved Jan 26, 2022, 04:48 PM UTC
This incident has been resolved.