MageMojo incident
One node in USEast under emergency maintenance
MageMojo experienced a major incident on June 4, 2021 affecting Webscale STRATUS - Northern Virginia, lasting 2h 10m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jun 04, 2021, 04:28 PM UTC
We are currently investigating this issue.
- identified Jun 04, 2021, 04:51 PM UTC
The issue has been identified and we are attempting to bring the node back online
- monitoring Jun 04, 2021, 06:34 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Jun 04, 2021, 06:38 PM UTC
This incident has been resolved.
- postmortem Jun 07, 2021, 06:19 PM UTC
An investigation concluded that a comprehensive kernel bug hit the ZFS filesystem and caused the issue with one of the nodes in our fleet. The problem is identified as similar to the [https://github.com/openzfs/zfs/issues/10642](https://github.com/openzfs/zfs/issues/10642) bug already reported. We have captured kernel stack traces during this event, and a solution for prevention is under investigation.