MageMojo incident

One node in USEast under emergency maintenance

Major Resolved View vendor source →

MageMojo experienced a major incident on June 4, 2021 affecting Webscale STRATUS - Northern Virginia, lasting 2h 10m. The incident has been resolved; the full update timeline is below.

Started
Jun 04, 2021, 04:28 PM UTC
Resolved
Jun 04, 2021, 06:38 PM UTC
Duration
2h 10m
Detected by Pingoru
Jun 04, 2021, 04:28 PM UTC

Affected components

Webscale STRATUS - Northern Virginia

Update timeline

  1. investigating Jun 04, 2021, 04:28 PM UTC

    We are currently investigating this issue.

  2. identified Jun 04, 2021, 04:51 PM UTC

    The issue has been identified and we are attempting to bring the node back online

  3. monitoring Jun 04, 2021, 06:34 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Jun 04, 2021, 06:38 PM UTC

    This incident has been resolved.

  5. postmortem Jun 07, 2021, 06:19 PM UTC

    An investigation concluded that a comprehensive kernel bug hit the ZFS filesystem and caused the issue with one of the nodes in our fleet. The problem is identified as similar to the [https://github.com/openzfs/zfs/issues/10642](https://github.com/openzfs/zfs/issues/10642) bug already reported. We have captured kernel stack traces during this event, and a solution for prevention is under investigation.