Endless Group incident

VM Host Disk Failure (Was: Continued Maintenance)

Critical Resolved View vendor source →

Endless Group experienced a critical incident on May 27, 2023 affecting DirectAdmin and Homepage/Signups and 1 more component, lasting 4d 8h. The incident has been resolved; the full update timeline is below.

Started
May 27, 2023, 06:06 AM UTC
Resolved
May 31, 2023, 03:04 PM UTC
Duration
4d 8h
Detected by Pingoru
May 27, 2023, 06:06 AM UTC

Affected components

DirectAdminHomepage/SignupsApacheHelpdeskFTPEmail ServerMySQLEximBillingCustomer VPS'

Update timeline

  1. monitoring May 25, 2023, 03:02 AM UTC

    The previous maintenance tasks have not yet been completed.

  2. investigating May 25, 2023, 03:20 AM UTC

    We are experiencing a problem where one of our host machines is unable to successfully reboot following the upgrade. We are working on this issue as fast as possible.

  3. identified May 25, 2023, 11:25 PM UTC

    We have identified the problem as a failing disk in one of our host machines. We are recovering the machine but as this may use a large portion of our in-network bandwidth, please expect degraded performance on your sites at this time.

  4. identified May 27, 2023, 05:38 AM UTC

    We will be rebooting the remaining host system in order to finalize the update.

  5. identified May 27, 2023, 06:06 AM UTC

    We are continuing to work on restoring the failed host. Most customer systems should be back online at this time.

  6. monitoring May 28, 2023, 08:17 AM UTC

    All host systems have been successfully restored and confirmed to be operational. New drives were installed in the failing machine. Additionally, our new host system has been joined to the cluster. We are now monitoring to ensure that all components are operating normally. All customer systems should be online at this time. If you are experiencing an issue with your system, please contact us using our support channels.

  7. resolved May 31, 2023, 03:04 PM UTC

    We have been monitoring our host systems and have not observed any further issues. We consider this incident to be resolved. If you are still experiencing problems, please contact our support.