Labrador CMS experienced a critical incident on January 12, 2023 affecting Labrador Editor, lasting 3h 37m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jan 12, 2023, 08:50 AM UTC
The CMS-rig is currently unresponsive for many users, we are investigating.
- identified Jan 12, 2023, 09:58 AM UTC
The issue has been identified and we are working on fixing it.
- monitoring Jan 12, 2023, 10:18 AM UTC
Systems are starting to be available again, we will continue monitoring.
- identified Jan 12, 2023, 10:39 AM UTC
Issues have worsened again, we continue to work to correct the issue.
- monitoring Jan 12, 2023, 11:21 AM UTC
Systems should be working again now, but we continue monitoring the situation. The downtime was caused by a problem with our file system CephFS crashing.
- resolved Jan 12, 2023, 12:28 PM UTC
The incident has been resolved for now, a more thorough report will be made available later.
- postmortem Jan 16, 2023, 02:17 PM UTC
## Summary On Thursday 12.01.2023 between 09:40 - 12:16 CET we experienced major service disruptions due to issues following an upgrade of our file system layer, Ceph. This resulted in a partial or complete service outage of the Labrador CMS Editor for all of our clients. At 12:16 CET all Labrador services returned to a healthy state. Labrador Front was completely unaffected by this outage. ## Preface At Labrador CMS we use [Ceph](https://ceph.io/) as a file system for most of our storage, including articles, front pages, images, settings, etc. A critical component of Ceph is the Metadata Server \(MDS\), which is responsible for handling file and directory metadata. Traditionally we have used 3 active Ceph MDS for performance and redundancy. On Tuesday 10.01.2023 we completed maintenance of our Labrador Editor environments, upgrading Ceph from version 14 to 16. During the upgrade we experienced a similar incident, although in a much smaller scale, which at that time was deemed to be caused by the upgrade process itself. We now believe the cause to be the same CephFS bug responsible for the 12.01 incident. ## Details Our internal monitoring systems reported the first unavailable services and sites at 09:40 CET. Initial investigation revealed the same CephFS symptoms we had encountered during the software upgrade of 10.01. File system clients were stuck in a bad state due to a fatal Ceph MDS crash, which caused the Labrador Editor to hang indefinitely. The solution was to forcefully restart the bad clients so the file system lock could be resolved. Following the first bad client restart, a new set of clients were now reporting bad states, which were also restarted. On the third bad state client occurrence our attention shifted to the Ceph MDS. Finally, halting all file system clients, reducing the amount of active Ceph MDS to 1, restarting it, and resuming all clients resolved the issue. Labrador Editor services returned to an operational state at 12:16 CET. ## Impacted services Services affected by this incident are specified in the table below. | **Service name** | **Minutes** | **Time from — to** | | --- | --- | --- | | Labrador CMS | 156 | 09:40 — 12:16 | ## Incident timeline Following is a timeline that describes the entire incident handling process. * `2022.01.12 09:40` Service outage alerts registered * `2022.01.12 09:50` Outage confirmed to be caused by CephFS crash * `2022.01.12 10:10` Begin partial restart of affected clients and services * `2022.01.12 10:30` Partial restart complete * `2022.01.12 10:45` Additional services affected, continue partial restart * `2022.01.12 11:15` Partial restart complete * `2022.01.12 11:30` Begin complete client and service restart * `2022.01.12 12:16` All services restarted and operational ## Root cause The root cause of the service outage is believed to be a [software bug in version 16 of Ceph](https://tracker.ceph.com/issues/58041), causing the Ceph MDS to potentially crash under certain circumstances when multiple MDS are active at the same time. The bug has been fixed, but has not yet been included in a release of Ceph v16. Until the fix has been released we will limit the amount of active MDS to 1 for Labrador Editor environments. Labrador Front environments are not affected by this bug, as it is still running Ceph version 14.