Labrador CMS incident

CMS outage

Labrador CMS experienced a major incident on March 2, 2023 affecting Labrador Editor, lasting 35m. The incident has been resolved; the full update timeline is below.

Started: Mar 02, 2023, 08:54 AM UTC
Resolved: Mar 02, 2023, 09:30 AM UTC
Duration: 35m
Detected by Pingoru: Mar 02, 2023, 08:54 AM UTC

Affected components

Labrador Editor

Update timeline

investigating Mar 02, 2023, 08:54 AM UTC

We are currently investigating an issue affecting the entire CMS-rig.
resolved Mar 02, 2023, 09:30 AM UTC

Systems should be coming back online now, a post mortem will be posted later.
postmortem Mar 02, 2023, 03:28 PM UTC

## Summary On Thursday 02.03.2023 between approximately 09:30 - 10:53 CET Labrador CMS experienced major service disruptions due to issues with our service provider erroneously disabling several of our physical servers. ## Preface All Labrador CMS servers are hosted by the French company OVH. Our servers are spread across three datacentres in three different countries and have a high degree of redundancy to allow for high uptime and lessen the impact of one server failing on the entire infrastructure. ## Details Our internal monitoring systems reported the first unavailable services at 09:31 CET. Customers were at the same time reporting the inability to access their Labrador instances. We initially suspected another issue with CephFS and the initial investigation seemed to corroborate this suspicion. Early efforts were therefore focused on trying to mitigate these errors. After a short while it became clear that the fault lay elsewhere as more servers reported being unavailable. We then discovered that the OVH dashboard reported several servers being disabled due to an error in the payment system. After discovering the cause we quickly reversed the disabling and servers began coming online again. When all servers had come on line, our infrastructure software was restarted. All services were back to normal operation at 10:53 CET. ## Impacted services Services affected by this incident are specified in the table below. | **Service name** | **Minutes** | **Time from — to** | | --- | --- | --- | | Labrador CMS | 83 | 09:30 — 10:53 | ## Incident timeline Following is a timeline that describes the entire incident handling process. * `2023.02.03 09:30` Service outage alerts registered * `2023.02.03 09:50` Problem confirmed to be a critical mass of servers going offline unexpectedly * `2023.02.03 10:02` Server disabling confirmed to be caused by OVH systems. * `2023.02.03 10:20` Servers began coming back online * `2023.02.03 10:30` Labrador CMS became available again, with intermittent failures. * `2023.02.03 10:39` All servers online, restart of Labrador infrastructure software started * `2023.02.03 10:53` All services restarted and operational ## Root cause and future work The root cause of the incident was an issue with OVHs payment handling and automated server disabling. Following this incident we have reached out to OVH to establish more rigid procedures around server disabling and payment handling. In addition we are continuing our work to be able to move more infrastructure to other cloud providers such as Amazon AWS.