Labrador CMS incident

Partial downtime following network outage

Labrador CMS experienced a major incident on October 16, 2023 affecting Labrador Editor and Labrador Frontend, lasting 11h 6m. The incident has been resolved; the full update timeline is below.

Started: Oct 16, 2023, 08:50 AM UTC
Resolved: Oct 16, 2023, 07:56 PM UTC
Duration: 11h 6m
Detected by Pingoru: Oct 16, 2023, 08:50 AM UTC

Affected components

Labrador EditorLabrador Frontend

Update timeline

investigating Oct 16, 2023, 08:50 AM UTC

We are currently investigating this issue.
monitoring Oct 16, 2023, 09:26 AM UTC

All services returned back to normal operating capacity at 10:42 CET. The root cause was a network incident with our infrastructure provider. We will continue monitoring the situation and follow up with a postmortem of the incident shortly.
monitoring Oct 16, 2023, 09:42 AM UTC

We are continuing to monitor for any further issues.
resolved Oct 16, 2023, 07:56 PM UTC

This incident has been resolved.
postmortem Oct 16, 2023, 07:59 PM UTC

## Summary On Monday 16.10.2023 between 09:58 - 10:13 CEST one of our primary infrastructure providers, OVH, experienced network disruptions across multiple data centers. This resulted in a partial or complete service outage for a large part of the traffic destined to both Labrador CMS and Labrador Front. Following this incident a subset of our customers experienced further service degradation, as some Labrador components were stuck in an unhealthy state and needed manual intervention. At 10:42 CEST all Labrador services returned to a fully operational state for all customers. ## Details Our internal monitoring systems reported the first unavailable services and sites at 10:03 CEST. Initial investigation revealed that a network outage was ongoing at OVH, one of our infrastructure providers, causing connectivity disruptions between our services. Networks returned back online at 10:13 CEST, however, internal monitoring and customer reports indicated that a subset of our clients still had service degradation, either in the form of slow responses or complete time-outs. Further investigation revealed that one of our servers was stuck in an unhealthy state following the network disruptions. The affected server was pulled from our cluster and restarted, and all Labrador services returned to normal at 10:42 CEST. ## Impacted services Services affected by this incident are specified in the table below. | **Service name** | **Minutes** | **Time from — to** | | --- | --- | --- | | Labrador CMS | 44 | 09:58 — 10:42 | | Labrador Front | 44 | 09:58 — 10:42 | ## Incident timeline Following is a timeline that describes the entire incident handling process. * `2023.10.16 09:58` Initial service outage alerts registered * `2023.10.16 10:08` Large scale network outage confirmed * `2023.10.16 10:13` Network operational again, most customers back online * `2023.10.16 10:20` Reports that some customers still experience service degradation * `2023.10.16 10:35` Service degradation root cause discovered * `2023.10.16 10:42` Affected services restarted, fully operational ## Root cause The root cause of the incident was determined to be network disruptions at OVH, resulting in one of our servers ending up in an unhealthy state. ## Planned actions We are continuously working on improving and decentralizing our infrastructure so that we are less vulnerable to large scale data center network outages. One of our current largest efforts in this regard is moving parts of the Labrador CMS and Front infrastructure to the cloud, reducing our exposure to various OVH outages and increasing our geographical presence and flexibility. This has a high priority for us, and we expect progress to be made throughout this year and the next.