Labrador CMS incident

Service outage due to network issues

Labrador CMS experienced a major incident on October 30, 2024 affecting Labrador Editor and Labdevs Development and 1 more component, lasting 44m. The incident has been resolved; the full update timeline is below.

Started: Oct 30, 2024, 01:32 PM UTC
Resolved: Oct 30, 2024, 02:17 PM UTC
Duration: 44m
Detected by Pingoru: Oct 30, 2024, 01:32 PM UTC

Affected components

Labrador EditorLabdevs DevelopmentLabrador Frontend

Update timeline

investigating Oct 30, 2024, 01:32 PM UTC

We are experiencing issues across all services and are actively investigating.
monitoring Oct 30, 2024, 01:46 PM UTC

The problem was caused by issues at one of our infrastructure providers. Services are back online now, but we are actively monitoring. A more thorough description of the issues will follow when we have more information from our server provider.
resolved Oct 30, 2024, 02:17 PM UTC

Our infrastructure provider reports that the issue has been identified and fixed. The incident is resolved and we will update with more details when we have them.
postmortem Oct 31, 2024, 06:18 AM UTC

## Summary On Wednesday 30.10.2024 between 13:23 - 13:40 UTC one of our primary infrastructure providers, OVHcloud, experienced network disruptions across multiple data centers. This resulted in a partial or complete service outage for a large part of the traffic destined to both Labrador CMS and Labrador Front. ## Details Our internal monitoring systems reported the first unavailable services and sites at 13:25 UTC. Initial investigation revealed that a network outage was ongoing at one of our infrastructure providers, causing connectivity disruptions for all their services. Network services started to return at 13:40 UTC and all systems came back online. At 13:45 UTC, all systems were confirmed to be operational. ## Impacted services Services affected by this incident are specified in the table below. All Labrador CMS customers were affected to varying degrees. CMS access was down, but most customers without their own Varnish cache layers were still available for readers of cached pages. | **Service name** | **Minutes** | **Time from — to** | | --- | --- | --- | | Labrador CMS | 20 | 13:25 — 13:45 | | Labrador Front | 20 | 13:25 — 13:45 | ## Incident timeline Following is a timeline that describes the entire incident handling process. All times UTC. * `2024.10.30 13:25` Initial service outage alerts registered * `2024.10.30 13:27` Large scale network outage confirmed * `2024.10.30 13:32` Statuspage updated and customers notified * `2024.10.30 13:40` Network back up again. * `2024.10.30 13:45` All services confirmed operational and customers notified. ## Root cause The root cause of the incident was determined to be network disruptions at our infrastructure provider, caused by one of their pairing partners pushing a faulty network update. ## Planned actions We are continuously working on improving and decentralizing our infrastructure so that we are less vulnerable to these large scale network outages. One of our current largest efforts in this regard is moving more parts of the Labrador CMS and Front infrastructure to the cloud. Currently storage, image rendering and Varnish caching has been moved to AWS, with the rest of Labrador Front following in the coming months. ## Additional reading For more information on the incident, the OVHcloud incident report can be found here: [https://network.status-ovhcloud.com/incidents/qgb1ynp8x0c4](https://network.status-ovhcloud.com/incidents/qgb1ynp8x0c4) In addition, Cloudflare has an interesting blog post with some more details here: [https://blog.cloudflare.com/cloudflare-perspective-of-the-october-30-2024-ovhcloud-outage/](https://blog.cloudflare.com/cloudflare-perspective-of-the-october-30-2024-ovhcloud-outage/)