Labrador CMS incident

Degraded front performance

Labrador CMS experienced a minor incident on December 3, 2024 affecting Labrador Frontend, lasting 42m. The incident has been resolved; the full update timeline is below.

Started: Dec 03, 2024, 06:59 AM UTC
Resolved: Dec 03, 2024, 07:42 AM UTC
Duration: 42m
Detected by Pingoru: Dec 03, 2024, 06:59 AM UTC

Affected components

Labrador Frontend

Update timeline

investigating Dec 03, 2024, 06:59 AM UTC

We are currently investigating an issue with parts of our server infrastructure. Some customers may experience degraded performance on the front servers.
identified Dec 03, 2024, 07:23 AM UTC

Problems were caused by two servers not responding in one data center. We are working to rectify the situation and investigating why the automatic failover did not trigger.
monitoring Dec 03, 2024, 07:34 AM UTC

The affected servers are back online and we are actively monitoring. Services should be back to normal.
resolved Dec 03, 2024, 07:42 AM UTC

The incident has been resolved. The issue seems to have been caused by faulty network changes by one of our infrastructure providers. A detailed post mortem will be added later.
postmortem Dec 12, 2024, 06:44 AM UTC

## Summary On Tuesday 03.12.2024 between 06:01 - 06:15 UTC one of our primary infrastructure providers, OVHcloud, experienced network disruptions in one of their data centers. This resulted in a partial service outage for some of our customers with primary clusters on servers in the affected data center. ## Details Our internal monitoring systems reported the first unavailable services and sites at 06:07 UTC. Initial investigation revealed that some of our servers were unavailable in one data center. These network issues caused servers to “flap” in and out of connection with the rest of our servers. This in turn caused our cluster orchestration software, Laika, to get caught in a bad state where it was unable to automatically fail over clusters hosted on the “flapping” servers. OVH reports that Network services started to return at 06:15 UTC but not all servers came back online at this time. Two of our servers were unavailable for longer and at 06:48 UTC we had to physically reboot these two. In addition we needed to restart the Laika control panel software to reset the cluster configurations and ensure that all servers were correctly detected in Laika. ## Impacted services Services affected by this incident are specified in the table below. Not all customers were affected, only those with a primary container of one or more servers running on the affected servers. Service was back online for all customers at 07:27 UTC, but there was still some slowdown until 08:00 when a full Laika restart was performed. | **Service name** | **Minutes** | **Time from — to \(UTC\)** | | --- | --- | --- | | Labrador Front | 128 | 06:01 — 08:09 | ## Incident timeline Following is a timeline that describes the entire incident handling process. All times UTC. * `2024.12.03 06:07` Initial service outage alerts registered * `2024.12.03 06:29` Unreachable servers confirmed. * `2024.12.03 06:34` OVH Cloud updated status confirming network issues. * `2024.12.03 06:48` Servers rebooted. * `2024.12.03 07:29` Several reboots completed, all systems back on line. * `2024.12.03 08:09` Final Laika reboot complete and all systems confirmed fully operational. ## Root cause The root cause of the incident was determined to be network disruptions at our infrastructure provider, caused by “An operation on multiple network equipment.” ## Planned actions Laika is built to handle outages of one or more of our data centers without large interruptions in delivery. When one data center is lost, Laika should automatically move the primary cluster to one of the other data centers for all customers. In this case, the “flapping” in and out of connection caused Laika to not be able to perform this automatic fail-over correctly. We have identified some improvements to the automation as well as set up routines on how to handle this manually when the automation fails. This will allow us to resolve these kinds of incidents with less service disruption for customers in the future. One of our current largest efforts internally is still is moving more parts of the Labrador CMS and Front infrastructure to the cloud. Currently storage, image rendering and Varnish caching has been moved to AWS, with the rest of Labrador Front following in the coming months. ## Additional reading For more information on the incident, the OVHcloud incident report can be found here: [https://network.status-ovhcloud.com/incidents/f0xs29sv5qbd](https://network.status-ovhcloud.com/incidents/f0xs29sv5qbd)