Spillvert incident

Driftsbrudd - Oslo (LØST)

Spillvert experienced a minor incident on April 30, 2022 affecting Minecraft and Game @ Oslo and 1 more component, lasting 3d 14h. The incident has been resolved; the full update timeline is below.

Started: Apr 30, 2022, 08:42 PM UTC
Resolved: May 04, 2022, 10:58 AM UTC
Duration: 3d 14h
Detected by Pingoru: Apr 30, 2022, 08:42 PM UTC

Affected components

MinecraftGame @ OsloGame & Voice Panel (TCAdmin)Game @ TysklandNettside og kundeportalTeamSpeak @ OsloTeamSpeak @ Tyskland

Update timeline

investigating Apr 30, 2022, 08:42 PM UTC

Vi opplever for tiden et betydelig driftsbrudd på en rekke tjenester som er plassert på vårt datasenter i Oslo. Feilsøk pågår.
investigating May 01, 2022, 07:39 AM UTC

Noen tjenester har nå blitt gjenopprettet. Det jobbes fortsatt med å gjenopprette resterende.
investigating May 01, 2022, 11:45 AM UTC

Vi jobber fortsatt med å gjenopprette tjenester.
identified May 01, 2022, 03:37 PM UTC

Vi jobber fortsatt med å gjenopprette tilgang til våre tjenester.
identified May 01, 2022, 07:10 PM UTC

De fleste tjenester kjører nå som normalt. Vi jobber fortsatt med å gjenopprette tilgang til tjenester på nodene OSLO15, 16, 17 og 23.
identified May 01, 2022, 07:39 PM UTC

Tjenester på node OSLO17 kjører nå som normalt.
identified May 01, 2022, 09:55 PM UTC

Tjenester på node OSLO15 og 16 kjører nå som normalt.
identified May 02, 2022, 11:20 AM UTC

Vi jobber fortsatt med å gjenopprette tilgang til tjenester på node OSLO23.
monitoring May 02, 2022, 03:46 PM UTC

Tilgang til tjenester på node OSLO23 er nå gjenopprettet. Alle systemer kjører nå som normalt. Vi vil fortsette å overvåke vår plattform i dagene som kommer før saken avsluttes som løst. Vi beklager på det sterkeste de ulemper forårsaket av dette problemet. Vi vil publisere en post mortem som beskriver hva som skjedde med flere detaljer på et senere tidspunkt
resolved May 04, 2022, 10:58 AM UTC

Saken avsluttes som løst.
postmortem May 06, 2022, 03:15 PM UTC

# Reason for Outage Report, Oslo Location Digiplex \(Post-mortem\) **Primary Outage** On Saturday april 30 at 21:43 CEST our ISP’s utility provider suffered downtime on one of their transformers. Alarms were quickly raised on our end as well as our ISP’s end. Our ISP’s technicians immediately checked to make sure that power generation was running. They were able to confirm that it was. Staff were dispatched to ensure successful failover of power generation. When arriving at the datacenter at 22:45 CEST, they found that power to the site had been lost a few minutes earlier. Upon investigating they found that the device responsible for switching over power from utility to power generation failed even though backup generation was running. Power was restored by switching over the system from utility power to backup power manually. **Subsequent effects** After power was restored we found that network connectivity was still not working for several customers. As it turned out some network devices were having issues coming back up, and some had either lost their configuration completely or simply did not boot. Work was immediately started to replace failed switches or restore their configuration. The majority of our ISP’s time restoring services was spent in troubleshooting networking and doing manual interventions for servers that did not come back up. **Improving networking** Our ISP will look at ways of improving their routing infrastructure so that the loss of a single site will not cause downtime for other sites. In our data centers we make use of a common spine/leaf architecture. For many services we deploy a leaf switch that communicates with multiple redundant spines. Most of the switches we ended up having issues with were leaf switches. For customers connected to those, they were effectively down. Our ISP found that even though they did have backups of all configuration data, resolving these issues took away a lot of engineering time that could have been spent on resolving individual customer issues. In order to improve this, our ISP will be looking at switch automation and zero-touch provisioning of switches. Meaning that if a switch were to fail we could very quickly pull it out, and simply have it boot into new configuration automatically with very little manual intervention other than replacing the device id in our provisioning system. **Improving communications** During an event like this, communications are always difficult. Our customers are understandably worried about their services and when they can be back up and running. On our end we came to realise that our website and single-point-of-contact online support-forms where in effect offline due to the fact that they where hosted in the same datacenter, and that the offline cached version did not fully work. We have now deployed a redundant solution so that this will not occur again. Furthermore - we will work on improving our communication routines to make sure customers have as much information as we can provide via our third party status portal: [https://spillvert.statuspage.io/](https://spillvert.statuspage.io/) **Improving power resiliency** In order to understand why power was lost, it is helpful to understand how backup power generation works. When utility power is lost there are microcontrollers that communicate with backup power generation and the breakers in the distribution panels. When power is lost these controllers ask backup power to start. When the backup system has started it will turn off the utility power breakers and turn on backup power breakers. This transitions the power feed from utility to backup power. In our case this transition did not happen. Our ISP is working with their electrical contractor to work out why this did not happen and to prevent a similar occurrence in the future. Furthermore they will look at making breakers remotely controllable which they currently are not. **Summary** Last, but certainly not least, we want to apologize. We know that our services are critical to our customers. Over the coming weeks our ISP and their engineers will spend a lot of time investigating the chain of events in detail in order to improve the understanding of what happened and improve infrastructure reliability.