MacStadium incident

[DUBLIN] Network Equipment Failure

Notice Resolved View vendor source →

MacStadium experienced a notice incident on September 13, 2020, lasting 18h 38m. The incident has been resolved; the full update timeline is below.

Started
Sep 13, 2020, 07:54 PM UTC
Resolved
Sep 14, 2020, 02:32 PM UTC
Duration
18h 38m
Detected by Pingoru
Sep 13, 2020, 07:54 PM UTC

Update timeline

  1. investigating Sep 13, 2020, 07:54 PM UTC

    We have received alerts of a down network device that could affect several customers in the Dublin Datacenter, Engineers are onsite investigating the issue currently.

  2. monitoring Sep 13, 2020, 09:26 PM UTC

    Failed equipment has been replaced and the network issue should be stabilized. On site Engineers are currently monitoring the replacement equipment.

  3. resolved Sep 14, 2020, 02:32 PM UTC

    Monitoring will continue, no further issues anticipated. Incident is resolved.

  4. postmortem Sep 21, 2020, 05:48 PM UTC

    ## Postmortem summary | **Postmortem owners** | Mark Ryan, Network Engineer ⎮ Michael Weir, Systems Engineer ⎮ Michael Rhoades, Manager, Service Center | | --- | --- | | **Incident** | Infrastructure router crash/module failure | | **Priority** | **P1** | | **Affected services** | Partial network connectivity at MacStadium’s Dublin data center | ‌ ## Executive summary At 20:25 UTC on Sunday, September 13, 2020 MacStadium’s Dublin data center experienced a partial network outage affecting some customers. This outage lasted for 2 hours and was due to a Supervisor module crash on a key infrastructure router \(and the hardware failure of one of its modular components\). The supervisor module was manually rebooted and the majority of traffic was restored. A sub-module had experienced a failure that additionally required a hardware replacement. Service to all customers was restored at 22:20 UTC. ‌ ## Postmortem report | **Instructions** | **Report** | | --- | --- | | **Timeline** | 2020-08-13, 9:12 UTC – a MacStadium Technician escalates an unusual pattern of alerts to the Incident Manager. The Engineering Team begins an immediate investigation. ⎮ 19:25 UTC – MacStadium Engineers confirm there is a partial outage affecting some customers in the Dublin data center and continue with troubleshooting. ⎮ 19:35 UTC – It was determined a key infrastructure router had crashed. Engineers reboot the router, which restored traffic to the majority of affected servers. ⎮ 19:50 UTC – One of the router sub-modules fails to power back on, and Engineers are dispatched to the Dublin data center to continue the investigation. ⎮ 19:54 UTC – The Incident Manager updates MacStadium’s Status Page with investigation status. ⎮ 20:20 UTC – Engineers arrive on site. They test and confirm the failure of the module. They locate, install, and test a replacement module. Once the cabling is fully reconnected to the new module, all traffic is restored. ⎮ 21:26 UTC – Status Page is updated that the incident cleared. | | **Root cause** | The primary supervisor module on a key infrastructure router crashed, necessitating a reboot of the router. After reboot, it was found that a sub-module had failed and required hardware replacement. | | **Follow-up Actions** | Approximately one year ago MacStadium set out to upgrade its Dublin data center to a new network architecture in order to provide greater east-west bandwidth, resiliency, and scale. The final piece of that plan will be executed by December 2020. Unfortunately, this incident occurred due to a single point of failure in the Dublin network that our network upgrade will address. |