MacStadium experienced a critical incident on March 19, 2020 affecting Atlanta Data Center, lasting 16h 26m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Mar 19, 2020, 07:30 PM UTC
We are currently experiencing a network outage affecting our Atlanta data center. Engineers are working to identify the cause and remediate as soon as possible. The outage is not affecting any other services (e.g., servers, etc.). We apologize for the inconvenience and will post an ETA as soon as possible.
- identified Mar 19, 2020, 07:50 PM UTC
Our Engineers have identified the problem and are now working to resolve the incident. Initial reports point to one of our border leafs failing. More details to be provided shortly.
- monitoring Mar 19, 2020, 08:52 PM UTC
As of approximately 19:50 UTC a fix was implemented and we are monitoring the results. Again, we apologize for the inconvenience and are working to identify the root cause of our Atlanta border leaf systems failure.
- resolved Mar 20, 2020, 11:57 AM UTC
We continued monitoring overnight and did not observe any issues. This incident has been resolved. Thank you again for your patience and understanding.
- postmortem Mar 23, 2020, 01:14 PM UTC
# Atlanta Network Outage _**Date: Thursday March 19, 2020**_ ## Prepared by * Paul Benati, SVP Operations * David Hunter, VP Engineering * Khoa Tran, Network Architect & Team Lead - Network Engineering * Michael Rhoades, Manager, Service Center ## Location Atlanta \(ATL1\) data center ## Event Description At 15:05 EST on Thursday, March 19, 2020 our Atlanta data center experienced a network outage. This outage lasted for 41 minutes and was due to a border leaf switch's configuration file inadvertently being modified. The switch's configuration was restored from backup and network service was restored at 15:46 EST. ## Chronology of Events / Timeline * 2020/03/19 at 15:05 EST - Engineering remotely executed a configuration script inadvertently against one of Atlanta's border leaf switches instead of the desired top of rack switch resulting in a network outage * 2020/03/19 at 15:05 EST - VPN and VDI connections down for Atlanta * 2020/03/19 at 15:08 EST - Onsite Engineering at our Atlanta DC \(ATL1\) was engaged to provide access for remote engineers to troubleshoot the outage * 2020/03/19 at 15:35 EST - Access was provided to remote engineers * 2020/03/19 at 15:35 EST - Remote engineers begin to investigate the outage * 2020/03/19 at 15:40 EST - Engineering determines one of the border leaf switch's configuration had been altered * 2020/03/19 at 15:46 EST - Engineering restored the original configuration from backup to the affected border leaf switch thereby restoring service ## Summary Findings on Technical Causes The configuration of one of the Atlanta border leaf switches was inadvertently modified. This modification necessarily made the two HA-paired border leaf switches out of sync with each other. Due to the nature of the Cisco HA capability, this partially altered configuration resulted in disabling external connectivity. ## Root Causes 1. The configuration of one of the Atlanta border leaf switches was inadvertently modified ### Root cause 1 A script was remotely executed that inadvertently affected one of the two high availability paired border leaf switches . The script partially altered one switch's configuration resulting in disabled external connectivity at 15:05 EST. Network service was restored when the original configuration file was re-introduced from backup. ## Reaction, Resolution, and Recommendations MacStadium has disabled remote execution of certain shell scripting on primary switches. Significant changes can now be made only when directly connected to the switch, precluding the possibility of a similar incident in the future. MacStadium takes issues resulting in a loss or degradation of service very seriously and we strive to continually understand the root cause so we may improve our level of service for our customers.