MacStadium incident

Partial Outage Affecting Some Cloud Customers

Major Resolved View vendor source →

MacStadium experienced a major incident on April 1, 2020 affecting Atlanta Data Center, lasting 1h 39m. The incident has been resolved; the full update timeline is below.

Started
Apr 01, 2020, 04:10 AM UTC
Resolved
Apr 01, 2020, 05:50 AM UTC
Duration
1h 39m
Detected by Pingoru
Apr 01, 2020, 04:10 AM UTC

Affected components

Atlanta Data Center

Update timeline

  1. investigating Apr 01, 2020, 05:10 AM UTC

    We are currently investigating an issue where a subset of VMware cloud customers are experiencing failures when attempting to access their vCenter for management purposes. All VMs appear to be running and general production is unaffected.

  2. identified Apr 01, 2020, 05:17 AM UTC

    Our onsite Engineers have identified the issue and service is now restored. We will be closely monitoring the environment to ensure all services are fully functioning.

  3. identified Apr 01, 2020, 05:19 AM UTC

    We are continuing to work on a fix for this issue.

  4. monitoring Apr 01, 2020, 05:21 AM UTC

    A fix has been implemented and we are monitoring the results.

  5. resolved Apr 01, 2020, 05:50 AM UTC

    This incident has been resolved.

  6. postmortem Apr 02, 2020, 12:02 PM UTC

    # Partial Outage of vCenter Access Root Cause Analysis Date: 04/02/2020 Location: Atlanta, Georgia, USA This Document Provides a Root Cause Analysis of a Partial Outage Affecting Some VMware Cloud Customers _**Date: Wednesday, April 1, 2020**_ ## Prepared by * Paul Benati, SVP Operations * David Hunter, VP Engineering * Preston Lasebikan, Systems Architect & Team Lead - Systems Engineering * Michael Rhoades, Manager, Service Center ## Location Atlanta \(ATL1\) data center ## Event Description At 00:10 EST on Wednesday, April 1, 2020 our Atlanta data center experienced a partial network outage specifically affecting the ability of some customers to access management functions of their vCenter appliances. This outage lasted for approximately 55 minutes and was due to a faulty high availability \(HA\) cable preventing failover during planned steps in a server platform upgrade. Engineers manually invoked the failover and full service was restored by 01:05 EST. ## Chronology of Events / Timeline * 2020/04/01 at 00:10 EST - Engineering remotely pushed planned configuration updates to the fabric interconnects \(FI\) in preparation for the upgrade to an advanced, future-proof server platform. * 2020/04/01 at 00:14 EST - Monitoring alerts indicated disjointed intermittent traffic to multiple Atlanta vCenter server appliance \(VCSA\) hosts causing delays or inability to reach VCSAs externally for some customers. * 2020/04/01 at 00:15 EST - Engineering immediately ascertained their VPN authentication was interrupted by the changes and dispatched personnel to intervene onsite. * 2020/04/01 at 01:02 EST - Manual intervention onsite successfully initiated FI failover and network traffic was immediately restored. Further inspection revealed the HA heartbeat cables had failed and were subsequently replaced. * 2020/04/01 at 01:15 EST - Engineering validated the configuration would prevent any future events by performing failover tests. ## Summary Findings on Technical Causes In preparation for an upcoming upgrade to advanced new fabric interconnects, configuration updates were pushed to one of the current FIs. Normally, failover would occur without interruption to network traffic; however, a failure in one of the physical HA cables between FIs prevented the failover, which simultaneously prevented authentication of the VPN connection used to remotely remediate the situation, thus requiring onsite intervention. The non-failover caused intermittent network traffic loss to multiple Atlanta vCenter appliances affecting the ability of some customers to perform management functions. As a note, this did not affect running hosts but only access to VCSA. ## Root Causes 1. A failed HA cable prevented automatic failover of the fabric interconnect. 2. Subsequent loss of VPN authentication prevented remote manual initiation of FI failover. ## Reaction, Resolution, and Recommendations MacStadium immediately dispatched Engineering to perform the manual failover procedure onsite and network traffic flow immediately restarted. HA cables were then replaced, configuration checked and confirmed. Internal and external alerting quickly confirmed access, and spot checks were run system-wide. Although the issue was isolated to the FI, full cluster checks were conducted. Going forward, VPN authentication will be performed prior to the use of virtual desktop infrastructure, precluding the possibility of a similar incident occurring in the future. Additionally, the integrity of physical HA links will be inspected at regular intervals. MacStadium is also investigating the implementation of advanced self-healing measures. MacStadium takes issues resulting in a loss or degradation of service very seriously and we strive to continually understand the root cause so we may improve our level of service for our customers. ## Sponsor Acceptance ### Approved by the Project Executive Date: 2020-04-02 Paul Benati SVP Operations ## Revision History