M.D.G. IT incident

Network issues identified in Equinix SY1—switch degradation

Major Resolved View vendor source →

M.D.G. IT experienced a major incident on August 8, 2022 affecting VPS Hosting, Sydney, lasting 6h 51m. The incident has been resolved; the full update timeline is below.

Started
Aug 08, 2022, 12:32 AM UTC
Resolved
Aug 08, 2022, 07:24 AM UTC
Duration
6h 51m
Detected by Pingoru
Aug 08, 2022, 12:32 AM UTC

Affected components

VPS Hosting, Sydney

Update timeline

  1. investigating Aug 08, 2022, 12:32 AM UTC

    We are currently investigating this issue.

  2. identified Aug 08, 2022, 03:42 AM UTC

    On-site network engineers have localised the issue, and expected time to fix will be posted as soon as it is available.

  3. identified Aug 08, 2022, 03:42 AM UTC

    We are continuing to work on a fix for this issue.

  4. monitoring Aug 08, 2022, 05:32 AM UTC

    A fix has been implemented, services should be restored within 30 minutes.

  5. resolved Aug 08, 2022, 07:24 AM UTC

    All services are now operational. Please contact support if you're still seeing ongoing issues on any virtual machines. An incident report will be made available at https://status.mdg-it.com.au/incidents/2tcw034x498l once a full post mortem has been completed.

  6. postmortem Aug 11, 2022, 09:07 PM UTC

    # INCIDENT REPORT 20220808—SWITCH FAILURE, SWITCHING FABRIC RECOVERY FAULTS, BORDER ROUTER SESSIONS FAULTS ‌ **AFFECTED SYSTEMS:** Virtual machines in Equinix SY1; M.D.G. IT ticket support system **DESCRIPTION**: At approximately 10:15 am AEST on Monday 8 August, a core network switch carrying IP and iSCSI storage traffic failed, and traffic automatically failed over to the redundant switch in the pair. This caused some servers to pause due to the storage network interruption. While internal switching failed over instantly, unexpected behaviour was observed on the internal switching fabric and external router sessions. This affected a subset of services connected to the devices, and appeared to be localised to specific ISPs. A secondary effect of the internal switching change was that several bare metal hypervisor servers started flapping in availability to the virtualisation pool. While up and able to run virtual machines, the servers were intermittently marked as up / down every several minutes to the virtualisation manager. While flapping, this slowed stop / start operations on other machines to the point of timeout. While these servers were connected to the redundant switch and passing traffic to virtual machines without packet loss, the effects of the flapping behaviour required them to be removed from the pool. This removed 96 CPU threads and 400GB of memory from Apollo-KVM, leading to memory contention until servers were migrated or rebalanced across remaining machines in the pool. This led to further issues on a number of servers overnight Mon / Tues as a result of backups and swap usage leading to high iowait. At approximately noon AEST on Thursday 11 August, the routing cluster automatically reconfigured its master node in response to an interface alerting threshold being reached on the master node. This is an automated live fail-over process which typically takes several seconds, however in this case the BGP sessions were lost and network sessions were dropped until BGP re-converged. **CAUSES**: Switch failure, contention on KVM pool after server removal, router autoconfiguration dropped BGP. **RECTIFICATION**: NOC staff immediately started recovering paused servers by powering down and rebooting the affected machines. Network engineers localised the primary networking issue to border router session initiation. After being unable to resolve the session faults on the running cluster, the BGP routing cluster was rebooted. KVM pools were rebalanced after the removal of affected servers to restore resource availability. **FUTURE MITIGATION**: Affected portions of switching fabric are being replaced with new hardware and tested for failover. Further, network architecture is being updated and tested to remove possible interdependencies between network layers. Critical works are expected to be completed 19 August.