Umbrellar incident

Auckland Network Outage

Umbrellar experienced a notice incident on August 27, 2019 affecting Auckland and Auckland and 1 more component, lasting 13h 30m. The incident has been resolved; the full update timeline is below.

Started: Aug 27, 2019, 09:48 AM UTC
Resolved: Aug 27, 2019, 11:19 PM UTC
Duration: 13h 30m
Detected by Pingoru: Aug 27, 2019, 09:48 AM UTC

Affected components

AucklandAucklandNZ North

Update timeline

investigating Aug 27, 2019, 09:48 AM UTC

We are experiencing network issues in Auckland as a result of the change implemented earlier this evening. We are in the process of rolling back the change to recover services. We will provide a further update shortly.
identified Aug 27, 2019, 10:36 AM UTC

We have rolled back the change which triggered the issue but services have not recovered automatically. We are in the process of recovering services and will provide a further update by 23:15.
identified Aug 27, 2019, 11:24 AM UTC

We are still in the process of recovering services. Some services have been recovered, however we are experiencing instability across the core network which is impacting our ability to fully recover from this issue. A further update will be provided by 23:55.
identified Aug 27, 2019, 12:12 PM UTC

We are still working on this issue which is affecting internal routing within our core network and therefore access to hosted sites and servers. A further update will be provided by 01:00.
identified Aug 27, 2019, 01:03 PM UTC

We have resolved the instability issues which we were experiencing across the core and now have a majority of customer networks back online. We are still working through recovering the remaining individual networks which will take some time. We will provide a further update by 01:45.
identified Aug 27, 2019, 01:57 PM UTC

We have now restored access for the Umbrellar Cloud platform. Restoration of individual customer networks is ongoing. A further update will be provided by 03:00.
monitoring Aug 27, 2019, 03:08 PM UTC

We have completed recovery for all impacted customer networks. We have a small number of infrastructure network services which are still being worked on. Our team will continue to monitor the situation and resolve any remaining issues which are identified overnight. We will be working on a full incident post-mortem and root cause analysis over the coming days.
resolved Aug 27, 2019, 11:19 PM UTC

All external facing networks and services are now operational. Recovery of services has meant implementing a number of static routes which will need to be backed out once we are confident the underlying dynamic routing issues have been resolved. These will be performed out of hours and following a publicised change notification. We are continuing to monitor the situation and are currently working on incident post-mortem, root cause analysis and a plan for full recovery of service resiliency. This will be the last update as part of this incident
postmortem Sep 03, 2019, 02:26 AM UTC

### **Report – Auckland Network Outage** #### **Overview** On the evening of August 27th, we suffered a Priority 1 \(P1\) incident that affected the network infrastructure in our Auckland datacentres. The issue was identified at 21:15 as part of a post change test plan. Most services hosted in the Auckland datacentre were affected, with ~80% of the hosted services restored by August 28th at 01:00. The remaining external services were restored by 04:30. #### **What Happened** The issue occurred during a scheduled network change required to improve internal network performance between two of our datacentres in the Auckland region. The issue was caused by a routing issue that arose from the implemented change. Engineers identified the issue and initiated a rollback of the change in order to restore connectivity to hosted services. #### **Contributing Factors** Umbrellar strives to continuously improve our service offerings and as of last year we had already embarked on long term service improvement and risk mitigation plan. We are currently mid-way through this piece of work. This prolonged the time to restore services during this incident. We have now taken learnings from this and incorporating them into our investigation of the root cause. #### **Resolution** Due to the nature of the impact caused by the planned change, a simple rollback did not restore services to the extent required. It became evident to the Incident Response team; additional manual work was required. Several engineering team members were then called upon to execute these changes and action the recovery of network services. We have launched an internal investigation to further identify the root cause and mitigation steps going forward. #### **Impact** The unexpected failure of the network infrastructure caused significant impact for approximately 4 hours while a rollback and remediation plan were formulated and implemented. A subset of services remained affected and access was restored within 8 hours of the issue being identified. #### **Timeline** | ‌ Time \(NZST\) | ‌ Event | ‌ Additional Information | | --- | --- | --- | | 21:00 – 21:10 27/08 | Scheduled network change implemented | ~70% of upstream traffic impacted | | 21:15 – 00:00 27/08 | Major network impact | ~95% of upstream traffic impacted | | 00:00 – 01:00 28/08 | Critical network impact | ~95% of upstream traffic impacted | | 01:00 – 03:00 28/08 | Limited network impact | ~20% of upstream traffic impacted | | 03:00 – 04:30 28/08 | Minor network impact | ~5% of upstream traffic impacted | | 04:30 – 11:30 28/08 | Umbrellar internal network impact | ~1% of upstream traffic impacted | ‌ #### **How did we do** #### What went well? The team identified the issue during testing and our major incident response process was quickly initiated and all relevant team members engaged. Additional resources were brought in and all hands required were on deck to assist service restoration. The team quickly concluded that a rollback alone was not enough to restore services and a service restoration plan was quickly put together to aid in timely action of the remediation activities. The Umbrellar status page was kept up to date and communication where required was sent out in a timely and concise manner. #### What didn’t go so well? The team have reviewed the incident with the aim to learn and improve going forward. We feel that internal communication can be improved, and we are working to ensure that systems in place already are fully utilized and communication to key stakeholders are improved. We continually strive to improve internal communications especially in the event of a major incident response plan initiated in cases of a group wide incident.