Umbrellar experienced a major incident on January 31, 2019 affecting Christchurch, lasting 3h 13m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jan 31, 2019, 07:25 PM UTC
We have experienced a Network Outage of our Christchurch network this morning at around 7:26am. Our engineers are currently investigating the root cause. We apologies for the inconvenience and will provide an Incident Report as soon as it is available.
- investigating Jan 31, 2019, 07:27 PM UTC
We are continuing to investigate this issue.
- resolved Jan 31, 2019, 10:39 PM UTC
We have identified the cause of the outage and will be providing an incident report.
- postmortem Jan 31, 2019, 10:40 PM UTC
Overview This morning \(Feb 1\), our Christchurch based Core Network experienced a 5 minute outage. This was caused as a result of a planned change occurring at the time. Team members involved quickly realized a network loop had been created, causing the impact to all network and data traffic. What Happened A planned change caused a routing loop which was quickly recognized and resolved by rolling back from the planned change. While network operations resumed, some systems needed manual intervention to restore functionality. This was resolved within the next 1-2 hours. Root Cause A miss-configuration on the new assets being deployed resulted in a network loop. This was not anticipated in the original change. Further, we have now been made aware of the potential to disrupt other services within the network as a result of the network loop. Assets under the original change have now been segmented into their own network, shielding the production network from any future changes, until these need to be in production. Resolution The network asset causing the loop was quickly identified during the change window and physically removed from the network. Other impact to the rest of the network post this was also resolved, and we have a managed change scheduled to restart services on the virtual infrastructure. This is a precautionary change and will happen after hours to ensure all services are operating optimally. Impact While the core network was impacted by the network loop for 5 minutes, the flow on impact was on virtual workloads belonging to 2 of our customers only. We have made contact with these customers identifying the workloads impacted. An IR may be shared with relevant customers in addition to this postmortem.