I.T Communications Limited incident

Telehouse North Data Centre Outage

I.T Communications Limited experienced a major incident on July 12, 2019 affecting Customer Leased Lines, lasting 6h 28m. The incident has been resolved; the full update timeline is below.

Started: Jul 12, 2019, 05:49 AM UTC
Resolved: Jul 12, 2019, 12:17 PM UTC
Duration: 6h 28m
Detected by Pingoru: Jul 12, 2019, 05:49 AM UTC

Affected components

Customer Leased Lines

Update timeline

identified Jul 12, 2019, 05:49 AM UTC

We have confirmed an issue with a Router in Telehouse North. This will affect any sites to site services terminating in this site, as well as any that use it to interlink with other providers. In addition to this, a number of leased line circuits are affected. On-site engineers have investigated the issue, and this has been escalated to two additional engineers who are on route to the site. This page will be updated with more information as it transpires. Onsite engineers and the vendor support team are still working on bringing the new line card into service
identified Jul 12, 2019, 06:44 AM UTC

Most services are now running as usual. We’re working to bring back the remaining affected leased lines and hope to have these back online before 9am.
identified Jul 12, 2019, 07:57 AM UTC

We are continuing to work on a fix for this issue.
identified Jul 12, 2019, 08:07 AM UTC

Work is still continuing to resolve the issue on the failed device causing the problems. We apologise for the trouble this is causing.
identified Jul 12, 2019, 09:03 AM UTC

in attempt to the address the issue we are taking two paths of action. Firstly, onsite staff are working with Brocade (the Router Vendor) to fix the problem, They previously sent a replacement line card for the router but adding this proved fruitless. Secondly, onsite staff are installing a completely new device in case the work with Brocade doesn’t fix the issue. This is a time consuming process as they don’t want to disrupt the wider network. The second item will take at least another hour to complete.
identified Jul 12, 2019, 10:14 AM UTC

Engineers are still working with Brocade in attempt to resolve the issue. Additionally, they are still working in parallel to bring a new device online to help resolve the remaining issues. We apologise that we do not have anything more concrete at this time, but can assure you that the team are treating this as their number one priority.
resolved Jul 12, 2019, 12:17 PM UTC

This incident has been resolved.
postmortem Jul 18, 2019, 10:01 AM UTC

**OVERVIEW** A failed line card of an MLXe device in Telehouse North \(THN\) resulted in a loss of connectivity. For the vast majority of customers, the network recovered by routing around the affected site as expected within minutes. Leased line customers with locations terminating in THN, and traffic for limited customers routing to certain IP ranges, saw a larger impact to their connectivity. **CAUSE OF INCIDENT** Background: The device in question was a chassis based Brocade MLXe device. These were originally chosen as the chassis based nature offered a modular approach that enabled easy expansion and fault tolerance. However, we had been disappointed with the real world performance of the Brocade hardware and had just undertaken a long programme of works to replace this Brocade hardware at Layer 2 with dedicated, best of breed Juniper devices. This upgrade work was the result of intensive work with Brocade in an attempt to remediate the long convergence times their products were d elivering. After much investigation by the most senior Brocade engineers and a programme of rolling updates, convergence times were reduced from approx. 15 to approx. 6 minutes. This was still clearly sub optimal, and Brocade themselves concluded that their network hardware was incapable of yielding any further improvements. This necessitated our investment into building of a brand new core network, not based on Brocade. The first phase of the new core network was recently completed, with Juniper devices no w handling Layer 2 traffic on behalf of our Brocade core. Phase two is to physically move all customer interconnects from Brocade to Juniper, which we are part way through. Phase three is to replace MLXe and other Brocade devices with dedicated Juniper hardware on our core network. Incident: Our 24/7 monitoring team detected the issue and we immediately arranged for the router to be rebooted by remote staff, as senior engineers began travelling site to investigate further. When the reboot failed, a Priority 1 case was raised with Brocade and methodical fault finding identified a failed line card. A replacement part was immediately ordered from Brocade, and as a backup, at the same time additional engineers brought to site the 3 spare line cards we keep in stock. Installing replacement line card did not immediately resolve the issue, necessitating detailed, meticulous troubleshooting with senior Brocade engineers. Parallel to these vendor led investigations, our engineers acquired and installed a Brocade CER r outer to operate in place of the MLXe, should Brocade be unable to identify the flaws. Switching to the CER router would have required 6\+ hours of careful manual configuration updates to restore services for all customers, so fixing the existing MLXe hardw are was focussed on as the quickest route to restore services. Our senior network engineers worked through the night, identifying the issue with the replacement line card, and restoring the MLXe to normal operation. **Timeline:** 21:49 Issue began and was detected by our 24/7 monitoring team 21:55 Initial network convergence complete, restoring the majority of services. Intermittent connectivity remained for limited customers accessing certain IP ranges, and for leased line customers terminating at THN 22:00 Problem device and location identified 23:00 Physical reboot of router, engineers travelling to THN and SC 1 , and P1 case raised with Brocade. 23:00 00:00 Engineers arrive on site, diagnostics begin, line card identified, replacement ordered from vend or 01:00 Engineer brings 3 stocked line cards from SC 1 to THN 02:00 Line card replaced, but would not integrate owing to a firmware issue, which we escalated to Brocade 04:00 Identified and implemented a fix to certain traffic that was being blackholed by the line card failure. Also continued working with senior Brocade engineers on resolving the fault with integrating the replacement line cards with the chassis 05:00 Made use of our MLXe in Telehouse East for enhanced diagnostics with Brocade. Also instigated Plan B of using Brocade CER routers in place of the MLXe in THN. 06:00 Engineer delivers CER from SC 1 to THN. Also continued working with senior Brocade engineers on resolving the fault with integrating the replacement line cards with the chassis 10:00 CER racked up and preliminary config begins whilst continuing to work with senior Brocade engineers on integrating replacement line cards 12:00 Cause of line card not integrating identified by our engineers an d resolved 12:10 Card and ports come online, affected services begin to return to normal 12:12 Affected services restored **RESOLUTION DETAILS** Once the issues with the line card were resolved, it was quickly installed and power cycled, which restored normal service. **FOLLOW UP ACTIONS** We will be reviewing our stocked spare policy, including regular config and firmware reviews, to ensure that replacing components \(such as line cards\) can happen as quickly as possible in future. In addition, we are continuing our on going phase two plans and are expediting phase three plans, which will remove all Brocade hardware from our core network. The MLXes will be replaced by best of breed Juniper hardware, deployed in resilient configuration. With separate devices hand ling Layer 3 and Layer 2, and with dedicated cold spares in strategic locations for rapid deployment, we will be extremely resilient to repetitions of issues of this nature.