Vorco NZ experienced a critical incident on June 16, 2021 affecting Core Network and Internet & WAN Access, lasting 9h 59m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jun 16, 2021, 11:22 PM UTC
One of our core routers has unexpectedly reloaded. Customers connected to this router will experience a loss of service while it reloads. We are investigating this with our hardware vendor urgently.
- identified Jun 17, 2021, 12:01 AM UTC
The router has come back online. Still investigating the cause.
- identified Jun 17, 2021, 01:12 AM UTC
The router has crashed again. We believe we have found the cause.
- identified Jun 17, 2021, 01:14 AM UTC
The router has come back online. We are preparing mitigation steps to restore stability.
- identified Jun 17, 2021, 01:15 AM UTC
The router has crashed again. We are continuing to prepare mitigation steps as quickly as possible.
- monitoring Jun 17, 2021, 01:17 AM UTC
The router has come back online. We have applied mitigation steps and are now monitoring.
- monitoring Jun 17, 2021, 01:24 AM UTC
The affected router has now been stable for just under 2 hours. At this stage, the cause appears to be a software bug, and our equipment vendor is examining crash dumps. We will likely require a further outage overnight to apply software updates or further investigate the cause if our vendor requires us to do so.
- resolved Jun 17, 2021, 08:36 AM UTC
The router has remained stable since 1145 when temporary mitigation steps were applied. An emergency maintenance window starting at midnight will be posted shortly so that we can attempt a permanent fix.
- postmortem Jun 22, 2021, 06:05 AM UTC
# Summary Post incident analysis has confirmed that this outage was caused by a software bug in one of our core routers at our AKL4 \(Mount Eden\) core site. Our equipment vendor \(Cisco\) has provided updated software which includes a bugfix to prevent reoccurrence. We installed the software update during the overnight maintenance window on the 18th of June. # Full Details The software bug was triggered when a customer moved offices and their connection migrated from our AKL3 \(Mayoral Drive\) site to AKL4 \(Mount Eden\). Customers relocating is a normal activity that happens daily without incident. However, in this instance a specific combination of factors led to entire chassis crash followed by a reboot which takes just over 10 minutes. The router crashed three times in total. After the first crash \(10:37 am\) we identified an approximate cause but not the specific customer triggering the bug and began reviewing differentials against backups. Whilst we were examining backups the router crashed a second time \(11:17 am\). Just as the third crash occurred \(11:34 am\), we identified the specific customer we suspected was triggering the bug and prepared commands ready to disconnect their service as soon as the router came back online. After disconnecting that customer, the router remained stable. We then opened a critical support case with our equipment vendor \(Cisco\) to have them examine our crash logs to determine the cause and provide a bugfix. Throughout the afternoon we worked with Cisco to confirm exactly what bug we had hit, and Cisco provided us with an updated software version containing a fix a few hours later. The updated software version was installed just after midnight and then at 1am we brought the customer back online without further incident. We are proud of our network stability and know that our customers value that stability. When services outages do occur, we take them very personally. We hope that by being transparent when things go wrong, our customers will continue to see us as a trusted partner.