Deft incident

Customer Portal

Deft experienced a major incident on February 23, 2017, lasting 17m. The incident has been resolved; the full update timeline is below.

Started: Feb 23, 2017, 04:10 AM UTC
Resolved: Feb 23, 2017, 04:28 AM UTC
Duration: 17m
Detected by Pingoru: Feb 23, 2017, 04:10 AM UTC

Update timeline

investigating Feb 23, 2017, 04:10 AM UTC

Network connectivity to the customer portal is currently unavailable, systems administrators are investigating the cause and working swiftly toward a resolution.
resolved Feb 23, 2017, 04:28 AM UTC

Connectivity has been restored by system administrators and all systems are now fully operational.
postmortem Aug 03, 2018, 05:33 PM UTC

#Reason for Outage Report **Date of Event:** February 22, 2017 **Event Start:** 21:56 CDT **Event End:** 22:23 CDT **Duration:** 27 minutes **Affected Devices:** portal.servercentral.com **Master Ticket:** N/A **Service Impact:** Outage for portal.servercentral.com and customer API access **Root Cause:** Incorrect configuration deployment to switch ##Summary: ServerCentral maintains a web-based portal for interacting with our customers and prospects in order to ensure the highest level of customer service and responsiveness for our staff. Within this system we provide access to our support ticketing backend, billing resources, and customer inventory information via our customer portal (https://portal.servercentral.com) to all known and registered users of our existing customers. Additionally we offer customers access to this data via a Restful API hosted at the same URL provided on an as-needed basis. On February 22, 2017, at approximately 21:56 CDT, the ServerCentral IT department was staging a change to the core switch supporting the primary customer portal system in preparation for a maintenance to be scheduled and announced. The configuration change was in anticipation of a scheduled test of automatic failover capabilities of a security component protecting the customer portal. This upgrade was to be announced via a login message presented to all users and via an email notification to relevant contacts as soon as the details were arranged. During the staging process, the proposed changes were inadvertently pushed into production immediately instead of staged for later review by the relevant staff involved. This was done through human error and is not a normal part of the ServerCentral change management procedures. Approximately 1 minute after the configuration change was made the error was detected by monitoring systems and reported to the ServerCentral IT department. System Administrators accessed the switch and rolled-back the changes once they were detected, restoring connectivity to the primary application servers. Changes were committed at 22:22 CDT and the process completed and access was restored at 22:23 CDT. Based on the assessment of the Compliance Officer and as the real cause of the disruption was identified so quickly and was easily resolved, ServerCentral chose not to pursue the implementation of the business continuity plan for the customer portal or API access. The recovery process would have taken longer than rolling back the switch changes, and as such it was opted to correct the issue with the primary systems instead of implementing the fail-over. ##Resolution: It is now and has always been the policy of ServerCentral to observe the full change management plan maintained by our Security, Risk, and Compliance committee and adopted by all departments. Changes are staged, vetted, approved, and scheduled by internal teams prior to being placed into production using an industry-leading approach to securing processes on our production systems. Moving forward the team managing the systems that run our internal and customer-facing applications will be re-trained on the change management procedure and their work will be observed by more senior staff on an ongoing basis.