Studentnet incident

20200429 - SSO Outage

Critical Resolved View vendor source →

Studentnet experienced a critical incident on April 27, 2020, lasting —. The incident has been resolved; the full update timeline is below.

Started
Apr 27, 2020, 10:00 PM UTC
Resolved
Apr 27, 2020, 10:00 PM UTC
Duration
Detected by Pingoru
Apr 27, 2020, 10:00 PM UTC

Update timeline

  1. resolved May 01, 2020, 05:30 AM UTC

    Descriptive Name: SSO outage Incident Reference Number: 20200429 Date of Incident: 27/04/2020, Monday & 28/04/2020 Tuesday Time & Duration of Incident: 08:30 to 09:25AEST 27/04/2020 08:07 to 09:05AEST 28/04/2020, 1hr 53min Severity: ☒ Service Effecting ☐ Access Effecting ☒ Performance Effecting ☐ Network Effecting Location Effect: ☐ Isolated school ☐ Host + all VMs ☐ Sub-Net ☒ Muiltiple schools Services Affected: - SSO login from any location inoperative.

  2. postmortem May 01, 2020, 05:31 AM UTC

    # **Post Incident Report - 2020-04-29** **Descriptive Name:** SSO outage **Incident Reference Number**: 20200429 **Date of Incident:** 27/04/2020, Monday & 28/04/2020 Tuesday **Time & Duration of Incident:** 08:30 to 09:25AEST 27/04/2020 ‌ 08:07 to 09:05AEST 28/04/2020, ‌ 1hr 53min **Severity**: ☒ Service Effecting ☐ Access Effecting ‌ ☒ Performance Effecting ☐ Network Effecting **Location Effect**: ☐ Isolated school ☐ Host \+ all VMs ‌ ☐ Sub-Net ☒ Muiltiple schools **Services Affected:** - SSO login from any location inoperative. **Incident Cause**: - Cloudwork SSO service experienced a DOS like surge of sign in activity on Monday and Tuesday mornings * Regular reports running overtime consuming DB resources * Analysis on Monday \(27/4/2020\) determined that authentication volume was not legitimate activity. Volume was generated by a misbehaving app continuously completing successful sign on requests hundreds of times a minute per successfully logged in user. This was being performed by School1 from incident on 31/3/2020. On Tuesday \(28/4/2020\) a new School2 was identified as exhibiting exactly the same mis-behaviour from exactly the same app. On Wednesday \(29/04/2020\) another School3 was also exhibiting exactly the same mis-behaviour from exactly the same app. **Incident Resolved**: ☒ Yes ☐ No ☐ Open **Time of Resolution:** 09:15, 28/04/2020, Tuesday ‌ 18:30, 28/04/2020, Tuesday **Restoration Timeframe**: 1hr 53mins 00secs **Issued By:** Technical Support, [[email protected]](mailto:[email protected]) & [[email protected]](mailto:[email protected]) **PIR Issue Date**: 30/04/2020 **Contact Information:** Please report any continued service disruption _**immediately**_ to : Studentnet NOC Support: \+61 2 9281 3905 Support Email: [[email protected]](mailto:[email protected]) [[email protected]](mailto:[email protected]) ## **Incident Description** **27/04/2020** * 08:15 Extraordinarily high sign in authentication activity observed * Some expectation that this reflected legitimate extra traffic caused by schools moving en mass to remote learning work models as a result of COVID-19 pandemic. * But also there was a possibility that a repeat of the mis-behaving app was occurring * Schools started reporting poor sign-in performance * 08:30 all sign-in services stopped. * Investigation commenced to audit resources allocated to critical processes and heavily utilised schools * Investigation commenced determine status of mis-behaving app. * Investigation determined: 1. Mis-behaving app was present again at School1 but for only some of their account holders 2. Weekly reports generation was running overtime and was creating unnecessary database load * 08:45 Reporting jobs were re-scheduled * 09:15 School1 service terminated until the school could confirm that it would no longer generate mis-behaved traffic. * Services to other schools re-commenced * 10:15 School1 services brought back online in a resource constrained container so as to have no impact on other services * 15:00 – Evidence started to appear that another school \(School2\) was exhibiting the same mis-behaving app traffic **28/04/2020** * 08:10 – positive confirmation that School2 was generating same mis-behaved traffic from same app * 08:15 – attempts made to call IT admins at School2 * Contact from School2 admins confirming that they are using the same app causing issues at School1 * Inbound load increasing uncontrollably * Examination of load indicated that it was not being efficiently distributed to available servers * Planning for a new round-robin based resource allocation scheme was completed. This would require a DNS change to implement. * School2 resources constrained to a resource limited container. * 08:59 – first status notification texted out to all reporting schools: * “_Studentnet Cloudwork SSO service currently experiencing capacity issues. Further update coming shortly._ ” * 09:00 DNS were implemented to establish a more efficient round-robin allocation of tasks to available servers * 09:15 DNS changes were deployed commencing TTL propagation period clock * Services progressively came back online as DNS TTL period expired * 09:31 – second status notification texted out to all reporting schools: * “_Studentnet Cloudwork SSO outage, update. A fix has been applied that required a DNS change. TTL for DNS propagation means that fix will take 30-40mins delay._” * Capacity planning completed to dramatically increase available capacity. Plan implements included: 1. Configure and physically deploy newly purchased servers into the DC. Hardware servers were purchased in March 2020, with planned deployment in Q2 2020. This was plan was brought forward to be completed in 8hours. 2. Complete implementation of round-robin load allocation policy to now more efficiently utilise 3 available physical servers 3. Configure and commission 2 new database servers increasing available DB capacity * 16:00 New servers delivered, and racked into the DC * 18:30 New servers network connected and incorporated into swarm. **29/04/2020** * 06:19 – third status notification texted out to all reporting schools: * “_Studentnet Cloudwork status update: All systems operational. New hardware deployed, extra DBMSs and DNSs commissioned, app behaviour being monitored. PIR to follow. Please report any problems to 02 9281 1626. Thank You_” * 07:30 A third school \(School3\) was detected with the same mis-behaving app traffic being exhibited * 08:00 Attempts being made to contact School3 IT admins * 08:15 Contact made with School3 IT admins advising of mis-behaving app traffic being generated * 08:30 School3 service placed into resource constrained container to isolate any impact on other schools * 08:30-08:55 School3 experiences slow performance issues arising from self-generated mis-behaving traffic. All other services continue unaffected * 08:55 School3 disables mis-behaving app * 09:10 School3 re-enables mis-behaving app in constrained fashion. **Root Cause** * Inappropriate authentication behaviour by an app * Poorly timed weekly report generation jobs **Recommendations/Preventative Measures** * Audit app behaviours and request rectification to conform to standard protocols where needed * Monitor resource usage re-allocating and optimising where needed * Implement Cloudwork status notification page * Prepare for further orders of magnitude growth in authentication volumes as remote learning is established as the new normal mode of operation _oOo_ **Strictly Confidential** Copyright © 2020, Studentnet/Coherent Cloud\(CoClo\) ABN 90 001 966 892