SCORM Cloud incident

Service Interruption

SCORM Cloud experienced a critical incident on March 13, 2020 affecting SCORM Cloud Website and SCORM Cloud API, lasting 28m. The incident has been resolved; the full update timeline is below.

Started: Mar 13, 2020, 11:21 AM UTC
Resolved: Mar 13, 2020, 11:50 AM UTC
Duration: 28m
Detected by Pingoru: Mar 13, 2020, 11:21 AM UTC

Affected components

SCORM Cloud WebsiteSCORM Cloud API

Update timeline

investigating Mar 13, 2020, 11:21 AM UTC

We are currently investigating an unexpected service interruption.
monitoring Mar 13, 2020, 11:34 AM UTC

A fix has been implemented and we are monitoring the results.
resolved Mar 13, 2020, 11:50 AM UTC

This incident has been resolved. We will write a postmortem with more information soon and post it on our status page.
postmortem Mar 16, 2020, 09:04 PM UTC

At 1:15 AM UTC on March 13th, we updated a value in our production Consul server to remove a MIME-type black list entry. This change was expected and approved, but an error occurred in the manual update process. The error did not surface as a problem until our database credentials rotated on their regular schedule. Once our monitoring systems detected the problem, our SREs responded. A timeline of the response is detailed below \(all times are in **UTC**\): * 10:55 AM, automated systems paged the on-call SRE * 10:56 AM, the SRE acknowledged the page * 10:58 AM, SCORM Cloud went offline \(returning a 404 for all requests\) * 11:08 AM, after an initial investigation failed, the on-call SRE paged the backup SRE for assistance * 11:16 AM, the backup SRE acknowledged the page * 11:20 AM, the backup SRE began investigation * 11:21 AM, SREs updated the status page * 11:23 AM, SREs identified the root cause * 11:26 AM, SREs fixed the invalid Consul entry and performed a rolling restart of existing unhealthy instances * 11:30 AM, monitoring reported that the service was back online ‌ We have implemented new procedures for all future updates to our Consul server. We have also identified two improvements to our dynamic configuration system. These changes will make our dynamic configuration more resilient to errors and notify us of errors immediately.