Learnosity incident
Issue affecting scoring delays in US East 1 (VA)
Learnosity experienced a minor incident on October 27, 2021 affecting Updating session response scores, lasting 2h 42m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 27, 2021, 02:11 PM UTC
As of 14:00 UTC, we are currently experiencing minor delays in scoring of submitted sessions affecting the US East 1 region (VA). Learnosity Support and Systems Engineering teams are actively investigating the issue, and will follow on with an update and resolution as soon as possible.
- investigating Oct 27, 2021, 02:30 PM UTC
As of 14:30 UTC, we are continuing to investigate an uptick in scoring delays that are affecting the US East 1 (VA) region. Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.
- investigating Oct 27, 2021, 03:07 PM UTC
As of 15:00 UTC, the Learnosity Systems Engineering team is working on mitigating the delays affecting the availability of newly scored session data in US East 1. Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.
- monitoring Oct 27, 2021, 03:49 PM UTC
As of 15:46 UTC, scoring throughput has returned to normal. We are still seeing a backlog in some scoring queues but it is rapidly reducing. Learnosity Support and Systems Engineering teams are continuing to monitor the issue, and will follow on with an update and resolution as soon as possible.
- monitoring Oct 27, 2021, 04:03 PM UTC
As of 16:00 UTC, scoring throughput remains to normal. The backlog in the queues is continuing to reduce. Learnosity Support and Systems Engineering teams are continuing to monitor the issue, and will follow on with an update and resolution as soon as possible.
- monitoring Oct 27, 2021, 04:23 PM UTC
As of 16:22 UTC, the scoring queue is empty, with sessions scoring too quickly to create a backlog. Learnosity Support and Systems Engineering teams will continue to monitor for 30 more minutes and then, in the absence of any further delays, mark this issue resolved.
- resolved Oct 27, 2021, 04:54 PM UTC
As of 16:53 UTC, we have resolved the issue affecting scoring delays in the US East 1 (VA) region. Learnosity Support and Systems Engineering teams will follow up with a post mortem once we have completed root cause analysis and finalized any next steps or preventative measures required. Please reach out if you have any questions or concerns.
- postmortem Jan 21, 2022, 08:58 PM UTC
An underlying storage system corruption, linked to a single row in the database table where _scoring errors_ are saved, caused delays in the asynchronous scoring queue. Prior to correction, queries run against this table could take anywhere from a partial second up to several seconds to complete, and this impact was compounded by an unusually high number of sessions scoring errors submitted at one time. Because this issue only surfaced when rare scoring errors were logged, it was not immediately detected. Originally, the number and size of RDS instances were increased to work through the scoring backlog but the queue cleared before this process could be completed. This led us to the fact that delays only occurred when errors were persisted, and led to the discovery, and deletion, of the problem table row. Additional measures have been put in place to monitor the write speed of errors, as well as session data, and the issue has not resurfaced since then.