Learnosity incident

Issue affecting availability of recently submitted session analytics in US-East-1

Minor Resolved View vendor source →

Learnosity experienced a minor incident on August 19, 2025 affecting Availability of session information, lasting 5h 20m. The incident has been resolved; the full update timeline is below.

Started
Aug 19, 2025, 04:11 PM UTC
Resolved
Aug 19, 2025, 09:31 PM UTC
Duration
5h 20m
Detected by Pingoru
Aug 19, 2025, 04:11 PM UTC

Affected components

Availability of session information

Update timeline

  1. investigating Aug 19, 2025, 04:11 PM UTC

    As of 3:40 UTC, we are experiencing degraded performance in our analytics stack affecting the availability of session data in US-East-1 for a subset of customers. This is affecting the Reports API and Data API. Neither authoring nor assessment stacks are affected, and there is no data loss. Submitted sessions are queueing for processing. Learnosity Support and Systems Engineering teams are actively investigating the issue, and will follow on with an update and resolution as soon as possible.

  2. investigating Aug 19, 2025, 04:35 PM UTC

    As of 4:30 UTC, we are continuing to investigate degraded performance in our Data and Reports APIs. Only availability of recently submitted session data is affected. Historic session data, as well as all other API stacks, remain unaffected. New submissions are persisting correctly with no data loss and these submission are being queued for scoring. Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.

  3. identified Aug 19, 2025, 05:34 PM UTC

    (Note: We're correcting cited UTC times to the 24 hr format and will include both forms in this update only.) As of 5:30pm/17:30 UTC, we've identified a possible contributing cause for the degraded performance in our analytics stack. Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.

  4. identified Aug 19, 2025, 06:42 PM UTC

    As of 18:30 UTC, initial remediation efforts have tripled the rate of queued session processing and we are continuing to work toward a full resolution. Access to recently submitted session results via the Data API and Reports API remains the only affected part of the Learnosity ecosystem. New submissions continue to be safely queued for scoring while the degraded performance remains. Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.

  5. identified Aug 19, 2025, 07:56 PM UTC

    As of 19:30 UTC, we are now processing queued sessions rapidly and more than half of the backlog has already cleared.. Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.

  6. identified Aug 19, 2025, 08:35 PM UTC

    As of 20:30 UTC, the scoring queue backlog is almost empty and new sessions will soon be processed without delay. Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.

  7. monitoring Aug 19, 2025, 09:05 PM UTC

    As of 21:00 UTC, all sessions have been cleared from the scoring backlog queue and all submissions are being processed normally. Learnosity Support and Systems Engineering teams will monitor this situation for a further 30 minutes before calling it resolved.

  8. resolved Aug 19, 2025, 09:31 PM UTC

    As of 21:30 UTC, we are have resolved the issue affecting availability of session data in US-East-1. Learnosity Support and Systems Engineering teams will follow up with a post mortem once we have completed root cause analysis and finalized any next steps or preventative measures required. Please reach out if you have any questions or concerns.

  9. postmortem Sep 19, 2025, 07:14 PM UTC

    ### Affected Systems and Regions On 19 August 2025, Learnosity experienced degraded performance in our analytics stack, affecting session data availability in US-East-1 for a subset of customers. This affected the Reports API and Data API, with no other stacks affected, and no data loss. ### Investigation Monitoring detected a rapid increase in unprocessed and retried messages, along with elevated lock contention in the sessions database. The root cause was traced to a customer implementation issue generating an extraordinarily high number of submissions. This drove excessive retries, magnifying actual traffic volume. The use of time-ordered v7 UUIDs for session IDs, normally handled without issue, became problematic under this contention. Uniqueness checks on each session ID required more resources and triggered a succession of temporary deadlocks. These deadlocks would usually self-resolve, but the amplified traffic prevented recovery, turning a minor issue into a sustained queue blockage. ### Resolution Once the issue was identified, Learnosity moved the customer to a dedicated, isolated sync queue, preventing cross‑tenant impact while we investigated. We applied targeted rate limits for the isolated service to protect the database, and drained the backlog. Where safe, long‑running queries were terminated to free locks and allow forward progress. To support faster diagnosis, Learnosity enabled detailed deadlock logging and expanded metrics around message retries, abandonment, and per‑session activity. Learnosity also worked with the customer to adjust implementation settings, reducing combined saves and submits by two orders of magnitude. Session IDs were also switched to v4 UUIDs which simplified uniqueness checks further preventing deadlocks. Immediately after these changes were put into use, queues began to rapidly recover, and normal processing resumed. Most sessions for the subset of affected customers saw short delays, while the most significantly delayed session took ~6 hours before final persistence. Throughout, we identified no data loss. ### Prevention To prevent recurrence, we are: * Implementing targeted load tests and contention simulations to replicate high-parallelism patterns. * Reviewing customer identifier schemes for session IDs and auditing usage across all customers \(initial checks confirm none of our Top 50 customers currently use v7 UUIDs\). * Analyzing adoption of per-tenant fair-use queues \(or equivalent fair-share policies\) to cap burst throughput from a single tenant and protect shared infrastructure.