Mindtickle incident
Intermittent failures observed in bulk operations and invitation workflows
Mindtickle experienced a minor incident on November 27, 2024 affecting Course / Quick-Update / Assessment and Mission and 1 more component, lasting 5h 40m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Nov 27, 2024, 06:06 PM UTC
Since 07:06 PT, Nov 27, 2024, we have observed intermittent failures for some bulk operations and invitation workflows. Below are the flows that are impacted. 1. Module-related bulk operations from series page [copy, publish, move, mirror] 2. Invitation: manual & invite during publish / mirror 3. Certification: delay in delivery of system-based certificates 4. Bulk Uploads The team is investigating the issue and we will share an update shortly.
- identified Nov 27, 2024, 11:45 PM UTC
The issue has been identified and a fix is being implemented.
- resolved Nov 27, 2024, 11:46 PM UTC
The incident has been resolved and the system is now back to normal.
- postmortem Dec 05, 2024, 09:14 AM UTC
### **Incident Summary** On **Nov 27, 2024**, we experienced a large influx of **requests** through the Open API. The requests were bulk API calls, amounting to a **few million new requests.** Each request was sent to a queue for efficient and reliable processing. Due to erroneous routing configuration, these requests were also sent to a dormant queue that had no consumer - leading to the queue getting full and memory overload in the message queue system. At **8:43 pm PT**, the team began receiving alerts. This was traced back to a **memory overload** in the message queue system caused by a large volume of unprocessed events. This incident led to intermittent failures in several workflows, including **user sync, invitations, certifications, and bulk uploads**. #### **Impacted Workflows:** * Bulk Publish module * Bulk Archive module * Bulk Mirror module * Update Availability module * Module Move * Certification Award * Schedule Invitation * Bulk update through Open APIs * User Sync ### **Incident Timeline** * **Nov 27, 2024, 7:06 AM PT**: Workflow service began encountering errors, marking the start of the incident. * **Nov 27, 2024, 8:43 AM PT**: War room activated to address the issue. * **Nov 27, 2024, 8:54 AM PT**: Identified the root cause of errors was the messaging queue system’s memory overload. * **Nov 28, 2024, 10:00 AM PT**: Determined that system upgrades / upscale could not occur during high memory usage. * **Nov 28, 2024, 11:15 AM PT**: Determined high memory overload was due to erroneous unnecessary routing of events to a queue that had no consumer leading to the queue getting full. Decision made to purge the faulty queue to reduce memory usage. * **Nov 28, 2024, 11:30 AM PT**: Purge completed and monitored the system for memory reduction. * **Nov 28, 2024, 12:30 PM PT**: Memory usage remained high despite the purge. * **Nov 28, 2024, 1:04 PM PT**: Performed a force reboot of the system, and memory usage normalized. * **Nov 28, 2024, 3:36 PM PT**: Systems returned to normal. ### **Root Cause Analysis** The incident was caused by a **memory overload** in the message queue system, triggered by the overflow of unprocessed events. This overload prevented workflows from completing, impacting several services. ### **Lessons Learned** * **Message Queue Configuration Automation**: The absence of automated synchronization between code changes and message queue configurations led to missing routing keys, causing issues with the DLQ. A system to automatically update message queue configurations will help mitigate this risk. * **Unmonitored DLQs**: This dead-letter queue \(DLQs\) was not actively monitored, which led to an overflow of unprocessed events. Future processes should include dedicated monitoring for DLQs, along with a consumer to process stalled events. * **Delayed Detection and Resolution**: The incident took too long to detect and resolve. By implementing **improved monitoring**, **better-alerting systems**, and **real-time anomaly detection**, we can reduce **MTTR** \(Mean Time to Resolution\) and **MTTD** \(Mean Time to Detection\) for future incidents.