Mindtickle incident
Intermittent failures in recording of calls through Call AI
Mindtickle experienced a major incident on February 8, 2024, lasting —. The incident has been resolved; the full update timeline is below.
Update timeline
- resolved Feb 08, 2024, 02:34 PM UTC
From 18:00 PT 24-01-2024 to 19:45 PT 26-01-2024, call recordings were failing intermittently on Call AI.
- postmortem Feb 09, 2024, 09:09 AM UTC
**Impact:** * Call recordings were intermittently not recorded across Zoom, Teams, Pexip, and Google. **Timestamp:** * Start Time: 18:00 PT, 24-01-2024 * End Time: 19:45 PT, 26-01-2024 **Root Cause Analysis:** * The issue arose when the disk space for one of our servers reached maximum capacity, preventing calls directed to this server from being recorded. * Alerts for memory, CPU, and disk usage were configured, but no alerts were triggered as data was still being ingested. * Large media files \(>20GB\) were identified as the main culprits, originating from bots running for extended periods \(beyond the expected 5-hour meeting timeout\). **Corrective Actions Taken:** * Increased disk space to prevent new recordings from being impacted. * Removed large files associated with stuck bots from the server. * Terminated stuck bots to halt the recording process. * Reconciled calls where possible. **Learning & Next Steps:** * Implement measures to reduce time in detecting such cases in the future: * Monitor and set alerts for long-running meetings and bots not reaching the terminal state after a specified time period \(denoted as 'X'\). * Monitor and set alerts for incomplete meeting workflows.