Mindtickle incident

Mindtickle Call AI unable to record calls on MS Teams

Minor Resolved View vendor source →

Mindtickle experienced a minor incident on May 24, 2024, lasting —. The incident has been resolved; the full update timeline is below.

Started
May 24, 2024, 02:22 PM UTC
Resolved
May 15, 2024, 08:15 AM UTC
Duration
Detected by Pingoru
May 24, 2024, 02:22 PM UTC

Update timeline

  1. resolved May 24, 2024, 02:22 PM UTC

    On May 15, 2024, Mindtickle Call AI could not record a few calls on MS Teams

  2. postmortem May 24, 2024, 02:23 PM UTC

    On May 15, 2024, Mindtickle Call AI could not record a few calls on MS Teams **Incident Start:** May 15, 2024, 01:15 AM PT **Incident Resolved:** May 15, 2024, 09:45 AM PT We sincerely apologize for any inconvenience caused. We are committed to learning from this incident and improving our processes and systems. Below is the incident's timeline, the root cause, and action items. ‌ **Incident timeline:** * May 15, 2024, 01:30 AM PT: Periodic reconciliation job triggered * May 15, 2024, 01:45 AM PT: Lag started building up * May 15, 2024, 09:00 AM PT: The Engineering team reviewed the end-of-day report and identified the issue. * May 15, 2024, 09:35 AM PT: Impacted call identified and removed from the pipeline * May 15, 2024, 09:45 AM PT: System restored to nominal state **Root Cause:** * We have systems in place to ensure all calls are accurately recorded and captured in the Mindtickle system. As an additional precaution, we regularly perform reconciliations of the calls. * We use a common pipeline to send calls to Microsoft Teams, including those to initiate a recording, run a reconciliation, and retrieve recordings after the call is recorded. * The periodic reconciliation job was scheduled for 01:30 PT on May 15, 2024. * During the reconciliation, one specific call took longer than expected to process. Upon investigation, we discovered that the resources allocated to reconcile this call were fully consumed by it, causing the process to fail. Despite retry attempts, the call continued to fail, delaying all other events in the pipeline. * We generate system reports at the beginning and end of each day to assess the performance of the Call AI platform. Upon reviewing the start-of-day report on May 15, 2024, we noticed that the pipeline was congested and unable to process calls. * We promptly addressed the issue by removing the affected call from the pipeline and completing the processing of all pending actions. * This delay affected a few new call recordings scheduled between 01:15 PT - 09:45 PT, as we were unable to initiate the 'start recording' call to MS Teams on time. ‌ **Learning and Next Steps:** * Separate pipelines for different workloads: Previously, we used one pipeline for both sending call recording requests and fetching call recordings from MS Teams. Now, we have split these tasks into separate pipelines, each dedicated to its specific workload. * Enhanced alerting: We've added real-time alerts to both pipelines to ensure immediate action can be taken if needed.