AI-Media incident

Errors connecting to Lexi - May 6 2024

Major Resolved View vendor source →

AI-Media experienced a major incident on May 6, 2024 affecting Lexi, lasting 1h 56m. The incident has been resolved; the full update timeline is below.

Started
May 06, 2024, 06:12 PM UTC
Resolved
May 06, 2024, 08:09 PM UTC
Duration
1h 56m
Detected by Pingoru
May 06, 2024, 06:12 PM UTC

Affected components

Lexi

Update timeline

  1. investigating May 06, 2024, 06:12 PM UTC

    We have received scattered reports of encoders being unable to connect to the Lexi service. We are investigating and will update here once more information is available.

  2. investigating May 06, 2024, 06:49 PM UTC

    Update: our engineering team is actively engaged and are still investigating this issue. Nothing new to report as yet, but we will continue to provide regular updates here as we work towards a resolution.

  3. investigating May 06, 2024, 06:57 PM UTC

    We are continuing to investigate this issue.

  4. monitoring May 06, 2024, 07:25 PM UTC

    We traced this incident to a memory issue with the Lexi message broker and have instituted a fix; Lexi sessions are starting normally again. We will continue to monitor to ensure this is fully resolved.

  5. resolved May 06, 2024, 08:09 PM UTC

    Lexi operations have returned to normal and all metrics are looking healthy. We believe this to be resolved. Any users experiencing further issues with Lexi should contact our Support team at [email protected].

  6. postmortem May 07, 2024, 02:52 PM UTC

    ## What happened? On May 6, beginning around 17:01 UTC/13:01 EST and lasting approximately 2 hours, the Lexi service was unable to start any new sessions. This affected all users of the service, whether managing Lexi from an encoder, or in EEGCloud via the Lexi UI or scheduler. Any Lexi sessions that were already running when this incident began were unaffected, and continued to run normally. Our engineering team was able to trace the cause of this incident to the Lexi production message broker, which apparently exceeded a memory threshold that caused it to go into an alarm state and block all clients attempting to send messages, indefinitely. EEGCloud was thus unable to communicate with the Lexi \(and Translate\) backends; all new jobs were stuck in a state of either “CREATED” or “TERMINATING”. Once engineering was able to identify the cause, they restarted the message broker service, which resolved the memory issue. This message broker had been running for almost a full year prior to this restart with no previous issues reported. ## What are we doing about it? We are exploring several steps in the immediate & longer-term to prevent an incident like this from disrupting Lexi in the future, including: * Better alarming on the message broker service * Generating an alarm if the number of Lexi jobs started successfully over a given period drops below a particular threshold * Updating the message broker software version & underlying OS * Periodic memory usage checks on the message broker * Implementing a backup to prevent a single point of failure Ai-Media understands the importance of a reliable cloud service for closed captioning and we apologize for the disruption this incident caused to EEGCloud users. If there are any follow-up questions on this incident, please submit a ticket to [[email protected]](mailto:[email protected]) with subject line “May 6 Lexi Outage”.