AI-Media experienced a minor incident on September 18, 2025, lasting —. The incident has been resolved; the full update timeline is below.
Update timeline
- resolved Sep 19, 2025, 02:39 AM UTC
We investigated a brief service disruption that affected 10 or fewer LEXI jobs on September 18th (~15:30-16:00 UTC). Users experienced dropped or delayed output of varying lengths between ~1 minute and ~5 minutes. Drops in output resolved without user intervention. We identified a potential root cause that may also explain the similar incident on September 8th. No further instances have been detected since this timeframe.
- postmortem Oct 06, 2025, 08:58 PM UTC
**Date/Time:** September 18th, 2025, 15:30-16:00 UTC \(11:30-12:00 EDT\) **Duration:** ~30 minutes **Impact:** Fewer than 10 LEXI jobs experienced dropped or delayed output \(1-5 minutes\) **Root Cause:** Missing CPU resource limits on infrastructure components caused node instability **Timeline** * **15:30 UTC:** Service disruption begins affecting LEXI jobs * **15:30-16:00 UTC:** Users experience intermittent dropped/delayed output * **15:45 UTC:** Issue detected via monitoring alerts showing CPU spikes on compute nodes * **16:00 UTC:** Service automatically recovered * **16:00-22:00 UTC:** Investigation identified resource limit configuration issue on \[specific components\] * **22:00 UTC \(6:00 PM EST\):** CPU resource limits applied to affected infrastructure components **Root Cause Analysis** Infrastructure components running without proper CPU resource limits consumed excessive resources, causing: 1. CPU spikes on compute nodes 2. Node instability 3. Disruption to LEXI job processing This same root cause likely explains the similar incident on September 8th.