AssemblyAI incident

Async + LLM gateway APIs returning 5xx

Critical Resolved View vendor source →

AssemblyAI experienced a critical incident on January 29, 2026 affecting Asynchronous API and Website and 1 more component, lasting 1h 38m. The incident has been resolved; the full update timeline is below.

Started
Jan 29, 2026, 05:54 PM UTC
Resolved
Jan 29, 2026, 07:32 PM UTC
Duration
1h 38m
Detected by Pingoru
Jan 29, 2026, 05:54 PM UTC

Affected components

Asynchronous APIWebsiteStreaming APIPlaygroundDashboardLLM Gateway

Update timeline

  1. investigating Jan 29, 2026, 05:54 PM UTC

    We are currently investigating an issue affecting async + LLM gateway services. More updates to come as we learn more.

  2. investigating Jan 29, 2026, 06:02 PM UTC

    We are continuing to investigate this issue.

  3. investigating Jan 29, 2026, 06:09 PM UTC

    We are currently investigating an issue affecting our Async and LLM Gateway services, which are returning 500 errors. Our team is actively working to identify and resolve the problem. We will provide updates as more information becomes available. All streaming API and EU endpoint services are running normally. This incident is isolated to NA API endpoints for async + LLM gateway.

  4. investigating Jan 29, 2026, 06:29 PM UTC

    We are seeing requests beginning to successfully complete and have moved to the monitoring phase. Some requests are still returning 500 errors. Our team continues to actively investigate and monitor the situation. We will provide updates as more information becomes available.

  5. monitoring Jan 29, 2026, 06:30 PM UTC

    We are seeing requests beginning to successfully complete and have moved to the monitoring phase. Some requests are still returning 500 errors. Our website is down, and logins may be affected. Our team continues to actively investigate and monitor the situation. We will provide updates as more information becomes available.

  6. monitoring Jan 29, 2026, 06:42 PM UTC

    We are continuing to see requests being completed. We are working through a backlog of requests, resulting in elevated turnaround times while we recover. Our team continues to actively monitor the situation. We will provide updates as more information becomes available.

  7. monitoring Jan 29, 2026, 07:18 PM UTC

    We are continuing to work through the backlog, and turnaround times are improving. Our team continues to actively monitor the situation. We will provide updates as more information becomes available.

  8. resolved Jan 29, 2026, 07:32 PM UTC

    We have cleared the backlog, and turnaround times are back to normal. Our team continues to actively monitor the situation. We will provide a post-mortem when available.

  9. postmortem Feb 05, 2026, 06:35 PM UTC

    **Summary: January 29 Outage** On January 29, our Async Transcription and LLM Gateway went down for 25 minutes, followed by two hours of degraded performance. Here's what happened. Our traffic has been steadily increasing and we received an influx of batch transcription jobs that were higher than usual. Autoscaling did what it should—spun up workers. We ran out of spot instances, fell back to on-demand, then hit AWS's Elastic Network Interface limit. Every ECS task needs an ENI. With no ENIs available, our authentication service couldn't rotate its tasks. It scaled to zero. Everything stopped. When we tried to recover, a Lambda function meant to help with failures was hammering the AWS API with describe calls, rate-limiting the very operations we needed to bring things back. A tool built for recovery was blocking recovery. **What actually went wrong:** We didn't monitor ENI utilization and our authentication service shared resources with batch workers. Traffic spike in one starved the other. **What we're doing:** Short-term: ENI alerts, isolate auth on dedicated capacity, fix the Lambda. Medium-term: Full multi-region for everything. This was preventable. We had the warning signs and didn't connect them. We're sorry, and we're fixing it.