AssemblyAI experienced a major incident on January 3, 2025 affecting Asynchronous API, lasting 1h 15m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jan 03, 2025, 04:38 PM UTC
We are currently investigating an issue that is resulting in slowdowns in transcription completions. We will share more details as soon as we learn more.
- investigating Jan 03, 2025, 04:42 PM UTC
We are still investigating the current issue we are experiencing but the impact of this issue is slower than usual processing times for requests to our Async endpoint as well as some failed requests with "the operation timed out" errors.
- investigating Jan 03, 2025, 05:03 PM UTC
We are still seeing issues leading to longer than usual processing times. This slowdown is causing throttling for some accounts as well as some failed requests due to time out errors.
- identified Jan 03, 2025, 05:35 PM UTC
We are working to reset unhealthy instances in our pipeline, and we are now seeing an increase in successful requests and a reduction in the number of failed requests. We are still experiencing longer-than-usual processing times, but we are seeing improvements now.
- identified Jan 03, 2025, 05:43 PM UTC
We are continuing to see improvements and will be moving into monitoring status shortly.
- monitoring Jan 03, 2025, 05:46 PM UTC
We have made changes to address the issues we were seeing earlier. Processing times and error rates are now returning to the normal range. There are still some requests throttled as a result of the earlier showdown but that number is steadily declining as we continue to monitor performance.
- resolved Jan 03, 2025, 05:57 PM UTC
We have continued to see good performance around processing times and error rates and all previously throttled jobs have now been complete so we are marking this issue as resolved.
- postmortem Jan 12, 2025, 05:20 PM UTC
On January 3th, 2025 and on January 8th, 2025, we experienced two incidents that resulted in service disruptions and performance degradation for our asynchronous API. **Root Cause Analysis** Increased demand for the AssemblyAI asynchronous API has required changes to our datastore infrastructure to maintain performance and reliability. * **January 3th:** Transcription Record Service system did not scale out aggressively enough to properly service the traffic. We identified a misconfiguration in the scaling policies that constrained the service’s ability to scale out. This resulted in degraded performance for and high latency time for all requests during the 1 hour window. * **January 8th:** Our team deployed a configuration update to our infrastructure that unexpectedly caused a redeployment of the transcription record service. This resulted in failures for all customer requests during the 20-minute window while the transcription record service was redeployed and scaled back out to handle current traffic. While both of these incidents impacted the same service, the underlying cause of each incident was different. **Resolution and Next Steps** To prevent similar incidents in the future, we have: * Adjusted the auto-scaling rules of the Transcription Record Service to scale more aggressively with additional performance metrics * Implemented additional pre-deployment validation checks on infrastructure code changes * Strengthening our infrastructure configuration management processes * Improving the communication cadence to the [status page](https://status.assemblyai.com/) throughout the incident lifecycle **Customer Commitment** We apologize for any inconvenience or disruption these incidents may have caused. We are committed to providing a reliable and high-performing service and are taking steps to prevent similar incidents from occurring in the future. As always, you can rely on the [status page](https://status.assemblyai.com/) for up to date API information on our current status and health.