Platform voice Grace and Nico not working
Timeline · 2 updates
- investigating Jun 12, 2026, 04:31 PM UTC
The platform voice, Grace and Nico, are not working
- resolved Jun 12, 2026, 04:44 PM UTC
The issue is resolved now
Retell AI had 44 outages in the last 2 years totaling 24h 48m of downtime — averaging 1.8 incidents per month.
There were 44 Retell AI outages since March 14, 2025 totaling 24h 48m of downtime. Each is summarised below — incident details, duration, and resolution information.
The platform voice, Grace and Nico, are not working
The issue is resolved now
Many calls are experiencing high latency, no pickup, or being dropped due to web socket connection issues. We are investigating now.
This incident has been resolved. It was due to a connection issue in one of the data centers for the telephony provider. We are adding more fallback routes there.
Due to an error in the code, the concurrency of orgs are incorrectly calculated and returning concurrency_limit_reached for many calls.
Due to a database outage, call/chat history, analytics and QA dashboard down are currently completely down. We are working with the database team to fix the issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are rolling out a fix to direct the traffic to our database replica while the provider is investigating the issue. Note that no call or chat data were lost.
The issue has been fixed and all services are back to working.
agent publishing was broken for ~1 hour - returned server error. Calling was not affected
The KB retrieval and QA experienced an issue due to an unintended code change. The issue has been resolved, and we are actively monitoring the systems to ensure continued stability.
This incident has been resolved.
Starting ~13:32 UTC (8:32 AM PT), batch calls were not being sent due to an internal issue that was triggered in our system. We've identified the root cause and are actively working on a fix.
A fix is being deployed and we are monitoring to ensure recovery. As a side effect, batch calls that were missed are being marked as "sent".
This incident has been resolved.
19:38 PST: custom LLM calls started failing to start due to a bad release. Other types of calls are not impacted. 21:35 PST: a fix is being rolled out, we are keep monitoring the issue.
This incident has been resolved.
Between 1:25–2:08pm PT, a provider-side rate-limit configuration issue caused some TTS requests to be silently dropped. This issue has been resolved, and we are adding detection to make sure a fallback will be triggered if this happens again.
There was an accidental change in the logic of account cleanup that led to the deletion of small portion of community voice resources, which caused those agents with the voice having issues. We have fixed the logic, and have been running a backfill to detect and add back those resources.
We are currently experiencing service degradation affecting SIP trunking through our upstream provider, Telnyx. This may result in call connection issues. https://status.telnyx.com/
This incident has been resolved.
There were call initialization issues between 1:35pm and 1:43pm PST. Calls are back online now, but concurrency for some users may be stuck. We are working on a fix.
Calls and concurrency should be back to normal.
Scheduled batch calls did not run from 2:35 PM to 5:15 PM PST. The service has now recovered. Any batch calls whose scheduling window still included 5:15 PM were executed at that time. Batch calls whose scheduling window was fully missed were not executed.
We're aware of issues affecting multilingual transcription and are actively investigating.
The issue appears to be resolved and multilingual transcription is recovering. We'll keep monitoring to ensure stability.
Outbound calls are currently experiencing issues. Some users report receiving multiple calls from a single dial attempt, or agents being unable to hear audio. We are investigating the root cause.
The root cause has been traced to our telephony stack provider. The issue has been escalated to their support team, who are actively triaging the incident.
The issue should have been resolved, we are closely monitoring the status.
The issue has been fixed
A recent code change introduced an issue in the ASR module that may cause some agents to go offline. We are working on a fix actively.
We are continuing to investigate this issue.
The fix is being pushed to the production, we are observing and monitoring the current system status.
This incident has been resolved.
For around 30 minutes, calls and API requests failed due to an issue with Stripe. The outage occurred between 7:10 and 7:40 PST. This has been identified as an outage from our payment provider (Stripe). We are continuing to monitor the situation.
This incident has been resolved.
For around 20 minutes, there was a spike of transcription error for certain non English traffic (notably multilingual, Spanish, etc). English traffic was not impacted. This has been identified as outages from underlying providers. We are working on adding more fallback routes to improve the stability of the platform.
We have observed that calls through retell Telnyx numbers are having issues.
This has been resolved now. This is related to an elastic IP issue, which is mitigated now.
From 11:10 AM – 11:34 AM PST, we observed that inbound calls were experiencing connection issues and timeouts. We have identified the issue as being caused by a brief outage with our telephony provider. The issue has since been resolved and all services are operating normally.
Starting from Oct 20 1am PST, the AWS outage has caused some login issues, and call history and analytics issues. Regular calls are NOT impacted. Once the AWS outage is over, we will backfill the analytics.
The AWS outage has been resolved. We are going to backfill analytics, and keep a close eye on it.
This incident has been resolved.
From 1:14pm - 1:47PM PST, we observed that some calls were running into connection issues, and experience timeouts on operations. We identified the issue to be AWS blocking the auto scaling of the instances, causing our call servers got overloaded for a while. We have been working with AWS team on identifying the root cause to ensure this gets fixed. This issue has been resolved now.
Incident range: 6:30am - 11am PST impact: around 7% of the calls are having abnormally high latency post mortem: Some AWS automatic patches to our transcription clusters caused the container to lost GPU access, and used CPU for transcription, causing extra long latency there. Most calls are routed to the backup endpoint which was working fine, but around 7% did not trigger the fallback there. We are updating the containers to ensure it does not get impacted with the automatic patches.
We experienced a temporary disruption where some outbound calls did not connect as expected. This was due to a telephony sip server provider issue related to elevated system load. The issue was transient and lasted from 8:00 AM to 9:30 AM PDT. The SIP server provider implemented a fix, and inbound calls have been operating normally since. We are going to roll out locally hosted SIP stack to boost reliability going onwards.
Starting from 10:25am to 11:23am, for some customers concurrency was not resetting correctly, causing some calls to fail to connect. We are mitigating the issue by manually resetting concurrency, so the dashboard may not reflect the actual current value. We identified the root cause to be a case where audio file manipulation led to a full disk usage under a corner case scenario, and a fix is deployed.