Cartesia incident

TTS API Elevated TTFA

Major Resolved View vendor source →
Started
Feb 23, 2026, 09:00 AM UTC
Resolved
Feb 23, 2026, 03:42 PM UTC
Duration
6h 42m
Detected by Pingoru
Feb 23, 2026, 09:00 AM UTC

Affected components

Text to Speech (US)Text to Speech (APAC)

Update timeline

  1. investigating Feb 23, 2026, 02:00 PM UTC

    Investigating: We are currently investigating elevated Time to First Audio (TTFA) for api.cartesia.ai in the APAC and US region. Customers may be experiencing longer-than-expected delays before audio begins streaming, and in some cases inter chunk latency. We are actively working to identify the root cause and will provide updates as we learn more.

  2. identified Feb 23, 2026, 02:54 PM UTC

    We have identified the component causing and are in the process of rolling a mitigation for it.

  3. identified Feb 23, 2026, 03:31 PM UTC

    We've rolled out a mitigation globally and are now currently monitoring TTFAs to ensure that we've resolved the incident fully.

  4. resolved Feb 23, 2026, 03:42 PM UTC

    After monitoring, this incident appears to be fully resolved and TTFAs should be at normal levels.

  5. postmortem Feb 26, 2026, 08:13 PM UTC

    ## Incident Summary On February 23, 2026, our text-to-speech \(TTS\) service experienced significant performance degradation affecting customers in APAC and US regions for approximately 8 hours. Users experienced elevated latency in both time to first audio \(TTFA\) and real-time factor \(RTF\), resulting in slow response times and choppy or distorted audio output. The incident was triggered by an architectural change in our streaming orchestration infra, which introduced a performance regression in request data flow at very high cluster level concurrencies. We mitigated the incident by rolling back the clusters to the previous stable architecture. ## Impact **Duration:** Approximately 8 hours \(February 23, 07:30 UTC - 15:30 UTC\) **Symptoms:** * Significantly elevated p99 Time to First Audio \(TTFA\), particularly noticeable during peak traffic * Degraded Real-Time Factor \(RTF\) causing choppy, broken, or distorted audio **Scope:** Customers in APAC and US regions experienced degraded service quality. While traffic levels remained steady, performance regressions were most pronounced during regional peak hours. ## Timeline \(UTC\) All times are in UTC \(Coordinated Universal Time\). 1. **07:10 UTC** - First customer reports of distorted audio in APAC region 2. **08:50 UTC** - Infrastructure scaled up in APAC, providing temporary improvement 3. **11:42 UTC** - Continued reports of high inter-chunk latencies from APAC users 4. **12:15 UTC** - Additional scaling performed with temporary metric improvements 5. **13:00 UTC** - Root cause was triaged to the infrastructure change, mitigation developed 6. **14:20 UTC** - Proactive rollback initiated in US region ahead of peak traffic 7. **15:30 UTC** - Complete migration to stable configuration, service fully restored ## Root Cause Analysis The incident was caused by a recent infrastructure change to our message streaming layer. We use a PubSub routing layer based off NATS to manage request data flows. We migrated from a lightweight message publishing protocol to a more robust jetstream-based protocol as part of improving our task assignment durability and reliability. While the new protocol provides important guarantees for task assignment, it introduced significant performance overhead in the request processing hotpath at high concurrency. Specifically, the new protocol adds acknowledgment flows for each message, which substantially increased the P99 message publish latency and lowered the total throughput of messages / second through the PubSub. This performance regression directly impacted our ability to stream audio chunks efficiently, causing the elevated TTFA and degraded RTF that customers experienced. The rollback involved migrating workloads back to our stable bare metal worker configuration, which restored normal performance levels. ## Learnings and Next Steps This incident highlighted several areas for improvement in our infrastructure and operational processes: **Enhanced Monitoring:** We do actively measure server side TTFA, but not server side RTF. RTF measures are implemented at the modeling layer, which was not affected by this change. Since TTFA perturbation was much smaller than the RTF perturbation, we were slower to detect this degradation. **Improvements to our Benchmarking for Rollouts:** We have robust benchmarking infrastructure for rollouts, but the scale of this is generally, intentionally, smaller than our production traffic. However, for the purposes of architecture shifts, we will be establishing a way to measure our infrastructure against simulated models to achieve full prod scale tests. **Streamlined Rollback Procedures:** Rollbacks while preserving prod traffic are currently still a semi-manual process. This delayed our mitigations to take ~1 hour from when we started. We are working improving the automations here that make global rollbacks faster. We sincerely apologize for the impact this incident had on your applications and services. We are committed to learning from this experience and continuously improving the reliability and performance of our TTS platform. If you have any questions or concerns about this incident, please don't hesitate to reach out to our support team.

Looking to track Cartesia downtime and outages?

Pingoru polls Cartesia's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when Cartesia reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track Cartesia alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring Cartesia for free

5 free monitors · No credit card required