Cartesia incident

EU + NA Cluster Instability

Cartesia experienced a notice incident on July 21, 2025 affecting Text to Speech (US), lasting 12h 25m. The incident has been resolved; the full update timeline is below.

Started: Jul 21, 2025, 09:18 AM UTC
Resolved: Jul 21, 2025, 09:44 PM UTC
Duration: 12h 25m
Detected by Pingoru: Jul 21, 2025, 09:18 AM UTC

Affected components

Text to Speech (US)

Update timeline

investigating Jul 21, 2025, 09:18 AM UTC

Around 1:51 am PT, instability in the EU cluster returned despite full rollback. Rolling EU traffic over to NA.
monitoring Jul 21, 2025, 09:26 AM UTC

EU traffic has successfully rolled over to NA and health checks have recovered. We'll continue to serve EU requests from NA cluster until we can identify the root cause of the degradation.
investigating Jul 21, 2025, 03:11 PM UTC

Our NA cluster is experiencing instability. From 5:38-5:43 am PT and 7:50-8:05 am PT we experienced a heavy latency spike - we're rolling a backup cluster and investigating the root cause.
monitoring Jul 21, 2025, 04:45 PM UTC

EU traffic has been routed back to EU cluster
monitoring Jul 21, 2025, 05:07 PM UTC

Currently attempting the same type of rollback on NA
monitoring Jul 21, 2025, 06:53 PM UTC

NA traffic is rolling out to backup cluster.
monitoring Jul 21, 2025, 06:59 PM UTC

NA traffic is 100% on the new NA cluster. Will monitor for the next hour before resolving
resolved Jul 21, 2025, 09:44 PM UTC

The primary NA cluster has been rolled back and traffic has been rerouted to the primary cluster. Confirmed the stability after some period of time. RCA will come shortly
postmortem Jul 23, 2025, 10:50 PM UTC

**What Happened?** In our ongoing efforts to enhance the reliability and uptime of our services, we began a phased deployment of a new, high-availability clustered messaging configuration in our AP, US and EU regions over the weekend. While initial health checks and sanity tests were successful, we encountered unforeseen issues under the high-volume traffic. Starting around 8:00 PM UTC on Sunday, July 20th, our EU region began to experience service degradation. This was followed by similar issues in our US region in the early hours of Monday, July 21st. These issues resulted in intermittent periods of request queuing and, in some cases, failed requests for a portion of our users. ‌ **Timeline** 1. Sunday, July 20th, 8:00 PM UTC: Service degradation began in our EU region. 2. Monday, July 21st, 5:40 AM UTC: Our US region began to experience similar issues, characterized by brief periods of unresponsiveness followed by a surge in traffic. 3. Monday, July 21st, 12:30 PM UTC: Services in all affected regions were stabilized and fully recovered. ‌ **Our Response** Our engineering team identified the issue and began rerouting traffic to backup clusters to minimize customer impact. The root cause was traced to the new configuration, and we initiated a full rollback. By 12:30 PM UTC on Monday, July 21st, all services had been restored to their normal operational state. ‌ **What We Are Doing to Prevent This in the Future** We are conducting a thorough analysis to fully understand the sequence of events and the root causes of this incident. Our preliminary findings indicate that while the new clustered messaging service is designed for high availability, its interaction with our specific traffic patterns led to bottlenecks.