Phrase incident

Performance Disruption of All Phrase TMS (EU) Components between March 19, 2025 12:21 PM CET and March 19, 2025 3:40 PM CET

Phrase experienced a critical incident on March 19, 2025 affecting Analytics and API and 1 more component, lasting 3h 27m. The incident has been resolved; the full update timeline is below.

Started: Mar 19, 2025, 11:49 AM UTC
Resolved: Mar 19, 2025, 03:16 PM UTC
Duration: 3h 27m
Detected by Pingoru: Mar 19, 2025, 11:49 AM UTC

Affected components

AnalyticsAPICAT web editorConnectorsFile processingMachine translationProject managementTerm baseTranslation memory

Update timeline

investigating Mar 19, 2025, 11:49 AM UTC

We are currently experiencing a performance disruption of all Phrase TMS (EU) components. Our engineering team is investigating the issue.
investigating Mar 19, 2025, 12:10 PM UTC

We are continuing to investigate this issue.
investigating Mar 19, 2025, 12:36 PM UTC

We continue to investigate and work on the issue.
investigating Mar 19, 2025, 01:14 PM UTC

Our engineering team continues to investigate the issue and work on a resolution.
investigating Mar 19, 2025, 01:55 PM UTC

Our team is still investigating the issue and working on a resolution.
identified Mar 19, 2025, 02:07 PM UTC

A possible root cause has been found and our team is currently in the process of mitigating it.
monitoring Mar 19, 2025, 02:34 PM UTC

Problems found have been isolated and the issues appear to be mitigated. Situation should be stable from now on - TMS is accessible again. The search service will currently stay disabled.
resolved Mar 19, 2025, 03:16 PM UTC

The incident has now been resolved. All TMS components are functional.
postmortem Apr 14, 2025, 06:35 PM UTC

# **Root Cause Analysis** March 19th, 2025 ### **Introduction** We would like to share more details about the events that occurred with Phrase between 12:21 PM CEST and 3:40 PM CEST on March 19th, 2025 which led to an outage of all Phrase TMS \(EU\) components and what Phrase engineers are doing to prevent these issues from reoccurring. ### **Timeline** **10:00 AM CET:** A new search feature is enabled for Beta testing. **12:48 PM CET:** A user triggers a search using a specific filter combination that leads to unexpected system behavior. **1:11 PM CET:** The issue is identified and escalated internally. A broader test unintentionally increases system load. **1:12 PM CET:** Message traffic increases sharply, causing high network and CPU usage. **1:20 PM – 3:40 PM CET**: Engineering efforts are focused on identifying and mitigating performance issues across services. **3:00 PM CET:** Network congestion is identified between Phrase TMS and a key database cache. **3:15 PM CET:** Cache systems are restarted, which alleviates the performance bottleneck. **4:00 PM CET:** The new search feature is disabled. System traffic and performance return to normal. ### **Customer Impact** Between **12:21 PM and 3:40 PM CEST**, all **Phrase TMS \(EU\)** users experienced a degraded experience: * **Service Availability**: Project and job loading, as well as general performance, were significantly slowed or failed intermittently. * **Search Functionality**: Users engaging with the Beta search feature experienced delays or failures. * **Processing Delays**: Background tasks and automation workflows relying on real-time data were delayed. Scope of Impact * The disruption affected **all active users** of Phrase TMS \(EU\). * **Successful request rates** dropped to approximately **15%** during the peak of the incident. * Performance began recovering around **3:45 PM CEST**, and the Beta search feature was kept offline until a validated fix could be applied. ### **Root Cause** The incident was primarily caused by **network congestion** between the Phrase TMS application and its underlying **database cache layer**, which slowed data retrieval and affected overall system performance. This was triggered by: * A **bug** in the new search feature that created a high volume of retry traffic * **Unexpected message growth** which led to increased system load * **Spillover traffic** that strained shared infrastructure resources * **Delayed detection** of cache congestion, which postponed full recovery ### **Actions to Prevent Recurrence** #### Safe Failure Handling We’ve updated our systems to better handle edge cases in message processing, ensuring that a similar loop cannot cause systemic degradation. #### Cache Layer Improvements * Upgraded infrastructure for better network handling and tuning * Scaled out the cache to handle higher loads more efficiently * Improved automated recovery mechanisms #### Cache Traffic Monitoring New observability tools have been introduced to detect anomalies in application-to-cache communication, helping to surface latency spikes or throughput drops more proactively. #### Alerting Improvements We’ve enhanced our alerting strategy to detect queuing behavior and system delays sooner. #### Cross-System Correlation Awareness We’ve updated our incident response playbooks to ensure a more holistic, system-wide approach to troubleshooting and escalation. ### **Conclusion** Finally, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and to determine how to make changes that improve our services and processes.