Phrase incident

Performance Disruption of Phrase Strings (EU) components between May 20, 2025 09:52 AM CEST and May 20, 2025 10:14 AM CEST

Phrase experienced a critical incident on May 20, 2025 affecting Translation center and Repo sync and 1 more component, lasting 44m. The incident has been resolved; the full update timeline is below.

Started: May 20, 2025, 08:06 AM UTC
Resolved: May 20, 2025, 08:51 AM UTC
Duration: 44m
Detected by Pingoru: May 20, 2025, 08:06 AM UTC

Affected components

Translation centerRepo syncOTAEmail deliveryOrderingIn-context editorAPI

Update timeline

investigating May 20, 2025, 08:06 AM UTC

Our engineers are currently investigating the root cause. We apologize for any inconvenience caused.
monitoring May 20, 2025, 08:18 AM UTC

The fix has been implemented and we are monitoring the results.
resolved May 20, 2025, 08:51 AM UTC

The issue has been resolved.
postmortem May 26, 2025, 08:40 AM UTC

### **Introduction** We would like to share more details about the events that occurred with Phrase between 09:52 AM CEST and 10:14 AM CEST on May 20, 2025, which led to a performance disruption of Strings EU and what Phrase engineers are doing to prevent these issues from reoccurring. ### **Timeline** 09:52 AM CEST: The Strings platform became unavailable due to a change in network configuration. Investigation began immediately. 10:14 AM CEST: A temporary fix was applied, restoring full platform availability. 03:43 PM CEST: A full fix addressing the underlying issue was deployed. ‌ **Root Cause** The issue was caused by a network configuration change introduced as part of a routine Kubernetes infrastructure upgrade. After the upgrade, some application components failed health checks, which led to automatic restarts and unavailability - even though the services themselves were actually running. This happened only in cases where enhanced network security rules were applied to individual parts of the system. The way the Kubernetes and underlying network handled traffic in this setup led to health checks not working as expected. The problem was resolved by adjusting the network settings on the underlying infrastructure network stack to support these health checks properly in this configuration. ‌ **Actions to Prevent Recurrence** We have updated our infrastructure configuration to always include this network setting going forward. This ensures that future upgrades will not trigger the same issue. We will also plan and roll out future Kubernetes upgrades in a more granular manner, targeting smaller segments of infrastructure first and performing changes outside of peak traffic hours. This will help us reduce the risk of platform-wide impact and detect potential issues earlier in the rollout process.