Phrase incident

Performance Disruption of Phrase Strings (EU) components between September 25, 2024 11:55 AM CEST and September 25, 2024 01:30 PM CEST

Phrase experienced a critical incident on September 25, 2024 affecting Translation center and Repo sync and 1 more component, lasting 4h 28m. The incident has been resolved; the full update timeline is below.

Started: Sep 25, 2024, 09:59 AM UTC
Resolved: Sep 25, 2024, 02:27 PM UTC
Duration: 4h 28m
Detected by Pingoru: Sep 25, 2024, 09:59 AM UTC

Affected components

Translation centerRepo syncEmail deliveryOrderingIn-context editorAPI

Update timeline

investigating Sep 25, 2024, 10:10 AM UTC

Clients experience issues receiving a 503 error when trying to open a page or run any actions. OTA feature is not affected. Our engineers are currently investigating the root cause. We apologize for any inconvenience caused.
identified Sep 25, 2024, 10:35 AM UTC

The issue has been identified as related to database connectivity, and the teams are actively working to resolve it.
identified Sep 25, 2024, 11:08 AM UTC

The Phrase Strings application is operational; however, not all background actions are currently processing. We are actively working on a resolution.
identified Sep 25, 2024, 11:40 AM UTC

We are continuing to investigate this issue.
identified Sep 25, 2024, 12:36 PM UTC

All Phrase Strings systems are functioning correctly. We are currently investigating the exact cause of the issue.
monitoring Sep 25, 2024, 02:08 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Sep 25, 2024, 02:27 PM UTC

The incident has been fully resolved. We apologize for all inconveniences caused.
postmortem Oct 25, 2024, 06:39 AM UTC

### **Introduction** We would like to share more details about the events that occurred with Phrase between September 25, 2024 11:55 AM CEST and September 25, 2024 01:30 PM CEST which led to a performance disruption of the Phrase Strings \(EU\) component and what Phrase engineers are doing to prevent these issues from reoccurring. ### **Timeline** 25/9/2024, 10:00 AM - 11:55 AM CEST: Observed an increase in MySQL active sessions. 25/9/2024, 11:55 AM CEST: An error in the database storage engine of a third party hosting and data provider triggered an unsuccessful failover of the cluster. 25/9/2024, 11:55 AM CEST: Application pods began losing connections and became stuck in restarting mode which resulted in the application no longer being accessible. 25/9/2024, 11:55 AM - 12:47 PM CEST: The database cluster was unavailable, with instances stuck in restarting mode. 25/9/2024 12:47 PM CEST: After numerous restarting attempts, the database instance restarted successfully, presumably once all previous operations timed out. 25/9/2024 12:50 PM CEST: The application became fully available, although background jobs remained paused in order to reduce load. 25/9/2024 13:30 PM CEST: All background jobs were reactivated. ### **Root Cause** The root cause for the Database restart was triggered by an issue in the database storage engine within the third party hosting and data storage provider. ### **Actions to Prevent Recurrence** We disabled the hosting and data storage provider option that was causing the instance to reboot. We optimized the query that was leading to high loads on the database instance.