Phrase incident

Degraded Performance of all Phrase Strings (EU) Components except OTA between November 12, 2024 9:20 AM CET and November 12, 2024 11:00 AM CET

Phrase experienced a minor incident on November 12, 2024 affecting Translation center and Repo sync and 1 more component, lasting 3h 26m. The incident has been resolved; the full update timeline is below.

Started: Nov 12, 2024, 08:57 AM UTC
Resolved: Nov 12, 2024, 12:23 PM UTC
Duration: 3h 26m
Detected by Pingoru: Nov 12, 2024, 08:57 AM UTC

Affected components

Translation centerRepo syncEmail deliveryOrderingIn-context editorAPI

Update timeline

identified Nov 12, 2024, 08:57 AM UTC

Our engineers have identified the root cause of a degraded performance of all Phrase Strings (EU) components except OTA and are working on a fix.
monitoring Nov 12, 2024, 09:14 AM UTC

Our engineers implemented a fix and are monitoring the results.
monitoring Nov 12, 2024, 10:44 AM UTC

Our engineers are continuing to monitor the performance, all components except the Translation center are now operational.
resolved Nov 12, 2024, 12:23 PM UTC

The incident has been resolved.
postmortem Nov 14, 2024, 09:42 AM UTC

# **Root Cause Analysis** November 12, 2024 ### **Introduction** We would like to share more details about the events that occurred with Phrase between 9:20 AM CET and 11:00 AM CET on November 12, 2024 which led to a gradual outage of all the Phrase Strings \(EU\) components excepting OTA and what Phrase engineers are doing to prevent these issues from reoccurring. ### **Timeline** 9:22 AM CET: We received a latency warning from our monitoring tool regarding our background job queues. 9:30 AM CET: We identified that a large number of enqueued webhook delivery jobs were causing high memory usage in our Redis instance which eventually affected the processing of other background jobs due to Redis being unresponsive. 9:40 AM CET: The root cause was identified as a large amount of misconfigured and duplicated webhooks that were triggered due to high activity. 9:55 AM CET: We began cleaning up the queue by identifying the duplicated webhook delivery jobs. 10:10 AM CET: The clean up was completed and webhook delivery returned to normal. 10:15 AM CET: We re-triggered the background jobs for translation statistics and search indexing which were affected by the outage. 11:00 AM CET: Systems stabilized. 13:20 AM CET: Processing of the re-triggered background jobs completed and the incident was declared as resolved. ‌ ### **Root Cause** The root cause of this incident was identified as a large number of misconfigured and duplicated webhooks triggered by high user activity. This resulted in a high load on backend services which affected the processing of several background jobs. ### **Actions to Prevent Recurrence** * Introduce hard limits and uniqueness checks at the project level to prevent duplicate webhook configurations. * Increase resource allocation for Redis to ensure stability in background job processing.