Phrase incident

Degraded Performance of Phrase Orchestrator (EU) Workflow Builder component between September 9, 2024 6:10 AM CEST and September 9, 2024 7:09 AM CEST

Phrase experienced a minor incident on September 9, 2024, lasting —. The incident has been resolved; the full update timeline is below.

Started: Sep 09, 2024, 04:10 AM UTC
Resolved: Sep 09, 2024, 04:10 AM UTC
Duration: —
Detected by Pingoru: Sep 09, 2024, 04:10 AM UTC

Update timeline

resolved Sep 17, 2024, 01:51 PM UTC

The Phrase Orchestrator (EU) Workflow Builder component suffered a degraded performance between September 9, 2024 6:10 AM CEST and September 9, 2024 7:09 AM CEST
postmortem Sep 17, 2024, 01:52 PM UTC

### **Introduction** We would like to share more details about the events that occurred with Phrase between 06:10 AM CEST and 07:09 AM CEST on September 9, 2024 which led to a gradual outage of the Orchestrator Workflow Builder component and what Phrase engineers are doing to prevent these issues from reoccurring. ### **Timeline** September 9, 6:10 AM CEST: The Orchestrator team is alerted by our monitoring system that database connections started timing out, resulting in page loads failing. When verifying the issue, we recognised that requests that were not failing were slow to get a response. September 9, 7:09 AM CEST: The issue was successfully mitigated. ### **Root Cause** Orchestrator relies heavily on asynchronous background processing. While mitigating a bug a few months back, we introduced another bug that could cause canceled jobs to re-appear in the queue. Thiscauses an endless loop of jobs, starving both compute power and available database connections. The resource usage had been spiking over the weekend, starting Saturday morning, leading up to genuine usage being starved on Monday morning. ### **Actions to Prevent Recurrence** The issue was mitigated by manually removing the three looping jobs from the queue. We are currently in the process of fixing the underlying issue to prevent the looping from reoccurring. We are also looking at ways to improve our monitoring to allow us to spot issues like these earlier; when services do not become completely unavailable, but slowly degrade in terms of response times.