Memsource incident

Degraded Performance of Phrase Orchestrator (EU) Next-Gen Workflow Engine between February 05, 05:38 PM CEST and February 05, 08:13 PM CEST

Memsource experienced a major incident on February 5, 2026 affecting Legacy Workflow Engine, lasting 2h 51m. The incident has been resolved; the full update timeline is below.

Started: Feb 05, 2026, 04:56 PM UTC
Resolved: Feb 05, 2026, 07:47 PM UTC
Duration: 2h 51m
Detected by Pingoru: Feb 05, 2026, 04:56 PM UTC

Affected components

Legacy Workflow Engine

Update timeline

investigating Feb 05, 2026, 04:56 PM UTC

Engineering has identified an issue with Orchestrator where the new Workflow Engine is currently not executing workflows. The problem is under investigation.
investigating Feb 05, 2026, 05:23 PM UTC

We are continuing to investigate this issue.
identified Feb 05, 2026, 06:59 PM UTC

The issue has been identified and a fix is being implemented.
monitoring Feb 05, 2026, 07:20 PM UTC

A fix has been implemented and the system is currently stable. We are continuing to monitor the situation.
monitoring Feb 05, 2026, 07:20 PM UTC

We are continuing to monitor for any further issues.
resolved Feb 05, 2026, 07:47 PM UTC

This incident has been resolved.
postmortem Feb 19, 2026, 09:31 AM UTC

## Introduction We would like to share more details about the workflow disruption that occurred on February 5, 2026, between 15:01 CET and 19:58 CET, which led to workflows being held in an “Executing” state and delays in workflow processing within Phrase Orchestrator. This affected workflows running on the new engine in the EU data center. During this time, newly triggered workflows were not progressing as expected. Below we describe what happened, the root cause, and the steps we are taking to prevent similar incidents in the future. ## Timeline **Feb 5, 2026** * **16:40 –** A Phrase employee reported that workflows for an organization were stuck in “Executing” status. * **17:05 –** Phrase engineers began investigating the issue: It was observed that the component that acts as an entry point for all workflows was unstable * **17:10** To decrease the load, the engineers decreased the amout of parallel workflow tasks * **17:41** The analysis identified a degraded communication with the the database * **19:28** Following the implementation of multiple load-relief and scalability strategies, engineers determined that the issue was solely caused by a small number of extraordinarily large workflows. * **19:58** After ruling out negative side-effects, the workflows associated with the excessive load were cancelled. * **20:01** Workflows resumed normal execution. * **20:12** Previously stuck workflow executions for the affected organizations were marked as “Failed” to ensure system consistency. ## Root Cause The incident was triggered by the execution of a very large workflow with a complex dependency structure. A workflow of this design results in rapid expansion of tasks when processing large payloads. On February 5, a workflow of this nature was triggered multiple times. Due to the complex dependency structure, this resulted in the creation of more than 140,000 tasks over a short period of time. Most of these tasks executed requests against a single API, where the API rate limit was reached. This caused a large number of tasks to retry repeatedly. At the same time, the workflow engine had to evaluate the state of many dependent tasks within the workflow graph. The system executes complex database queries to determine the state of scheduled workflow jobs and their dependencies. With tens of thousands of jobs and large dependency trees, these queries became increasingly slow. The combination of … * A very high number of generated jobs * Frequent retries due to APIs and hitting rate limiting * Complex dependency evaluation for large workflows … led to long-running database queries. These slow queries exhausted available database connections and caused repeated crashes and restarts of the multiple workflow engine components. One of the affected components was the entry point for all workflows - its instability temporarily prevented other workflows from progressing. Once the tasks related to the complex workflows were cancelled, the database load immediately decreased and normal processing resumed. ## Actions to Prevent Recurrence We are taking the following steps to prevent similar incidents in the future: * **Improved Alerting:** We are improving our alerting mechanisms to ensure we are notified more quickly when workflows stop progressing. This also includes improved visibility around system exceptions. * **Enhanced Monitoring:** We are expanding our monitoring around workflow executions to more quickly identify large workloads that may clog the system. * **Workflow Architecture Redesign:** We are redesigning how workflows with complex dependency structures are handled. Complex workflow segments will be encapsulated into separate entities, reducing dependency tree complexity and improving overall processing efficiency. * **Dedicated Database Connection:** We are separating the workflow engine from the main application at the database connection level. The engine will use a dedicated connection with appropriate capacity, improving flexibility and ensuring better isolation between components. * **Improved API Rate Limit Handling:** We are enhancing how we handle APIs and how we react when rate limits get reached. * **Faster Mitigation:** We are establishing tooling to more quickly resolve workflow executions that are no longer progressing as expected. We sincerely apologize for the disruption caused by this incident. We are committed to improving the resilience and predictability of our workflow processing system and appreciate the feedback and patience of our customers.