Memsource incident

Degraded Performance of Phrase Orchestrator (EU) Workflow Engine component between August 12, 2025 05:45 PM CEST and August 13, 2025 10:40 AM CEST

Major Resolved View vendor source →

Memsource experienced a major incident on August 13, 2025 affecting Legacy Workflow Engine, lasting 22h 49m. The incident has been resolved; the full update timeline is below.

Started
Aug 13, 2025, 10:01 AM UTC
Resolved
Aug 14, 2025, 08:50 AM UTC
Duration
22h 49m
Detected by Pingoru
Aug 13, 2025, 10:01 AM UTC

Affected components

Legacy Workflow Engine

Update timeline

  1. investigating Aug 13, 2025, 08:16 AM UTC

    On August 12, 2025 5:45 PM CEST we began experiencing delays in the execution of Orchestrator workflows hosted in the EU Data Center. Our engineers are currently investigating the root cause. We apologize for any inconvenience this may have caused.

  2. monitoring Aug 13, 2025, 10:01 AM UTC

    The engineers have resolved the issue causing Orchestrator workflows to remain “stuck”. Previously untriggered workflows are now beginning to reprocess and should execute as expected. Please note that while workflows are no longer stuck, the processing queue will take some time to work through the backlog. As a result, your workflow may not run immediately and could take several hours to complete. Thank you very much for your patience.

  3. resolved Aug 14, 2025, 08:50 AM UTC

    This incident has been resolved. Previously untriggered Orchestrator workflows in the queue have been processed and executed as expected.

  4. postmortem Aug 18, 2025, 11:02 AM UTC

    # **Root Cause Analysis** August 13, 2025 ### **Introduction** We would like to share more details about the events that occurred with Phrase between August 12, 2025 05:45 PM CEST and August 13, 2025 10:40 AM CEST, which led to a degraded performance of the workflow engine component of Phrase Orchestrator \(EU DC\) and what Phrase engineers are doing to prevent these issues from reoccurring. ### **Timeline** **August 12, 2025 05:45 PM CEST**: The executed workflow throughput started to decrease, so workflows started more slowly than usual. The workflow scheduling for execution was unaffected. **August 12, 2025 07:53 PM CEST**: First external report of the issue received. **August 12, 2025 08:31 PM CEST**: Workflow processing significantly impacted. **August 13, 2025 10:40 AM CEST**: Issue identified and fix deployed. Pending workflows resumed execution, working off the queue. **August 13, 2025 07:38 PM CEST**: All delayed workflows completed. Normal operations restored. ### **Root Cause** A change was implemented into the Orchestrator’s message handling logic. This change introduced a mechanism where the workflow engine had to explicitly acknowledge receipt of each message before new messages were submitted for execution. This change was initially introduced to prevent the workflow engine from being overloaded, in cases of rapid workflow trigger executions. Due to a bug in this new acknowledgment logic, some acknowledgments were not properly registered when facing production traffic. As a result, the system incorrectly assumed that the engine was at full capacity, and gradually reduced the amount of messages sent, even though there was spare capacity on the engine. The system did not automatically notify the team about the issue, since the engine was not overloaded at the infrastructure level - the condition that would normally trigger an alert and page the on-call engineer. Importantly: * **No data or messages were lost**. * Once the rollback was completed, all queued workflows were processed successfully, though with significant delays. ### **Actions to prevent recurrence and improve time to resolution** 1. Improve detection and alerting for slow message processing, even when the engine is still responsive: currently, the system only alerts if the workflow engine infrastructure is actually overloaded. 2. Investigate a more resilient acknowledgment system. The acknowledgment system has been implemented to prevent the workflow engines overload in race condition scenarios; however, a new system must ensure that they are properly registered in all cases.