Memsource incident

Degraded Performance of the Workflow Engine Component in Phrase Orchestrator (EU) between June 27, 2025 02:00 AM CET and June 28, 2025 06:30 AM CET

Minor Resolved View vendor source →

Memsource experienced a minor incident on June 27, 2025, lasting —. The incident has been resolved; the full update timeline is below.

Started
Jun 27, 2025, 12:00 AM UTC
Resolved
Jun 27, 2025, 12:00 AM UTC
Duration
Detected by Pingoru
Jun 27, 2025, 12:00 AM UTC

Update timeline

  1. resolved Jul 03, 2025, 03:54 PM UTC

    The issue has been resolved.

  2. postmortem Jul 14, 2025, 05:03 PM UTC

    # **Root Cause Analysis** July 2, 2025 ### **Introduction** We would like to share more details about the events that occurred between 2:00 AM June 27 and 6:40 AM CET on June 28, 2025, which led to missed execution triggers and also a delay relative to typical workflow starting times. ‌ Missed execution triggers could be observed between: June 27, 03:37 PM CET to 03:55 CET and from June 27, 11:30 PM CET to June 28, 0:45 AM CET. Below is a timeline of events and the steps our engineering team is taking to prevent similar issues in the future. ### **Timeline** June 27, 2:00 AM CET: Monitoring showed high traffic spikes, which were still handled by the system gracefully. June 27, 03:37 PM CET: The traffic spikes increased and the engineering team received alerts that a system component had become unstable. Phrase engineers identified a large consumption in available memory, resulting in system instability. As a first mitigation the Orchestrator Workflow Engine component was restarted. June 27, 03:52 PM CET The component up again and continued to process scheduled workflows. June 27, 11:30 PM CET: The Orchestrator Workflow Engine component became overloaded again. New workflows were no longer being triggered. June 28, 12:04 AM CET: Phrase engineers took steps to halt a workload that was suspected as having consumed large amounts of memory. ‌ June 28, 12:45 AM CET: Phrase engineers applied further memory resources and the system continued again to trigger and process scheduled workflows. June 28, 6:40 AM CET: The backlog of workflows had been completely processed. ‌ **Root Cause** ‌ The incident was caused by sustained memory pressure on the Orchestrator Workflow Engine component, triggered by a spike in incoming events caused by a single workflow. This eventually prevented the system from processing new workflows. ### **Actions to Prevent Recurrence** A thorough review of the affected system’s resource configuration is being conducted. As part of this effort, the following improvements are implemented: * **Limit radius of impact of single workflows:** We will implement mechanisms to isolate resources to reduce the likelihood of cross workflow impact. * **Proactive detection:** We will introduce additional alerts to improve early detection, to allow us to take action before customers are impacted. * **Improved Scalability:** We are adding additional scaling mechanisms to key components of the Workflow Engine to handle sudden increases in workload. We are adjusting memory limits and requests for the impacted component to ensure greater stability under load.