Memsource incident

Performance Disruption of All Phrase TMS (EU) Components between August 06, 2025 07:51 PM CET and August 06, 2025 08:28 PM CET

Memsource experienced a critical incident on August 6, 2025 affecting Analytics and API and 1 more component, lasting 22m. The incident has been resolved; the full update timeline is below.

Started: Aug 06, 2025, 06:20 PM UTC
Resolved: Aug 06, 2025, 06:42 PM UTC
Duration: 22m
Detected by Pingoru: Aug 06, 2025, 06:20 PM UTC

Affected components

AnalyticsAPICAT web editorConnectorsFile processingMachine translationProject managementTerm baseTranslation memory

Update timeline

investigating Aug 06, 2025, 06:20 PM UTC

We are investigating the issue.
monitoring Aug 06, 2025, 06:33 PM UTC

The incident has been resolved, we are monitoring the system.
resolved Aug 06, 2025, 06:42 PM UTC

The incident has been resolved.
postmortem Aug 31, 2025, 01:43 PM UTC

# **Root Cause Analysis** August 6, 2025 ### **Introduction** We would like to share more details about the events that occurred with Phrase between 19:45 PM CEST and 20:40 PM CEST on August 6, 2025 which led to a performance disruption of all our TMS services in the EU environment and what Phrase engineers are doing to prevent these issues from reoccurring. ### **Timeline** Aug 6, 2025 @ 19:45 – The Platform and Application teams were alerted to database issues. Aug 6, 2025 @ 20:00 – Engineers identified that a majority of database replicas were unresponsive. TMS services performance disruption is confirmed. Aug 6, 2025 @ 20:00 – The first replica node was restarted. Aug 6, 2025 @ 20:05 – The restarted node appeared online, but a significant imbalance in resource usage was observed on the primary node as there were too few healthy nodes. Aug 6, 2025 @ 20:10 – The second replica node was restarted. Aug 6, 2025 @ 20:15 – Both replica nodes restarted but were unable to connect to the primary node. Aug 6, 2025 @ 20:15 – Customer complaints were reported, prompting escalation to the next level of engineering support. Aug 6, 2025 @ 20:25 – Investigation confirmed a higher than expected number of connections to the primary node. Aug 6, 2025 @ 20:35 – The maximum number of allowed connections on the primary node were increased, allowing both replicas to reconnect. Aug 6, 2025 @ 20:40 – Replica nodes completed resynchronization, restoring normal service. ‌ ### **Root Cause** A recent routine upgrade of the database replicas introduced an unexpected change in how system memory was allocated under high connection loads. While this behavior did not appear in test environments, in production it caused the replica nodes to run out of memory. As a result, they became unresponsive and failed to synchronize with the primary database which caused their removal from load-balancing while increasing the load on the primary node. ‌**Actions to Prevent Recurrence** As an immediate action we upscaled our database cluster - We’ve significantly increased the capacity so the cluster can handle higher demand without running into memory limits. Optimized memory usage - We’ve reduced how much memory the database requires for normal operations, making it more efficient and less prone to memory starvation under high load. Reviewed how memory is allocated - We have fine-tuned memory allocation for our database cluster resulting in minimal fragmentation of the address space and improving memory allocation performance as well. Enhanced monitoring - We’ve added extra monitoring metrics that alert us earlier if any database memory issues start to appear. Improve resilience of the database layer - We’re working on updating configuration and limits of our database layer to be more resilient in similar situations to lower any possible impact. Improved testing processes - We’re working with our application teams to include the database in more realistic load testing in lower \(non-production\) environments. This will help us catch potential issues before they ever reach customers.