Memsource incident
Performance Disruption of Identity management - IDM (EU) and Phrase TMS (EU) between February 13, 2025 10:45 AM CET and February 13, 2025 02:38 PM CET
Memsource experienced a critical incident on February 13, 2025 affecting Analytics and API and 1 more component, lasting 3h 30m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Feb 13, 2025, 10:13 AM UTC
Our engineers are currently investigating the root cause. We apologize for any inconvenience caused.
- identified Feb 13, 2025, 10:53 AM UTC
The issue has been identified and a fix is being implemented.
- identified Feb 13, 2025, 11:51 AM UTC
Our engineers are still working on a fix for this issue.
- investigating Feb 13, 2025, 12:46 PM UTC
Our engineers are still investigating the cause of the issue.
- resolved Feb 13, 2025, 01:44 PM UTC
A fix was implemented and the incident has been resolved.
- postmortem Feb 25, 2025, 12:56 PM UTC
### **Introduction** We would like to share more details about the events that occurred with Phrase between 10:45 AM CET and 07:36 PM CET on February 13, 2025 which led to a performance disruption of identity management - IDM \(EU\) and Phrase TMS \(EU\) and what Phrase engineers are doing to prevent these issues from reoccurring. ### **Timeline** Feb 13, 2025 @ 10:45 am - Internal monitoring reports an increased number of connection failures to IDM in the EU cluster. Errors are also reported by other platform applications and users. Some requests are successful, some are not. Feb 13, 2025 @ 10:49 am - The team responsible starts analyzing the issue and attempting countermeasures. Feb 13, 2025 @ 11:15 am - The issue is traced to DB connections limit - all connections are consumed and that is blocking some incoming traffic. Feb 13, 2025 @ 11:53 am - The issue is caused by the heavy load of pricing metrics updating the primary database and the metrics cleaner job → the team deploys a hot fix to completely eliminate the cleaner job. Feb 13, 2025 @ 12:35 pm - The hotfix partially helped, but the database is again blocked after some time. Kubernetes pods keep restarting due to heavy load and liveness checks that are not able to respond in time. Database connections are consumed again. Feb 13, 2025 @ 12:42 pm - The platform team jumps on a call to help trace the issue. The primary database is under heavy load because the IDM application tries to catch up with pending events/updates that were stuck due to the previous issues. Feb 13, 2025 @ 01:47 pm - The database instance is restarted and upgraded to a more powerful instance. Kubernetes pods are scaled up to be able to process the load. Feb 13, 2025 @ 02:38 pm - The production EU environment is stable again. Feb 13, 2025 @ 03:55 pm - The team performs the final clean-up tasks by enabling all disabled jobs \(mainly inter-app events processing\). This change kills the connections again. Feb 13, 2025 @ 04:02 pm - Errors are again reported from both monitoring and users \(but some users can still work\). Feb 13, 2025 @ 04:15 pm - The team jumps on the call again and stops the jobs. The system is oscillating because of the number of incoming requests. The number of pods is again increased, but the situation is not stable. Feb 13, 2025 @ 05:35 pm - The team disables all input traffic to give the system time to recover. In the meantime, product catalog subsystem refresh and caching is adjusted to reduce database load. A hotfix is being prepared. Feb 13, 2025 @ 06:17 pm - A fix is deployed and all inbound traffic enabled. The system is stable. Feb 13, 2025 @ 07:36 pm - The incident is closed. **Root Cause** The issue was triggered by multiple simultaneous events: high user request load, a pricing metrics update batch, and product catalog refreshes, all of which contributed to the increased load. After the first fix, the system was stable, but running at full capacity, particularly the database instance.. When the team enabled all paused jobs, the system became overwhelmed by the queued tasks. A more powerful database instance helped to mitigate the original issue and made the system more responsive, however, the system was not ready for a slow start up in all parts. Therefore, when the jobs restarted, the number of events and records was not throttled and the connections to the database were consumed. Disabling input traffic gave the system a chance to process the pending events and a pause before letting the users back in. ### **Actions to Prevent Recurrence** There are several actions that will be taken, both immediate and long-term: * Immediate: * Events processing will be throttled for selected jobs. * Implement immediate kill switches for selected jobs. * Optimize in-app event handling by deleting successful events immediately instead of archiving them. * Long term * Pricing metrics will be separated to a dedicated service. * Distributed locking will be done using a different persistence service.