Phrase incident
Performance Disruption of Identity management - IDM (EU) and Phrase TMS (EU) between February 4, 2025 11:10 AM CET and February 4, 2025 11:50 AM CET
Phrase experienced a critical incident on February 4, 2025 affecting Analytics and API and 1 more component, lasting 48m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Feb 04, 2025, 10:27 AM UTC
Our engineers are currently investigating the root cause. We apologize for any inconvenience caused.
- investigating Feb 04, 2025, 10:51 AM UTC
We are actively working on resolving the issue.
- monitoring Feb 04, 2025, 11:07 AM UTC
Users should be able to log in. We keep monitoring the situation.
- resolved Feb 04, 2025, 11:16 AM UTC
The performance disruption should be resolved now and users should be able to access the platform. All systems are now operating normally. We apologize for the inconvenience caused by this.
- postmortem Feb 20, 2025, 08:26 AM UTC
### **Introduction** We would like to share more details about the events that occurred with Phrase between 11:10 AM CET and 11:50 AM CEST on February 4, 2025 which led to a performance disruption of identity management - IDM \(EU\) and Phrase TMS \(EU\) and what Phrase engineers are doing to prevent these issues from reoccurring. ### **Timeline** Feb 4, 2025 @ 10:38:41 - deployment of IDM services v25.3.1 to us-east-1 started Feb 4, 2025 @ 10:50:30 - IDM services successfully deployed and smoke tested at us-east-1 Feb 4, 2025 @ 11:01:30.189 - deployment of IDM services v25.3.1 to eu-west-1 started Feb 4, 2025 @ 11:08:41.625 - IDM BE service started to fail API requests due to an exhausted DB connection pool. This caused pod restarts and general unavailability of the IDM service. As a consequence, the TMS service started to fail. Feb 4, 2025 @ 11:32:43.972 - rollback to IDM services v25.2.4 \(US services were rolled back as well, although no issue was detected\). Feb 4, 2025 @ 11:51:13.680 - the last appearance of the error in the log \(after redeployment recovery\) Feb 4, 2025 @ 11:51:14 - functionality of IDM and TMS services are fully restored ### **Root Cause** A new code change introduced an inconsistency in cached data. During the rolling deployment, the new version of the application was trying to read and store a new object from/to a cache, while the still running old version was trying to read and store a different object from/to the same cache. In short, the old and new versions were loading and replacing the cached value again and again. This and a high load during the deployment process caused a depletion of DB connections, causing IDM unavailability. ### **Actions to Prevent Recurrence** The growth team will implement a better mechanism for caching that will ensure each application version is using its own cache to prevent such issues in the future.