Phrase incident
Performance Disruption of Phrase Identity Management (EU) Component on August 26, 2024 between 12:56 PM and 02:00 PM CEST
Phrase experienced a major incident on August 26, 2024 affecting Identity management - IDM (EU), lasting 1h 3m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Aug 26, 2024, 11:24 AM UTC
Clients may experience issues with logging in to Phrase Platform or receive a 429 error when trying to log in. Our engineers are currently investigating the root cause. We apologize for any inconvenience caused.
- monitoring Aug 26, 2024, 12:06 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Aug 26, 2024, 12:27 PM UTC
This incident has been resolved.
- postmortem Aug 28, 2024, 03:53 PM UTC
### **Introduction** We would like to share more details about the events that occurred with Phrase between August 26, 2024 12:56 PM CEST and August 26, 2024 02:00 PM CEST which led to a performance disruption of the IDM \(EU\) component and what Phrase engineers are doing to prevent these issues from reoccurring. ### **Timeline** 26/8/2024 12:56 PM CEST: Multiple clients are reporting being logged out from IDM Prod EU UI. 26/8/2024 12:58 PM CEST: Health checks indicate an increased number of HTTP 429 errors \(too many connections\). 26/8/2024 01:00 PM CEST: The team responsible for the application starts analyzing the traffic. 26/8/2024 01:14 PM CEST: Phrase status page updated. 26/8/2024 01:20 PM CEST: The issue is traced down to a rate limiter malfunction in a NodeJS IDM UI related component. 26/8/2024 01:26 PM CEST: The component fix is created. 26/8/2024 01:53 PM CEST: The fix is deployed to production. 26/8/2024 02:00 PM CEST: Production EU is fully functional. ### **Root Cause** The issue was caused by an internal NodeJS rate limiter serving the IDM UI. After a redeployment, the rate limiter error limited all users regardless of their IPs and called endpoints. This effectively blocked users from using our application. Because of the set thresholds, the rate limiting LRU cache was cleared regularly allowing some users to work without disruptions. Alternately, some users were not able to use the app at all. ### **Actions to Prevent Recurrence** The NodeJS rate limiter will be replaced by another solution providing at least the same functionality.