Treasure Data incident

[EU Region] Elevated error/ performance degradation related to personalisation API

Treasure Data experienced a major incident on January 22, 2025 affecting CDP Personalization - Lookup API and CDP Personalization - Ingest API, lasting 16h 44m. The incident has been resolved; the full update timeline is below.

Started: Jan 22, 2025, 09:56 AM UTC
Resolved: Jan 23, 2025, 02:41 AM UTC
Duration: 16h 44m
Detected by Pingoru: Jan 22, 2025, 09:56 AM UTC

Affected components

CDP Personalization - Lookup APICDP Personalization - Ingest API

Update timeline

investigating Jan 22, 2025, 09:56 AM UTC

We are currently observing errors or performance degradation for the personalization API. We are investigating the cause of the issue now.
investigating Jan 22, 2025, 02:52 PM UTC

We have applied various mitigation on our infrastructure side however it doesn't decrease the error rate. We are continuously investigating the possible causes on our end
investigating Jan 22, 2025, 06:34 PM UTC

From 09:00 to 17:00 UTC, we observed elevated 500s and high latency on the CDP KVS server. Customers may have observed elevated errors and timeouts during this period when sending requests to the Personalization API. Our team has been investigating this issue and has deployed a workaround to our systems while we work to identify the root cause of the problem. There should be no system impact at this time. Customers who continue to observe delays or elevated error rates should contact our support team, and we'll be happy to assist them further. We will continue to investigate and will provide another update by 11 PM UTC.
investigating Jan 22, 2025, 08:24 PM UTC

Our response team has identified a potential cause for this issue, and we will be deploying a fix shortly. At this time we have not observed any elevated error rates or delays since 16:40 UTC. We will provide an additional update once this fix has been deployed. If you are observing abnormal errors or long delays from our Personalization API, please reach out to our support team. We will continue to monitor for any issues, and will update once our fix is deployed.
investigating Jan 23, 2025, 12:44 AM UTC

We have observed some intermittent errors as we roll out a fix to all of our systems, and users may see delays or errors as the change is applied to our systems. Our response team is working to minimize the impact to customers while we deploy this change, but we expect some slower performance while we gradually deploy the fix over the next 3-4 hours.
monitoring Jan 23, 2025, 02:15 AM UTC

We have fully deployed our fixes to the Personalization API and our monitors show systems operating normally. Our teams will continue to monitor the issue, and we will update this incident if we observe any unusual behavior. If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will share an incident retrospective once it is available.
resolved Jan 23, 2025, 02:41 AM UTC

Between Wednesday, 22 Jan 2025 09:15 UTC to 16:40 UTC, Some customers experienced elevated error rates and increased latency related to Profiles API. A fix has been implemented and the issue has been resolved. If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will share an incident retrospective soon.
postmortem Jan 23, 2025, 02:41 AM UTC

The Profiles API enables browsers to retrieve personalized content based on detailed customer information. Between January 20 and January 22, we experienced elevated error rates and increased latency for a subset of requests to the Profiles API. We sincerely apologize for the inconvenience caused by this incident. We understand the critical role our API plays in delivering seamless user experiences, and we are committed to preventing such disruptions in the future. # Timeline * On January 20, from 7:45 to 11:15 UTC - 3% error rate during the time * On January 21, from 7:35 to 10:25 UTC - 33% error rate during the time * On January 22, from 9:15 to 16:40 UTC - 40% error rate during the time During these periods, API calls to `https://cdp-eu01.in.treasuredata.com/` exhibited elevated error rates and latency. This issue did not impact RT 2.0, the newer version of our real-time system. # Incident Analysis This is the current analysis snapshot; updates will be provided as more information becomes available. We noticed a gradual increase in processing workloads on the Profiles API starting on January 6, driven by the complexity of real-time segmentation. By January 20, this workload exceeded the internal concurrency limit configured in our caching cluster. Key observations are: * Symptoms consistently began to appear around 07:30 UTC each day. * Internal system indicators flagged potential issues approximately two hours prior to the incidents. The bottleneck was traced to the caching cluster's concurrency capacity, which was insufficient to handle the growing workload. # Action Taken Based on the observation, we implemented the mitigation to increase the concurrency capacity in the caching cluster. We will monitor the symptoms closely today and provide additional capacity when necessary. # Further Actions Our development team will have a capacity review of the Profiles API infrastructure to prepare for future workload growth. The remediation plan will include the following steps: * Enhanced monitoring and alerting of the caching cluster’s concurrency capacity * Ensuring safe yet rapid capacity provisioning when required We will provide a follow-up update by the end of Friday, summarizing any additional findings and actions taken. Hiroshi \(Nahi\) Nakamura CTO & VP Engineering Treasure Data