Treasure Data incident

[EU Region] Elevated error rate and performance degradation for personalization API

Treasure Data experienced a major incident on January 30, 2025 affecting CDP API and CDP Personalization - Lookup API and 1 more component, lasting 4h 49m. The incident has been resolved; the full update timeline is below.

Started: Jan 30, 2025, 10:54 AM UTC
Resolved: Jan 30, 2025, 03:43 PM UTC
Duration: 4h 49m
Detected by Pingoru: Jan 30, 2025, 10:54 AM UTC

Affected components

CDP APICDP Personalization - Lookup APICDP Personalization - Ingest API

Update timeline

investigating Jan 30, 2025, 10:54 AM UTC

We detected degraded performance of personalization API and an error rate increase. We are currently investigating this issue.
monitoring Jan 30, 2025, 11:38 AM UTC

We are currently observing that the performance degradation and error rate have improved. We continue to closely monitor the metrics.
monitoring Jan 30, 2025, 12:31 PM UTC

We are continuing to monitor for any further issues.
monitoring Jan 30, 2025, 02:18 PM UTC

We are still monitoring the service. Between Thursday, 30 Jan 2025, 10:00 UTC to 11:05 UTC, customers experienced elevated error rates and longer latency for Profiles API lookup. Currently, the cluster workload has calmed down and is operating normally. Our response team is ready to provision additional processing capacity. However, we are closely monitoring the service status to avoid further downtime during peak times. In addition to it, we are working on isolating problematic accesses from the service. We will keep the status page open and update you on the progress.
resolved Jan 30, 2025, 03:43 PM UTC

We implemented fundamental isolation to a problematic configuration at 14:42 UTC. The remediation caused the cluster workload to drop from 60% to 1%. On Friday, we implemented write access isolation to the problematic configuration. It stopped the cluster workload from growing. Today, we implemented read access isolation that restored the cluster workload to the previous level. The system is operating normally now. We close the incident. We acknowledge we need further actions to prevent the same incident from happening again by a similar configuration. We will post further postmortem when we are ready.