Treasure Data incident

[EU Region] Elevated error/ performance degradation related to personalisation API

Minor Resolved View vendor source →

Treasure Data experienced a minor incident on January 23, 2025 affecting CDP Personalization - Lookup API and CDP Personalization - Ingest API, lasting 4h 44m. The incident has been resolved; the full update timeline is below.

Started
Jan 23, 2025, 07:34 AM UTC
Resolved
Jan 23, 2025, 12:18 PM UTC
Duration
4h 44m
Detected by Pingoru
Jan 23, 2025, 07:34 AM UTC

Affected components

CDP Personalization - Lookup APICDP Personalization - Ingest API

Update timeline

  1. investigating Jan 23, 2025, 07:34 AM UTC

    We are currently observing errors or performance degradation for the personalization API. We are investigating the cause of the issue now.

  2. identified Jan 23, 2025, 08:05 AM UTC

    The response team confirmed the symptom is from the same cause as the previous incidents. We are provisioning additional concurrency capacity to the environment. We will update you when it is completed.

  3. identified Jan 23, 2025, 10:37 AM UTC

    We provisioned additional capacity at 10:00 am UTC to support the increasing workload. It improved the latency, but we still observed errors and long latency for a small amount of requests. The response team started providing another concurrency capacity. Unlike the previous methods, the new process should not take longer for provisioning. We will update the result in 30 minutes.

  4. identified Jan 23, 2025, 11:08 AM UTC

    We successfully provisioned 2x capacity in 30 minutes. New resources improved the latency, but the error rate is still high. The response team is planning to implement another remediation instead of adding resources. We will update you in 30 minutes.

  5. monitoring Jan 23, 2025, 12:16 PM UTC

    The response team found problematic real-time segment configurations of one customer's Parent Segment that possibly contributed to consuming the concurrency capacity. The team updated the real-time event routing configuration to mitigate the high latency issue. Combined with capacity addition operations, the team stabilized the Profiles API cache cluster. If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will update the postmortem with further remediation plan as promised.

  6. resolved Jan 23, 2025, 12:18 PM UTC

    Between Thursday, 23 Jan 2025 07:20 UTC to 11:40 UTC, customers experienced elevated error rates and increased latency related to Profiles API. A fix has been implemented, and the issue has been resolved. If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will share an incident retrospective soon.