Kustomer incident

[Chat and Workflow] Latency in Sending and Receiving Messages and Processing Workflows [Prod 1]

Kustomer experienced a minor incident on April 3, 2025 affecting Channel - Chat and Workflow, lasting 49m. The incident has been resolved; the full update timeline is below.

Started: Apr 03, 2025, 03:11 AM UTC
Resolved: Apr 03, 2025, 04:01 AM UTC
Duration: 49m
Detected by Pingoru: Apr 03, 2025, 03:11 AM UTC

Affected components

Channel - ChatWorkflow

Update timeline

identified Apr 03, 2025, 03:11 AM UTC

Kustomer has identified an event that may cause latency in sending and receiving chat messages, as well as in processing workflows. Our team is actively working on implementing a resolution. Please expect further updates within the next 30 minutes. If you have any questions or require additional information, please reach out to Kustomer Support at [email protected].
monitoring Apr 03, 2025, 03:24 AM UTC

Kustomer has implemented an update to address an event affecting chat message delivery and workflow processing that caused latency. Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at [email protected] if you have additional questions or concerns.
resolved Apr 03, 2025, 04:01 AM UTC

Kustomer has resolved an event affecting chat message delivery and workflow processing that caused latency. After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at [email protected] if you have additional questions or concerns.
postmortem May 23, 2025, 06:04 PM UTC

# **Summary** On April 2, our background worker services were unable to scale up to meet increased demand because their auto-scaling alarms had not been in place. As a result, queues backed up and processing latency spiked. After manually scaling the services and restoring the missing alarms, throughput and responsiveness returned to normal within an hour. # **Root Cause** During a recent migration of services to a new cluster, the CloudWatch alarms that trigger scale-out and scale-in actions were inadvertently removed and never recreated. Without those alarms, the auto-scaling system never responded to rising load. # **Timeline** **04/02 9:49 pm EST** - PagerDuty alert triggered: workflow service has a high number of messages in the queue. **04/02 9:50 pm EST -** On-call engineers began investigating the issue. **04/02 10:43 pm EST** - Workflow containers were manually scaled out. **04/02 10:57 pm EST** - Multiple customers reported Conversation Assistants were not responding. **04/02 10:57 pm EST** - Message volume in the workflow queue began decreasing. **04/02 11:20 pm EST** - Workflow service processed all pending messages and caught up with the queue. **04/02 11:19 pm EST** - Root cause identified: missing CloudWatch alarms due to Terraform configuration. A temporary fix was applied by manually creating an alarm and verifying that scaling worked as expected. **04/03 1:06 am EST** - A permanent fix \(redeploying the affected services\) was identified and tested successfully with one service. **04/03 1:07 am EST** - Rollout of permanent fix initiated for remaining impacted services. # **Lessons / Improvements** **Recreate Critical Alarms**Ensure auto-scaling alarms are reattached whenever services are migrated or redeployed. **Distinct Naming Conventions**Incorporate cluster or environment identifiers in alarm names to avoid accidental deletion. **Pre-Migration Verification**Add a checklist step to confirm all scaling policies exist before cutting over to a new environment.