Kustomer incident

[ALL CHANNELS] - Latency and delays [Prod 1]

Kustomer experienced a minor incident on January 10, 2025 affecting Kustomer Voice and Channel - Chat and 1 more component, lasting 1h 52m. The incident has been resolved; the full update timeline is below.

Started: Jan 10, 2025, 04:37 PM UTC
Resolved: Jan 10, 2025, 06:30 PM UTC
Duration: 1h 52m
Detected by Pingoru: Jan 10, 2025, 04:37 PM UTC

Affected components

Kustomer VoiceChannel - ChatChannel - EmailWorkflow

Update timeline

investigating Jan 10, 2025, 04:37 PM UTC

Kustomer is aware of an event that may cause delays in routing, issues with Conversation Assistants, and delays with business rules. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes. Please reach out to Kustomer Support for any further questions or updates.
identified Jan 10, 2025, 05:04 PM UTC

Kustomer is aware of an event that may cause delays in routing, issues with Conversation Assistants, and delays with business rules and workflows. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes. Please reach out to Kustomer Support for any further questions or updates.
identified Jan 10, 2025, 05:32 PM UTC

Kustomer is aware of an event that may cause delays in routing, issues with Conversation Assistants, and delays with business rules and workflows. Our team is still working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes. Please reach out to Kustomer Support for any further questions or updates.
investigating Jan 10, 2025, 05:37 PM UTC

Kustomer is aware of an event that may cause delays in routing, issues with Conversation Assistants, and delays with business rules and workflows. Our team is still working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes. Please reach out to Kustomer Support for any further questions or updates.
monitoring Jan 10, 2025, 06:13 PM UTC

Kustomer has implemented an update to address an event affecting delays in routing, issues with Conversation Assistants, and delays with business rules. Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at if you have additional questions or concerns.
resolved Jan 10, 2025, 06:30 PM UTC

Kustomer has resolved an event affecting delays in routing, issues with Conversation Assistants, and delays with business rules. After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at if you have additional questions or concerns.
postmortem Feb 05, 2025, 04:10 PM UTC

# Post Mortem: Traffic Surge and Scaling Issues on 1-10-2025 # **Summary** A large spike in traffic to one of our services failed to be met with appropriate scaling, leading to raised error rates and system latency across multiple services over the span of approximately 3 hours. During this time customers on prod1 experienced latency in routing and business rule execution, as well as errors and delayed responses from Conversational assistants. # **Root Cause** A large spike in traffic during a short window lead to an incongruency in our service scaling: as one service scaled up rapidly to accommodate the traffic spike, another service with a slower scaling policy was unable to meet the demand and caused a bottleneck, leading to cascading failures across services. # **Timeline** **Jan 10, 2025** * 9:48 AM EST - A significant increase in inbound message creation and automation activity driven by organic traffic to 2 large organizations led to a quick doubling of traffic on our primary platform data service. It scaled up to accommodate this traffic but began to experience failures from a dependency service. * 10:16 AM EST - Multiple services downstream start experiencing heightened error rates; engineering is alerted * 10:17 AM EST - Engineering begins to investigate the errors, looking for a root cause and solution * 10:52 AM - 12:55 PM EST - Engineering manually scales the impacted services * 1:01 PM EST - Engineering begins to see systemwide improvement with decreased error rates, restored performance, and healthy metrics * 1:28 PM EST - Incident declared resolved # **Lessons/Improvements** * **Caching Optimization -** To protect against similar spikes in the future we’ve reviewed our service caching strategy and implemented an additional layer of caching to protect dependent services in the case of a quick scale up during a traffic burst. This protection will prevent a recurrence of this failure. * **Better Logging/Observability** - We encountered some problems with our observability tool during incident investigation that made it more difficult to determine the root cause of this issue. We are working with our vendor to resolve these issues.