Kustomer incident
[ROUTING] Chat and Voice conversations not routing [PROD 1 && PROD 2]
Kustomer experienced a minor incident on January 22, 2026 affecting Channel - Chat, lasting 46m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jan 22, 2026, 06:44 PM UTC
Kustomer is aware of an event affecting Chat and Voice conversations that may cause the conversation to not be routed to an available agent. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support via Email or Chat for any further questions or updates.
- monitoring Jan 22, 2026, 07:02 PM UTC
Kustomer has implemented an update to address an event affecting Chats and Voice calls in PROD 1 & 2 that caused conversations to not be routed to available agents. Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at Chat and Email if you have additional questions or concerns.
- resolved Jan 22, 2026, 07:30 PM UTC
Kustomer has resolved an event affecting Conversational Assistants in PROD1 and 2 that caused conversations to not route to available agents. To resolve this issue, our team has completed a rollback our codebase to address the failures in the assistants. After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at Chat or Email if you have additional questions or concerns.
- postmortem Feb 04, 2026, 09:52 PM UTC
# Post Mortem: Chat, Voice Routing, CSAT, Oauth, Scheduled Send Issues # **Summary** On January 22, 2024, customers experienced chat and voice conversations failing to route to agents due to a recent change in the assistant service. This triggered cascading failures across multiple backend services, degrading platform performance for several orgs. On January 27th, a subsequent incident occurred due to our scheduled jobs queue being flooded by assistant service jobs. This caused some CSAT Surveys to not send, Oauth connections to fail to refresh, and scheduled messages to not send. **Root Cause** A recent change to the assistant backend service increased the maximum workflow loops before transferring a conversation to an available agent. This allowed a single rate-limited WhatsApp conversation to become stuck in an infinite retry loop while attempting to transfer. The transfer requests themselves were also rate-limited, overwhelming shared infrastructure \(e.g. “job” engine which is shared by CSAT\) and causing service degradation across multiple orgs. # **Timeline** **Jan 22, 2026** 1:33 PM – Users began experiencing errors with chat and voice conversations failing to route to agents 1:44 PM – Engineers identified the problematic deployment and initiated rollback across all environments 1:54 PM – Rollback completed across all environments; on-call engineers continued monitoring system status 2:02 PM – Full assistant functionality restored for all customers 6:07 PM - Customers begin to report Oauth connections that failed to refresh **Jan 24, 2026** 6:24 PM - Customers begin to report that scheduled messages were not sending **Jan 27, 2026** 1:00 PM - Customers begin to report that CSAT surveys are not being sent 4:00 PM - Identified cause of CSAT scheduled job processing delays 6:00 PM - Started a script to manually increase processing throughput of scheduled jobs **Jan 28, 2026** 12:08 PM - Deployed code change to programmatically increase processing throughput of scheduled jobs 3:49 PM - Backlog of all delayed jobs processed and system restored **Lessons/Improvements** * Implementing new alerts to detect when scheduled job processing falls behind, enabling faster identification of similar issues * Improving alert prioritization to reduce noise and ensure critical alerts are acted upon immediately * Enhancing monitoring for downstream service dependencies * Evaluating queue architecture changes to prevent a single conversation from impacting other customers \("noisy neighbor" isolation\) * Investigating improvements to make our job scheduling service more resilient to backlogs * Creating documentation of all services that depend on scheduled jobs to better understand incident ripple effects