Kustomer incident
[PLATFORM ] Buffering and Routing Issues PROD 1
Kustomer experienced a notice incident on July 14, 2025 affecting Web Client, lasting 51m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jul 14, 2025, 02:59 PM UTC
Kustomer is aware of an event affecting routing, potential buffering issues, and the ability to update team status. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support [email protected] for any further questions or updates.
- monitoring Jul 14, 2025, 03:27 PM UTC
Kustomer has implemented an update to address an event affecting PROD 1 that caused issues with loading behavior. Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at [email protected] if you have additional questions or concerns.
- monitoring Jul 14, 2025, 03:36 PM UTC
Kustomer has implemented an update to address an event affecting PROD 1 that caused issues with loading behavior. Our team is continuing to monitor this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at [email protected] if you have additional questions or concerns.
- resolved Jul 14, 2025, 03:51 PM UTC
Kustomer has resolved an event affecting platform loading issues . To resolve this issue, our team has rolled back changes to the platform. After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at [email protected] if you have additional questions or concerns.
- postmortem Sep 07, 2025, 04:49 PM UTC
## Summary Between July 13th and July 15th, 2025, customer support conversations were not being automatically routed to agents for approximately 12 organizations. This resulted in conversations remaining in queue indefinitely, preventing customer service teams from receiving new tickets. The issue was resolved through manual intervention by Kustomer engineering on July 14th at 6:30 PM EST. ## Root Cause A database cleanup script inadvertently deleted critical routing records \(work-item revisions\) that were still actively referenced by queued work-items. This created an unrecoverable state where the routing system was unable to process these work-items, effectively blocking affected queues and halting the distribution of subsequent conversations to agents. ## Timeline **Jul 13, 2025** * **Evening** - Database cleanup script runs and deletes active routing records, conversations begin backing up in queues **Jul 14, 2025** * **10:35 AM EST** - Kustomer engineering alerted to routing issues affecting multiple customer organizations * **10:41 AM EST** - Initial investigation begins, recent code deployments rolled back as precautionary measure * **11:33 AM EST** - Issue isolated to specific conversations stuck at the front of routing queues * **2:04 PM EST** - Root cause identified: missing database records preventing queue processing * **5:10 PM EST** - Manual database repair process validated and initiated * **6:30 PM EST** - All affected queues restored, normal routing resumed * **6:45 PM EST** - Monitoring confirms full resolution **Jul 15, 2025** * **9:44 AM EST** - Final confirmation that no additional conversations were impacted ## Lessons/Improvements * **Enhanced Database Script Safety** - We have implemented additional validation checks in our database cleanup processes to prevent deletion of records that are actively referenced by live systems. **Status**: Complete * **Graceful Error Handling** - The routing system will now gracefully handle missing database references instead of failing silently, ensuring conversations continue to route even if supporting records are corrupted. **Status**: Not Started * **Queue Health Monitoring** - We are implementing proactive monitoring to detect queue blockages and alert engineering teams before customer impact occurs. **Status**: Not Started