Kustomer incident

[WORKFLOWS + ROUTING] Delays in processing Prod 2

Kustomer experienced a minor incident on October 8, 2025 affecting Workflow, lasting 1h 6m. The incident has been resolved; the full update timeline is below.

Started: Oct 08, 2025, 02:09 PM UTC
Resolved: Oct 08, 2025, 03:16 PM UTC
Duration: 1h 6m
Detected by Pingoru: Oct 08, 2025, 02:09 PM UTC

Affected components

Workflow

Update timeline

investigating Oct 08, 2025, 02:09 PM UTC

Kustomer is aware of an event affecting workflows and routing that may cause platform delays. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support at [email protected] for any further questions or updates.
monitoring Oct 08, 2025, 02:31 PM UTC

Kustomer has implemented an update to address an event affecting workflows, business rules, and routing in Prod 2 that caused platform delays. Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at [email protected] if you have additional questions or concerns.
resolved Oct 08, 2025, 03:16 PM UTC

Kustomer has resolved an event affecting Prod 2 that may have caused issues with routing and workflows. To resolve this issue, our team has released an update. After careful monitoring, our team has determined that our systems are now fully restored, and has completed a redrive. Please reach out to Kustomer support at [email protected] if you have additional questions or concerns.
postmortem Oct 16, 2025, 04:40 PM UTC

# **Summary** On October 8, 2025, Kustomer experienced a period of degraded performance in our prod2 EU environment that affected business rules, workflow execution and routing responsiveness. The elevated load stemmed from an internal feedback loop in workflow processing, which caused increased memory usage and instability in a downstream automation component. Systems serving automation traffic became rate-limited as they approached memory thresholds, resulting in delays across related services. The issue was mitigated by reducing processing throughput and rebalancing service traffic, allowing impacted systems to stabilize. Full recovery was confirmed later that day. # **Impact** * **Duration:** ~2.5 hours of degraded performance * **Scope:** Production region prod-2 \(EU\) * **Customer Impact:** * Minor latency observed across multiple customer organizations # **Next Steps** While existing safeguards are already in place, we are implementing additional safety measures to better detect anomalous workflow patterns and further enhance the stability of our automation systems under high-load scenarios. We’re also investing in expanded observability to improve early detection of similar issues in the future.