Kustomer incident

[DRAFTS] Internal API Errors (Prod1)

Minor Resolved View vendor source →

Started: Apr 24, 2026, 08:03 PM UTC
Resolved: Apr 24, 2026, 09:47 PM UTC
Duration: 1h 44m
Detected by Pingoru: Apr 24, 2026, 08:03 PM UTC

Affected components

Channel - ChatChannel - Email

Update timeline

monitoring Apr 24, 2026, 08:03 PM UTC

Kustomer has identified an event affecting Prod1 that may cause internal API errors and latency. Our team is currently working to implement a resolution. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at [email protected] if you have additional questions or concerns.
monitoring Apr 24, 2026, 08:32 PM UTC

Kustomer has identified an event affecting Prod1 that may cause internal API errors and latency. Our team is currently working to implement a resolution. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at [email protected] if you have additional questions or concerns.
monitoring Apr 24, 2026, 09:02 PM UTC

Kustomer has identified an event affecting Prod1 that may cause internal API errors and latency. Our team is currently working to implement a resolution. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at [email protected] if you have additional questions or concerns.
monitoring Apr 24, 2026, 09:31 PM UTC

Kustomer has implemented an update to address an event affecting Prod1 that caused internal API errors and latency. Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at [email protected] if you have additional questions or concerns.
resolved Apr 24, 2026, 09:47 PM UTC

Kustomer has resolved an event affecting Prod1 that caused API errors and latency. To resolve this issue, our team has released an update. After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at [email protected] if you have additional questions or concerns.
postmortem Apr 30, 2026, 06:21 PM UTC

## Summary On April 24, 2026, customers in our prod1 environment experienced a service event that caused elevated errors and latency in messaging-related workflows. The broad cross-customer impact was limited to approximately 41 minutes, from 3:26 PM ET to 4:07 PM ET. During that window, some customers saw failed or delayed messaging operations. The immediate platform impact was resolved the same day, and overall system health returned to normal. We then completed follow-up mitigation to stop the underlying event source and reduce the risk of recurrence. ## Impact * Customers in prod1 experienced elevated API errors and latency in messaging-related workflows. * The broad cross-customer impact lasted about 41 minutes. * A subset of messaging workflows failed or were delayed during that period. ## Timeline * **~3:25 PM ET** — A newly enabled automation began processing a large backlog of historical conversations for one tenant. * **3:26 PM ET** — Elevated errors and latency began affecting shared messaging workflows in prod1. * **4:01 PM ET** — We published a status update for the production issue. * **4:07 PM ET** — Broad cross-customer impact ended as the affected services stabilized. * **~5:17 PM ET** — We disabled the triggering automation configuration for the affected tenant. * **~5:23 PM ET** — The remaining retry activity stopped. ## Root cause The event was triggered when a newly enabled automation for one tenant processed a much larger set of eligible conversations than intended. That sudden volume overloaded a shared downstream service and caused elevated errors and timeouts in dependent workflows. The incident was amplified by missing safeguards in how this automation handled backlog volume and retries. In particular, the system did not sufficiently limit the number of conversations processed at once or prevent the same failed work from being retried too aggressively. ## Resolution We restored platform stability during the incident by allowing the affected services to recover under increased capacity, then disabled the triggering automation configuration and cleared the remaining retry backlog. System health is currently normal. ## Preventative actions We are treating the following preventative actions as a priority bug effort. These actions are expected to be resolved by the end of May in accordance with our SLOs: * Prevent newly enabled automation settings from processing large historical backlogs unintentionally. * Add stronger batch limits and tenant-level throttling for this workflow. * Reduce retry amplification by improving how failed work is tracked and re-queued. * Improve error handling so rate-limit conditions are classified correctly and handled with the right retry behavior.

Looking to track Kustomer downtime and outages?

Pingoru polls Kustomer's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

Real-time alerts when Kustomer reports an incident
Email, Slack, Discord, Microsoft Teams, and webhook notifications
Track Kustomer alongside 5,000+ providers in one dashboard
Component-level filtering
Notification groups + maintenance calendar

Start monitoring Kustomer for free

5 free monitors · No credit card required