Kustomer incident

[CHAT] [Chat messages may fail to deliver] [Prod 1]

Minor Resolved View vendor source →

Kustomer experienced a minor incident on August 6, 2025 affecting Channel - Chat, lasting 39m. The incident has been resolved; the full update timeline is below.

Started
Aug 06, 2025, 02:14 PM UTC
Resolved
Aug 06, 2025, 02:54 PM UTC
Duration
39m
Detected by Pingoru
Aug 06, 2025, 02:14 PM UTC

Affected components

Channel - Chat

Update timeline

  1. investigating Aug 06, 2025, 02:14 PM UTC

    Kustomer is aware of an event affecting Chat that may cause outbound chat messages to not deliver. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support at [email protected] for any further questions or updates.

  2. identified Aug 06, 2025, 02:29 PM UTC

    Kustomer has identified an event in Chat that may cause Chat messages to fail to deliver Our team is still continuing to work on implementing a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support [email protected] for any further questions or updates.

  3. monitoring Aug 06, 2025, 02:30 PM UTC

    Kustomer has implemented an update to address an event affecting Chat in prod 1 that caused Chat messages to fail to deliver. Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at [email protected] if you have additional questions or concerns.

  4. resolved Aug 06, 2025, 02:54 PM UTC

    Kustomer has resolved an event affecting Chat in Prod 1 that caused Chat messages to not deliver. To resolve this issue, our team has released an updated. All chats that did not send have been processed. After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support [email protected] if you have additional questions or concerns.

  5. postmortem Sep 07, 2025, 04:57 PM UTC

    ## **Summary** On August 6, 2025, customers experienced an incident where chat messages failed to deliver in one of our production environments. The issue was resolved by identifying and using a previously stable image. We have initiated follow-up actions to enhance our image retention policies and improve logging to prevent recurrence. ## **Root Cause** A production issue impacted our chat messaging service, causing messages to not deliver. This was due to an unavailable container image, which prevented the service from recovering after an internal issue. This led to a backlog of messages in the queue and widespread chat delivery failures. The problem was resolved by reverting to a previous, stable image. ## **Timeline** **August 6, 2025** * **8/6 9:51 am EST**: Alerts were triggered indicating a high message count in our chat queue. * **8/6 10:01 am EST**: Attempts to redeploy the service failed due to a missing container image. * **8/6 10:11 am EST**: A status page was created to inform customers. * **8/6 10:34 am EST**: The service began recovering after being reconfigured to use a previous, stable image. * **8/6 10:51 am EST**: The status page was updated to resolved. * **8/6 10:58 am EST**: The incident was marked as resolved. ## **Lessons/Improvements** * **Image Visibility**: We are working to improve visibility and alerting around container image availability to prevent similar issues, as the container image deletion due to lifecycle policy was unexpected and not surfaced clearly. * **Rapid Detection & Triage**: The issue was quickly detected through our monitoring systems, allowing for immediate investigation. We can further enhance this by ensuring alerts are correctly prioritized. * **Clear Cross-Team Collaboration**: Our teams coordinated swiftly to escalate, communicate, and resolve the incident efficiently.