Kustomer incident

[SEARCHES] Search Retrieval Issues (PROD2)

Kustomer experienced a minor incident on January 12, 2025 affecting Search, lasting 3h 11m. The incident has been resolved; the full update timeline is below.

Started: Jan 12, 2025, 05:29 PM UTC
Resolved: Jan 12, 2025, 08:41 PM UTC
Duration: 3h 11m
Detected by Pingoru: Jan 12, 2025, 05:29 PM UTC

Affected components

Update timeline

investigating Jan 12, 2025, 05:29 PM UTC

Kustomer is aware of an event affecting Searches that may cause issues retrieving search results. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support at [email protected] for any further questions or updates.
identified Jan 12, 2025, 05:39 PM UTC

Kustomer is aware of an event affecting Searches that may cause issues retrieving search results. Our team has identified the issue and is working on a resolution as soon as possible. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support at [email protected] for any further questions or updates.
identified Jan 12, 2025, 06:00 PM UTC

Kustomer has identified the root cause of the ongoing issue retrieving search results in Prod2 instances. We are focused on resolving this as quickly as possible and will provide updates in the next 30 minutes. Please reach out to Kustomer Support at [email protected] for any further questions or updates.
identified Jan 12, 2025, 06:30 PM UTC

We have identified the root cause of the issue retrieving search results in Prod2 instances and continue to work on implementing a solution. Additional updates will be provided in the next 30 minutes and in the meantime, please reach out to Kustomer Support at [email protected] for any further queries.
monitoring Jan 12, 2025, 06:40 PM UTC

Kustomer has implemented an update to address the issue impacting search retrieval results in Prod2 instances. Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer Support at [email protected] if you have additional questions or concerns.
monitoring Jan 12, 2025, 06:52 PM UTC

Kustomer has released an update and seeing indications of recovery on search retrieval in Prod2 instances. We are monitoring this fix to ensure the issue is fully resolved. Please expect additional details within the next 30 minutes, and reach out to Kustomer Support at [email protected] if you have further questions or concerns.
monitoring Jan 12, 2025, 07:30 PM UTC

After releasing a fix to restore search retrieval in Prod2 instances, this issue should now be resolved for All Prod2 clients except an isolated instance where we're working with the relevant stakeholders to ensure optimal functionality on a specific search query. Please feel free to reach out to Kustomer Support at [email protected] if you have additional queries or concerns.
resolved Jan 12, 2025, 08:41 PM UTC

Kustomer has resolved an event affecting Prod2 instances that caused issues when attempting to retrieve search results. After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at [email protected] if you have additional questions or concerns.
postmortem Jan 16, 2025, 06:57 PM UTC

# **Summary** An organization’s automated process running a high volume of complex search queries strained our search system, causing reduced search functionality for multiple Prod2 organizations over the span of roughly 3 hours. # **Root Cause** A high volume of complex queries in a short amount of time put excessive strain on our infrastructure. This negatively impacted the health of 2 nodes, affecting organizations that also rely on this same infrastructure. # **Timeline** **Jan 12, 2025** 6:00 AM - EST - A single organization’s automated user ramps up activity to significantly higher levels than normal 11:38 AM EST - Engineers start receiving alerts that various charts/reporting endpoints are generating errors 12:16 PM EST - Multiple clients report broken search functionality, we declare an incident 12:22 PM - Engineers identify that 2 of our nodes are reaching maximum CPU utilization 12:33 PM EST - Engineers identify the organization, user, and query that are causing the issue 1:15 PM EST - Engineers temporarily blocked the organization generating the problematic queries, restoring functionality to all organizations except the blocked organization. 3:15 PM EST - Engineers disabled the block, full functionality is restored for all organizations # **Lessons/Improvements** * **Better communications -** This incident highlighted the need for us to establish better protocol and communications with organizations whose activity may be impacting infrastructure or other organizations. With such a protocol in place we can hopefully reduce the need to block a single organization. * **Tune our blocking -** In response to this incident, we implemented the ability to block a single user \(rather than block an entire organization\) that is disrupting our systems. * **Better alerts** - By more finely tuning our alerting system, we can get earlier notification of incidents such as this without having to wait for customers to report impaired functionality. * **Rate limiting machine users** - We are exploring a reasonable way to rate-limit activity by machine users so that no single machine user can overwhelm our systems, as seen in this incident.