Kustomer incident

[PLATFORM] Entire platform is experiencing some latency causing disruptions in load times (PROD1)

Minor Resolved View vendor source →

Kustomer experienced a minor incident on December 14, 2024 affecting API and Web Client, lasting 54m. The incident has been resolved; the full update timeline is below.

Started
Dec 14, 2024, 05:33 PM UTC
Resolved
Dec 14, 2024, 06:27 PM UTC
Duration
54m
Detected by Pingoru
Dec 14, 2024, 05:33 PM UTC

Affected components

APIWeb Client

Update timeline

  1. investigating Dec 14, 2024, 05:33 PM UTC

    Kustomer is aware of an event affecting load times in the platform that may cause latency while using or accessing the platform. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support through EMAIL or CHAT for any further questions or updates.

  2. investigating Dec 14, 2024, 05:38 PM UTC

    Kustomer is aware of an event affecting load times in the platform that may cause latency while using or accessing the platform. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support through EMAIL or CHAT for any further questions or updates.

  3. monitoring Dec 14, 2024, 06:04 PM UTC

    Kustomer has implemented an update to address an event affecting PROD1 that caused latency issues in the platform. Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support through CHAT or EMAIL if you have additional questions or concerns.

  4. resolved Dec 14, 2024, 06:27 PM UTC

    Kustomer has resolved an event affecting PROD1 that caused latency issues in the platform. To resolve this issue, discovered a bottleneck in our database and it has recovered. Normal speeds have returned to the platform. After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at EMAIL or CHAT if you have additional questions or concerns.

  5. postmortem Feb 10, 2025, 04:11 PM UTC

    # Postmortem: Platform Latency on December 14, 2024 # **Summary** On December 14th, 2024, customers in PROD1 encountered latency and disruptions while using the platform. The issue is being investigated, with some steps already taken and additional measures planned to prevent recurrence. # **Root Cause** On December 14, 2024, at around 12 PM EST, database read operations temporarily stalled in the PROD1 environment This led to a period of latency and reduced performance for some customers, around an hour. Following the incident, we added additional alerting and optimized queries to improve system responsiveness and reduce the chance of similar issues occurring in the future. We are actively working with MongoDB to further understand the database issues that occurred during the incident. # **Timeline** **Dec 14, 2024** * 12:00 PM ET: Customers began reporting platform latency and degraded performance * 12:09 PM ET: Incident was declared, and investigation started * 12:27 PM ET: Created support issue with AWS to rule out AWS issues * 12:30 PM ET: Concluded incident was PROD1-specific and not due to recent code changes * 12:51 PM ET: Identified this as a database issue, database read operations temporarily stalled * 1:11 PM ET: PROD1 db environment recovered **Dec 16, 2024–Dec 17, 2024** * Engineering began in-depth investigation. * Set up alerting to monitor db reads. **Dec 18, 2024–Dec 19, 2024** * Created scripts to analyze and process db data during the incident. * Started collaboration with our database vendor to determine additional optimizations **Dec 19, 2024** * 5:23 PM ET: Optimized queries were implemented # **Lessons/Improvements** * **\[Done\]** **Apply Identified Optimizations**: Optimized queries that were targeting a large data set. * **\[In Progress\] Refine Alerting**: Improve database monitoring. * **Future Mitigation:** * **Reduce Data**: Manage and clean up old data to optimize system performance. * **Evaluate System Capacity**: Assess whether scaling resources will improve performance. * **Improve System Design:** Migrate impacted collections to a dedicated database for scalability.