Kustomer incident

[Knowledge Base] Not Accessible [PROD 2]

Minor Resolved View vendor source →

Kustomer experienced a minor incident on November 8, 2025 affecting Knowledge base, lasting 1h 38m. The incident has been resolved; the full update timeline is below.

Started
Nov 08, 2025, 07:06 AM UTC
Resolved
Nov 08, 2025, 08:44 AM UTC
Duration
1h 38m
Detected by Pingoru
Nov 08, 2025, 07:06 AM UTC

Affected components

Knowledge base

Update timeline

  1. investigating Nov 08, 2025, 07:06 AM UTC

    Kustomer is aware of an event affecting our Knowledge Base rendering it inaccessible. [PROD 2] Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, and reach out to Kustomer support at [email protected] if you have additional questions or concerns.

  2. investigating Nov 08, 2025, 07:39 AM UTC

    Our Engineering team is still actively investigating the issue alongside our cloud provider to identify the underlying cause and implement a resolution. We’ll provide an update within the next 30 minutes.

  3. investigating Nov 08, 2025, 08:11 AM UTC

    We are continuing to investigate the issue and working with our cloud provider to identify the cause in an effort to implement a resolution. You can expect additional updates within the next 30 minutes, please reach out to Kustomer Support at [email protected] for any further questions or updates in the meantime.

  4. monitoring Nov 08, 2025, 08:19 AM UTC

    A fix has been deployed and services have been restored. We’re closely monitoring system performance to confirm stability.

  5. resolved Nov 08, 2025, 08:44 AM UTC

    Kustomer has resolved the event affecting PROD 2 that caused the Knowledge Base to be inaccessible. To resolve this issue, our team has deployed a fix. After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at [email protected] if you have additional questions or concerns.

  6. postmortem Nov 14, 2025, 06:38 PM UTC

    ## **Summary** ‌ During a routine deployment, a configuration issue caused an SSL certificate mismatch that temporarily affected access to some knowledge bases in one production region. ## **Root Cause** ‌ The incident was caused by a configuration change that introduced a new SSL certificate missing the required Subject Alternative Names \(SANs\) for two production regions. These SANs are necessary for secure connections between CloudFront and the load balancers serving knowledge base traffic. Without the correct entries, HTTPS requests to those regions failed TLS validation, resulting in users being unable to access their knowledge bases. ‌ The underlying issue stemmed from how the certificate was selected and applied during the deployment process. The new certificate shared the same primary domain as others in production but lacked the additional SANs specific to these regions. This mismatch caused CloudFront to reject requests at the TLS handshake stage, effectively disrupting service availability for knowledge bases in both affected regions. ## **Timeline** ‌ **Nov 7, 2025** **8:48 PM EST** Release went live to all regions **9:37 PM EST** Alarm triggered for low invocations in our Prod2 region for KB indexing **Nov 8, 2025** **12:33 AM EST** Incident triggered by on-call engineer **2:45 AM EST** Certificates identified as the likely culprit of this incident **3:15 AM EST** Fix applied and system fully recovered ## **Lessons/Improvements** * **Reduce noisy alarms** - Multiple noisy alarms of the same type at the same time for normally functioning systems caused us to miss this alarm. * **Status**: In progress. We are actively tuning our alarms to reduce noise and ensure actual alarms are actionable. * **Standardize certificates across all regions** - Standardize our certificates for all regions to ensure that we do not have custom certificate configurations which could cause incidents in select regions. * **Status**: In progress. A fix has been applied, but further work to standardize is ongoing. * **More KB logging** - Our knowledge base front-end has limited logging in place to help us identify issues faster. **Status**: Planned. Implement new logging like we have in our other front-end platforms to ensure faster alerting.