Tenfold incident

Tenfold is experiencing a service disruption

Tenfold experienced a critical incident on January 16, 2024 affecting API and Dashboard and 1 more component, lasting 16m. The incident has been resolved; the full update timeline is below.

Started: Jan 16, 2024, 10:52 PM UTC
Resolved: Jan 16, 2024, 11:09 PM UTC
Duration: 16m
Detected by Pingoru: Jan 16, 2024, 10:52 PM UTC

Affected components

APIDashboardChrome Extension

Update timeline

investigating Jan 16, 2024, 10:52 PM UTC

We are experiencing access issues with Tenfold Agent and Tenfold Dashboard. Our engineering teams are involved and their investigation is ongoing.
monitoring Jan 16, 2024, 10:58 PM UTC

Our engineering team has identified the cause of the outage and is working to mitigate it. We are now experiencing latency. We are continuing to monitor and will send another update once the issue is resolved.
resolved Jan 16, 2024, 11:09 PM UTC

Access to Tenfold Agent and Tenfold Dashboard has now been restored.
postmortem Jan 18, 2024, 11:07 PM UTC

**LivePerson Incident #SEV-103, SEV-105, SEV-108, SEV-109 - Preliminary Root Cause Analysis** _This preliminary assessment is pending further in-depth analysis of the incident to confirm the root cause and corrective actions._ ### **Summary** On Monday, January 8th, 2024, at 3:54 PM EST, LivePerson’s Tenfold Cloud Operations team observed increasing operation latency followed by a database connection timeout on the Tenfold Platform. In parallel, Tenfold customers reported that login to the Tenfold Application and Tenfold Dashboard was failing. Upon notification of the issue, the NOC engaged our on-call team, created a war room, and started an investigation. During our investigation, it was determined the primary platform database was unresponsive. The reason pointed to corrupt records in the primary database node. We modified the service to failover incoming requests to a secondary database server. Attempts to resume the impacted services were unsuccessful, and outside resources were engaged. After cross-collaboration between Engineering and third-party infrastructure providers, it was determined the affected server needed to be updated due to an unforeseen misconfiguration of the Domain Name System. At 9:05 PM EST, it was confirmed the failover succeeded, and new incoming requests were processed correctly. At 9:15 PM EST, after successfully monitoring incoming server requests and services returning to normal working conditions, the issue was marked as resolved. Subsequent outages have been identified to be related to the original January 8th incidents inclusive of disruptions in service on January 9th, January 16th, and January 17th, 2024. While the affected database node has remained out of service, the remaining nodes have experienced similar stability issues. Ongoing efforts are being undertaken to restore operational stability. The efforts include in-depth consultation with the database vendor and infrastructure provider along with internal LivePerson architecture groups. ### **Customer Impact** Tenfold Customers were unable to perform the following actions: * Log in to the Tenfold Application or Tenfold Dashboard * Perform any call actions * Perform any CRM actions * Access Analytics data ### **Corrective Actions** To resolve the January 8th incident, LivePerson performed a failover to the secondary database node utilized by the underlying login process for the LivePerson Agent Connector for Salesforce and subsequently updated DNS configurations, returning services to normal conditions. The subsequent incidents have required similar failover and restarts to restore service with the remaining operational nodes. To resolve the January 9th incident, LivePerson performed a failover from the new primary node to the new secondary database node. Additionally, actions were taken to attempt to return to a 3-node configuration without success. Service was restored with the new configuration of both a primary and secondary database node in a pooled architecture. To resolve the January 16th, 2024 incident, LivePerson performed a restart of the primary database node and removed the secondary node from service \(not replication\). Service restarts of all platform microservices were required to clear the issue and restore normal service. To resolve the January 17th, 2024 incident, LivePerson performed a restart of the primary database node and platform microservices. Normal operation was restored after the actions. In the January 17th service window, LivePerson performed a configuration change to bring the database architecture back to a known good operating mode. This operation was successful with stable service observed and improved latency performance. The implementation of an additional backup database node is in progress and will be added to the production environment after thorough testing and according to the change management policy. Planning has begun for architectural simplification and platform upgrades to add additional stability to the service.