Kindly incident

Problems with handover to live chat

Kindly experienced a notice incident on June 28, 2024 affecting Handover to human chat, lasting 3h 50m. The incident has been resolved; the full update timeline is below.

Started: Jun 28, 2024, 08:28 AM UTC
Resolved: Jun 28, 2024, 12:19 PM UTC
Duration: 3h 50m
Detected by Pingoru: Jun 28, 2024, 08:28 AM UTC

Affected components

Handover to human chat

Update timeline

monitoring Jun 28, 2024, 08:28 AM UTC

This morning there was an incident with a sub-system handling handover of chats from bot to human agents. The issue has been handled and everything appears to be working as intended now, but we are still monitoring to make sure. The incident only affected handover for certain clients using this particular service, not all clients. Conversations with chatbots were unaffected, only users attempting to contact a human would have noticed the problem.
resolved Jun 28, 2024, 12:19 PM UTC

This incident has been resolved.
postmortem Jun 28, 2024, 02:27 PM UTC

In the morning hours of June 28th as traffic started picking up, one of our services started having issues serving web requests. This service is used for handover between the main chatbot service and external customer service systems and lets users talk to human agents. The main chatbot was unaffected and continued to respond as normal, it was only communication between end users and agents that was affected. We received automatic alert about errors soon after it started to fail and began to investigate. One of the error messages that the webserver displayed, suggested that the reason might be that it was out of memory, so we started by investigating this and trying to increase memory for the service. Unfortunately this was a misleading error message that lead us to spend our time looking into the wrong solution. The actual problem was latency, as the webserver _was_ responding to requests, but too _slowly_ and the connections were timing out. Once we figured this out we were able to scale up the service to have more capacity and the service recovered quickly and went back to handling requests in milliseconds instead of minutes. To avoid this happening again we have increased the capacity that the service can scale up to automatically so that it can dynamically respond to increased traffic. We’re also looking into if we can add more alerts that tell us explicitly if server response time starts to increase a lot, so that we will get an accurate error message that indicates the actual problem. We are also looking into if it’s possible to change the misleading error message from the webserver, to prevent being mislead again if something similar were to happen in the future