SAP Conversational AI incident

Outage on /dialog endpoint

Major Resolved View vendor source →

SAP Conversational AI experienced a major incident on September 5, 2020 affecting [US10] Dialog and [EU10] Channels Connector and 1 more component, lasting 12h 23m. The incident has been resolved; the full update timeline is below.

Started
Sep 05, 2020, 02:50 AM UTC
Resolved
Sep 05, 2020, 03:13 PM UTC
Duration
12h 23m
Detected by Pingoru
Sep 05, 2020, 02:50 AM UTC

Affected components

[US10] Dialog[EU10] Channels Connector[EU10] NLU Analysis[US10] Channels Connector[US10] NLU Analysis

Update timeline

  1. investigating Sep 05, 2020, 02:50 AM UTC

    We are currently experiencing outages on our /dialog endpoint. Our team is working on identifying and fixing the issue. We apologize for the inconvenience.

  2. identified Sep 05, 2020, 03:10 AM UTC

    We have identified the root cause of the outage and we are working on a fix. Fix will be implemented as soon as possible.

  3. identified Sep 05, 2020, 04:03 AM UTC

    Our teams are continuing to work on a solution. Our storage for conversation logs ran out of disk space unexpectedly. We will now increase the volume size and restart the corresponding database to bring the /dialog endpoint back up. We will keep you updated on the progress.

  4. identified Sep 05, 2020, 05:47 AM UTC

    The volume increase is ongoing - at this point we are not able to give an estimate when it will complete. We are investigating options to recover the runtime (/dialog endpoint) before the volume increase concludes. We apologize for this outage and will keep you updated on the progress.

  5. monitoring Sep 05, 2020, 07:40 AM UTC

    We have deployed a temporary fix that makes the runtime (/dialog endpoint) fully operational again. Users will be able to have fully functional conversations with the bots. The temporary fix comes with the following limitations: * New conversations will not appear in the conversation logs (Monitor tab) * New conversations will not be reflected in the usage metrics (Monitor tab) * Conversation logs and usage metrics cannot currently be retrieved. Old conversation logs and usage metrics will become available again once the issue is fully resolved. We continue to work on the DB volume increase, which should be finished in about 20 minutes and would remove the aforementioned limitations. We will monitor the system and keep you updated on the progress.

  6. monitoring Sep 05, 2020, 07:44 AM UTC

    We are continuing to monitor for any further issues.

  7. monitoring Sep 05, 2020, 11:18 AM UTC

    We have deployed a fix in production and the system is back to normal. We are monitoring the stability of the platform.

  8. resolved Sep 05, 2020, 03:13 PM UTC

    This incident has been resolved and the platform is fully operational.