Walkme incident

US accounts - Editor/Insights/Console is down - Partial Workstation and ActionBot outage

Critical Resolved View vendor source →

Walkme experienced a critical incident on March 6, 2024 affecting Designer (Editor) and Admin Space and 1 more component, lasting 3h 54m. The incident has been resolved; the full update timeline is below.

Started
Mar 06, 2024, 04:47 PM UTC
Resolved
Mar 06, 2024, 08:42 PM UTC
Duration
3h 54m
Detected by Pingoru
Mar 06, 2024, 04:47 PM UTC

Affected components

Designer (Editor)Admin SpaceDAP ConsoleActionBotMenu WebInsights dashboards

Update timeline

  1. investigating Mar 06, 2024, 02:09 PM UTC

    editor/insights/console is down for US accounts. We are investigating and will update shortly.

  2. investigating Mar 06, 2024, 02:41 PM UTC

    We are continuing to investigate this issue.

  3. investigating Mar 06, 2024, 03:23 PM UTC

    We are continuing to investigate this issue.

  4. investigating Mar 06, 2024, 03:54 PM UTC

    We are continuing to investigate this issue.

  5. investigating Mar 06, 2024, 04:47 PM UTC

    Our Development team is investigating and working on resolving the issue as soon as possible. We will report back with an update shortly.

  6. investigating Mar 06, 2024, 05:31 PM UTC

    We are continuing to investigate this issue.

  7. investigating Mar 06, 2024, 06:22 PM UTC

    We are continuing to investigate this issue.

  8. investigating Mar 06, 2024, 07:32 PM UTC

    We are continuing to investigate this issue.

  9. investigating Mar 06, 2024, 07:48 PM UTC

    Our Development team has been able to restore most services, however, the Editor's publish functionality is still impacted. We are continuing to work hard to resolve this and will have an update soon.

  10. monitoring Mar 06, 2024, 08:24 PM UTC

    WalkMe Services have been restored and our team is monitoring to ensure all items are resolved.

  11. resolved Mar 06, 2024, 08:42 PM UTC

    All WalkMe components impacted by today's outage have been restored. Thank you for your patience as our team worked to ensure WalkMe functionality was fully restored. Our initial findings indicate an issue with the WalkMe Authorization and Authentication endpoints in our US services to be the cause. Please expect this incident to be updated with a root cause analysis including all relevant details in the coming days.

  12. postmortem Mar 19, 2024, 07:55 PM UTC

    ### Description of Incident * On Mar 6, 2024 ,14:09 UTC, WalkMe experienced an elevated level of service errors in our Design Time API Gateways. * This issue primarily affected our WalkMe builders, who may have experienced difficulties when connecting to the WalkMe Editor, Console, or Insights. * After urgent investigation by the WalkMe engineering team, and in order to avoid further disruption, a set of recovery steps was performed on the databases and load balancers. * Once recovery steps were completed, the Gateway services returned to normal and Design Time APIs were fully functional by March 6, 2024, 20:42 UTC ### Scope of Incident * This issue primarily affected our WalkMe builders, who may have experienced difficulties when connecting to the WalkMe Editor, Console, or Insights. ### Root Cause Analysis * A new product configuration caused some of our underlying services to enter an unstable state. This triggered a ripple effect on additional internal services. ### WalkMe Corrective Action * WalkMe performed a rollback from the latest backup 17:40 UTC, to Database snapshot of 13:20 UTC ### Ongoing Commitments * To uphold WalkMe's commitment to providing reliable and uninterrupted service, we are actively monitoring our systems to ensure this issue does not recur: * WalkMe will add additional designated tests for these specific components and configurations. * WalkMe will increase the observability, monitoring and alerting on these specific components and configurations. * WalkMe will apply additional load protection layers on core services.