Roam incident

Service Outage

Critical Resolved View vendor source →

Roam experienced a critical incident on May 19, 2023, lasting 1h 41m. The incident has been resolved; the full update timeline is below.

Started
May 19, 2023, 03:10 PM UTC
Resolved
May 19, 2023, 04:52 PM UTC
Duration
1h 41m
Detected by Pingoru
May 19, 2023, 03:10 PM UTC

Update timeline

  1. investigating May 19, 2023, 03:10 PM UTC

    We're currently investigating a service outage.

  2. investigating May 19, 2023, 03:10 PM UTC

    We are continuing to investigate this issue.

  3. identified May 19, 2023, 03:33 PM UTC

    Meetings are working again, chat and calendar functionality should be resolved shortly.

  4. monitoring May 19, 2023, 03:44 PM UTC

    A fix has been implemented and we are monitoring the results.

  5. resolved May 19, 2023, 04:52 PM UTC

    This incident is resolved. We will post a public postmortem and post it by end of day on Monday, May 22nd, 2023.

  6. postmortem May 24, 2023, 09:19 PM UTC

    ## Summary of Impact From 11:02 ET on May 19, 2023 until 11:24 ET Roam was totally unavailable, and Chat and Calendar functionality weren't restored until 11:44 ET. ## Cause A change meant to improve our ability to debug system issues caused performance problems in our backend systems during some usage patterns. Those problems then cascaded to other parts of our backend leading to a complete outage. This was part of the cause of the outage on May 16th, and a code fix had been made the night of May 18th but failed to be deployed. ## Remediation Plan 1. The root cause fix was deployed during the incident. 2. We have instituted a more formal SRE process and dedicated senior staff to consistent production monitoring and early issue identification that we believe would have caught the signs of this before it became an incident at all. 3. We are improving our deployment process to ensure it is more clear which changes are deployed and ensure important fixes are deployed in a timely manner.