Spruce Health incident

[RESOLVED] Delayed inbound sms, voicemails and call events

Minor Resolved View vendor source →

Spruce Health experienced a minor incident on October 18, 2022 affecting Phone Call Routing and SMS Routing, lasting 1h 47m. The incident has been resolved; the full update timeline is below.

Started
Oct 18, 2022, 05:05 PM UTC
Resolved
Oct 18, 2022, 06:53 PM UTC
Duration
1h 47m
Detected by Pingoru
Oct 18, 2022, 05:05 PM UTC

Affected components

Phone Call RoutingSMS Routing

Update timeline

  1. identified Oct 19, 2022, 12:25 AM UTC

    From 10:05am PT to 11:53am PT on October 18 2022, voicemails, inbound sms and inbound call events reached provider's Spruce inboxes in a delayed manner. The events that were delayed had an indication in the message itself for how long they were delayed by. There was no impact to inbound calls, outbound calls, secure message exchanges, video calls, email or fax. Spruce identified this issue in response to customer complaints rather than the proactive monitoring in place for the system in general.

  2. resolved Oct 19, 2022, 12:26 AM UTC

    The issue was resolved and the system returned to being fully functional at around 11:53 am PT.

  3. postmortem Oct 19, 2022, 12:26 AM UTC

    # Summary The reason for the delayed messages was because of communication with our transcription provider timing out to transcribe voicemails. The timeout on uploading a recording to the provider was not correctly tuned, leading to a build-up of messages that needed to be processed by a set of application workers and causing a backlog of messages that needed to be processed. The messages were being processed albeit in a delayed manner due to the communication issues. Spruce was made aware of the issue via multiple customer complaints and the engineering team started investigating as soon as the issue was escalated. # **Action items to mitigate future impact** * Add an alarm on the application worker responsible for processing transcriptions and SMS. Note that we already had alarms in place for all but one of the workers. This will help ensure that should an issue like this arise again, we’ll be notified asap. * Fine tune the timeout in communication with the transcription provider to prevent a build-up in the event of communication errors in the future.