Spruce Health incident

Spruce experiencing issues with contact and conversation search related activities.

Spruce Health experienced a minor incident on March 15, 2023 affecting Web App and Mobile Apps, lasting 1d 5h. The incident has been resolved; the full update timeline is below.

Started: Mar 15, 2023, 05:15 PM UTC
Resolved: Mar 16, 2023, 10:23 PM UTC
Duration: 1d 5h
Detected by Pingoru: Mar 15, 2023, 05:15 PM UTC

Affected components

Web AppMobile Apps

Update timeline

identified Mar 15, 2023, 07:26 PM UTC

We are investigating an issue with contact and conversation search related activities.
identified Mar 15, 2023, 07:27 PM UTC

We are continuing to work on a fix for this issue.
identified Mar 15, 2023, 07:31 PM UTC

We are continuing to work on a fix for this issue.
identified Mar 15, 2023, 08:29 PM UTC

We continue to work on the issue here to reduce the intermittent errors while searching for contacts or loading contact lists. Note that there is no impact to phone calls, SMS routing, loading of inbox, exchanging secure messages, or video calling. Bulk messages will continue to send during this time, albeit in a delayed fashion given that bulk messages work off of contact lists. We will update here as we make progress against the performance issue here.
identified Mar 15, 2023, 10:31 PM UTC

We continue to investigate this issue. Note that some bulk message operations may take long to complete or may get stuck in a particular state given that the bulk message operations also face similar errors when querying contact lists.
identified Mar 16, 2023, 01:16 AM UTC

We have identified a potential cause for the intermittent failures with the search cluster. We are going to work towards better distributing the data across the cluster so as to increase overall performance and reduce the error rate. To recap, due to the errors throughout the day: - Searching for conversations, messages or contacts may have failed - Bulk messages may have taken longer to complete than usual - Newly created contacts, conversations and messages may not have shown up when searching - Updates to contacts may not have been searchable We will continue to work through the evening to reindex the data so as to better distribute it across the cluster and keep this incident up to date as we make progress here. We're really sorry for the inconvenience this is causing to your workflows.
identified Mar 16, 2023, 02:11 AM UTC

Indexing of data has now caught up such that successfully searches for any contacts, conversations and messages will bring up up to date results. We are continuing to work on better distributing the data across the cluster. We are not experiencing poor performance or intermittent errors currently. This is likely due to the decreased overall traffic in the system given time of day. That being said, we continue to work on reducing the likelihood of this problem continuing into business day tomorrow.
identified Mar 16, 2023, 02:58 PM UTC

The redistribution of data in the cluster is still in progress (note that this happens in the background with minimal impact to searching and indexing of new data). We have been closely monitoring the situation throughout the night. We also increased the capacity of the cluster to accommodate for the redistribution of data and to insure that we are in better shape for today. We have ~20% of redistribution remaining that we believe will have a long standing improvement to the overall performance. The metrics so far are looking healthy with no signs pointing to poor performance or increased error rate. We will report back here once the redistribution completes or if we see any signs pointing to degraded performance.
resolved Mar 16, 2023, 10:23 PM UTC

The system is fully operational as per our active monitoring over the last 10 hours. Summary: From 10:15am PT March 15 to 6:50pm PT March 15, the following actions on Spruce experienced degraded performance: - Searching for conversations and contacts either took long or failed - Contact filters frequently failed to load when clicked into - When starting a new conversation, contact suggestions either took long or failed to load, making it challenging to start new conversations - Bulk actions (messaging, tagging, deleting) took longer than expected to complete, but eventually completed - Newly created contacts, conversations and messages during this time period were not searchable. The new items eventually became searchable 6:50pm PT onwards - Successful searches for contacts and conversations may have brought up stale results, where an update to a contact or conversation was not reflected in search results. The updated items were eventually updated in the search to reflect their latest versions There was no impact to calls, SMS, Fax, Secure Messaging, Email or Fax during this time. We will post a postmortem to the incident soon.
postmortem Apr 14, 2023, 08:21 PM UTC

## **Summary \+ customer impact** The Spruce system experienced degraded performance from 10:15am PT to 6:50pm PT on Match 15 2023. During this time: * Searching for contacts or conversations resulted in errors, slowness or stale results * Searching contacts to create conversations resulted in errors or slowness * Bulk messages sent were delayed due to being stuck in Processing state for a while or just taking longer to complete * Opening contact lists likely returned errors or experiencing slowness returning results * Contacts exports were delayed due to being stuck in Processing state for a while or just taking longer to complete The degraded performance was caused due to inefficient data distribution across the search cluster where one of our data nodes experienced heavy load and was unable to process any new indexing and search operations. ## **Analysis** The Spruce engineering team immediately reacted to the monitoring alarms that were triggered to investigate the issue. The issue was caused due to one of the data nodes storing significantly larger amounts of data, compared to the other data nodes in the cluster, which resulted in the node taking longer to process requests. The number of requests piled up over time putting the node under heavier load, and getting to a point where the request queue was exhausted, resulting in new requests being rejected. The engineering team reviewed and analyzed the cluster configuration and performance metrics in detail, and made the following steps to resolve the issue: * Adjusted the data distribution strategy so the data can be equally allocated across the data nodes in the search cluster. The old strategy was inefficient because the size of the stored data grew significantly over time and it could not support the demands of indexing and search operations. * Increased the number of data nodes and reallocated the data equally across the nodes. This was done in the background with minimal impact on the searching and indexing of new data. With the new configuration, data stored was uniformly distributed across all nodes. During the degraded performance, the other data nodes in the cluster were fully operational and any requests that were routed to them were processed successfully. Also, all the indexing requests were successfully queued and processed after the issue was resolved. ## **Action items** * Create additional monitoring alerts for search and indexing operations latency, so that any potential issues can be early detected. * Review the cluster configuration and performance metrics in detail every 3-6 months. * Improve the clean-up strategy for unused data to reduce space usage. * Create an internal strategy with clearly defined steps that can be taken in order to quickly troubleshoot and resolve issues like this one and thus minimize the impact on the clients. * Increase general knowledge about the search cluster and its configuration within the engineering team.