Simon Data incident

Temporary impact on some data refreshes, scheduled one-time flows, and event-triggered flows

Simon Data experienced a major incident on July 27, 2022 affecting Simon Data Pipes and Simon Data Syncs, lasting 7h 13m. The incident has been resolved; the full update timeline is below.

Started: Jul 27, 2022, 11:39 AM UTC
Resolved: Jul 27, 2022, 06:52 PM UTC
Duration: 7h 13m
Detected by Pingoru: Jul 27, 2022, 11:39 AM UTC

Affected components

Simon Data PipesSimon Data Syncs

Update timeline

identified Jul 27, 2022, 11:39 AM UTC

We have identified this issue due to unplanned maintenance by AWS. We have escalated to our technical support team at the provider for immediate assistance and will update this status soon.
monitoring Jul 27, 2022, 12:23 PM UTC

A fix has been identified and is currently being implemented. We will provide further updates as the fix is rolled out.
monitoring Jul 27, 2022, 02:17 PM UTC

The fix has been implemented and we are starting to recover pipes of affected client orgs.
monitoring Jul 27, 2022, 02:21 PM UTC

Approximately 20% of affected client organization's pipes have been fully restored. We will continue to update here as we restore the remainder of affected pipes.
resolved Jul 27, 2022, 06:52 PM UTC

This incident has been resolved.
postmortem Aug 01, 2022, 07:51 PM UTC

# **Overview** Last week on July 27th, Amazon Web Services \(AWS\) was performing routine maintenance on a piece of infrastructure that Simon uses to power segmentation. During the maintenance window, AWS' routine failed and this left parts of Simon's segmentation functionality temporarily unavailable. This had ramifications for use of several Simon Data platform interfaces as well as on-time delivery of downstream flows & journeys. The attached file documents the steps we took to react to and remediate the issue while in parallel escalating the emergency to AWS who ultimately was able to identify the source of the issue and fix. AWS acknowledged that this bug was introduced by AWS and that the remediation they implemented on our infrastructure is a permanent solution.Executive Summary # **Executive Summary** In the early morning EST of July 27th 2022, a maintenance routine to an Amazon-managed service that Simon uses to power segmentation left that service unavailable. Simon was immediately paged and our on-call support reacted. Unfortunately, the most direct and fastest-acting remediations were not possible due to persistent hardware failure from Amazon, requiring Simon to fall back on reconstructing data on new services. In the end, Amazon rectified the bug in their maintenance program and remediation took effect before Simon finished reconstructing data on new services - rendering that effort unnecessary. No data or messages were lost, but both data refreshes and active or planned campaigns were delayed while Simon performed an emergency migration to a functional database. This resulted in delayed data refreshes and campaign launches, in addition to leaving the core segmentation product, and those products that depend upon it \(like unified contact view and selecting sample contacts in content\), to be unavailable until the migration completed. # **Root Cause** Simon Data splits data, leveraged by our platform, by kind into different databases and business logic into other databases. Each database comes equipped with multiple, redundant nodes. We routinely exercise node failover logic during upgrades, maintenance, and failures. Maintenance on active Amazon Web Services \(AWS\) databases that Simon Data uses is frequent and rarely are there unplanned outages because maintenance processes with AWS are typically predictable. On July 27th, 2022, AWS’ maintenance routine introduced a subtle bug that prevented its maintenance from finishing. As a result, the segmentation database was being reported as still under maintenance. The Simon Data on-call team was alerted and immediately reacted, however contacting our technical support at AWS took longer than expected and the Simon on-call team only received 1 downstream page. While the Simon team performed an emergency data reconstruction / migration to healthy infrastructure, the Simon team escalated through multiple teams at AWS. Despite the parallel escalations, it took longer than normal to have AWS view this incident as an emergency situation instead of an unhealthy situation. Once recognized as an emergency situation, AWS resolved quickly and this happened before Simon’s internal reconstruction / migration completed. AWS has provided their post mortem to Simon Data: “_temporary tables were inadvertently created \[by AWS\] during a maintenance period that caused a corruption of metadata and prevented the cluster from powering on”_ # **Impact Analysis** All customers using one of our specific segmentation databases saw interruptions in service in Unified Contact View \(UCV\), Segmentation, and content loading tools in Simon # **Remediation Plan** ## **Quicker Time to Detection & Alerting** Simon Data has reviewed the process we use for publishing incidents to our status page. We have cut out the manual steps which delayed the alert to our customers longer than expected. ## **Quicker Time to Resolution & Recovery** Simon Data has revisited design and usage of our segmentation databases such that a new process was added to validate that maintenance has been conducted correctly. If it doesn’t finish as intended, immediate migration to another segmentation database will occur to stem interruption of service for Simon Data customers.