Orbee incident

Network Outage for all Orbee services

Orbee experienced a minor incident on February 3, 2022 affecting Platform and Data Collection and 1 more component, lasting 32m. The incident has been resolved; the full update timeline is below.

Started: Feb 03, 2022, 06:51 PM UTC
Resolved: Feb 03, 2022, 07:23 PM UTC
Duration: 32m
Detected by Pingoru: Feb 03, 2022, 06:51 PM UTC

Affected components

PlatformData CollectionData PipelineAdvertisingPersonalizationEmail Marketing

Update timeline

monitoring Feb 03, 2022, 06:51 PM UTC

We detected at 10:36AM on February 3rd, 2022 that network connectivity went down for all Orbee services. This was due to an incorrect configuration of some networking resources. At 10:47AM on February 3rd, 2022 we remedied the issue. We are monitoring all services as they come back online.
monitoring Feb 03, 2022, 06:54 PM UTC

All services are healthy and available now. We are continuing to monitor to make sure every aspect of our products are working as intended.
resolved Feb 03, 2022, 07:23 PM UTC

All services are stable and available.
postmortem Feb 03, 2022, 09:06 PM UTC

# Executive Summary Between 6:31 PM and 6:49 PM \(UTC\) on February 3, 2022, engineers and customers experienced connectivity issues to our services and databases. The event was triggered by the removal of the subnets used in the original route table. We resolved the issue by adding the subnets back to the route table of the VPC. The incident lasted for less than 20 minutes and was detected and mitigated immediately causing a minor outage. ## Leadup At 6:30 PM \(UTC\) on February 3, 2022, a change was made to our VPC routing table in AWS causing a network outage for all services and databases. The change caused the inability to access any service running within the VPC. ## Root Cause Identification 1. Network outage for Orbee services occurred. 2. Default VPC route table was overridden by a new route table when implementing VPC Peering. 3. Due to a sense of urgency for project implementation changes were made to the VPC configuration 4. Adjustments were made without proper review of changes. 5. Because we do not have an established process for implementing all configuration changes in AWS. ## Fault The misconfiguration of the VPC Route table disconnected communication between the services and the network. It is noted that configuration changes were made to the VPC shortly before the outage, overriding the default routing table setup. Due to the urgency for project implementation, changes were made without the proper review of changes. ## Detection The incident was detected when engineers experienced extended loading times and customers reporting issues connecting to the platform. Our Datadog monitoring system showed heightened errors for all services during the time of the incident. The probable cause was pinpointed immediately and engineers investigated further to find the root cause. ## Mitigation and Resolution The subnets were added back to the original VPC route table and connectivity was observed. ## Lessons Learned * Practice an iterative approach for root cause analysis and making critical changes to software and infrastructure. * Corroborate with other engineers about potential changes and perform a proper review of changes. * The VPC route table must contain subnets used by AWS resources in order to maintain communication throughout the network.