Omnivore incident

API and Webhooks Intermittent Unavailability

Omnivore experienced a major incident on January 28, 2023 affecting API and Webhooks, lasting 1h 22m. The incident has been resolved; the full update timeline is below.

Started: Jan 28, 2023, 03:08 AM UTC
Resolved: Jan 28, 2023, 04:31 AM UTC
Duration: 1h 22m
Detected by Pingoru: Jan 28, 2023, 03:08 AM UTC

Affected components

APIWebhooks

Update timeline

investigating Jan 28, 2023, 03:08 AM UTC

We are currently investigating our API and Webhooks having instability.
investigating Jan 28, 2023, 03:09 AM UTC

We are continuing to investigate this issue.
monitoring Jan 28, 2023, 04:00 AM UTC

We have identified the problem and have implemented a fix. The API and webhooks are returning to normal. We will continue to monitor.
resolved Jan 28, 2023, 04:31 AM UTC

This incident has been resolved.
postmortem Feb 14, 2023, 10:11 PM UTC

## Executive Summary On January 27, 2023, between UTC 02:37 and 03:40, ECS instances could not be deployed in our environment because a GPG key was changed on a package used on these instances. This caused a cascading outage of Omnivore’s API, with a period of total downtime between 02:50 and 03:40. ## Background and Root Cause Omnivore utilizes Amazon Web Services Elastic Container Service for some of our services. These instances are deployed as needed and built using Chef's configuration management tool. When Chef runs on these instances, it installs software packages that are needed by the instances. Typically, these software packages are in repositories maintained by the operating system. However, there are a few packages that are maintained by software companies that develop the application. These repositories are secured using GnuPG \(GPG\) keys. Software companies will change their GPG keys from time to time for security reasons. When this happens, the software will not be installed, and an error message will be displayed. When this type of error happens with Chef, the installation of the ECS instance is not completed, and the needed extra resources are not deployed. This is what caused this outage. ## Timeline All times are in UTC 02:37: Omnivore infrastructure team receives an alert that ECS instances were not able to be deployed. 02:45: Omnivore infrastructure team attempts to manually raise the number of ECS instances. 02:50: Omnivore infrastructure team receives an alert that the Omnivore API is failing. 02:56: Omnivore infrastructure team pages service team to alert them to the issue. 03:19: Omnivore infrastructure team discovers that Chef is not able to deploy ECS instances. 03:22: Omnivore infrastructure team attempts to run Chef manually to force deployment. 03:40: Omnivore infrastructure team notes that Chef is failing due to a bad GPG key. 03:40: Omnivore infrastructure team downloads and installs new GPG key, allowing Chef to run to completion. ## Action Items 1. Change the process for Chef deployment to include a fresh download of the GPG key on every run. 2. Consider using a “Golden Image” over deploying with Chef.