Omnivore incident

API Outage

Major Resolved View vendor source →

Omnivore experienced a major incident on November 24, 2023 affecting API and Webhooks and 1 more component, lasting 4h 36m. The incident has been resolved; the full update timeline is below.

Started
Nov 24, 2023, 09:55 PM UTC
Resolved
Nov 25, 2023, 02:31 AM UTC
Duration
4h 36m
Detected by Pingoru
Nov 24, 2023, 09:55 PM UTC

Affected components

APIWebhooksControl Panel

Update timeline

  1. investigating Nov 24, 2023, 09:55 PM UTC

    We are currently investigating an issue that is affecting the Omnivore API.

  2. monitoring Nov 24, 2023, 10:25 PM UTC

    We have identified the issue and implemented a fix. We are monitoring systems to ensure stability. API and webhooks traffic are flowing normally.

  3. resolved Nov 25, 2023, 02:31 AM UTC

    All systems have been functioning normally with API and Webhooks flowing normally for several hours. We will follow up with a postmortem by 12/1/2023.

  4. postmortem Nov 29, 2023, 10:23 PM UTC

    # Overview On November 24, 2023, Olo's Omnivore API experienced a disruption between 21:17 UTC and 22:12 UTC. During this time all API operations with the exception of Add Payment, Open Ticket, and Submit Order were failing, and 25% of Omnivore-related webhooks experienced delayed delivery. # What Happened On November 24, 2023, Olo experienced a disruption to the Omnivore API and related webhook delivery, caused by a failure in the automated process for creating new Omnivore API instances. As traffic to the Omnivore API increased, its auto-scaling system was unable to add capacity to meet it. As a result, at 21:17 UTC all API operations with the exception of Add Payment, Open Ticket, and Submit Order began to fail, and 25% of Omnivore-related webhooks began to experience delayed delivery. ‌ We discovered that some of our package dependencies had been updated by their maintainers to require a newer runtime version than what was available in our deployment pipeline. This caused the bootstrapping process to fail for new instances that were needed to handle current traffic levels. With this identified, we implemented and deployed a fix to remove the failing dependencies from the API's critical path, allowing the system to resume scaling out additional API instances and restoring service at 21:12 UTC. # Next Steps * We have already made improvements to our alerting to automatically detect and mitigate similar issues before they become critical. * We will complete our in-progress migration of all Omnivore services into our newer hosting environment, which removes these dependencies as a failure point.