Omnivore incident

Agents Offline

Critical Resolved View vendor source →

Omnivore experienced a critical incident on May 29, 2024 affecting API and Webhooks and 1 more component, lasting 28m. The incident has been resolved; the full update timeline is below.

Started
May 29, 2024, 09:47 PM UTC
Resolved
May 29, 2024, 10:15 PM UTC
Duration
28m
Detected by Pingoru
May 29, 2024, 09:47 PM UTC

Affected components

APIWebhooksControl PanelMarketplace

Update timeline

  1. identified May 29, 2024, 09:47 PM UTC

    At approximately 20:30 UTC, we identified an issue causing many locations to enter either a degraded or offline state. We have identified the issue and are working to resolve it.

  2. monitoring May 29, 2024, 10:01 PM UTC

    A fix has been implemented and location status have returned to normal. We will continue to monitor at this time.

  3. resolved May 29, 2024, 10:15 PM UTC

    Affected locations have returned to online status and are operating normally

  4. postmortem Jun 11, 2024, 02:53 PM UTC

    ## Overview On May 29, 2024, Olo’s Omnivore Platform experienced agent degradation between 20:30 UTC and 21:55 UTC. Some API calls were failing during this time, and some agents went offline at 21:30 UTC. ## What Happened On May 29, 2024, during a routine instance resizing operation for our Connect service cluster, our configuration management system misidentified the IP addresses for the newly deployed instances, causing them to get bootstrapped incorrectly. This resulted in elevated error rates for Omnivore API calls beginning at 20:30 UTC, with 32% of connected Omnivore agents becoming degraded. At 21:30 UTC, some more API calls began to fail, causing 6% of connected agents to go fully offline. We initiated an accelerated rollback of the change, which fully restored service to a healthy state by 21:55 UTC. ## Next Steps * Improve the provisioning process to better detect and alert on this kind of misconfiguration earlier, and before the new instances are put into rotation. * Create additional alerting around agent errors to improve investigation speed.