Buildkite incident

Delays in job dispatch, webhook processing, and outbound webhooks

Buildkite experienced a major incident on May 7, 2026 affecting Job Queue and Webhook Notifications, lasting 1h 28m. The incident has been resolved; the full update timeline is below.

Started: May 07, 2026, 10:45 PM UTC
Resolved: May 08, 2026, 12:13 AM UTC
Duration: 1h 28m
Detected by Pingoru: May 07, 2026, 10:45 PM UTC

Affected components

Job QueueWebhook Notifications

Update timeline

investigating May 07, 2026, 10:45 PM UTC

We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.
identified May 07, 2026, 11:13 PM UTC

We've identified the issue and are working on applying mitigations. At this time we can confirm inbound and outbound webhooks, and notifications are delayed.
monitoring May 07, 2026, 11:50 PM UTC

We have now monitoring the incident. We are seeing most customers have recovered, and some showing signs of recovery.
resolved May 08, 2026, 12:13 AM UTC

All customer workloads have now recovered.
postmortem May 21, 2026, 05:53 AM UTC

## Service Impact On 2026-05-07 between 22:22 UTC and 00:14 UTC the next day \(~1 hour 52 minutes\), Buildkite customers experienced delays across several Pipelines features: * Inbound webhook processing was delayed. * Outbound webhook delivery and build notifications were delayed. * Job dispatch was delayed for jobs queued during the window. No data was lost. Webhooks and notifications were queued for retry and delivered after the underlying database recovered. Customers with high webhook volumes or time-sensitive build dispatches were most affected. Some customers continued to see lingering latency for a short period after the underlying database recovered, while queued work drained. ## Incident Summary The Pipelines product depends on a shared database that brokers inbound API and webhook requests across all Pipelines shards. At 22:22 UTC, this database's writer instance began saturating under contention on internal locks and could no longer keep up with the volume of work directed at it. Because every Pipelines shard depends on this single shared database, the slowdown affected all customers. Contributing factors: * The shared database's writer was provisioned with very little spare capacity, leaving no headroom to absorb a load spike. * A specific category of background worker, the one that processes incoming webhooks, shares a queue with workers that dispatch builds, so back-pressure on webhook processing cascaded into delayed builds. * A small number of slow external webhook endpoints held worker capacity open inside long-running transactions, amplifying load on the shared database. This is a pattern we have seen before. * Our detection at the database tier was not specific enough to catch the saturation directly; we detected the problem ~12 minutes after impact began, via a downstream signal on Sidekiq queue latency. Mitigation: we provisioned a substantially larger database replica, failed over to it, and re-enabled processing shard-by-shard. We then resized a second replica to match, restoring our ability to fail over again if needed. We initially suspected the incident might be upstream of a separate availability-zone event experienced by our cloud provider. We have since ruled this out: the cloud provider's availability-zone event was declared after this incident was resolved, and the provider has confirmed the affected database instance was not impacted by that event. ## Changes we're making * **Decoupling inbound webhook ingestion from the shared database.** We are currently working through shard isolating inbound webhooks. The risk had already been identified and work had begun to mitigate the risk. In a stroke of bad fortune, it hadn’t been fully rolled out before we incurred the spike in load. Once this has been completed the underlying databases will be one step closer to be broken. * **Operator controls to pause queues during incidents.** We have shipped admin controls to pause individual Sidekiq sets. We’ve also improved our remediation tooling to give our on-callers more control over the ingestion pipeline for managing back-pressure on this specific shared database. * **Database-tier alerting.** We are adding alerts on database write latency and lock-wait activity, so we detect saturation at the source rather than via downstream queue latency. * **Capacity normalisation.** We are normalising autoscaling capacity across shards, so a spike to a single shard's queue can be absorbed by autoscaling rather than cascading.