Buildkite incident

Increased dispatch latency

Major Resolved View vendor source →
Started
Feb 26, 2026, 07:10 PM UTC
Resolved
Feb 27, 2026, 02:30 AM UTC
Duration
7h 20m
Detected by Pingoru
Feb 26, 2026, 07:10 PM UTC

Affected components

Job Queue

Update timeline

  1. identified Feb 26, 2026, 07:10 PM UTC

    Some customers are experiencing increased latency for jobs being assigned to agents. We have identified the cause and are working on mitigations.

  2. monitoring Feb 26, 2026, 07:36 PM UTC

    We're seeing signs of recovery and will continue to monitor.

  3. investigating Feb 26, 2026, 11:41 PM UTC

    We're seeing ongoing latency impact across for a subset of customers. Some customers are seeing signs of improvement, but we are continuing to investigate the issue.

  4. monitoring Feb 27, 2026, 12:57 AM UTC

    We've seen recovery for the remaining subset of customers. We will continue to monitor.

  5. resolved Feb 27, 2026, 02:30 AM UTC

    We have seen a full recovery of service, and have a good understanding of the underlying cause. We will publish a post-incident review next week.

  6. postmortem Mar 04, 2026, 02:33 AM UTC

    # Service Impact Between approximately 18:00 UTC and 22:50 UTC February 26, 2026, a subset of customers experienced increased latency when dispatching jobs to agents. Affected customers observed agents sitting idle for several minutes despite having matching jobs waiting in the queue. Job dispatch eventually succeeded, but with significantly elevated latency. The impact was concentrated on specific database shards but affected customers across multiple shards over the course of the incident. # Incident Summary A database maintenance task designed to improve job ordering performance was running across all production database shards. However, this task was itself contributing significant database load, which impacted normal job dispatch and pipeline upload operations. This increased database load caused dispatch operations to queue up, resulting in the observed delays in matching jobs to agents. The issue was compounded by a connection pooling service having several containers running on underperforming infrastructure, which reduced the available database throughput. Contributing factors: * The maintenance task consumed limited database resources, which conflicted with concurrent dispatch operations * The task was running simultaneously across all database shards, amplifying the impact * A connection pooling service had degraded capacity due to infrastructure imbalance # Changes we're making * The maintenance task has been paused and in future will be run during low-traffic periods and on individual shards rather than all shards simultaneously * The connection pooling service has been rebalanced to ensure consistent performance * We are improving our monitoring and dashboards to enable faster identification of lock contention issues during incidents

Looking to track Buildkite downtime and outages?

Pingoru polls Buildkite's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when Buildkite reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track Buildkite alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring Buildkite for free

5 free monitors · No credit card required