Buildkite incident

Degraded performance and increased error rates

Major Resolved View vendor source →
Started
Apr 08, 2026, 10:26 PM UTC
Resolved
Apr 08, 2026, 11:12 PM UTC
Duration
45m
Detected by Pingoru
Apr 08, 2026, 10:26 PM UTC

Affected components

GitHub Commit Status NotificationsEmail NotificationsAgent APISlack NotificationsWebhook Notifications

Update timeline

  1. investigating Apr 08, 2026, 10:26 PM UTC

    We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.

  2. monitoring Apr 08, 2026, 10:40 PM UTC

    We have identified and fixed the issue. We are monitoring and seeing signs of improvement.

  3. resolved Apr 08, 2026, 11:12 PM UTC

    We experienced an issue that caused a brief increase in errors for the Agent API. This also impacted latency for notifications. All notifications were stored in a queue and processed. Latency is now back to normal.

  4. postmortem Apr 21, 2026, 06:04 AM UTC

    ## Service Impact On April 8th from 22:10 to 22:15 UTC and from 22:33 to 22:38 UTC all customers would have experienced an increase in latency browsing [buildkite.com](http://buildkite.com) and using Buildkite REST and GraphQL APIs, as well as latency creating triggered and scheduled builds of up to 12 minutes. A portion of customers also experienced increased latency and error rates for the Agent API, with the impact not being evenly distributed, causing some customers to experience a higher impact than others. Customers impacted experienced p99 latency of more than 1 second and error rates of 0.5%. Between 22:10 to 23:03 UTC a majority of customers experienced notification latency of between 5 and 33 minutes, which includes Github commit statuses and webhook delivery, as well as delays in processing incoming webhooks of up to 40 seconds. ## Incident Summary Our engineers noticed an increase in exceptions and shortly afterwards received an alert at 22:16 UTC for high CPU utilisation on a single node in a Redis cluster. Upon investigation we found a rate limiter was hot spotting on a single node within the cluster. Upon removing this limit at 22:33 UTC the impact of high Redis CPU utilization was mitigated, with CPU utilization falling to 5% for the affected node where it had been at 50% prior to the incident. Further investigations also revealed high load on a replica database was also contributing to high latency, which recovered at 22:37 UTC after a third replica was added. The rate limit which was responsible for high load on Redis had caused hot spotting on a single key because it was applied across all organizations. This limit was introduced in response to a previous incident but hadn’t been required since our work to horizontally shard our Pipelines database had distributed the load and enabled higher scalability in early 2024. ## Changes we've made These are the changes we’ve made in response to this incident: * Removed the global rate limit that was contributing to a significant proportion of load * Increased the number of read replicas for our customer information database to 3 and increased the instance size * We’ve reviewed our remaining rate limits and have confirmed that no other rate limits apply globally to all shards.

Looking to track Buildkite downtime and outages?

Pingoru polls Buildkite's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when Buildkite reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track Buildkite alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring Buildkite for free

5 free monitors · No credit card required