Buildkite incident
Degraded performance and increased error rates
Buildkite experienced a major incident on April 8, 2026 affecting GitHub Commit Status Notifications and Email Notifications and 1 more component, lasting 45m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 08, 2026, 10:26 PM UTC
We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.
- monitoring Apr 08, 2026, 10:40 PM UTC
We have identified and fixed the issue. We are monitoring and seeing signs of improvement.
- resolved Apr 08, 2026, 11:12 PM UTC
We experienced an issue that caused a brief increase in errors for the Agent API. This also impacted latency for notifications. All notifications were stored in a queue and processed. Latency is now back to normal.
- postmortem Apr 21, 2026, 06:04 AM UTC
## Service Impact On April 8th from 22:10 to 22:15 UTC and from 22:33 to 22:38 UTC all customers would have experienced an increase in latency browsing [buildkite.com](http://buildkite.com) and using Buildkite REST and GraphQL APIs, as well as latency creating triggered and scheduled builds of up to 12 minutes. A portion of customers also experienced increased latency and error rates for the Agent API, with the impact not being evenly distributed, causing some customers to experience a higher impact than others. Customers impacted experienced p99 latency of more than 1 second and error rates of 0.5%. Between 22:10 to 23:03 UTC a majority of customers experienced notification latency of between 5 and 33 minutes, which includes Github commit statuses and webhook delivery, as well as delays in processing incoming webhooks of up to 40 seconds. ## Incident Summary Our engineers noticed an increase in exceptions and shortly afterwards received an alert at 22:16 UTC for high CPU utilisation on a single node in a Redis cluster. Upon investigation we found a rate limiter was hot spotting on a single node within the cluster. Upon removing this limit at 22:33 UTC the impact of high Redis CPU utilization was mitigated, with CPU utilization falling to 5% for the affected node where it had been at 50% prior to the incident. Further investigations also revealed high load on a replica database was also contributing to high latency, which recovered at 22:37 UTC after a third replica was added. The rate limit which was responsible for high load on Redis had caused hot spotting on a single key because it was applied across all organizations. This limit was introduced in response to a previous incident but hadn’t been required since our work to horizontally shard our Pipelines database had distributed the load and enabled higher scalability in early 2024. ## Changes we've made These are the changes we’ve made in response to this incident: * Removed the global rate limit that was contributing to a significant proportion of load * Increased the number of read replicas for our customer information database to 3 and increased the instance size * We’ve reviewed our remaining rate limits and have confirmed that no other rate limits apply globally to all shards.