Buildkite incident

Increased REST and GraphQL API Latency

Buildkite experienced a minor incident on September 23, 2025 affecting REST API, lasting 5h 44m. The incident has been resolved; the full update timeline is below.

Started: Sep 23, 2025, 07:57 PM UTC
Resolved: Sep 24, 2025, 01:41 AM UTC
Duration: 5h 44m
Detected by Pingoru: Sep 23, 2025, 07:57 PM UTC

Affected components

REST API

Update timeline

investigating Sep 23, 2025, 07:57 PM UTC

We are investigating reports of increased latency on our REST and GraphQL APIs
investigating Sep 23, 2025, 08:23 PM UTC

We have applied mitigations and are still investigating the cause of the increased latency.
investigating Sep 23, 2025, 09:04 PM UTC

We are continuing to investigate this issue.
monitoring Sep 23, 2025, 09:48 PM UTC

We have seen recovery on the impacted services and are investigating possible root causes
investigating Sep 23, 2025, 11:29 PM UTC

We've seen a recurrence on the latency issue and we're continuing to investigate.
monitoring Sep 24, 2025, 12:42 AM UTC

The p95 experience on our REST and GraphQL APIs have had a number of latency spikes, as high as 26 seconds and we saw an increase in error rates during this time. We're continuing to implement mitigations, and have seen latency and error rate return to baseline over the last 30 minutes. We are continuing to monitor the issue.
resolved Sep 24, 2025, 01:41 AM UTC

We have implemented further mitigations and seen error rates and latency return to acceptable levels. We will continue to investigate the issue to better understand the causes.
postmortem Nov 10, 2025, 03:07 AM UTC

### Service Impact Between August 16th and September 24th, 2025, some customers experienced elevated API latency and intermittent errors when using Buildkite’s REST and GraphQL APIs. On September 23rd at 18:04 UTC the impact became severe and universal; alarms were raised and several customers reached out to us directly. The impact varied by customer and endpoint, with common pain points including creating builds and fetching build lists. ### Incident Summary A combination of increased API usage and changing workload patterns drove higher‑than‑usual network throughput. We discovered unexpected and undocumented throughput limits on our database proxy service. During the incident, this led to intermittent request latency and timeouts. By increasing the amount of CPU and memory resources available to our proxy service, the network throughput limits were also raised, and the error rates and latencies began to decrease. We discovered at this time that this throughput limit had been the major contributing factor to the inconsistent latencies that some customers had experienced since August 16th. ### Changes we’re making The nature of this issue meant that customers experienced intermittent latency spikes, usually as a result of the amount of data that needed to go between our applications and our databases. This contributed significantly to the time between when this incident was initially reported, confirmation, and resolution. We’re implementing better customer-specific observability and monitoring as a result. Limitations in our compute platform mean that these network throughput limits are not something that we can reliably detect or alert on. We’re evaluating moving our proxy service to an alternative compute solution so that we can provide better observability, and we’ve pre-emptively scaled up other network sensitive services as a preventative measure. We are also optimizing requests that return large result sets to reduce load and response sizes where practical.