Buildkite incident
Jobs not starting on hosted agents and agent-stack-k8s
Buildkite experienced a major incident on May 7, 2026 affecting Agent API, lasting 49m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating May 07, 2026, 08:48 AM UTC
We've spotted that something has gone wrong. We're currently investigating the issue with new builds not starting.
- investigating May 07, 2026, 09:09 AM UTC
We've identified issue with job acquiring endpoint. We're rolling back now. We'll provide next update in ~20 minutes.
- identified May 07, 2026, 09:24 AM UTC
We're currently seeing recovery at 50% rate. We'll provide next update soon.
- resolved May 07, 2026, 09:37 AM UTC
We've reverted a change that caused stale environment variables provided to acquire job used in Hosted Agents, agent-stack-k8s and other agent implementations using acquire job.
- postmortem May 15, 2026, 01:15 AM UTC
## Service Impact On May 7th between 08:14 and 09:25 UTC customers using [hosted agents](https://buildkite.com/docs/agent/buildkite-hosted), the [k8s stack](https://buildkite.com/docs/agent/self-hosted/agent-stack-k8s) and the Buildkite agent [acquire job feature](https://buildkite.com/docs/agent/cli/reference/start#run-a-single-job) experienced failures when starting a job, resulting in an error message `Missing agent. See: buildkite-agent bootstrap --help`. ## Incident Summary As part of our efforts to improve the performance of our platform we shipped a change to how database commits were grouped together. This change inadvertently caused certain job environment variables to be omitted when a job was assigned via the acquire method. These variables are key to our integration with hosted agents as well as customers using our Kubernetes stack and their omission caused any jobs launched via this method to fail. While we had test coverage that ensured these variables were populated, these tests did not exercise the code path used by acquire job effectively enough to indicate this problem before it was deployed. Furthermore, while parts of the change were placed behind a feature flag, the refactor of the code that caused this bug were not. Attempts to restore service by rolling back this change were hindered by the current revision being selected for deploy instead of the previous. The initial rollback was triggered at around 08:40 UTC, but it wasn’t until 09:05 we realised the mistake and began the rollback to the correct revision. The rollback started deploying at 09:16 and at 09:25 service was fully restored.  ## Changes we're making We have added additional rollback gates making it easier to identify when the incorrect revisions has been selected. Our test suite will be expanded to include contract tests for APIs used by Hosted Agent and the Kubernetes stack. Additionally, we are configuring Hosted Agent synthetic tests to automatically page on-call engineers when failures occur, improving our response times.