Buildkite Outage History
Buildkite is up right nowBuildkite had 45 outages in the last 2 years totaling 84h 12m of downtime — averaging 1.8 incidents per month.
There were 45 Buildkite outages since June 30, 2025 totaling 84h 12m of downtime. Each is summarised below — incident details, duration, and resolution information.
Delayed Test Engine ingestion processing
Timeline · 2 updates
- monitoring May 15, 2026, 06:51 AM UTC
Ingestion of Test Engine execution data from an internal queue to a data store stalled, has been resumed, and is working through the backlog. Visibility of test executions from the past hour hours will be delayed for approximately a further one hour. This has been a recurring issue; an architectural change is coming soon to eliminate this failure mode.
- resolved May 15, 2026, 07:35 AM UTC
Processing of the backlog is complete.
Error rates increasing
Timeline · 2 updates
- investigating May 13, 2026, 03:14 PM UTC
We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.
- resolved May 13, 2026, 03:34 PM UTC
Additional capacity was added to our redis caches. This triggered a failover between UTC 15:10 - 15:14 and there was a spike of errors on the REST and GraphQL APIs. Customers would have seen some errors in the Buildkite UI during this period as well. We have been monitoring the situation since then and things have returned to baseline.
AWS us-east-1 single availability zone outage
Timeline · 11 updates
Delays in job dispatch, webhook processing, and outbound webhooks
Timeline · 5 updates
Jobs not starting on hosted agents and agent-stack-k8s
Timeline · 5 updates
Test Engine: Delayed processing of test result ingestion
Timeline · 3 updates
- investigating May 06, 2026, 03:57 AM UTC
A process writing test results to our Test Engine data store stalled, we've restarted the process and are seeing it catching up. We expect to be fully caught up on the backlog within the next couple of hours.
- monitoring May 06, 2026, 04:21 AM UTC
We've identified the issue and the system is currently processing the backlog of test executions
- resolved May 06, 2026, 05:26 AM UTC
Processing of test execution ingestion data has successfully caught up.
Increased latency and error rates
Timeline · 2 updates
- investigating May 04, 2026, 06:02 AM UTC
We're observing increased latency and error rates in the Agent API for a subset of our customers. We're currently investigating and will provide status updates as they become available.
- resolved May 04, 2026, 06:30 AM UTC
An increase in requests has lead to the API service being temporarily saturated. We have updated rate limits to ensure this doesn't re-occur and will add further resources if necessary
Buildkite service disruption
Timeline · 4 updates
Increased dispatch latency and error rates
Timeline · 4 updates
- investigating Apr 28, 2026, 06:00 PM UTC
We're observing increased error rates and dispatch latency for a subset of our customers. We're currently investigating and will provide status updates as they become available.
- identified Apr 28, 2026, 06:26 PM UTC
The issue has been identified and a fix is being implemented.
- monitoring Apr 28, 2026, 06:45 PM UTC
We have mitigated the issue causing increased Hosted Agents dispatch latency and intermittent timeout errors for a subset of customers. We identified abnormal workload activity that was placing elevated load on a supporting service, and have now blocked that activity and applied additional protections. Service metrics have returned to normal, and we are continuing to monitor closely.
- resolved Apr 28, 2026, 07:16 PM UTC
Previously elevated loads with Hosted Agents dispatch have fully recovered.
Auth failures with remote MCP server
Timeline · 4 updates
- investigating Apr 22, 2026, 09:19 PM UTC
We are currently investigating reports of authentication failures with the remote MCP server.
- investigating Apr 22, 2026, 10:07 PM UTC
We are continuing to investigate errors when authenticating to the remote MCP server.
- monitoring Apr 22, 2026, 10:44 PM UTC
We have rolled back a change on the remote MCP server that was contributing to authentication failures.
- resolved Apr 22, 2026, 10:59 PM UTC
The issue is resolved.
Delayed processing of test execution
Timeline · 2 updates
- monitoring Apr 22, 2026, 02:32 AM UTC
We noticed a lag in data processing, but our systems are operational and currently working through the backlog. We expect to be fully caught up within the next couple of hours.
- resolved Apr 22, 2026, 05:07 AM UTC
The backlog has been cleared and all systems are fully operational. Thank you for your patience.
Degraded performance and increased error rates
Timeline · 4 updates
Hosted Agents jobs immediately cancelled
Timeline · 3 updates
- investigating Mar 31, 2026, 07:51 AM UTC
We have received reports from customers that they are unable to start builds on Hosted Agents. Their builds are immediately cancelled. We are investigating.
- identified Mar 31, 2026, 08:15 AM UTC
We have identified the issue and are rolling out a fix.
- resolved Mar 31, 2026, 08:34 AM UTC
We have deployed the fix and we have confirmed customer builds are working. If you encounter any further issues please contact support.
504 errors viewing builds
Timeline · 4 updates
- investigating Mar 27, 2026, 07:02 AM UTC
We're seeing an increase in 504 errors when viewing pipeline builds. We're investigating this now.
- identified Mar 27, 2026, 07:18 AM UTC
We've identified a change which we think is the cause of this issue, and we're in the process of reverting it.
- monitoring Mar 27, 2026, 08:08 AM UTC
The deploy to revert this change is complete and builds are loading normally. We will continue to monitor for any other issues.
- resolved Mar 27, 2026, 08:53 AM UTC
The incident is now resolved. We are no longer seeing errors when viewing pipelines.
Increased Delays with Hosted Agents
Timeline · 4 updates
- investigating Mar 25, 2026, 02:26 PM UTC
We are currently investigating this issue.
- identified Mar 25, 2026, 02:30 PM UTC
The issue has been identified to be related to Networking and affecting Git Mirror cloning.
- monitoring Mar 25, 2026, 02:59 PM UTC
The networking issue has been resolved, dispatch of Hosted Agents has returned to normal levels and no further issues with Git cloning. We are monitoring the situation.
- resolved Mar 25, 2026, 03:50 PM UTC
This incident is now resolved. We are no longer seeing further networking issues with Hosted Agents, which affected delays in creating them for Jobs, resolving external traffic and interactions with Cache - affecting Git Mirror Cloning.
Increased queue times on hosted agents
Timeline · 3 updates
- investigating Mar 11, 2026, 07:50 PM UTC
We are investigating reports of elevated queue times with hosted agents.
- monitoring Mar 11, 2026, 08:44 PM UTC
We identified increased demand affecting hosted agent queue times. We have added additional capacity and are seeing recovery of hosted agent queue times.
- resolved Mar 11, 2026, 09:14 PM UTC
This incident has been resolved.
Increased error rates from Test Plan API
Timeline · 3 updates
- investigating Mar 10, 2026, 01:21 AM UTC
We've observed periodic test splitting plan timing out and falling back to non-intelligent splitting. Performance appears to be back to normal as of an hour ago. We are continuing to investigate the root cause and solve the underlying issue.
- monitoring Mar 10, 2026, 02:25 AM UTC
We have implemented several mitigation and continue working on fixing the underlying cause. Our team is actively monitoring the situation to ensure the stability. We will provide further updates as we make progress on resolving this issue.
- resolved Mar 10, 2026, 09:34 AM UTC
Our mitigations have resolved the elevated latency and likelihood of suboptimal fallback test plans. We have also identified and fixed a blind-spot in our automated alerting, which was previously unable to detect this scenario as an issue. Work continues this week to resolve the underlying performance issue by restructuring how the relevant data is ingested and accessed.
Elevated ingestion latency for Test Engine
Timeline · 3 updates
- investigating Mar 07, 2026, 12:21 AM UTC
We are investigating the elevated latency issue for Test Engine. Processing the backlog of test executions is taking longer than expected, so elevated ingestion latency remains.
- monitoring Mar 07, 2026, 12:56 AM UTC
We've identified the issue and the system is currently processing the backlog of test executions
- resolved Mar 07, 2026, 01:05 AM UTC
Processing of test execution ingestion data has successfully caught up.
Hosted Agents: Job start latency for a small subset of customers
Timeline · 1 update
- resolved Mar 06, 2026, 08:54 AM UTC
Buildkite Hosted Agents experienced degraded start-time performance due to a network partition issue in the Hosted Agents control plane. A small subset of customers may have seen delayed job starts during 04:40-04:50 UTC and 05:06-05:16 UTC. The issue has been resolved and we are monitoring to confirm stability.
Slow artifact uploads
Timeline · 3 updates
- investigating Mar 05, 2026, 10:14 PM UTC
We're investigating slow artifact uploads. This is isolated to artifacts, dispatch remains unaffected.
- monitoring Mar 06, 2026, 08:02 AM UTC
Latency for artifact uploads has remained at normal levels for some time now, and we now have a mitigation in place for a common source of load going forward. We are continuing to monitor.
- resolved Mar 06, 2026, 10:23 AM UTC
With artifact upload latency continuing to be stable, we are resolving this incident.
Latency issues
Timeline · 7 updates
- investigating Mar 03, 2026, 09:51 PM UTC
We're seeing elevated job dispatch latency and Agent API latency across multiple shards. We're investigating.
- investigating Mar 03, 2026, 10:41 PM UTC
We're still experiencing latency issues for agent api and job dispatch. We continue to investigate and identify the root cause.
- investigating Mar 03, 2026, 11:21 PM UTC
We continue to experience high latency on some services. We're continuing to identify root causes.
- monitoring Mar 04, 2026, 12:11 AM UTC
We've made some changes to address the issue and are seeing signs of recovery. We continue to monitor the situation.
- monitoring Mar 04, 2026, 01:06 AM UTC
We've seen a small number of unrelated issues, each affecting a subset of customers. Most impact is resolved, but we are continuing to monitor impact for a small number of remaining customers. We are in touch with those customers directly.
- monitoring Mar 04, 2026, 03:29 AM UTC
We continue to observe high latency on isolated infrastructure serving Agent API endpoints for a subset of customers. We are provisioning additional capacity to address this latency, and have informed impacted customers.
- resolved Mar 04, 2026, 05:24 AM UTC
We have completed the provisioning of additional capacity mentioned in our last update, and error rates and response times have returned to normal. This incident is now resolved.
Increased dispatch latency
Timeline · 6 updates
Increased latency for secrets endpoints for some customers
Timeline · 3 updates
- investigating Feb 26, 2026, 12:43 AM UTC
We're observing increased latency on secrets endpoints for a subset of our customers. We're currently investigating and will provide status updates as they become available.
- monitoring Feb 26, 2026, 12:53 AM UTC
We've increased the compute available to the secrets service, and have seen response times return to normal levels.
- resolved Feb 26, 2026, 02:44 AM UTC
Response times have returned to normal. This incident is now resolved.