Buildkite incident

Increased latency and error rates

Major Resolved View vendor source →

Buildkite experienced a major incident on October 20, 2025 affecting Web and GitHub Commit Status Notifications and 1 more component, lasting 7h 44m. The incident has been resolved; the full update timeline is below.

Started
Oct 20, 2025, 02:18 PM UTC
Resolved
Oct 20, 2025, 10:02 PM UTC
Duration
7h 44m
Detected by Pingoru
Oct 20, 2025, 02:18 PM UTC

Affected components

WebGitHub Commit Status NotificationsAgent APIREST APIJob QueueSCM Integrations

Update timeline

  1. investigating Oct 20, 2025, 02:18 PM UTC

    We're observing increased latency and error rates due to an inability to scale up. We're currently investigating and will provide status updates as they become available.

  2. investigating Oct 20, 2025, 03:17 PM UTC

    We're currently working on mitigations for scaling up, but at this stage service is degraded with increased latency across API, notifications, and builds starting.

  3. identified Oct 20, 2025, 05:04 PM UTC

    We're continuing to see increased latency across much of our sub-systems due to an on going AWS outage. We are unable to launch new tasks in us-east-1 and are investigating potential mitigations to restore service.

  4. identified Oct 20, 2025, 05:36 PM UTC

    We have implemented mitigations and see an improvement in latency for the Agent API. Latency and error rates continue to be elevated across Rest, GraphQL and Web service as well as notifications being delayed. We are continuing to work through mitigations and will provide an update in 1 hour.

  5. identified Oct 20, 2025, 06:54 PM UTC

    Our mitigations improved latency for the Agent API, although latency and error rates are still visible across other services. The us-east-1 issue is reporting some recovery and we are seeing further improvements in our services. We are actively monitoring the situation and implementing mitigations where possible.

  6. monitoring Oct 20, 2025, 07:29 PM UTC

    We're seeing slow recovery of all our services. Latency and error rates are decreasing across the board. We are continuing to monitor the situation.

  7. monitoring Oct 20, 2025, 08:33 PM UTC

    We're seeing signs of recovery across the board. Error rates have reduced to baseline levels. Latency is trending towards baseline. We continue to actively monitor our services and the AWS reports on us-east-1 impact.

  8. monitoring Oct 20, 2025, 09:04 PM UTC

    Latency and error rates have all returned to baseline levels. We have seen full recovery of our services. We continue to actively monitor our services and the AWS reports on us-east-1 impact to ensure stability is maintained.

  9. resolved Oct 20, 2025, 10:02 PM UTC

    Our services have been fully recovered for the last hour, so we are marking this as resolved. Our engineers will continue to monitor AWS and will keep services scaled up to prevent impact from any additional failures.

  10. postmortem Oct 24, 2025, 08:22 PM UTC

    We have published a post-incident review here: [https://buildkite.com/resources/blog/post-incident-review-for-20th-october-2025/](https://buildkite.com/resources/blog/post-incident-review-for-20th-october-2025/)