Mergify incident

merge queue unexpectedly unqueue pull requests or temporary stuck

Minor Resolved View vendor source →

Mergify experienced a minor incident on July 31, 2023 affecting Engine and GitHub Pull Requests (githubstatus.com), lasting 3h 20m. The incident has been resolved; the full update timeline is below.

Started
Jul 31, 2023, 03:12 PM UTC
Resolved
Jul 31, 2023, 06:32 PM UTC
Duration
3h 20m
Detected by Pingoru
Jul 31, 2023, 03:12 PM UTC

Affected components

EngineGitHub Pull Requests (githubstatus.com)

Update timeline

  1. identified Jul 31, 2023, 03:12 PM UTC

    The merge queue unexpectedly unqueue some pull requests or get stuck due to recent changes in the GitHub API behavior. We are working on a mitigation.

  2. monitoring Jul 31, 2023, 04:31 PM UTC

    We added a way to mitigate those API change from GitHub and are monitoring that everything works as intended.

  3. monitoring Jul 31, 2023, 04:34 PM UTC

    We added a way to mitigate those API change from GitHub and are monitoring that everything works as intended.

  4. monitoring Jul 31, 2023, 05:50 PM UTC

    We added a way to mitigate those API change from GitHub and are monitoring that everything works as intended.

  5. resolved Jul 31, 2023, 06:32 PM UTC

    This incident has been resolved, and the workaround work as expected. We are still waiting for GitHub support to get more information about these API behavior changes we have observed.

  6. postmortem Aug 02, 2023, 10:33 AM UTC

    All timestamps are in UTC **2023-07-31 11:53,** first support case about the merge-queue unexpectedly dequeued a pull request with the message: `Base does not exist`. We started the investigation. **2023-07-31 14:53,** We opened an internal incident as our monitoring alerted us about an increasing number of unexpected GitHub API status codes while Mergify created or deleted draft pull requests. **2023-07-31 15:12,** We understood that the Git branches we create and the changes we make on them, with the GitHub Git Database API, are not instantly visible by GitHub Repository and Pulls API. API call of Git manipulation succeeds, but when you get the Git resources you just created, GitHub returns that they do not exist. The issue was causing unexpected failures in many different code paths. That could result for customers into two visible issues: * pull requests wrongly dequeued with one of these error messages: * `No commits between XXXX and YYYY` * `Base does not exist` * merge queue stuck at step: `This queue is waiting for a batch to fill up.` We decided to implement in different code paths a retry mechanism when this issue occurred. **2023-07-31 14:50,** Our first change to mitigate the issue lands in production and continue the monitoring closely **2023-07-31 14:53,** We enabled some full HTTP request/response logging to gather material for GitHub support. **2023-07-31 15:12,** We decide to make the incident public **2023-07-31 15:34,** We deploy a second code change to improve the mitigation **2023-07-31 15:49,** We escalated the issue to GitHub support as we have enough materials to show the API breakage. **2023-07-31 16:25,** We extracted stats about the number of customers and pull requests impacted. We found that GitHub API started to report as non-existing existing Git resources on 2023-07-27 at 14:14:10 UTC for some accounts. We discovered later it was the date of the previous GitHub Pull Request API incident [https://www.githubstatus.com/incidents/l59z35rhzdky](https://www.githubstatus.com/incidents/l59z35rhzdky). **2023-07-31 17:34,** A third change is deployed to readjust the retrying strategy. Mergify was always able to succeed in detecting and retrying when the issue occurred. **2023-08-01 07:15,** A new change is deployed to cover a new code path where the issue occurs. **2023-08-01 09:51,** GitHub support answered our support ticket and acknowledged the GitHub API behavior changed and escalated to the engineering team **2023-08-01 16:53,** GitHub fixed the issue; we asked for more details and why the GitHub status page didn’t get updated **2023-08-02 09:36,** GitHub communicates more details about the API behavior change issue: > A feature flag related to spoke caching was turn on earlier that causes replication lag. Following reports of 404 errors occurring for newly created refs, the change was reversed. **023-08-02 10:53,** GitHub confirms this incident will be part of their next availability report > Thanks for the feedback --I'd pass those on to the relevant team. Hopefully it gets published in the monthly published [availability report](https://github.blog/?s=availability).