Mergify experienced a critical incident on November 19, 2021 affecting Engine, lasting 5h 55m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- monitoring Nov 19, 2021, 06:57 AM UTC
The Mergify engine is unable to process most events received.
- monitoring Nov 19, 2021, 06:59 AM UTC
We have fixed the underlying issue and restored the service. We are now monitoring the platform and planning long term action to have this incident not happen again.
- identified Nov 19, 2021, 07:00 AM UTC
We're implementing long term fixes.
- identified Nov 19, 2021, 07:11 AM UTC
We are continuing to work on a fix for this issue.
- resolved Nov 19, 2021, 08:24 AM UTC
Everything is back to normal.
- postmortem Nov 19, 2021, 12:16 PM UTC
# 19th November @ 1:00 UTC * We start receiving more than 5000 events/minute, while our max rate is usually around 1000 events/minute. # 19th November @ 3:00 UTC * The high load of incoming events continued our Redis database got full as it has been sized for only 3000 events/minute. * Events processing got stuck, and some processes started to crash. # 19th November @ 6:00 UTC * The engineering team is notified and investigates the issue and remediation solution. * The Redis Database gets replicated for further investigation. * We increased the Redis database size to be able to absorb up to 6000 events/minute. * The engine starts reprocessing events. # 19th November @ 6:10 UTC * The abusing user has been identified and flagged. Its Mergify installation has been suspended. Its account was generating 100 commit/s on a repository triggering associated CIs. The abusing repository also has been suspended/deleted on the GitHub side. * The engine has automatically dropped all its events and does not receive events from it anymore.