postmortem Mar 10, 2026, 05:52 PM UTC
On Friday night, we deployed a small infrastructure change that required a reboot of our database. During this reboot, we encountered a bug in our event processor code that caused events not to be processed until service reboot. To compound this, our monitoring platform had invalidated the alarm we had in place to monitor errors with our event processor. This caused us to remain unaware of the issue until Monday morning, where we promptly fixed it by rebooting the service containing the event processors. At that point, the platform needed to catch up by processing all the events that had been building up over the weekend, which completed on Monday night. We will be taking a multi-pronged approach to ensuring this doesn’t happen again: 1. We have fixed our event monitors, and will be testing them every release going forward to ensure they remain operational. We are also following up with our monitoring platform to ensure future monitoring will not be disrupted. 2. We are augmenting our event processing code in the following ways: 1. We have switched to a more robust event processing pattern that is not vulnerable to the bug that caused consumers to stop polling after a database reboot. 2. We have increased the processing throughput such that if this _were_ to happen again, our consumers would catch up much more quickly than they did during this incident. We apologize for service disruptions caused by this incident.