Strigo experienced a critical incident on August 9, 2021, lasting —. The incident has been resolved; the full update timeline is below.
Update timeline
- resolved Aug 11, 2021, 09:20 PM UTC
Starting on August 9th, 3:23 PM UTC, we've had a 19-minute downtime of our main app. We deployed a version of our app that provides infrastructure for a new feature and had a bug that caused all of our main app's processes to peak at 100% CPU. We reverted the change, verified that everything works, and deployed a version that fixed the issue. By 3:42 PM UTC, the system was back to normal. In retrospect: * We could've had less downtime (could've been back in around 3m instead of 19), but a technicality in how we revert changes made it so that the first attempt to deploy a fix didn't actually deploy anything. We will optimize that. * A more robust testing and rollout framework could've helped us find this problem before it reaches the production environment. This is something that's already a WIP, in fact.