Strigo incident

Main app is unavailable

Critical Resolved View vendor source →

Strigo experienced a critical incident on August 9, 2021, lasting —. The incident has been resolved; the full update timeline is below.

Started
Aug 09, 2021, 12:23 PM UTC
Resolved
Aug 09, 2021, 12:23 PM UTC
Duration
Detected by Pingoru
Aug 09, 2021, 12:23 PM UTC

Update timeline

  1. resolved Aug 11, 2021, 09:20 PM UTC

    Starting on August 9th, 3:23 PM UTC, we've had a 19-minute downtime of our main app. We deployed a version of our app that provides infrastructure for a new feature and had a bug that caused all of our main app's processes to peak at 100% CPU. We reverted the change, verified that everything works, and deployed a version that fixed the issue. By 3:42 PM UTC, the system was back to normal. In retrospect: * We could've had less downtime (could've been back in around 3m instead of 19), but a technicality in how we revert changes made it so that the first attempt to deploy a fix didn't actually deploy anything. We will optimize that. * A more robust testing and rollout framework could've helped us find this problem before it reaches the production environment. This is something that's already a WIP, in fact.