Dead Man's Snitch incident

False alerts and dashboard 503 errors

Dead Man's Snitch experienced a major incident on September 27, 2021 affecting Snitch Check-in Processing and Management Portal and 1 more component, lasting 4h 50m. The incident has been resolved; the full update timeline is below.

Started: Sep 27, 2021, 08:04 AM UTC
Resolved: Sep 27, 2021, 12:55 PM UTC
Duration: 4h 50m
Detected by Pingoru: Sep 27, 2021, 08:04 AM UTC

Affected components

Snitch Check-in ProcessingManagement PortalAPI

Update timeline

investigating Sep 27, 2021, 08:04 AM UTC

We're currently investigating issue affecting check-in processing and dashboard availability. We believe these are related to two major issues affecting our hosting provider (Heroku) and are currently investigating. https://status.heroku.com/incidents/2361 https://status.heroku.com/incidents/2362
identified Sep 27, 2021, 08:48 AM UTC

We've temporarily disabled alerting as we investigate a way to work around the upstream issues.
identified Sep 27, 2021, 09:44 AM UTC

We've worked around the issues with check-in processing and are currently working through the backlog of pending check-ins in the queue. It doesn't appear our check-in receiver was impacted by the outage, just the workers that process the check-ins.
monitoring Sep 27, 2021, 09:57 AM UTC

Our check-in workers have caught up on all pending check-ins and alerts should be accurate going forward. Our main goal has been to get alerting and check-in processing back online. Heroku continues to experience issues with dynos and routing requests. We've worked around the dyno issues by temporarily moving check-in processing to hosts on EC2. We monitoring check-in process and Heroku's status and will update once we consider the issue fully resolved.
monitoring Sep 27, 2021, 12:10 PM UTC

Routing issues with the API and Dashboard appear to be mostly resolved. We are migrating some check-in processing back to Heroku and will continue to monitor the situation.
resolved Sep 27, 2021, 12:55 PM UTC

All systems are green. We've migrated all processing back to Heroku as they have resolved their upstream issue. Our processing system should have recovered more quickly than it did and we're investigating a possible fix for that.