Dead Man's Snitch incident

False alerts and check-in backlog

Major Resolved View vendor source →

Dead Man's Snitch experienced a major incident on January 4, 2021 affecting Snitch Check-in Processing, lasting 3h 40m. The incident has been resolved; the full update timeline is below.

Started
Jan 04, 2021, 12:14 PM UTC
Resolved
Jan 04, 2021, 03:54 PM UTC
Duration
3h 40m
Detected by Pingoru
Jan 04, 2021, 12:14 PM UTC

Affected components

Snitch Check-in Processing

Update timeline

  1. investigating Jan 04, 2021, 12:14 PM UTC

    We're investigating an issue with check-in processing starting around 11:45 UTC.

  2. monitoring Jan 04, 2021, 12:28 PM UTC

    We've restarted the affected service and confirmed that it's processing correctly. It has now caught up on the backlog of pending check-ins. We're continuing to investigate the root cause.

  3. monitoring Jan 04, 2021, 12:28 PM UTC

    We are continuing to monitor for any further issues.

  4. resolved Jan 04, 2021, 03:54 PM UTC

    The root cause has been tracked down to an timeout error during check-in processing that wasn't handled correctly and put the process into a bad state. We're working on a fix for the issue and should have it deployed shortly. Check-in processing stopped at 11:47 UTC but we weren't made aware of the issue until 11:57 UTC. In reviewing our metrics and alerting we've identified a better metric to be alerting on and will be working that into an update to our internal monitoring and alerting systems.