Dead Man's Snitch incident

Elevated check-in error rate and timeouts

Minor Resolved View vendor source →

Dead Man's Snitch experienced a minor incident on June 22, 2022 affecting Snitch Check-in Processing, lasting 51m. The incident has been resolved; the full update timeline is below.

Started
Jun 22, 2022, 01:34 PM UTC
Resolved
Jun 22, 2022, 02:25 PM UTC
Duration
51m
Detected by Pingoru
Jun 22, 2022, 01:34 PM UTC

Affected components

Snitch Check-in Processing

Update timeline

  1. investigating Jun 22, 2022, 01:34 PM UTC

    We're seeing an increase in timeouts for our check-in service. We've removed a pair of misbehaving nodes from our load balancer and we're investigating the root cause. Any check-in that received a 408 Request Timeout error was not processed.

  2. monitoring Jun 22, 2022, 02:07 PM UTC

    We've replaced the affected nodes and error rates have gone back to normal levels

  3. resolved Jun 22, 2022, 02:25 PM UTC

    Error rates have returned to normal and systems are all green. We've identified the possible root cause as a bug in the version of the message bus client we were using that could cause a deadlock in some cases. We've updated to the latest version of the client and rolled out the update to production.