Flying Sphinx incident

Sphinx daemons not responding on Robigus

Notice Resolved View vendor source →

Flying Sphinx experienced a notice incident on October 3, 2018 affecting Robigus, lasting —. The incident has been resolved; the full update timeline is below.

Started
Oct 03, 2018, 08:49 AM UTC
Resolved
Oct 03, 2018, 08:49 AM UTC
Duration
Detected by Pingoru
Oct 03, 2018, 08:49 AM UTC

Affected components

Robigus

Update timeline

  1. resolved Oct 03, 2018, 08:49 AM UTC

    Earlier today, Robigus stopped responding to search requests (connections to Sphinx daemons). The part of the Flying Sphinx infrastructure that failed was the Sphinx proxy on your original server - it stopped responding to any TCP requests (though the logs had no suggestion as to why). Clearly, this is a critical part of everything - if the proxy’s down, you can’t connect to your Sphinx daemon at all (and that’s essential in both searching and regenerating). I usually get downtime alerts (with a dedicated phone + SMS messages which should wake me up if it’s the middle of the night), but this wasn’t triggered by the proxy failing. So, I’ve made the following changes: * If the proxy does not respond to health checks, it’s considered a major server failure, and thus I will get SMS alerts. * Also, if it’s not responding to health checks, Monit will restart the proxy process, so resolution should be sorted out within a minute. This is now all in place, but I’ll continue to think through better ways of handling such situations. I’m very sorry for the downtime!