Flying Sphinx incident
Sphinx daemons not responding on Robigus
Flying Sphinx experienced a notice incident on October 3, 2018 affecting Robigus, lasting —. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- resolved Oct 03, 2018, 08:49 AM UTC
Earlier today, Robigus stopped responding to search requests (connections to Sphinx daemons). The part of the Flying Sphinx infrastructure that failed was the Sphinx proxy on your original server - it stopped responding to any TCP requests (though the logs had no suggestion as to why). Clearly, this is a critical part of everything - if the proxy’s down, you can’t connect to your Sphinx daemon at all (and that’s essential in both searching and regenerating). I usually get downtime alerts (with a dedicated phone + SMS messages which should wake me up if it’s the middle of the night), but this wasn’t triggered by the proxy failing. So, I’ve made the following changes: * If the proxy does not respond to health checks, it’s considered a major server failure, and thus I will get SMS alerts. * Also, if it’s not responding to health checks, Monit will restart the proxy process, so resolution should be sorted out within a minute. This is now all in place, but I’ll continue to think through better ways of handling such situations. I’m very sorry for the downtime!