Flying Sphinx incident

Widespread issues

Flying Sphinx experienced a notice incident on February 29, 2016, lasting 3h 12m. The incident has been resolved; the full update timeline is below.

Started: Feb 29, 2016, 09:10 PM UTC
Resolved: Mar 01, 2016, 12:23 AM UTC
Duration: 3h 12m
Detected by Pingoru: Feb 29, 2016, 09:10 PM UTC

Update timeline

investigating Feb 29, 2016, 09:10 PM UTC

Major problems that (so very annoyingly) did not trigger pingdom and thus earlier escalation, reason as yet unknown. Investigating.
identified Feb 29, 2016, 09:26 PM UTC

Major API outage related to an underlying database problem has been fixed. API requests will now work reliably again. Will be following up directly with customers who had reported problems - if you're still seeing problems, do get in touch.
identified Feb 29, 2016, 09:58 PM UTC

Still hunting down the finer details, but API and daemon behaviour seems to be returning to normal. I'm very sorry for this outage. I'm in Australia, and this problem happened overnight. I do have a dedicated phone for Pingdom alerts, but this particular problem didn't flow through to Pingdom - something I'll be remedying as soon as the initial problem is confirmed and fully resolved, so future such issues wake me up and are dealt with far more promptly.
monitoring Feb 29, 2016, 11:22 PM UTC

Everything's been functioning fine for a little while now, but will continue to keep an eye on things and put things into place to stop this issue from having such a far-reaching impact again.
resolved Mar 01, 2016, 12:23 AM UTC

No further issues at this point. The underlying problem was some out-of-memory errors from the database, which essentially killed the API. In turn, the Sphinx proxies that authenticate daemon search requests couldn't get updated credential lists, and thus started blocking search requests for some customers. Added to this was the fact that it happened while I was sleeping, and because Pingdom didn't consider the API as offline, I didn't receive any alerts to wake me up. The database just recently had been switched from a legacy plan to a current one, and I believe this is related. So, to address all of this: * I've lodged a ticket with Heroku to clarify the database change and any associated memory changes. * I will be updating the proxy to continue with out-of-date credentials if it can't retrieve new ones, instead of blocking *all* access on a given server in the case of an API failure. * I will be connecting error spikes via Bugsnag to alerts to my dedicated phone, to ensure I'm woken up should similar issues crop up again, instead of several hours delay. I am very sorry for this issue occurring, and greatly appreciate your patience and understanding.