Ziron incident

[all] Database issues affecting various services

Ziron experienced a major incident on February 27, 2018, lasting 7h 41m. The incident has been resolved; the full update timeline is below.

Started: Feb 27, 2018, 02:48 PM UTC
Resolved: Feb 27, 2018, 10:30 PM UTC
Duration: 7h 41m
Detected by Pingoru: Feb 27, 2018, 02:48 PM UTC

Update timeline

investigating Feb 27, 2018, 02:48 PM UTC

We are investigating database issues that are potentially affecting various services. Details to follow.
monitoring Feb 27, 2018, 03:16 PM UTC

We have restored service to the affected database cluster, albeit in reduced capacity. Engineers are working on restoring the remaining capacity.
monitoring Feb 27, 2018, 04:04 PM UTC

We have scheduled emergency maintenance on the database tonight from 22:00 UTC. For more details, please see http://www.zironstatus.com/incidents/mns6hk3f8918
resolved Feb 27, 2018, 10:30 PM UTC

Maintenance has now been completed, although we will be scheduling further maintenance for later this week.
postmortem Aug 01, 2018, 07:36 PM UTC

27 February 2018 - all times in UTC. Between approximately 14:47 and 15:13 on 27th February 2018, we experienced a major failure of our London high availability database cluster hosted at Rackspace - which, in turn, caused issues with a number of key services including the Ziron API and dashboard, outbound calls, and SMS. By design, inbound voice call routing was unaffected, and lookup services should also have been unaffected. However, the issue with the database cluster caused the API frontend load balancers (hosted at Rackspace) to remove all API backend servers due to a high volume of errors. At present, the lookup APIs are also hosted on these same servers and are accessed via the same hostname. A full timeline is below, and service was restored as quickly as possible. A number of key learnings have resulted from this incident, and good progress has already made to reduce the reliance on this database cluster by migrating key databases to their own individual database clusters. Changes are also underway to separate key API services onto their own backend servers and reduce the reliance on both the API frontend load balancers and Rackspace’s London data centres as a whole. Further details will be posted on our blog in due course, and whilst we will endeavour to keep any disruption to a minimum, any scheduled maintenance required will be posted to our status page. Timeline 14:47 Multiple monitoring systems started showing issues with various API endpoints and internal systems. All available technical resource was immediately deployed. 14:48 An incident was opened on the Ziron status page advising that an issue is affecting various services. 14:48 Initial investigation suggests an issue with the primary London database cluster, hosted at Rackspace. This will be affecting the Ziron API and dashboard, outbound calls, and SMS. Inbound voice call routing is unaffected, and at this time it was believed that lookup services were (by design) also unaffected. 14:49 Further investigation confirms the primary London database cluster has suffered a failure of the majority of nodes, and by design has shut down awaiting manual restart. 14:54 External monitoring confirms that the Ziron API is unavailable from numerous locations worldwide. Investigation shows that the API load balancers (hosted at Rackspace) have removed all API backend servers due to a high volume of errors, which is also now also affecting lookup services. Inbound voice call routing continues to be unaffected by this issue. 14:56 Repeated attempts to restart database cluster nodes have failed, so investigation is now turned to re-creating (bootstrap) the cluster. As the cluster is run in a multiple master setup, it is critical that the correct procedures to identify the most authoritative node is followed. 15:05 The cluster is bootstrapped from the most authoritative node, which begins to start data integrity checks. 15:13 Service to the database cluster has now been restored, albeit with reduced resilience. 15:16 The incident on the Ziron status page is updated to advise that all services are returning to normal. 16:04 After further internal discussions, a decision is made to postpone any further attempts to increase resilience until a scheduled maintenance window, which is scheduled for 22.00 that evening. A notice to this effect is posted to the Ziron status page. 22:00 The schedule maintenance window begins. There is a brief (< 15 minute) period during this time where, as a precaution, external traffic to the database cluster is blocked. 22:30 The scheduled maintenance window completes, with plans to schedule further maintenance windows. In the coming days, it transpires that these are no longer required.