Simwood incident

Database cluster issues

Simwood experienced a notice incident on October 8, 2019 affecting API and Portal and Operations Desk, lasting 7h 32m. The incident has been resolved; the full update timeline is below.

Started: Oct 08, 2019, 04:46 PM UTC
Resolved: Oct 09, 2019, 12:18 AM UTC
Duration: 7h 32m
Detected by Pingoru: Oct 08, 2019, 04:46 PM UTC

Affected components

API and PortalOperations Desk

Update timeline

identified Oct 08, 2019, 07:17 AM UTC

Whilst not affecting call traffic, we are presently unable to write to our primary database cluster. This is due to an overnight job triggering a bug. The query will eventually work through but we have no way presently of determining how long that will take. We are meanwhile investigating more invasive options. In the interim, this means portal, API and administration options which would normally update the database (e.g. billing, number allocation and pre-pay top-ups) are delayed or non-functional. We're sorry for any impact this will have but, to repeat, call traffic is not affected.
identified Oct 08, 2019, 12:05 PM UTC

This remains ongoing but we are making progress. The offending query remains on one node and continues to be in the process of rolling back. Unfortunately, rolling back is less efficient than the problem it caused in the first place. Note this is not an issue with the query per-se (a single row delete) but an internal Galera issue triggered by it. Until this rollback completes the cluster remains effectively write locked but serviceable for reads. We know why this happened and how to prevent it going forwards and have backup nodes with current data ready to takeover should we decide to fail-over from the existing cluster. As we have no idea whatsoever how long the trigger query will take to roll back on the final node, we have held off failing over in anger in the hope it may be soon, but cannot delay indefinitely. Call traffic remains unaffected and our ops team have been handling most urgent customer issues such as locked balances. We will therefore continue monitoring and update here should anything change.
identified Oct 08, 2019, 03:55 PM UTC

We are about to commence failover to the standby cluster as this query rollback is showing no signs of concluding. Once this is concluded we'll mark this incident as 'monitoring'. There are several million CDRs to catch up on so we will leave it unresolved until they are processed.
monitoring Oct 08, 2019, 04:46 PM UTC

Failover is largely complete and CDRs are now being processed.
resolved Oct 09, 2019, 12:18 AM UTC

Billing has fully caught up. Thanks for your patience.