ChargeOver incident

Service Outage

ChargeOver experienced a critical incident on September 9, 2020 affecting Main Application and Payment Processing, lasting 4h 23m. The incident has been resolved; the full update timeline is below.

Started: Sep 09, 2020, 04:15 PM UTC
Resolved: Sep 09, 2020, 08:39 PM UTC
Duration: 4h 23m
Detected by Pingoru: Sep 09, 2020, 04:15 PM UTC

Affected components

Main ApplicationPayment Processing

Update timeline

investigating Sep 09, 2020, 04:15 PM UTC

We are aware of an issue, and are investigating. Further information to follow.
identified Sep 09, 2020, 04:38 PM UTC

We have identified the issue, and are working to resolve the outage as quickly as possible.
identified Sep 09, 2020, 06:43 PM UTC

We are continuing to work towards restoration of service. More updates to follow.
identified Sep 09, 2020, 08:16 PM UTC

Services are being restored. We are continuing to monitor the situation and will provide further updates.
monitoring Sep 09, 2020, 08:17 PM UTC

We continue to monitor the situation as services are being restored.
monitoring Sep 09, 2020, 08:25 PM UTC

All ChargeOver services are operational now. We continue to monitor the situation. A postmortem will follow.
resolved Sep 09, 2020, 08:39 PM UTC

This issue has been resolved and all services are operational. Root cause and postmortem will follow.
postmortem Sep 10, 2020, 01:31 PM UTC

ChargeOver primarily uses the MariaDB database for data storage. The MariaDB process crashed on our primary database server at 11:09 CST. The process crashed with the following error message which is still being investigated: `InnoDB: Failing assertion: templ->clust_rec_field_no != ULINT_UNDEFINED` ChargeOver staff were immediately notified, and we began to investigate the issue. The database was automatically restarted immediately, and the database began to run automated data integrity checks to ensure that no data was lost, and no data was corrupted before beginning to service requests again. Although we had the ability to fail-over to a secondary database server, a decision was made to let the process complete, as the estimated downtime was very short. This automated data integrity check took much longer than originally expected. Our estimated time to recovery was less than an hour, and instead the database server took approximately 3 hours to do integrity checks and restart safely. The data integrity checks took from 11:09 CST to 14:49 CST. After the data checks were complete, our team ran through our recovery checklist, and started the database server. Service was restored in a degraded state at 15:01 CST, and fully operational at 15:15 CST. We recognize that there are many things to be learned from this lesson, and are working towards putting pieces in place to be able to avoid the long check times and revise our fail-over processes in the future to better account for possible long check times. Please make sure to subscribe to updates at [https://status.ChargeOver.com](https://status.ChargeOver.com) to be notified of service disruptions in the future.