ChargeOver incident

ChargeOver main application is unavailable

ChargeOver experienced a critical incident on July 19, 2023 affecting Main Application and Search and 1 more component, lasting 48m. The incident has been resolved; the full update timeline is below.

Started: Jul 19, 2023, 03:29 PM UTC
Resolved: Jul 19, 2023, 04:17 PM UTC
Duration: 48m
Detected by Pingoru: Jul 19, 2023, 03:29 PM UTC

Affected components

Main ApplicationSearchEmail SendingPayment ProcessingIntegrations

Update timeline

investigating Jul 19, 2023, 03:29 PM UTC

We are aware of the problem, and are investigating. We will post further updates as we have them.
identified Jul 19, 2023, 03:36 PM UTC

We have identified the problem. ETA to resolution is less than 30 minutes.
identified Jul 19, 2023, 03:52 PM UTC

We are continuing to work on a fix for this issue.
monitoring Jul 19, 2023, 04:12 PM UTC

All systems have been restored. We are monitoring the fix. A postmortem will be posted.
resolved Jul 19, 2023, 04:17 PM UTC

The incident has been resolved. A postmortem will be provided.
postmortem Jul 19, 2023, 05:10 PM UTC

## Incident details The ChargeOver team follows an agile software development lifecycle for rolling out new features and updates. We routinely roll out new features and updates to the platform multiple times per week. Our typical continuous integration/continuous deployment roll-outs look something like this: * Developers make changes * Code review by other developers * Automated security, linting, and security testing is performed * Senior-level developer code review before deploying features to production * Automated deploy of new updates to production environment * If an error occurs, roll back the changes to a previously known-good configuration ChargeOver uses `Docker` containers and a redundant set of `Docker Swarm` nodes for production deployments. On `July 19th at 10:24am CT` we reviewed and deployed an update which, although passing all automated acceptance tests, errantly deployed an update which caused the deployment of `Docker` container images to silently fail. This caused the application to become unavailable, and a `503 Service Unavailable` error message was shown to all users. The deployment appeared to be successful to automated systems, but due to a syntax error actually only removed existing application servers rather than replacing them with the new software version. No automated roll back occurred, because the deployment appeared successful but had actually failed silently. ## Root cause A single extra space \(a single errant spacebar press!\) was accidentally added to the very beginning of a `docker-compose` `YAML` file, which caused the `docker-compose` file to be invalid `YAML` syntax. The single-space change was subtle enough to be missed when reviewing the code change. All automated tests passed, because the automated tests do not use the production `docker-compose` deployment configuration file. When deploying the service to `Docker Swarm`, `Docker Swarm` interpreted the invalid syntax in the `YAML` file as an empty set of services to deploy, rather than a set of valid application services to be deployed. This caused the deployment to look successful \(it successfully deployed, removing all existing application servers, and replacing them with nothing\) and thus automated roll-back to a known-good set of services did not happen. ## Incident timeline * 10:21am CT - Change was reviewed and merged from a staging branch, to our production branch. * 10:24am CT - Change was deployed to production, immediately causing an outage. * 10:29am CT - Our team posted a status update here, notifying affected customers. * 10:36am CT - Our team identified the related errant change, and started to revert to a known-good set of services. * 11:06am CT - All services became operational again after deploying to last known good configuration. At this time, all services were restored and operational. * 11:09am CT - Our team identified exactly what was wrong - an accidentally added single space character at the beginning of a configuration file, causing the file to be invalid `YAML` syntax. * 11:56am CT - Our team made a change to validate the syntax of the `YAML` configuration files, to ensure this failure scenario cannot happen again. ## Remediation plan There are several things that our team has identified as part of a remediation plan: * We have already deployed multiple checks to ensure that invalid `YAML` syntax and/or configuration errors cannot pass automated tests, and thus cannot reach testing/UAT or production environments. * Our team will work to improve the very generic `503 Service Unavailable` message that customers received, directing affected customers to [https://status.chargeover.com](https://status.chargeover.com) where they can see real-time updates regarding any system outages. * Customers logging in via [https://app.chargeover.com](https://app.chargeover.com) received generic `The credentials that you provided are not correct.` messages, instead of a notification of the outage. This will be improved. * Our team will do a review of our deployment pipelines, to see if we can identify any other similar potential failure points.