Cheddar experienced a minor incident on May 2, 2018, lasting 1d 23h. The incident has been resolved; the full update timeline is below.
Update timeline
- investigating May 02, 2018, 03:52 PM UTC
We are currently experiencing database sync issues which have caused data about some transactions to become temporarily unavailable, and some recent transactions may not have run as expected. We have temporarily suspended recurring transactions while we work with our database provider to remedy the issue, and will post updates when available.
- identified May 03, 2018, 01:26 AM UTC
Additional info on today's data discrepancy issues: Cheddar experienced a sync issue during maintenance of our production database. This resulted in an approximately 6 minute gap in data. We didn’t lose this data. We have a redundancy to ensure against this kind of data loss. However, now that additional data has been created, we have to bring that 6 minutes of data back into the production database. Our team of engineers and database administrators have been working tirelessly to do just that. In the meantime we’ve turned off our recurring engine while we fix this issue. Invoices do not transact when the recurring engine is off. The recurring engine is fault tolerant and when re-enabled will transact any invoices that didn’t process while the engine was suspended. We are very close to correcting any missing transactional data and should be able to re-enable the recurring engine soon. In the meantime you may see two problems: 1. A limited amount of missing data from that window Monday night. This affects a small subset of Cheddar customers. 2. Invoices appearing queued and being delayed. This affects all customers. Once resolved: 1. Any missing data should be restored. 2. The recurring engine will be turned on and any queued transactions processed. We will continue to monitor the situation and update status. Primary systems including the API and Dashboard continue to operate normally.
- identified May 03, 2018, 03:26 PM UTC
All data records that were affected during the 6-minute incident window have now been incorporated into the production database. The small subset of customers that were missing transactions or events in their records yesterday is now repaired. You may still notice that some customer accounts have multiple queued invoices. We’ve identified which customer records were affected, and will be fixing those as soon as possible. While we continue to repair invoice data, the recurring engine is still disabled. We will post updates about the status of the recurring engine soon. In the meantime, it is possible to manually run queued transactions. If there are queued invoices you'd like to go ahead and transact now, click on the queued invoice and hit 'run invoice' in the bottom left corner of the page.
- identified May 03, 2018, 06:02 PM UTC
We are currently testing the recurring engine on some customer records and monitoring for any errors.
- monitoring May 03, 2018, 08:16 PM UTC
Our tests have completed successfully, and the recurring engine is now up and running normally. We will continue to monitor the situation for any further issues.
- resolved May 04, 2018, 03:06 PM UTC
This incident has been resolved. We will issue a postmortem in the next couple of days.
- postmortem Jul 30, 2018, 08:42 PM UTC
#Data Inconsistencies Postmortem Last week, we experienced some issues that temporarily created data inconsistencies and delayed the processing of recurring transactions for some merchants. We know you rely on us to provide a consistent, dependable service, and we regret the disruption that these issues caused. We wanted to take this opportunity to say we’re sorry, and to explain the chain of events that took place from April 29th - May 3rd. **What Happened** Starting April 29, we noticed that the system clocks on our servers were no longer synchronized. This issue was caused by a configuration error in our hosting provider’s time sync settings, which affected all of their data centers. Once our hosting provider told us the issue was resolved, we rebooted our servers on April 30th in an effort to immediately synchronize the clocks. After the reboot, we noticed some inconsistencies between the nodes in our high availability database layer. These issues caused our recurring engine, which automatically runs queued invoices several times a day, to stall. While we worked with our third-party providers to remedy the underlying issues, out of an abundance of caution we disabled the recurring engine on 5/2/18, at around 15:25 UTC. We fully re-enabled it on 5/3/18, at around 20:00 UTC, at which time all of Cheddar’s normal functionality was restored. **Effects** - Due to the time sync issue, multiple queued invoices were created on several merchants’ customers’ accounts. - As a result of the reboot, data sent to Cheddar on April 30th, between approximately 19:04-19:10 UTC, briefly appeared to be “missing” for some of our merchants. For example, some transaction activity or new customer records created during this 6-minute window were unavailable. - The temporary data inconsistencies resulted in duplicate transactions for some of our merchants. - While the recurring engine was disabled, recurring transactions were not automatically being run, and invoices sat in a queued state for longer than usual. **What we did** Throughout this incident, we worked closely with our third-party providers to remedy the underlying issues. In the meantime, we also worked to minimize the impact on our merchants. - Fixing customer records: Some of the customers who had multiple queued invoices were in danger of being auto-canceled by the recurring engine. Our engineering team manually corrected those customer records to prevent cancellation. We’ve also been monitoring for duplicate transactions on customer records so we can let our merchants know they might need to issue refunds. - Restoring data: Thanks to Cheddar’s redundancy, the data that appeared to be “missing” was never truly gone. We worked with our database administrators to restore the data. - Recurring engine: The recurring engine was brought back in service slowly as we monitored for anomalies. **What we’ve learned** The issues we experienced last week had one cause in common: miscommunication with our third-party providers. While technical issues are sometimes inevitable, we recognize our role in minimizing the impact of our upstream services on our customers. Going forward, we’re focusing on some technical and organizational measures to help us fulfill that role: - We’re working to incorporate additional automated monitoring to catch system clock drift. - We’ve updated our shared documentation and conferred with all relevant third-party providers regarding mechanisms for keeping that documentation accurate and up to date. - We’re implementing additional monitoring of the recurring engine processes, so we can be better aware of problems with the recurring engine. - We’re putting procedures in place for communicating via our [status page](https://status.getcheddar.com) and [support forum](http://support.getcheddar.com), so that we can better keep our merchants informed of any adverse conditions. Thanks for your patience while we sorted this out. We appreciate the trust you put in Cheddar by allowing us to take care of one of the most important aspects of your business. Rest assured that we're already working hard on implementing these solutions. As always, if you have any questions or concerns, please reach out to us at the [support forum](http://support.getcheddar.com) and we’ll be happy to help!