Castle experienced a critical incident on March 30, 2021 affecting Dashboard and Legacy APIs, lasting 4h 19m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Mar 30, 2021, 01:18 PM UTC
We are experiencing issues with our main database cluster which affects all API endpoints. We're currently investigating this issue.
- monitoring Mar 30, 2021, 02:09 PM UTC
Database cluster operating normally and API endpoints are responding. We're continuing to monitor the situation
- identified Mar 30, 2021, 02:53 PM UTC
We are seeing degraded performance on API endpoints once more, and are working on restoring functionality as quickly as possible.
- monitoring Mar 30, 2021, 03:47 PM UTC
API endpoints are responsive again and the system is stabilizing. We're monitoring the situation
- resolved Mar 30, 2021, 04:28 PM UTC
Systems are operating normally and we have put mitigation measures in place to ensure the issue does not reoccur. We'll have a full retrospective and root cause teardown of the incident published within the next few days.
- postmortem Mar 30, 2021, 05:06 PM UTC
On March 30, 2021, Castle’s API became degraded during three distinct windows of time: * 12:02 UTC - 12:45 UTC * 12:59 UTC - 13:41 UTC * 14:48 UTC - 15:25 UTC During this time, some Castle API calls failed, including calls to our synchronous `authenticate` endpoint. The Castle dashboard was up, but due to the API being unavailable was not rendering data. Service was fully restored as of 15:25 UTC, and some data generated from requests to our asynchronous `track` and `batch` endpoints during the incident was recovered from queues and subsequently processed. As we communicated to all active customers yesterday, we take these sort of incidents very seriously, and want to share some of the factors that led to this incident. The root cause of the incident was a failure of one of our primary data clusters. This is a multi-node, fault-tolerant commercial solution and a complete cluster failure is extremely rare. Castle’s infrastructure team responded immediately to the incident and found an unbounded memory leak that caused each node to simultaneously shut down. Over the course of the incident, we learned this memory leak was exacerbated by a specific class of background job that actually began running a day prior but did not begin leaking memory for some time. When the incident began, we detected the issue and immediately restarted the cluster. A full 'cold start' of the entire cluster takes around 40 minutes, and this accounts for the first downtime window. After the cluster restarted, our fault-tolerant job scheduling system attempted to run the jobs again, which caused the cluster to require full cold restarts twice more as we worked to clear out the job queue and replicas. At this time, we believe the reason for the memory leak is a bug in our data cluster provider’s software - we have been able to successfully reproduce the issue in a test environment and have a high-priority case open with their support team. In the meantime, we have audited all active background job systems to ensure performance-affecting jobs are temporarily disabled or worked around. Once again, we apologize for the impact of this interruption. Please feel free to contact us at [[email protected]](mailto:[email protected]) if you have any further questions.