Cloudera experienced a critical incident on February 13, 2023 affecting Cloudera Management Console and Cloudera Management Console and 1 more component, lasting 2h 2m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Feb 13, 2023, 04:18 AM UTC
We're experiencing an elevated level of errors on CDP Control plane and are currently looking into the issue.
- investigating Feb 13, 2023, 05:08 AM UTC
We are continuing to investigate the issue. Please note this impacts the management operations of the workload clusters.
- identified Feb 13, 2023, 05:42 AM UTC
The issue has been identified and a fix is being implemented.
- identified Feb 13, 2023, 05:43 AM UTC
We are continuing to work on a fix for this issue.
- monitoring Feb 13, 2023, 05:50 AM UTC
A fix has been implemented and we are monitoring the results.
- resolved Feb 13, 2023, 06:20 AM UTC
This incident has been resolved and the CDP Control plane is working as expected. we will publish the RCA as soon as possible.
- postmortem Feb 16, 2023, 09:33 AM UTC
On Monday, February 13th at 04:18 UTC, Cloudera SRE detected a spike in errors related to the CDP Control plane management console. After investigation, it was determined that a recent production change had caused the outage. Although Cloudera follows strict software development lifecycle standards, an unforeseen bug due to complex dependencies was still encountered. The issue was resolved by rolling back to the previous version. To reduce the risk of similar issues in the future, we are improving our test suites, monitoring and dependency tracking to detect such scenarios as early in the development process as possible. Alongside also reviewing our rollback process to reduce mean time to recovery.