Cloudera incident
Customer workloads in Azure cloud experiencing degraded performance
Cloudera experienced a major incident on April 18, 2023 affecting Cloudera Data Hub, lasting 3h 57m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 18, 2023, 09:32 AM UTC
Datalake and DataHub workload management is in degraded state for few customers in Azure cloud
- identified Apr 18, 2023, 09:47 AM UTC
Issue has been identified and rollback is in progress
- identified Apr 18, 2023, 10:33 AM UTC
We are continuing to work on a fix for this issue.
- identified Apr 18, 2023, 11:45 AM UTC
Rollback complete. Services are currently being monitored.
- monitoring Apr 18, 2023, 12:12 PM UTC
Rollback complete. Services are currently being monitored.
- resolved Apr 18, 2023, 12:29 PM UTC
Datahub and Datalake services are fully operational now. Incident is now resolved
- postmortem Apr 19, 2023, 05:40 AM UTC
On April/18/2023 between 8:30 UTC to 11:45 UTC Customers using DataHub and DataLake on Azure connecting to us-west CDP Control plane were experiencing timeouts causing environment creation failures. The team was notified about this incident immediately. On investigation it was found that latest release triggered a edge case bug which caused the metadata update failures with Azure. This incident was resolved by performing a rollback. The existing environments on AWS or GCP were not impacted due to this. As a mitigation item the team is working on adding additional test workloads across cloud environments to simulate these edge cases and also enhancing our test suites