Cloudera incident

Customer workloads in Azure cloud experiencing degraded performance

Major Resolved View vendor source →

Cloudera experienced a major incident on April 18, 2023 affecting Cloudera Data Hub, lasting 3h 57m. The incident has been resolved; the full update timeline is below.

Started
Apr 18, 2023, 08:32 AM UTC
Resolved
Apr 18, 2023, 12:29 PM UTC
Duration
3h 57m
Detected by Pingoru
Apr 18, 2023, 08:32 AM UTC

Affected components

Cloudera Data Hub

Update timeline

  1. investigating Apr 18, 2023, 09:32 AM UTC

    Datalake and DataHub workload management is in degraded state for few customers in Azure cloud

  2. identified Apr 18, 2023, 09:47 AM UTC

    Issue has been identified and rollback is in progress

  3. identified Apr 18, 2023, 10:33 AM UTC

    We are continuing to work on a fix for this issue.

  4. identified Apr 18, 2023, 11:45 AM UTC

    Rollback complete. Services are currently being monitored.

  5. monitoring Apr 18, 2023, 12:12 PM UTC

    Rollback complete. Services are currently being monitored.

  6. resolved Apr 18, 2023, 12:29 PM UTC

    Datahub and Datalake services are fully operational now. Incident is now resolved

  7. postmortem Apr 19, 2023, 05:40 AM UTC

    On April/18/2023 between 8:30 UTC to 11:45 UTC Customers using DataHub and DataLake on Azure connecting to us-west CDP Control plane were experiencing timeouts causing environment creation failures. The team was notified about this incident immediately. On investigation it was found that latest release triggered a edge case bug which caused the metadata update failures with Azure. This incident was resolved by performing a rollback. The existing environments on AWS or GCP were not impacted due to this. As a mitigation item the team is working on adding additional test workloads across cloud environments to simulate these edge cases and also enhancing our test suites