Cloudera incident

Control Plane Issues

Major Resolved View vendor source →

Cloudera experienced a major incident on October 5, 2023 affecting Cloudera Management Console and Cloudera IAM and 1 more component, lasting 3h 39m. The incident has been resolved; the full update timeline is below.

Started
Oct 05, 2023, 07:27 PM UTC
Resolved
Oct 05, 2023, 11:07 PM UTC
Duration
3h 39m
Detected by Pingoru
Oct 05, 2023, 07:27 PM UTC

Affected components

Cloudera Management ConsoleCloudera IAMCloudera Data FlowCloudera Data EngineeringCloudera Data WarehouseCloudera Operational DatabaseCloudera AICloudera Data HubCloudera Data CatalogCloudera Replication Manager

Update timeline

  1. investigating Oct 05, 2023, 07:27 PM UTC

    We are investigating problems affecting the US control plane.

  2. identified Oct 05, 2023, 08:49 PM UTC

    Issues have been identified in an internal control plane service. Multiple services are currently impacted.

  3. monitoring Oct 05, 2023, 09:26 PM UTC

    We've implemented a fix to our control plane service. We're seeing services recover albeit with higher latency.

  4. monitoring Oct 05, 2023, 10:34 PM UTC

    The overall control plane has stabilized. We're investigating connectivity issues that are impacting some number of customers. A potential fix is being implemented to resolve the connectivity issues.

  5. resolved Oct 05, 2023, 11:07 PM UTC

    The fix implemented has seen positive results, resolving connectivity issues.

  6. postmortem Oct 16, 2023, 12:30 PM UTC

    On Oct 5th, 2023 we had reports from customers that Cloudera Manager instances in their environments were reporting UNREACHABLE status for both Datalake’s and Datahub’s. Upon further investigation, it was identified that two independent production changes were attributed to the issue. 1. A new software release for connectivity between Control Plane and Workload; which hit an edge case during certain operations. This was addressed by rolling back to the previous software version. 2. The second independent production change was related to a decrease in resources allocated to one of our internal systems; responsible for storing key secrets. This was addressed by reverting to the previous configurations. These issues did not manifest in our lower environments, where the changes were tested prior to rolling out to Production. As a mitigation action, our teams are working on adding monitoring and alerting around these corner cases and will introduce additional checks and balances before any resource changes are implemented in our production system.