Cloudera incident

Intermittent issues with the US Control plane.

Major Resolved View vendor source →

Cloudera experienced a major incident on January 18, 2023 affecting Cloudera Management Console and Cloudera IAM, lasting 2h 40m. The incident has been resolved; the full update timeline is below.

Started
Jan 18, 2023, 02:00 PM UTC
Resolved
Jan 18, 2023, 04:40 PM UTC
Duration
2h 40m
Detected by Pingoru
Jan 18, 2023, 02:00 PM UTC

Affected components

Cloudera Management ConsoleCloudera IAM

Update timeline

  1. investigating Jan 18, 2023, 03:00 PM UTC

    SRE team is currently investigating an issue with us-west control plane. This impacts all the management operations including ability to launch clusters, modify users etc. However please do note customer workloads are not impacted due to this

  2. investigating Jan 18, 2023, 03:10 PM UTC

    We are continuing to investigate this issue.

  3. investigating Jan 18, 2023, 03:12 PM UTC

    Issue seems to be partially resolved.

  4. monitoring Jan 18, 2023, 03:18 PM UTC

    A fix has been implemented and we are monitoring the results.

  5. resolved Jan 18, 2023, 04:40 PM UTC

    All services are back up and operating as expected.

  6. postmortem Jan 30, 2023, 04:11 AM UTC

    On January/18/2023 between 13:00 UTC to 15:00 UTC us-west CDP Control plane was in degraded state causing Environment creation failures. The teams were notified about this incident immediately. On investigation the team found that one of the services responsible for FreeIPA management was suffering from resource starvation. Indeed this service was configured to be highly available however the cascaded impact caused more service instances to fail. This eventually led to failures of new environment creation although the existing environments were working fine. As a mitigation item the team is working on adding additional alerts to monitor this situation and also increase the resources available for this service in an automated manner.