Cloudera experienced a major incident on February 14, 2024 affecting Cloudera Data Flow and Cloudera Data Engineering and 1 more component, lasting 1h 44m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Feb 14, 2024, 04:05 PM UTC
Current Status: We are currently investigating a potential issue with the FreeIPA service. Customer Experience: Customer may observe FreeIPA service in an unreachable status.
- investigating Feb 14, 2024, 04:19 PM UTC
We are continuing to investigate this issue.
- identified Feb 14, 2024, 04:26 PM UTC
Current Status: Our teams have identified the source of the issue. We are working on developing and implementing a solution to restore the service(s). We will have another update within 60 mins. Customer Experience: Customer may observe FreeIPA service in an unreachable status. Incident Start time: ~10:53 UTC Feburary 14th, 2024
- monitoring Feb 14, 2024, 04:54 PM UTC
Current Status: Our teams have identified the source of the issue and have implemented a solution which is under monitoring. We will have another update within 60 mins. Customer Experience: Customer may observe FreeIPA service in an unreachable status. Incident Start time: ~10:53 UTC Feburary 14th, 2024
- resolved Feb 14, 2024, 05:50 PM UTC
Current Status: Our teams have confirmation that the solution implemented has addressed the issue. If you continue to experience issues please raise a support case with us. A root cause analysis (RCA) will be published within seven business days. Customer Experience: Customers may observe FreeIPA service in an unreachable status. Incident Start time: ~10:53 UTC Feburary 14th, 2024
- postmortem Feb 24, 2024, 02:03 AM UTC
On February 14, 2024, a scheduled monthly Kubernetes maintenance activity unintentionally caused a 30-minute outage for the FreeIPA service within the Control Plane. While the service itself remained functional on workloads, the issue stemmed from a bug introduced during the maintenance. This bug impacted the Inverting Proxy component, responsible for facilitating communication between the Control Plane and workloads, leading to the temporary disruption. The team promptly identified and rectified the bug, restoring full service approximately within 30 minutes. Additionally, we have implemented corrective measures within the maintenance automation to prevent similar occurrences in the future. We sincerely apologize for the inconvenience this incident may have caused.