Camunda incident

Corrupted disk in multiple regions

Camunda experienced a notice incident on February 27, 2024 affecting Operate and Optimize and 1 more component, lasting 2d 3h. The incident has been resolved; the full update timeline is below.

Started: Feb 27, 2024, 05:13 PM UTC
Resolved: Feb 29, 2024, 08:54 PM UTC
Duration: 2d 3h
Detected by Pingoru: Feb 27, 2024, 05:13 PM UTC

Affected components

OperateOptimizeTasklistZeebe

Update timeline

investigating Feb 27, 2024, 05:13 PM UTC

We've spotted issues with disks being mounted on several regions and are currently investigating the issue.
investigating Feb 27, 2024, 06:56 PM UTC

We are still experiencing the issue and will proactively backup the data of all enterprise and professional clusters that are at risk of being affected. For versions >= 8.2.4 there is no downtime expected. For versions < 8.2.4 a short downtime for all Camunda applications will occur.
investigating Feb 27, 2024, 07:16 PM UTC

We are in touch with our cloud provider and working together on a mitigation.
monitoring Feb 27, 2024, 10:25 PM UTC

The backups for all clusters >= 8.2.4 have been completed. Our cloud provider recommended us to mitigate the observed issues by migrating the workload to a prior GKE version. We do not see any disruptions of our services at the moment and will continue to work on the recommended mitigation.
monitoring Feb 27, 2024, 10:44 PM UTC

The backups for all clusters >= 8.2.4 have been completed. Our cloud provider recommended us to mitigate the observed issues by migrating the workload to a prior GKE version. We dont see any disruptions of our services at the moment and worked on the recommended mitigation. We see less errors and will continue to monitor the situation.
monitoring Feb 28, 2024, 09:39 AM UTC

We followed the mitigation strategy of our cloud provider and have a workaround in place to resolve errors related to the disk issues. We don't expect interruptions to our services, and will continue to monitor the situation.
monitoring Feb 28, 2024, 02:33 PM UTC

The problems are still occurring and we are working with our cloud provider on a new strategy to remedy the situation.
monitoring Feb 29, 2024, 03:57 AM UTC

We continued working with our cloud provider and moved all affected services to stable nodes. We are now scaling down the affected workloads for cloud provider to apply a fix. This will have a minimal impact on the running services.
monitoring Feb 29, 2024, 01:38 PM UTC

We are continuing to work on a fix with our cloud provider. As of this statement no production systems have been affected. We are in touch with customers who have been impacted.
monitoring Feb 29, 2024, 05:02 PM UTC

Our service provider has successfully performed a fix for all our workloads in all regions. Our operations are back to normal and all services have been restored. We will continue to monitor the situation.
resolved Feb 29, 2024, 08:54 PM UTC

Since the fix was applied, all services have been successfully restored and we no longer see any service interruptions.