Camunda incident

Corrupted disk in multiple regions

Notice Resolved View vendor source →

Camunda experienced a notice incident on February 27, 2024 affecting Operate and Optimize and 1 more component, lasting 2d 3h. The incident has been resolved; the full update timeline is below.

Started
Feb 27, 2024, 05:13 PM UTC
Resolved
Feb 29, 2024, 08:54 PM UTC
Duration
2d 3h
Detected by Pingoru
Feb 27, 2024, 05:13 PM UTC

Affected components

OperateOptimizeTasklistZeebe

Update timeline

  1. investigating Feb 27, 2024, 05:13 PM UTC

    We've spotted issues with disks being mounted on several regions and are currently investigating the issue.

  2. investigating Feb 27, 2024, 06:56 PM UTC

    We are still experiencing the issue and will proactively backup the data of all enterprise and professional clusters that are at risk of being affected. For versions >= 8.2.4 there is no downtime expected. For versions < 8.2.4 a short downtime for all Camunda applications will occur.

  3. investigating Feb 27, 2024, 07:16 PM UTC

    We are in touch with our cloud provider and working together on a mitigation.

  4. monitoring Feb 27, 2024, 10:25 PM UTC

    The backups for all clusters >= 8.2.4 have been completed. Our cloud provider recommended us to mitigate the observed issues by migrating the workload to a prior GKE version. We do not see any disruptions of our services at the moment and will continue to work on the recommended mitigation.

  5. monitoring Feb 27, 2024, 10:44 PM UTC

    The backups for all clusters >= 8.2.4 have been completed. Our cloud provider recommended us to mitigate the observed issues by migrating the workload to a prior GKE version. We dont see any disruptions of our services at the moment and worked on the recommended mitigation. We see less errors and will continue to monitor the situation.

  6. monitoring Feb 28, 2024, 09:39 AM UTC

    We followed the mitigation strategy of our cloud provider and have a workaround in place to resolve errors related to the disk issues. We don't expect interruptions to our services, and will continue to monitor the situation.

  7. monitoring Feb 28, 2024, 02:33 PM UTC

    The problems are still occurring and we are working with our cloud provider on a new strategy to remedy the situation.

  8. monitoring Feb 29, 2024, 03:57 AM UTC

    We continued working with our cloud provider and moved all affected services to stable nodes. We are now scaling down the affected workloads for cloud provider to apply a fix. This will have a minimal impact on the running services.

  9. monitoring Feb 29, 2024, 01:38 PM UTC

    We are continuing to work on a fix with our cloud provider. As of this statement no production systems have been affected. We are in touch with customers who have been impacted.

  10. monitoring Feb 29, 2024, 05:02 PM UTC

    Our service provider has successfully performed a fix for all our workloads in all regions. Our operations are back to normal and all services have been restored. We will continue to monitor the situation.

  11. resolved Feb 29, 2024, 08:54 PM UTC

    Since the fix was applied, all services have been successfully restored and we no longer see any service interruptions.