Cloud.gov incident

Outage for Cloud.gov logging system

Critical Resolved View vendor source →

Cloud.gov experienced a critical incident on May 12, 2025 affecting Logs front end, lasting 1h 9m. The incident has been resolved; the full update timeline is below.

Started
May 12, 2025, 01:59 PM UTC
Resolved
May 12, 2025, 03:08 PM UTC
Duration
1h 9m
Detected by Pingoru
May 12, 2025, 01:59 PM UTC

Affected components

Logs front end

Update timeline

  1. identified May 12, 2025, 01:59 PM UTC

    In the process of upgrading our Cloud.gov logging system from version to 2.19 to 3.0, the deployment experienced issues resulting in downtime for https://logs.fr.cloud.gov. We are working to fix the problem and will post an update as soon as possible.

  2. monitoring May 12, 2025, 02:04 PM UTC

    The logging system, https://logs.fr.cloud.gov/, is now up and appears to be responding normally to requests. We will continue to monitor the system closely for any issues.

  3. resolved May 12, 2025, 03:08 PM UTC

    The Cloud.gov logging system is fully stable and healthy. As with all incidents, we will be conducting a post-mortem of this outage in the coming days. Once our analysis is complete, we will share our findings and our plans to prevent a future recurrence of similar outages. Thank you for your patience and for being a Cloud.gov customer.

  4. postmortem May 14, 2025, 08:36 PM UTC

    The [Cloud.gov](http://Cloud.gov) team has conducted a post-mortem analysis of this incident. A timeline of the incident, our findings of what caused it, and our actions taken to prevent it from recurring are summarized below. **Timeline** * On Monday, May 12, at 8:15 AM, a deployment of our production customer logging system was initiated to upgrade the system from OpenSearch version 2.19 to version 3.0. * At 9:12 AM, [logs.fr.cloud.gov](http://logs.fr.cloud.gov) became inoperable due to a failure in the deployment of the nodes for OpenSearch Dashboards, which provide the visual user interface. * Once the outage was detected, the [Cloud.gov](http://Cloud.gov) team began investigating. We quickly discovered the issue and implemented a temporary fix. * At 9:38 AM, a new deployment was started in production to bring the OpenSearch Dashboards nodes back online. * By 10:10 AM, the new deployment had finished and the OpenSearch Dashboards were restored to fully health and running on version 3.0. At this time, the outage was resolved. **Findings** * The deployment plan for the OpenSearch system was configured to update the nodes for different components \(data nodes, Dashboards\) in serial, one at a time. * In the deployment plan, the data nodes were upgraded before the Dashboards nodes. * In the initial deployment where OpenSearch Dashboard nodes failed to upgrade, the upgrade of the data nodes completed successfully without any issues. * When the Dashboards nodes attempted to upgrade individually, they recognized that the data nodes were on version 3.0, but that at least one of the Dashboards nodes was still on version 2.19, which prevented the deployment from succeeding. This error was observed in the logs for OpenSearch Dashboards: ``` This version of OpenSearch Dashboards (v2.19.0) is incompatible with the following OpenSearch nodes in your cluster: v3.0.0 ``` ‌ * To fix the deployment issues, the team updated the deployment plan to upgrade all OpenSearch Dashboards nodes **at the same time** rather than **serially**. As a result, all Dashboards nodes moved to version 3.0 at the same time, so no error because of incompatibility with the data nodes occurred. **Actions taken** To prevent this incident from recurring, we have taken the following actions: * [Updated the deployment plan for the logging system so that Dashboards nodes are upgraded at the same time, not in serial](https://github.com/cloud-gov/deploy-logs-opensearch/pull/174) * [Documented the necessary order for nodes in the deployment configuration for the logging system](https://github.com/cloud-gov/deploy-logs-opensearch/pull/175) As always, we appreciate your patience as a customer of [Cloud.gov](http://Cloud.gov). If you have any questions about this incident, don’t hesitate to contact us at s[[email protected].](mailto:[email protected])