Cloud.gov incident

Log outage for OpenSearch

Minor Resolved View vendor source →

Cloud.gov experienced a minor incident on April 17, 2025 affecting Logs front end, lasting 3h 55m. The incident has been resolved; the full update timeline is below.

Started
Apr 17, 2025, 03:46 PM UTC
Resolved
Apr 17, 2025, 07:42 PM UTC
Duration
3h 55m
Detected by Pingoru
Apr 17, 2025, 03:46 PM UTC

Affected components

Logs front end

Update timeline

  1. investigating Apr 17, 2025, 03:46 PM UTC

    We have noticed that no logs are appearing in OpenSearch for customer logs (https://logs.fr.cloud.gov) after approximately 10:11 AM ET. We are investigating and will provide an update as soon as we know more.

  2. monitoring Apr 17, 2025, 04:56 PM UTC

    Update – 12:56 PM ET Status: - Logs are flowing into the OpenSearch cluster again, but indices are still catching up to real time. - Full real-time ingestion is expected to resume within the next few hours. - In the meantime, stream live application logs with: cf logs APP_NAME ---- Technical Details Durable storage & caching: Application logs are first written to S3 for durability, then passed through a cache before landing in OpenSearch. This two‑step process ensures no data loss even if the cluster becomes temporarily unavailable. Root cause: Several OpenSearch data nodes exceeded their disk‑usage high watermark. When this threshold is crossed, OpenSearch marks the affected indices as read‑only and rejects new writes. Mitigation: We increased storage capacity on the affected nodes and rebalanced shard allocation across the cluster. The cluster is now healthy and processing the backlog of cached logs. ---- Next Update: We will continue to monitor cluster health and ingestion rates. Our next status update will be posted by 3:30 PM ET, or sooner if anything changes.

  3. resolved Apr 17, 2025, 07:42 PM UTC

    Status: • The OpenSearch cluster has processed the entire backlog and is now ingesting logs in real time without delay. • All indices are writable and healthy, and write throughput remains stable. Resolution Details: • We increased disk capacity on the affected data nodes and rebalanced shard allocation to clear the high‐watermark condition. • The cluster’s health is green and all new log events are successfully indexed. • Live log streaming via cf logs APP_NAME continues to work as expected. Next Steps: • We will keep a heightened watch on disk usage and shard distribution over the next 24 hours to ensure sustained health. • If you notice any further issues with log visibility or performance, please open a support ticket. Thank you for your patience and apologies for any inconvenience.