Cloud.gov experienced a minor incident on August 14, 2025 affecting Logs intake and storage, lasting 2h 4m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Aug 14, 2025, 07:51 PM UTC
We are again having issues with our log ingestion rate for https://logs.fr.cloud.gov. Consequently, there may be a delay before customer application logs appear on the system. We are actively investigating the cause of the slow log ingestion and working towards a solution.
- resolved Aug 14, 2025, 09:56 PM UTC
After adjusting some data node configuration, log ingestion has recovered and has stayed up to date. As always, the Cloud.gov team takes these incidents very seriously. We will conduct a post-mortem analysis of this incident in the coming days and publish our findings.
- postmortem Aug 15, 2025, 02:50 PM UTC
**Summary** On August 14, 2025, we had intermittent issues with delayed log ingestion. During these periods of delayed log ingestion, customers may not have seen their logs from the past 1 to 2 hours on [logs.fr.cloud.gov](http://logs.fr.cloud.gov). We believe we have now identified and fixed the cause of delayed log ingestion, so we do not expect the problem to recur. **Timeline** * August 14, 9:13 AM - A smoke test fails for the logs system, [logs.fr.cloud.gov](http://logs.fr.cloud.gov). A [Cloud.gov](http://Cloud.gov) engineer begins investigating the test failure and notices that log ingestion is delayed * 10:30 AM - An engineer scales up the log ingestion infrastructure * 1:52 PM - Log ingestion rates returned to normal and the system was ingesting near real-time logs again * 3:20 PM - A [Cloud.gov](http://Cloud.gov) engineer notices that log ingestion is delayed again * 4:12 PM - An engineer makes an OpenSearch configuration change to control how data ingestion is distributed across nodes * 5:47 PM - Log ingestion rates returned to normal and the system was ingesting near real-time logs again **Impact** Intermittently, customers may not have seen their logs appearing on [logs.fr.cloud.gov](http://logs.fr.cloud.gov) in real time. Logs may have been delayed from appearing for 1 to 2 hours. While there were delays in logs appearing on [logs.fr.cloud.gov](http://logs.fr.cloud.gov), **there was no loss of logs**. **Root Cause** While investigating the second case of delayed log ingestion, we discovered that the log ingestion was being over-allocated to a specific data node, which overwhelmed that node’s CPU resources and caused the delays in log ingestion. To fix this issue, we updated a setting in OpenSearch that controls how many shards can be allocated to a single node for each index, which allowed data ingestion to be distributed more evenly across all of the data nodes. Since making the change to shard allocation, log ingestion rates have remained stable and logs have been ingested in near real-time. **Next Steps** * We are going to improve our internal documentation of how to diagnose and how to troubleshoot delayed log ingestion. * We are going to leave the log ingestion infrastructure at the scaled up levels. Thank you for your patience. If you have any questions, please contact us at [[email protected]](mailto:[email protected]).