Cloud.gov experienced a major incident on March 8, 2024 affecting Logs front end, lasting 2h 33m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Mar 08, 2024, 10:01 PM UTC
We have received reports from customers that using "cf logs" CLI command to retrieve logs from their applications is either not working or not showing recent logs. Customers have confirmed that real-time logs are still being received in the customer logs Elasticsearch/Kibana instance at https://logs.fr.cloud.gov and are being sent correctly through log drains. Our team has already identified the possible cause of this issue as an expired certificate for the Log Cache component, which is the component that the "cf logs" CLI command uses to retrieve logs. The certificate expired at approximately 1:18 PM ET. We are working to remediate the issue.
- identified Mar 08, 2024, 10:23 PM UTC
We have renewed the certificate for the log cache component and we have started a full redeployment of our production system to apply the renewed certificates to the log cache. It may take several hours for the renewed certificate to roll out through the system, but we will post an update once we can confirm the updated certificate has been applied.
- resolved Mar 09, 2024, 12:35 AM UTC
The log cache system has been updated with the renewed certificate. Our testing indicates that real-time logs can now be successfully retrieved using the "cf logs" CLI commands. As with all incidents, the cloud.gov team will conduct a post-mortem analysis of this incident in the coming days and post our findings here as an update. Thank you for being a cloud.gov customer!
- postmortem Mar 13, 2024, 01:55 PM UTC
As part of our normal incident response process, we conducted a post-mortem analysis to determine why this incident occurred and how to improve our operations going forward. Our main findings as to why this incident occurred were: * Monitoring pending certificate expiration is currently a manual process * The week of this incident in particular was very busy due to other incidents * The user interface for monitoring expiring certificate shows some “false positives” which creates confusion To address these findings and to prevent a recurrence of a similar incident in the future, we have planned the following work: * Remove the “false positive” expired certificates in our certificate monitoring tool * Add Slack alerts for expiring certificates to make the review process less manual and ensure that expiring certificates don’t get missed * Schedule formal handoffs between engineers on maintenance rotations who are responsible for certificate renewal to ensure continuity of operations As always, we appreciate your patience and thank you for being a [cloud.gov](http://cloud.gov) customer. If you have any questions, don’t hesitate to contact us at [email protected].