Cloud.gov incident

Cloud.gov and api.fr.cloud.gov Outage

Major Resolved View vendor source →

Cloud.gov experienced a major incident on October 27, 2023, lasting —. The incident has been resolved; the full update timeline is below.

Started
Oct 27, 2023, 03:30 PM UTC
Resolved
Oct 27, 2023, 03:30 PM UTC
Duration
Detected by Pingoru
Oct 27, 2023, 03:30 PM UTC

Update timeline

  1. resolved Oct 27, 2023, 06:39 PM UTC

    From approximately 11:34 AM ET – 1:38 PM ET, while attempting to mitigate previous DDOS attacks, new WAF rules were added to the platform load balancers. This resulted in some traffic which was targeting api.fr.cloud.gov to be blocked. An additional change at 1:34 PM ET caused access to a majority of the platform to be blocked until 1:38 PM ET. The outage was resolved when the WAF rule changes were reverted and deployed into production at 1:38 PM EDT. Timeline: 11:31 AM ET: An internal cloud.gov tool began failing and alerting the platform team to investigate the failure. 12:15 PM ET: A small subset of cloud.gov customers connecting to api.fr.cloud.gov from within the platform began to notice failures to connect. 1:30 PM ET: The platform team began to investigate the latest changes to the WAF rules as a possible problem. 1:35 PM ET: Customers notified us they could no longer access cloud.gov or access api.fr.cloud.gov. 1:38 PM ET: The WAF rules changes were reverted and functionality to the platform was restored. Update to this incident - post this notice some additional customers notified us that a large portion of the platform lost access to their applications but access was restored. This happened during the 1:34 to 1:38 window EDT.

  2. postmortem Nov 01, 2023, 06:35 PM UTC

    On October 21, 2023, [the platform experienced a partial outage due to a sustained increase in traffic](https://cloudgov.statuspage.io/incidents/n212qfbrqg83). In response to this incident, the [cloud.gov](http://cloud.gov) immediately prioritized work to mitigate the effects of traffic surges on the platform. While the team did add valuable protections to the platform as part of that work, it was also a complex process due to the multi-tenant nature of [cloud.gov](http://cloud.gov) and the associated difficulty of ensuring that legitimate traffic is not blocked by any protections against malicious traffic. On October 27, 2023, the team received reports that some legitimate traffic to the platform was being blocked and began investigating. Once the causes of the traffic interruptions were identified, the team immediately applied the fixes so that the legitimate traffic could be restored. Unfortunately, in the process of adjusting the web application firewall \(WAF\) rules that protect the platform from malicious traffic, around 1:35 PM ET an engineer made a change that blocked traffic from any IP that was not in the internal IP CIDR ranges or public egress IP CIDR ranges for [cloud.gov](http://cloud.gov). Since customer traffic cannot come from these IP ranges, the effect of this change was to block almost all traffic into the platform. In response to customers reporting outages for their sites and the team’s independent confirmation of a platform-wide outage, the problematic WAF rule was disabled around 1:38 PM ET and customer traffic was immediately restored. As part of our normal post-incident process, the [cloud.gov](http://cloud.gov) has conducted a post-mortem for this incident and determined that its primary causes were: * Making changes to WAF rules directly in the production environment without promoting and testing them in lower environments first * Complexity of managing multiple conditions on firewall rules * Engineer fatigue and exhaustion from responding to multiple recent incidents * Difficulty of testing WAF rules in lower environments prior to production To address these issues, the team will pursue the following changes to our systems and processes: * Make sure to always promote WAF changes through lower environment using normal CI deployment processes * Make sure to rotate team members doing incident response every 48 hours at least As always, thank you for being a [cloud.gov](http://cloud.gov) customer. If you have any questions, don’t hesitate to contact us at [email protected].