Cloud.gov incident

Intermittent issues connecting to S3 buckets

Cloud.gov experienced a major incident on August 30, 2024 affecting AWS s3-us-gov-west-1, lasting 3d 16h. The incident has been resolved; the full update timeline is below.

Started: Aug 30, 2024, 10:04 PM UTC
Resolved: Sep 03, 2024, 02:27 PM UTC
Duration: 3d 16h
Detected by Pingoru: Aug 30, 2024, 10:04 PM UTC

Affected components

AWS s3-us-gov-west-1

Update timeline

investigating Aug 30, 2024, 10:04 PM UTC

Some customers are experiencing intermittent "connection refused" errors when their apps are connecting to S3 buckets. We are investigating the issue and coordinating with AWS support. We will update this incident as discover further information
monitoring Aug 30, 2024, 10:52 PM UTC

We believe we have identified the source of the problem. Our "trusted_local_networks_egress" security group, which allows applications to connect to S3, was not allowing egress to all of the possible IP ranges for S3. We have updated the "trusted_local_networks_egress" to allow egress to all of the IP ranges for S3 published by AWS. In our testing, it seems that the updated egress IP ranges have resolved the issue, but we will continue to monitor and update this page as necessary.
resolved Sep 03, 2024, 02:27 PM UTC

After monitoring this incident over the weekend, we feel confident that the issues are now resolved. If you continue to experience issues, please contact us at [email protected]. As with all incidents, the cloud.gov team will be conducting a post-mortem analysis of this incident and publishing our findings in the coming days. Thank you for being a cloud.gov customer!
postmortem Sep 03, 2024, 06:44 PM UTC

**Incident response** At 1:35 PM on Friday, August 30, some customers began reporting 502 errors for requests to S3 from their applications. The issues were also reported to be intermittent, where requests to S3 would sometimes succeed and sometimes fail. Initial investigation by the[ Cloud.gov](http://cloud.gov) team found that there was no general outage for S3 on the AWS side. The[ Cloud.gov](http://cloud.gov) team then engaged the AWS support team for further assistance, who definitively confirmed that 502 errors do not come from S3 itself and thus must be coming from something running within the[ Cloud.gov](http://cloud.gov) infrastructure. Around 6 pm ET, the[ Cloud.gov](http://cloud.gov) team began making requests to public objects in S3 buckets from **within** the affected applications using `curl`, which reproduced the intermittent request failures to S3 reported by customers. This diagnostic technique also revealed that when requests to S3 were failing, they were failing anytime the hostname for the S3 bucket resolved to an IP in the CIDR range `108.175.56.0/22`. The fact that request failures mapped to a specific IP range led the[ Cloud.gov](http://cloud.gov) team to suspect that configuration of the allowed egress IP ranges for S3 in[ the application security groups on the platform](https://cloud.gov/docs/management/space-egress/) was at fault. Further investigation determined that indeed, the `trusted_local_networks_egress` security group which should allow requests to reach S3 was not allowing requests out to the `108.175.56.0/22` CIDR range. Around 6:30 PM, the[ Cloud.gov](http://cloud.gov) team added the `108.175.56.0/22` CIDR block to the allowed IP egress ranges for the `trusted_local_networks_egress` security group. The [Cloud.gov](http://Cloud.gov) team then re-ran the `curl` tests from within the affected applications and observed that requests to S3 that resolved to the `108.175.56.0/22` CIDR range were now succeeding. **Post-incident analysis** After resolving the incident, the [Cloud.gov](http://Cloud.gov) team then began investigating **why** requests to the `108.175.56.0/22` CIDR range began failing in the first place. The team quickly discovered that while the platform has infrastructure code to automatically retrieve the IP ranges for S3 from AWS, this code **was hard-coded to only expect two IP ranges**, while in reality there were now **multiple** IP ranges that needed to be supported for egress to S3. In addition to the problematic hard-coded expectation of only two IP ranges, another issue the team found was that there is no automatic trigger to update the IP ranges supported by [Cloud.gov](http://Cloud.gov) when they change in AWS. Nevertheless, the question still remained as to why customers began experiencing these issues precisely on August 30, 2024. The [Cloud.gov](http://Cloud.gov) team found that the deployment job which updates the IP ranges for the application security groups was run on August 29, 2024 and had inadvertently removed the `108.175.56.0/22` CIDR range due to the hardcoded expectation of there being only two IP ranges. Thus, it makes sense that customers began experiencing issues with S3 requests connecting to that IP range on the following day, August 30, 2024. **Action items** [The Cloud.gov team has already addressed the infrastructure code issue with expecting only two IP ranges for S3](https://github.com/cloud-gov/cg-provision/pull/1749). To ensure that [Cloud.gov](http://Cloud.gov) application security group egress rules stay in sync with the IP ranges published by AWS, the [Cloud.gov](http://Cloud.gov) team has planned work to trigger our infrastructure update jobs anytime [AWS publishes a new version of the supported IP ranges for all of their services in a public JSON file](https://ip-ranges.amazonaws.com/ip-ranges.json). **Conclusion** The [Cloud.gov](http://Cloud.gov) team will take this incident as an opportunity to improve our technology and our processes to avoid a recurrence of this incident in the future. We appreciate your patience and thank you for being a [Cloud.gov](http://Cloud.gov) customer!