Cloud.gov incident

Applications and platform login were non-responsive

Notice Resolved View vendor source →

Cloud.gov experienced a notice incident on April 25, 2023, lasting —. The incident has been resolved; the full update timeline is below.

Started
Apr 25, 2023, 03:51 PM UTC
Resolved
Apr 24, 2023, 04:00 PM UTC
Duration
Detected by Pingoru
Apr 25, 2023, 03:51 PM UTC

Update timeline

  1. resolved Apr 25, 2023, 03:51 PM UTC

    Around 11:45 AM ET, several cloud.gov customers reported that their applications were not responding to requests nor could they login to the platform. Around 11:47 AM ET, the platform became responsive to requests again and customers reported their applications were accessible again. The cloud.gov team has identified the cause of the system outage as a brief DDoS attack that resulted in resource exhaustion. During this time customer applications continued to run on the platform but requests coming in from the internet were very slow and/or timed out. The team is continuing to investigate the matter, will hold a retrospective, and will publish an update in the days ahead.

  2. postmortem Jun 09, 2023, 05:44 PM UTC

    # What Happened ‌ The[ cloud.gov](http://cloud.gov/) team has reviewed the incident internally and is publishing this final update. The platform load balancers received a massive amount of circular traffic out and back into the platform which consumed a large amount of resources on the platform gorouters. These gorouters are used to route incoming web traffic to specific customer application instances. While the gorouters never went down, they did become very busy and look longer to respond to all HTTP requests including health-check requests from the load balancers. This slowdown in responses caused the gorouters to fail health checks on the load balancers. These failures led to HTTP 5XX errors received by our customers. # What We’re Doing ‌ To resolve this from happening again in the future, the [cloud.gov](http://cloud.gov) team is performing these steps: * Review and adjust the ELB health-checks to allow for slightly more time before failure. * We are looking into new logging and monitoring tools for the platform as a whole so we will include this data flow into consideration for the new tooling. * Adjustments to AWS WAF to identify and control more traffic coming into the platform. * Hosting an internal ELB which would decrease the possibility of circular traffic out and back into the platform - mainly for customers using CUPS services hosted on the platform. Any change that affects the system architecture will be made through [cloud.gov](http://cloud.gov)’s standard change management process. For any questions regarding the incident, please email [[email protected]](mailto:[email protected]) / file a ticket with the[ cloud.gov](http://cloud.gov/) service desk. Thank you for your understanding and for being a[ cloud.gov](http://cloud.gov/) customer.