Fluid Attacks incident

Gateway timeout error when accessing the platform

Fluid Attacks experienced a critical incident on July 3, 2025 affecting Platform, lasting 4h 12m. The incident has been resolved; the full update timeline is below.

Started: Jul 03, 2025, 02:36 PM UTC
Resolved: Jul 03, 2025, 06:49 PM UTC
Duration: 4h 12m
Detected by Pingoru: Jul 03, 2025, 02:36 PM UTC

Affected components

Platform

Update timeline

identified Jul 03, 2025, 02:36 PM UTC

The platform fails to load due to a Gateway timeout error (504), blocking user access.
resolved Jul 03, 2025, 06:49 PM UTC

The incident has been resolved, and users can now access the platform normally.
postmortem Jul 09, 2025, 01:28 PM UTC

**Impact** Approximately 740 sessions were impacted, representing around 32.04% of the total sessions during the incident period, affecting internal and external users attempting to access the platform. The issue started on UTC-5 25-07-02 15:26 and was reactively discovered 17.5 hours \(TTD\) later by a client who reported to a staff member [\[1\]](https://help.fluidattacks.com/agent/fluid4ttacks/fluid-attacks/tickets/details/944043000040412003) that, when attempting to access the platform, some users encountered a “504 Bad Gateway” error from Cloudflare, preventing regular interaction with the platform. No other modules besides platform access were affected during the incident. The problem was resolved in 2.6 hours \(TTF\), resulting in a total window of exposure of 20.1 hours \(WOE\) [\[2\]](https://gitlab.com/fluidattacks/universe/-/issues/16686). **Cause** The problem started when a user tried to generate a report. This triggered a series of automatic actions that, due to an issue with one of our external providers \(Twilio\), caused many repeated requests to be sent. On top of that, two additional factors made things worse: 1. The system was slow to add new servers to handle the increased load. 2. The requests that were failing took too long to stop, keeping the servers busy for too long. As a result, the servers became overloaded and started failing to respond, showing the 504 error to some users. This was a rare and complex situation resulting from the combination of an external service failure, slow automatic scaling, and inefficient error handling [\[3\]](https://gitlab.com/fluidattacks/universe/-/merge_requests/80263). **Solution** Two key actions were taken: 1. The login process was simplified by removing unnecessary steps that could add extra work for the servers [\[4\]](https://gitlab.com/fluidattacks/universe/-/merge_requests/80312). 2. We adjusted how the system adds new servers when traffic increases, so that it reacts faster by adding them sooner when usage grows [\[5\]](https://gitlab.com/fluidattacks/universe/-/merge_requests/80319). **Conclusion** These changes improved the system’s ability to handle sudden spikes in traffic and external service failures. The incident highlighted the need to stop requests that are taking too long to process, instead of letting them overload the system. We’re planning further improvements to how the system handles errors and timeouts. To further improve reliability and security, we plan to implement a Time-based One-Time Password \(TOTP\) system for user verification. This approach will reduce our dependence on external providers like Twilio and make authentication faster and safer [\[6\]](https://gitlab.com/fluidattacks/universe/-/issues/16714). **THIRD\_PARTY\_ERROR < INFRASTRUCTURE\_ERROR < INCOMPLETE\_PERSPECTIVE**