Fasterize incident

Platform unavailability

Fasterize experienced a critical incident on October 9, 2025 affecting API and Dashboard and 1 more component, lasting 9h 22m. The incident has been resolved; the full update timeline is below.

Started: Oct 09, 2025, 09:01 AM UTC
Resolved: Oct 09, 2025, 06:23 PM UTC
Duration: 9h 22m
Detected by Pingoru: Oct 09, 2025, 09:01 AM UTC

Affected components

APIDashboardAcceleration

Update timeline

investigating Oct 09, 2025, 09:01 AM UTC

Our platform is experiencing an outage, where possible traffic is automatically routed to origin to mitigate the incident. We identified the issue and are working on a fix to restore availability
identified Oct 09, 2025, 09:54 AM UTC

We are still investigating and trying to restore the failing component.
identified Oct 09, 2025, 10:27 AM UTC

The failing component is gradually recovering, and we are monitoring it until full restoration
monitoring Oct 09, 2025, 10:34 AM UTC

A fix has been implemented and we are monitoring the results.
monitoring Oct 09, 2025, 01:20 PM UTC

Configuration updates are not possible from the Fasterize console. We are currently investigating.
investigating Oct 09, 2025, 03:35 PM UTC

Acceleration is now fully operational. Configuration updates are still unavailable, and we are investigating the issue. Cache purges are also currently unavailable.
identified Oct 09, 2025, 05:04 PM UTC

The issue has been identified and a workaround is being implemented. In the meantime, configuration updates and cache purges are unfortunately still unavailable. The workaround will unfortunately degrade the API performance but it should be temporary until we rollout a definitive solution. As soon as the fix is delivered, we'll monitor it to be sure that everything's ok. The postmortem will follow, as we also need to gather further elements from our provider. We deeply apologize for this incident and we're actively working to solve this issue as soon as possible.
resolved Oct 09, 2025, 06:23 PM UTC

This incident has been resolved. The API and the dashboard are now able to handle correctly configuration updates and cache purges. As mentioned earlier, a postmortem will be written and available here.
postmortem Oct 13, 2025, 12:17 PM UTC

An in-memory database cluster failure occurred leading to service unavailability across multiple Fasterize components — primarily the **Optimisation Engine** and the **API**. At **09:48**, an in-memory database cluster failure happens after restarting multiple nodes to release a new engine version. The database cluster began an automatic failover sequence, but each time a new node was promoted as primary, it **crashed under excessive connection load**. This cluster serves as a **cache layer** providing access to configurations. During the outage, Engine instances attempted to reconnect at a very high frequency and fell back to retrieving data directly from the main database.This fallback mechanism worked as intended until **10:32**, allowing our optimization engine to continue operating in **degraded mode**. At **10:32**, however, the **proxy layer** of our optimisation engine became **saturated in network resources**, rendering it unavailable from the front layer. When the proxy layer is unreachable, the platform automatically **unplugs the CDN and sites are served directly from their origin servers**, without Fasterize optimizations. Working with our hosting provider, we **reduced the Optimisation Engine cluster size at 11:45** to limit reconnection attempts to Redis. By **12:16**, the Redis cluster had stabilised and full service was restored. Later, at **14:15**, we detected that the **API was unable to write to the Redis cluster**. The root cause was a **security patch applied by the hosting provider**, which **restricted the use of some commands** in the cluster.The API was patched to remove usage of these commands, and full functionality was restored by **20:15**. ‌ ## **Impact** * **Duration:** 10h32 – 12:16 \(outage\), API issue until 20:15 * **Affected components:** Optimisation Engine, API * **User impact:** Most websites temporarily served unoptimised content directly from origin; Some websites experienced unavailability due to an dysfunctional DNS fallback mechanismAPI write operations failed, blocking configuration update. ## **Resolution Timeline** ‌ ## Action plan Short term: * Improve **alerting and visibility** on in-memory database cluster health and failover events. * Review the in-memory database connection logic of the engine to avoid too many reconnection attempts in case of disconnections and avoid a case preventing the start of the process in case of an in-memory database outage. * Adjust failover DNS logic to avoid redirection to the origin when the CDN/fronts are still able to accept the traffic. * Upscale in-memory database cluster in order to accept more connections Medium term: * Review and test **disaster recovery procedures** for in-memory database cache clusters, including the ability to quickly activate a passive cluster. Long term: * Re-architecture the engine to reduce the number of connections on the itn-memory database cluster