Fasterize incident
Acceleration was disabled between 10:45 AM GMT+1 and 11 AM GMT+1
Fasterize experienced a major incident on February 25, 2025 affecting Acceleration, lasting 6h 55m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Feb 25, 2025, 10:10 AM UTC
Acceleration was disabled between 10:45 AM GMT+1 and 11 AM GMT+1 due to a misconfiguration of the global platform health checks. These health checks are currently being reviewed (but the platform is actually healthy).
- identified Feb 25, 2025, 03:37 PM UTC
Root cause has been identified. A post-mortem is being written. We're gonna set up more robust global platform health checks.
- monitoring Feb 25, 2025, 04:37 PM UTC
Fix has been deployed, everything is back to normal (the platform is properly monitored again). We're still monitoring.
- resolved Feb 25, 2025, 05:06 PM UTC
Incident is now closed. Sorry for the inconvenience. A post-mortem will follow.
- postmortem Feb 25, 2025, 05:06 PM UTC
# Healthcheck in error Date of post mortem : 25/02/2025 Participants of the post mortem : Anthony Barré # Description of the incident Following a change in the Fasterize API, **the health checks of the platform turned red**, indicating an unavailability of the platform while it was actually healthy. These health checks were configured to query a URL associated with **a domain name that was no longer present** in the configuration database. ### **Immediate corrective action** * **Temporary deactivation** of health checks. * **Rollback** of the API configuration change. # Facts & Timeline All times are UTC\+1. # Analysis ## **Technical context** * The health checks validate the availability of the platform through several layers \(proxy, load balancer, workers, etc …\) * The failure of the health checks was due to errors at the proxy level, a symptom of the absence of configuration associated with the domain used. * At the proxy level, the health check response is adapted by region to respond to the health checks associated with each region. * An outdated process, that was still running despite no longer being required, caused issues when it updated the configuration used by proxies with incorrect data. ## **Root cause analysis** The API configuration change caused an unstable state. Indeed, an API update did not take place immediately. Thus, some API pods were reading the health check configuration from one environment and others from another environment. This corrupted the database used by the proxies of our main environment. The proxies no longer responded with 200 but with error code to health check requests. ## **Impact of incident** 🏠 **Affected customers** : **All customers** were impacted by the lack of acceleration. 🔴 **Specific issue** : **Two customers** experienced major disruptions because their origins did not allow them to receive traffic directly from the Internet. 📊 **Incident metrics** * **Severity**: **1** \(service shutdown impacting a large number of users\). * ⏱ **Detection time**: **4 minutes**. * 🛠 **Resolution time**: **20 minutes**. # Action Plan ## Short term \(immediate - 1 week\) * Review the platform's health checks so that they are more robust * Disable and delete the obsolete monitoring process ✅ * Fix the deployment of configmaps in the helm charts. ## Medium term * Review the API configuration to clarify the use of region-related fields and avoid side effects.