Fasterize incident

Unavailability of the origin on several CDN Edge Servers leading to 502/504 errors for end users

Fasterize experienced a critical incident on June 18, 2025 affecting Acceleration and CDN, lasting 1h 13m. The incident has been resolved; the full update timeline is below.

Started: Jun 18, 2025, 07:41 AM UTC
Resolved: Jun 18, 2025, 08:55 AM UTC
Duration: 1h 13m
Detected by Pingoru: Jun 18, 2025, 07:41 AM UTC

Affected components

AccelerationCDN

Update timeline

investigating Jun 18, 2025, 07:41 AM UTC

We currently have some issues on one of our european DC. Being fixed. Trafic is interrupted for a large portion of the customers. Really sorry for the inconvenience.
investigating Jun 18, 2025, 07:49 AM UTC

The failover mechanism didn't trigger. We trigger it manually.
resolved Jun 18, 2025, 08:55 AM UTC

This incident is now resolved. Post-mortem will follow but to sum up: the root cause is a change in a DNS record. During that change, the record pointing to our DC took a temporary wrong value that was captured by some edge servers and stored for one hour. This affected only a subset of edge servers and only a subset of the health checkers responsible for triggering the failover mechanism. This explains why the failover mechanism wasn't fully triggered.
postmortem Jun 19, 2025, 08:38 AM UTC

**🧾 Incident Summary** As part of preparations for the summer sales and in response to a load handling issue observed on June 12, the tech team has planned an update of a DNS record pointing to the platform’s frontend layer. The goal was to review the way CDN Edge servers send traffic to the platform. Since the DNS record could not be edited in place, it was deleted and immediately recreated with a new routing policy. During the brief deletion \(~10 seconds\), some DNS resolvers fell back to a default wildcard record with a 3600s TTL, pointing to decommissioned IPs instead of returning NXDOMAIN. As a result, several CDN Edge servers cached these incorrect IPs and continued to use them for up to an hour, causing a global service outage on the affected POPs. Some health check probes also resolved to the wrong IPs but not enough of them failed at once to trip the health check thresholds \(which require consecutive failures to switch to a degraded state\). ‌ ### **📅 Timeline** ### **🧩 Root Cause Analysis** This outage was not due to an infrastructure failure but to a human error during maintenance without sufficient safeguards. We consider this as a major governance failure in infrastructure change management. The affected DNS zone was created over 12 years ago and currently holds legacy records. Over time, several have become outdated or unused. However, due to the risk of accidental deletion and unexpected impact, no regular cleanup has been performed. This lack of maintenance allowed a misconfigured wildcard record to persist, which was unintentionally triggered during the temporary deletion of a critical DNS record. We’ve now initiated a full audit of this DNS zone to identify, document, and progressively remove obsolete records. A cross-validation policy will be enforced before any future changes. Although a staging environment exists and is used to validate infrastructure changes, the specific scenario here — related to active CDN traffic and DNS behavior during the few seconds of deletion — was not anticipated. Due to the distributed and asynchronous nature of CDN DNS propagation and caching \(per POP\), the issue could hardly be replicated in a staging environment. ### **✅ Immediate Fixes** * Re-creation of the DNS record with the correct configuration * Manual checks to ensure resolvers and CDN POPs are now resolving to the correct origin ### **🔒 Preventive countermeasures** **Short-term:** * Freeze on infrastructure changes for two weeks **Medium-term:** * Improve the staging environment to better simulate CDN-specific issues * Clean up outdated records in the platform's DNS zone * Formalize an emergency fallback procedure * Introduce more logical zones to avoid widespread impact across clients ‌ ### **Conclusion** This incident highlights how even short-lived DNS misconfigurations can cause major disruptions in distributed systems like CDNs.Rigorous TTL management and better anticipation of critical DNS usages are essential to avoid similar outages in the future. Despite this incident, the technical and customer success team remains fully committed to delivering a smooth and successful sales period for our clients.A temporary change freeze is in place, and additional planning and capacity measures are being taken to ensure high availability and reliability throughout this peak traffic period.