Fasterize experienced a critical incident on March 16, 2021 affecting API and Dashboard and 1 more component, lasting 38m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Mar 16, 2021, 06:01 PM UTC
We currently have some issues on our european infrastructure. Being fixed. Speeding-up is disabled but trafic is ok.
- identified Mar 16, 2021, 06:20 PM UTC
Issue has been identified Mitigation is being deployed.
- monitoring Mar 16, 2021, 06:33 PM UTC
Fix has been deployed, acceleration has been enabled and everything is back to normal. We're still monitoring.
- resolved Mar 16, 2021, 06:39 PM UTC
Incident is now closed. Sorry for the inconvenience. A post-mortem will follow.
- postmortem Mar 18, 2021, 10:15 PM UTC
# Description Le 16/03/2021, entre 18h45 et 19h20, l’ensemble de la plateforme Fasterize a subi des ralentissements avec des temps de réponse possiblement élevés quelque soit le site. Entre 18h49 et 18h59, la plateforme a automatiquement basculé le trafic vers les origines clients afin d’assurer la continuité du trafic. A partir de 18h58, de nouvelles machines ont été ajoutées et ont commencé à prendre du trafic pour mitiger l’impact en attendant de trouver la root cause. A 19h, le trafic est à nouveau routé sur la plateforme Fasterize et seules quelques requêtes sont ralenties. A 19h18, la cause est identifiée et conduit à bloquer 10 minutes plus tard une adresse IP effectuant des requêtes surchargeant la plateforme. # Faits et timeline * A partir de 8h23, augmentation du nombre de requêtes vers des fichiers volumineux \(> 1 Go\) d’un site client * A partir de 14h, seconde augmentation du nombre de requêtes vers ces fichiers volumineux * 18h49, alerte sur un nb élevé de requêtes annulées par les internautes \(499\) * 18h50, alerte sur l’indispo de nos fronts, le trafic est rerouté automatiquement sur les origines client * 18h55, ajout de machines dans le pool * 18h58, les nouvelles machines répondent au trafic * 18h59, la plateforme est de nouveau vue comme up, le trafic est rerouté chez Fasterize, quelques ralentissements * 19h18, débranchement du site client après identification de la surcharge de trafic * 19h20, retour à la normale \(plus de ralentissements\) * 19h27, blocage de l’adresse IP incriminée # Analyse A partir de 8h23, un serveur hébergé chez GCP a lancé plusieurs centaines de requêtes sur des fichiers volumineux transitant par nos proxys \(fichiers XML > 1Go\). Jusqu’alors, ce serveur faisait quelques dizaines de requêtes par jour. La bande passante sur les fronts et les proxys a progressivement augmenté sur toute la journée \(jusqu’à un facteur x2.5 par rapport à la veille et à la semaine précédente\) : A partir de 18h45, les temps de réponse globaux ont commencé à se dégrader sans qu’il y ait plus de bande passante utilisée/ Cela peut s’expliquer par l’augmentation soudaine du load des frontaux qui jusque là était stable. L’augmentation du load reste cependant inexpliquée à cette heure. # Métriques * Niveaux de sévérité de l'incident : * Sévérité 2 : dégradation du site, problème de performance et/ou feature cassée avec difficulté de contourner impactant un nombre significatif d'utilisateur * Temps de détection : 5 minutes \(18h45 ⇢ 18h49\) * Temps de résolution : 35 minutes \(18h45 ⇢ 19h20\) * Durée de l’incident : 35 minutes # Impacts * Débranchement automatique de l’ensemble des clients pdt 10 minutes. * Aucun ticket au support * Débranchement manuel de quelques sites par un client # Contre mesures ## Actions pendant l’incident * Ajout de frontaux * Débranchement du site web incriminé * Blocage de l’adresse incriminée # Plan d'actions **Court terme :** * Ajustement des alertes sur la bande passante * Ajustement des alertes sur le ping-fstrz-engine * Détection des objets les plus volumineux pour les bypasser **Moyen terme :** * Rate limiting sur les objets volumineux # English version # Description On 16/03/2021, between 18:45 and 19:20, the entire Fasterize platform experienced slowdowns with possibly high response times regardless of the site. Between 18:49 and 18:59, the platform automatically switched the traffic to the customer origins to ensure the continuity of traffic. From 18:58, new machines were added and started to take traffic to mitigate the impact until the root cause is found. At 7:00 pm, the traffic is again routed on the Fasterize platform and only a few requests are slowed down. At 7:18pm, the cause is identified and leads to blocking 10 minutes later an IP address making requests overloading the platform. # Facts and timeline * From 8:23 am, increase in the number of requests to large files \(> 1 GB\) from a client site * From 2pm, second increase in the number of requests to these large files * 18h49, alert on a high number of requests cancelled by users \(499\) * 18h50, alert route53 on the availability of our fronts, the traffic is rerouted automatically on the client origins * 18h55, addition of machines in the pool * 18h58, the new machines respond to traffic * 18h59, the platform is again seen as up, traffic is rerouted to Fasterize, some slowdowns * 19h18, disconnection of the customer site after identification of the traffic overload * 19h20, back to normal \(no more slowdowns\) * 19h27, blocking of the IP address in question # Analysis From 8:23 am, a server hosted by GCP started several hundred requests on large files transiting through our proxies \(XML files > 1GB\). Until then, this server made a few dozen requests per day. The bandwidth on the fronts and proxies has progressively increased throughout the day \(up to a factor x2.5 compared to the day before and the week before\) Starting at 6:45pm, overall response times started to degrade without more bandwidth being used. This can be explained by the sudden increase of the load of the front-ends, which until then had been stable. The increase in load remains unexplained at this time. # Metrics * Incident severity levels: * Severity 2: degradation of the site, performance problem and/or broken feature with difficulty to bypass impacting a significant number of users * Detection time: 5 minutes \(18h45 ⇢ 18h49\) * Resolution time: 35 minutes \(18h45 ⇢ 19h20\) * Duration of the incident: 35 minutes # Impacts * Automatic disconnection of all customers for 10 minutes. * No ticket to support * Manual disconnection of some sites by a customer # Countermeasures ## Actions during the incident * Addition of front ends * Disconnection of the offending website * Blocking of the incriminated address # Action plan \[ \] planned, \[-\] doing, \[x\] done ## Short term : * \[x\] Adjustment of the alerts on the bandwidth * \[-\] Adjustment of alerts on ping-fstrz-engine * \[-\] Detection of the most voluminous objects to bypass them ## Medium term : * \[ \] Rate limiting on large objects