Fasterize incident

Temporary Platform Unavailability

Fasterize experienced a minor incident on May 15, 2024, lasting —. The incident has been resolved; the full update timeline is below.

Started: May 15, 2024, 03:53 PM UTC
Resolved: May 15, 2024, 09:30 AM UTC
Duration: —
Detected by Pingoru: May 15, 2024, 03:53 PM UTC

Update timeline

resolved May 15, 2024, 03:53 PM UTC

The platform experienced an outage from 11:29 AM to 11:55 AM. Traffic was automatically routed to origins. Customers therefore lost the benefit of the solution, but the sites remained available during the incident.
postmortem May 15, 2024, 03:53 PM UTC

**Post Mortem**: Temporary Platform Unavailability **Event Date**: May 15, 2023 **Incident Duration**: 11:29 AM to 11:55 AM **Incident Description**: The platform experienced an outage from 11:29 AM to 11:55 AM. Traffic was automatically routed to origins. Customers therefore lost the benefit of the solution, but the sites remained available during the incident. ‌ The addition of a large number of configurations on the platform increased the consumed memory and the startup time of the front layer services. Some services stopped and did not start correctly. **Event Timeline**: * 11:17 AM: Addition of new configurations. * 11:21 AM: Detection of a memory shortage on a service, leading to the shutdown of a critical process. * 11:34 AM: Additional services become unavailable. * 11:38 AM: Widespread detection of the incident; automatic traffic redirection. * 11:45 AM: Attempts to restart services, partially successful. * 12:00 PM - 12:15 PM: Assessment and decision-making on corrective actions. * 12:33 PM: Modification of startup configurations to improve tolerance to startup time. **Analysis**: Two main factors lead to this incident : * our HTTP server requires a reload in order to load new configuration into account. During this reload, the number of processes for this service is doubled, leading to a risk of memory exhaustion. * The start timeout for the HTTP service was set as the default value and we didn’t have a monitor alerting us that the HTTP service start time was close to the limit. **Impact**: All users of the platform were affected by this incident. **Corrective and Preventive Measures**: * Short term: Review of alert systems and adjustment of service startup configurations. * Medium term: Improvement in configuration management to reduce their number and optimize service startup monitoring. * Long term: Researching alternative HTTP server to improve update management without impacting performance or memory consumption. **Conclusion**: This incident highlights the importance of constant monitoring and proactive resource management to prevent outages. The measures taken should enhance the stability and reliability of the platform.