Flatfile experienced a major incident on April 2, 2025 affecting Spaces, lasting 1h 14m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 02, 2025, 05:53 PM UTC
We are investigating an issue where some users are seeing intermittent 503 errors
- monitoring Apr 02, 2025, 06:07 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Apr 02, 2025, 07:08 PM UTC
This incident has been resolved.
- postmortem Apr 24, 2025, 07:11 PM UTC
# **Introduction** On Apr 2, 2025 a service degradation caused intermittent requests for static assets to fail; these included requests for HTML, JS, CSS and other assets resulting in failed delivery of frontend applications for several short bursts of time. # **Incident Details** * **Date Reported**: April 2, 2025 * **Issue Summary**: Delivery of frontend application assets degraded # **Impact Assessment** The incident resulted in degraded delivery of static assets used in the frontend applications, manifesting in the following: 1. Intermittent errors loading spaces 2. Missing assets in applications 3. NGINX error pages being viewed instead of Spaces The incident did not affect usage of the API and browser clients which had cached the static asset files. # **Root Cause** Our cloud hosting provider terminated several EC2 instances in our Kubernetes fleet over several hours the morning of April 2. The NGINX proxy that delivers static assets was forced to recreate on another node, resulting in several seconds of failed requests for assets. This occurred several times in succession. # **Resolution & Fix** 1. **Immediate Remediation** * Flatifle infrastructure engineers scaled NGINX resources across the fleet to avoid downtime during disruptions 2. **Recovery Strategy** * We implemented new routing and retry strategy combined with affinity rules to prevent scheduling on ephemeral resources # **Follow-Up Actions** * **Monitoring Enhancement**: While monitoring for this type of issue exists and alerts triggered correctly, enhancements could be made to escalate alerts and prompt faster response times.