Flatfile experienced a minor incident on July 4, 2025 affecting AU Regional API and AU Spaces and 1 more component, lasting 34m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jul 04, 2025, 01:18 AM UTC
We are currently investigating an issue where our AU regional server is not loading for some customers. We are working to get a fix out as quickly as possible.
- identified Jul 04, 2025, 01:39 AM UTC
The issue has been identified and a fix is being implemented.
- monitoring Jul 04, 2025, 01:44 AM UTC
A fix has been implemented and we are monitoring the results.
- resolved Jul 04, 2025, 01:52 AM UTC
This incident has been resolved.
- postmortem Jul 07, 2025, 03:23 PM UTC
# Incident Overview **Date and Time of Incident:** July 3rd starting at about 7:15pm MT to about 7:45pm MT **Nature of Incident:** Ephemeral Database CPU Pinning on Australia Regional Environment **Services Affected:** All Workbooks and Sheets endpoints ## Details of the Incident At approximately 7:15pm MT, we received a report from a customer on the Australia region that a database timeout error was being returned by the API. Upon investigation it was determined that the ephemeral database was experiencing degraded performance due to CPU load. We also investigated the API service in the Australia regional ECS cluster and rolled over the API deployment as a precautionary measure. ## Impact Assessment All Australia Regional platform users were unable to load workbooks or sheets during the incident. This resulted in degraded service for most customer workflows during the incident, including completely blocking some workflows due to API response failures. The incident was fully resolved about 30 minutes after initial report. ## Root Cause The root cause of this was CPU load on the ephemeral database instance. Upon investigation we discovered that the instance was a small size unable to handle the amount of load on it. We suspect also that background load due to postgres auto-vacuum might have been occurring at this time, exacerbating the problem. ## Resolution The ephemeral datastore instance was scaled up to a much larger instance size. This was sufficient to relieve the load and accommodate a much larger load from application users. We have confirmed that this resolved the problem. Going forward we will audit our regional deployments and determine if ephemeral datastores are at an appropriate scale to the volume of traffic on the region. ## Security and Data Integrity There was no loss of customer data and no security breach. We have reviewed application and database logs related to the incident and concluded that the database capacity problem was the root cause and that no other parts of the system experienced degradation.