Flatfile incident

"Something went wrong" errors

Minor Resolved View vendor source →

Flatfile experienced a minor incident on March 14, 2025 affecting Spaces, lasting 1h 30m. The incident has been resolved; the full update timeline is below.

Started
Mar 14, 2025, 02:33 PM UTC
Resolved
Mar 14, 2025, 04:03 PM UTC
Duration
1h 30m
Detected by Pingoru
Mar 14, 2025, 02:33 PM UTC

Affected components

Spaces

Update timeline

  1. investigating Mar 14, 2025, 02:33 PM UTC

    We are seeing some intermittent "something went wrong" errors when trying to load sheets for some users. We are currently investigating this.

  2. identified Mar 14, 2025, 03:16 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Mar 14, 2025, 03:57 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Mar 14, 2025, 04:03 PM UTC

    This incident has been resolved.

  5. postmortem Mar 17, 2025, 02:59 PM UTC

    # **Introduction** On March 14, 2025, our team identified an issue where certain workbooks were failing to open and/or update. These failures were caused by a database incident involving one of our ephemeral database servers. This document outlines the incident details, the identified root cause, the steps taken to resolve the issue, and the long-term remediation plan. # **Incident Details** * **Date Reported**: March 14, 2025 * **Issue Summary**: One of Flatfile’s ephemeral database instances entered an abnormal state. Workbooks mounted to this database instance failed to open and/or be updated. # **Impact Assessment** The incident resulted in degraded service performance for users with workbooks on the Quickstore 3 database. Specifically, users experienced: 1. Intermittent unavailability of existing workbooks stored on the affected database 2. Issues loading sheets in newly created spaces that attempted to access data from the affected database The incident did not affect the creation of new workbooks, as these would be directed to functioning database instances. Only workbooks that were already stored on the Quickstore 3 instance were impacted, leading to a compromised user experience for a subset of users. # **Root Cause** Initial investigations determined that the Quickstore 3 database had entered an abnormal state. The database writer node became unresponsive, preventing both read and write operations from completing successfully. While the exact trigger for this state is still under investigation, monitoring data suggests that the database instance may have experienced resource exhaustion or an internal failure that was not automatically resolved by the database management system. # **Resolution & Fix** 1. **Immediate Remediation** * A backup of the affected database instance was completed to secure all data. * A new database instance was brought online to attempt to maintain service availability. * A new reader node was spun up while planning to remove the problematic node from service. 2. **Recovery Strategy** * After evaluating options, Flatfile launched a new database cluster using the backup at the same time that the reader node was coming online in case the additional reader node was unable to make the database healthy again. # **Follow-Up Actions** * **Monitoring Enhancement**: While monitoring for this type of issue exists and alerts triggered correctly, enhancements could be made to escalate alerts and prompt faster response times. * **Root Cause Investigation**: Continue the investigation into database monitoring data to determine what initially caused the Quickstore 3 database to enter the problematic state.