Air incident

App is getting 504 Gateway timeouts

Major Resolved View vendor source →

Air experienced a major incident on September 30, 2025 affecting Web App, lasting 1h. The incident has been resolved; the full update timeline is below.

Started
Sep 30, 2025, 04:45 PM UTC
Resolved
Sep 30, 2025, 05:45 PM UTC
Duration
1h
Detected by Pingoru
Sep 30, 2025, 04:45 PM UTC

Affected components

Web App

Update timeline

  1. investigating Sep 30, 2025, 04:58 PM UTC

    We are currently investigating this issue.

  2. identified Sep 30, 2025, 05:46 PM UTC

    The issue has been identified and a fix is being implemented.

  3. resolved Sep 30, 2025, 05:57 PM UTC

    This incident has been resolved.

  4. postmortem Oct 06, 2025, 07:35 PM UTC

    ### **Overview** * Incident name: Internal Cleanup Led to API Request Handling Issues * Date and time: 2025-09-30, roughly 12:45 PM–1:45 PM ET * Affected areas: Consumers of Air’s API * Status: Resolved ### **Customer impact** * What customers experienced: During this period foreground apps were left with intermittent success in accessing the api with more impact seen on operations modifying data * Scope: Actions that modified data and intermittent impact to actions reading data. Some background tasks such as media processing, AI enrichments, indexing, and downloads were delayed during the incident but resumed following resolution without degradation. * Duration: 2025-09-30, roughly 12:45 PM–1:45 PM ET * Data and security: No data loss or security exposure occurred. ### **What happened** Internal cleanup on a dataset led to increased load on Air’s primary database which led to subsequent queries being impacted \(latency and/or timing out\). ### **Root cause** * Primary cause: cleanup logic was not throttled correctly to limit load it placed on the primary database. ### **Timeline \(high level\)** * 12:45 PM: Degradation from cleanup observed and cleanup terminated * 1:00 PM: Load on primary decreases and intermittent api access is observed * 1:08 PM: longer running queries that were created from cleanup terminated * 1:25 PM: Additional read capacity added to the database to support volume of requests during recovery * 1:45 PM: All app functionality returned to nominal behavior and async processing resumed ### **Preventative actions** * Immediate fixes completed * Increased environment capacity to handle load * Cleanup process placed back in review for further refinement * Near-term improvements * Refine approval and application out-of-band processing \(e.g cleanup, etc\) * Alert on early indicators \(e.g., database locks, etc\) for faster detection * Long-term investments * Dedicated system for out-of-band processing to provide additional guardrails ### **Frequently asked questions** * Was any customer data lost? * No. We confirmed no data loss or security exposure. * Do customers need to take any action? * No. All queued work has been safely processed. If anything still looks off, please let us know and we will investigate immediately. * How will we keep you updated? * Your account team will share any follow-ups on improvements. We will also post future status updates through our standard channels if needed.