Astronomer incident

We are experiencing an issue with new task execution on AWS clusters

Major Resolved View vendor source →

Astronomer experienced a major incident on March 26, 2025 affecting Scheduling and Running DAGs and Tasks and Scheduling and Running DAGs and Tasks and 1 more component, lasting 2h 42m. The incident has been resolved; the full update timeline is below.

Started
Mar 26, 2025, 04:59 AM UTC
Resolved
Mar 26, 2025, 07:42 AM UTC
Duration
2h 42m
Detected by Pingoru
Mar 26, 2025, 04:59 AM UTC

Affected components

Scheduling and Running DAGs and TasksScheduling and Running DAGs and TasksCloud UICloud UICloud APICloud API

Update timeline

  1. investigating Mar 26, 2025, 04:59 AM UTC

    We are experiencing an issue with new task execution on AWS clusters and issue with accessing UI. The team is actively investigating the issue.

  2. identified Mar 26, 2025, 06:22 AM UTC

    The team has identified the issue and is currently actively working on mitigating it. The issue concerns new task execution on AWS clusters and accessing the UI.

  3. resolved Mar 26, 2025, 07:42 AM UTC

    This incident has been resolved.

  4. postmortem Mar 27, 2025, 09:36 PM UTC

    # Overview Between March 18 and March 26, 2025, Astro experienced a series of related incidents. This write up serves as the analysis of the full set of incidents, as the events overlap. On Tuesday, March 18, an Azure outage prevented our control plane components that handle authentication requests from reaching our Authentication vendor, Auth0. This lasted for 45 minutes and was resolved without intervention from Astronomer. Because the control plane for Astro is hosted in Azure, all customers were affected, including those that host their data planes in AWS or GCP. During this period, customers were unable to reach the Astro or Airflow UI and were unable to get a response from the Airflow API. On Friday, March 21, our monitoring systems alerted us to a critical issue affecting the control plane authentication components, again preventing customers from accessing the Astro UI and API. Our investigation revealed the root cause was unusually high volumes of automated authentication requests that overwhelmed the scaling capabilities of our authentication components. While we cannot share specific details about this traffic for security reasons, we can confirm that our security team verified no customer data was compromised or exposed. Our engineering team implemented protective measures against similar authentication scaling issues, including reconfiguring several network settings to improve resilience. Early Sunday, March 23, we discovered that our network reconfiguration had unintentionally triggered a previously unknown bug in our system. * Astro uses a component called “Harmony client” to synchronize changes between the control plane \(where customers make configuration changes\) and the data plane \(where customer workloads run\). * During normal operations, when the Harmony client cannot connect to the control plane, it waits a few seconds and retries. However, we discovered a specific bug in how Harmony client handles connection errors for AWS clusters. When receiving a 404 error, it interpreted this as an empty response which signaled that we didn’t need resources and inadvertently deleted all worker node pools, the computing resources that run tasks. * This affected around a dozen customer clusters, disrupting workflow execution and UI access. Most clusters were protected by safety mechanisms that prevented the deletion of node pools, but this also blocked automatic recovery. The team manually restored node pools to all affected clusters by Sunday afternoon. We developed a fix for the Harmony client bug and implemented enhanced monitoring to detect any similar issues. Given the complexity of these interconnected incidents, we made the decision to implement a temporary change freeze while thoroughly reviewing the fix and our deployment procedures to avoid introducing new issues. On Wednesday, March 26, our monitoring systems detected a recurrence of the Harmony client issue. Our incident response team immediately engaged and identified the cause: during our troubleshooting process, a change intended for our staging environment had been applied to production through a gap in our change management process. This change caused 404 errors to once again be received by the Harmony client, re-triggering the issue. Given the large number of clusters, our efforts to apply manual fixes were taking too long to address the issue. As we no longer had confidence that it would be safer to wait, we initiated a hotfix release of the patch to fix the root cause to Harmony client across the fleet. # Long-Term Preventive Measures We have corrected the underlying bug in the Harmony client and put in initial steps to deal with bursts of unusual traffic. In the coming weeks, we plan to put further mitigations in place around automated traffic to further protect the system. We are looking at additional process measures to further separate staging and production access. We have also improved our monitoring to detect similar issues and are discussing further monitoring improvements. Additionally, we will deploy a script for automated recovery of a set of clusters to improve remediation time. If you have any additional questions or concerns, please contact us at [[email protected]](mailto:[email protected]).