Qubole incident
Degraded performance issue Airflow clusters on api.q
Qubole experienced a minor incident on October 15, 2022 affecting Cluster Operations, lasting 16d 15h. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 14, 2022, 06:33 PM UTC
For customers using Airflow, there is an issue with clusters that are running v1.10.2. All other aspects of Qubole operations are functioning normally. These Airflow clusters do not automatically terminate due to the nature of their function. Any currently running Airflow clusters are functioning normally. If an Airflow cluster is terminated or restarted for any reason then it will not come back up as the loading of Airflow will fail. We are currently working to resolve the issue and will post an update in the next 2 hours or earlier if new information is available or service is restored.
- investigating Oct 14, 2022, 09:23 PM UTC
The library used by airflow relies on a common python package which was recently upgraded in the open source community and is causing the breaking changes. The community is triaging and updating. In the meantime, we are exploring ways to bring in a resolution into QDS airflow that serves as a workaround for this upstream issue.
- identified Oct 15, 2022, 12:42 AM UTC
The issue has been identified and a fix is being implemented.
- identified Oct 15, 2022, 02:59 AM UTC
We are investigating options to resolve the issue. We will continue to post the status.
- identified Oct 15, 2022, 07:31 AM UTC
We are continuing to work on a resolution.
- identified Oct 15, 2022, 10:58 AM UTC
We are continuing to work on a resolution.
- identified Oct 15, 2022, 02:12 PM UTC
We are continuing to work on a resolution.
- identified Oct 15, 2022, 03:12 PM UTC
DevOps continues to work on patching an Airflow AMI that can be used to mitigate the issue. Until the issue is resolved, if you are currently running Airflow clusters and they are being utilized do not shut them down. This will prevent the failure on startup condition that we are reporting.
- monitoring Oct 15, 2022, 11:43 PM UTC
We are investigating options to resolve the issue. We will continue to post the status.
- monitoring Oct 16, 2022, 02:45 AM UTC
We are continuing to monitor for any further issues.
- monitoring Oct 16, 2022, 05:44 AM UTC
We are continuing to monitor for any further issues.
- monitoring Oct 16, 2022, 09:03 AM UTC
We are continuing to monitor for any further issues.
- monitoring Oct 16, 2022, 12:49 PM UTC
We are continuing to monitor for any further issues.
- monitoring Oct 16, 2022, 03:04 PM UTC
We were able to get a new AMI set up and working. Our next step will be to roll out to all affected customers using Airflow. In the meantime, this new AMI is available and you can override the current Airflow cluster settings to use the new AMI.
- resolved Oct 17, 2022, 02:40 AM UTC
The Airflow issue is resolved. Customers needing to obtain this hotfix AMI should file a ticket with Qubole Support and include the cluster Id(s) they need to be updated.