Qubole incident

Degraded performance issue on api.qubole.com

Qubole experienced a minor incident on May 10, 2022 affecting Command Processing and Cluster Operations and 1 more component, lasting 1d 12h. The incident has been resolved; the full update timeline is below.

Started: May 10, 2022, 07:39 PM UTC
Resolved: May 12, 2022, 07:50 AM UTC
Duration: 1d 12h
Detected by Pingoru: May 10, 2022, 07:39 PM UTC

Affected components

Command ProcessingCluster OperationsNotebooks

Update timeline

investigating May 10, 2022, 07:39 PM UTC

Several customers are experiencing issues when scheduling jobs. We are looking into the matter and will update shortly.
identified May 10, 2022, 09:37 PM UTC

We have identified a full table in the Rstore database that appears to be causing the issue. We are in the process of clearing that condition.
identified May 11, 2022, 01:15 AM UTC

We continue to work on clearing resources and expanding the limits in the rStore database. We should have an ETA shortly.
identified May 11, 2022, 03:50 AM UTC

Latest updates: -Cleared the storage issues and the low memory on the longer running tunnels. -Updated the RDS memory from 5000 GB to 5500 GB in the production rstore RDS instance as well as the replicate production rstore. This takes about 6 hours as per Amazon document. We started it about 5PM CST, so around 11PM CST the updated instance with added memory size should be up and running After taking steps to free up storage the issue still exists and the storage is not being released. We are continuing to investigate and will update accordingly.
identified May 11, 2022, 10:13 AM UTC

-Right now, the Task is Under Investigation. -Given the current RDS DB (MySQL) instance is using the deprecated major version (5.6.39) and the tablespace seems full even after applying the innodb_file_per_table=1. -The team is currently working to migrate the environment along with DB to a supported version of MySQL. We are continuing to investigate and will update accordingly.
identified May 11, 2022, 03:08 PM UTC

Latest Update: What caused the outage * The Rstore database had a table that filled up and also caused the disk space to fill up, which caused the database to not respond. Customers are not able to run jobs because of the unresponsive Rstore database What has been done to resolve so far * Increased memory and storage on instance * The table was cleared but the disk space was not reclaimed and is still full. * Engaged AWS and determined that we cannot set the parameter for the table to autoscale because it has to be set upon creation. * Created a new instance from the old database increased storage and memory. What’s Next * The new mySQL database in in place, and setup is complete. * Export data to S3 from prior DB, in progress. * Import Data from prior instance to new instance. Estimated ETA to complete the data load is 24hrs due to the size of the MySQL database (1TB+). We are working with AWS to identify any methods to decrease data load time. We will provide updates here if there is any change to the timeline.
identified May 11, 2022, 08:04 PM UTC

As per the last update, we are still in the progress of moving the data.
identified May 11, 2022, 10:10 PM UTC

Upon further investigation and working with AWS support we have a new update and plan: 1. In working with AWS this afternoon, DevOps figured out that a table reached the MySQL 2TB limit. This table is a system table so we cannot delete data. 2. The cause is that multiple tables are writing to the same file. Good practice would have been to have a separate datafile for each table, which was not the case. 3. To fix they will: -Backup a handful of tables they are going to move data into their own files. -Drop those tables and recreate them with their own data files. -Restore the data to those tables which should move the data into their own data files and split it out of the data file with the 2TB limit thus freeing space. 4. This should defragment the database and free up space while decreasing the file size of the data file running into the limit. This will be a temporary measure to get back up and running. The process of testing and implementation should take the next 8 hrs or so depending on the data load. We estimate that by 12:00 CST to be complete and back up. The long term solution is to rebuild the entire database. That can be done offline and then cutover to it once it's ready, so no downtime would be involved. We have done similar updates in the other regions with no impact or downtime with customers.
identified May 12, 2022, 01:06 AM UTC

We are still proceeding with the plan as outlined and on track to complete by 12:00 CST. We will update here if there are any changes.
monitoring May 12, 2022, 03:02 AM UTC

The issue with the rStore database has been resolved. Customers should be able to execute their jobs and workloads now.
resolved May 12, 2022, 07:50 AM UTC

The issue with the rStore database has been resolved. Customers should be able to execute their jobs and workloads now.