Qubole incident

Accessiblity and performance issue on api.qubole.com

Major Resolved View vendor source →

Qubole experienced a major incident on June 21, 2021 affecting Site Availability and QDS API and 1 more component, lasting 23h 46m. The incident has been resolved; the full update timeline is below.

Started
Jun 21, 2021, 07:10 PM UTC
Resolved
Jun 22, 2021, 06:57 PM UTC
Duration
23h 46m
Detected by Pingoru
Jun 21, 2021, 07:10 PM UTC

Affected components

Site AvailabilityQDS APICommand ProcessingQubole SchedulerCluster OperationsNotebooks

Update timeline

  1. investigating Jun 21, 2021, 07:10 PM UTC

    Api.qubole.com is currently seeing some degraded performance, and is returning errors during access. At this time failures appear to be partial and intermittent, but Devops is investigating.

  2. identified Jun 21, 2021, 11:05 PM UTC

    Devops was able to identify an issue with the EKS Cluster, and they could see the qds api are failing from the logs. Currently the team is still troubleshooting and as next step are working on restarting the pods associated with EKS cluster.

  3. identified Jun 22, 2021, 03:02 AM UTC

    Devops has identified issue with the production-rstore due to table space issue, and has reached out to AWS support to resolve the issue. Currently Ddevops is moving out the old table to get some space, as Increasing the file system is a complex process at this stage.

  4. identified Jun 22, 2021, 07:04 AM UTC

    Devops team is continuing their effort in moving out the DB table dump from one of the impacted tables. Once the data dump is completed, Devops will be able to free up some space which will help in resolving the issue.

  5. identified Jun 22, 2021, 11:00 AM UTC

    Devops is still continuing to troubleshoot issues with databases. Currently the production rstore read replica DB is erroring out, due to which there are various issues being encountered. Devops is actively working on these issues and are looking forward to a resolution at the earliest.

  6. identified Jun 22, 2021, 03:00 PM UTC

    The RDS DB seems to have hit the maximum limit of DB size for instances. As per recommendation from AWS, DevOps team are upgrading the instances, and are currently generating a backup of older instances, so that they can restore that backup on a newer instance with higher configuration.

  7. resolved Jun 22, 2021, 06:57 PM UTC

    Devops has confirmed that they are no longer seeing the errors on the DB. Currently the application is able to connect to DB and api.qubole.com is up and working fine. Currently Query hist, Cluster start, stop and Read only replica of Primary DB are up and running.