Qubole incident
GCP API errors causing issues for Qubole cluster operations
Qubole experienced a major incident on October 7, 2020 affecting Cluster Operations, lasting 1d 4h. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 07, 2020, 08:23 PM UTC
Qubole DevOps was alerted to issues with cluster operations in the GCP environment. While we do not see a public statuspage incident from GCP at this time, we have detected large amounts of timeouts on standard GCP API calls. We are opening a critical investigation with GCP support and will update as we gain further insights. At this time, cluster start and stop operations are failing.
- investigating Oct 07, 2020, 09:23 PM UTC
Qubole DevOps continues to debug the current issue with GCP Support assistance. We will keep you updated when we have more information.
- monitoring Oct 07, 2020, 11:34 PM UTC
Qubole DevOps has been working with GCP support and have implemented a short-term workaround which has improved the ability to start/stop clusters. Note that during our investigation, GCP support has posted a public statuspage update here: https://status.cloud.google.com. We will monitor the situation and determine if additional steps are required to stabilize performance of the service.
- identified Oct 08, 2020, 12:09 AM UTC
Qubole DevOps has determined that the temporary workaround is not working consistently and is working on additional potential solutions. We are working closely with GCP Support and also staying apprised of their statuspage updates with their latest update as follows: "Google's API Discovery Service GetRest (https://www.googleapis.com/discovery/v1/apis/pubsub/v1/rest) requests are hanging in the following regions: asia-northeast1, asia-northeast2, asia-northeast3, europe-west3, europe-west6, northamerica-northeast1, southamerica-east1, us-west2, and us-west4. We are currently working to mitigate by rolling back a configuration change. We expect the rollout to complete within the next 7 hours. Next update time is Wednesday, 2020-10-07 23:15 US/Pacific."
- monitoring Oct 08, 2020, 01:00 AM UTC
Qubole DevOps has applied additional workarounds and verified that cluster operations are functional again. We will continue to track the incident on the GCP side and determine the best course of action to re-configure the workarounds thereafter. We appreciate your patience through this process. If you see further issues, please don't hesitate to reach out to Qubole Support.
- monitoring Oct 08, 2020, 08:53 AM UTC
As updated earlier Qubole DevOps has applied additional workarounds and verified that cluster operations are continuing to be functional. However, we are continuing to work with the Google support team to resolve this issue permanently. The latest update is as below: "Google's API Discovery Service GetRest (https://www.googleapis.com/discovery/v1/apis/pubsub/v1/rest) requests are hanging in the following regions: asia-northeast1, asia-northeast2, asia-northeast3, asia-southeast1, europe-west1,europe-west3, europe-west6, europe-west4, northamerica-northeast1,southamerica-east1,us-central1, us-east1, us-west1, us-west2, and us-west4. We are currently working to mitigate by rolling back a configuration change. Next update time is Thursday, 2020-10-08 07:00 US/Pacific." The same is available on https://status.cloud.google.com/.
- resolved Oct 09, 2020, 12:30 AM UTC
Qubole DevOps has restored the environment to the working state prior to the GCP API outage. A series of thorough application functionality verifications were performed and passed. We appreciate your patience through this process.