Qubole experienced a major incident on April 16, 2021 affecting Cluster Operations, lasting 12d 17h. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 16, 2021, 07:59 PM UTC
We are aware of a tunnel server availability issue on gcp.qubole.com that may prevent clusters from starting. Devops is in the process of restarting tunnel servers -- this incident will be updated as that is finalized.
- identified Apr 22, 2021, 01:09 PM UTC
Tunnel server replacement uncovered an issue with the discovery server. The server is in the process of being replaced, and will have to be online before clusters can be started.
- monitoring Apr 23, 2021, 02:50 PM UTC
A cluster engine restart has resolved this issue. Devops is resolving a few leftover cluster redirection issues manually.
- monitoring Apr 26, 2021, 05:53 AM UTC
We are continuing to monitor for any further issues.
- monitoring Apr 28, 2021, 02:42 PM UTC
Devops believes they have identified the issue preventing some individual clusters from coming online. They're monitoring to ensure that the change provided is the complete fix.
- resolved Apr 29, 2021, 01:05 PM UTC
Outstanding cluster issues appear to be specific to the clusters' configuration. At this time the service interruption is resolved.