Qubole incident

Cluster startup failure

Major Resolved View vendor source →

Qubole experienced a major incident on April 16, 2021 affecting Cluster Operations, lasting 12d 17h. The incident has been resolved; the full update timeline is below.

Started
Apr 16, 2021, 07:59 PM UTC
Resolved
Apr 29, 2021, 01:05 PM UTC
Duration
12d 17h
Detected by Pingoru
Apr 16, 2021, 07:59 PM UTC

Affected components

Cluster Operations

Update timeline

  1. investigating Apr 16, 2021, 07:59 PM UTC

    We are aware of a tunnel server availability issue on gcp.qubole.com that may prevent clusters from starting. Devops is in the process of restarting tunnel servers -- this incident will be updated as that is finalized.

  2. identified Apr 22, 2021, 01:09 PM UTC

    Tunnel server replacement uncovered an issue with the discovery server. The server is in the process of being replaced, and will have to be online before clusters can be started.

  3. monitoring Apr 23, 2021, 02:50 PM UTC

    A cluster engine restart has resolved this issue. Devops is resolving a few leftover cluster redirection issues manually.

  4. monitoring Apr 26, 2021, 05:53 AM UTC

    We are continuing to monitor for any further issues.

  5. monitoring Apr 28, 2021, 02:42 PM UTC

    Devops believes they have identified the issue preventing some individual clusters from coming online. They're monitoring to ensure that the change provided is the complete fix.

  6. resolved Apr 29, 2021, 01:05 PM UTC

    Outstanding cluster issues appear to be specific to the clusters' configuration. At this time the service interruption is resolved.