Qubole incident

Spark and Presto query failures

Critical Resolved View vendor source →

Qubole experienced a critical incident on April 22, 2021 affecting QDS API and Command Processing and 1 more component, lasting 5d 7h. The incident has been resolved; the full update timeline is below.

Started
Apr 22, 2021, 12:52 PM UTC
Resolved
Apr 27, 2021, 08:45 PM UTC
Duration
5d 7h
Detected by Pingoru
Apr 22, 2021, 12:52 PM UTC

Affected components

QDS APICommand ProcessingQubole SchedulerCluster Operations

Update timeline

  1. investigating Apr 22, 2021, 12:52 PM UTC

    Spark and Presto queries run in in.qubole.com may stall, returning Pending or Queued status. Devops is investigating.

  2. investigating Apr 22, 2021, 02:56 PM UTC

    We are continuing to investigate this issue.

  3. monitoring Apr 23, 2021, 06:07 PM UTC

    Devops is monitoring its latest fix -- this should be resolved. Additional information about the resolution will be added after monitoring.

  4. monitoring Apr 25, 2021, 11:44 AM UTC

    We are continuing to monitor for any further issues.

  5. monitoring Apr 25, 2021, 11:47 AM UTC

    An additional incidence of stalled operations was reported yesterday evening (4/24), which have since cleared. Devops is looking into a root cause for the stall, so that a more permanent fix can be applied.

  6. monitoring Apr 26, 2021, 05:51 AM UTC

    We are continuing to monitor for any further issues.

  7. resolved Apr 27, 2021, 08:45 PM UTC

    Devops expects operational issues to be resolved. After restarting discovery, they needed to augment client nodes to serve the scope of traffic.