Qubole incident

Accessiblity and performance issue on api.qubole.com

Minor Resolved View vendor source →

Qubole experienced a minor incident on June 11, 2021 affecting Site Availability and QDS API and 1 more component, lasting 1d. The incident has been resolved; the full update timeline is below.

Started
Jun 11, 2021, 07:16 PM UTC
Resolved
Jun 12, 2021, 08:00 PM UTC
Duration
1d
Detected by Pingoru
Jun 11, 2021, 07:16 PM UTC

Affected components

Site AvailabilityQDS APICommand ProcessingQubole SchedulerCluster Operations

Update timeline

  1. investigating Jun 11, 2021, 07:16 PM UTC

    Api.qubole.com is currently seeing some degraded performance, and occasionally returning 404 errors during access. At this time failures appear to be partial and intermittent, but Devops is investigating.

  2. investigating Jun 12, 2021, 04:05 AM UTC

    api.qubole.com is currently running slowly due to extremely high throughput, likely complicated by an initial issue with burst throttling (now resolved). Though the number of connections and backlogged operations is consistently coming down, we're seeing delays in the webapp's ability to intake requests and process them. New nodes being added are seeing stability issues in the webapp tier, not accepting traffic and failing.

  3. resolved Jun 12, 2021, 08:00 PM UTC

    A large, ad-hoc workload running into unexpected errors drove an ongoing backlog of operations. The lagging operations have either been killed or finished, restoring regular performance. Devops is doing post-mortem monitoring of the environment.