Qubole incident

Accessibility and performance issue on us.qubole.com

Minor Resolved View vendor source →

Qubole experienced a minor incident on June 15, 2021 affecting Site Availability and QDS API and 1 more component, lasting 14h 7m. The incident has been resolved; the full update timeline is below.

Started
Jun 15, 2021, 09:24 AM UTC
Resolved
Jun 15, 2021, 11:31 PM UTC
Duration
14h 7m
Detected by Pingoru
Jun 15, 2021, 09:24 AM UTC

Affected components

Site AvailabilityQDS APICommand ProcessingQubole SchedulerCluster OperationsNotebooks

Update timeline

  1. investigating Jun 15, 2021, 09:24 AM UTC

    us.qubole.com is currently seeing some degraded performance, and occasionally returning 404 errors during access. At this time failures appear to be partial and intermittent, but Devops is investigating.

  2. identified Jun 15, 2021, 11:49 AM UTC

    Devops were able to identify errors in different tiers of us.q environment and are fixing webapp nodes that are not connected to ELB. The team is currently working towards resolution of the issue.

  3. identified Jun 15, 2021, 01:27 PM UTC

    Devops team has restarted the nginx on webapp nodes at this stage after gathering the required logs for future analysis. After the restart, we could see that the webnodes are now joining back to the loadbalancer. The investigation still continues to get this issue to resolution at the earliest

  4. monitoring Jun 15, 2021, 01:37 PM UTC

    After restarting the nginix services, Qubole UI is now accessible on us.qubole.com The team continues to monitor and investigate issues with the other identified issues (including Airflow and Notebooks). We will keep you posted with further updates, as soon as we receive them.

  5. identified Jun 15, 2021, 06:12 PM UTC

    Devops is continuing to work on this issue. While they troubleshoot a problem on an internal cluster, they are also confirming that access to the interface has degraded, frequently returning 404 or 502 errors.

  6. identified Jun 15, 2021, 09:16 PM UTC

    All nodes of the problem cluster have been restarted, but Devops has run into an issue bringing some database resources back online. Devops is focused on resolving this error, which should restore functionality.

  7. resolved Jun 15, 2021, 11:31 PM UTC

    The database issue resolved, nodes have come back online and the environment is accessible.