Qubole incident

Notebooks unavailable

Critical Resolved View vendor source →

Qubole experienced a critical incident on March 23, 2021 affecting Notebooks and Notebooks and 1 more component, lasting 29d 3h. The incident has been resolved; the full update timeline is below.

Started
Mar 23, 2021, 02:13 PM UTC
Resolved
Apr 21, 2021, 06:10 PM UTC
Duration
29d 3h
Detected by Pingoru
Mar 23, 2021, 02:13 PM UTC

Affected components

NotebooksNotebooksNotebooks

Update timeline

  1. investigating Mar 23, 2021, 02:13 PM UTC

    Customers attempting to access notebooks will encounter a "500 Internal Server" error.

  2. investigating Mar 30, 2021, 06:35 PM UTC

    Devops is continuing to investigate this issue, with the intent to find an immediate workaround.

  3. investigating Mar 31, 2021, 08:18 PM UTC

    Devops has identified a potential cause for this issue and is testing a fix in an internal environment. If the fix is successful, this ticket will be updated to reflect the plan to implement the change across Qubole environments.

  4. identified Apr 02, 2021, 10:32 PM UTC

    Devops is implementing several fixes in the Qubole environment today (4.2.2021) and tomorrow (4.3.2021) in order to resolve this ongoing outage. This page will continue to be updated as the implementation is completed.

  5. identified Apr 06, 2021, 06:00 PM UTC

    Devops has implemented several fixes in the Qubole environment during the last 4 days in order to resolve this ongoing outage. At this point they believe there's one component deployment still pending, where they have sorted out the issues related to deployment and will be performing a rigorous testing. Devops anticipate to get this fix rolled out by Thursday (4/8). This page will continue to be updated as the implementation is completed.

  6. identified Apr 06, 2021, 07:42 PM UTC

    Devops has confirmed the issue has been resolved for all environments except us.qubole, for which their ETA is still Thursday.

  7. identified Apr 09, 2021, 02:01 AM UTC

    Devops is continuing component deployment to resolve this issue. We will update status again as we have a new ETA, or confirmation of resolution.

  8. investigating Apr 12, 2021, 01:12 PM UTC

    Devops resolved a blocking reference error over the weekend. They continue to investigate specific notebook failures on us.qubole.com. At this time a few notebooks appear to be opening normally, though the vast majority still have an issue. Devops is investigating specific failure instances to determine a root cause that will resolve all ongoing failures.

  9. identified Apr 14, 2021, 02:20 AM UTC

    Devops encountered a dependency issue in their attempted fix. They are performing a package rebuild, which will need to be tested tomorrow, Wednesday, before deployment. Devops is attempting to finalize both testing and deployment tomorrow, though deployment may occur on Thursday, 4/15/2021.

  10. identified Apr 15, 2021, 08:32 PM UTC

    Redeployment is still in process, and expected to finish between today (4/15) and tomorrow. This entry will be updated again as redeployment is finished, or as Devops provides further information.

  11. identified Apr 19, 2021, 05:43 PM UTC

    Devops has passed some additional hurdles in final testing of the redeployment on QA. The new deployment is undergoing retesting before being rolled to production.

  12. identified Apr 21, 2021, 06:08 PM UTC

    Significant progress has been made in resolving the Notebook outage. Each environment that serves Notebooks has been refreshed with new nodes and the Notebooks functionality is successfully responding in AWS. AWS customers will now see Notebooks as available and functional while, in parallel, the Qubole team continues its production testing to confirm stability and true resolution to the outage. We continue to work to restore Notebooks on GCP and will be updating the status page regularly

  13. resolved Apr 21, 2021, 06:10 PM UTC

    This incident has been resolved.