Qubole incident

Degraded performance issue on api.qubole.com

Minor Resolved View vendor source →

Qubole experienced a minor incident on March 9, 2022 affecting Site Availability and Command Processing and 1 more component, lasting 4d 6h. The incident has been resolved; the full update timeline is below.

Started
Mar 09, 2022, 10:05 PM UTC
Resolved
Mar 14, 2022, 04:56 AM UTC
Duration
4d 6h
Detected by Pingoru
Mar 09, 2022, 10:05 PM UTC

Affected components

Site AvailabilityCommand ProcessingQubole SchedulerCluster Operations

Update timeline

  1. investigating Mar 09, 2022, 10:05 PM UTC

    api.qubole.com is currently seeing some degraded performance while processing commands and UI. At this time issue appears to be intermittent.

  2. identified Mar 10, 2022, 01:19 AM UTC

    DevOps had identified that there is an issue with memcache and redis in api.qubole.com. Devops team is investigating further.

  3. identified Mar 10, 2022, 11:26 AM UTC

    Our internal team has resolved the issue with worker nodes. The team is working on auto scaling of the nodes under scheduler ELB.

  4. identified Mar 10, 2022, 04:18 PM UTC

    The scheduler tier on api.q is not allowing to create a new instances. Devops team has created a new non-vpc ASG and trying to add and scale up scheduler nodes. Once it is done we could switch from existing ASG to the new ASG.

  5. identified Mar 10, 2022, 08:31 PM UTC

    New ASG has been created under classic and added a couple of nodes and those are serving traffic. Scheduler nodes and the connectivity issues between worker and memcache are fixed. Now Devops try to run a sample jobs and observing the stability.

  6. identified Mar 11, 2022, 03:56 AM UTC

    New ASG has been created under classic and added a couple of nodes and those are serving traffic. Now DevOps is working on the Redis connection issue and also, still they are working on the root cause of this issue to resolve it.

  7. monitoring Mar 11, 2022, 01:47 PM UTC

    Overall api.q environment seems to be stabilizing. Devops team is continuously monitoring the environment

  8. monitoring Mar 11, 2022, 05:23 PM UTC

    Overall, the api.qubole.com environment seems to be stabilizing. DevOps team is continuing to resolve issues for specific individual customers.

  9. monitoring Mar 11, 2022, 08:42 PM UTC

    Overall, the environment 'api.qubole.com' seems to be stabilizing. DevOps team is continuing to resolve issues for specific individual customers.

  10. monitoring Mar 11, 2022, 11:32 PM UTC

    Overall, the 'api.qubole.com' environment seems to be stabilizing. DevOps team is continuing to resolve issues for specific individual customers.

  11. monitoring Mar 12, 2022, 04:59 AM UTC

    DevOps team is trying to resolve this issue as soon as possible as they are continuing to resolve issues for specific individual customers.

  12. monitoring Mar 12, 2022, 08:45 AM UTC

    DevOps team is continuously trying to resolve this issue as soon as possible as they are working on individual customers to resolve it.

  13. monitoring Mar 12, 2022, 11:58 AM UTC

    DevOps team is continuously trying to resolve this issue as soon as possible as they are working on individual customers to resolve it.

  14. identified Mar 12, 2022, 03:35 PM UTC

    Devops has identified a secondary issue with scheduler autoscaling that is contributing to the remaining intermittent issues. They are currently working to resolve the autoscaling issue.

  15. identified Mar 12, 2022, 06:32 PM UTC

    DevOps team is actively working on it and they have identified a secondary issue with scheduler autoscaling that is contributing to the remaining intermittent issues. They are currently working to resolve the autoscaling issue.

  16. identified Mar 12, 2022, 09:34 PM UTC

    DevOps team is actively working on it and they have identified a secondary issue with scheduler autoscaling that is contributing to the remaining intermittent issues. They are currently working to resolve the autoscaling issue.

  17. identified Mar 13, 2022, 12:43 AM UTC

    DevOps team has identified the cause of the issue as scheduler autoscaling that is contributing to the remaining intermittent issues. They are currently working to resolve it.

  18. identified Mar 13, 2022, 04:53 AM UTC

    DevOps team is still working on the Scheduler issue with individual customers and trying to resolve it.

  19. identified Mar 13, 2022, 08:08 AM UTC

    DevOps team is actively working on the Clusters and Scheduler issue with individual customers and checking with the customers and trying to resolve it soon.

  20. identified Mar 13, 2022, 12:11 PM UTC

    DevOps team is actively working on the issue with individual customers and checking with the customers and trying to resolve it at the earliest.

  21. identified Mar 13, 2022, 05:11 PM UTC

    DevOps team is actively working on the issue and checking with the customers and trying to resolve it at the earliest.

  22. identified Mar 13, 2022, 08:34 PM UTC

    DevOps team is actively working on the issue and checking with the customers and trying to resolve it at the earliest.

  23. identified Mar 13, 2022, 11:30 PM UTC

    DevOps team is actively working on the issue and checking with the customers and trying to resolve it at the earliest.

  24. identified Mar 14, 2022, 02:04 AM UTC

    DevOps team is actively working on the issue and checking with the customers and trying to resolve it at the earliest.

  25. resolved Mar 14, 2022, 04:56 AM UTC

    The issue has been resolved.