Qubole incident

Issue with increased error rates for AWS API

Qubole experienced a minor incident on July 29, 2020 affecting Command Processing and Cluster Operations and 1 more component, lasting 4h 11m. The incident has been resolved; the full update timeline is below.

Started: Jul 29, 2020, 02:13 PM UTC
Resolved: Jul 29, 2020, 06:24 PM UTC
Duration: 4h 11m
Detected by Pingoru: Jul 29, 2020, 02:13 PM UTC

Affected components

Command ProcessingCluster OperationsCommand ProcessingCommand ProcessingCluster OperationsCommand ProcessingCluster Operations

Update timeline

investigating Jul 29, 2020, 02:13 PM UTC

Qubole received notification from AWS on their status page (https://status.aws.amazon.com) as follows: 6:21 AM PDT We have identified the cause of the increased API error rates in a single Availability Zone in the US-EAST-1 Region and continue working towards resolution. Customers experiencing errors launching new EC2 instances may attempt to launch their EC2 instances in another Availability Zone. Qubole customers might have impact due to this for their cluster operations.
identified Jul 29, 2020, 04:28 PM UTC

Qubole continues to monitor the AWS API error rate issue in the us-east-1 region. At this time, the Availability Zone (AZ) performance is sporadic and inconsistent. AWS recognizes that existing instances were not affected, so existing clusters are generally operational. For your current cluster start operations, we can recommend the following: if you cannot start your cluster, in the cluster startup log, you will notice the AZ referenced. You may remove the private subnet for that AZ in your cluster config if many subnets are configured or replace with a private subnet of a different AZ. Similarly, attempts to upscale or downscale your cluster including acquiring spot nodes may run into similar errors. If you are trying to downscale or terminate your cluster, you may need to attempt this multiple times in the Qubole UI or via API Our goal is to work with AWS to ensure this issue is resolved as expediently as possible, but at this time, there is no definitive ETA as of their 8:23 am PDT update. If you would like to follow along with their updates, they are here: https://status.aws.amazon.com/
identified Jul 29, 2020, 05:34 PM UTC

Qubole DevOps received a recent update from AWS: 10:19 AM PDT We have deployed a fix to the impacted EC2 sub-system causing increased API error rates and new instance launch failures in a Single Availability zone in the US-EAST-1 Region and are beginning to see recovery. We continue to work towards full resolution. Existing instances remain unaffected by this issue. We have been testing our own internal cluster operations, have seen improvements and will continue to verify as the issue clears on the AWS side.
resolved Jul 29, 2020, 06:24 PM UTC

Qubole DevOps has verified the resolution from AWS and verified our internal cluster operations along with clearing all related operational alerts. The resolution from AWS was posted at 10:52 AM PDT. We are resolving this incident at this time. Update from AWS: Between 5:18 AM and 10:25 AM PDT we experienced increased error rates for some EC2 APIs and new instance launches in a Single Availability Zone in the US-EAST-1 region. Existing instances were unaffected. We are working to address API errors affecting a small number of EBS volumes as a result of this issue. The issue has been resolved and the service is operating normally.