Sticky incident

Partial outage due to Amazon AWS single available zone connectivity issues

Sticky experienced a major incident on August 31, 2019 affecting Transaction API and Analytics and 1 more component, lasting 4h 38m. The incident has been resolved; the full update timeline is below.

Started: Aug 31, 2019, 04:25 PM UTC
Resolved: Aug 31, 2019, 09:03 PM UTC
Duration: 4h 38m
Detected by Pingoru: Aug 31, 2019, 04:25 PM UTC

Affected components

Transaction APIAnalyticsMembership APIAdmin Portalportal.sticky.ioThird Party Integrations

Update timeline

identified Aug 31, 2019, 04:25 PM UTC

LimeLight platform had partial connectivity outages from about 8:50am ET to 11:05am ET on Saturday, August 31, 2019. Upon investigation, it was due to Amazon AWS connectivity issues in a SINGLE Availability Zone. LimeLight services are redundant across 3 to 4 availability zones (separate data centers within the AWS US-EAST-1 Virginia Region). We were seeing connection timeouts to various providers (payment gateways, chargeback services, fulfillment services, etc.) during this period. Due to the redundancy across availability zones, we expect that you, as a client, may see issues/errors affecting about 25%-33% of your transactions during this 2 hour 15 minute timeframe. LimeLight will conduct a post mortem of the issue over the coming days to see how we can better react in an automated way to this type of single zone outage. As of now, it does look like all services have been restored as of 11:05am ET. Below are the status updates directly from AWS around the connectivity issue. They started reporting updates on the issue at 9:22am ET and confirm the recovery starting at 11:06am ET. 6:22 AM PDT We are investigating connectivity issues affecting some instances in a single Availability Zone in the US-EAST-1 Region. 6:54 AM PDT We can confirm that some instances are impaired and some EBS volumes are experiencing degraded performance within a single Availability Zone in the US-EAST-1 Region. Some EC2 APIs are also experiencing increased error rates and latencies. We are working to resolve the issue. 7:37 AM PDT We can confirm that some instances are impaired and some EBS volumes are experiencing degraded performance within a single Availability Zone in the US-EAST-1 Region. We are investigating increased error rates for new launches within the same Availability Zone. We are working to resolve the issue. 8:06 AM PDT We are starting to see recovery for instance impairments and degraded EBS volume performance within a single Availability Zone in the US-EAST-1 Region. We are also starting to see recovery of EC2 APIs. We continue to work towards recovery for all affected EC2 instances and EBS volumes.
monitoring Aug 31, 2019, 05:02 PM UTC

We are continuing to monitor the effects of the partial AWS connectivity outage this morning. As of now, we are still seeing that all LimeLight services recovered by 11:05am ET on August 31, 2019 and we are experiencing NO system degradation at this time. AWS status page does still report some additional instance recovery, this was their last update at 12:04pm ET: 9:04 AM PDT Recovery is in progress for instance impairments and degraded EBS volume performance within a single Availability Zone in the US-EAST-1 Region. We continue to work towards recovery for all remaining affected instances and EBS volumes.
resolved Aug 31, 2019, 09:03 PM UTC

LimeLight is considering this morning's incident resolved. No further interruptions to LimeLight systems have occurred since our instances in the problem availability zone were available again at 11:05am ET on Saturday, August 31, 2019. In summary, the LimeLight platform had partial connectivity outages from about 8:50am ET to 11:05am ET on Saturday, August 31, 2019 affecting about 25%-33% of the transactions during this 2 hour 15 minute window of degraded performance. Below are the additional status updates directly from AWS around the connectivity issue (1:47pm ET and 4:30pm ET updates) in case you were interested in the root cause failures: 10:47 AM PDT We want to give you more information on progress at this point, and what we know about the event. At 4:33 AM PDT one of 10 datacenters in one of the 6 Availability Zones in the US-EAST-1 Region saw a failure of utility power. Backup generators came online immediately, but for reasons we are still investigating, began quickly failing at around 6:00 AM PDT. This resulted in 7.5% of all instances in that Availability Zone failing by 6:10 AM PDT. Over the last few hours we have recovered most instances but still have 1.5% of the instances in that Availability Zone remaining to be recovered. Similar impact existed to EBS and we continue to recover volumes within EBS. New instance launches in this zone continue to work without issue. 1:30 PM PDT At 4:33 AM PDT one of ten data centers in one of the six Availability Zones in the US-EAST-1 Region saw a failure of utility power. Our backup generators came online immediately but began failing at around 6:00 AM PDT. This impacted 7.5% of EC2 instances and EBS volumes in the Availability Zone. Power was fully restored to the impacted data center at 7:45 AM PDT. By 10:45 AM PDT, all but 1% of instances had been recovered, and by 12:30 PM PDT only 0.5% of instances remained impaired. Since the beginning of the impact, we have been working to recover the remaining instances and volumes. A small number of remaining instances and volumes are hosted on hardware which was adversely affected by the loss of power. We continue to work to recover all affected instances and volumes and will be communicating to the remaining impacted customers via the Personal Health Dashboard. For immediate recovery, we recommend replacing any remaining affected instances or volumes if possible.