Skeddly incident

Issues with front-end and action executions

Skeddly experienced a major incident on February 28, 2017, lasting 5h. The incident has been resolved; the full update timeline is below.

Started: Feb 28, 2017, 05:50 PM UTC
Resolved: Feb 28, 2017, 10:50 PM UTC
Duration: 5h
Detected by Pingoru: Feb 28, 2017, 05:50 PM UTC

Update timeline

investigating Feb 28, 2017, 05:50 PM UTC

We are investigating issues accessing the Skeddly front-end and action executions.
investigating Feb 28, 2017, 05:57 PM UTC

AWS is currently experiencing issues in us-east-1 with S3, EC2, and ELB.
identified Feb 28, 2017, 06:02 PM UTC

Amazon Web Services have updated their status page with the current status: http://status.aws.amazon.com/
identified Feb 28, 2017, 06:52 PM UTC

The Skeddly front-end is functioning again. Action execution logs are stored in S3, so they cannot be retrieved at this time. Actions are executing. But they are behind schedule.
identified Feb 28, 2017, 07:53 PM UTC

Amazon has updated the status page with serviced affected by the outage. http://status.aws.amazon.com/ As of this writing, 24 AWS services are affected, including S3, EC2, RDS, Elastic Beanstalk, CloudWatch, ELB, SES.
identified Feb 28, 2017, 09:03 PM UTC

AWS reports that the issues with S3 are partially resolved. We are starting to see reductions in the delays of action executions as a result.
monitoring Feb 28, 2017, 10:06 PM UTC

We are noticing that writing logs to S3 has resumed and action execution delays have been eliminated. AWS still lists S3 and other services as having issues on their status page. http://status.aws.amazon.com/ We will continue to monitor the situation.
resolved Feb 28, 2017, 10:50 PM UTC

AWS is reporting the S3 issues have been resolved. http://status.aws.amazon.com/
postmortem Aug 02, 2018, 04:38 PM UTC

Between 12:45 PM EST and 5:08 PM EST, Amazon S3 experienced a significant outage in us-east-1. API calls to S3 for reads and writes failed. As a result, many AWS services were affected. Some other AWS services that were affected: * Elastic Load Balancing * Elastic Beanstalk * RDS * EBS snapshots * AMI images * Simple Email Service * Many others, totalling over 45 distinct AWS services. We first encountered issues reading from and writing to S3, but also our internal API load balancers, which use Amazon Elastic Load Balancing, dropped HTTPS requests. These combined affected our front-end UI. The issues affecting our front-end UI were short-lived and our front-end UI recovered quickly. However, the S3 issues continued to affect action executions. Logs from action executions are stored in Amazon S3. Since API calls to S3 failed and/or timed-out, this caused delays in executing actions. At it's peak, actions were delayed by 91 minutes. The net result of this AWS outage is as follows: * Actions executed, however during the affected time window, actions executions were delayed. Our [SLA](https://www.skeddly.com/sla/) was automatically applied as appropriate. * Actions that made use of affected AWS services in us-east-1 failed. For example, if an action created EBS snapshots or AMI images, then that would have failed. * Actions executing during the affected time window will not have full logs available since the logs were not able to be saved to S3. For many, this outage, like all outages, is a learning opportunity. We will take what we learned so that Skeddly will weather future issues and outages even better.