Skeddly incident

Action execution failures

Skeddly experienced a major incident on June 13, 2018 affecting Action Infrastructure, lasting 1h 37m. The incident has been resolved; the full update timeline is below.

Started: Jun 13, 2018, 04:53 AM UTC
Resolved: Jun 13, 2018, 06:31 AM UTC
Duration: 1h 37m
Detected by Pingoru: Jun 13, 2018, 04:53 AM UTC

Affected components

Action Infrastructure

Update timeline

investigating Jun 13, 2018, 05:53 AM UTC

Action executions are ending prematurely. We are investigating.
identified Jun 13, 2018, 05:55 AM UTC

We have identified a problem with the clock on one of Skeddly's action execution workers. The worker was terminated to stop the problem.
monitoring Jun 13, 2018, 06:07 AM UTC

After terminating the offending EC2 instance, the time problem has been resolved. We are monitoring to ensure the problem is fully resolved.
resolved Jun 13, 2018, 06:31 AM UTC

The issue has been resolved. Action executions are executing normally again. We are applying our SLA to affected action executions.
postmortem Aug 02, 2018, 04:38 PM UTC

# The Problem At 04:42 UTC on 2018-06-13, action executions started failing with an error: ``` The retrieved credentials have already expired: Now = 06/13/2018 04:45:05, Credentials expiration = 06/13/2018 04:42:30 ``` This error message was coming from the AWS SDK that we use. We noticed that the message was coming from a single EC2 instance only. Once that EC2 instance attempted to execute an action, the above error message would be seen and the action would fail. However, the clock on the offending EC2 instance was correct. # Steps to Stop the Problem In order to prevent the errors from continuing, the offending EC2 instance was terminated at 05:42. Indeed, this did stop the problem. # Digging Deeper It's very hard to diagnose a terminated EC2 instance. However... The program stack traces at the time of failure indicated the problem was happening when Skeddly attempted to retrieve AWS credentials from the EC2 instance metadata. The credentials were already expired when they were retrieved from the metadata. Our worker EC2 instances use IAM Roles for EC2 Instances for AWS credentials. These credentials are used for a few purposes: 1. To "assume" to customer AWS IAM roles, 2. To upload logs to our S3 bucket, and 3. To upload and download intermediate reports during some of our report-generating actions. Looking at the actions that failed, most were using IAM roles to allow Skeddly to access customer AWS accounts. So that would fit #1. Our action execution logs are stored in S3. So parts of logs were unable to be uploaded from this EC2 instance. Bingo for #2. The remaining actions that failed that were using IAM user access keys failed while attempting to upload or download an intermediate report from S3. That's #3. # AWS Support After talking with AWS support, most likely the EC2 instance metadata was unreachable for some reason and the AWS SDK was caching the Skeddly credentials it received previously. Eventually those cached credentials expired. Since the EC2 instance was terminated, we are unable to diagnose further, but they mentioned to running the following via PowerShell next time: ``` get-date;net time ipconfig /all route print ``` That may shed some light on the issue. Hopefully we'll never need it. # Our SLA We are applying our SLA to this incident. All action failures that occurred between 04:42 UTC and 05:40 UTC with an "unknown failure" will have the SLA applied.