Iron.io incident

AWS S3 is down now causing job processing issues... stand by please while we try and reroute around it

Iron.io experienced a major incident on February 28, 2017 affecting IronWorker Dedicated and IronWorker Public, lasting 5h 19m. The incident has been resolved; the full update timeline is below.

Started: Feb 28, 2017, 05:57 PM UTC
Resolved: Feb 28, 2017, 11:16 PM UTC
Duration: 5h 19m
Detected by Pingoru: Feb 28, 2017, 05:57 PM UTC

Affected components

IronWorker DedicatedIronWorker Public

Update timeline

identified Feb 28, 2017, 05:57 PM UTC

The issue has been identified and a fix is being implemented.
identified Feb 28, 2017, 05:57 PM UTC

https://news.ycombinator.com/item?id=13755673
identified Feb 28, 2017, 06:08 PM UTC

Reported S3 issues in US-East: https://status.aws.amazon.com/ Trying to bypass their S3 service completely. We will build around it in the future.
identified Feb 28, 2017, 08:02 PM UTC

We are considering bypassing s3 but even then, Docker hub is down and would block any Upstream updating of code packages as they are all built with Docker.
identified Feb 28, 2017, 08:48 PM UTC

Unfortunately the issue has now cascaded to over 45 AWS services causing unrecoverable issues upstream. At this point, we have to wait on AWS and then begin a fully multi-cloud initiative.
identified Feb 28, 2017, 08:58 PM UTC

Update from AWS. We are quickly trying to restore our services as well: Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour.
identified Feb 28, 2017, 08:59 PM UTC

We see jobs going through again... none should be lost but they will be queued up since the issues started this morning.
identified Feb 28, 2017, 09:32 PM UTC

We are now seeing recovery of IronWorker and working through backlogs of jobs.
monitoring Feb 28, 2017, 09:54 PM UTC

Job processing is almost fully up to speed again. It may take awhile to get through the backlog of jobs.
resolved Feb 28, 2017, 11:16 PM UTC

The system has now almost fully caught up. We're continuing to scan for any residual jobs that may have not run but all should have ran, or be queued up to run shortly. Thank you for your patience as AWS recovered their core services. We will be evaluating options of running core iaas outside of AWS.