Iron.io incident

AWS S3 is down now causing job processing issues... stand by please while we try and reroute around it

Major Resolved View vendor source →

Iron.io experienced a major incident on February 28, 2017 affecting IronWorker Dedicated and IronWorker Public, lasting 5h 19m. The incident has been resolved; the full update timeline is below.

Started
Feb 28, 2017, 05:57 PM UTC
Resolved
Feb 28, 2017, 11:16 PM UTC
Duration
5h 19m
Detected by Pingoru
Feb 28, 2017, 05:57 PM UTC

Affected components

IronWorker DedicatedIronWorker Public

Update timeline

  1. identified Feb 28, 2017, 05:57 PM UTC

    The issue has been identified and a fix is being implemented.

  2. identified Feb 28, 2017, 05:57 PM UTC

    https://news.ycombinator.com/item?id=13755673

  3. identified Feb 28, 2017, 06:08 PM UTC

    Reported S3 issues in US-East: https://status.aws.amazon.com/ Trying to bypass their S3 service completely. We will build around it in the future.

  4. identified Feb 28, 2017, 08:02 PM UTC

    We are considering bypassing s3 but even then, Docker hub is down and would block any Upstream updating of code packages as they are all built with Docker.

  5. identified Feb 28, 2017, 08:48 PM UTC

    Unfortunately the issue has now cascaded to over 45 AWS services causing unrecoverable issues upstream. At this point, we have to wait on AWS and then begin a fully multi-cloud initiative.

  6. identified Feb 28, 2017, 08:58 PM UTC

    Update from AWS. We are quickly trying to restore our services as well: Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour.

  7. identified Feb 28, 2017, 08:59 PM UTC

    We see jobs going through again... none should be lost but they will be queued up since the issues started this morning.

  8. identified Feb 28, 2017, 09:32 PM UTC

    We are now seeing recovery of IronWorker and working through backlogs of jobs.

  9. monitoring Feb 28, 2017, 09:54 PM UTC

    Job processing is almost fully up to speed again. It may take awhile to get through the backlog of jobs.

  10. resolved Feb 28, 2017, 11:16 PM UTC

    The system has now almost fully caught up. We're continuing to scan for any residual jobs that may have not run but all should have ran, or be queued up to run shortly. Thank you for your patience as AWS recovered their core services. We will be evaluating options of running core iaas outside of AWS.