Iron.io experienced a minor incident on August 6, 2018 affecting IronMQ v3 (AWS US-East) and IronWorker Dedicated and 1 more component, lasting 3h 9m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Aug 06, 2018, 02:14 PM UTC
We are currently investigating this issue.
- investigating Aug 06, 2018, 04:15 PM UTC
We are continuing to investigate this issue.
- resolved Aug 06, 2018, 05:23 PM UTC
This incident has been resolved.
- postmortem Aug 08, 2018, 12:23 AM UTC
**Overview** On August 6th, at 15:07 UTC, we noticed connectivity issues across our network. These connectivity issues caused IronMQ to degrade into an unhealthy state which rendered the service un-usable. **What went wrong** At 12:49 AM PDT, the vendor who we rely on for DNS \(AWS Route 53\) experienced issues. In-network connectivity was broken and many components of our network were unable to communicate with each other. When the vendor issue was resolved at 1:04 AM PDT, the issue persisted within our network due to caching and TTL issues. **What we're doing to prevent this from happening again** * We identified the places within network that could have caused this issue and reviewed their caching strategies and TTL times. Multiple cache times were too aggressive and we’ve increased timeouts in the necessary places. We’re testing various failure scenarios within our staging network to confirm the validity of these timeout values. * We’re currently discussing backup DNS strategies as a team and will be posting updates on our blog about our strategy moving forward, and, continued progress. **Resolution time** The incident was resolved at 16:04 UTC.