Digital Pigeon incident

Ongoing issues resolving DNS queries to Amazon S3 services effecting a small number of users

Major Resolved View vendor source →

Digital Pigeon experienced a major incident on October 23, 2019 affecting Oceania File Servers and USA West File Servers and 1 more component, lasting 16h 46m. The incident has been resolved; the full update timeline is below.

Started
Oct 23, 2019, 04:11 AM UTC
Resolved
Oct 23, 2019, 08:57 PM UTC
Duration
16h 46m
Detected by Pingoru
Oct 23, 2019, 04:11 AM UTC

Affected components

Oceania File ServersUSA West File ServersSouth East Asia File ServersEurope File Servers

Update timeline

  1. monitoring Oct 23, 2019, 04:11 AM UTC

    Continuing from the previous incident (https://digitalpigeon.statuspage.io/incidents/r886w5d2t611) which was prematurely closed based on feedback from Amazon AWS. We are continuing to work with Amazon AWS to resolve the issue. A minority of users report that uploads, download and previews are failing. Investigation suggests that the effected users are experiencing issues due to problems resolving Domain Name Server* (DNS) queries to Amazon S3 services (our hosting provider). DNS is a core component of the internet and is usually provided and configured auto-magically by your ISP when you connect to internet. Unfortunately being such a low level part of the internet puts it completely outside of our control. As a temporary work-around for the issue we recommend switching to either the Cloudflare or Google DNS servers which are confirmed working (and are generally faster than those provided by your ISP). > Cloudflare MacOS: https://developers.cloudflare.com/1.1.1.1/setting-up-1.1.1.1/mac/ > Cloudflare Windows: https://developers.cloudflare.com/1.1.1.1/setting-up-1.1.1.1/windows/ > Google MacOS: https://developers.google.com/speed/public-dns/docs/using#mac_os > Google Windows: https://developers.google.com/speed/public-dns/docs/using#windows * DNS converts human readable host names (e.g. digitalpigeon-dp-us-west.s3.amazonaws.com) into a series of numbers that computers understand. Whenever your computer attempts to make a connection on the internet the first thing it does is contact your configured Domain Name Server and attempt to resolve its host name into an actual network address. If this step fails then, which is what's happening in this case, the connection will fail.

  2. monitoring Oct 23, 2019, 05:23 AM UTC

    Amazon AWS reports that there service teams are actively investigating the issue. Unfortunately due to the complexity of the issue they are unable to provide an ETA. To illustrate the complexity of the issue check out: https://www.whatsmydns.net/#A/digitalpigeon-dp-us-west.s3.amazonaws.com

  3. monitoring Oct 23, 2019, 07:27 AM UTC

    Amazon AWS is continuing to implement fixes to mitigate the issue... Amazon have also suggested a code change to help mitigate the problem. We are currently working on implementing and testing that change and hope to have it deployed within the hour.

  4. monitoring Oct 23, 2019, 08:04 AM UTC

    We are deploying the code change suggested by Amazon AWS to help mitigate the issue.

  5. monitoring Oct 23, 2019, 08:34 AM UTC

    We have deployed the code change as suggested by Amazon AWS. The change adds extra region information to all our Amazon S3 requests which according to Amazon will help mitigate the issue. We still recommend switching to either the Cloudflare or Google DNS servers which are confirmed working and are generally faster and apparently are a lot more reliable than those provided by your ISP. For more info on Cloudflare's DNS service check out https://www.cloudflare.com/learning/dns/what-is-1.1.1.1/. For setup instructions see: > Cloudflare MacOS: https://developers.cloudflare.com/1.1.1.1/setting-up-1.1.1.1/mac/ > Cloudflare Windows: https://developers.cloudflare.com/1.1.1.1/setting-up-1.1.1.1/windows/ > Google MacOS: https://developers.google.com/speed/public-dns/docs/using#mac_os > Google Windows: https://developers.google.com/speed/public-dns/docs/using#windows We will continue to provide updates as we get info from Amazon AWS.

  6. monitoring Oct 23, 2019, 10:32 AM UTC

    The code level change we made seems as suggested by Amazon AWS appears to have mitigated the issue. However, Amazon AWS is still reporting that the issue is under investigation. We are considering all system as operational at this point, however we will continue to monitor the situation and provide updates. If you still experiencing issues please let us know.

  7. resolved Oct 23, 2019, 08:57 PM UTC

    We are considering this issue resolved as we have had no new reports of problems in the last 12 hours.