Doppler incident

DNS resolution failure

Critical Resolved View vendor source →

Doppler experienced a critical incident on July 17, 2020 affecting API (api.doppler.com) and Dashboard (dashboard.doppler.com) and 1 more component, lasting 45m. The incident has been resolved; the full update timeline is below.

Started
Jul 17, 2020, 09:20 PM UTC
Resolved
Jul 17, 2020, 10:05 PM UTC
Duration
45m
Detected by Pingoru
Jul 17, 2020, 09:20 PM UTC

Affected components

API (api.doppler.com)Dashboard (dashboard.doppler.com)Marketing site (doppler.com)

Update timeline

  1. investigating Jul 17, 2020, 09:20 PM UTC

    DNS resolution is currently failing for doppler.com and all subdomains. We believe this is a Cloudflare outage but are investigating.

  2. investigating Jul 17, 2020, 09:29 PM UTC

    We are continuing to investigate this issue.

  3. monitoring Jul 17, 2020, 09:37 PM UTC

    DNS resolution appears to be operating normally again. We are still monitoring.

  4. monitoring Jul 17, 2020, 09:41 PM UTC

    We are continuing to monitor for any further issues.

  5. resolved Jul 17, 2020, 10:05 PM UTC

    DNS resolution has resumed normal operation. A proper postmortem will follow.

  6. postmortem Jul 17, 2020, 10:17 PM UTC

    Today at 2:15pm Pacific we experienced our second outage in the last two weeks. Sadly the timing is not great but it did give us the opportunity to re-evaluate our failure points as we continue to harden our infrastructure. This outage affected all of our endpoints, from our production and failover infrastructure to the documentation hub and status page but did not result in any data loss. ## What Happened? Doppler uses [Cloudflare](https://cloudflare.com) as our DNS provider which provides a suite of powerful features including DDOS protection, a CDN for assets, firewall rules, edge workers, and plenty of others. They are one of the most popular and trusted DNS providers, which supports nearly 20% of all internet traffic. Today they went down, which brought down a portion of the global internet with them. Cloudflare recommends using their DNS proxy so you can benefit from their suite of features. As we were reminded today, using their proxy changes the landscape of the default protections DNS provides, which results in a nonobvious cost. DNS by its very nature is decentralized, which creates a layer of resilience against being a single point of failure. But this assumption of protection breaks down when you use a proxy at the DNS layer, as now you have a new single point of failure. Today we all paid that nonobvious cost. ## Moving Forward **Hardening Our DNS Reliability** Internally we are tracking the best path forward towards hardening our DNS’s reliability. This can come in a couple of different forms, such as disabling proxy mode for our DNS records. This would remove our DNS layer as a single point of failure but comes at the cost of losing DDOS protection, our records not being masked, and also some other behind the scenes magic. Another possible option would be to add an additional DNS provider \(that supports DDOS protection\) to our stack. Then in the case one goes down, our traffic will automatically failover the other. This would add a fair amount of complexity to our stack. Sadly all solutions thought of so far have tradeoffs and could have nonobvious consequences. We deeply care about finding the right answer, not the fastest to implement. As we continue to explore and implement solutions, we expect to write about our findings and decisions on our [engineering blog](https://doppler.com/blog). **Customer Observability** Being transparent is core to the DNA of the company and we strive to provide our customer's observability during outages in real-time. We do this through our [@DopplerHelp](https://twitter.com/DopplerHelp) Twitter and [status page](https://status.doppler.com). Being that our status page’s DNS is hosted by Cloudflare, it was also affected by the outage. To prevent this in the future, we are moving our status page’s DNS to another provider and will use a new dedicated domain. This domain is still being configured and will be announced soon. **Doppler CLI** The Doppler CLI has a nifty command called `doppler run` which downloads your secrets from our API and then injects them into your application. After each successful run, we automatically create and store an encrypted snapshot of the secrets for you. On the off chance the CLI is unable to connect to our API, we smartly fallback to this encrypted snapshot after 5 retries. During the outage, our `doppler run` users were unaffected as they had an existing snapshot to fallback to. One area we found that could use a little love is in showcasing of a retry event. If request hangs it can create a visible delay to the user. In the next release, the Doppler CLI will print a message stating a retry event is happening so you always stay informed. ## Wrapping Up Providing a seamless experience that provides near perfect uptime is an incredibly difficult task that requires deep thought about every layer in the stack. Today we are reminded that our DNS is a single point of failure and that even with the most trusted of services, like Cloudflare, can bring us down if we don’t have multiple layers of redundancies. As we continue to harden our infrastructure, we plan to share our learnings with you through our [engineering blog](https://doppler.com/blog).