Ruvna experienced a critical incident on September 21, 2018 affecting Ruvna Web App and Ruvna Backend Infrastructure and 1 more component, lasting 59m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Sep 21, 2018, 02:19 AM UTC
Ruvna's primary DNS servers are experiencing significant delays in response time, causing many systems to be unreachable.
- monitoring Sep 21, 2018, 02:30 AM UTC
Service has been restored. We are monitoring the situation to ensure the issue is fully resolved.
- resolved Sep 21, 2018, 03:18 AM UTC
This incident has been resolved and service is fully restored.
- postmortem Sep 21, 2018, 03:20 AM UTC
#### **Thursday’s DNS Service Outage - Post Mortem** On Thursday \(9/20\) evening at 10:04PM ET, Ruvna’s DNS host went down for approximately 19 minutes. During the outage, anyone accessing services located at ruvna.com \(such as the Accountability Web Client or SFTP Sync Service\) would have received a DNS error or a hung page. Anyone accessing services that rely on the ruvna.com network \(such as the Ruvna iOS app\) would have received a “No Connection” error. We apologize for this incident and any challenges it may have caused. We recognize that you rely on Ruvna to be available for you at all times since you never know when a crisis may happen. **Root Cause** The cause of the incident was an edge router failure within the network of our DNS hosting provider. DNS, or Domain Name System, is the foundational technology necessary for translating a URL typed into a browser into a physical address of a server to handle and respond to a request. When the infrastructure that hosts DNS records for a given domain \(a.k.a. “nameserver”\) is unavailable, requests to that domain can’t be mapped to physical servers and therefore fail. Normally, an edge router failure would be automatically identified so that a failover procedure could step in to assign another router to this task. On Thursday, the failover process didn’t kickoff automatically, leading the incident to cause an outage for clients of the DNS provider, including Ruvna. Moreover, DNS queries are cached for a period of time which adds protection against these outages, as cached DNS queries can be resolved without reaching a domain’s nameservers. Thursday’s outage lasted longer than the majority of our DNS records cache period, so once the cache TTL on those records expired, the outage with Ruvna began. Ruvna uses a separate provider to host our DNS records independently from the rest of Ruvna’s infrastructure and servers so that a major outage of either provider gives our team options for resuming service as quickly as possible. As such, this incident had no impact on Ruvna’s internal networking \(whose DNS is hosted with yet another provider\), servers, or data, even though such entities were unreachable via the ruvna.com domain during the outage. **Remediation and Prevention** Ruvna’s engineers were notified as soon as our network became unreachable at 10:04PM. They quickly identified that the outage was not with Ruvna’s own infrastructure or network, but with our DNS provider’s physical hardware preventing access to Ruvna’s infrastructure and network. We were in touch with the provider by 10:15PM to address the problem as quickly as possible. They confirmed the failover procedure had been completed by 10:26PM, at which point service was restored. No amount of downtime is acceptable for us, just as it is unacceptable for you. Here’s what we’re doing to prevent incidents like this from occurring in the future: * We feel this incident could have been prevented by the provider-in-question with better failover and high availability strategies. As such, on September 28, we completed the process of migrating our primary DNS hosting to a provider with significantly stronger infrastructure. Our nameservers are now replicated across the globe to ensure Ruvna’s constant availability.