Ruvna incident

Web App Outage

Ruvna experienced a critical incident on September 8, 2020 affecting Ruvna Web App, lasting 1h 54m. The incident has been resolved; the full update timeline is below.

Started: Sep 08, 2020, 11:07 AM UTC
Resolved: Sep 08, 2020, 01:01 PM UTC
Duration: 1h 54m
Detected by Pingoru: Sep 08, 2020, 11:07 AM UTC

Affected components

Ruvna Web App

Update timeline

investigating Sep 08, 2020, 11:07 AM UTC

We are currently investigating this issue.
investigating Sep 08, 2020, 12:04 PM UTC

We are continuing to investigate the major web client outage and work on potential solutions. We will provide updates here as they become available.
monitoring Sep 08, 2020, 12:40 PM UTC

A solution is being deployed to provide access to the web app. Access may still be spotty and errors may continue over the next 1-2 hours as the situation develops, but most users should be able to now connect.
monitoring Sep 08, 2020, 12:47 PM UTC

We are continuing to monitor for any further issues.
resolved Sep 08, 2020, 01:01 PM UTC

The incident should now be resolved and traffic to the web app is no longer failing. We are continuing to monitor the situation closely.
postmortem Sep 08, 2020, 05:26 PM UTC

This morning, Ruvna experienced a significant outage. We largely restored platform functionality by 8:24AM ET and service has been operational since. The cause of the disruption was an issue in our high availability strategy which became magnified by an extreme spike in web traffic. This resulted in a backlog of queued requests which eventually brought the platform down for roughly 2 hours. As an organization, there is nothing more important than our responsibility to assist your communities in safely returning to school, and we know that you are relying on us to do just that. This morning, we let you down. We deeply regret this incident and sincerely apologize. I’m personally disappointed that this happened, and I am incredibly sorry for the confusion and frustration we likely caused. Ruvna's infrastructure is built for high availability. However, like many education technology platforms right now, we are experiencing an increase in traffic with back-to-school unlike any time in our history. I'd like to share with you some details of what happened this morning as well as explain the steps we are taking to ensure this never happens again. **What Happened - Technical Details** Today, the load balancer responsible for relaying requests for static assets to available servers detected a spike in traffic and \(correctly\) stopped sending new requests to servers for processing. At that point, the load balancer began placing new requests into a queue of unhandled requests. This queue then became saturated as well, causing additional requests to fail or hang. We believe the load balancer should have started sending traffic to the Nginx automatically shortly after the spike in traffic subsided, and we are still investigating what prevented this from happening. Because this issue was at the load balancer level, and related to the part of our infrastructure normally believed to be the least sensitive to spikes in traffic, our automated fault detection systems did not identify or resolve this issue automatically. This also prevented our team from resolving the issue as quickly as we wanted to, since we needed to take the load balancer completely offline in order to clear the backlog of queued requests. As the saying goes, a chain is only as strong as its weakest link. Today, the weakest link in our chain became abundantly clear. **Resolution and Moving Forward** Around 7:15AM ET, our team began preparing to shift web traffic to a fully managed Content Delivery Network \(CDN\). CDNs offer distributed, load-tolerant web request handling in situations where requested resources are static files like HTML/CSS/JS/images. Configuration completed around 8:24AM ET at which point we began transferring traffic to the CDN rather than the problematic load balancer. This shift in traffic also allowed the problematic load balancer to clear its backlog of queued requests so service was restored almost immediately. Even though the load balancer began handling requests normally again once the backlog was cleared, moving forward, these requests will continue to be served by the CDN. We previously felt the increased control we gained by not running web traffic through a CDN was worth the potential risks, especially since we had never encountered a scenario where traffic resulted in the outage we experienced today. Obviously, today's events highlight the limitations of our old strategy. A CDN will let us handle a virtually unlimited number of concurrent web requests by caching resources in hundreds of servers throughout the country, and will prevent the issue which took us down today from happening in the future.