Elevio incident

API outage

Elevio experienced a major incident on April 9, 2018, lasting 19m. The incident has been resolved; the full update timeline is below.

Update timeline

identified Apr 09, 2018, 01:49 PM UTC

We're currently experiencing a server outage with our main API service, we are aware of the issue and working to resolve it. We will have this restored ASAP.
resolved Apr 09, 2018, 02:09 PM UTC

This incident has been resolved.
postmortem Jul 30, 2018, 08:40 PM UTC

An update was made to the API early on the weekend (US time) to gather more usage statistics. Things went well initially, but the infrastructure started to buckle under heavy load around 9 AM US EST. As soon as we were notified of the increased error rates by our monitoring systems, we immediately rolled back the change. Unfortunately, during this rollback process, the system subsequently suffered from [thundering herd](https://en.wikipedia.org/wiki/Thundering_herd_problem) due to a huge amount of failed requests being retried concurrently, even with autoscaling on. We decided to double the allowable capacity temporarily to handle the increased load. The system subsequently stabilized a few minutes after that.