Elevio incident

API outage

Major Resolved View vendor source →

Elevio experienced a major incident on April 9, 2018, lasting 19m. The incident has been resolved; the full update timeline is below.

Started
Apr 09, 2018, 01:49 PM UTC
Resolved
Apr 09, 2018, 02:09 PM UTC
Duration
19m
Detected by Pingoru
Apr 09, 2018, 01:49 PM UTC

Update timeline

  1. identified Apr 09, 2018, 01:49 PM UTC

    We're currently experiencing a server outage with our main API service, we are aware of the issue and working to resolve it. We will have this restored ASAP.

  2. resolved Apr 09, 2018, 02:09 PM UTC

    This incident has been resolved.

  3. postmortem Jul 30, 2018, 08:40 PM UTC

    An update was made to the API early on the weekend (US time) to gather more usage statistics. Things went well initially, but the infrastructure started to buckle under heavy load around 9 AM US EST. As soon as we were notified of the increased error rates by our monitoring systems, we immediately rolled back the change. Unfortunately, during this rollback process, the system subsequently suffered from [thundering herd](https://en.wikipedia.org/wiki/Thundering_herd_problem) due to a huge amount of failed requests being retried concurrently, even with autoscaling on. We decided to double the allowable capacity temporarily to handle the increased load. The system subsequently stabilized a few minutes after that.