SchemeServe experienced a major incident on September 8, 2022 affecting 🎩 SchemeServe and 🔗 SchemeServe API and 1 more component, lasting —. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Sep 07, 2022, 04:29 PM UTC
We are currently investigating an issue where we are unable to connect to SchemeServe
- investigating Sep 07, 2022, 04:36 PM UTC
The Microsoft Azure hosting platform used for hosting SchemeServe is currently offline
- monitoring Sep 07, 2022, 04:56 PM UTC
service to Microsoft Azure has been partially restored and SchemeServe is back online, we are continuing to monitor the situation. Please be aware that connectivity to SchemeServe may still be intermittent while all services recover.
- resolved Sep 08, 2022, 07:33 AM UTC
Full service of SchemeServe has been resumed, a full post mortem will be posted shortly.
- postmortem Sep 08, 2022, 12:49 PM UTC
**What went wrong, and why?** SchemeServe uses the Azure front door \(AFD\) service in order to route all traffic to the relevant backend services. Between 16:10 and 16:45 UTC Azure observed an unusual spike in traffic where the AFD service attempted to load balance traffic for optimal use and minimal latency for customers. In this instance, the load balancing that occurred during the window of the traffic spike caused multiple environments managing this traffic to go offline. Azure has auto-mitigations which will cause our environments to recover in such an event. By design, these environments will recover and once they are in a healthy state so they can start to resume managing traffic. During this instance, as users and Azure systems retried the requests, it exacerbated the situation where Azure had a build-up of requests and this build-up did not allow time for the environment to fully recover. **How did Azure respond?** Azure manually intervened in the AFD load balancing process by expediting the auto-recovery system and performing more efficient load distributions in regions where there was a large build-up of traffic. Once the environment recovered, we began to gradually bring them back online to resume traffic management in a normal way.