FMX incident

Application outage

FMX experienced a critical incident on August 26, 2020 affecting Web App and API and 1 more component, lasting 23h 11m. The incident has been resolved; the full update timeline is below.

Started: Aug 26, 2020, 04:00 PM UTC
Resolved: Aug 27, 2020, 03:11 PM UTC
Duration: 23h 11m
Detected by Pingoru: Aug 26, 2020, 04:00 PM UTC

Affected components

Web AppAPIEmailReporting Dashboards

Update timeline

investigating Aug 27, 2020, 11:19 AM UTC

We are currently investigating this issue. We greatly apologize for the inconvenience!
identified Aug 27, 2020, 12:26 PM UTC

The issue has been identified and a fix is being implemented. We expect a return of service in 1 hour. We greatly apologize for the inconvenience.
monitoring Aug 27, 2020, 12:38 PM UTC

A fix has been implemented and we're monitoring the results. Service should now be returning.
resolved Aug 27, 2020, 03:11 PM UTC

This incident has been resolved. We greatly apologize for the inconvenience!
postmortem Aug 31, 2020, 03:01 PM UTC

We want to thank you for using FMX to manage your operations. We understand that FMX is often a critical component of running your organization and therefore we take any service disruptions seriously. This postmortem report will help you to better understand what caused the interruption in service as well as how we plan to avoid issues like these in the future. **Root cause of outage:** On 8-26 we discovered high priority bug in our application and determined that it would be disruptive enough to our users that we should deploy mid-week fix. On the evening of 8-26, we attempted to deploy this fix twice and were unsuccessful both times. We later determined that there was a problematic change included in this deployment. During the second deployment attempt, a rare set of circumstances followed: Windows Update restarted the build server mid-deployment, resulting in Azure, our hosting service, continuing the deployment in the background without our knowledge. As designed, Azure refused to deploy the broken app to live servers due to health probe failures \(a protective measure to prevent deploying bad code\). After experiencing an hour of health probe failures and having received no additional cancellation attempts from us, Azure, according to its design, deployed the broken application. **Contributing factors:** We rely on alerts to inform us of application and feature outages outside of our core hours from 8 am to 6 pm EDT. While we received alerts, they were in email and instant messaging format. This, coupled with the overnight timing of the outage, delayed the time to respond. **Future mitigations:** As a result of this issue are taking the following actions: * We are setting up a tiered, automated, calling service when alerts are issued so that we ensure team members are aware of them. * When we cancel a deployment we will now verify that the corresponding cloud service deployment is actually cancelled, to protect against this rare set of circumstances in the future. Once more, we deeply apologize for this outage and we will be taking steps to ensure it does not happen again. Regards, FMX Team