FMX incident

Cloud Service Provider outage

FMX experienced a critical incident on July 18, 2024 affecting Web App and API and 1 more component, lasting 6h 9m. The incident has been resolved; the full update timeline is below.

Started: Jul 18, 2024, 10:52 PM UTC
Resolved: Jul 19, 2024, 05:02 AM UTC
Duration: 6h 9m
Detected by Pingoru: Jul 18, 2024, 10:52 PM UTC

Affected components

Web AppAPIEmailReporting Dashboards

Update timeline

investigating Jul 18, 2024, 10:52 PM UTC

Our Cloud Service Provider is experiencing an outage. We will update this message as we have more details. We apologize for the inconvenience.
investigating Jul 19, 2024, 12:56 AM UTC

We are continuing to monitor and follow our cloud providers status pages and will provide more updates as they become available.
investigating Jul 19, 2024, 01:38 AM UTC

According to our cloud provider they've `determined the underlying cause and are currently applying mitigation through multiple workstreams. ` We're continuing to monitor the situation and will provide updates as they become available.
investigating Jul 19, 2024, 01:43 AM UTC

We are seeing sites come back online. We're continuing to monitor the incident to ensure our cloud service provider's mitigation is successful.
monitoring Jul 19, 2024, 02:47 AM UTC

Our cloud provider has deployed a fix that appears to have resolved the current incident. We'll continue to monitor the situation and provide updates as they happen.
resolved Jul 19, 2024, 05:02 AM UTC

This incident has been resolved by our cloud service provider. We will continue to keep an eye out for any further issues. Again, our sincere apologies for any service disruption.
postmortem Jul 31, 2024, 08:29 PM UTC

We want to thank you for using FMX to manage your operations. We understand that FMX is often a critical component of running your organization and therefore we take any service disruptions seriously. This postmortem report will help you to better understand what caused the interruption in service as well as how we plan to avoid issues like these in the future. **The root cause of the outage** On the evening of July 18th, we discovered that the FMX platform was unavailable. After investigating we determined that the cause of this was due to an entire region outage at the Microsoft Azure datacenters across the entire US-Central region. You can read more about it [here](https://azure.status.microsoft/en-us/status/history/) from the Azure team. Tracking ID: `1K80-N_8` **Solution** To solve the problem, we simply monitored the situation and waited for the Azure remediations to be implemented. **Future mitigations** On July 30th, we conducted our own internal postmortem discussing future mitigation tactics that we can employ to reduce our reliance on an individual region or cloud provider, improve communication, and improve our monitoring capabilities. This discussion aligned very closely with the [Azure Well-Architected Framework](https://learn.microsoft.com/en-us/azure/well-architected/). ‌ Regards, FMX Team