Frontegg incident
Frontegg Services are showing Degraded Performance in EU & US
Frontegg experienced a major incident on May 31, 2023, lasting 3h 4m. The incident has been resolved; the full update timeline is below.
Update timeline
- investigating May 31, 2023, 01:08 PM UTC
We are currently investigating this issue.
- monitoring May 31, 2023, 01:18 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved May 31, 2023, 04:13 PM UTC
This incident has been resolved.
- postmortem Jun 01, 2023, 02:27 PM UTC
### **Executive summary:** On Wednesday May 31st, 2023, at 12:55 GMT we deployed a minor version to one of our services. Shortly after at 12:56 GMT, Frontegg’s US monitoring system started sending alerts for an authentication service which was not performing as expected, and the team immediately began investigating the issue. At 13:01 GMT we started getting alerts from Frontegg’s EU monitoring as well regarding the same service, shortly after, we started to get complaints from customers. At 13:04 GMT, 8 min after we started getting the alerts the team concluded that it was sourced by a recent change that was deployed. As part of the change, there was a database migration for one of our primary services. However, the migration job didn't run due to an edge race condition in our CD infrastructure, causing the service to remain in a schema mismatch state. At this point we immediately started a rollback process for both EU & US regions that was completed by 13:16 GMT. Once the rollback completed, we noticed that our services are working as expected again and customers also reported that they were no longer experiencing issues. **Affect:** Most requests to customers’ custom Frontegg domains resulted in 401/404 responses or inability to authenticate. For the EU region - between 12:59 to 13:16 GMT time.For the US region - between 12:56 to 13:14 GMT time ### **Mitigation and resolution:** Following the monitoring alerts the incident response team immediately identified the potential corrupted service and started rollback procedure with the previous successful deployment. ### **Preventive steps:** * We defined a gated process for deploying DB migration changes * A schema validation on service init to prevent schema mismatch cases was added * Will add deployment validation that will fail deployment if migration didn’t run * Will remove the high dependency in that specific service as a single-point-of-failure for the main system flows * Reduce service rollback time by running only relevant part of the CD pipeline