Frontegg incident

EU Degraded State - Partial Outage

Frontegg experienced a major incident on July 24, 2024, lasting 2h 6m. The incident has been resolved; the full update timeline is below.

Started: Jul 24, 2024, 08:51 AM UTC
Resolved: Jul 24, 2024, 10:58 AM UTC
Duration: 2h 6m
Detected by Pingoru: Jul 24, 2024, 08:51 AM UTC

Update timeline

investigating Jul 24, 2024, 08:51 AM UTC

We are currently investigating this issue.
identified Jul 24, 2024, 09:06 AM UTC

The issue has been identified and a fix is being implemented.
monitoring Jul 24, 2024, 09:13 AM UTC

A fix has been implemented and we are monitoring the results.
monitoring Jul 24, 2024, 09:47 AM UTC

We are continuing to monitor for any further issues.
monitoring Jul 24, 2024, 10:37 AM UTC

We are continuing to monitor for any further issues.
resolved Jul 24, 2024, 10:58 AM UTC

This incident has been resolved.
postmortem Jul 26, 2024, 02:00 PM UTC

# **Root Cause Analysis \(RCA\) Report** **Date and Time**: July 24, 2024**Duration**: 22 minutes **Affected Services**: Authentication and core services**Impact**: Customers in the EU region were hanging and returned as 504 timeouts**Reported By**: Internal monitoring systems and customers \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ **Executive summary:** On Wednesday, July 24th, at 08:43 GMT, Frontegg's internal monitoring systems indicated that the API Gateway encountered an issue following the deployment of a new OpenTelemetry propagator \(OTEL instrumentation\), causing service disruptions in the EU. As a result, some of our customers were experiencing timeout errors \(HTTP status 504\) returned by Frontegg. During the upgrade of our API Gateway, Frontegg also updated the OpenTelemetry library. This update inadvertently caused the system to send data one piece at a time instead of using efficient batches due to a misconfiguration in the data handling settings. OTEL transmitted millions of traces individually rather than in aggregated batches. Although our system was rigorously tested under various conditions, the high load in the EU environment caused our auto-scaling mechanism to lag behind the incoming traffic. This led to the API gateway being overwhelmed by the volume of client requests. ‌ **Cause Analysis:** The primary cause of the incident was the deployment of a new OTEL instrumentation in the API Gateway, which led to a significant increase in trace data volume. Contributing factors included: * The API Gateway's OTEL was configured with the BasicPropagator instead of a BatchPropogator, sending each trace as part of the flow. * The fast rise of HTTP requests to the OTEL collector overloaded the API gateway to handle incoming requests. Although it was autoscaled, it lacked in response to the number of requests. * With the increase of traces being sent, the OTEL Collector failed to handle millions of traces at such a rate, increasing the request handling time, which caused another increase in API-gateway HTTP requests ‌ **Customer Impact** During the incident, customers in the European region experienced significant service degradation. Specific issues included failures in hosted login monitors and general service instability. ‌ **Mitigation and resolution:** Upon receiving the initial alerts, the Frontegg team began investigating the issue promptly. After identifying the problem with the OTEL propagator and collector, we increased the allocated resources and reverted to the latest working version. Following the implementation of this change, the systems returned to normal operations. **Mitigation**: * Increased the CPU allocation for the OTEL Gateway to handle the increased workload. * Revert to the latest Api-gateway version. **Resolution**: * Restarted the API Gateway to clear hanging requests and stabilize the OTEL Gateway. * Deployed a new version of the API gateway with the correct configuration ‌ **Prevention and Future steps:** Enhance OTEL Propagator: Implement batch processing, asynchronous handling, and strict timeouts. * **Upgrade OTEL Gateway**: Allocate additional resources to the OTEL Gateway and implement autoscaling to handle increased workloads effectively. * **Implement Aggressive Timeouts**: Implement stringent timeout policies for all HTTP requests that are not customer-related. This measure will proactively prevent delays and mitigate the risk of unresponsive requests. * Stress tests: change the deployment pipeline to include stress testing instead of the nightly testing suite. ‌ **Communication:** **Enhance Status Page Communication**: Ensure the status page provides clear and timely updates during incidents. Develop and maintain standardized templates for incident communication to facilitate prompt and consistent information, even if the root cause is not immediately identified.