Flexera incident

Flexera One - Cloud Cost Optimization (CCO) - NAM - Degraded Policy Page Functionality

Flexera experienced a major incident on October 15, 2025 affecting Cloud Cost Optimization - US, lasting 2h 46m. The incident has been resolved; the full update timeline is below.

Started: Oct 15, 2025, 12:11 PM UTC
Resolved: Oct 15, 2025, 02:58 PM UTC
Duration: 2h 46m
Detected by Pingoru: Oct 15, 2025, 12:11 PM UTC

Affected components

Cloud Cost Optimization - US

Update timeline

investigating Oct 15, 2025, 12:11 PM UTC

Incident Description: We are currently investigating an issue within the Cloud Cost Optimization (CCO) platform impacting customers in the NAM region. The platform remains accessible; however, affected customers may experience degraded functionality on the Applied Policy Page, with certain tables not loading Priority: P2 Restoration Activity: Our technical team is actively investigating the root cause and working to restore services as quickly as possible. We will continue to share updates as we make progress toward a resolution.
resolved Oct 15, 2025, 02:58 PM UTC

The incident began at 3:37 AM PDT when multiple production nodes became unhealthy. These nodes were replaced and infrastructure was scaled up as part of the remediation. Service was fully restored by 7:16 AM PDT. The environment is now operational, and we will continue to monitor and investigate further to ensure ongoing stability.
postmortem Oct 28, 2025, 07:15 AM UTC

**Description:** Flexera One - Cloud Cost Optimization \(CCO\) - NAM - Degraded Policy Page Functionality **Timeframe:** October 14, 2025, 3:37 AM PDT to October 15, 2025, 7:16 AM PDT **Incident Summary** ‌ On Tuesday, October 14, 2025, at 3:37 AM PDT, the Flexera One platform experienced a service degradation affecting customers in the North America \(NAM\) region, with Cloud Cost Optimization \(CCO\) being one of the primarily impacted services. While the platform remained accessible throughout the incident, some users encountered degraded functionality on the Applied Policy page, where certain tables failed to load. The investigation determined that the degradation was caused by a failing service component within the Flexera One environment, which led to resource contention and reduced service performance. Additionally, increased disk utilization on multiple instances contributed to unresponsiveness in some services. To mitigate the issue, the failing component was replaced, and the infrastructure was scaled up to relieve disk pressure and restore platform stability. After extended monitoring and validation, all services were confirmed to be fully operational, and the incident was declared resolved on October 15, 2025, at 7:16 AM PDT. ‌ **Root Cause** The investigation identified that the primary cause of the degradation was a capacity issue within the Flexera One environment, caused by certain system components that became unresponsive. This resulted in service degradation and resource contention across multiple components. Additionally, elevated disk usage on several instances caused I/O bottlenecks and intermittent unresponsiveness, further impacting service stability. The combined effect of resource exhaustion and disk pressure led to degraded performance, particularly within the Cloud Cost Optimization \(CCO\) service. ‌ **Remediation Actions** · **Infrastructure Scaling:** Increased the number of instances to handle the elevated workload and improve system resilience. · **Storage Expansion:** Increased disk capacity on each instance to accommodate higher data volume and prevent resource saturation. · **Service component Replacement and Scaling**: Replaced the service components that were in a bad state to restore overall cluster stability. · **Validation and Monitoring:** Performed extended validation and monitoring to confirm full service recovery and stability. The issue was confirmed resolved on October 15, 2025, at 7:16 AM PDT. ‌ **Future Preventative Measures** · **Infrastructure Resilience Enhancements**: Implement proactive health checks and automated failover mechanisms to prevent service degradation due to component failures. · **Performance Tuning:** Continue refining in-memory batch processing configurations to ensure optimal performance during peak processing loads.