Flexera incident

Flexera One - Cloud Cost Optimization (CCO) - NAM - Policies failing to run

Flexera experienced a critical incident on September 29, 2025 affecting Cloud Cost Optimization - US, lasting 5h 44m. The incident has been resolved; the full update timeline is below.

Started: Sep 29, 2025, 11:00 AM UTC
Resolved: Sep 29, 2025, 04:45 PM UTC
Duration: 5h 44m
Detected by Pingoru: Sep 29, 2025, 11:00 AM UTC

Affected components

Cloud Cost Optimization - US

Update timeline

investigating Sep 29, 2025, 11:00 AM UTC

Incident Description: We are currently investigating an issue within the Cloud Cost Optimization (CCO) platform impacting customers in the NAM region. The platform remains accessible; however, affected customers may experience degraded functionality, including policies failing to run and certain automation features not loading as expected. Priority: P1 Restoration Activity: Our technical team is actively investigating the root cause and working to restore services as quickly as possible. We will continue to share updates as we make progress toward a resolution.
investigating Sep 29, 2025, 11:41 AM UTC

Following further investigation, we’ve determined that the scope of impact is broader than initially assessed. Accordingly, we’ve escalated the priority to P1.
identified Sep 29, 2025, 12:27 PM UTC

Our teams have identified a potential issue affecting one of the infrastructure nodes within the Cloud Cost Optimization platform. Our technical teams are continuing the remediation efforts, including provisioning a replacement and cleaning up degraded components. We’re working to restore full functionality and will provide updates as progress continues
identified Sep 29, 2025, 02:36 PM UTC

Remediation efforts have progressed, including replacement of the impacted infrastructure node and the addition of new capacity to stabilize the environment. As a result, affected pages are now loading again and performance has improved. Our teams are continuing to validate functionality, with particular focus on policy execution workflows, to ensure full restoration. Further updates will be provided as progress continues.
resolved Sep 29, 2025, 04:45 PM UTC

Remediation efforts have been completed, including removal of impacted resources and the addition of new capacity. Following these actions, affected pages became accessible again and policy execution has been confirmed across multiple customer environments. The incident is now resolved and services are fully operational.
postmortem Oct 14, 2025, 03:54 PM UTC

**Description:** Flexera One - Cloud Cost Optimization \(CCO\) - NAM - Policies failing to run **Timeframe:** September 29, 2025, 4:00 AM PDT September 29, 2025, 9:44 AM PDT ‌ **Incident Summary** On Monday, September 29, 2025, at 4:00 AM PDT, the Cloud Cost Optimization \(CCO\) platform experienced a service degradation that impacted customers in the North America \(NAM\) region. Although the platform remained accessible during this incident, some users encountered degraded functionality, particularly with policies failing to execute and certain automation features not loading as intended. The investigation revealed that the primary cause of the degradation was a failing node within the CCO environment, which resulted in service component degradation and resource contention. Additionally, an increase in the usage of scalable file system storage contributed to unresponsive Service components and service delays. To address the issue, the failing node was replaced, and the infrastructure was scaled up to alleviate disk pressure and restore stability. A code update was also implemented to enable in-memory batch processing thresholds, which reduced the volume of metadata I/O operations on scalable file storage, thereby improving responsiveness. Following extended monitoring and validation, the issue was declared resolved at 9:44 AM PDT. ‌ **Root Cause** The investigation determined that the primary cause of the degradation was a failing node within the CCO environment, which led to Service component degradation and resource contention. Additionally, increased utilization of scalable file system storage contributed to unresponsive components and service delays. Subsequent analysis also revealed errors related to a legacy load balancer, which further impacted specific UI functionalities such as the Incidents page. ‌ **Remediation Actions** ‌ · **Node Replacement and Scaling:** The failing node was replaced, and additional nodes were provisioned to alleviate disk pressure and restore cluster stability. · **Service components Cleanup:** Degraded and failed components were identified and cleaned up, ensuring smooth operation across the environment. · **Load Balancer Investigation and Fix:** The legacy load balancer errors were analyzed and remediated to restore full functionality of affected pages. · **Code Optimization:** A code update was released to enable in-memory batch processing thresholds, reducing the volume of metadata I/O operations on scalable file storage and improving responsiveness. · **Validation and Monitoring:** Extended monitoring and validation were conducted to confirm recovery and ensure all CCO services were functioning as expected. The issue was declared fully resolved at 9:44 AM PDT on September 29, 2025. ‌ **Future Preventative Measures** · **Infrastructure Resilience Enhancements**: Implement proactive node health checks and automated failover mechanisms to prevent service degradation due to node failures. · **EFS Usage Optimization:** Review existing monitoring and alerting for file storage utilization and I/O latency to detect abnormal usage patterns early. · **Performance Tuning:** Continue refining in-memory batch processing configurations to ensure optimal performance during peak processing loads.