Harness incident

The Pipeline Executions list page experienced intermittent 500 errors

Harness experienced a minor incident on January 26, 2026 affecting Continuous Delivery - Next Generation (CDNG) and Continuous Integration Enterprise(CIE) - Self Hosted Runners and 1 more component, lasting 52m. The incident has been resolved; the full update timeline is below.

Started: Jan 26, 2026, 09:20 PM UTC
Resolved: Jan 26, 2026, 10:13 PM UTC
Duration: 52m
Detected by Pingoru: Jan 26, 2026, 09:20 PM UTC

Affected components

Continuous Delivery - Next Generation (CDNG)Continuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsSecurity Testing Orchestration (STO)

Update timeline

investigating Jan 26, 2026, 09:54 PM UTC

The pipeline executions themselves were not impacted. This was a UI issue.
monitoring Jan 26, 2026, 09:55 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Jan 26, 2026, 10:13 PM UTC

This incident has been resolved
postmortem Feb 11, 2026, 06:59 AM UTC

## **Summary** On January 26, 2026, the Pipeline service in the production environment \(Prod2\) experienced intermittent failures affecting certain pipeline-related views. The issue was triggered by elevated memory usage in a subset of service instances, which caused specific API requests to fail. The issue was identified quickly through automated monitoring and resolved the same day. ## **Impact** During the incident window, some customers may have experienced: * Intermittent failures when loading the **Execution List** or **Execution Details** pages. * Occasional issues accessing **Retry History**. * Rare intermittent failures loading the **Pipeline List** page. The issue did **not** impact pipeline execution itself. Pipelines continued to run successfully, and there was **no data loss**. The impact was limited to UI/API visibility of execution metadata for a subset of requests. ## **Root Cause** The issue was caused by memory pressure for certain heaving backend operations triggered. Because the affected instances were not fully unhealthy and continued responding to basic health checks, automated readiness checks did not trigger a restart. As a result, the impacted instances remained in a partially degraded state until manual mitigation was performed. ## **Mitigation** Immediate mitigation steps included: * Performing a rolling restart of the affected service instances, which restored normal operation. * Enabling configuration to automatically terminate service instances upon OutOfMemoryError, ensuring degraded instances do not remain in a bad state. Service functionality was fully restored shortly after mitigation was applied. ## **Action Items** ‌ To reduce the risk of recurrence and improve early detection, the following actions are being implemented: * Enforce automatic JVM termination on OutOfMemoryError to ensure faster self-recovery. * Implement enhanced and granular monitoring and alerting for application-level heap memory utilization. * Improve health check logic to better detect partially degraded service states.