Harness incident

Intermittent error while loading pipeline execution history page

Harness experienced a notice incident on January 9, 2026 affecting Continuous Delivery - Next Generation (CDNG) and Continuous Integration Enterprise(CIE) - Self Hosted Runners and 1 more component, lasting 15m. The incident has been resolved; the full update timeline is below.

Started: Jan 09, 2026, 06:15 PM UTC
Resolved: Jan 09, 2026, 06:30 PM UTC
Duration: 15m
Detected by Pingoru: Jan 09, 2026, 06:15 PM UTC

Affected components

Continuous Delivery - Next Generation (CDNG)Continuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsSecurity Testing Orchestration (STO)

Update timeline

monitoring Jan 09, 2026, 06:00 PM UTC

A fix has been implemented and we are monitoring the results.
monitoring Jan 10, 2026, 12:20 AM UTC

We are continuing to monitor for any further issues.
resolved Jan 10, 2026, 12:25 AM UTC

This incident has been resolved.
postmortem Jan 28, 2026, 07:23 PM UTC

## **Summary** On January 9, 2026, some customers experienced intermittent errors and slow responses while accessing pipeline execution details and execution lists. The issue was identified promptly and mitigated by the engineering team. Service functionality was restored within a short period. **Impact** During the incident window, a subset of users may have encountered: * Intermittent failures or delays when loading pipeline execution details. * Occasional issues viewing execution history or execution lists. Pipeline executions themselves continued to run as expected. There was **no data loss**, and no long-term impact to customer environments. ## **Root Cause** The issue was caused by elevated memory usage in a subset of service instances under load. When available memory dropped below required thresholds, certain requests related to loading execution data could not be processed successfully. Because the service instances remained partially healthy, they were not immediately recycled, resulting in intermittent request failures until mitigation was applied. ## **Mitigation** We did a rolling restart of all pods. That immediately fixed the issue. As a preventive measure we also increased the pod heap ## **Action Items** To prevent recurrence and improve resiliency, the following actions are being implemented: ‌ * Increased memory allocation to affected services to better handle peak load conditions. * Improved automatic recovery behavior for services encountering unrecoverable memory conditions. * Enhanced monitoring and alerting for application-level memory usage to enable earlier detection. * Added additional safeguards to ensure degraded instances are identified and remediated more quickly.