Harness incident

Intermittent slowness during pipeline executions (Prod1, Prod2)

Harness experienced a minor incident on May 1, 2026 affecting Continuous Delivery - Next Generation (CDNG) and Continuous Integration Enterprise(CIE) - Self Hosted Runners and 1 more component, lasting 5h 8m. The incident has been resolved; the full update timeline is below.

Started: May 01, 2026, 03:02 PM UTC
Resolved: May 01, 2026, 08:10 PM UTC
Duration: 5h 8m
Detected by Pingoru: May 01, 2026, 03:02 PM UTC

Affected components

Continuous Delivery - Next Generation (CDNG)Continuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsSecurity Testing Orchestration (STO)Internal Developer Portal (IDP)

Update timeline

investigating May 01, 2026, 03:02 PM UTC

We are currently investigating this issue.
monitoring May 01, 2026, 03:37 PM UTC

A fix has been implemented and we are monitoring the results.
monitoring May 01, 2026, 07:58 PM UTC

We are largely mitigated and most pipelines are running normally. We are monitoring all parameters to make sure there are no issues before closing it.
resolved May 01, 2026, 08:10 PM UTC

This incident has been resolved.
postmortem May 11, 2026, 08:58 PM UTC

### **Summary** A rollout involving OpenTelemetry instrumentation changes introduced a memory leak in the OTEL eBPF collector running in production clusters. Under sustained production traffic, the leak caused increasing JVM heap utilization, elevated garbage collection pressure, and eventual out-of-memory \(OOM\) conditions across several core platform services. ### **Impact** * Elevated latency and intermittent instability in Prod1, Prod2, and Prod3 * Some customers experienced slow pipeline execution and degraded responsiveness No customer data loss occurred. ### **Root Cause** The root cause was an upstream defect in the OpenTelemetry eBPF instrumentation library that introduced a memory leak under production-scale workloads. The leak continuously increased telemetry-related memory consumption, leading to sustained JVM garbage collection pressure and eventual heap exhaustion. ### **Mitigation and Recovery** **Immediate Actions** * scaled up clusters to stabilize impacted clusters * Disabled OTEL instrumentation components and restarted affected services ### **Next Steps** To prevent such issues from happening again, we are: * Enhance our load testing process to test in higher workloads to identify such issues prior to going production. * Add additional granular instrumentation to catch such issues sooner.