Harness incident
Intermittent slowness during pipeline executions (Prod1, Prod2)
Harness experienced a minor incident on May 1, 2026 affecting Continuous Delivery - Next Generation (CDNG) and Continuous Integration Enterprise(CIE) - Self Hosted Runners and 1 more component, lasting 5h 8m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating May 01, 2026, 03:02 PM UTC
We are currently investigating this issue.
- monitoring May 01, 2026, 03:37 PM UTC
A fix has been implemented and we are monitoring the results.
- monitoring May 01, 2026, 07:58 PM UTC
We are largely mitigated and most pipelines are running normally. We are monitoring all parameters to make sure there are no issues before closing it.
- resolved May 01, 2026, 08:10 PM UTC
This incident has been resolved.
- postmortem May 11, 2026, 08:58 PM UTC
### **Summary** A rollout involving OpenTelemetry instrumentation changes introduced a memory leak in the OTEL eBPF collector running in production clusters. Under sustained production traffic, the leak caused increasing JVM heap utilization, elevated garbage collection pressure, and eventual out-of-memory \(OOM\) conditions across several core platform services. ### **Impact** * Elevated latency and intermittent instability in Prod1, Prod2, and Prod3 * Some customers experienced slow pipeline execution and degraded responsiveness No customer data loss occurred. ### **Root Cause** The root cause was an upstream defect in the OpenTelemetry eBPF instrumentation library that introduced a memory leak under production-scale workloads. The leak continuously increased telemetry-related memory consumption, leading to sustained JVM garbage collection pressure and eventual heap exhaustion. ### **Mitigation and Recovery** **Immediate Actions** * scaled up clusters to stabilize impacted clusters * Disabled OTEL instrumentation components and restarted affected services ### **Next Steps** To prevent such issues from happening again, we are: * Enhance our load testing process to test in higher workloads to identify such issues prior to going production. * Add additional granular instrumentation to catch such issues sooner.