Harness incident

Prod1/Prod2 pipelines and logins are degraded. Some delegates are disconnected

Harness experienced a major incident on February 26, 2026 affecting Continuous Delivery - Next Generation (CDNG) and Continuous Delivery - Next Generation (CDNG), lasting 33m. The incident has been resolved; the full update timeline is below.

Started: Feb 26, 2026, 05:56 PM UTC
Resolved: Feb 26, 2026, 06:29 PM UTC
Duration: 33m
Detected by Pingoru: Feb 26, 2026, 05:56 PM UTC

Affected components

Continuous Delivery - Next Generation (CDNG)Continuous Delivery - Next Generation (CDNG)

Update timeline

investigating Feb 26, 2026, 05:56 PM UTC

We are currently investigating this issue.
identified Feb 26, 2026, 06:05 PM UTC

The issue has been identified and a fix is being implemented.
monitoring Feb 26, 2026, 06:14 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Feb 26, 2026, 06:29 PM UTC

This incident has been resolved.
postmortem Mar 02, 2026, 04:26 PM UTC

## Summary On **February 26, 2026**, multiple customers experienced disruptions accessing Harness on Prod1 and Prod2. A transient network connectivity issue caused disruption to our backend systems , leading to platform unresponsiveness. Service was restored within approximately one hour. ## Impact * Customers on Prod2 were unable to log in or access the Harness platform. * Prod1 experienced login disruptions due to a cross-environment dependency on Prod2. * Delegates disconnected; Kubernetes-based delegates reconnected automatically, while non-Kubernetes delegates required a manual restart. ## Root Cause A transient network connectivity disruption caused connection timeouts across the platform. The exact infrastructure-side trigger of the initial connectivity disruption is still under investigation. ## Remediation * **Immediate:** Affected services were manually restarted, clearing stuck connections and restoring platform availability. * **Short-term:** Autoscaling limits were adjusted to better handle sudden reconnection load. * **Ongoing:** Investigation into timeout configuration and application resilience improvements is in progress. ## Action Items To prevent such issues from happening again 1. Review and update the timeouts settings to fail fast and limit thread blocking during connectivity issues. 2. Improve application resilience — enhance circuit breakers to prevent connectivity issues and retries