Harness incident

Platform access issues in Prod1/Prod2/Prod3

Harness experienced a major incident on May 12, 2026 affecting Platform and Platform and 1 more component, lasting 21m. The incident has been resolved; the full update timeline is below.

Started: May 12, 2026, 05:33 PM UTC
Resolved: May 12, 2026, 05:55 PM UTC
Duration: 21m
Detected by Pingoru: May 12, 2026, 05:33 PM UTC

Affected components

PlatformPlatformPlatform

Update timeline

investigating May 12, 2026, 05:33 PM UTC

We are currently investigating this issue.
investigating May 12, 2026, 05:37 PM UTC

We are continuing to investigate this issue.
identified May 12, 2026, 05:46 PM UTC

The issue has been identified and a fix is being implemented.
monitoring May 12, 2026, 05:52 PM UTC

A fix has been implemented and we are monitoring the results.
resolved May 12, 2026, 05:55 PM UTC

This incident has been resolved.
postmortem May 13, 2026, 12:03 AM UTC

## **Summary** Between 10:07 AM–10:39 AM PST on Tuesday, May 12, 2026, customers using the prod1, prod2, and prod3 Production clusters experienced elevated latency and intermittent service degradation. During this timeframe, customers observed delegate timeouts, login failures, and pipeline execution failures ## **Root Cause** A recently introduced configuration change to a common infrastructure component caused unexpected resource pressure across nodes in the prod1, prod2, and prod3 production clusters. The peak traffic exacerbated the resource utilization and introduced elevated latency across several critical platform services. ## **Impact** 1. Customers in prod1, prod2, and prod3 experienced login and access failures for approximately 20 minutes. 2. Delegate connectivity was intermittently impacted during the incident window. 3. Pipeline executions and API requests experienced elevated failure rates and latency during the incident window. ## **Remediation** * Immediately Rolled back to the previous stable release, restoring customer pipeline functionality and alleviating node pressure. ## **Action Items** 1. Enhance perf testing to include such workloads so that we can catch issues before we hit production. 2. Increase capacity across clusters to make sure we have enough headroom to absorb the traffic surges.