Harness incident

Deployment Degradation – Failures in Prod1,2,3

Harness experienced a minor incident on May 7, 2026 affecting Continuous Delivery (CD) - FirstGen - EOS and Continuous Delivery (CD) - FirstGen - EOS and 1 more component, lasting 3h 29m. The incident has been resolved; the full update timeline is below.

Started: May 07, 2026, 07:54 AM UTC
Resolved: May 07, 2026, 11:23 AM UTC
Duration: 3h 29m
Detected by Pingoru: May 07, 2026, 07:54 AM UTC

Affected components

Continuous Delivery (CD) - FirstGen - EOSContinuous Delivery (CD) - FirstGen - EOSContinuous Delivery (CD) - FirstGen - EOSContinuous Delivery - Next Generation (CDNG)Continuous Delivery - Next Generation (CDNG)Continuous Delivery - Next Generation (CDNG)Cloud Cost Management (CCM)Cloud Cost Management (CCM)Cloud Cost Management (CCM)Continuous Error Tracking (CET)

Update timeline

investigating May 07, 2026, 07:54 AM UTC

CDS Deployment Failing in looping startegy
investigating May 07, 2026, 08:17 AM UTC

We are continuing to investigate this issue.
investigating May 07, 2026, 08:43 AM UTC

We are continuing to investigate this issue.
monitoring May 07, 2026, 09:05 AM UTC

A fix has been implemented and we are monitoring the results.
resolved May 07, 2026, 11:23 AM UTC

This incident has been resolved.
postmortem May 20, 2026, 12:43 AM UTC

### Incident Summary On May 6 at 11:50 PM PST, we deployed a configuration change to one of our core pipeline services. This change introduced an unintended interaction with our database layer, causing a significant increase in write load. The resulting pressure degraded query and command throughput across the platform ### Root Cause The configuration change introduced a blocking condition on expression evaluation in the pipeline service. When executions encountered blocked expressions, they failed and retried repeatedly, generating a write storm against the database and the throughput went 4x ‌ ### Remediation ● Rolled back the configuration change. ● Applied database-level tuning to reduce write pressure and accelerate backlog drainage ● Performed a controlled failover to a healthy database node to restore throughput ● Scaled up database nodes to provide sufficient capacity for full recovery ‌ ### Preventive Actions To prevent from such Issues happening again, we are focussing on: 1\. Increased database resilience: We are implementing automated load-shedding thresholds that trigger on leading indicators \(replication lag, session depth, op latency\) before the database reaches saturation, preventing retry storms from compounding into full degradation events. 2\. We are optimizing databases so that we can increase write throughput by an order of magnitude and enable independent scaling of customer data workloads. This would have allowed us to drain the message backlog nearly instantaneously during this incident.