Harness incident

Platform is experiencing degraded performance for some organizations.

Harness experienced a minor incident on April 30, 2026 affecting Continuous Delivery (CD) - FirstGen - EOS and Continuous Delivery - Next Generation (CDNG) and 1 more component, lasting 1h 25m. The incident has been resolved; the full update timeline is below.

Started: Apr 30, 2026, 04:25 PM UTC
Resolved: Apr 30, 2026, 05:50 PM UTC
Duration: 1h 25m
Detected by Pingoru: Apr 30, 2026, 04:25 PM UTC

Affected components

Continuous Delivery (CD) - FirstGen - EOSContinuous Delivery - Next Generation (CDNG)Continuous Delivery - Next Generation (CDNG)Cloud Cost Management (CCM)Cloud Cost Management (CCM)Continuous Error Tracking (CET)Continuous Error Tracking (CET)Chaos EngineeringContinuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud Builds

Update timeline

investigating Apr 30, 2026, 04:25 PM UTC

We are currently investigating this issue.
identified Apr 30, 2026, 05:27 PM UTC

Issue has been identified and mitigated
resolved Apr 30, 2026, 07:08 PM UTC

This incident has been resolved.
postmortem May 11, 2026, 08:47 PM UTC

# **Summary** On April 30, 2026, between approximately 15:29 UTC and 17:00 UTC, customers in Prod3 experienced degradation impacting delegate connectivity, instance synchronization, pipeline executions, and connector operations due to spike in load on one of our services. . Service stability was restored through service scaling, infrastructure capacity increases, and database resource expansion. # **Impact** Customer Impact: * Delegates disconnected intermittently during the incident window * Instance synchronization operations were delayed * Some pipeline executions and connector operations experienced failures or delays Duration: * Delegate connectivity impact: ~15 minutes * Elevated service degradation: ~90 minutes # **Root Cause** The incident was caused by spike causing thread exhaustion and elevated request contention between internal services during a period of increased synchronization and delegate activity. ‌ # **Mitigation and Recovery** The following actions were taken to restore service stability: * Scaled management service replicas horizontally * Increased autoscaling thresholds and maximum replica counts * Expanded Database compute capacity * Upgraded MongoDB infrastructure components * Stabilized delegate reassignment and reconnection processing Services recovered progressively beginning at approximately 15:47 UTC, with full stability restored by ~17:00 UTC. # **Preventive Actions** To prevent such issues from happening again, We are implementing the following improvements * Improving our circuit breakers and fail-fast protections between dependent services * Enhancing monitoring and alerting for thread pool saturation and queue buildup * Increasing baseline service headroom and resiliency protections