Harness incident
Platform is experiencing degraded performance for some organizations.
Harness experienced a minor incident on April 30, 2026 affecting Continuous Delivery (CD) - FirstGen - EOS and Continuous Delivery - Next Generation (CDNG) and 1 more component, lasting 1h 25m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 30, 2026, 04:25 PM UTC
We are currently investigating this issue.
- identified Apr 30, 2026, 05:27 PM UTC
Issue has been identified and mitigated
- resolved Apr 30, 2026, 07:08 PM UTC
This incident has been resolved.
- postmortem May 11, 2026, 08:47 PM UTC
# **Summary** On April 30, 2026, between approximately 15:29 UTC and 17:00 UTC, customers in Prod3 experienced degradation impacting delegate connectivity, instance synchronization, pipeline executions, and connector operations due to spike in load on one of our services. . Service stability was restored through service scaling, infrastructure capacity increases, and database resource expansion. # **Impact** Customer Impact: * Delegates disconnected intermittently during the incident window * Instance synchronization operations were delayed * Some pipeline executions and connector operations experienced failures or delays Duration: * Delegate connectivity impact: ~15 minutes * Elevated service degradation: ~90 minutes # **Root Cause** The incident was caused by spike causing thread exhaustion and elevated request contention between internal services during a period of increased synchronization and delegate activity. # **Mitigation and Recovery** The following actions were taken to restore service stability: * Scaled management service replicas horizontally * Increased autoscaling thresholds and maximum replica counts * Expanded Database compute capacity * Upgraded MongoDB infrastructure components * Stabilized delegate reassignment and reconnection processing Services recovered progressively beginning at approximately 15:47 UTC, with full stability restored by ~17:00 UTC. # **Preventive Actions** To prevent such issues from happening again, We are implementing the following improvements * Improving our circuit breakers and fail-fast protections between dependent services * Enhancing monitoring and alerting for thread pool saturation and queue buildup * Increasing baseline service headroom and resiliency protections