Harness incident
Prod 2 - Customers may see some executions from March 11 in a "running" but hung state
Affected components
Update timeline
- investigating Mar 12, 2026, 04:20 PM UTC
Customers may continue to see that some pipeline executions show that they are "running" even though they have completed, aborted, or failed as a result of yesterday's incident. (https://status.harness.io/incidents/4y4dl47v2qhc) This behavior is a UI-only artifact from the incident and should not affect customers' ability to start new executions. We are working on clearing these artifacts.
- identified Mar 12, 2026, 04:21 PM UTC
The issue has been identified and a fix is being implemented.
- resolved Mar 12, 2026, 07:16 PM UTC
This incident has been resolved.
- postmortem Mar 17, 2026, 11:43 PM UTC
### **Summary** On March 11, 2026, customers experienced pipeline failures and degraded UI performance\(incorrect status of states\) in the Prod2 environment. The issue was caused by a degradation in an internal shared infrastructure component used for coordination across services. The incident began around **7:10 AM PST** and was fully mitigated by approximately **10:12 AM PST**. During this period, pipeline execution throughput was significantly impacted for affected customers. ### **Root Cause** The issue was caused by resource saturation in a shared infrastructure component used for distributed coordination, which led to increased latency and failures in service-to-service communication. As a result, pipeline execution services were unable to process workloads efficiently, leading to a buildup of queued tasks and reduced system throughput. ### **Impact** Customers experienced the following: * Pipeline executions failing or not progressing * Increased pipeline execution times * UI delays due to processing backlogs The impact was limited to specific production environments and no data loss occurred. ### **Mitigation** **Immediate** * Redirected services to a higher-capacity infrastructure instance to restore normal processing * Cleared accumulated processing backlogs to recover system throughput * Scaled supporting services to stabilize performance **Permanent** * Improved monitoring and alerting for early detection of resource saturation * Implemented capacity and scaling improvements to handle higher load scenarios * Initiated architectural improvements to reduce reliance on shared coordination components ### **Action Items** To prevent such issues from happening again we are taking several steps: * Enhance alerting to detect early signs of infrastructure saturation * Review and optimize system behavior under high concurrency scenarios * Continue investigation into the triggering conditions and incorporate findings into long-term improvements
Looking to track Harness downtime and outages?
Pingoru polls Harness's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.
- Real-time alerts when Harness reports an incident
- Email, Slack, Discord, Microsoft Teams, and webhook notifications
- Track Harness alongside 5,000+ providers in one dashboard
- Component-level filtering
- Notification groups + maintenance calendar
5 free monitors · No credit card required