Harness incident

Prod 2 - Customers may see some executions from March 11 in a "running" but hung state

Notice Resolved View vendor source →
Started
Mar 12, 2026, 04:20 PM UTC
Resolved
Mar 12, 2026, 07:16 PM UTC
Duration
2h 55m
Detected by Pingoru
Mar 12, 2026, 04:20 PM UTC

Affected components

Continuous Delivery - Next Generation (CDNG)

Update timeline

  1. investigating Mar 12, 2026, 04:20 PM UTC

    Customers may continue to see that some pipeline executions show that they are "running" even though they have completed, aborted, or failed as a result of yesterday's incident. (https://status.harness.io/incidents/4y4dl47v2qhc) This behavior is a UI-only artifact from the incident and should not affect customers' ability to start new executions. We are working on clearing these artifacts.

  2. identified Mar 12, 2026, 04:21 PM UTC

    The issue has been identified and a fix is being implemented.

  3. resolved Mar 12, 2026, 07:16 PM UTC

    This incident has been resolved.

  4. postmortem Mar 17, 2026, 11:43 PM UTC

    ### **Summary** On March 11, 2026, customers experienced pipeline failures and degraded UI performance\(incorrect status of states\) in the Prod2 environment. The issue was caused by a degradation in an internal shared infrastructure component used for coordination across services. The incident began around **7:10 AM PST** and was fully mitigated by approximately **10:12 AM PST**. During this period, pipeline execution throughput was significantly impacted for affected customers. ### **Root Cause** The issue was caused by resource saturation in a shared infrastructure component used for distributed coordination, which led to increased latency and failures in service-to-service communication. As a result, pipeline execution services were unable to process workloads efficiently, leading to a buildup of queued tasks and reduced system throughput. ### **Impact** Customers experienced the following: * Pipeline executions failing or not progressing * Increased pipeline execution times * UI delays due to processing backlogs The impact was limited to specific production environments and no data loss occurred. ### **Mitigation** **Immediate** * Redirected services to a higher-capacity infrastructure instance to restore normal processing * Cleared accumulated processing backlogs to recover system throughput * Scaled supporting services to stabilize performance **Permanent** * Improved monitoring and alerting for early detection of resource saturation * Implemented capacity and scaling improvements to handle higher load scenarios * Initiated architectural improvements to reduce reliance on shared coordination components ### **Action Items** To prevent such issues from happening again we are taking several steps: * Enhance alerting to detect early signs of infrastructure saturation * Review and optimize system behavior under high concurrency scenarios * Continue investigation into the triggering conditions and incorporate findings into long-term improvements

Looking to track Harness downtime and outages?

Pingoru polls Harness's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when Harness reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track Harness alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring Harness for free

5 free monitors · No credit card required