Harness incident

Slowness in Pipeline Execution graph UI

Minor Resolved View vendor source →
Started
Feb 27, 2026, 02:44 PM UTC
Resolved
Feb 27, 2026, 04:07 PM UTC
Duration
1h 23m
Detected by Pingoru
Feb 27, 2026, 02:44 PM UTC

Affected components

Continuous Delivery - Next Generation (CDNG)

Update timeline

  1. investigating Feb 27, 2026, 02:44 PM UTC

    We are currently investigating the issue. The impact is identified to be currently only in UI interface. Executions continue to work as expected.

  2. identified Feb 27, 2026, 03:50 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Feb 27, 2026, 04:05 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Feb 27, 2026, 04:07 PM UTC

    This incident has been resolved.

  5. postmortem Mar 04, 2026, 07:50 PM UTC

    ## **Summary** On **2/27/2026**, customers experienced slowness when viewing **running pipeline execution pages** in the Harness UI. The issue was caused by delays in the processing of graph generation events used to generate the pipeline execution graph. The degradation began around **7:33 AM** PT and resulted in delayed updates and slow loading of pipeline execution views. The engineering team identified the underlying performance bottleneck, applied mitigation measures, and restored normal system behavior after stabilizing the event processing pipeline. ‌ ‌ ## **Root Cause** The incident was caused by a **temporary backlog in Kafka consumers responsible for processing orchestration log events**, which are used to generate the execution graph for running pipelines. The backlog was triggered by increased system load combined with performance degradation in a **shared Elasticsearch cluster** used by the pipeline processing services. During the incident window, Elasticsearch experienced a sudden spike in indexing activity which caused resource contention and high CPU utilization on one of the cluster nodes. This slowdown in Elasticsearch queries reduced the processing throughput of the Kafka consumers responsible for graph generation, resulting in accumulated consumer lag and delayed updates in the pipeline execution UI. ‌ ‌ ## **Impact** During the incident window: * Users experienced **slow loading or delayed updates when viewing running pipeline execution pages**. * The **pipeline graph visualization** and related execution details were slower to render. * Pipeline executions themselves continued to run normally, but the UI display of their progress was delayed. Other Harness services and pipeline execution functionality were not impacted. ## **Mitigation** Engineering teams implemented several mitigation steps to restore system performance: * **Scaled the Elasticsearch cluster** to relieve resource pressure and improve query performance. * **Scaled Kafka consumer capacity** to accelerate backlog processing. These actions improved consumer processing throughput and allowed the Kafka backlog to drain. Consumer lag reduced, after which the pipeline execution UI returned to normal responsiveness. ## **Prevention and Improvements** To reduce the likelihood of similar incidents in the future, the following improvements are being implemented: * Capacity planning improvements for shared Elasticsearch clusters supporting orchestration workloads. * Additional safeguards to prevent that can amplify indexing activity. These measures will help ensure better isolation of workloads and faster detection of resource contention scenarios.

Looking to track Harness downtime and outages?

Pingoru polls Harness's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when Harness reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track Harness alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring Harness for free

5 free monitors · No credit card required