Harness incident

IACM Pipeline Struck In Prod4

Minor Resolved View vendor source →

Harness experienced a minor incident on May 14, 2026 affecting Infrastructure as Code Management (IaCM), lasting 2h 2m. The incident has been resolved; the full update timeline is below.

Started
May 14, 2026, 01:00 PM UTC
Resolved
May 14, 2026, 03:02 PM UTC
Duration
2h 2m
Detected by Pingoru
May 14, 2026, 01:00 PM UTC

Affected components

Infrastructure as Code Management (IaCM)

Update timeline

  1. investigating May 14, 2026, 02:54 PM UTC

    We are currently investigating this issue.

  2. investigating May 14, 2026, 02:56 PM UTC

    We are continuing to investigate this issue.

  3. investigating May 14, 2026, 02:57 PM UTC

    We are continuing to investigate this issue.

  4. identified May 14, 2026, 02:58 PM UTC

    The issue has been identified and a fix is being implemented.

  5. monitoring May 14, 2026, 02:58 PM UTC

    A fix has been implemented and we are monitoring the results.

  6. resolved May 14, 2026, 03:02 PM UTC

    This incident has been resolved.

  7. postmortem May 22, 2026, 04:32 PM UTC

    ## **Summary** On May 14, 2026 , some customers running pipelines in the Prod4 production environment observed pipeline create and update requests that were slow or failed, and pipeline stages that did not start, produced no logs, and were eventually auto-aborted as “stuck”. The issue was caused by an underlying compute node in our Prod4 cluster being recycled abruptly. As a downstream side effect, internal worker threads in the pipeline service became blocked waiting on responses that would never arrive, which caused some pipeline create/update requests and stage starts to slow down or fail. ## **Impact** During the incident window \(approximately 5:38 PM PDT on May 14 to 9:47 PM PDT\): * Some pipeline create and update requests on Prod4 were slow or failed. * Behavior was intermittent — only pipelines whose requests were routed to an affected service pod were impacted; other pipelines continued to execute normally. There was **no data loss**. The majority of pipelines on Prod4 continued to execute successfully throughout the incident — the primary impact was that affected create/update requests slowed down or failed, and a subset of pipelines could not progress and had to be aborted and re-run after mitigation. Overall service availability was degraded during this window. ## **Root Cause** The pipeline service coordinates pipeline plan creation by sending requests to several internal supporting services. During the incident, an underlying compute node in our Prod4 cluster was recycled by the cloud provider without completing its normal graceful-drain process, so the supporting-service pods running on that node were terminated abruptly. As a result, in-flight requests from the pipeline service to those pods were left without a response. ## **Mitigation** Harness completed the following immediate mitigation steps: * Restarted the affected supporting-service pods to restore healthy targets. * Restarted the pipeline service in Prod4 to clear the blocked worker threads. * Confirmed pipeline executions returned to normal and updated the status page to mitigated. These actions restored pipeline execution behavior and resolved the customer-facing impact. ## **Action Items** To reduce the risk of recurrence and improve detection, the following actions are in various stages of being implemented: * Enhance timeout configuration to the pipeline service’s plan-creation requests so that when a supporting service goes away unexpectedly, the worker threads recover automatically instead of remaining blocked. * Add per-target instrumentation on the pipeline service’s plan-creation request fanout \(request count, latency, in-flight requests\) so the affected supporting service can be identified as soon as possible during an incident. * Investigate the abrupt node-recycle behavior in Prod4 with our cloud provider to ensure pods running on a recycled node receive a graceful shutdown signal in the future. * Add proactive paging alerts on pipeline-service worker-thread saturation, so this failure mode is detected before it becomes customer-visible.