Harness incident

Pipeline Updates in Prod4 is taking time.

Harness experienced a minor incident on May 14, 2026 affecting Continuous Delivery - Next Generation (CDNG) and Platform, lasting 34m. The incident has been resolved; the full update timeline is below.

Started: May 14, 2026, 01:36 PM UTC
Resolved: May 14, 2026, 02:10 PM UTC
Duration: 34m
Detected by Pingoru: May 14, 2026, 01:36 PM UTC

Affected components

Continuous Delivery - Next Generation (CDNG)Platform

Update timeline

investigating May 14, 2026, 01:36 PM UTC

We are currently investigating this issue.
investigating May 14, 2026, 01:39 PM UTC

We are continuing to investigate this issue.
investigating May 14, 2026, 01:40 PM UTC

We are continuing to investigate this issue.
identified May 14, 2026, 02:03 PM UTC

The issue has been identified and a fix is being implemented.
monitoring May 14, 2026, 02:08 PM UTC

A fix has been implemented and we are monitoring the results.
resolved May 14, 2026, 02:10 PM UTC

This incident has been resolved.
postmortem May 27, 2026, 11:47 PM UTC

On May 14, 2026 , some customers running pipelines in the Prod4 production environment observed pipeline create and update requests that were slow or failed, and pipeline stages that did not start, produced no logs, and were eventually auto-aborted as “stuck”. The issue was caused by an underlying compute node in our Prod4 cluster being recycled abruptly. ## **Impact** During the incident window \(approximately 5:38 PM PDT on May 14 to 9:47 PM PDT on May 14, 2026\): * Some pipeline create and update requests on Prod4 were slow or failed. * Some Prod4 pipeline executions hung at the stage-start step, producing no logs, and were eventually auto-aborted as “stuck” after a timeout. * Behavior was intermittent — only pipelines whose requests were routed to an affected service pod were impacted; other pipelines continued to execute normally. There was **no data loss**. The majority of pipelines on Prod4 continued to execute successfully throughout the incident — the primary impact was that affected create/update requests slowed down or failed, and a subset of pipelines could not progress and had to be aborted and re-run after mitigation. Overall service availability was degraded during this window. ## **Root Cause** During the incident, an underlying compute node in our Prod4 cluster was recycled by the cloud provider without completing its normal graceful-drain process, so the supporting-service pods running on that node were terminated abruptly. As a result, in-flight requests from the backend service to those pods were left without a response. ## **Mitigation** Harness completed the following immediate mitigation steps: * Restarted the affected supporting-service pods to restore healthy targets. * Restarted the pipeline service in Prod4 to clear the blocked worker threads. This is what fully restored normal pipeline create/update behavior and stage-start behavior; restarting only the supporting service was not enough on its own. * Confirmed pipeline executions returned to normal and updated the status page to mitigated. These actions restored pipeline execution behavior and resolved the customer-facing impact. ## **Action Items** To reduce the risk of recurrence and improve detection, the following actions are in various stages of being implemented: * Optimize timeouts to the pipeline service’s plan-creation requests so that when a supporting service goes away unexpectedly, the worker threads recover automatically instead of remaining blocked. * Investigate the abrupt node-recycle behavior in Prod4 with our cloud provider to ensure pods running on a recycled node receive a graceful shutdown signal in the future. * Add proactive paging alerts on service worker-thread saturation, so this failure mode is detected before it becomes a impacting issue