Harness incident

Pipeline are running slow in Prod3

Minor Resolved View vendor source →

Harness experienced a minor incident on February 17, 2026 affecting Continuous Delivery - Next Generation (CDNG) and Continuous Integration Enterprise(CIE) - Mac Cloud Builds and 1 more component, lasting 5h 32m. The incident has been resolved; the full update timeline is below.

Started
Feb 17, 2026, 05:27 PM UTC
Resolved
Feb 17, 2026, 10:59 PM UTC
Duration
5h 32m
Detected by Pingoru
Feb 17, 2026, 05:27 PM UTC

Affected components

Continuous Delivery - Next Generation (CDNG)Continuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsSecurity Testing Orchestration (STO)Service Reliability Management (SRM)Chaos EngineeringInternal Developer Portal (IDP)Infrastructure as Code Management (IaCM)Software Supply Chain Assurance (SSCA)

Update timeline

  1. investigating Feb 17, 2026, 05:27 PM UTC

    We are currently investigating this issue.

  2. identified Feb 17, 2026, 05:28 PM UTC

    We are actively working to mitigate this

  3. monitoring Feb 17, 2026, 06:11 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Feb 17, 2026, 10:59 PM UTC

    This incident has been resolved.

  5. postmortem Mar 02, 2026, 08:42 PM UTC

    **Summary** On February 17, 2026, we had a traffic spike in one of the services in Prod3, which impacted the Pipeline Service’s capacity. We remediated this by addressing the source of the spike in workload and performing tuning of our backend systems. **Root Cause** Starting around 7:25 A.M. PST, our databases became overwhelmed with an increased rate of writes, causing resource pressure. The write latency spiked, causing our upstream systems to experience timeouts and errors. **Customer Impact** During the window of the impact * Pipeline executions ran significantly slower or stalled, with initialization steps delayed. * Slowness while performing CRUD operations on pipelines. **Resolution** We identified and disabled a high-frequency batch write workload that was contributing significantly to the write pressure. By switching that component to a lower-write alternative flow, full system recovery was confirmed at ~10:05 AM PST. **Prevention and Improvements** To prevent recurrence and enable faster identification of such issues, we are taking several measures: * Automate the audit and proactively optimize resource-intensive queries. Optimize with better indexes or query scope limits to prevent working set overflow. * Fine-tune workloads to increase headroom to handle spikes. * Add proactive alerts for sustained traffic rates and resource utilization approaching the high watermark. * Add capacity to our backend systems.