Harness incident

Pipeline are running slow in Prod3

Minor Resolved View vendor source →
Started
Feb 17, 2026, 05:27 PM UTC
Resolved
Feb 17, 2026, 10:59 PM UTC
Duration
5h 32m
Detected by Pingoru
Feb 17, 2026, 05:27 PM UTC

Affected components

Continuous Delivery - Next Generation (CDNG)Continuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsSecurity Testing Orchestration (STO)Service Reliability Management (SRM)Chaos EngineeringInternal Developer Portal (IDP)Infrastructure as Code Management (IaCM)Software Supply Chain Assurance (SSCA)

Update timeline

  1. investigating Feb 17, 2026, 05:27 PM UTC

    We are currently investigating this issue.

  2. identified Feb 17, 2026, 05:28 PM UTC

    We are actively working to mitigate this

  3. monitoring Feb 17, 2026, 06:11 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Feb 17, 2026, 10:59 PM UTC

    This incident has been resolved.

  5. postmortem Mar 02, 2026, 08:42 PM UTC

    **Summary** On February 17, 2026, we had a traffic spike in one of the services in Prod3, which impacted the Pipeline Service’s capacity. We remediated this by addressing the source of the spike in workload and performing tuning of our backend systems. **Root Cause** Starting around 7:25 A.M. PST, our databases became overwhelmed with an increased rate of writes, causing resource pressure. The write latency spiked, causing our upstream systems to experience timeouts and errors. **Customer Impact** During the window of the impact * Pipeline executions ran significantly slower or stalled, with initialization steps delayed. * Slowness while performing CRUD operations on pipelines. **Resolution** We identified and disabled a high-frequency batch write workload that was contributing significantly to the write pressure. By switching that component to a lower-write alternative flow, full system recovery was confirmed at ~10:05 AM PST. **Prevention and Improvements** To prevent recurrence and enable faster identification of such issues, we are taking several measures: * Automate the audit and proactively optimize resource-intensive queries. Optimize with better indexes or query scope limits to prevent working set overflow. * Fine-tune workloads to increase headroom to handle spikes. * Add proactive alerts for sustained traffic rates and resource utilization approaching the high watermark. * Add capacity to our backend systems.

Looking to track Harness downtime and outages?

Pingoru polls Harness's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when Harness reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track Harness alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring Harness for free

5 free monitors · No credit card required