Harness incident
Degraded performance in CI Steps in Prod 2 and Prod 3
Affected components
Update timeline
- investigating Mar 13, 2026, 04:12 PM UTC
We are noticing degraded performance in CI Steps in Prod 2 and Prod 3 environments The issue is intermittent. We are investigating the cause
- identified Mar 13, 2026, 04:16 PM UTC
The issue has been identified and a fix is being implemented.
- monitoring Mar 13, 2026, 04:27 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Mar 13, 2026, 05:34 PM UTC
This incident has been resolved.
- postmortem Mar 17, 2026, 11:54 PM UTC
### **Summary** On March 13, 2026, customers running CI pipelines in the Prod2 and Prod3 environments experienced **slower-than-normal CI step execution times**. Investigation showed that CI steps were delayed due to a backlog in internal response processing within the CI Manager. While individual plugin steps executed normally, their completion notifications were delayed, causing pipeline stages to appear significantly slower. The issue began around **2:00 AM PT** and affected some customers until mitigation actions were applied. ### **Root Cause** The delay was caused by a **backlog in the CI response processing pipeline** A combination of factors contributed to the backlog: * A brief latency spike affecting internal services, including a DB query executed by the pipeline service. * Increased response processing load that caused iterator workers to stall while waiting on shared resources. These conditions caused CI Manager to accumulate pending responses, which delayed the reporting of step completion even though the underlying plugin execution completed quickly. ### **Impact** Customers experienced **significantly increased CI pipeline step durations**, even though the actual execution time of the steps was minimal. Impact included: * CI pipeline stages appearing to take **longer than expected** * Slower overall pipeline execution times for some customers in **Prod2 and Prod3** * No data loss or failed builds occurred Pipeline performance returned to normal after mitigation actions were applied. ### **Mitigation** **Immediate** * Restarted CI Manager components in affected environments to clear the response processing backlog. * Verified CI pipeline execution times returned to baseline levels. **Permanent** * Implemented new monitoring and alerting on CI iterator response processing latency. * Introduced proactive detection thresholds to identify abnormal processing delays earlier. ### **Action Items** To prevent such issues from happening again we are taking the following steps: * Improve monitoring for CI response processing latency to detect backlog formation sooner. * Investigate and optimize query behavior associated with pipeline-service reads. * Review CI Manager response processing design to reduce sensitivity to latency spikes. * Add safeguards to prevent iterator workers from entering non-recovering states during transient spikes.
Looking to track Harness downtime and outages?
Pingoru polls Harness's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.
- Real-time alerts when Harness reports an incident
- Email, Slack, Discord, Microsoft Teams, and webhook notifications
- Track Harness alongside 5,000+ providers in one dashboard
- Component-level filtering
- Notification groups + maintenance calendar
5 free monitors · No credit card required