Harness incident

Intermittent slowness while running pipelines

Harness experienced a notice incident on April 27, 2026, lasting —. The incident has been resolved; the full update timeline is below.

Started: Apr 27, 2026, 10:27 PM UTC
Resolved: Apr 27, 2026, 08:00 PM UTC
Duration: —
Detected by Pingoru: Apr 27, 2026, 10:27 PM UTC

Update timeline

resolved Apr 27, 2026, 10:27 PM UTC

We were seeing slowness while executing pipelines
postmortem Apr 29, 2026, 07:53 PM UTC

## **Summary** On April 27, 2026, customers running pipelines in the Prod3 environment experienced intermittent slowness in pipeline execution and delays in execution status updates in the UI. It was caused by a unexpected spike causing contention on a backend database supporting pipeline orchestration. The issue was mitigated and fully resolved. ## **Impact** **Incident window:** April 27, 2026, 1:00 PM – 3:12 PM PDT * Pipeline executions ran slower than normal; some executions took longer than expected to complete. For pipelines with stricter timeouts, there could be failures. * No widespread pipeline failures were observed * Execution view in the UI lagged behind real-time pipeline progress There was no data loss. The majority of pipelines continued to execute successfully, with the primary impact being increased latency and delayed UI updates. ## **Root Cause** Pipeline orchestration relies on a backend database to track execution state and power the execution view in the UI. During the incident, we had a spike of load, leading to increased query latency across the orchestration layer.This resulted in a backlog, causing UI updates to lag behind actual pipeline execution until the system was scaled. ## **Remediation** **Immediate Mitigation** * Scaled up the affected database instance to increase CPU capacity * Reduced query latency and eliminated lock contention * Cleared the execution-view update backlog within ~30 minutes These actions restored normal pipeline performance and UI responsiveness. ## **Action Items** To prevent such issues from happening again. * **Capacity Improvements:**Updated Prod3 capacity baseline to prevent similar resource constraints * **Proactive Detection:** Enhancing monitoring and alerting for backend resource utilization, lock contention, and critical query latency