Harness incident

Intermittent STO step failure with 500 error

Harness experienced a minor incident on March 25, 2026 affecting Security Testing Orchestration (STO), lasting 2h 30m. The incident has been resolved; the full update timeline is below.

Started: Mar 25, 2026, 10:00 AM UTC
Resolved: Mar 25, 2026, 12:30 PM UTC
Duration: 2h 30m
Detected by Pingoru: Mar 25, 2026, 10:00 AM UTC

Affected components

Security Testing Orchestration (STO)

Update timeline

investigating Mar 25, 2026, 03:53 PM UTC

We are currently investigating this issue.
resolved Mar 25, 2026, 03:53 PM UTC

This incident has been resolved.
postmortem Mar 26, 2026, 07:53 PM UTC

## **Summary** On March 25, 2026, between approximately **3:30 PM and 6:00 PM IST**, the STO service in the **Prod1 environment** experienced **intermittent failures** while processing scan uploads. This resulted in **step failures for some pipeline executions** during the incident window. ## **Root Cause** During a scheduled internal data backfill activity, the STO service experienced **increased database load**. Concurrently, a recent change in the scan upload processing path introduced additional latency under these conditions. The combination of elevated load and increased query execution time caused some scan upload requests to exceed processing thresholds and fail. Retry attempts further amplified system load, leading to intermittent failures. ## **Impact** * Intermittent **scan upload failures \(500 errors\)** during pipeline execution * Some pipelines experienced **step failures or delays due to retries** * No impact to previously uploaded scan results or other STO functionality ## **Mitigation/Remediation** ### **Immediate** * Stopped the internal backfill activity to reduce database load * Optimized the scan upload processing query ### **Permanent** * Introduced safeguards for background jobs to prevent impact on production workloads * Improved performance of critical database paths * Enhanced monitoring to detect abnormal load and retry amplification earlier ## **Action Items** To prevent such issues from happening again: * Implement throttling and isolation for background/backfill jobs * Add protections for critical request paths under load * Improve alerting on database latency and retry patterns * Strengthen validation for production-like load conditions