Harness incident

Intermittent STO step failure with 500 error

Minor Resolved View vendor source →
Started
Mar 25, 2026, 10:00 AM UTC
Resolved
Mar 25, 2026, 12:30 PM UTC
Duration
2h 30m
Detected by Pingoru
Mar 25, 2026, 10:00 AM UTC

Affected components

Security Testing Orchestration (STO)

Update timeline

  1. investigating Mar 25, 2026, 03:53 PM UTC

    We are currently investigating this issue.

  2. resolved Mar 25, 2026, 03:53 PM UTC

    This incident has been resolved.

  3. postmortem Mar 26, 2026, 07:53 PM UTC

    ## **Summary** On March 25, 2026, between approximately **3:30 PM and 6:00 PM IST**, the STO service in the **Prod1 environment** experienced **intermittent failures** while processing scan uploads. This resulted in **step failures for some pipeline executions** during the incident window. ## **Root Cause** During a scheduled internal data backfill activity, the STO service experienced **increased database load**. Concurrently, a recent change in the scan upload processing path introduced additional latency under these conditions. The combination of elevated load and increased query execution time caused some scan upload requests to exceed processing thresholds and fail. Retry attempts further amplified system load, leading to intermittent failures. ## **Impact** * Intermittent **scan upload failures \(500 errors\)** during pipeline execution * Some pipelines experienced **step failures or delays due to retries** * No impact to previously uploaded scan results or other STO functionality ## **Mitigation/Remediation** ### **Immediate** * Stopped the internal backfill activity to reduce database load * Optimized the scan upload processing query ### **Permanent** * Introduced safeguards for background jobs to prevent impact on production workloads * Improved performance of critical database paths * Enhanced monitoring to detect abnormal load and retry amplification earlier ## **Action Items** To prevent such issues from happening again: * Implement throttling and isolation for background/backfill jobs * Add protections for critical request paths under load * Improve alerting on database latency and retry patterns * Strengthen validation for production-like load conditions

Looking to track Harness downtime and outages?

Pingoru polls Harness's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when Harness reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track Harness alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring Harness for free

5 free monitors · No credit card required