Harness incident
Hosted CI build vm environment is seeing higher network latency
Affected components
Update timeline
- investigating Feb 19, 2026, 05:36 PM UTC
We are currently investigating this issue.
- identified Feb 19, 2026, 05:39 PM UTC
The issue has been identified and a fix is being implemented.
- monitoring Feb 19, 2026, 06:11 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Feb 19, 2026, 06:20 PM UTC
This incident has been resolved.
- postmortem Mar 02, 2026, 05:32 PM UTC
## **Summary** On February 19, 2026, a partial degradation occurred in the CI infrastructure in the **us-west1** region due to issues affecting the NAT control plane. During a brief window \(~30 minutes\), a limited number of CI build jobs failed during VM initialization. The issue was detected through internal monitoring and mitigated via controlled failover, followed by restoration of the affected NAT instances. ## **Root Cause** The incident was caused by saturation of connection tracking \(iptables/conntrack state\) on NAT virtual machines in the us-west1 region. A short-lived spike in build VM activity led to a burst of metadata-related connections. Over time, stale connection entries accumulated without automated cleanup, eventually preventing the NAT VMs from accessing the cloud metadata service. This metadata connectivity disruption impacted control-plane functionality \(including VM provisioning\), which resulted in a limited number of build initialization failures. ## **Impact** * **Region Impacted:** us-west1 * **Customer Impact:** * Limited CI job failures during VM provisioning * Two customers experienced isolated build failures * No impact to running workloads * **Data Loss:** None * **Duration:** Approximately 30 minutes Customers were advised to retry failed builds after mitigation. ## **Mitigation** * Traffic was automatically failed over to another NAT to maintain egress functionality. * Affected s NAT VMs were restarted to clear saturated connection state. * Metadata connectivity, SSH access, monitoring, and health checks were verified. * Traffic was gradually restored to affected NAT after stability confirmation. * Cloud NAT IP utilization was monitored during failover to prevent capacity exhaustion. ### **Action Items and Permanent Preventive Measures** To prevent such issues from happening again, we will * Implement automated cleanup of metadata-related connection tracking state. * Add proactive health checks and alerts for metadata reachability. * Strengthen monitoring for NAT VM control-plane health degradation. * Enhance fallback guardrails and capacity validation for Cloud NAT. * Deploy automated iptables/conntrack cleanup for metadata traffic. * Improve monitoring for metadata connectivity health.
Looking to track Harness downtime and outages?
Pingoru polls Harness's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.
- Real-time alerts when Harness reports an incident
- Email, Slack, Discord, Microsoft Teams, and webhook notifications
- Track Harness alongside 5,000+ providers in one dashboard
- Component-level filtering
- Notification groups + maintenance calendar
5 free monitors · No credit card required