Harness Outage History

Harness is up right now

There were 53 Harness outages since February 3, 2026 totaling 107h 48m of downtime. Each is summarised below — incident details, duration, and resolution information.

Source: https://status.harness.io

Minor March 24, 2026

Feature Flag SDK authentication operations are running slow in Prod2

Detected by Pingoru
Mar 24, 2026, 06:14 PM UTC
Resolved
Mar 24, 2026, 08:37 PM UTC
Duration
2h 23m
Affected: Feature Flags (FF)
Timeline · 6 updates
  1. investigating Mar 24, 2026, 06:14 PM UTC

    We are currently investigating this issue.

  2. identified Mar 24, 2026, 06:25 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Mar 24, 2026, 07:03 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. monitoring Mar 24, 2026, 07:03 PM UTC

    We are continuing to monitor for any further issues.

  5. resolved Mar 24, 2026, 08:37 PM UTC

    This incident has been resolved.

  6. postmortem Mar 31, 2026, 03:31 AM UTC

    ## Summary Between **05:48 PM UTC on March 18** and **04:58 PM UTC on March 19**, two customers using [app.split.io](http://app.split.io/) with **SAML authentication using SCIM** experienced login failures. The issue was limited to customers with specific identity provisioning configurations. ## Root Cause A service update was deployed to a login-related component that changed internal request handling behavior. This caused authentication workflows for SAML-configured customers with identity provisioning enabled to fail during login processing. ## Impact * 2 customers using SAML authentication **with SCIM provisioning enabled** were unable to log in to [app.split.io](http://app.split.io/). * Customers using SAML authentication without SCIM were not affected and could log in normally. * Username and password login was unaffected. * SDKs, feature flags, and experiments continued functioning normally. * No data loss occurred. ## Remediation * Rolled back the service update that introduced the issue. * Restored SAML login for affected customers. ## Action Items To prevent such issues from happening again, we are doing/completed the following: * **Improve detection capability**: Introduced a high-priority alert to detect authentication failures sooner, enabling faster response if similar issues occur in the future. * **Reduce legacy surface area**: Continue guiding remaining customers toward the Harness Platform, where authentication flows are unified and improvements can be delivered more consistently over time.

Read the full incident report →

Minor March 19, 2026

Degraded Performance for SCIM users during login.

Detected by Pingoru
Mar 19, 2026, 04:14 PM UTC
Resolved
Mar 19, 2026, 04:59 PM UTC
Duration
45m
Affected: Management Console
Timeline · 6 updates
  1. investigating Mar 19, 2026, 04:14 PM UTC

    The issue has been identified and a fix is in place.

  2. monitoring Mar 19, 2026, 04:42 PM UTC

    The issue has been identified and a fix is in place. (Note: It is still in a degraded state)

  3. monitoring Mar 19, 2026, 04:42 PM UTC

    The issue has been identified and a fix is in place.

  4. monitoring Mar 19, 2026, 04:59 PM UTC

    We are continuing to monitor for any further issues.

  5. resolved Mar 19, 2026, 04:59 PM UTC

    This incident has been resolved.

  6. postmortem Apr 02, 2026, 12:45 AM UTC

    ## Summary Between **05:48 PM UTC on March 18** and **04:58 PM UTC on March 19**, two customers using [app.split.io](http://app.split.io/) with **SAML authentication using SCIM** experienced login failures. The issue was limited to customers with specific identity provisioning configurations. ## Root Cause A service update was deployed to a login-related component that changed internal request handling behavior. This caused authentication workflows for SAML-configured customers with identity provisioning enabled to fail during login processing. ## Impact * 2 customers using SAML authentication **with SCIM provisioning enabled** were unable to log in to [app.split.io](http://app.split.io/). * Customers using SAML authentication without SCIM were not affected and could log in normally. * Username and password login was unaffected. * SDKs, feature flags, and experiments continued functioning normally. * No data loss occurred. ## Remediation * Rolled back the service update that introduced the issue. * Restored SAML login for affected customers. ## Action Items To prevent such issues from happening again, here are the action items. * **Improve detection capability**: Introduced a high-priority alert to detect authentication failures sooner, enabling faster response if similar issues occur in the future. * **Reduce legacy surface area**: Continue guiding remaining customers toward the Harness Platform, where authentication flows are unified and improvements can be delivered more consistently over time.

Read the full incident report →

Minor March 17, 2026

Test Intelligence service is impacted in Prod1 / Prod3

Detected by Pingoru
Mar 17, 2026, 06:04 PM UTC
Resolved
Mar 17, 2026, 06:39 PM UTC
Duration
34m
Affected: Continuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud Builds
Timeline · 5 updates
  1. investigating Mar 17, 2026, 06:04 PM UTC

    We are currently investigating this issue.

  2. identified Mar 17, 2026, 06:07 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Mar 17, 2026, 06:10 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Mar 17, 2026, 06:39 PM UTC

    This incident has been resolved.

  5. postmortem Mar 24, 2026, 07:17 PM UTC

    ## **Summary** On March 17, between approximately **3:04 PM and 4:05 PM PST**, customers experienced **degraded performance and intermittent unavailability** for Test Intelligence \(TI\) related APIs. This impacted test result uploads and report access. ## **Root Cause** A recent change introduced additional background processing during test result ingestion, which significantly increased database load. This led to resource saturation and caused elevated latency and temporary service disruption. ## **Impact** * **Degraded performance** for test uploads and report retrieval * **Intermittent API unavailability \(~7 minutes\)** * Affected customers experienced slower response times across TI-related workflows ## **Remediation** ### **Immediate** * Increased database capacity to stabilize performance * Applied configuration changes to reduce system load ### **Permanent** * Optimized processing logic to reduce database overhead * Introduced safeguards to prevent similar load amplification scenarios ## **Action Items** * Improve performance validation using production-scale datasets * Enhance safeguards for new feature rollouts * Strengthen monitoring to detect abnormal load patterns earlier * Add controls to limit impact scope of similar changes

Read the full incident report →

Minor March 17, 2026

FME SDK is experiencing elevated error rates for Impressions and events

Detected by Pingoru
Mar 17, 2026, 05:05 PM UTC
Resolved
Mar 17, 2026, 07:33 PM UTC
Duration
2h 28m
Affected: FMEFMEFME
Timeline · 4 updates
  1. investigating Mar 17, 2026, 04:36 PM UTC

    The issue started around ~6:45AM PT and the team is currently investigating

  2. monitoring Mar 17, 2026, 05:05 PM UTC

    We are now monitoring the results.

  3. resolved Mar 17, 2026, 11:38 PM UTC

    This incident has been resolved.

  4. postmortem Mar 18, 2026, 10:57 PM UTC

    ## **Summary** _March 17, 2026, FME events & impressions ingestion experienced significant degradation, resulting in elevated latency and error rates. The impact was traced to degraded performance in the underlying shared infrastructure used for event processing._ ## **Root Cause** _We had an unexpected surge in traffic which caused stress on our systems._ ## **Impact** SDKs sending impressions and events would experience elevated error logging and continue to retry, with differing policies depending on the particular SDK and it’s retry policy, which are designed and tailored to each runtime environment to avoid any application impact. In some scenarios, events and impressions may be lost if they are not successfully delivered according to the SDK’s specific retry policy. There was no impact to our control plane services and feature flag delivery and evaluations continued to work without any disruption. ## **Mitigation** _To mitigate, we immediately increased capacity to handle the bursty traffic._ ## **Action Items** To prevent such issues from happening again, we are working on the following items: 1. Evaluate and enforce per customer rate-limit . 2. _Improve the auto-scaling and on-demand network infrastructure scale up._ 3. _Improve resiliency of the ingestion layer_ _._

Read the full incident report →

Minor March 13, 2026

Degraded performance in CI Steps in Prod 2 and Prod 3

Detected by Pingoru
Mar 13, 2026, 04:12 PM UTC
Resolved
Mar 13, 2026, 05:34 PM UTC
Duration
1h 22m
Affected: Continuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud Builds
Timeline · 5 updates
  1. investigating Mar 13, 2026, 04:12 PM UTC

    We are noticing degraded performance in CI Steps in Prod 2 and Prod 3 environments The issue is intermittent. We are investigating the cause

  2. identified Mar 13, 2026, 04:16 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Mar 13, 2026, 04:27 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Mar 13, 2026, 05:34 PM UTC

    This incident has been resolved.

  5. postmortem Mar 17, 2026, 11:54 PM UTC

    ### **Summary** On March 13, 2026, customers running CI pipelines in the Prod2 and Prod3 environments experienced **slower-than-normal CI step execution times**. Investigation showed that CI steps were delayed due to a backlog in internal response processing within the CI Manager. While individual plugin steps executed normally, their completion notifications were delayed, causing pipeline stages to appear significantly slower. The issue began around **2:00 AM PT** and affected some customers until mitigation actions were applied. ### **Root Cause** The delay was caused by a **backlog in the CI response processing pipeline** A combination of factors contributed to the backlog: * A brief latency spike affecting internal services, including a DB query executed by the pipeline service. * Increased response processing load that caused iterator workers to stall while waiting on shared resources. These conditions caused CI Manager to accumulate pending responses, which delayed the reporting of step completion even though the underlying plugin execution completed quickly. ### **Impact** Customers experienced **significantly increased CI pipeline step durations**, even though the actual execution time of the steps was minimal. Impact included: * CI pipeline stages appearing to take **longer than expected** * Slower overall pipeline execution times for some customers in **Prod2 and Prod3** * No data loss or failed builds occurred Pipeline performance returned to normal after mitigation actions were applied. ### **Mitigation** **Immediate** * Restarted CI Manager components in affected environments to clear the response processing backlog. * Verified CI pipeline execution times returned to baseline levels. **Permanent** * Implemented new monitoring and alerting on CI iterator response processing latency. * Introduced proactive detection thresholds to identify abnormal processing delays earlier. ### **Action Items** To prevent such issues from happening again we are taking the following steps: * Improve monitoring for CI response processing latency to detect backlog formation sooner. * Investigate and optimize query behavior associated with pipeline-service reads. * Review CI Manager response processing design to reduce sensitivity to latency spikes. * Add safeguards to prevent iterator workers from entering non-recovering states during transient spikes.

Read the full incident report →

Notice March 12, 2026

Prod 2 - Customers may see some executions from March 11 in a "running" but hung state

Detected by Pingoru
Mar 12, 2026, 04:20 PM UTC
Resolved
Mar 12, 2026, 07:16 PM UTC
Duration
2h 55m
Affected: Continuous Delivery - Next Generation (CDNG)
Timeline · 4 updates
  1. investigating Mar 12, 2026, 04:20 PM UTC

    Customers may continue to see that some pipeline executions show that they are "running" even though they have completed, aborted, or failed as a result of yesterday's incident. (https://status.harness.io/incidents/4y4dl47v2qhc) This behavior is a UI-only artifact from the incident and should not affect customers' ability to start new executions. We are working on clearing these artifacts.

  2. identified Mar 12, 2026, 04:21 PM UTC

    The issue has been identified and a fix is being implemented.

  3. resolved Mar 12, 2026, 07:16 PM UTC

    This incident has been resolved.

  4. postmortem Mar 17, 2026, 11:43 PM UTC

    ### **Summary** On March 11, 2026, customers experienced pipeline failures and degraded UI performance\(incorrect status of states\) in the Prod2 environment. The issue was caused by a degradation in an internal shared infrastructure component used for coordination across services. The incident began around **7:10 AM PST** and was fully mitigated by approximately **10:12 AM PST**. During this period, pipeline execution throughput was significantly impacted for affected customers. ### **Root Cause** The issue was caused by resource saturation in a shared infrastructure component used for distributed coordination, which led to increased latency and failures in service-to-service communication. As a result, pipeline execution services were unable to process workloads efficiently, leading to a buildup of queued tasks and reduced system throughput. ### **Impact** Customers experienced the following: * Pipeline executions failing or not progressing * Increased pipeline execution times * UI delays due to processing backlogs The impact was limited to specific production environments and no data loss occurred. ### **Mitigation** **Immediate** * Redirected services to a higher-capacity infrastructure instance to restore normal processing * Cleared accumulated processing backlogs to recover system throughput * Scaled supporting services to stabilize performance **Permanent** * Improved monitoring and alerting for early detection of resource saturation * Implemented capacity and scaling improvements to handle higher load scenarios * Initiated architectural improvements to reduce reliance on shared coordination components ### **Action Items** To prevent such issues from happening again we are taking several steps: * Enhance alerting to detect early signs of infrastructure saturation * Review and optimize system behavior under high concurrency scenarios * Continue investigation into the triggering conditions and incorporate findings into long-term improvements

Read the full incident report →

Major March 11, 2026

Pipelines and dashboards are impacted in Prod2

Detected by Pingoru
Mar 11, 2026, 04:20 PM UTC
Resolved
Mar 11, 2026, 06:45 PM UTC
Duration
2h 24m
Affected: Continuous Delivery - Next Generation (CDNG)Cloud Cost Management (CCM)Continuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsCustom Dashboards
Timeline · 10 updates
  1. investigating Mar 11, 2026, 03:38 PM UTC

    We are currently investigating this issue.

  2. identified Mar 11, 2026, 03:42 PM UTC

    The issue has been identified and a fix is being implemented.

  3. identified Mar 11, 2026, 03:55 PM UTC

    We are continuing to work on a fix for this issue.

  4. identified Mar 11, 2026, 04:20 PM UTC

    Dashboards are in recovering phase. We are continuing to work on a fix for pipelines issue.

  5. identified Mar 11, 2026, 05:18 PM UTC

    Pipeline executions are going fine,there is a delay to view it on the UI.

  6. identified Mar 11, 2026, 06:06 PM UTC

    Currently all the executions are going on track and will complete. The UI is showing a delayed status. We are currently expediting the UI recovery.

  7. monitoring Mar 11, 2026, 06:15 PM UTC

    A fix has been implemented and we are monitoring the results.

  8. monitoring Mar 11, 2026, 06:33 PM UTC

    A fix has been implemented and we are monitoring the results.

  9. resolved Mar 11, 2026, 06:45 PM UTC

    This incident has been resolved.

  10. postmortem Mar 17, 2026, 11:40 PM UTC

    ### **Summary** On March 11, 2026, customers experienced pipeline failures and degraded UI performance\(incorrect status of states\) and CCM Dashboards were not accessible to the affected customers in the Prod2 environment. The issue was caused by a degradation in an internal shared infrastructure component used for coordination across services. The incident began around **7:10 AM PST** and was fully mitigated by approximately **10:12 AM PST**. During this period, pipeline execution throughput was significantly impacted for affected customers. ### **Root Cause** The issue was caused by resource saturation in a shared infrastructure component used for distributed coordination, which led to increased latency and failures in service-to-service communication. As a result, pipeline execution services were unable to process workloads efficiently, leading to a buildup of queued tasks and reduced system throughput. ### **Impact** Customers experienced the following: * Pipeline executions failing or not progressing * Increased pipeline execution times * UI delays due to processing backlogs The impact was limited to specific production environments and no data loss occurred. ### **Mitigation** **Immediate** * Redirected services to a higher-capacity infrastructure instance to restore normal processing * Cleared accumulated processing backlogs to recover system throughput * Scaled supporting services to stabilize performance **Permanent** * Improved monitoring and alerting for early detection of resource saturation * Implemented capacity and scaling improvements to handle higher load scenarios * Initiated architectural improvements to reduce reliance on shared coordination components ### **Action Items** To prevent such issues from happening again we are taking several steps: * Enhance alerting to detect early signs of infrastructure saturation * Review and optimize system behavior under high concurrency scenarios * Continue investigation into the triggering conditions and incorporate findings into long-term improvements

Read the full incident report →

Minor March 9, 2026

GCP Billing Delay - Google Cloud billing export delays since March 8 may cause incomplete cost data for GCP resources. Resolution by GCP is expected soon

Detected by Pingoru
Mar 09, 2026, 10:47 AM UTC
Resolved
Mar 12, 2026, 07:13 AM UTC
Duration
2d 20h
Affected: Cloud Cost Management (CCM)Cloud Cost Management (CCM)Cloud Cost Management (CCM)Cloud Cost Management (CCM)Cloud Cost Management (CCM)
Timeline · 5 updates
  1. identified Mar 09, 2026, 10:47 AM UTC

    We are experiencing delays in GCP billing data due to an issue with the GCP Billing Export.

  2. identified Mar 10, 2026, 07:01 AM UTC

    Google Cloud billing export delays since ~March 8 may cause incomplete cost data for GCP resources. Resolution by GCP is expected soon

  3. identified Mar 10, 2026, 07:33 AM UTC

    We are continuing to work on a fix for this issue.

  4. resolved Mar 12, 2026, 07:13 AM UTC

    This incident has been resolved.

  5. postmortem Apr 14, 2026, 03:31 AM UTC

    This was a GCP issue \(more details once GCP publishes a post mortem report\)

Read the full incident report →

Critical March 6, 2026

Prod4 is experiencing login issues

Detected by Pingoru
Mar 06, 2026, 12:31 AM UTC
Resolved
Mar 06, 2026, 02:10 AM UTC
Duration
1h 38m
Affected: Continuous Delivery - Next Generation (CDNG)Cloud Cost Management (CCM)Continuous Error Tracking (CET)Continuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsCustom DashboardsFeature Flags (FF)Security Testing Orchestration (STO)Service Reliability Management (SRM)Chaos EngineeringInternal Developer Portal (IDP)Infrastructure as Code Management (IaCM)Code RepositoryArtifact RegistryPlatform
Timeline · 4 updates
  1. investigating Mar 06, 2026, 12:31 AM UTC

    We are currently investigating this issue.

  2. monitoring Mar 06, 2026, 12:45 AM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved Mar 06, 2026, 02:10 AM UTC

    This incident has been resolved.

  4. postmortem Mar 10, 2026, 01:12 AM UTC

    ## Summary On **March 6, 2026** , customers on **Prod4** experienced a service disruption affecting delegate connectivity, pipeline execution, and CI workflows. The disruption was caused by a configuration change introduced during a scheduled platform upgrade that reduced the connection capacity of our internal routing layer. This reduction, combined with a concurrent update to our delegate management service, caused a burst of delegate reconnections that exceeded the new capacity limit. Service was fully restored via rollback within **42 minutes**. ## Impact * Delegates on Prod4 were unable to communicate with the Harness platform during the incident window. * Pipeline executions requiring delegate tasks were blocked and could not progress. * CI pipelines requiring secret resolution failed for the duration of the incident. * Platform API and UI operations on Prod4 returned errors. * Delegates reconnected automatically once service was restored — no manual restart was required. ## Root Cause A scheduled platform upgrade introduced a fixed connection limit in the internal routing layer that handles all delegate-to-platform traffic on Prod4. At the same time, the delegate management service underwent a rolling update that caused active delegates to reconnect simultaneously. The volume of concurrent reconnections exceeded the new fixed limit, blocking delegates from reaching the platform for the duration of the incident. The previous routing configuration used an unbounded connection queue, which could absorb reconnection bursts of this nature without impact. The new fixed limit, sized for steady-state traffic, had no headroom for the reconnection surge produced by a concurrent rolling update. ## Mitigation **Immediate:** Rolled back the routing component to the previous version, restoring the unbounded connection configuration and allowing all delegates to reconnect. **Short-term:** * Connection capacity is was increased and tuned to handle full delegate reconnection bursts with sufficient headroom above steady-state load. * Connection acquire timeout is was extended so that temporary overload conditions resolve naturally rather than cascading into a self-sustaining failure. * The routing component is being moved to its own independent release pipeline, decoupled from the main service upgrade, with dedicated post-deploy validation before traffic is promoted. **Ongoing:** * The delegate management service rolling update policy is being updated to stagger pod replacement one at a time, limiting the maximum reconnection burst to a fraction of the delegate fleet rather than the entire fleet simultaneously. * Routing layer autoscaling limits are being raised so the cluster can expand connection capacity in response to load spikes during deployments. ## Action Items To prevent such issues from happening again, we are: 1. Increasing the connection pool capacity to handle the full delegate reconnection burst with headroom. 2. Extending connection acquire timeout to prevent transient overload from becoming a self-sustaining failure loop. 3. Update the delegate management service rolling update configuration to replace one pod at a time. 4. Update the Deployment mechanism so that we can deploy the routing component independently from the main service release pipeline with dedicated validation. 5. Raise routing layer autoscaling limits to allow capacity expansion during connection load spikes.

Read the full incident report →

Minor March 5, 2026

Code Module is not accessible on prod1/2/3

Detected by Pingoru
Mar 05, 2026, 09:21 PM UTC
Resolved
Mar 05, 2026, 10:05 PM UTC
Duration
43m
Affected: Code RepositoryCode RepositoryCode Repository
Timeline · 5 updates
  1. investigating Mar 05, 2026, 10:16 PM UTC

    We are currently investigating this issue.

  2. identified Mar 05, 2026, 10:17 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Mar 05, 2026, 10:17 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Mar 05, 2026, 11:42 PM UTC

    This incident has been resolved.

  5. postmortem Mar 09, 2026, 08:33 PM UTC

    ## Summary Between **4:20 PM and 5:16 PM EST on Thursday, March 5, 2026**, customers using the **Harness Code modules** experienced a production outage in Harness production clusters **Prod1, Prod2, and Prod3**. Git repositories were unreachable during this outage. ## Root Cause We experienced a surge in metrics that overwhelmed the metric collectors on the Kubernetes pods. As a result, the Git pods were impacted. The StatefulSet became unschedulable, and resizing of the metric collectors was required to remedy the situation. ## Impact All code repositories were offline during the event across all three production clusters. ## Remediation Engineering increased the memory allocated to the metric collectors and redeployed the configuration. After redeployment, the Git pods were rescheduled and service was restored. ## Action Items To prevent such issues from happening, we are implementing the following: * **Enhance monitoring and alerting** – Add health monitors for metric-gathering collectors and rebalance metric growth across the cluster. * **Review capacity planning** – Proactively monitor metric collector usage and scale them appropriately with sufficient headroom to handle spikes.

Read the full incident report →

Minor March 3, 2026

[Prod-8] Degraded access to the login page

Detected by Pingoru
Mar 03, 2026, 10:32 AM UTC
Resolved
Mar 03, 2026, 04:40 PM UTC
Duration
6h 8m
Affected: Platform
Timeline · 4 updates
  1. investigating Mar 03, 2026, 10:32 AM UTC

    We are currently investigating this issue.

  2. monitoring Mar 03, 2026, 10:43 AM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved Mar 03, 2026, 04:40 PM UTC

    This incident has been resolved.

  4. postmortem Mar 04, 2026, 07:20 PM UTC

    ## **Summary** On **March 2**, the **prod8 environment** became temporarily inaccessible due to a config issue during a platform deployment. The issue affected ingress routing for the platform UI, resulting in HTTP **404 responses** when users attempted to access the environment. The issue was quickly identified as an ingress configuration problem. A temporary mitigation was applied by updating the ingress configuration, which immediately restored access. A permanent fix is being implemented to prevent recurrence. ## **Root Cause** ‌ The issue was caused by a **service config** that incorrectly generated ingress configuration during deployment. This caused the ingress controller to misroute incoming requests that did not match the expected path. As a result, these requests were directed to the default backend and returned **404 responses**.The problem was isolated to the ingress routing layer. Network connectivity and the Google Cloud Network Load Balancer were functioning normally ## **Impact** * **Affected Environment:** prod8 * **Customer Impact:** Users were unable to access the platform UI and received HTTP 404 responses. * **Scope:** Limited to the specific environment impacted by the ingress configuration change. ## **Resolution** ‌ Engineering teams applied a temporary mitigation by **patching the platform-ui ingress configuration in production to remove the incorrect host entries**. This restored correct routing behavior and resolved the accessibility issue. Access to the prod8 environment was fully restored after the ingress configuration update. ## **Prevention and Improvements** To prevent recurrence of this issue, the following steps are underway: * Adding additional validation checks to ensure ingress configuration is rendered correctly during deployment. * Improving deployment testing for ingress routing scenarios to detect configuration regressions earlier. These improvements will ensure that similar misconfigurations are caught before reaching production environments.

Read the full incident report →

Minor February 27, 2026

Login issues and propagation delays on FME

Detected by Pingoru
Feb 27, 2026, 10:46 PM UTC
Resolved
Feb 27, 2026, 10:46 PM UTC
Duration
Timeline · 2 updates
  1. resolved Feb 27, 2026, 11:12 PM UTC

    Customers using FME would have experienced logging issues in addition to their Flag propagation delay from 22:46 UTC to 23:00 UTC. Services are back to normal now

  2. postmortem Mar 04, 2026, 05:22 PM UTC

    ## Summary On February 27, 2026, at 22:46 UTC, customers using Feature Management experienced brief UI and Admin API unavailability. The issue lasted approximately 14 minutes and was fully resolved at 23:00 UTC. Feature flag evaluation and SDK operations were not impacted during this time. ## Impact * Customers could not access or make changes in the Feature Management console or API. * All other operations, including flag evaluation via SDKs, continued to function normally. ## Root Cause A configuration change deployed during a release introduced database connection errors. ## Remediation Upon detection, we immediately reverted the configuration change. We then initiated data synchronization and cleared relevant caching layers to restore normal authentication and operation. ## Action Items * We have implemented additional automated validation and safeguards to ensure configuration changes are validated prior to deployment. * We have also enhanced service health monitoring and alerting to proactively detect authentication latency and flag propagation delays, enabling faster mitigation in the future.

Read the full incident report →

Minor February 27, 2026

Slowness in Pipeline Execution graph UI

Detected by Pingoru
Feb 27, 2026, 02:44 PM UTC
Resolved
Feb 27, 2026, 04:07 PM UTC
Duration
1h 23m
Affected: Continuous Delivery - Next Generation (CDNG)
Timeline · 5 updates
  1. investigating Feb 27, 2026, 02:44 PM UTC

    We are currently investigating the issue. The impact is identified to be currently only in UI interface. Executions continue to work as expected.

  2. identified Feb 27, 2026, 03:50 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Feb 27, 2026, 04:05 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Feb 27, 2026, 04:07 PM UTC

    This incident has been resolved.

  5. postmortem Mar 04, 2026, 07:50 PM UTC

    ## **Summary** On **2/27/2026**, customers experienced slowness when viewing **running pipeline execution pages** in the Harness UI. The issue was caused by delays in the processing of graph generation events used to generate the pipeline execution graph. The degradation began around **7:33 AM** PT and resulted in delayed updates and slow loading of pipeline execution views. The engineering team identified the underlying performance bottleneck, applied mitigation measures, and restored normal system behavior after stabilizing the event processing pipeline. ‌ ‌ ## **Root Cause** The incident was caused by a **temporary backlog in Kafka consumers responsible for processing orchestration log events**, which are used to generate the execution graph for running pipelines. The backlog was triggered by increased system load combined with performance degradation in a **shared Elasticsearch cluster** used by the pipeline processing services. During the incident window, Elasticsearch experienced a sudden spike in indexing activity which caused resource contention and high CPU utilization on one of the cluster nodes. This slowdown in Elasticsearch queries reduced the processing throughput of the Kafka consumers responsible for graph generation, resulting in accumulated consumer lag and delayed updates in the pipeline execution UI. ‌ ‌ ## **Impact** During the incident window: * Users experienced **slow loading or delayed updates when viewing running pipeline execution pages**. * The **pipeline graph visualization** and related execution details were slower to render. * Pipeline executions themselves continued to run normally, but the UI display of their progress was delayed. Other Harness services and pipeline execution functionality were not impacted. ## **Mitigation** Engineering teams implemented several mitigation steps to restore system performance: * **Scaled the Elasticsearch cluster** to relieve resource pressure and improve query performance. * **Scaled Kafka consumer capacity** to accelerate backlog processing. These actions improved consumer processing throughput and allowed the Kafka backlog to drain. Consumer lag reduced, after which the pipeline execution UI returned to normal responsiveness. ## **Prevention and Improvements** To reduce the likelihood of similar incidents in the future, the following improvements are being implemented: * Capacity planning improvements for shared Elasticsearch clusters supporting orchestration workloads. * Additional safeguards to prevent that can amplify indexing activity. These measures will help ensure better isolation of workloads and faster detection of resource contention scenarios.

Read the full incident report →

Minor February 27, 2026

Feature flags that were updated during 09:14:39 PM UTC and 11:39:44 PM UTC were experiencing delays to propagation.

Detected by Pingoru
Feb 27, 2026, 12:17 AM UTC
Resolved
Feb 26, 2026, 09:14 PM UTC
Duration
Timeline · 2 updates
  1. resolved Feb 27, 2026, 12:17 AM UTC

    Feature flags that were updated on Feb 26, 2026 during 09:14:39 PM UTC and 11:39:44 PM UTC were experiencing delays to propagation. The issue has been resolved, and all propagations from that window are recovering. No customer action is required

  2. postmortem Mar 04, 2026, 07:55 PM UTC

    ## **Summary** On **February 26, 2026**, feature flags updated between **09:14:39 PM UTC and 11:39:44 PM UTC** experienced **delays in propagation**. Feature flag evaluation through SDKs continued to function normally during this time. This was part of the planned maintenance activity ## **Impact** * Feature flag updates made between **09:14:39 PM UTC and 11:39:44 PM UTC** experienced delayed propagation. * The delay affected the time it took for flag configuration changes to appear across the system. * **Feature flag evaluation through SDKs continued to operate normally**, and applications relying on flag evaluation were not impacted. * Once the issue was resolved, all pending propagations began processing and the system returned to normal behavior. ## **Root Cause** The issue was caused by a temporary delay in the internal propagation pipeline responsible for distributing feature flag configuration updates across the platform as we were in middle of the planned maintenance activity During the affected window, propagation tasks accumulated in the processing pipeline, which delayed the distribution of updated flag configurations. This did not affect flag evaluation or SDK operations but delayed the visibility of configuration updates. ## **Remediation** Engineering teams resolved the issue by rolling back and restoring normal processing of the propagation pipeline and allowing the queued updates to be processed. All delayed updates from the affected time window have since propagated successfully. ## **Action Items** To prevent from such issues from happening again * **Enhanced monitoring** for propagation latency to detect delays earlier. * **Improved alerting** for propagation backlog thresholds. * **Additional safeguards** to ensure faster recovery if propagation queues accumulate.

Read the full incident report →

Major February 26, 2026

Prod1/Prod2 pipelines and logins are degraded. Some delegates are disconnected

Detected by Pingoru
Feb 26, 2026, 05:56 PM UTC
Resolved
Feb 26, 2026, 06:29 PM UTC
Duration
33m
Affected: Continuous Delivery - Next Generation (CDNG)Continuous Delivery - Next Generation (CDNG)
Timeline · 5 updates
  1. investigating Feb 26, 2026, 05:56 PM UTC

    We are currently investigating this issue.

  2. identified Feb 26, 2026, 06:05 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Feb 26, 2026, 06:14 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Feb 26, 2026, 06:29 PM UTC

    This incident has been resolved.

  5. postmortem Mar 02, 2026, 04:26 PM UTC

    ## Summary On **February 26, 2026**, multiple customers experienced disruptions accessing Harness on Prod1 and Prod2. A transient network connectivity issue caused disruption to our backend systems , leading to platform unresponsiveness. Service was restored within approximately one hour. ## Impact * Customers on Prod2 were unable to log in or access the Harness platform. * Prod1 experienced login disruptions due to a cross-environment dependency on Prod2. * Delegates disconnected; Kubernetes-based delegates reconnected automatically, while non-Kubernetes delegates required a manual restart. ## Root Cause A transient network connectivity disruption caused connection timeouts across the platform. The exact infrastructure-side trigger of the initial connectivity disruption is still under investigation. ## Remediation * **Immediate:** Affected services were manually restarted, clearing stuck connections and restoring platform availability. * **Short-term:** Autoscaling limits were adjusted to better handle sudden reconnection load. * **Ongoing:** Investigation into timeout configuration and application resilience improvements is in progress. ## Action Items To prevent such issues from happening again 1. Review and update the timeouts settings to fail fast and limit thread blocking during connectivity issues. 2. Improve application resilience — enhance circuit breakers to prevent connectivity issues and retries

Read the full incident report →

Major February 24, 2026

Hosted CI mac builds are impacted

Detected by Pingoru
Feb 24, 2026, 04:37 PM UTC
Resolved
Feb 24, 2026, 05:11 PM UTC
Duration
33m
Affected: Continuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Mac Cloud Builds
Timeline · 5 updates
  1. investigating Feb 24, 2026, 04:37 PM UTC

    We are currently investigating this issue.

  2. identified Feb 24, 2026, 04:46 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Feb 24, 2026, 05:01 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Feb 24, 2026, 05:11 PM UTC

    This incident has been resolved.

  5. postmortem Mar 04, 2026, 01:00 AM UTC

    ## **Summary** On February 24, 2026, Harness experienced an incident that temporarily affected **Mac Cloud build scheduling across production environments**. During the incident window, new Mac build jobs could not be scheduled due to an orchestration control-plane disruption. The issue was detected immediately through monitoring alerts, investigated by the on-call engineering team, and resolved after restoring the affected control-plane node. Full service was restored shortly thereafter. ## **Root Cause** The incident occurred due to a loss of quorum in the orchestration control plane responsible for scheduling Mac builds. This was due to two separate issues of memory pressure and a traffic spike at the same time.This resulted in a temporary interruption to Mac build scheduling until the degraded node was restored and quorum was re-established. ## **Impact** * **Affected Service:** Mac Cloud builds * **Affected Regions:** Production clusters \(Prod1, Prod2, Prod3, Prod4\) * **Customer Impact:** * New Mac build jobs could not be scheduled during the incident window * Existing running builds were not impacted * **Services Not Impacted:** * Linux Cloud builds * Windows Cloud builds * Self-hosted build infrastructure * Other Harness CI/CD services \(pipelines, artifacts, deployments\) **Mitigation** Engineering teams restored the degraded orchestration node, allowing the cluster to re-establish quorum and elect a leader. Once the leader election was completed, Mac build scheduling resumed and services returned to normal operation. ## **Prevention and Improvements** To reduce the likelihood of similar incidents in the future, the following actions are being implemented: * Reducing scheduling load on the orchestration layer by optimizing infrastructure in terms of reliability. * Improving monitoring and health checks for control-plane nodes.

Read the full incident report →

Minor February 23, 2026

Degraded performance in SEI insights

Detected by Pingoru
Feb 23, 2026, 11:03 AM UTC
Resolved
Feb 23, 2026, 02:11 PM UTC
Duration
3h 8m
Affected: Software Engineering Insights (SEI)
Timeline · 5 updates
  1. investigating Feb 23, 2026, 11:03 AM UTC

    We are currently investigating the issue.

  2. identified Feb 23, 2026, 12:56 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Feb 23, 2026, 01:51 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Feb 23, 2026, 02:11 PM UTC

    This incident has been resolved.

  5. postmortem Mar 02, 2026, 06:02 PM UTC

    ## **Summary** Few Efficiency and Productivity widgets were intermittently showing no data for few customers on Prod2 environment. ## **Root Cause :** SEI query engine service node had metadata sync issue with the rest of the cluster and was returning empty datasets. A defect in the service metadata sync process prevented the synchronization , resulting in query returning no data. As queries are distributed across all nodes, the issue occurred intermittently—only when traffic was routed to the affected node. This made the problem appear random rather than systemic. ## **Impact :** For few SEI 2.0 customers, some Insights' widgets were impacted . **Productivity Insights** : work completed, coding days per dev widgets were intermittently showing empty data. **Efficiency Insights** : Lead time to change widget was intermittently showing empty data. **Duration:** February 23, 01:18 AM PST – February 23, 4:21 AM PST ### What was not impacted? The following systems and capabilities remained fully operational: * Data ingestion and processing * Integrations * Org tree and other metadata operational flows * All other efficiency, productivity, AI insights widgets * Customers using SEI 1.0 No customer data was lost. ### **Remediation** * **Immediate Mitigation:** We identified a metadata synchronization issue as the root cause. We have implemented a fix that successfully stabilizes the system and prevents this bug from being triggered. ## **Action Items** To prevent such issues from happening again * We permanently fixed the bug in downstream systems * We are also enhancing and making our tests and alerting to make sure we detect this sooner.

Read the full incident report →

Notice February 23, 2026

Hosted CI Builds Failing

Detected by Pingoru
Feb 23, 2026, 08:37 AM UTC
Resolved
Feb 23, 2026, 08:52 AM UTC
Duration
15m
Affected: Continuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Self Hosted Runners
Timeline · 3 updates
  1. monitoring Feb 23, 2026, 08:37 AM UTC

    A fix has been implemented and we are monitoring the results.

  2. resolved Feb 23, 2026, 08:52 AM UTC

    This incident has been resolved.

  3. postmortem Mar 02, 2026, 05:47 PM UTC

    ### **Summary** On **February 23, 2026**, between **00:00 AM and 00:30 AM PST** \(**08:00–08:30 UTC / 13:30–14:00 IST**\), a configuration change to **VPC Service Controls \(VPC-SC\)** was enforced at the **organization level** with the intent of restricting **BigQuery** usage to approved projects only. Shortly after enforcement, multiple **unrelated GCP services** began failing due to **unexpected API denials**, impacting **CI workflows** and **internal service operations** \(e.g., access to storage, container images, and some console/project visibility behaviors\). Service was fully restored after the change was **reverted**, and normal operations resumed within approximately **30 minutes**. **No data loss occurred, and there was no security breach.** ### **Root Cause** A **regression in configuration scope** occurred when an **org-level VPC-SC perimeter** was enforced to restrict BigQuery access. While the intent was to limit **BigQuery** specifically, applying the perimeter at the organization level **changed service communication boundaries** more broadly than expected. This resulted in **wider GCP API denials** affecting dependent services and cross-project interactions. Contributing factors: * **Org-level enforcement increased blast radius** beyond the intended BigQuery restriction. * Some **internal service-to-service calls** and **cross-project dependencies** were impacted in ways that were **not clearly surfaced** by dry-run visibility. * Validation in a non-production environment did not expose this behavior due to **lower workload volume** and **fewer real-world cross-service / cross-project integrations**. ### **Impact** During the incident window, the following symptoms were observed: **CI / External-facing impact** * CI Hosted builds and some **CI jobs** were unable to complete operations that rely on GCP access \(for example, launching compute resources using service accounts and/or accessing required dependencies\). **Customer Impact \(CD\)** During the incident window, customers were unable to execute CD pipelines and encountered the error:_“Cannot generate token for the accountId.”_ Additionally: * The Harness File Store was unable to fetch existing files or store new files from the GCS backend. * Logs were not visible in the Harness UI. ### **Remediation** **Immediate** * **Reverted** the organization-level VPC-SC enforcement, returning the environment to the prior access state. * Confirmed recovery across impacted services and workflows after rollback. **Permanent** * A safer enforcement approach is being developed to meet the original governance goal \(restricting BigQuery usage to approved projects\) **without applying an organization-wide perimeter in a single step**. * Working with **cloud provider support** to validate an alternative design and ensure the expected service dependency behavior is understood before re-enforcement. ### **Action Items** To prevent such incidents from happening we have implemented\(and implementing\): 1. **Staged rollout for org-level security boundary changes** 2. **Dependency mapping before perimeter enforcement** 3. **Improve observability for VPC-SC denials** 4. **Strengthen pre-production validation** 5. **Evaluate alternative BigQuery governance controls**

Read the full incident report →

Minor February 19, 2026

Hosted CI build vm environment is seeing higher network latency

Detected by Pingoru
Feb 19, 2026, 05:36 PM UTC
Resolved
Feb 19, 2026, 06:20 PM UTC
Duration
44m
Affected: Continuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud Builds
Timeline · 5 updates
  1. investigating Feb 19, 2026, 05:36 PM UTC

    We are currently investigating this issue.

  2. identified Feb 19, 2026, 05:39 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Feb 19, 2026, 06:11 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Feb 19, 2026, 06:20 PM UTC

    This incident has been resolved.

  5. postmortem Mar 02, 2026, 05:32 PM UTC

    ## **Summary** On February 19, 2026, a partial degradation occurred in the CI infrastructure in the **us-west1** region due to issues affecting the NAT control plane. During a brief window \(~30 minutes\), a limited number of CI build jobs failed during VM initialization. The issue was detected through internal monitoring and mitigated via controlled failover, followed by restoration of the affected NAT instances. ## **Root Cause** ‌ The incident was caused by saturation of connection tracking \(iptables/conntrack state\) on NAT virtual machines in the us-west1 region. A short-lived spike in build VM activity led to a burst of metadata-related connections. Over time, stale connection entries accumulated without automated cleanup, eventually preventing the NAT VMs from accessing the cloud metadata service. This metadata connectivity disruption impacted control-plane functionality \(including VM provisioning\), which resulted in a limited number of build initialization failures. ## **Impact** ‌ * **Region Impacted:** us-west1 * **Customer Impact:** * Limited CI job failures during VM provisioning * Two customers experienced isolated build failures * No impact to running workloads * **Data Loss:** None * **Duration:** Approximately 30 minutes Customers were advised to retry failed builds after mitigation. ## **Mitigation** ‌ * Traffic was automatically failed over to another NAT to maintain egress functionality. * Affected s NAT VMs were restarted to clear saturated connection state. * Metadata connectivity, SSH access, monitoring, and health checks were verified. * Traffic was gradually restored to affected NAT after stability confirmation. * Cloud NAT IP utilization was monitored during failover to prevent capacity exhaustion. ### **Action Items and Permanent Preventive Measures** To prevent such issues from happening again, we will * Implement automated cleanup of metadata-related connection tracking state. * Add proactive health checks and alerts for metadata reachability. * Strengthen monitoring for NAT VM control-plane health degradation. * Enhance fallback guardrails and capacity validation for Cloud NAT. * Deploy automated iptables/conntrack cleanup for metadata traffic. * Improve monitoring for metadata connectivity health.

Read the full incident report →

Major February 18, 2026

Git API rate limit errors

Detected by Pingoru
Feb 18, 2026, 03:30 PM UTC
Resolved
Feb 18, 2026, 03:30 PM UTC
Duration
Timeline · 2 updates
  1. resolved Feb 20, 2026, 07:17 PM UTC

    Two enterprise customers experienced Git API rate limit errors when using remote entities backed by Git repositories on prod1 environment.This prevented loading of services, environments, and other entities stored in Git, and blocked pipeline executions that depended on the remote YAML.

  2. postmortem Feb 20, 2026, 07:17 PM UTC

    ## **Summary** On February 18, 2026, between 7:30 AM and 12:15 PM PST, Some enterprise customers experienced Git API rate limit errors when using remote entities backed by Git repositories on prod1 environment. This prevented loading of services, environments, and other entities stored in Git, and blocked pipeline executions that depended on the remote YAML. Service was fully restored after rolling back the release. ## **Root Cause** A regression introduced in a recent release caused incorrect cache invalidation in the GitX bidirectional sync mechanism. * Webhook event processing cleared the GitX cache on every event instead of only when relevant changes occurred * The two enterprise customers generated high webhook volume and contained a large number of remote entities * The cache was invalidated continuously, forcing all entity fetches to query GitHub directly * This behavior rapidly exhausted GitHub API rate limits The impact was amplified by repeated calls to the API, which retrieves all service definitions from Git. Due to the high number of remote services, these calls significantly increased API consumption. ## **Impact** * Remote entities \(pipelines, templates, services, environments\) failed to load * Pipeline executions depending on remote YAML were blocked ## **Remediation** * **Immediate:** Rolled back the Prod1 system release, which restored normal caching behavior and resolved the rate limit issues. * **Permanent:** A fix has been implemented to correct the cache invalidation logic, ensuring the cache is only cleared whenever necessary rather than on every webhook push/pr event processing. ## **Action Items** 1. **Add proactive alerting for cache health -** Implement monitoring and alerts that trigger when cache hit rates drop below expected thresholds, enabling faster detection of cache-related issues 2. **Move to GitHub App–based authentication -** Adopt GitHub App authentication to significantly increase API rate limits and reduce the risk of throttling 3. **Improve cache observability -** Add comprehensive metrics for all cache operations to enable better monitoring, troubleshooting, and debugging of cache-related issues 4. **Enhance automated testing -** Expand test coverage to include cache behavior validation as part of automated sanity and regression testing 5. **Holistic review of GitX flows -** Review all GitX flows, identify P0 and P1 paths, and ensure full automation coverage and observability across these critical workflows

Read the full incident report →

Minor February 17, 2026

Pipeline are running slow in Prod3

Detected by Pingoru
Feb 17, 2026, 05:27 PM UTC
Resolved
Feb 17, 2026, 10:59 PM UTC
Duration
5h 32m
Affected: Continuous Delivery - Next Generation (CDNG)Continuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsSecurity Testing Orchestration (STO)Service Reliability Management (SRM)Chaos EngineeringInternal Developer Portal (IDP)Infrastructure as Code Management (IaCM)Software Supply Chain Assurance (SSCA)
Timeline · 5 updates
  1. investigating Feb 17, 2026, 05:27 PM UTC

    We are currently investigating this issue.

  2. identified Feb 17, 2026, 05:28 PM UTC

    We are actively working to mitigate this

  3. monitoring Feb 17, 2026, 06:11 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Feb 17, 2026, 10:59 PM UTC

    This incident has been resolved.

  5. postmortem Mar 02, 2026, 08:42 PM UTC

    **Summary** On February 17, 2026, we had a traffic spike in one of the services in Prod3, which impacted the Pipeline Service’s capacity. We remediated this by addressing the source of the spike in workload and performing tuning of our backend systems. **Root Cause** Starting around 7:25 A.M. PST, our databases became overwhelmed with an increased rate of writes, causing resource pressure. The write latency spiked, causing our upstream systems to experience timeouts and errors. **Customer Impact** During the window of the impact * Pipeline executions ran significantly slower or stalled, with initialization steps delayed. * Slowness while performing CRUD operations on pipelines. **Resolution** We identified and disabled a high-frequency batch write workload that was contributing significantly to the write pressure. By switching that component to a lower-write alternative flow, full system recovery was confirmed at ~10:05 AM PST. **Prevention and Improvements** To prevent recurrence and enable faster identification of such issues, we are taking several measures: * Automate the audit and proactively optimize resource-intensive queries. Optimize with better indexes or query scope limits to prevent working set overflow. * Fine-tune workloads to increase headroom to handle spikes. * Add proactive alerts for sustained traffic rates and resource utilization approaching the high watermark. * Add capacity to our backend systems.

Read the full incident report →

Minor February 16, 2026

Some customers in Prod1 are experiencing issues with secret evaluations.

Detected by Pingoru
Feb 16, 2026, 07:37 PM UTC
Resolved
Feb 16, 2026, 08:09 PM UTC
Duration
31m
Affected: Continuous Delivery - Next Generation (CDNG)Continuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsSecurity Testing Orchestration (STO)Service Reliability Management (SRM)Internal Developer Portal (IDP)Infrastructure as Code Management (IaCM)Software Supply Chain Assurance (SSCA)Software Engineering Insights (SEI)
Timeline · 5 updates
  1. investigating Feb 16, 2026, 07:37 PM UTC

    We are currently investigating this issue.

  2. identified Feb 16, 2026, 07:41 PM UTC

    The issue has been identified and we are rolling out mitigations.

  3. monitoring Feb 16, 2026, 07:58 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Feb 16, 2026, 08:09 PM UTC

    This incident has been resolved.

  5. postmortem Mar 02, 2026, 05:20 PM UTC

    ## **Summary** On **February 16, 2026, at 7:04 PM UTC** , an incident occurred that caused pipeline execution failures for workflows using cross-scope secrets. Affected customers were unable to execute pipelines that referenced secrets stored in a different scope than the pipeline’s execution scope.‌ The issue was introduced as part of a recent production changes. The system was rolled back to the previous stable release to restore normal functionality.‌ ## **Root Cause** A regression introduced in a recent release impacted the secret resolution flow for cross-scope secret managers.‌ Under specific conditions, the system incorrectly resolved the scope of the secret manager during decryption, causing the lookup to fail. As a result, pipelines referencing cross-scope secrets were unable to complete secret evaluation successfully.‌ The issue was not detected prior to production deployment due to a gap in automated test coverage for cross-scope decryption scenarios. ## **Impact** Customers using cross-scope secrets experienced pipeline execution failures. Pipelines that relied on secrets stored in a different scope than the pipeline execution scope were unable to run successfully during the impact window. Other pipeline executions that did not use cross-scope secret managers were unaffected. ‌ ## **Mitigation**‌ **Immediate** ‌ * Rolled back the production deployment to the previous stable release. * Functionality was fully restored following the rollback. ### **Permanent** * Enhanced test coverage to include cross-scope secret evaluation scenarios. * Updated automation to validate pipeline execution using non-default Secret Managers. * Reviewed and improved alerting for error rate monitoring related to bulk encryption and secret resolution flows. * Reviewed deployment rollback processes to prevent similar environment inconsistencies. * Strengthened code review validation for scope-sensitive changes. ## **Action Items** To prevent such items in the future * Add comprehensive automation coverage for cross-scope secret decryption workflows. * Enhance and further solidify CI/CD validation for non-default Secret Manager scenarios. * Improve monitoring and alerting for secret resolution error rates.

Read the full incident report →

Major February 13, 2026

SEI - Team Settings and Efficiency Insights Inaccessible

Detected by Pingoru
Feb 13, 2026, 06:51 AM UTC
Resolved
Feb 13, 2026, 10:21 AM UTC
Duration
3h 29m
Affected: Software Engineering Insights (SEI)Software Engineering Insights (SEI)Software Engineering Insights (SEI)
Timeline · 5 updates
  1. investigating Feb 13, 2026, 04:05 PM UTC

    On February 12-13, 2026, the Software Engineering Insights (SEI) module experienced a service disruption lasting approximately 3.5 hours. During this period, users were unable to access team settings, efficiency profiles, and efficiency insights features.

  2. identified Feb 13, 2026, 04:05 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Feb 13, 2026, 04:06 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Feb 13, 2026, 04:06 PM UTC

    This incident has been resolved.

  5. postmortem Feb 20, 2026, 06:41 PM UTC

    ## **Summary** Customers in the Prod2, Prod3, and EU production clusters experienced page load errors when accessing Team Settings, Efficiency Profiles, and Efficiency Insights during a limited window on February 12, 11:30 PM PST to February 13, 2:21 AM PST. No customer data was lost. Data ingestion, processing, and other core product capabilities remained fully operational throughout the incident. SEI 1.0 customers were not impacted. ## **Root Cause** During a production deployment, Harness team identified a process bug in the migration step. As a result, the application and database schema became temporarily out of sync. This schema mismatch caused specific pages \(Team Settings, Efficiency Profiles, and Efficiency Insights\) to fail to load properly. ## **Impact** The following functionalities were impacted during the incident window: * Team Settings * Efficiency Profiles * Efficiency Insights Users attempting to access these areas experienced page load errors. **Duration:** February 12, 11:30 PM PST – February 13, 2:21 AM PST ### What was not impacted? The following systems and capabilities remained fully operational: * Data ingestion and processing * Productivity Insights * BA Insights * AI Insights * Integrations * Org tree and other metadata operational flows * Customers using SEI 1.0 No customer data was lost. ### **Remediation** The issue was resolved by executing the required database migration procedure in the production environment, restoring synchronisation between the application and database. ## **Action Items** To prevent recurrence and strengthen reliability, we are implementing the following improvements: #### 1. Full Automation of Database Migrations All production database migration steps will be integrated into our CI/CD deployment pipeline, eliminating manual execution and reducing the risk of human error. #### 2. Functional Automation and Monitoring We are converting schema validation into automated functional test coverage. This ensures that any application–database mismatch is detected immediately during deployment. #### 4. Strengthened Deployment Checkpoints We are enhancing our post deployment automation to detect and prevent such issues.

Read the full incident report →

Minor February 12, 2026

Dashboard Data Not Displaying as Expected in Prod 3

Detected by Pingoru
Feb 12, 2026, 12:56 PM UTC
Resolved
Feb 12, 2026, 02:29 PM UTC
Duration
1h 33m
Affected: Custom Dashboards
Timeline · 4 updates
  1. investigating Feb 12, 2026, 12:56 PM UTC

    We are currently investigating this issue.

  2. monitoring Feb 12, 2026, 01:56 PM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved Feb 12, 2026, 02:29 PM UTC

    This incident has been resolved.

  4. postmortem Feb 18, 2026, 06:08 PM UTC

    Summary On **February 11th, 2026**, Harness Dashboards experienced an issue with data staleness within the prod3 environment. During this period, newly added data was not reflected in the dashboards. **Root Cause** We had some unexpected spike in load which caused resource exhaustion. This interruption prevented new data changes from being reflected in the replicas that power the dashboards. **Impact** Customers were viewing stale data, as no new changes were visible in the dashboards and approximate duration of impact was from 12:30 PM PST to 05:30 AM PST. The issue was limited exclusively to the dashboards and all other functionality was working as expected. No customer data was lost or damaged. The issue only affected data visibility on the dashboards during the specified timeframe. **Mitigation and Permanent Fix** To mitigate, we increased capacity to effectively manage and to handle the spike in workload. **Action Items** To prevent such issues from happening further we are enhancing Monitoring & Early Detection by Adding replication lag alerts with severity thresholds. We will also proactively make the system resilient and robust by running performance tests and fine tune the system.

Read the full incident report →

Notice February 9, 2026

Github is experiencing degraded performance. This could affect your pipelines.

Detected by Pingoru
Feb 09, 2026, 05:34 PM UTC
Resolved
Feb 09, 2026, 05:49 PM UTC
Duration
14m
Affected: Security Testing Orchestration (STO)Continuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsSecurity Testing Orchestration (STO)Security Testing Orchestration (STO)Security Testing Orchestration (STO)Security Testing Orchestration (STO)Security Testing Orchestration (STO)Infrastructure as Code Management (IaCM)Infrastructure as Code Management (IaCM)Infrastructure as Code Management (IaCM)Infrastructure as Code Management (IaCM)Infrastructure as Code Management (IaCM)
Timeline · 4 updates
  1. investigating Feb 09, 2026, 05:32 PM UTC

    We are currently investigating this issue.

  2. monitoring Feb 09, 2026, 05:34 PM UTC

    Systems are being monitored for any downstream issues. Github Status Page: https://www.githubstatus.com/

  3. resolved Feb 09, 2026, 05:49 PM UTC

    Github has reported that all services are up and running without issue. https://www.githubstatus.com/incidents/ffz2k716tlhx

  4. postmortem Feb 18, 2026, 07:28 PM UTC

    [https://www.githubstatus.com/incidents/ffz2k716tlhx](https://www.githubstatus.com/incidents/ffz2k716tlhx)

Read the full incident report →

Looking to track Harness downtime and outages?

Pingoru polls Harness's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when Harness reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track Harness alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring Harness for free

5 free monitors · No credit card required