- Detected by Pingoru
- May 02, 2026, 12:01 AM UTC
- Resolved
- Apr 28, 2026, 05:21 PM UTC
- Duration
- —
Timeline · 1 update
-
resolved May 02, 2026, 12:01 AM UTC
Between April 28 and May 1, 2026, a subset of scheduled change requests failed to execute at their scheduled times. The issue was introduced by a production deployment on April 28 and fixed with a new deployment on May 1. All affected scheduled jobs that had not been manually withdrawn by customers have since been executed successfully.
Read the full incident report →
- Detected by Pingoru
- May 01, 2026, 04:28 PM UTC
- Resolved
- May 01, 2026, 08:10 PM UTC
- Duration
- 3h 41m
Affected: Continuous Delivery - Next Generation (CDNG)Continuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsFME
Timeline · 6 updates
-
investigating May 01, 2026, 04:28 PM UTC
We are currently investigating this issue.
-
monitoring May 01, 2026, 04:56 PM UTC
A fix has been implemented and we are monitoring the results.
-
monitoring May 01, 2026, 05:14 PM UTC
We are continuing to monitor for any further issues.
-
monitoring May 01, 2026, 05:15 PM UTC
We are continuing to monitor for any further issues.
-
monitoring May 01, 2026, 07:58 PM UTC
We are largely mitigated and most pipelines are running normally. We are monitoring all parameters to make sure there are no issues before closing it.
-
resolved May 01, 2026, 08:10 PM UTC
This incident has been resolved.
Read the full incident report →
- Detected by Pingoru
- May 01, 2026, 03:02 PM UTC
- Resolved
- May 01, 2026, 08:10 PM UTC
- Duration
- 5h 8m
Affected: Continuous Delivery - Next Generation (CDNG)Continuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsSecurity Testing Orchestration (STO)Internal Developer Portal (IDP)
Timeline · 4 updates
-
investigating May 01, 2026, 03:02 PM UTC
We are currently investigating this issue.
-
monitoring May 01, 2026, 03:37 PM UTC
A fix has been implemented and we are monitoring the results.
-
monitoring May 01, 2026, 07:58 PM UTC
We are largely mitigated and most pipelines are running normally. We are monitoring all parameters to make sure there are no issues before closing it.
-
resolved May 01, 2026, 08:10 PM UTC
This incident has been resolved.
Read the full incident report →
- Detected by Pingoru
- Apr 30, 2026, 11:09 PM UTC
- Resolved
- Apr 30, 2026, 11:43 PM UTC
- Duration
- 33m
Affected: Feature Flags (FF)
Timeline · 4 updates
-
investigating Apr 30, 2026, 11:09 PM UTC
We are currently investigating this issue.
-
identified Apr 30, 2026, 11:19 PM UTC
Issue has been identified and fix is underway
-
monitoring Apr 30, 2026, 11:29 PM UTC
A fix has been implemented and we are monitoring the results.
-
resolved Apr 30, 2026, 11:43 PM UTC
This incident has been resolved.
Read the full incident report →
- Detected by Pingoru
- Apr 30, 2026, 04:25 PM UTC
- Resolved
- Apr 30, 2026, 05:50 PM UTC
- Duration
- 1h 25m
Affected: Continuous Delivery (CD) - FirstGen - EOSContinuous Delivery - Next Generation (CDNG)Continuous Delivery - Next Generation (CDNG)Cloud Cost Management (CCM)Cloud Cost Management (CCM)Continuous Error Tracking (CET)Continuous Error Tracking (CET)Chaos EngineeringContinuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsCustom DashboardsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsFeature Flags (FF)Custom DashboardsFeature Flags (FF)Security Testing Orchestration (STO)Security Testing Orchestration (STO)Service Reliability Management (SRM)Service Reliability Management (SRM)Chaos EngineeringInternal Developer Portal (IDP)Internal Developer Portal (IDP)Infrastructure as Code Management (IaCM)Infrastructure as Code Management (IaCM)Software Supply Chain Assurance (SSCA)Software Supply Chain Assurance (SSCA)Software Engineering Insights (SEI)Software Engineering Insights (SEI)Code RepositoryArtifact RegistryPlatformFME
Timeline · 3 updates
-
investigating Apr 30, 2026, 04:25 PM UTC
We are currently investigating this issue.
-
identified Apr 30, 2026, 05:27 PM UTC
Issue has been identified and mitigated
-
resolved Apr 30, 2026, 07:08 PM UTC
This incident has been resolved.
Read the full incident report →
- Detected by Pingoru
- Apr 30, 2026, 03:57 PM UTC
- Resolved
- Apr 30, 2026, 04:14 PM UTC
- Duration
- 17m
Affected: FMEFMEFME
Timeline · 2 updates
-
investigating Apr 30, 2026, 03:57 PM UTC
We are currently investigating this issue.
-
resolved Apr 30, 2026, 04:14 PM UTC
This incident has been resolved.
Read the full incident report →
- Detected by Pingoru
- Apr 28, 2026, 09:48 PM UTC
- Resolved
- Apr 28, 2026, 06:27 PM UTC
- Duration
- —
Timeline · 1 update
-
resolved Apr 28, 2026, 09:48 PM UTC
Requests made by SDKs with Rule Based Segments support could have gotten a null instead of an empty for responses with no Rule Based Segments, leading to an null pointer exception. Impact observed: 18:27 until 19:05 UTC. The impact for SDKs would have been a failure to initialize or process an update. Any change made to feature flags or RBS in an affected environment would re-generate any specific caches remaining.
Read the full incident report →
- Detected by Pingoru
- Apr 28, 2026, 03:28 PM UTC
- Resolved
- Apr 28, 2026, 07:03 PM UTC
- Duration
- 3h 34m
Affected: Cloud Cost Management (CCM)Infrastructure as Code Management (IaCM)
Timeline · 1 update
-
investigating Apr 28, 2026, 03:28 PM UTC
We are currently investigating this issue.
Read the full incident report →
- Detected by Pingoru
- Apr 28, 2026, 11:51 AM UTC
- Resolved
- Apr 28, 2026, 02:16 PM UTC
- Duration
- 2h 24m
Affected: Continuous Delivery (CD) - FirstGen - EOSContinuous Delivery - Next Generation (CDNG)Cloud Cost Management (CCM)Infrastructure as Code Management (IaCM)Service Reliability Management (SRM)Feature Flags (FF)
Timeline · 2 updates
-
investigating Apr 28, 2026, 11:51 AM UTC
We are currently investigating this issue.
-
monitoring Apr 28, 2026, 12:14 PM UTC
A fix has been implemented and we are monitoring the results.
Read the full incident report →
- Detected by Pingoru
- Apr 27, 2026, 10:27 PM UTC
- Resolved
- Apr 27, 2026, 08:00 PM UTC
- Duration
- —
Timeline · 2 updates
-
resolved Apr 27, 2026, 10:27 PM UTC
We were seeing slowness while executing pipelines
-
postmortem Apr 29, 2026, 07:53 PM UTC
## **Summary** On April 27, 2026, customers running pipelines in the Prod3 environment experienced intermittent slowness in pipeline execution and delays in execution status updates in the UI. It was caused by a unexpected spike causing contention on a backend database supporting pipeline orchestration. The issue was mitigated and fully resolved. ## **Impact** **Incident window:** April 27, 2026, 1:00 PM – 3:12 PM PDT * Pipeline executions ran slower than normal; some executions took longer than expected to complete. For pipelines with stricter timeouts, there could be failures. * No widespread pipeline failures were observed * Execution view in the UI lagged behind real-time pipeline progress There was no data loss. The majority of pipelines continued to execute successfully, with the primary impact being increased latency and delayed UI updates. ## **Root Cause** Pipeline orchestration relies on a backend database to track execution state and power the execution view in the UI. During the incident, we had a spike of load, leading to increased query latency across the orchestration layer.This resulted in a backlog, causing UI updates to lag behind actual pipeline execution until the system was scaled. ## **Remediation** **Immediate Mitigation** * Scaled up the affected database instance to increase CPU capacity * Reduced query latency and eliminated lock contention * Cleared the execution-view update backlog within ~30 minutes These actions restored normal pipeline performance and UI responsiveness. ## **Action Items** To prevent such issues from happening again. * **Capacity Improvements:**Updated Prod3 capacity baseline to prevent similar resource constraints * **Proactive Detection:** Enhancing monitoring and alerting for backend resource utilization, lock contention, and critical query latency
Read the full incident report →
- Detected by Pingoru
- Apr 27, 2026, 04:05 PM UTC
- Resolved
- Apr 27, 2026, 04:24 PM UTC
- Duration
- 19m
Affected: Continuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud Builds
Timeline · 3 updates
-
investigating Apr 27, 2026, 04:05 PM UTC
We are currently investigating this issue.
-
identified Apr 27, 2026, 04:08 PM UTC
The issue has been identified and a fix is being implemented.
-
resolved Apr 27, 2026, 04:24 PM UTC
This incident has been resolved.
Read the full incident report →
- Detected by Pingoru
- Apr 24, 2026, 04:06 PM UTC
- Resolved
- Apr 24, 2026, 06:59 PM UTC
- Duration
- 2h 52m
Affected: Feature Flags (FF)
Timeline · 4 updates
-
investigating Apr 24, 2026, 07:16 PM UTC
We are currently investigating this issue.
-
monitoring Apr 24, 2026, 07:29 PM UTC
A fix has been implemented and we are monitoring the results.
-
resolved Apr 24, 2026, 08:01 PM UTC
This incident has been resolved.
-
postmortem Apr 30, 2026, 07:16 PM UTC
### Summary On April 24, 2026, a large non-batched bulk DELETE operation on the prod-2 primary database triggered lock contention, causing Feature Flag API latency and hung queries across multiple customer SDKs. ### Impact 1. Slow SDK auth/init — SDKs took longer than expected to complete evaluations 2. Elevated latency across many FF APIs 3. Limited to Feature Flag module, prod-2 4. No Data loss ### Root Cause A background cleanup job executed a non-batched, single-transaction delete causing lock contention and API latency spikes **Mitigation** Immediately terminated the offending queries. ### Next Steps / Action Items To prevent such issues from happening again. we are working on 1. Enhanced alerting and observability on long running queries. 2. Permanently replace large single-transaction delete pattern with smaller batched deletes
Read the full incident report →
- Detected by Pingoru
- Apr 24, 2026, 05:23 AM UTC
- Resolved
- Apr 24, 2026, 05:43 AM UTC
- Duration
- 19m
Affected: Feature Flags (FF)
Timeline · 2 updates
-
investigating Apr 24, 2026, 05:23 AM UTC
We are currently investigating this issue.
-
resolved Apr 24, 2026, 05:43 AM UTC
This incident has been resolved.
Read the full incident report →
- Detected by Pingoru
- Apr 19, 2026, 02:09 PM UTC
- Resolved
- Apr 19, 2026, 03:24 PM UTC
- Duration
- 1h 15m
Affected: Infrastructure as Code Management (IaCM)Infrastructure as Code Management (IaCM)Infrastructure as Code Management (IaCM)Infrastructure as Code Management (IaCM)Infrastructure as Code Management (IaCM)
Timeline · 5 updates
-
investigating Apr 19, 2026, 02:09 PM UTC
We are currently investigating this issue.
-
identified Apr 19, 2026, 02:20 PM UTC
The issue has been identified and a fix is being implemented.
-
monitoring Apr 19, 2026, 03:17 PM UTC
A fix has been implemented and we are monitoring the results.
-
resolved Apr 19, 2026, 03:24 PM UTC
This incident has been resolved. Customers using a pinned version older than plugins/harness_terraform:0.214.0 should update to the latest version by following the https://developer.harness.io/docs/continuous-integration/use-ci/set-up-build-infrastructure/harness-ci/#specify-the-harness-ci-images-used-in-your-pipelines. If you are not pinning a specific version, no action is required — your pipelines are already using the updated image.
-
postmortem Apr 30, 2026, 07:04 PM UTC
## **Summary** On April 19, 2026, Terraform-based IaCM pipelines failed across production environments due to an issue with Terraform binary verification during runtime. The issue was caused by an expired OpenPGP signing key in a third-party library used to validate Terraform downloads. This resulted in failures when pipelines attempted to install Terraform dynamically. ## **Impact** * Terraform-based IaCM pipelines failed during execution * Failures occurred at runtime when attempting to download/verify Terraform binaries * **Customers Unaffected:** * OpenTofu-based pipelines * Pipelines using pre-installed or cached Terraform binaries Customers pinning plugin versions older than the fixed release continued to experience failures until upgraded. ## **Root Cause** The IaCM Terraform plugin relies on a third-party library \(HashiCorp’s `hc-install`\) to download and verify Terraform binaries. * The library contained a **hardcoded OpenPGP signing key** * This key **expired**, causing verification failures during Terraform installation * HashiCorp had not yet released an updated version with a renewed key ## **Remediation** ### **Immediate Mitigation** * Released **IaCM Terraform plugin v0.214.0** * Modified behavior to: * **Bypass the expired signature verification step** * Continue secure downloads over HTTPS ### **Resolution** * Rolled out the fix across **prod0–prod4** * Pipeline execution functionality was restored ## **Customer Actions Required** * Customers using pinned plugin versions **older than v0.214.0** must: * **Upgrade to v0.214.0 or later** * No action required for customers using default/latest plugin versions ## **Prevention & Next Steps** We are implementing the following improvements: * **Dependency Monitoring** * Proactive monitoring for third-party certificate/key expirations * **Upstream Coordination** * Track HashiCorp release for updated signing key * Re-enable signature verification once available * **Customer Communication** * Notify customers using older pinned versions * **Operational Improvements** * Enhance validation of external dependencies in runtime workflows
Read the full incident report →
- Detected by Pingoru
- Apr 16, 2026, 07:15 AM UTC
- Resolved
- Apr 16, 2026, 12:02 PM UTC
- Duration
- 4h 46m
Affected: Continuous Integration Enterprise(CIE) - Linux Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud Builds
Timeline · 6 updates
-
investigating Apr 16, 2026, 10:06 AM UTC
We are currently investigating this issue.
-
identified Apr 16, 2026, 10:14 AM UTC
The issue has been identified and a fix is being implemented.
-
identified Apr 16, 2026, 11:08 AM UTC
We are continuing to work on a fix for this issue.
-
monitoring Apr 16, 2026, 11:41 AM UTC
A fix has been implemented and we are monitoring the results.
-
resolved Apr 16, 2026, 12:02 PM UTC
This incident has been resolved.
-
postmortem Apr 23, 2026, 03:34 PM UTC
On April 16, 2026, Hosted CI Linux pipelines experienced intermittent initialization failures due to an upstream outage affecting package repositories used during environment setup. ### **Impact** A subset of customers running Hosted CI pipelines encountered failures during the initialization phase, preventing jobs from starting successfully. * Affected: 24 accounts * Failed executions: 129 pipelines * Impact duration: ~5 hours 49 minutes ### **Root Cause** The issue was caused by a service disruption in an external package repository provider. During CI environment provisioning, dependency installation requests to this upstream service timed out, causing initialization failures. ### **Remediation** **Immediate Mitigation** We updated runner configurations to bypass dependency installation during initialization and rolled out updated environments across affected clusters. **Permanent Fix** We have improved resilience in our CI infrastructure by: * Using pre-configured environments with required dependencies pre-installed * Eliminating runtime dependency on external package repositories during initialization * Enhancing failure handling for external dependency timeouts ### **Action Items / Next Steps** * Continue improving isolation from external dependencies during environment startup * Strengthen monitoring and alerting for upstream service degradation * Optimize rollout speed for infrastructure changes to reduce mitigation time
Read the full incident report →
- Detected by Pingoru
- Apr 15, 2026, 11:55 PM UTC
- Resolved
- Apr 16, 2026, 04:15 AM UTC
- Duration
- 4h 20m
Affected: Feature Flags (FF)
Timeline · 5 updates
-
investigating Apr 15, 2026, 11:55 PM UTC
We are currently investigating this issue.
-
investigating Apr 16, 2026, 12:01 AM UTC
We are continuing to investigate this issue.
-
monitoring Apr 16, 2026, 04:04 AM UTC
A fix has been implemented and we are monitoring the results.
-
resolved Apr 16, 2026, 06:05 PM UTC
This incident has been resolved.
-
postmortem Apr 30, 2026, 05:34 PM UTC
## **Summary** On April 15, 2026, between approximately 23:21 UTC and 01:58 UTC, customers using Feature Flag in the prod2 environment experienced delays in feature flag updates. Feature flag changes made via UI or API were successfully processed but were **not immediately reflected**, causing stale flag values to be served. ## **Impact** * **Scope:** Customers on **prod2 environment only** * **Customer Impact:** * Feature flag updates were **delayed or appeared ineffective** * Applications continued serving **stale configurations** * **Other Environments:** No impact to prod0, prod1, or other regions ## **Root Cause** The issue was caused by replication lag in the read replica database used for serving feature flag reads. A **long-running read query** on the replica blocked replication updates from the primary database. This caused a delay in propagating recent feature flag changes to read queries ### **What triggered the issue** * A high-volume API usage pattern involving **large paginated queries on target data.** These queries became **resource-**intensive impacting the database. ## **Mitigation** ### **Immediate Actions** * Identified and **terminated long-running queries** on the replica * Replication resumed and flag updates began reflecting correctly ## **Prevention & Next Steps** We are continuing to strengthen reliability through: * We configured replica to **automatically cancel queries** that block replication beyond a threshold and tuned **query timeouts** for heavy read operations * Improving **query efficiency and pagination strategies** * Enhancing **monitoring and alerting for replication lag** * Evaluating **database upgrades and scaling improvements**
Read the full incident report →
- Detected by Pingoru
- Apr 13, 2026, 04:47 PM UTC
- Resolved
- Apr 13, 2026, 05:31 PM UTC
- Duration
- 43m
Affected: Data processing
Timeline · 3 updates
-
monitoring Apr 13, 2026, 04:47 PM UTC
Feature flag metrics impact calculations were not updating. This issue does not impact experiment calculation. Fix is being rolled out. There is no data loss.
-
resolved Apr 13, 2026, 05:31 PM UTC
After monitoring post fixing the issue, all systems are back to normal and processing metrics impact calculations regularly. We will provide an RCA soon.
-
postmortem Apr 20, 2026, 03:56 PM UTC
## **Summary** _April 13, 2026, FME metrics impact calculations experienced calculations not updated. The root cause was determined to be due to a bug introduced in the software upgrade / release process._ ## **Root Cause** _An internal library upgrade included in the release caused a runtime issue is a legacy execution pathway._ ## **Impact** Feature flag metrics impact calculations were not updating. This issue does not impact experiment calculation, and there is no data loss. ## **Mitigation** _To mitigate we immediately rolled back the update._ ## **Action Items** To prevent such issues from happening again, we are working on fixing gaps in monitoring and alerting for the metrics impact calculations flow.
Read the full incident report →
- Detected by Pingoru
- Apr 09, 2026, 05:30 AM UTC
- Resolved
- Apr 09, 2026, 10:48 AM UTC
- Duration
- 5h 18m
Affected: Continuous Integration Enterprise(CIE) - Self Hosted RunnersContinuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud Builds
Timeline · 5 updates
-
investigating Apr 09, 2026, 08:34 AM UTC
Some of the legacy run test step connectivity to test intel service is failing intermittently. We are currently investigating the issue here.
-
identified Apr 09, 2026, 09:38 AM UTC
The issue has been identified and a fix is being implemented.
-
monitoring Apr 09, 2026, 10:38 AM UTC
A fix has been implemented and we are monitoring the results.
-
resolved Apr 09, 2026, 10:48 AM UTC
This incident has been resolved.
-
postmortem Apr 21, 2026, 01:18 PM UTC
## **Summary** On April 8, 2026, customers in certain production environments experienced degraded performance and intermittent failures while accessing the platform. This impacted login functionality and execution of new and existing tasks. ## **Root Cause** A spike in internal task processing caused excessive load on the service, leading to resource exhaustion and degraded performance across multiple service instances. ## **Impact** Customers in affected environments experienced: * Slowness and failures during login * Inability to start new tasks in some cases * Failures in ongoing executions ## **Remediation** **Immediate:** Stabilized the system by resetting affected components and restoring service capacity, which allowed the platform to recover. **Permanent:** Introduced safeguards to limit resource-intensive operations and prevent unbounded processing under high load conditions. ## **Action Items** To prevent such issues from happening again, Harness will * Add limits to high-volume internal processing paths * Audit and enforce safeguards across similar workflows * Improve system resilience under burst load scenarios * Enhance monitoring to detect abnormal load patterns earlier
Read the full incident report →
- Detected by Pingoru
- Apr 09, 2026, 01:05 AM UTC
- Resolved
- Apr 09, 2026, 02:42 AM UTC
- Duration
- 1h 36m
Affected: Continuous Delivery - Next Generation (CDNG)Continuous Integration Enterprise(CIE) - Mac Cloud BuildsContinuous Integration Enterprise(CIE) - Windows Cloud BuildsContinuous Integration Enterprise(CIE) - Linux Cloud BuildsFeature Flags (FF)PlatformFME
Timeline · 5 updates
-
investigating Apr 09, 2026, 01:43 AM UTC
We are currently investigating this issue.
-
identified Apr 09, 2026, 02:02 AM UTC
The issue has been identified and a fix is being implemented.
-
monitoring Apr 09, 2026, 02:33 AM UTC
A fix has been implemented and we are monitoring the results.
-
resolved Apr 09, 2026, 02:42 AM UTC
This incident has been resolved.
-
postmortem Apr 16, 2026, 09:22 PM UTC
## **Summary** On April 8, 2026, customers in Prod1 and Prod2 experienced degraded performance when logging into the Harness platform. Additionally, in Prod2, customers were unable to start new pipeline executions and some running pipelines failed. The issue lasted approximately 1 hour and 35 minutes. ## **Root Cause** The issue was caused by a sudden surge of task reassignment requests triggered after customer delegate restarts. This resulted in a high volume of backend processing requests that exceeded expected limits, leading to elevated resource utilization and degraded performance of the Harness Manager service. ## **Impact** * Customers in Prod1 and Prod2 experienced login failures and degraded user operations. * Customers in Prod2 were unable to start new pipeline executions, and some ongoing executions failed. * All customers in the affected clusters experienced service degradation during the incident window. ## **Remediation** **Immediate:** * Restarted affected services and stabilized system performance, restoring login and pipeline functionality. **Permanent:** * Introduced safeguards to limit backend processing for large task reassignment scenarios. * Identifying and applying limits to similar high-volume operations to prevent resource exhaustion. ## **Action Items** To prevent from such issues from happening again * Implement query limits for high-volume task processing scenarios. * Audit and enforce limits across similar backend operations so that we can be resilient. * Enhance monitoring and alerting for abnormal spikes in task reassignment and resource utilization.
Read the full incident report →
- Detected by Pingoru
- Apr 08, 2026, 01:09 PM UTC
- Resolved
- Apr 08, 2026, 03:00 PM UTC
- Duration
- 1h 51m
Affected: Continuous Delivery - Next Generation (CDNG)Continuous Delivery - Next Generation (CDNG)
Timeline · 3 updates
-
investigating Apr 08, 2026, 01:09 PM UTC
We are currently investigating this issue.
-
monitoring Apr 08, 2026, 01:54 PM UTC
A fix has been implemented and we are monitoring the results.
-
resolved Apr 08, 2026, 07:44 PM UTC
This incident has been resolved.
Read the full incident report →
- Detected by Pingoru
- Apr 07, 2026, 10:54 AM UTC
- Resolved
- Apr 07, 2026, 11:08 AM UTC
- Duration
- 13m
Affected: Internal Developer Portal (IDP)Internal Developer Portal (IDP)
Timeline · 4 updates
-
monitoring Apr 07, 2026, 10:54 AM UTC
A fix has been implemented and we are monitoring the results.
-
monitoring Apr 07, 2026, 10:59 AM UTC
We are continuing to monitor for any further issues.
-
resolved Apr 07, 2026, 11:08 AM UTC
This incident has been resolved.
-
postmortem Apr 21, 2026, 12:58 PM UTC
## **Summary** On April 7, 2026, customers using the Internal Developer Portal \(IDP\) in certain production environments experienced service disruption where the IDP UI became inaccessible. Users encountered errors when attempting to access the module. ## **Root Cause** A configuration change introduced during a routine deployment prevented the system from correctly routing incoming requests to the IDP service, resulting in loss of access to the UI. ## **Impact** Customers using IDP in affected environments were unable to access the portal UI during the incident window. Other modules and environments remained unaffected. ## **Remediation** **Immediate:** Rolled back the recent configuration change and restored service routing, which recovered access to the IDP module. **Permanent:** Implemented additional safeguards in the deployment process to validate configuration changes and ensure compatibility before rollout. ## **Action Items** To prevent such issues from happening again, we are taking the following steps. * Enhancing our release process and UI validation * Improve monitoring and alerting for early detection of routing issues
Read the full incident report →
- Detected by Pingoru
- Apr 03, 2026, 04:53 PM UTC
- Resolved
- Apr 04, 2026, 01:59 PM UTC
- Duration
- 21h 5m
Affected: Custom Dashboards
Timeline · 5 updates
-
investigating Apr 03, 2026, 04:53 PM UTC
We are currently investigating this issue.
-
identified Apr 03, 2026, 04:59 PM UTC
The issue has been identified and a fix is being implemented.
-
monitoring Apr 03, 2026, 09:29 PM UTC
A fix has been implemented and we are monitoring the results.
-
monitoring Apr 03, 2026, 09:30 PM UTC
We are continuing to monitor for any further issues.
-
resolved Apr 04, 2026, 01:59 PM UTC
This incident has been resolved.
Read the full incident report →
- Detected by Pingoru
- Apr 02, 2026, 03:13 PM UTC
- Resolved
- Apr 02, 2026, 05:53 PM UTC
- Duration
- 2h 40m
Affected: Continuous Delivery - Next Generation (CDNG)Continuous Delivery - Next Generation (CDNG)
Timeline · 3 updates
-
investigating Apr 02, 2026, 03:13 PM UTC
We are investigating a degradation in CI steps when using AWS connectors and inherited authentication.
-
resolved Apr 02, 2026, 05:53 PM UTC
This incident has been resolved.
-
postmortem Apr 17, 2026, 03:29 PM UTC
## **Summary** On April 2, 2026, customers experienced failures in CI pipelines during S3 upload steps following a routine delegate upgrade. The issue primarily impacted customers using cross-account AWS role assumption with inherit-from-delegate connectors. ## **Impact** Few customers across Prod1 and Prod2 using CI pipelines using S3 upload with cross-account role assumption experienced Artifact uploads failures, blocking downstream deployments ## **Root Cause** A change introduced during the delegate upgrade altered how AWS credentials were passed to CI steps. This resulted in **partial credentials being provided to the S3 upload plugin**, which triggered a latent issue in the plugin’s credential selection logic. Instead of executing the intended cross-account role assumption flow, the plugin attempted authentication using incomplete credentials, leading to failures. ## **Mitigation** * Rolled back delegate to the previous stable version * Restored original credential handling behavior * Service functionality recovered immediately after rollback ## **Next Steps** To prevent such issues from happening again we will: * Improve validation of credential handling in CI steps * Expand automated test coverage for cross-account scenarios * Reintroduce changes behind proper feature flags with full end-to-end testing
Read the full incident report →
- Detected by Pingoru
- Apr 01, 2026, 06:44 AM UTC
- Resolved
- Apr 02, 2026, 08:40 AM UTC
- Duration
- 1d 1h
Affected: Cloud Cost Management (CCM)Cloud Cost Management (CCM)Cloud Cost Management (CCM)
Timeline · 5 updates
-
identified Apr 01, 2026, 06:16 AM UTC
The issue has been identified and a fix is being implemented.
-
identified Apr 01, 2026, 06:44 AM UTC
Status update : CCM AutoStopping functionality for the AWS cloud provider is currently impacted due to increased latency from AWS in the me-south-1 region. This is affecting multiple operations, including warm-up, cool-down, schedule execution, and traffic detection. In addition, CCM Asset Governance functionality is also impacted for resources in the me-south-1 region. We are actively working on isolating/excluding the affected region to restore functionality for the remaining customers. Resources within the me-south-1 region may continue to experience issues until the region fully recovers.
-
monitoring Apr 01, 2026, 08:20 AM UTC
Update: AutoStopping functionality for AWS has been restored for all regions except me-south-1. The issue was caused by elevated latency from AWS in the affected region, impacting operations such as warm-up, cool-down, schedule execution, and traffic detection. We have now isolated this region to prevent impact on other customers. Resources in me-south-1 will continue to experience the issue until the region fully recovers. We are actively monitoring the situation and will provide further updates as available.
-
resolved Apr 02, 2026, 08:40 AM UTC
This incident has been resolved.
-
postmortem Apr 14, 2026, 03:30 AM UTC
The issue was caused by elevated latency from AWS in the affected region, impacting operations such as warm-up, cool-down, schedule execution, and traffic detection. We have now isolated this region to prevent impact on other customers. Resources in me-south-1 will continue to experience the issue until the region fully recovers. We are actively monitoring the situation and will provide further updates as available.
Read the full incident report →
- Detected by Pingoru
- Mar 25, 2026, 10:00 AM UTC
- Resolved
- Mar 25, 2026, 12:30 PM UTC
- Duration
- 2h 30m
Affected: Security Testing Orchestration (STO)
Timeline · 3 updates
-
investigating Mar 25, 2026, 03:53 PM UTC
We are currently investigating this issue.
-
resolved Mar 25, 2026, 03:53 PM UTC
This incident has been resolved.
-
postmortem Mar 26, 2026, 07:53 PM UTC
## **Summary** On March 25, 2026, between approximately **3:30 PM and 6:00 PM IST**, the STO service in the **Prod1 environment** experienced **intermittent failures** while processing scan uploads. This resulted in **step failures for some pipeline executions** during the incident window. ## **Root Cause** During a scheduled internal data backfill activity, the STO service experienced **increased database load**. Concurrently, a recent change in the scan upload processing path introduced additional latency under these conditions. The combination of elevated load and increased query execution time caused some scan upload requests to exceed processing thresholds and fail. Retry attempts further amplified system load, leading to intermittent failures. ## **Impact** * Intermittent **scan upload failures \(500 errors\)** during pipeline execution * Some pipelines experienced **step failures or delays due to retries** * No impact to previously uploaded scan results or other STO functionality ## **Mitigation/Remediation** ### **Immediate** * Stopped the internal backfill activity to reduce database load * Optimized the scan upload processing query ### **Permanent** * Introduced safeguards for background jobs to prevent impact on production workloads * Improved performance of critical database paths * Enhanced monitoring to detect abnormal load and retry amplification earlier ## **Action Items** To prevent such issues from happening again: * Implement throttling and isolation for background/backfill jobs * Add protections for critical request paths under load * Improve alerting on database latency and retry patterns * Strengthen validation for production-like load conditions
Read the full incident report →