Harness incident

K8s Customer Billing Data will be outdated

Minor Resolved View vendor source →

Harness experienced a minor incident on May 5, 2026 affecting Cloud Cost Management (CCM), lasting 19m. The incident has been resolved; the full update timeline is below.

Started
May 05, 2026, 07:57 AM UTC
Resolved
May 05, 2026, 08:17 AM UTC
Duration
19m
Detected by Pingoru
May 05, 2026, 07:57 AM UTC

Affected components

Cloud Cost Management (CCM)

Update timeline

  1. investigating May 05, 2026, 07:57 AM UTC

    We are currently experiencing an issue, K8s customer billing data will be stale.

  2. resolved May 05, 2026, 08:17 AM UTC

    This incident has been resolved.

  3. postmortem May 20, 2026, 12:02 AM UTC

    ## **Summary** On May 4, 2026, after a service deployment in Prod2, background schema migrations did not complete successfully. As a result, some cluster and perspective-related data appeared missing or stale for Elevance and a small number of other Prod2 customers. The service itself remained available, but the database schema was not updated to match the new application code. This caused downstream billing and cluster data processing jobs to fail or skip expected data updates. ## **Impact** * Affected customers saw missing or stale cluster and perspective data in the UI. * Data appeared to stop updating around April 30 for impacted accounts. * Elevance was the primary impacted customer, along with a few other Prod2 customers. * There was no full service downtime. * Once remediated, the missing data was backfilled. ## **Root Cause** During the Prod2 deployment , the background schema migration process attempted to acquire a Redis distributed lock before running Timescale database migrations. The lock acquisition failed immediately due to a `PersistentLockException`, likely because another replica or overlapping deployment process was holding the lock. Since the migration used a zero-wait lock acquisition path, it did not retry and the migration did not run. The failure was logged only as a generic warning rather than a clear production error. Because of this, the migration failure was not immediately surfaced through alerts, and the service continued running with the database schema behind the application code expectations. ## **Remediation** The service was redeployed in Prod2. On restart, the background migrations completed successfully. The team then reran the required job to backfill the missing data and restore expected cluster and perspective data visibility. ## **Preventive Actions** * Update background migration locking behavior to wait/retry when acquiring the migration lock. * Improve logging from generic warnings to explicit error logs when schema migrations are skipped. * Add alerting for failed or skipped production schema migrations.