Codefresh incident

We are experiencing issues with viewing GitOps-related pages in UI

Major Resolved View vendor source →

Codefresh experienced a major incident on November 13, 2023 affecting Codefresh GitOps UI, lasting 6h 49m. The incident has been resolved; the full update timeline is below.

Started
Nov 13, 2023, 01:32 PM UTC
Resolved
Nov 13, 2023, 08:21 PM UTC
Duration
6h 49m
Detected by Pingoru
Nov 13, 2023, 01:32 PM UTC

Affected components

Codefresh GitOps UI

Update timeline

  1. investigating Nov 13, 2023, 01:32 PM UTC

    We are currently investigating this issue.

  2. investigating Nov 13, 2023, 02:23 PM UTC

    We are continuing to investigate this issue.

  3. identified Nov 13, 2023, 03:48 PM UTC

    The issue has been identified and a fix is being implemented.

  4. monitoring Nov 13, 2023, 05:47 PM UTC

    A fix has been implemented and we are monitoring the results.

  5. identified Nov 13, 2023, 06:04 PM UTC

    We have identified an additional issue and are working on a fix.

  6. monitoring Nov 13, 2023, 06:51 PM UTC

    We have implemented additional fixes to restore UI functionality, and we are monitoring the results.

  7. resolved Nov 13, 2023, 08:21 PM UTC

    This incident has been resolved.

  8. postmortem Nov 20, 2023, 08:01 PM UTC

    We have completed our RCA for this incident, for which the summary is below: **Impact:** We had significant disruption to any UI page that relied on displaying runtime-related information, leading to incomplete or unavailable data for users. **Detection:** This issue was reported to us by customers. **Root Cause:** An unexpected side effect of an API change which caused the event handler to not recognize runtime events as runtimes and instead treat them as generic-entities. When the change was reverted the entries in the generic-entities collection were no longer updated, and an automatic cleaning function then resulted in some UI data queries returning incorrect data. **Resolution:** After resolving the root cause, we rebuilt the required data and reinitialized the runtime information. We have identified improvements to our E2E testing process and monitoring systems as a result of this incident that we will be implementing.