Codefresh incident

Some Classic builds are stuck in Pending state

Major Resolved View vendor source →

Codefresh experienced a major incident on September 23, 2024 affecting Codefresh Classic Pipeline Engine, lasting 6h 40m. The incident has been resolved; the full update timeline is below.

Started
Sep 23, 2024, 01:34 PM UTC
Resolved
Sep 23, 2024, 08:14 PM UTC
Duration
6h 40m
Detected by Pingoru
Sep 23, 2024, 01:34 PM UTC

Affected components

Codefresh Classic Pipeline Engine

Update timeline

  1. investigating Sep 23, 2024, 01:34 PM UTC

    We are currently investigating this issue.

  2. identified Sep 23, 2024, 01:39 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Sep 23, 2024, 02:21 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. resolved Sep 23, 2024, 08:14 PM UTC

    This incident has been resolved.

  5. postmortem Oct 10, 2024, 02:43 PM UTC

    **Impact**: Some accounts sporadically experienced longer pending times than usual on a portion of their builds for a day. **Detection**: Issue was reported by a customer, and shortly after confirmed by Codefresh’s platform monitoring alerts. **Root Cause**: This issue was caused by a bug in MongoDB driver. The MongoDB driver was upgraded in Codefresh services as part of our efforts to improve performance, but this version contained a bug that caused Mongoose queries to hang when under heavy load without returning or throwing errors. This resulted in the Codefresh build manager randomly getting stuck when enough queries were hanging under certain conditions. **Resolution**: A temporary solution to improve build queries queue behavior was initially implemented to alleviate the issue for affected customers. The actual root cause was identified the following week, and the issue was resolved by downgrading the MongoDB driver to a version that did not contain the bug.