Featureflow incident

A network issue to our backend database is causing intermittent disruption

Major Resolved View vendor source →

Featureflow experienced a major incident on September 13, 2023 affecting Admin API and Event Server and 1 more component, lasting 2h 6m. The incident has been resolved; the full update timeline is below.

Started
Sep 13, 2023, 06:19 PM UTC
Resolved
Sep 13, 2023, 08:26 PM UTC
Duration
2h 6m
Detected by Pingoru
Sep 13, 2023, 06:19 PM UTC

Affected components

Admin APIEvent ServerSDK Server

Update timeline

  1. investigating Sep 13, 2023, 06:19 PM UTC

    We are currently investigating this issue.

  2. investigating Sep 13, 2023, 08:07 PM UTC

    Issue has been identified in a rogue process causing database CPU issues. We are recovering services currently.

  3. investigating Sep 13, 2023, 08:18 PM UTC

    Feaatureflow SDK and Event Server is recovered. Admin Api is being recovered.

  4. investigating Sep 13, 2023, 08:23 PM UTC

    All Services are operational.

  5. resolved Sep 13, 2023, 08:26 PM UTC

    This incident has been resolved.

  6. postmortem Sep 13, 2023, 09:11 PM UTC

    ### Outage Summary The issue was caused by excessive ‘CPU Steal’ on our backend database causing connection issues and timeouts. We temporarily and intentionally dropped connections by scaling down the servers to scale the database cluster in minimal time. Services were restored after the cluster was upgraded, starting with the event and SDK servers. ### Outage impact: * 99.74% of feature flag requests were returned successfully via our global CDN * ~10,000 Successful requests per minute for feature flags. * 18 Failed responses per minute for feature flags. * All feature analytic event postings were affected * Sep 13, 18:19 UTC to Sep 13, 20:18 UTC - the SDK and Event APIs were degraded. * Sep 13, 18:19 UTC to Sep 13, 20:23 UTC - the Admin APIs were degraded.