Featureflow incident
A network issue to our backend database is causing intermittent disruption
Featureflow experienced a major incident on September 13, 2023 affecting Admin API and Event Server and 1 more component, lasting 2h 6m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Sep 13, 2023, 06:19 PM UTC
We are currently investigating this issue.
- investigating Sep 13, 2023, 08:07 PM UTC
Issue has been identified in a rogue process causing database CPU issues. We are recovering services currently.
- investigating Sep 13, 2023, 08:18 PM UTC
Feaatureflow SDK and Event Server is recovered. Admin Api is being recovered.
- investigating Sep 13, 2023, 08:23 PM UTC
All Services are operational.
- resolved Sep 13, 2023, 08:26 PM UTC
This incident has been resolved.
- postmortem Sep 13, 2023, 09:11 PM UTC
### Outage Summary The issue was caused by excessive ‘CPU Steal’ on our backend database causing connection issues and timeouts. We temporarily and intentionally dropped connections by scaling down the servers to scale the database cluster in minimal time. Services were restored after the cluster was upgraded, starting with the event and SDK servers. ### Outage impact: * 99.74% of feature flag requests were returned successfully via our global CDN * ~10,000 Successful requests per minute for feature flags. * 18 Failed responses per minute for feature flags. * All feature analytic event postings were affected * Sep 13, 18:19 UTC to Sep 13, 20:18 UTC - the SDK and Event APIs were degraded. * Sep 13, 18:19 UTC to Sep 13, 20:23 UTC - the Admin APIs were degraded.