Harness incident

FME SDK are experiencing elevated error rates and delays in response.

Minor Resolved View vendor source →

Harness experienced a minor incident on May 8, 2026 affecting FME and FME and 1 more component, lasting 1h 47m. The incident has been resolved; the full update timeline is below.

Started
May 08, 2026, 05:05 PM UTC
Resolved
May 08, 2026, 06:53 PM UTC
Duration
1h 47m
Detected by Pingoru
May 08, 2026, 05:05 PM UTC

Affected components

FMEFMEFME

Update timeline

  1. investigating May 08, 2026, 05:05 PM UTC

    We are currently investigating this issue.

  2. monitoring May 08, 2026, 06:01 PM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved May 08, 2026, 06:53 PM UTC

    This incident has been resolved.

  4. postmortem May 18, 2026, 05:26 PM UTC

    ### Summary On May 8, 2026, between approximately 17:00 and 18:37 UTC, a small fraction of SDK requests to [sdk.split.io](http://sdk.split.io/) returned errors or experienced elevated latency. The cache that fronts [sdk.split.io](http://sdk.split.io/) continued to serve the vast majority of requests normally throughout the event. The issue was mitigated by increasing backend capacity, and all systems returned to baseline by 18:37 UTC. ### Root Cause [sdk.split.io](http://sdk.split.io/) is served by a CDN that caches flag data and forwards a small percentage of requests to a backend service when cached data is not available. During this event, an elevated rate of cache updates caused the CDN to forward roughly 40x its normal volume of requests to the backend. Sustained over several hours, this exceeded the backend's available capacity, leading to elevated latency, increased error rates on requests forwarded to the backend, and intermittent service restarts during recovery. ### Impact * The CDN continued to serve 97.5% of requests directly from cache as normal across the affected window. * The remaining requests forwarded to the backend saw elevated errors \(approximately 0.35% of total SDK traffic during the window\) or latency until backend capacity was restored. * SDKs continued to evaluate flags normally using their locally cached flag data; no flag evaluation correctness issues occurred. New SDK instances initializing and getting an error retried and/or signaled the timeout depending on configurations. * No data loss occurred. ### Remediation * Increased backend capacity across multiple dimensions, including service replicas, compute and memory per replica, and database connection capacity. * Service stabilized at the increased capacity profile after the changes were applied. ### Action Items * Allow additional backend capacity parameters to be tuned at runtime for quicker scaling response. * Add monitoring and alerting on identified leading indicators for earlier detection of saturation trends. * Adopt more responsive autoscaling for the backend so it can react to traffic surges more aggressively than the current autoscaler. * Improve load-shedding so the backend drops excess requests cleanly under saturation rather than queueing to failure.