FlagSmith incident

Increased error rates on the Edge API

Minor Resolved View vendor source →

FlagSmith experienced a minor incident on January 18, 2024 affecting Edge API, lasting 1h 9m. The incident has been resolved; the full update timeline is below.

Started
Jan 18, 2024, 02:40 PM UTC
Resolved
Jan 18, 2024, 03:50 PM UTC
Duration
1h 9m
Detected by Pingoru
Jan 18, 2024, 02:40 PM UTC

Affected components

Edge API

Update timeline

  1. investigating Jan 18, 2024, 02:40 PM UTC

    We are currently investigating this issue.

  2. identified Jan 18, 2024, 02:44 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Jan 18, 2024, 02:52 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. investigating Jan 18, 2024, 03:06 PM UTC

    Issues are still persisting for integrations using the Go client. We are investigating further.

  5. investigating Jan 18, 2024, 03:06 PM UTC

    We are continuing to investigate this issue.

  6. identified Jan 18, 2024, 03:15 PM UTC

    We have identified the remaining issue and are implementing a fix. ETA for full resolution: 15 minutes.

  7. identified Jan 18, 2024, 03:34 PM UTC

    We have rolled back in certain affected regions and have completed the work for the permanent fix. This is in the final stages of testing now and will be rolled out imminently.

  8. identified Jan 18, 2024, 03:36 PM UTC

    We are continuing to work on a fix for this issue.

  9. identified Jan 18, 2024, 03:43 PM UTC

    We are deploying the permanent fix now.

  10. resolved Jan 18, 2024, 05:33 PM UTC

    This incident has been resolved.

  11. postmortem Jan 18, 2024, 05:34 PM UTC

    ### Timeline At around 13.45 today, we deployed a change to resolve a validation issue that had been introduced in a release earlier today. This validation issue affected only requests which provided a numeric value for the identity identifier. The new validation which was added, however, caused an issue for certain integrations since it also added a requirement for the traits key to be provided \(and not omitted\) which is not the case in some of our clients \(the Go client for example omits the traits key if the list is empty\). This meant that valid requests from these clients for identities with no traits were being incorrectly rejected as invalid. Once we received alerts for this from our monitoring and some of our affected customers we began investigating. At 14:54 we deployed a change which resolved the validation issue for certain cases, however not all. As such, at 15:06 we made the decision to roll back the affected regions, and at 15:48 we deployed a permanent fix for this including additional test cases to cover this behaviour. ### Impact Since the requests that were affected by this issue were those that had no traits, the impact was fairly limited and no trait data has been lost. Some identities will not have been created during this period, however, due to the nature of the Flagsmith integration, subsequent calls to identify those users will create them. ### Next Steps We have been working hard already on improving our release process for the Edge API. The first step of this, which is due to be released next week, is to improve our automated releases to rollback based on a number of additional alerting factors, including more granular looks at our error rates. This will ensure that, in future, a small subset of errors like this will trigger an immediate automated rollback. The next step after this is to create a more comprehensive end to end testing suite which exercises each of our SDKs to verify that the integrations are all compatible with any new changes.