fix: consistent SetProviderAndWait init flow#474
fix: consistent SetProviderAndWait init flow#474toddbaert merged 5 commits intoopen-feature:mainfrom
Conversation
Summary of ChangesHello @dd-oleksii, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the OpenFeature SDK's provider management, addressing critical inconsistencies and race conditions in how providers are initialized, updated, and shut down. The changes ensure a more predictable and robust lifecycle for feature flag providers, particularly by standardizing asynchronous initialization, improving error propagation, and preventing interference between old and new provider states during transitions. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request provides a significant and well-executed refactoring to address inconsistencies and bugs in the provider-setting logic, particularly between SetProvider and SetProviderAndWait. The changes make the initialization flow more consistent and robust by always initializing asynchronously and having AndWait methods wait outside of critical sections. The separation of initialization and shutdown logic into initNew and shutdownOld improves clarity. The fix for the race condition where an old provider could affect the state of a new one is also a crucial improvement. The code is of high quality. I have one suggestion to improve error handling during the shutdown of old providers.
|
The tests seem to be failing with the |
0aefdc9 to
281ed00
Compare
|
@sahidvelji thanks for the hint! I have fixed two tests that relied on events emitted before initialization to be received after initialization is complete. I fixed two tests to emit error/stale events after provider initialization is complete. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #474 +/- ##
==========================================
+ Coverage 81.66% 83.92% +2.25%
==========================================
Files 27 27
Lines 2111 2071 -40
==========================================
+ Hits 1724 1738 +14
+ Misses 302 293 -9
+ Partials 85 40 -45
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
281ed00 to
4576e34
Compare
|
It seems one of the tests timed out: https://github.com/open-feature/go-sdk/actions/runs/22098383602/job/63927536776?pr=474 |
|
For reference, this test has intermittent failures on main, though they are very rare. The failure rate increases when new changes are introduced. |
|
@dd-oleksii These improvements sound great. I will fully review later today. In the meantime, could you see if you can address the existing flakiness your fixes seem to have revealed?
|
|
oh yeah, it's the same flawed pattern in this test as well:
In the old "and wait" implementation, there was a bug that the SDK would only start listening for provider events after initialization has succeeded. So it would receive ProviderStale event after ProviderReady, despite the fact that it was issued before the initialization was attempted. In the non-waiting implementation, the subscription and initialization happen concurrently, so it's undefined what event arrives first in this situation and what the end state would be. That's why the test is flaky. This PR moves subscription start earlier in the chain (before we attempt initialization), so it's more likely that the pre-existing events arrive first — that's why it started flaking more often. There are two main issues here: First, the tests are just making a wrong assumption about the ordering of events. This one is easy to fix — we should emit events only after initialization (or make initialization return error). The harder issue is ensuring consistent ordering of pre-existing events and the initialization. I guess the simplest way to achieve that is by draining the provider's event channel before starting initialization. Another option is to send the provider ready event through the same channel, so the ordering is preserved. (In either case, it's probably best to handle this in a separate PR — this one is trying to fix too much things simultaneously. |
|
I think those tests are checking that no events are lost when the provider is replaced. |
1da59f0 to
e8eda71
Compare
OpenFeature specification defines SetProviderAndWait as a waiting version of SetProvider, or as a shortcut for waiting on provider ready event. However, currently SetProvider and SetProviderAndWait exhibit non-trivial behavior differences besides waiting. SetProvider runs initialization asynchronously and potentially concurrently with shutdown of the old provider. The API is not blocked and the application author may initialize other providers concurrently, run evaluations, etc. 🐛: in this mode, the error from initializer is ignored when updating the provider state, so fatal/error states may be not set properly. SetProviderAndWait runs initialization synchronously while holding exclusive api.mu lock. This almost completely locks OpenFeature SDK: the application author cannot initialize other providers (for different domains), configure context or hooks, evaluate feature flags, or shutdown SDK. If a provider initialization blocks forever'ish, the SDK remains unusable and is unrecoverable. Another difference is that old provider is shutdown only after new provider has successfully initialized. 🐛: if the new provider fails to initialize, the old provider is already unset in API but will never be shutdown, and the new provider is not registered with api.eventExecutor (so if it comes back online after some time, nobody listens to its events, and the state will go out of sync if old provider continues emitting events). 🐛: in both modes, given that shutdown is run concurrently with updating subscriptions in eventExecutor, it is possible for the old provider to override the state of the new provider: 1. init finishes, emits provider ready event (directly from goroutine), updates state 2. old provider emits some event during shutdown (e.g., PROVIDER_ERROR or PROVIDER_STALE), eventExecutor receives the event and updates the state to error/stale 3. new provider is registered with eventExecutor but the state is already wrong. This PR introduces a couple of changes: Make initialization flow consistent across both modes: always initialize async but make "AndWait" methods wait for initialization outside of critical section. Make init respect returned error. Always call shutdown on old provider (if it is no longer used). Always register new provider with event executor. Do this before we start init/shutdown, so the old provider cannot influence state of the new provider. Make event executor registration non-erroring by making shutdown channel buffered (there were no good way to recover from registration error). Signed-off-by: Oleksii Shmalko <oleksii.shmalko@datadoghq.com>
Signed-off-by: Oleksii Shmalko <oleksii.shmalko@datadoghq.com>
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
d7d1776 to
b6e4bfa
Compare
|
Pushed one additional change to remove some now-dead code. Will merge Monday unless I hear objections cc @erka @sahidvelji |
|
@toddbaert This PR introduces a breaking change relative to the OpenFeature spec. According to the spec, if a provider emits an event and there is a subscriber for that event, the subscriber event handlers must be run. The test changes clearly demonstrate this breakage - they were modified to hide the underlying problem rather than address it. The |
|
@erka can you provide an actionable recommendation for @dd-oleksii ? I think I see your point, but from a usage standpoint I think this is an improvement (even if it can get better). Or do we need to wait to resolve this first? |
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
|
Thinking about it more - I'm not 100% sure that this is a deviation from the spec. The spec uses the language of "associated provider" in most places - if a provider has been replaced I'm don't think it matters. I think this is the pre-existing behavior though - events from replaced providers we already silently discarded. I did make a change here though, which moves |
|
@toddbaert Honestly, the simplest way to address all of this complexity might be to prohibit provider replacement at the OpenFeature spec level. Do we know if providers are actually replaced in production scenarios? Let me bring up another elephant: tracking. From what I can see in the go sdk, tracking relies on the current (default/domain) provider. A client may perform an evaluation using one provider, then that provider has been replaced, and only afterward client receives a Track call. In that situation, which provider is expected to handle tracking? |
This might be outside the scope of this issue... really, but its not so much switching we are worried about, but late binding. Different modules may load at different times, and may register their own providers; we want to make sure that works as expected. It's less important that you change from one provider to another, but the requirements for these 2 things are largely the same. |
I don't think this is the case. The two tests that were failing were attaching event handlers after an event was emitted, so they should not have received it in the first place. This is not new—they are flaky on the main as well. The rest of the tests weren't failing but were modified to avoid attach-after-emit test bug. |
|
There is also a change in how This is an example func main() {
client := openfeature.NewDefaultClient()
go func() {
for {
<-time.After(time.Microsecond)
result, err := client.BooleanValueDetails(context.TODO(), "flag", false, openfeature.NewTargetlessEvaluationContext(nil))
fmt.Println(result, err)
}
}()
time.Sleep(100 * time.Microsecond)
go func() {
provider, _ := flagd.NewProvider()
_ = openfeature.SetProviderAndWait(provider)
fmt.Println("flagd set and wait done")
}()
time.Sleep(time.Second)
}Before output: //...
{false {flag bool {default-variant DEFAULT map[]}}} <nil>
flagd set and wait done
{false {flag bool { ERROR PROVIDER_NOT_READY connection not made map[]}}} error code: PROVIDER_NOT_READY: connection not made
//...After output //...
{false {flag bool {default-variant DEFAULT map[]}}} <nil>
{false {flag bool { map[]}}} PROVIDER_NOT_READY: provider not yet initialized
//...
{false {flag bool { map[]}}} PROVIDER_NOT_READY: provider not yet initialized
flagd set and wait done
{false {flag bool { ERROR PROVIDER_NOT_READY connection not made map[]}}} error code: PROVIDER_NOT_READY: connection not made
//... |
Do you mean from a different goroutine/thread? The |
|
OpenFeature Requirement 2.4.1 says The thing about This PR aims to address certain problems, but it also introduces new ones. I see it more as a breaking change not as a bug fix. |
|
@erka all the behavior you describe exists today for The only thing this PR changes is making |
|
@erka @toddbaert there has been a lot of back-and-forth on this PR over the last three weeks. Would you like to jump on a call to discuss spec/expected behavior and resolve any misunderstandings about what this PR is doing? |
|
@dd-oleksii Since the Go SDK is v1, some users might be surprised by the new behavior of package main
import (
"context"
"log/slog"
"os"
"time"
"github.com/lmittmann/tint"
flagd "github.com/open-feature/go-sdk-contrib/providers/flagd/pkg"
"github.com/open-feature/go-sdk/openfeature"
"github.com/open-feature/go-sdk/openfeature/memprovider"
)
func init() {
slog.SetDefault(slog.New(tint.NewHandler(os.Stdout, &tint.Options{
TimeFormat: "-",
})))
}
var _ openfeature.ContextAwareStateHandler = (*SlowProvider)(nil)
type SlowProvider struct {
*flagd.Provider
}
func (s *SlowProvider) InitWithContext(ctx context.Context, evaluationContext openfeature.EvaluationContext) error {
select {
case <-time.After(250 * time.Microsecond):
break
case <-ctx.Done():
return ctx.Err()
}
return s.Init(evaluationContext)
}
func (s *SlowProvider) ShutdownWithContext(ctx context.Context) error {
s.Shutdown()
return nil
}
func main() {
client := openfeature.NewDefaultClient()
gcb := func(e openfeature.EventDetails) {
slog.Error("global-event", "state", client.State(), "details", e)
}
ccb := func(e openfeature.EventDetails) {
slog.Error("client-event", "state", client.State(), "details", e)
}
openfeature.AddHandler(openfeature.ProviderError, &gcb)
client.AddHandler(openfeature.ProviderError, &ccb)
go func() {
for {
evaluate(client)
<-time.After(5 * time.Microsecond)
}
}()
openfeature.SetProviderAndWait(memprovider.NewInMemoryProvider(map[string]memprovider.InMemoryFlag{
"my-flag": {
Key: "my-flag",
DefaultVariant: "on",
Variants: map[string]any{
"on": true,
"off": false,
},
},
}))
slog.Warn("in-memory is set")
ctx, cancel := context.WithCancel(context.Background())
go func() {
provider, _ := flagd.NewProvider()
slog.Warn("flagd init")
err := openfeature.SetProviderAndWait(&SlowProvider{Provider: provider})
slog.Warn("flagd is set", "error", err)
cancel()
}()
<-ctx.Done()
evaluate(client)
}
func evaluate(client *openfeature.Client) {
v, e := client.BooleanValue(context.TODO(), "my-flag", false, openfeature.NewTargetlessEvaluationContext(nil))
slog.Info("client evaluation", "state", client.State(), "value", v, "details", e)
}Before logs - WRN in-memory is set
- WRN flagd init
- INF client evaluation state=READY value=true details=<nil>
- WRN flagd is set error="provider initialization failed: stream error: unavailable: dial tcp 127.0.0.1:8013: connect: connection refused"
- ERR global-event state=ERROR details="{ProviderName:flagd ProviderEventDetails:{Message:Provider initialization failed: provider initialization failed: stream error: unavailable: dial tcp 127.0.0.1:8013: connect: connection refused FlagChanges:[] EventMetadata:map[] ErrorCode:}}"
- INF client evaluation state=ERROR value=false details="error code: PROVIDER_NOT_READY: connection not made"After logs - WRN in-memory is set
- WRN flagd init
- INF client evaluation state=READY value=true details=<nil>
- INF client evaluation state=READY value=false details="error code: PROVIDER_NOT_READY: client did not yet finish the initialization"
- INF client evaluation state=READY value=false details="error code: PROVIDER_NOT_READY: connection not made"
- WRN flagd is set error="failed to initialize default provider \"flagd\": provider initialization failed: stream error: unavailable: dial tcp 127.0.0.1:8013: connect: connection refused"
- ERR global-event state=ERROR details="{ProviderName:flagd ProviderEventDetails:{Message:Provider initialization failed: provider initialization failed: stream error: unavailable: dial tcp 127.0.0.1:8013: connect: connection refused FlagChanges:[] EventMetadata:map[] ErrorCode:}}"
- ERR client-event state=ERROR details="{ProviderName:flagd ProviderEventDetails:{Message:Provider initialization failed: provider initialization failed: stream error: unavailable: dial tcp 127.0.0.1:8013: connect: connection refused FlagChanges:[] EventMetadata:map[] ErrorCode:}}"
- INF client evaluation state=ERROR value=false details="error code: PROVIDER_NOT_READY: connection not made"There’s now a client-event callback that didn’t happen before. The client reports After looking into this, it might make sense for the SDK to only expose a blocking |
oh, this one looks like another bug in main that we can fix: setting new provider does not reset state to not ready as it should. You can observe this behavior with Otherwise, your example demonstrates two spec violations that got fixed:
Now the questions is whether fixing these two is a breaking change or not. My argument for "not" is that these were undocumented deviations from the spec. So if anyone relied on these, they were clearly relying on undocumented behavior that was not part of the public API. But I can see the point that it might be considered a breaking change if we want to play safe (obligatory xkcd reference). @erka so how do you want to proceed? do you want me to bump the major version? do you want to bundle these changes with your branch? do you want to fix the bugs in a different way? |
|
Adding a comment here to level-set from the community meeting. The goal of the OpenFeature spec is to describe the shape of the interfaces and APIs for the SDKs. The finest details of their behaviors, for instance around things like locking, ordering, etc, in many cases is left unspecified. That doesn't mean they can't be specified within the context of the SDK. In fact, understanding such nuances should be one of the goals of the maintainers of a particular implementation. Where ordering is critical, the spec mentions it (or should be made to). In other cases, the specification lends some flexibility. The goal is to specify a set of APIs and behaviors that provides a degree of familiarity to users, not specify every possible internal state of the artifacts of an implementation. It's a separate issue where any particular behavioral, timing, or locking change constitutes a breaking change, requiring a major version release. In general, I would not recommend considering behavior-changes breaking, unless we believe they are likely to cause significant issues for users. This is a subjective evaluation; even fixing something like an unnecessary lock can break somebody's workflow; our job as maintainers is to exercise judgement on these issues, case by case. So the question here is whether the things @erka has discovered constitute breaking changes for us or not. I'll re-familiarize myself and give my evaluation soon. EDIT: When I posted this I didn't even notice @dd-oleksii referenced the same XKCD. |
|
Thanks @erka for your attention to detail. I appreciate it and I think it's good to have made these discoveries. Regarding the concern that
In all cases, the event mechanism is expected to be ready at construction time, not created lazily in init. A Go provider that creates its This PR aligns |
EventChannel()inInit()rather than at construction time may be impacted by this change; these should be verified to return a validEventChannel()beforeInit()is called.Context
OpenFeature specification defines SetProviderAndWait as a waiting version of SetProvider, or as a shortcut for waiting on provider ready event. However, currently SetProvider and SetProviderAndWait exhibit non-trivial behavior differences besides waiting.
SetProvider runs initialization asynchronously and potentially concurrently with shutdown of the old provider. The API is not blocked and the application author may initialize other providers concurrently, run evaluations, etc.
🐛: in this mode, the error from initializer is ignored when updating the provider state, so fatal/error states may be not set properly.
SetProviderAndWait runs initialization synchronously while holding exclusive api.mu lock. This almost completely locks OpenFeature SDK: the application author cannot initialize other providers (for different domains), configure context or hooks, evaluate feature flags, or shutdown SDK. If a provider initialization blocks forever'ish, the SDK remains unusable and is unrecoverable.
Another difference is that old provider is shutdown only after new provider has successfully initialized.
🐛: if the new provider fails to initialize, the old provider is already unset in API but will never be shutdown, and the new provider is not registered with api.eventExecutor (so if it comes back online after some time, nobody listens to its events, and the state will go out of sync if old provider continues emitting events).
🐛: in both modes, given that shutdown is run concurrently with updating subscriptions in eventExecutor, it is possible for the old provider to override the state of the new provider:
This PR
This PR introduces a couple of changes:
Make initialization flow consistent across both modes: always initialize async but make "AndWait" methods wait for initialization outside of critical section. Make init respect returned error.
Always call shutdown on old provider (if it is no longer used).
Always register new provider with event executor. Do this before we start init/shutdown, so the old provider cannot influence state of the new provider. Make event executor registration non-erroring by making shutdown channel buffered (there were no good way to recover from registration error).
Refactored
SetProviderXxxmethods to remove unnecessary duplication.Related Issues
Can't find any.
Notes
Follow-up Tasks
How to test