Skip to content

fix: consistent SetProviderAndWait init flow#474

Merged
toddbaert merged 5 commits intoopen-feature:mainfrom
dd-oleksii:oleksii/jj-xxuwqutpmvxy
Mar 17, 2026
Merged

fix: consistent SetProviderAndWait init flow#474
toddbaert merged 5 commits intoopen-feature:mainfrom
dd-oleksii:oleksii/jj-xxuwqutpmvxy

Conversation

@dd-oleksii
Copy link
Contributor

@dd-oleksii dd-oleksii commented Feb 16, 2026

⚠️ Providers that lazily initialize their EventChannel() in Init() rather than at construction time may be impacted by this change; these should be verified to return a valid EventChannel() before Init() is called.

Context

OpenFeature specification defines SetProviderAndWait as a waiting version of SetProvider, or as a shortcut for waiting on provider ready event. However, currently SetProvider and SetProviderAndWait exhibit non-trivial behavior differences besides waiting.

SetProvider runs initialization asynchronously and potentially concurrently with shutdown of the old provider. The API is not blocked and the application author may initialize other providers concurrently, run evaluations, etc.

🐛: in this mode, the error from initializer is ignored when updating the provider state, so fatal/error states may be not set properly.

SetProviderAndWait runs initialization synchronously while holding exclusive api.mu lock. This almost completely locks OpenFeature SDK: the application author cannot initialize other providers (for different domains), configure context or hooks, evaluate feature flags, or shutdown SDK. If a provider initialization blocks forever'ish, the SDK remains unusable and is unrecoverable.

Another difference is that old provider is shutdown only after new provider has successfully initialized.

🐛: if the new provider fails to initialize, the old provider is already unset in API but will never be shutdown, and the new provider is not registered with api.eventExecutor (so if it comes back online after some time, nobody listens to its events, and the state will go out of sync if old provider continues emitting events).

🐛: in both modes, given that shutdown is run concurrently with updating subscriptions in eventExecutor, it is possible for the old provider to override the state of the new provider:

  1. init finishes, emits provider ready event (directly from goroutine), updates state
  2. old provider emits some event during shutdown (e.g., PROVIDER_ERROR or PROVIDER_STALE), eventExecutor receives the event and updates the state to error/stale
  3. new provider is registered with eventExecutor but the state is already wrong.

This PR

This PR introduces a couple of changes:

Make initialization flow consistent across both modes: always initialize async but make "AndWait" methods wait for initialization outside of critical section. Make init respect returned error.

Always call shutdown on old provider (if it is no longer used).

Always register new provider with event executor. Do this before we start init/shutdown, so the old provider cannot influence state of the new provider. Make event executor registration non-erroring by making shutdown channel buffered (there were no good way to recover from registration error).

Refactored SetProviderXxx methods to remove unnecessary duplication.

Related Issues

Can't find any.

Notes

Follow-up Tasks

How to test

@dd-oleksii dd-oleksii requested review from a team as code owners February 16, 2026 19:20
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dd-oleksii, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the OpenFeature SDK's provider management, addressing critical inconsistencies and race conditions in how providers are initialized, updated, and shut down. The changes ensure a more predictable and robust lifecycle for feature flag providers, particularly by standardizing asynchronous initialization, improving error propagation, and preventing interference between old and new provider states during transitions.

Highlights

  • Standardized Provider Initialization: Unified the initialization logic for SetProvider and SetProviderAndWait to always be asynchronous, with AndWait methods now correctly waiting for completion outside of critical sections.
  • Robust Error Handling: Ensured that initialization errors are properly captured and propagated, preventing silent failures in provider state updates.
  • Improved Provider Lifecycle Management: Guaranteed that old providers are always shut down when no longer in use and registered new providers with the event executor before initialization to prevent state corruption from concurrent old provider events.
  • Enhanced Concurrency Safety: Modified the shutdown semaphore to be buffered, addressing potential deadlocks and making event executor registration non-erroring.
Changelog
  • openfeature/event_executor.go
    • Removed unused imports for fmt and time.
    • Updated registerDefaultProvider, registerNamedEventingProvider, and startListeningAndShutdownOld functions to no longer return errors.
    • Modified old provider shutdown logic to log warnings instead of returning errors on semaphore timeouts.
  • openfeature/event_executor_test.go
    • Updated test cases to reflect the change in return types for provider registration functions, removing error checks.
  • openfeature/openfeature_api.go
    • Refactored SetProvider and SetNamedProvider methods to use new context-aware, asynchronous initialization helpers.
    • Separated initialization and shutdown logic into distinct initNew and shutdownOld functions.
    • Removed the initNewAndShutdownOld helper function.
    • Moved GetProviderMetadata and GetNamedProviderMetadata functions to a different section of the file.
  • openfeature/reference.go
    • Modified the providerReference struct to use a buffered channel with a capacity of 1 for the shutdownSemaphore, improving concurrency safety.
Activity
  • No specific activity has been recorded for this pull request since its creation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a significant and well-executed refactoring to address inconsistencies and bugs in the provider-setting logic, particularly between SetProvider and SetProviderAndWait. The changes make the initialization flow more consistent and robust by always initializing asynchronously and having AndWait methods wait outside of critical sections. The separation of initialization and shutdown logic into initNew and shutdownOld improves clarity. The fix for the race condition where an old provider could affect the state of a new one is also a crucial improvement. The code is of high quality. I have one suggestion to improve error handling during the shutdown of old providers.

@sahidvelji
Copy link
Contributor

sahidvelji commented Feb 16, 2026

The tests seem to be failing with the -race flag. Though I see that the Makefile test target does not include the -race flag, so we'll need to update that. For now, please run the tests locally with

go test -count=1 -race --short -tags testtools -cover -timeout 1m ./...

@dd-oleksii dd-oleksii force-pushed the oleksii/jj-xxuwqutpmvxy branch from 0aefdc9 to 281ed00 Compare February 16, 2026 21:09
@dd-oleksii
Copy link
Contributor Author

@sahidvelji thanks for the hint! I have fixed two tests that relied on events emitted before initialization to be received after initialization is complete. I fixed two tests to emit error/stale events after provider initialization is complete.

@codecov
Copy link

codecov bot commented Feb 16, 2026

Codecov Report

❌ Patch coverage is 89.06250% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.92%. Comparing base (20c5cb9) to head (96ab6ee).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
openfeature/openfeature_api.go 92.45% 2 Missing and 2 partials ⚠️
openfeature/event_executor.go 70.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #474      +/-   ##
==========================================
+ Coverage   81.66%   83.92%   +2.25%     
==========================================
  Files          27       27              
  Lines        2111     2071      -40     
==========================================
+ Hits         1724     1738      +14     
+ Misses        302      293       -9     
+ Partials       85       40      -45     
Flag Coverage Δ
e2e 83.92% <89.06%> (+2.25%) ⬆️
unit 83.92% <89.06%> (+2.25%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sahidvelji
Copy link
Contributor

@erka
Copy link
Member

erka commented Feb 18, 2026

For reference, this test has intermittent failures on main, though they are very rare. The failure rate increases when new changes are introduced.

@toddbaert
Copy link
Member

@dd-oleksii These improvements sound great. I will fully review later today. In the meantime, could you see if you can address the existing flakiness your fixes seem to have revealed?

It seems one of the tests timed out: https://github.com/open-feature/go-sdk/actions/runs/22098383602/job/63927536776?pr=474

For reference, this test has intermittent failures on main, though they are very rare. The failure rate increases when new changes are introduced.

@toddbaert toddbaert self-requested a review February 18, 2026 13:10
@dd-oleksii
Copy link
Contributor Author

oh yeah, it's the same flawed pattern in this test as well:

  1. Provider emits some event (ProviderStale in this case)
  2. Provider is set in the API
  3. API runs initialization, emits provider ready event, and sets the state to ready
  4. Test expects the state to be "stale" (while the spec expects it to be "ready")

In the old "and wait" implementation, there was a bug that the SDK would only start listening for provider events after initialization has succeeded. So it would receive ProviderStale event after ProviderReady, despite the fact that it was issued before the initialization was attempted.

In the non-waiting implementation, the subscription and initialization happen concurrently, so it's undefined what event arrives first in this situation and what the end state would be. That's why the test is flaky. This PR moves subscription start earlier in the chain (before we attempt initialization), so it's more likely that the pre-existing events arrive first — that's why it started flaking more often.

There are two main issues here:

First, the tests are just making a wrong assumption about the ordering of events. This one is easy to fix — we should emit events only after initialization (or make initialization return error).

The harder issue is ensuring consistent ordering of pre-existing events and the initialization. I guess the simplest way to achieve that is by draining the provider's event channel before starting initialization. Another option is to send the provider ready event through the same channel, so the ordering is preserved. (In either case, it's probably best to handle this in a separate PR — this one is trying to fix too much things simultaneously.

@erka
Copy link
Member

erka commented Feb 18, 2026

I think those tests are checking that no events are lost when the provider is replaced. AddHandler allows subscribing to a specific event type, the order doesn’t matter here as well as a current state of any provider.

Copy link
Member

@toddbaert toddbaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good improvement, but I have 2 suggestions... one might fix a small race condition. Overall this seems like a huge improvement in tidiness and stability, but please check this and this.

@dd-oleksii dd-oleksii force-pushed the oleksii/jj-xxuwqutpmvxy branch from 1da59f0 to e8eda71 Compare February 27, 2026 12:11
dd-oleksii and others added 3 commits February 28, 2026 15:12
OpenFeature specification defines SetProviderAndWait as a waiting
version of SetProvider, or as a shortcut for waiting on provider ready
event. However, currently SetProvider and SetProviderAndWait exhibit
non-trivial behavior differences besides waiting.

SetProvider runs initialization asynchronously and potentially
concurrently with shutdown of the old provider. The API is not blocked
and the application author may initialize other providers
concurrently, run evaluations, etc.

🐛: in this mode, the error from initializer is ignored when updating
the provider state, so fatal/error states may be not set properly.

SetProviderAndWait runs initialization synchronously while holding
exclusive api.mu lock. This almost completely locks OpenFeature SDK:
the application author cannot initialize other providers (for
different domains), configure context or hooks, evaluate feature
flags, or shutdown SDK. If a provider initialization blocks
forever'ish, the SDK remains unusable and is unrecoverable.

Another difference is that old provider is shutdown only after new
provider has successfully initialized.

🐛: if the new provider fails to initialize, the old provider is
already unset in API but will never be shutdown, and the new provider
is not registered with api.eventExecutor (so if it comes back online
after some time, nobody listens to its events, and the state will go
out of sync if old provider continues emitting events).

🐛: in both modes, given that shutdown is run concurrently with
updating subscriptions in eventExecutor, it is possible for the old
provider to override the state of the new provider:
1. init finishes, emits provider ready event (directly from
   goroutine), updates state
2. old provider emits some event during shutdown (e.g., PROVIDER_ERROR
   or PROVIDER_STALE), eventExecutor receives the event and updates
   the state to error/stale
3. new provider is registered with eventExecutor but the state is
   already wrong.

This PR introduces a couple of changes:

Make initialization flow consistent across both modes: always
initialize async but make "AndWait" methods wait for initialization
outside of critical section. Make init respect returned error.

Always call shutdown on old provider (if it is no longer used).

Always register new provider with event executor. Do this before we
start init/shutdown, so the old provider cannot influence state of the
new provider. Make event executor registration non-erroring by making
shutdown channel buffered (there were no good way to recover from
registration error).

Signed-off-by: Oleksii Shmalko <oleksii.shmalko@datadoghq.com>
Signed-off-by: Oleksii Shmalko <oleksii.shmalko@datadoghq.com>
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
@toddbaert toddbaert force-pushed the oleksii/jj-xxuwqutpmvxy branch from d7d1776 to b6e4bfa Compare February 28, 2026 20:12
@toddbaert
Copy link
Member

Pushed one additional change to remove some now-dead code. Will merge Monday unless I hear objections cc @erka @sahidvelji

@toddbaert toddbaert requested review from erka and sahidvelji February 28, 2026 20:17
@erka
Copy link
Member

erka commented Mar 1, 2026

@toddbaert This PR introduces a breaking change relative to the OpenFeature spec. According to the spec, if a provider emits an event and there is a subscriber for that event, the subscriber event handlers must be run.

The test changes clearly demonstrate this breakage - they were modified to hide the underlying problem rather than address it. The eventExecutor is buggy in how it tracks states with events from a non-current provider, but dropping those non-current provider events is not an appropriate fix, in my opinion.

@toddbaert
Copy link
Member

toddbaert commented Mar 2, 2026

@erka can you provide an actionable recommendation for @dd-oleksii ? I think I see your point, but from a usage standpoint I think this is an improvement (even if it can get better).

Or do we need to wait to resolve this first?

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
@toddbaert
Copy link
Member

toddbaert commented Mar 2, 2026

Thinking about it more - I'm not 100% sure that this is a deviation from the spec. The spec uses the language of "associated provider" in most places - if a provider has been replaced I'm don't think it matters.

I think this is the pre-existing behavior though - events from replaced providers we already silently discarded.

I did make a change here though, which moves AddHandler to after the state transition, so the callback fires through emitOnRegistration (the "already in that state" path) instead of through normal event dispatch. This actually tests spec 5.3.3 instead of accidentally testing 5.1.2 which was happening before, I think.

@erka
Copy link
Member

erka commented Mar 2, 2026

@toddbaert Honestly, the simplest way to address all of this complexity might be to prohibit provider replacement at the OpenFeature spec level. Do we know if providers are actually replaced in production scenarios?

Let me bring up another elephant: tracking. From what I can see in the go sdk, tracking relies on the current (default/domain) provider. A client may perform an evaluation using one provider, then that provider has been replaced, and only afterward client receives a Track call. In that situation, which provider is expected to handle tracking?

@toddbaert
Copy link
Member

@toddbaert Honestly, the simplest way to address all of this complexity might be to prohibit provider replacement at the OpenFeature spec level. Do we know if providers are actually replaced in production scenarios?

Let me bring up another elephant: tracking. From what I can see in the go sdk, tracking relies on the current (default/domain) provider. A client may perform an evaluation using one provider, then that provider has been replaced, and only afterward client receives a Track call. In that situation, which provider is expected to handle tracking?

This might be outside the scope of this issue... really, but its not so much switching we are worried about, but late binding. Different modules may load at different times, and may register their own providers; we want to make sure that works as expected. It's less important that you change from one provider to another, but the requirements for these 2 things are largely the same.

@dd-oleksii
Copy link
Contributor Author

dd-oleksii commented Mar 2, 2026

@toddbaert This PR introduces a breaking change relative to the OpenFeature spec. According to the spec, if a provider emits an event and there is a subscriber for that event, the subscriber event handlers must be run.

The test changes clearly demonstrate this breakage - they were modified to hide the underlying problem rather than address it.

I don't think this is the case. The two tests that were failing were attaching event handlers after an event was emitted, so they should not have received it in the first place. This is not new—they are flaky on the main as well. The rest of the tests weren't failing but were modified to avoid attach-after-emit test bug.

@erka
Copy link
Member

erka commented Mar 2, 2026

There is also a change in how SetProviderAndWait works: the StateHandler provider may receive evaluation requests before it is initialized. Is this behavior expected?

This is an example

func main() {
	client := openfeature.NewDefaultClient()
	go func() {
		for {
			<-time.After(time.Microsecond)
			result, err := client.BooleanValueDetails(context.TODO(), "flag", false, openfeature.NewTargetlessEvaluationContext(nil))
			fmt.Println(result, err)
		}
	}()
	time.Sleep(100 * time.Microsecond)
	go func() {
		provider, _ := flagd.NewProvider()
		_ = openfeature.SetProviderAndWait(provider)
		fmt.Println("flagd set and wait done")
	}()
	time.Sleep(time.Second)
}

Before output:

//...
{false {flag bool {default-variant DEFAULT   map[]}}} <nil>
flagd set and wait done
{false {flag bool { ERROR PROVIDER_NOT_READY connection not made map[]}}} error code: PROVIDER_NOT_READY: connection not made
//...

After output

//...
{false {flag bool {default-variant DEFAULT   map[]}}} <nil>
{false {flag bool {    map[]}}} PROVIDER_NOT_READY: provider not yet initialized
//...
{false {flag bool {    map[]}}} PROVIDER_NOT_READY: provider not yet initialized
flagd set and wait done
{false {flag bool { ERROR PROVIDER_NOT_READY connection not made map[]}}} error code: PROVIDER_NOT_READY: connection not made
//...

@toddbaert
Copy link
Member

toddbaert commented Mar 3, 2026

There is also a change in how SetProviderAndWait works: the StateHandler provider may receive evaluation requests before it is initialized. Is this behavior expected?

Do you mean from a different goroutine/thread? The setProviderAndWait should block, but if another goroutine evaluates a flag, we'd expect to see PROVIDER_NOT_READY, but that also would be true before this change, I would think. There's no requirement to "block" evaluations while initialization is taking place.

@erka
Copy link
Member

erka commented Mar 3, 2026

OpenFeature Requirement 2.4.1 says Many feature flag frameworks or SDKs require some initialization before they can be used.. I read it as non-initialized provider should not receive evaluation request calls.

The thing about EventChannel() call. There is no guidance when this channel should be initialized. If some provider creates that channel in Init, this PR breaks their implementation.

This PR aims to address certain problems, but it also introduces new ones. I see it more as a breaking change not as a bug fix.

@dd-oleksii
Copy link
Contributor Author

dd-oleksii commented Mar 6, 2026

@erka all the behavior you describe exists today for SetProvider — it's not new. SetProvider calls initialize completely async in a new goroutine and it may subscribe to the event channel before init is actually called (as has been demonstrated by flaky tests). It also allows evaluation calls on a provider before it completes initialization.

The only thing this PR changes is making SetProviderAndWait consistent with SetProvider.

@dd-oleksii
Copy link
Contributor Author

@erka @toddbaert there has been a lot of back-and-forth on this PR over the last three weeks. Would you like to jump on a call to discuss spec/expected behavior and resolve any misunderstandings about what this PR is doing?

@erka
Copy link
Member

erka commented Mar 10, 2026

@dd-oleksii Since the Go SDK is v1, some users might be surprised by the new behavior of SetProviderAndWait. Here’s a small app to observe some of the changes.

package main

import (
	"context"
	"log/slog"
	"os"
	"time"

	"github.com/lmittmann/tint"
	flagd "github.com/open-feature/go-sdk-contrib/providers/flagd/pkg"
	"github.com/open-feature/go-sdk/openfeature"
	"github.com/open-feature/go-sdk/openfeature/memprovider"
)

func init() {
	slog.SetDefault(slog.New(tint.NewHandler(os.Stdout, &tint.Options{
		TimeFormat: "-",
	})))
}

var _ openfeature.ContextAwareStateHandler = (*SlowProvider)(nil)

type SlowProvider struct {
	*flagd.Provider
}

func (s *SlowProvider) InitWithContext(ctx context.Context, evaluationContext openfeature.EvaluationContext) error {
	select {
	case <-time.After(250 * time.Microsecond):
		break
	case <-ctx.Done():
		return ctx.Err()
	}
	return s.Init(evaluationContext)
}

func (s *SlowProvider) ShutdownWithContext(ctx context.Context) error {
	s.Shutdown()
	return nil
}

func main() {
	client := openfeature.NewDefaultClient()
	gcb := func(e openfeature.EventDetails) {
		slog.Error("global-event", "state", client.State(), "details", e)
	}
	ccb := func(e openfeature.EventDetails) {
		slog.Error("client-event", "state", client.State(), "details", e)
	}

	openfeature.AddHandler(openfeature.ProviderError, &gcb)
	client.AddHandler(openfeature.ProviderError, &ccb)
	go func() {
		for {
			evaluate(client)
			<-time.After(5 * time.Microsecond)
		}
	}()

	openfeature.SetProviderAndWait(memprovider.NewInMemoryProvider(map[string]memprovider.InMemoryFlag{
		"my-flag": {
			Key:            "my-flag",
			DefaultVariant: "on",
			Variants: map[string]any{
				"on":  true,
				"off": false,
			},
		},
	}))
	slog.Warn("in-memory is set")
	ctx, cancel := context.WithCancel(context.Background())
	go func() {
		provider, _ := flagd.NewProvider()
		slog.Warn("flagd init")
		err := openfeature.SetProviderAndWait(&SlowProvider{Provider: provider})
		slog.Warn("flagd is set", "error", err)
		cancel()
	}()
	<-ctx.Done()
	evaluate(client)
}

func evaluate(client *openfeature.Client) {
	v, e := client.BooleanValue(context.TODO(), "my-flag", false, openfeature.NewTargetlessEvaluationContext(nil))
	slog.Info("client evaluation", "state", client.State(), "value", v, "details", e)
}

Before logs

- WRN in-memory is set
- WRN flagd init
- INF client evaluation state=READY value=true details=<nil>
- WRN flagd is set error="provider initialization failed: stream error: unavailable: dial tcp 127.0.0.1:8013: connect: connection refused"
- ERR global-event state=ERROR details="{ProviderName:flagd ProviderEventDetails:{Message:Provider initialization failed: provider initialization failed: stream error: unavailable: dial tcp 127.0.0.1:8013: connect: connection refused FlagChanges:[] EventMetadata:map[] ErrorCode:}}"
- INF client evaluation state=ERROR value=false details="error code: PROVIDER_NOT_READY: connection not made"

After logs

- WRN in-memory is set
- WRN flagd init
- INF client evaluation state=READY value=true details=<nil>
- INF client evaluation state=READY value=false details="error code: PROVIDER_NOT_READY: client did not yet finish the initialization"
- INF client evaluation state=READY value=false details="error code: PROVIDER_NOT_READY: connection not made"
- WRN flagd is set error="failed to initialize default provider \"flagd\": provider initialization failed: stream error: unavailable: dial tcp 127.0.0.1:8013: connect: connection refused"
- ERR global-event state=ERROR details="{ProviderName:flagd ProviderEventDetails:{Message:Provider initialization failed: provider initialization failed: stream error: unavailable: dial tcp 127.0.0.1:8013: connect: connection refused FlagChanges:[] EventMetadata:map[] ErrorCode:}}"
- ERR client-event state=ERROR details="{ProviderName:flagd ProviderEventDetails:{Message:Provider initialization failed: provider initialization failed: stream error: unavailable: dial tcp 127.0.0.1:8013: connect: connection refused FlagChanges:[] EventMetadata:map[] ErrorCode:}}"
- INF client evaluation state=ERROR value=false details="error code: PROVIDER_NOT_READY: connection not made"

There’s now a client-event callback that didn’t happen before. The client reports Ready, but evaluations say the provider is not ready / not initialized, and later the client switches to the Error state.

After looking into this, it might make sense for the SDK to only expose a blocking SetProvider method. If callers don’t want to wait, they can always invoke it in a goroutine. But that’s a bit off topic.

@dd-oleksii
Copy link
Contributor Author

dd-oleksii commented Mar 10, 2026

Since the Go SDK is v1, some users might be surprised by the new behavior of SetProviderAndWait. Here’s a small app to observe some of the changes.

The client reports Ready, but evaluations say the provider is not ready / not initialized

oh, this one looks like another bug in main that we can fix: setting new provider does not reset state to not ready as it should. You can observe this behavior with SetProvider today.

Otherwise, your example demonstrates two spec violations that got fixed:

  1. Evaluations calls were blocked which goes against 1.4.12 which suggests that evaluation methods should be non-blocking. The current implementation may block evaluation calls indefinitely long which is a serious issue and is the reason why I started the PR.
  2. The SDK didn't call the client event handler after initialization. Clearly violates 5.3.2

Now the questions is whether fixing these two is a breaking change or not. My argument for "not" is that these were undocumented deviations from the spec. So if anyone relied on these, they were clearly relying on undocumented behavior that was not part of the public API. But I can see the point that it might be considered a breaking change if we want to play safe (obligatory xkcd reference).

@erka so how do you want to proceed? do you want me to bump the major version? do you want to bundle these changes with your branch? do you want to fix the bugs in a different way?

@toddbaert
Copy link
Member

toddbaert commented Mar 12, 2026

Adding a comment here to level-set from the community meeting. The goal of the OpenFeature spec is to describe the shape of the interfaces and APIs for the SDKs. The finest details of their behaviors, for instance around things like locking, ordering, etc, in many cases is left unspecified. That doesn't mean they can't be specified within the context of the SDK. In fact, understanding such nuances should be one of the goals of the maintainers of a particular implementation. Where ordering is critical, the spec mentions it (or should be made to). In other cases, the specification lends some flexibility. The goal is to specify a set of APIs and behaviors that provides a degree of familiarity to users, not specify every possible internal state of the artifacts of an implementation.

It's a separate issue where any particular behavioral, timing, or locking change constitutes a breaking change, requiring a major version release. In general, I would not recommend considering behavior-changes breaking, unless we believe they are likely to cause significant issues for users. This is a subjective evaluation; even fixing something like an unnecessary lock can break somebody's workflow; our job as maintainers is to exercise judgement on these issues, case by case.

So the question here is whether the things @erka has discovered constitute breaking changes for us or not. I'll re-familiarize myself and give my evaluation soon.

EDIT:

When I posted this I didn't even notice @dd-oleksii referenced the same XKCD.

@toddbaert
Copy link
Member

Thanks @erka for your attention to detail. I appreciate it and I think it's good to have made these discoveries. Regarding the concern that EventChannel() might be called before Init(): I looked at how other SDKs handles this, and they all subscribe to provider events before calling init:

  • Java: setEventProviderListener is called in the FeatureProviderStateManager constructor, before initialize()
  • JavaScript: event handlers are attached to provider.events in setAwaitableProvider, before initialize()
  • Python: provider.attach(callback) is called before initialize() in _initialize_provider
  • .NET: ProcessFeatureProviderEventsAsync reads from provider.GetEventChannel() before InitializeAsync(); this is the closest analog to Go since it also uses something more like Go's channels

In all cases, the event mechanism is expected to be ready at construction time, not created lazily in init. A Go provider that creates its EventChannel() inside Init() would already be broken with SetProvider (the non-waiting path).

This PR aligns SetProviderAndWait with both the Go SDK's own SetProvider and the rest of the OpenFeature ecosystem. I don't think we should consider this a breaking change, and I think overall it's an improvement. I will note this change as clearly as possible in the commit/description.

@toddbaert toddbaert changed the title fix: make SetProviderAndWait behavior consistent with SetProvider fix: consistent SetProviderAndWait init flow Mar 17, 2026
@toddbaert toddbaert merged commit e6452ea into open-feature:main Mar 17, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants