fix(propeller): bypass informer cache when clearing finalizers#10
Open
pfernandes21 wants to merge 2 commits intomasterfrom
Open
fix(propeller): bypass informer cache when clearing finalizers#10pfernandes21 wants to merge 2 commits intomasterfrom
pfernandes21 wants to merge 2 commits intomasterfrom
Conversation
PluginManager.Finalize reads the resource through the informer cache, then calls clearFinalizer, which short-circuits silently when the local copy's Finalizers list does not contain ours (controllerutil.RemoveFinalizer returns false). When the cache is stale relative to the API server — for example because the watch verb is missing in RBAC and the reflector is falling back to repeated LISTs — the cached copy can lack a finalizer that is still attached on the server, and the patch that should remove it is never sent. The resource then sits forever with deletionTimestamp set, blocking owned objects (RayCluster head/worker pods, etc.) and the underlying nodes from being garbage collected. Add GetAPIReader() to executors.Client and pluginsCore.KubeClient (with matching impls in flyteK8sClient, kubeClient, KubeClientObj, and the mocks) and use it in Finalize so the read that drives clearFinalizer always sees the API server's current Finalizers list, not a possibly stale cache snapshot.
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tracking issue
Internal — see Slack thread.
Why are the changes needed?
PluginManager.Finalizereads the resource through the informer cache and then callsclearFinalizer, which usescontrollerutil.RemoveFinalizeron that copy and only sends the merge-patch whenRemoveFinalizerreturnstrue. When the cache is stale relative to the API server, the cached copy can lack a finalizer that the API server still has —RemoveFinalizerreturnsfalse, the patch is never sent, and the function returns nil. The caller treats finalize as successful, the node phase advances, and the resource is left forever withdeletionTimestampset and our finalizer attached.The
flyteK8sClient.Getwrapper is cache-first (it callscacheReader.Getand only falls through to the API server on cache error), so any informer staleness propagates straight into this codepath. Staleness windows widen significantly when the watch verb is missing in RBAC and the reflector falls back to repeated LISTs (visible in propeller logs asreflector.go:147] Failed to watch *v1.RayJob: unknown (get rayjobs.ray.io)).This is the propeller-side half of a fix for hephaestus, where 258 of 263 RayJobs in
flytesnacks-*ended up stuckTerminatingwithflyte.org/finalizer-k8sattached, pinning RayCluster head/worker pods and blocking node consolidation. The companion change adds the missingwatch/deletecollectionRBAC verbs.What changes were proposed in this pull request?
Add a
GetAPIReader() client.Readermethod to both kube-client interfaces (executors.Clientin flytepropeller andcore.KubeClientin flyteplugins) and use it for theGetinsideFinalize. The new reader bypasses the informer cache and reads directly from the API server, so the object passed toclearFinalizeralways reflects the API server's currentFinalizerslist. If the finalizer is still present,RemoveFinalizerreturns true and the merge-patch goes out; if it really was already removed, the no-op log is correct.Implementations updated:
executors.flyteK8sClient— relies on the embedded cachelessclient.Clientit was already constructed with.pluginmachinery/k8s.kubeClient— adds an optionalapiReader; falls back to the existingclient(which already issues direct API calls) when no separate cacheless reader was wired in.pluginmachinery/array/k8s.KubeClientObj— returns the underlying client (already cacheless).executors/mocks/andpluginmachinery/core/mocks/— generated-style additions forGetAPIReader.NewFakeKubeClient()returns the samefake.Clientfor bothGetClientandGetAPIReader.Writes are unchanged —
Patchalready goes to the API server via the embedded cacheless client.How was this patch tested?
go build ./...andgo vet ./...clean for bothflytepropellerandflyteplugins.go test ./...passes for both modules.TestFinalizeinflytepropeller/pkg/controller/nodes/task/k8s/plugin_manager_test.gocontinues to pass afterNewFakeKubeClient()was updated to also stubGetAPIReader.End-to-end verification will happen on hephaestus after the companion RBAC PR (exa-labs/monorepo#34236) lands and this image is rolled out: new RayJob completions should no longer accumulate
flyte.org/finalizer-k8son terminating objects.Labels
fixed
Setup process
n/a — server-side fix, no migrations / config changes required.
Screenshots
n/a
Check all the applicable boxes
Related PRs
watch+deletecollectionfor rayjobs / pytorchjobs.Docs link
n/a
Link to Devin session: https://app.devin.ai/sessions/4884d86b33694c858a1abdf476f20ed8
Requested by: @pfernandes21