Skip to content

fix(propeller): bypass informer cache when clearing finalizers#10

Open
pfernandes21 wants to merge 2 commits intomasterfrom
devin/1777572238-fix-finalizer-stale-cache
Open

fix(propeller): bypass informer cache when clearing finalizers#10
pfernandes21 wants to merge 2 commits intomasterfrom
devin/1777572238-fix-finalizer-stale-cache

Conversation

@pfernandes21
Copy link
Copy Markdown

Tracking issue

Internal — see Slack thread.

Why are the changes needed?

PluginManager.Finalize reads the resource through the informer cache and then calls clearFinalizer, which uses controllerutil.RemoveFinalizer on that copy and only sends the merge-patch when RemoveFinalizer returns true. When the cache is stale relative to the API server, the cached copy can lack a finalizer that the API server still has — RemoveFinalizer returns false, the patch is never sent, and the function returns nil. The caller treats finalize as successful, the node phase advances, and the resource is left forever with deletionTimestamp set and our finalizer attached.

// flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go (before)
if err := e.kubeClient.GetClient().Get(ctx, nsName, o); err != nil { ... }
if err := e.clearFinalizer(ctx, o); err != nil { ... }
// clearFinalizer (unchanged)
finalizerRemoved := controllerutil.RemoveFinalizer(o, finalizer)
oldFinalizerRemoved := controllerutil.RemoveFinalizer(o, oldFinalizer)
if finalizerRemoved || oldFinalizerRemoved {
    e.kubeClient.GetClient().Patch(ctx, o, client.MergeFrom(orig))
} else {
    logger.Debugf(ctx, "Finalizer is already cleared from Resource ...")
    // returns nil — silent no-op
}

The flyteK8sClient.Get wrapper is cache-first (it calls cacheReader.Get and only falls through to the API server on cache error), so any informer staleness propagates straight into this codepath. Staleness windows widen significantly when the watch verb is missing in RBAC and the reflector falls back to repeated LISTs (visible in propeller logs as reflector.go:147] Failed to watch *v1.RayJob: unknown (get rayjobs.ray.io)).

This is the propeller-side half of a fix for hephaestus, where 258 of 263 RayJobs in flytesnacks-* ended up stuck Terminating with flyte.org/finalizer-k8s attached, pinning RayCluster head/worker pods and blocking node consolidation. The companion change adds the missing watch / deletecollection RBAC verbs.

What changes were proposed in this pull request?

Add a GetAPIReader() client.Reader method to both kube-client interfaces (executors.Client in flytepropeller and core.KubeClient in flyteplugins) and use it for the Get inside Finalize. The new reader bypasses the informer cache and reads directly from the API server, so the object passed to clearFinalizer always reflects the API server's current Finalizers list. If the finalizer is still present, RemoveFinalizer returns true and the merge-patch goes out; if it really was already removed, the no-op log is correct.

Implementations updated:

  • executors.flyteK8sClient — relies on the embedded cacheless client.Client it was already constructed with.
  • pluginmachinery/k8s.kubeClient — adds an optional apiReader; falls back to the existing client (which already issues direct API calls) when no separate cacheless reader was wired in.
  • pluginmachinery/array/k8s.KubeClientObj — returns the underlying client (already cacheless).
  • Mocks under executors/mocks/ and pluginmachinery/core/mocks/ — generated-style additions for GetAPIReader. NewFakeKubeClient() returns the same fake.Client for both GetClient and GetAPIReader.

Writes are unchanged — Patch already goes to the API server via the embedded cacheless client.

How was this patch tested?

  • go build ./... and go vet ./... clean for both flytepropeller and flyteplugins.
  • go test ./... passes for both modules. TestFinalize in flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager_test.go continues to pass after NewFakeKubeClient() was updated to also stub GetAPIReader.

End-to-end verification will happen on hephaestus after the companion RBAC PR (exa-labs/monorepo#34236) lands and this image is rolled out: new RayJob completions should no longer accumulate flyte.org/finalizer-k8s on terminating objects.

Labels

fixed

Setup process

n/a — server-side fix, no migrations / config changes required.

Screenshots

n/a

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

  • exa-labs/monorepo#34236 — RBAC: add watch + deletecollection for rayjobs / pytorchjobs.

Docs link

n/a

Link to Devin session: https://app.devin.ai/sessions/4884d86b33694c858a1abdf476f20ed8
Requested by: @pfernandes21

PluginManager.Finalize reads the resource through the informer cache,
then calls clearFinalizer, which short-circuits silently when the local
copy's Finalizers list does not contain ours (controllerutil.RemoveFinalizer
returns false). When the cache is stale relative to the API server — for
example because the watch verb is missing in RBAC and the reflector is
falling back to repeated LISTs — the cached copy can lack a finalizer
that is still attached on the server, and the patch that should remove
it is never sent. The resource then sits forever with deletionTimestamp
set, blocking owned objects (RayCluster head/worker pods, etc.) and the
underlying nodes from being garbage collected.

Add GetAPIReader() to executors.Client and pluginsCore.KubeClient (with
matching impls in flyteK8sClient, kubeClient, KubeClientObj, and the
mocks) and use it in Finalize so the read that drives clearFinalizer
always sees the API server's current Finalizers list, not a possibly
stale cache snapshot.
@devin-ai-integration
Copy link
Copy Markdown

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant