Skip to content

fix(controller): preserve shared Skyhook cordons#275

Open
fallintoplace wants to merge 2 commits into
NVIDIA:mainfrom
fallintoplace:fix/cordon-ownership
Open

fix(controller): preserve shared Skyhook cordons#275
fallintoplace wants to merge 2 commits into
NVIDIA:mainfrom
fallintoplace:fix/cordon-ownership

Conversation

@fallintoplace

@fallintoplace fallintoplace commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Remove only the current Skyhook cordon annotation before deciding whether to clear spec.unschedulable.
  • Keep a node unschedulable while any skyhook.nvidia.com/cordon_* annotation remains.
  • Initialize node annotations before recording cordon state.
  • Add wrapper tests for cordon initialization plus sole-owner, shared-owner, and non-owner uncordon cases.

Why

Previously, one Skyhook completing could clear spec.unschedulable even though another Skyhook still had a cordon annotation on the same node. While touching that path, Cordon() also needed to tolerate nodes without annotations so recording cordon ownership cannot panic.

Validation

  • go test ./internal/wrapper
  • go test ./...

@github-actions

Copy link
Copy Markdown

Welcome to NodeWright, @fallintoplace! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO (git commit -s)
  • Commits follow Conventional Commits
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@github-actions github-actions Bot added component/operator Skyhook operator (controller-manager) component/ci CI workflows, GitHub Actions, and repo tooling labels Jun 12, 2026
@fallintoplace fallintoplace force-pushed the fix/cordon-ownership branch from b7eb775 to b3189d3 Compare June 12, 2026 18:41
@fallintoplace fallintoplace changed the title Preserve shared Skyhook cordons on uncordon fix(controller): preserve shared Skyhook cordons Jun 12, 2026
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces multi-skyhook cordon coordination by adding centralized annotation helpers and refactoring node cordon/uncordon logic. The Cordon method now initializes annotations when absent. Uncordon deletes only the caller's cordon annotation and clears Spec.Unschedulable only when no other skyhooks hold a cordon via the new hasSkyhookCordon helper. Reset uses the centralized cordonAnnotationKey helper. Comprehensive unit tests validate Cordon initialization, sole-owner uncordon, co-owner uncordon (node remains unschedulable), unrelated-annotation preservation, and no-ownership scenarios. Documentation explains shared cordon ownership coordination and recovery procedures. An end-to-end Chainsaw test validates that fast and slow Skyhooks coordinate properly on the same node, with only the completing Skyhook removing its annotation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • rice-riley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: preserving shared Skyhook cordons by preventing one Skyhook from clearing the unschedulable flag while another still holds a cordon.
Description check ✅ Passed The description is well-related to the changeset, explaining the core fix (selective cordon removal), the rationale (preventing premature unschedulable clearing), and validation steps performed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
operator/internal/wrapper/node.go (1)

478-490: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider adding a brief comment explaining the multi-skyhook coordination pattern.

The conditional clearing of node.Spec.Unschedulable based on hasSkyhookCordon (lines 485-487) implements a subtle multi-owner pattern that might not be immediately obvious to future maintainers. As per coding guidelines, code that is "unusual, surprising, or breaks a pattern" should include a comment explaining why.

Suggested comment above line 485:

// Multiple skyhooks can cordon the same node; only mark it schedulable when all cordon owners have released.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/internal/wrapper/node.go` around lines 478 - 490, Add a short
clarifying comment above the conditional that clears node.Spec.Unschedulable in
skyhookNode.Uncordon to explain the multi-owner cordon coordination: note that
multiple skyhooks may set cordon annotations (use
cordonAnnotationKey/hasSkyhookCordon) and we should only set Spec.Unschedulable
= false when hasSkyhookCordon(...) returns false; place the comment immediately
before the hasSkyhookCordon check to make the intent clear to future
maintainers.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@operator/internal/wrapper/node.go`:
- Around line 478-490: Add a short clarifying comment above the conditional that
clears node.Spec.Unschedulable in skyhookNode.Uncordon to explain the
multi-owner cordon coordination: note that multiple skyhooks may set cordon
annotations (use cordonAnnotationKey/hasSkyhookCordon) and we should only set
Spec.Unschedulable = false when hasSkyhookCordon(...) returns false; place the
comment immediately before the hasSkyhookCordon check to make the intent clear
to future maintainers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d7afc486-bf99-4c82-af6f-9df06c441317

📥 Commits

Reviewing files that changed from the base of the PR and between 98fe42d and b7eb775.

📒 Files selected for processing (2)
  • operator/internal/wrapper/node.go
  • operator/internal/wrapper/node_test.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@operator/internal/wrapper/node_test.go`:
- Around line 181-244: Tests in the Uncordon context duplicate the raw
annotation key pattern "skyhook.nvidia.com/cordon_*"; update the three It blocks
to use the shared helper/constant instead of hard-coded strings by calling
cordonAnnotationKey("my-skyhook") and cordonAnnotationKey("other-skyhook") (or
define local constants) when constructing node.ObjectMeta.Annotations and when
asserting Expect(...).To(HaveKeyWithValue(...)) so all references match the
canonical key format used by NewSkyhookNodeOnly and Uncordon.

In `@operator/internal/wrapper/node.go`:
- Around line 469-474: The Cordon() method writes to node.Annotations[...]
without ensuring the map exists, causing a panic if Annotations is nil; modify
Cordon() to check if node.Annotations == nil and if so initialize it
(make(map[string]string)) before assigning cordonAnnotationKey(node.skyhookName)
and setting node.Spec.Unschedulable and node.updated so that the map write is
safe; update the logic inside skyhookNode.Cordon to perform this
nil-check/initialization before any writes to node.Annotations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 458295f6-a03e-459f-8045-cc56af8cba42

📥 Commits

Reviewing files that changed from the base of the PR and between b7eb775 and b3189d3.

📒 Files selected for processing (2)
  • operator/internal/wrapper/node.go
  • operator/internal/wrapper/node_test.go

Comment thread operator/internal/wrapper/node_test.go
Comment thread operator/internal/wrapper/node.go
@fallintoplace fallintoplace force-pushed the fix/cordon-ownership branch 2 times, most recently from 8b59e04 to 4d8d5dc Compare June 12, 2026 18:56
mchmarny
mchmarny previously approved these changes Jun 12, 2026

@ayuskauskas ayuskauskas left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Solid, correctly-scoped bug fix. A node selected by two Skyhooks that both cordon it would be prematurely uncordoned when the first one completed — this fixes that by only clearing spec.unschedulable once no skyhook.nvidia.com/cordon_* annotation remains. Notes below.

Suggestions (minor)

  1. Unit test gap worth closing — the whole point of the change is that hasSkyhookCordon matches the cordon_ prefix specifically, not the generic skyhook.nvidia.com/ prefix. Please add a case: node owned only by my-skyhook's cordon plus an unrelated annotation (e.g. status_other / nodeState_other) → Uncordon() should still set unschedulable=false. This pins the behavior that a non-cordon Skyhook annotation doesn't keep the node cordoned.
  2. DRY nit (low priority): the cordon_ literal now lives in three places — the CLI const cordonAnnotationPrefix (cmd/cli/app/reset.go), the new cordonAnnotationKey(), and the inline prefix inside hasSkyhookCordon. Consider defining cordonAnnotationKey and hasSkyhookCordon against a single shared prefix constant in the wrapper package so the literal appears once.

Please add an e2e test verifying the fix

The unit tests exercise the wrapper in isolation, but the bug is fundamentally about two Skyhooks racing over one node's schedulability. Please add a chainsaw e2e (under k8s-tests/chainsaw/skyhook/) that:

  • Deploys two Skyhooks, both requiring interrupts, selecting the same node.
  • Asserts the node stays cordoned (spec.unschedulable: true) after the first Skyhook reaches complete.
  • Asserts the node only becomes schedulable (spec.unschedulable: false, no cordon_* annotations) after both Skyhooks complete.

This is the regression guard that proves the fix end-to-end; a passing unit suite alone wouldn't have caught the original bug at the rollout level.

Documentation

Orphaned cordon annotation = stuck-unschedulable. With this change, a stale cordon_<gone-skyhook> annotation (left by a force-delete that bypasses the finalizer, or a failed cleanup) will now keep the node unschedulable indefinitely from the operator's view — hasSkyhookCordon keeps seeing the key. Previously any completing Skyhook would (incorrectly, but as a side-effect) free it. Recovery exists (kubectl skyhook reset / node reset strip cordon annotations), so this is an acceptable trade for correctness. Please update the docs/interrupt_flow.md to document this as well as high level logic introduced in this PR.

@mchmarny mchmarny self-requested a review June 22, 2026 21:29
@lockwobr

Copy link
Copy Markdown
Collaborator

@fallintoplace there is an issue with linting on your changes causing all tests to fail.

Error: internal/wrapper/node.go:477:61: string `true` has 5 occurrences, make it a constant (goconst)
		node.Annotations[cordonAnnotationKey(node.skyhookName)] = "true"

@fallintoplace

Copy link
Copy Markdown
Contributor Author

@ayuskauskas @lockwobr I will resolve the issues and push this as soon as possible. Thank you for your attention.

@fallintoplace fallintoplace force-pushed the fix/cordon-ownership branch from 4d8d5dc to 2553176 Compare June 22, 2026 21:54
@fallintoplace fallintoplace requested a review from a team June 22, 2026 21:54
@github-actions github-actions Bot added doc Documentation change (PR path label; doc issues use the Documentation type) component/tests End-to-end / chainsaw test suites (k8s-tests) labels Jun 22, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@k8s-tests/chainsaw/skyhook/shared-cordon-ownership/blocker-pod.yaml`:
- Around line 33-34: Replace the `busybox:latest` image tag in the blocker pod
configuration with a specific pinned version such as `busybox:1.38.0`. Update
the image field to use a concrete version tag instead of the latest tag to
ensure consistency and prevent test flakiness when the upstream image changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2217aa09-613e-4d32-8fe7-22455c2607d3

📥 Commits

Reviewing files that changed from the base of the PR and between 4d8d5dc and 2553176.

📒 Files selected for processing (8)
  • docs/interrupt_flow.md
  • k8s-tests/chainsaw/skyhook/shared-cordon-ownership/README.md
  • k8s-tests/chainsaw/skyhook/shared-cordon-ownership/blocker-pod.yaml
  • k8s-tests/chainsaw/skyhook/shared-cordon-ownership/chainsaw-test.yaml
  • k8s-tests/chainsaw/skyhook/shared-cordon-ownership/skyhook-fast.yaml
  • k8s-tests/chainsaw/skyhook/shared-cordon-ownership/skyhook-slow.yaml
  • operator/internal/wrapper/node.go
  • operator/internal/wrapper/node_test.go

Comment thread k8s-tests/chainsaw/skyhook/shared-cordon-ownership/blocker-pod.yaml Outdated
@fallintoplace fallintoplace force-pushed the fix/cordon-ownership branch from 2553176 to fef7466 Compare June 22, 2026 22:08

@lockwobr lockwobr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of accuracy notes on the new Orphaned Cordon Recovery section. The code change itself looks correct; these are doc-only suggestions so the recovery steps don't leave a node stuck cordoned.

Comment thread docs/interrupt_flow.md Outdated
Comment thread docs/interrupt_flow.md
@ayuskauskas

Copy link
Copy Markdown
Collaborator

Changes look good once @lockwobr documentation asks are addressed.

Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
@fallintoplace fallintoplace force-pushed the fix/cordon-ownership branch from 25058e7 to 023b329 Compare June 30, 2026 18:23
@fallintoplace

fallintoplace commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

@ayuskauskas I somehow missed this PR in the inbox, but I have updated it the direction you want.

@ayuskauskas

Copy link
Copy Markdown
Collaborator

Looks good to me. Once tests pass I can approve.

@ayuskauskas ayuskauskas requested a review from lockwobr June 30, 2026 19:33
@ayuskauskas ayuskauskas dismissed lockwobr’s stale review June 30, 2026 19:34

Asks were addressed.

@ayuskauskas ayuskauskas enabled auto-merge June 30, 2026 19:35
@ayuskauskas

Copy link
Copy Markdown
Collaborator

@fallintoplace the last blocking piece is that your commits must be signed. Can you force push with signed commits?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/ci CI workflows, GitHub Actions, and repo tooling component/operator Skyhook operator (controller-manager) component/tests End-to-end / chainsaw test suites (k8s-tests) doc Documentation change (PR path label; doc issues use the Documentation type)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants