Skip to content

fix: scope consolidation validation to candidate-relevant state#3

Open
jamesmt-aws wants to merge 1 commit intomainfrom
consolidation-invalidation-fix
Open

fix: scope consolidation validation to candidate-relevant state#3
jamesmt-aws wants to merge 1 commit intomainfrom
consolidation-invalidation-fix

Conversation

@jamesmt-aws
Copy link
Copy Markdown
Owner

@jamesmt-aws jamesmt-aws commented Mar 10, 2026

Summary

Three scoping changes to consolidation validation, one per validation stage:

  • revalidateTargetCandidates (stages 1 & 3): re-check only the command's candidate nodes instead of rebuilding the global candidate list
  • candidatePodSchedulingErrors (stage 2): check only the candidate's pods for scheduling errors instead of all non-pending pods
  • filterNewNodeClaimsByCandidatePods (stage 2): count only NewNodeClaims containing candidate-displaced pods

The simulation itself is unchanged. Only the comparison logic is scoped. All three safety invariants are preserved.

Related

Test plan

🤖 Generated with Claude Code

@jamesmt-aws jamesmt-aws force-pushed the consolidation-invalidation-fix branch from 9d8b3de to 3f878f4 Compare March 10, 2026 01:22
@jamesmt-aws
Copy link
Copy Markdown
Owner Author

Note: candidatePodSchedulingErrors now sorts its error messages for deterministic output. The existing NonPendingPodSchedulingErrors() in scheduling/scheduler.go has the same non-deterministic map iteration pattern. Worth updating that too as a follow-up.

@jamesmt-aws jamesmt-aws force-pushed the consolidation-invalidation-fix branch 2 times, most recently from 15323b2 to 9a882c5 Compare March 10, 2026 01:39
Consolidation validation currently operates on global cluster state,
so any change to any NodePool can invalidate a plan that only affects
one NodePool. This permanently blocks consolidation on clusters with
mixed workloads where one NodePool has high pod turnover.

Three scoping changes, one per validation stage:

1. revalidateTargetCandidates (stages 1 & 3): replace the global
   GetCandidates() + mapCandidates() rebuild with a targeted re-check
   of only the command's candidate nodes.

2. candidatePodSchedulingErrors (stage 2): replace
   AllNonPendingPodsScheduled() with a check that only examines the
   candidate's own pods for scheduling errors.

3. filterNewNodeClaimsByCandidatePods (stage 2): filter the
   NewNodeClaim comparison to only count claims containing pods
   displaced by the consolidation candidates.

Includes design doc, integration tests (T1-T7), and unit tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jamesmt-aws jamesmt-aws force-pushed the consolidation-invalidation-fix branch from 9a882c5 to cab9291 Compare March 10, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant