Skip to content

WIP: wait for old kueue-controller-manager pods to be gone on a kueue cr update#1141

Open
rphillips wants to merge 1 commit intoopenshift:mainfrom
rphillips:wait_for_kueue_cr_update_and_pods
Open

WIP: wait for old kueue-controller-manager pods to be gone on a kueue cr update#1141
rphillips wants to merge 1 commit intoopenshift:mainfrom
rphillips:wait_for_kueue_cr_update_and_pods

Conversation

@rphillips
Copy link
Contributor

@rphillips rphillips commented Nov 26, 2025

The goal is to fix the race condition in which the Kueue CR is updated, but the test—via the webhook—still observes the old controller-manager pods instead of the new ones.

STEP: Feching Kueue Instance @ 11/26/25 23:21:38.38
  STEP: Updating Kueue config @ 11/26/25 23:21:38.395
  STEP: Waiting for kueue-controller-manager resource version to change from 53889 @ 11/26/25 23:21:38.407
  STEP: Checking if deployment replicas: 3 matches amount of ready replicas: 1 @ 11/26/25 23:21:48.432
  STEP: Checking if deployment replicas: 2 matches amount of ready replicas: 2 @ 11/26/25 23:21:58.456
  STEP: Waiting for webhook configurations to exist @ 11/26/25 23:21:58.456
  STEP: Waiting for webhook to handle requests successfully @ 11/26/25 23:21:58.465

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 26, 2025
@rphillips rphillips changed the title WIP: wait for old kueue-controller-manager pods to be gone on a kueue cr u… WIP: wait for old kueue-controller-manager pods to be gone on a kueue cr update Nov 26, 2025
@openshift-ci openshift-ci bot requested review from cpmeadors and kannon92 November 26, 2025 22:45
@openshift-ci
Copy link

openshift-ci bot commented Nov 26, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rphillips

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2025
@kannon92 kannon92 requested review from Copilot and removed request for cpmeadors and kannon92 November 26, 2025 23:43
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds logic to wait for old kueue-controller-manager pods to be fully terminated after a Kueue configuration update, ensuring clean rollover during deployment updates by verifying all pods match the current ReplicaSet's template hash before proceeding.

Key Changes:

  • Adds pod termination verification that compares pod-template-hash labels to identify and wait for old pods to be removed
  • Implements ReplicaSet discovery logic to determine the current active deployment version

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return fmt.Errorf("deployment is not ready")
}, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "kueue-controller-manager deployment failed to be ready")

// Wait for all old controller pods to be terminated to
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete comment. The sentence ends with 'to' but doesn't complete the thought. Should be 'Wait for all old controller pods to be terminated' or 'Wait for all old controller pods to be terminated to ensure clean rollover'.

Suggested change
// Wait for all old controller pods to be terminated to
// Wait for all old controller pods to be terminated to ensure clean rollover

Copilot uses AI. Check for mistakes.
Comment on lines 1254 to 1264
// Find the current (active) ReplicaSet - it should have replicas matching deployment spec
var currentTemplateHash string
for _, rs := range replicaSets.Items {
if rs.Status.Replicas == managerDeployment.Status.Replicas && rs.Status.Replicas > 0 {
currentTemplateHash = rs.Labels["pod-template-hash"]
break
}
}

if currentTemplateHash == "" {
return fmt.Errorf("could not find current active replicaset")
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to identify the current ReplicaSet is flawed. Comparing rs.Status.Replicas to managerDeployment.Status.Replicas doesn't reliably identify the active ReplicaSet. During rollout, multiple ReplicaSets may have replicas. Instead, check for rs.Status.ReadyReplicas == *managerDeployment.Spec.Replicas or use the deployment's selector to find the newest ReplicaSet by checking the deployment.kubernetes.io/revision annotation.

Suggested change
// Find the current (active) ReplicaSet - it should have replicas matching deployment spec
var currentTemplateHash string
for _, rs := range replicaSets.Items {
if rs.Status.Replicas == managerDeployment.Status.Replicas && rs.Status.Replicas > 0 {
currentTemplateHash = rs.Labels["pod-template-hash"]
break
}
}
if currentTemplateHash == "" {
return fmt.Errorf("could not find current active replicaset")
// Find the current (active) ReplicaSet using deployment.kubernetes.io/revision annotation
var currentTemplateHash string
deployRevision := managerDeployment.Annotations["deployment.kubernetes.io/revision"]
for _, rs := range replicaSets.Items {
if rs.Annotations["deployment.kubernetes.io/revision"] == deployRevision {
currentTemplateHash = rs.Labels["pod-template-hash"]
break
}
}
if currentTemplateHash == "" {
return fmt.Errorf("could not find current active replicaset with revision %s", deployRevision)

Copilot uses AI. Check for mistakes.
Comment on lines +1283 to +1286
By(fmt.Sprintf("All old controller pods terminated, %d new pods running with hash %s", len(pods.Items), currentTemplateHash))
return nil
}, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "old controller pods failed to terminate")

Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This By statement is placed inside the Eventually block's function, which means it will be printed on every successful iteration. The By statement should be placed after the Eventually block completes successfully, not inside the polling function.

Suggested change
By(fmt.Sprintf("All old controller pods terminated, %d new pods running with hash %s", len(pods.Items), currentTemplateHash))
return nil
}, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "old controller pods failed to terminate")
return nil
}, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "old controller pods failed to terminate")
// Print the message after Eventually succeeds
{
// Get the current deployment to find the active ReplicaSet
managerDeployment, err := kClient.AppsV1().Deployments(testutils.OperatorNamespace).Get(ctx, "kueue-controller-manager", metav1.GetOptions{})
if err == nil {
// List all ReplicaSets for this deployment
replicaSets, err := kClient.AppsV1().ReplicaSets(testutils.OperatorNamespace).List(ctx, metav1.ListOptions{
LabelSelector: "control-plane=controller-manager",
})
if err == nil {
var currentTemplateHash string
for _, rs := range replicaSets.Items {
if rs.Status.Replicas == managerDeployment.Status.Replicas && rs.Status.Replicas > 0 {
currentTemplateHash = rs.Labels["pod-template-hash"]
break
}
}
if currentTemplateHash != "" {
pods, err := kClient.CoreV1().Pods(testutils.OperatorNamespace).List(ctx, metav1.ListOptions{
LabelSelector: "control-plane=controller-manager",
})
if err == nil {
By(fmt.Sprintf("All old controller pods terminated, %d new pods running with hash %s", len(pods.Items), currentTemplateHash))
}
}
}
}
}

Copilot uses AI. Check for mistakes.
@rphillips rphillips force-pushed the wait_for_kueue_cr_update_and_pods branch from 72975fd to 5c56774 Compare December 1, 2025 19:21
@openshift-ci
Copy link

openshift-ci bot commented Jan 28, 2026

@rphillips: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/test-e2e-upstream-4-20 5c56774 link true /test test-e2e-upstream-4-20
ci/prow/test-e2e-downstream-4-20 5c56774 link true /test test-e2e-downstream-4-20
ci/prow/test-e2e-downstream-4-21 5c56774 link true /test test-e2e-downstream-4-21
ci/prow/test-e2e-upstream-4-21 5c56774 link true /test test-e2e-upstream-4-21

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants