WIP: wait for old kueue-controller-manager pods to be gone on a kueue cr update by rphillips · Pull Request #1141 · openshift/kueue-operator

rphillips · 2025-11-26T22:45:11Z

The goal is to fix the race condition in which the Kueue CR is updated, but the test—via the webhook—still observes the old controller-manager pods instead of the new ones.

STEP: Feching Kueue Instance @ 11/26/25 23:21:38.38
  STEP: Updating Kueue config @ 11/26/25 23:21:38.395
  STEP: Waiting for kueue-controller-manager resource version to change from 53889 @ 11/26/25 23:21:38.407
  STEP: Checking if deployment replicas: 3 matches amount of ready replicas: 1 @ 11/26/25 23:21:48.432
  STEP: Checking if deployment replicas: 2 matches amount of ready replicas: 2 @ 11/26/25 23:21:58.456
  STEP: Waiting for webhook configurations to exist @ 11/26/25 23:21:58.456
  STEP: Waiting for webhook to handle requests successfully @ 11/26/25 23:21:58.465

openshift-ci · 2025-11-26T22:45:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rphillips

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [rphillips]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR adds logic to wait for old kueue-controller-manager pods to be fully terminated after a Kueue configuration update, ensuring clean rollover during deployment updates by verifying all pods match the current ReplicaSet's template hash before proceeding.

Key Changes:

Adds pod termination verification that compares pod-template-hash labels to identify and wait for old pods to be removed
Implements ReplicaSet discovery logic to determine the current active deployment version

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-26T23:44:05Z

test/e2e/e2e_operator_test.go

 		return fmt.Errorf("deployment is not ready")
 	}, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "kueue-controller-manager deployment failed to be ready")

+	// Wait for all old controller pods to be terminated to


Incomplete comment. The sentence ends with 'to' but doesn't complete the thought. Should be 'Wait for all old controller pods to be terminated' or 'Wait for all old controller pods to be terminated to ensure clean rollover'.

Suggested change

// Wait for all old controller pods to be terminated to

// Wait for all old controller pods to be terminated to ensure clean rollover

Copilot · 2025-11-26T23:44:06Z

test/e2e/e2e_operator_test.go

+		// Find the current (active) ReplicaSet - it should have replicas matching deployment spec
+		var currentTemplateHash string
+		for _, rs := range replicaSets.Items {
+			if rs.Status.Replicas == managerDeployment.Status.Replicas && rs.Status.Replicas > 0 {
+				currentTemplateHash = rs.Labels["pod-template-hash"]
+				break
+			}
+		}
+
+		if currentTemplateHash == "" {
+			return fmt.Errorf("could not find current active replicaset")


The logic to identify the current ReplicaSet is flawed. Comparing rs.Status.Replicas to managerDeployment.Status.Replicas doesn't reliably identify the active ReplicaSet. During rollout, multiple ReplicaSets may have replicas. Instead, check for rs.Status.ReadyReplicas == *managerDeployment.Spec.Replicas or use the deployment's selector to find the newest ReplicaSet by checking the deployment.kubernetes.io/revision annotation.

Suggested change

// Find the current (active) ReplicaSet - it should have replicas matching deployment spec

var currentTemplateHash string

for _, rs := range replicaSets.Items {

if rs.Status.Replicas == managerDeployment.Status.Replicas && rs.Status.Replicas > 0 {

currentTemplateHash = rs.Labels["pod-template-hash"]

break

}

}

if currentTemplateHash == "" {

return fmt.Errorf("could not find current active replicaset")

// Find the current (active) ReplicaSet using deployment.kubernetes.io/revision annotation

var currentTemplateHash string

deployRevision := managerDeployment.Annotations["deployment.kubernetes.io/revision"]

for _, rs := range replicaSets.Items {

if rs.Annotations["deployment.kubernetes.io/revision"] == deployRevision {

currentTemplateHash = rs.Labels["pod-template-hash"]

break

}

}

if currentTemplateHash == "" {

return fmt.Errorf("could not find current active replicaset with revision %s", deployRevision)

Copilot · 2025-11-26T23:44:06Z

test/e2e/e2e_operator_test.go

+		By(fmt.Sprintf("All old controller pods terminated, %d new pods running with hash %s", len(pods.Items), currentTemplateHash))
+		return nil
+	}, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "old controller pods failed to terminate")
+


This By statement is placed inside the Eventually block's function, which means it will be printed on every successful iteration. The By statement should be placed after the Eventually block completes successfully, not inside the polling function.

Suggested change

By(fmt.Sprintf("All old controller pods terminated, %d new pods running with hash %s", len(pods.Items), currentTemplateHash))

return nil

}, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "old controller pods failed to terminate")

return nil

}, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "old controller pods failed to terminate")

// Print the message after Eventually succeeds

{

// Get the current deployment to find the active ReplicaSet

managerDeployment, err := kClient.AppsV1().Deployments(testutils.OperatorNamespace).Get(ctx, "kueue-controller-manager", metav1.GetOptions{})

if err == nil {

// List all ReplicaSets for this deployment

replicaSets, err := kClient.AppsV1().ReplicaSets(testutils.OperatorNamespace).List(ctx, metav1.ListOptions{

LabelSelector: "control-plane=controller-manager",

})

if err == nil {

var currentTemplateHash string

for _, rs := range replicaSets.Items {

if rs.Status.Replicas == managerDeployment.Status.Replicas && rs.Status.Replicas > 0 {

currentTemplateHash = rs.Labels["pod-template-hash"]

break

}

}

if currentTemplateHash != "" {

pods, err := kClient.CoreV1().Pods(testutils.OperatorNamespace).List(ctx, metav1.ListOptions{

LabelSelector: "control-plane=controller-manager",

})

if err == nil {

By(fmt.Sprintf("All old controller pods terminated, %d new pods running with hash %s", len(pods.Items), currentTemplateHash))

}

}

}

}

}

…pdate

openshift-ci · 2026-01-28T16:53:30Z

@rphillips: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/test-e2e-upstream-4-20	`5c56774`	link	true	`/test test-e2e-upstream-4-20`
ci/prow/test-e2e-downstream-4-20	`5c56774`	link	true	`/test test-e2e-downstream-4-20`
ci/prow/test-e2e-downstream-4-21	`5c56774`	link	true	`/test test-e2e-downstream-4-21`
ci/prow/test-e2e-upstream-4-21	`5c56774`	link	true	`/test test-e2e-upstream-4-21`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 26, 2025

rphillips changed the title ~~WIP: wait for old kueue-controller-manager pods to be gone on a kueue cr u…~~ WIP: wait for old kueue-controller-manager pods to be gone on a kueue cr update Nov 26, 2025

openshift-ci bot requested review from cpmeadors and kannon92 November 26, 2025 22:45

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2025

kannon92 requested review from Copilot and removed request for cpmeadors and kannon92 November 26, 2025 23:43

Copilot AI reviewed Nov 26, 2025

View reviewed changes

rphillips mentioned this pull request Nov 28, 2025

OCPKUEUE-365: Add e2e for labelPolicy=None with JobSet #1066

Merged

wait for old kueue-controller-manager pods to be gone on a kueue cr u…

5c56774

…pdate

rphillips force-pushed the wait_for_kueue_cr_update_and_pods branch from 72975fd to 5c56774 Compare December 1, 2025 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: wait for old kueue-controller-manager pods to be gone on a kueue cr update#1141

WIP: wait for old kueue-controller-manager pods to be gone on a kueue cr update#1141
rphillips wants to merge 1 commit intoopenshift:mainfrom
rphillips:wait_for_kueue_cr_update_and_pods

rphillips commented Nov 26, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Nov 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

openshift-ci bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// Wait for all old controller pods to be terminated to
	// Wait for all old controller pods to be terminated to ensure clean rollover

-		By(fmt.Sprintf("All old controller pods terminated, %d new pods running with hash %s", len(pods.Items), currentTemplateHash))
-		return nil
-	}, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "old controller pods failed to terminate")
+		return nil
+	}, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "old controller pods failed to terminate")
+	// Print the message after Eventually succeeds
+	{
+		// Get the current deployment to find the active ReplicaSet
+		managerDeployment, err := kClient.AppsV1().Deployments(testutils.OperatorNamespace).Get(ctx, "kueue-controller-manager", metav1.GetOptions{})
+		if err == nil {
+			// List all ReplicaSets for this deployment
+			replicaSets, err := kClient.AppsV1().ReplicaSets(testutils.OperatorNamespace).List(ctx, metav1.ListOptions{
+				LabelSelector: "control-plane=controller-manager",
+			})
+			if err == nil {
+				var currentTemplateHash string
+				for _, rs := range replicaSets.Items {
+					if rs.Status.Replicas == managerDeployment.Status.Replicas && rs.Status.Replicas > 0 {
+						currentTemplateHash = rs.Labels["pod-template-hash"]
+						break
+					}
+				}
+				if currentTemplateHash != "" {
+					pods, err := kClient.CoreV1().Pods(testutils.OperatorNamespace).List(ctx, metav1.ListOptions{
+						LabelSelector: "control-plane=controller-manager",
+					})
+					if err == nil {
+						By(fmt.Sprintf("All old controller pods terminated, %d new pods running with hash %s", len(pods.Items), currentTemplateHash))
+					}
+				}
+			}
+		}
+	}

Conversation

rphillips commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Nov 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rphillips commented Nov 26, 2025 •

edited

Loading