WIP: wait for old kueue-controller-manager pods to be gone on a kueue cr update#1141
WIP: wait for old kueue-controller-manager pods to be gone on a kueue cr update#1141rphillips wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rphillips The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR adds logic to wait for old kueue-controller-manager pods to be fully terminated after a Kueue configuration update, ensuring clean rollover during deployment updates by verifying all pods match the current ReplicaSet's template hash before proceeding.
Key Changes:
- Adds pod termination verification that compares pod-template-hash labels to identify and wait for old pods to be removed
- Implements ReplicaSet discovery logic to determine the current active deployment version
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return fmt.Errorf("deployment is not ready") | ||
| }, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "kueue-controller-manager deployment failed to be ready") | ||
|
|
||
| // Wait for all old controller pods to be terminated to |
There was a problem hiding this comment.
Incomplete comment. The sentence ends with 'to' but doesn't complete the thought. Should be 'Wait for all old controller pods to be terminated' or 'Wait for all old controller pods to be terminated to ensure clean rollover'.
| // Wait for all old controller pods to be terminated to | |
| // Wait for all old controller pods to be terminated to ensure clean rollover |
test/e2e/e2e_operator_test.go
Outdated
| // Find the current (active) ReplicaSet - it should have replicas matching deployment spec | ||
| var currentTemplateHash string | ||
| for _, rs := range replicaSets.Items { | ||
| if rs.Status.Replicas == managerDeployment.Status.Replicas && rs.Status.Replicas > 0 { | ||
| currentTemplateHash = rs.Labels["pod-template-hash"] | ||
| break | ||
| } | ||
| } | ||
|
|
||
| if currentTemplateHash == "" { | ||
| return fmt.Errorf("could not find current active replicaset") |
There was a problem hiding this comment.
The logic to identify the current ReplicaSet is flawed. Comparing rs.Status.Replicas to managerDeployment.Status.Replicas doesn't reliably identify the active ReplicaSet. During rollout, multiple ReplicaSets may have replicas. Instead, check for rs.Status.ReadyReplicas == *managerDeployment.Spec.Replicas or use the deployment's selector to find the newest ReplicaSet by checking the deployment.kubernetes.io/revision annotation.
| // Find the current (active) ReplicaSet - it should have replicas matching deployment spec | |
| var currentTemplateHash string | |
| for _, rs := range replicaSets.Items { | |
| if rs.Status.Replicas == managerDeployment.Status.Replicas && rs.Status.Replicas > 0 { | |
| currentTemplateHash = rs.Labels["pod-template-hash"] | |
| break | |
| } | |
| } | |
| if currentTemplateHash == "" { | |
| return fmt.Errorf("could not find current active replicaset") | |
| // Find the current (active) ReplicaSet using deployment.kubernetes.io/revision annotation | |
| var currentTemplateHash string | |
| deployRevision := managerDeployment.Annotations["deployment.kubernetes.io/revision"] | |
| for _, rs := range replicaSets.Items { | |
| if rs.Annotations["deployment.kubernetes.io/revision"] == deployRevision { | |
| currentTemplateHash = rs.Labels["pod-template-hash"] | |
| break | |
| } | |
| } | |
| if currentTemplateHash == "" { | |
| return fmt.Errorf("could not find current active replicaset with revision %s", deployRevision) |
| By(fmt.Sprintf("All old controller pods terminated, %d new pods running with hash %s", len(pods.Items), currentTemplateHash)) | ||
| return nil | ||
| }, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "old controller pods failed to terminate") | ||
|
|
There was a problem hiding this comment.
This By statement is placed inside the Eventually block's function, which means it will be printed on every successful iteration. The By statement should be placed after the Eventually block completes successfully, not inside the polling function.
| By(fmt.Sprintf("All old controller pods terminated, %d new pods running with hash %s", len(pods.Items), currentTemplateHash)) | |
| return nil | |
| }, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "old controller pods failed to terminate") | |
| return nil | |
| }, testutils.OperatorReadyTime, testutils.OperatorPoll).Should(Succeed(), "old controller pods failed to terminate") | |
| // Print the message after Eventually succeeds | |
| { | |
| // Get the current deployment to find the active ReplicaSet | |
| managerDeployment, err := kClient.AppsV1().Deployments(testutils.OperatorNamespace).Get(ctx, "kueue-controller-manager", metav1.GetOptions{}) | |
| if err == nil { | |
| // List all ReplicaSets for this deployment | |
| replicaSets, err := kClient.AppsV1().ReplicaSets(testutils.OperatorNamespace).List(ctx, metav1.ListOptions{ | |
| LabelSelector: "control-plane=controller-manager", | |
| }) | |
| if err == nil { | |
| var currentTemplateHash string | |
| for _, rs := range replicaSets.Items { | |
| if rs.Status.Replicas == managerDeployment.Status.Replicas && rs.Status.Replicas > 0 { | |
| currentTemplateHash = rs.Labels["pod-template-hash"] | |
| break | |
| } | |
| } | |
| if currentTemplateHash != "" { | |
| pods, err := kClient.CoreV1().Pods(testutils.OperatorNamespace).List(ctx, metav1.ListOptions{ | |
| LabelSelector: "control-plane=controller-manager", | |
| }) | |
| if err == nil { | |
| By(fmt.Sprintf("All old controller pods terminated, %d new pods running with hash %s", len(pods.Items), currentTemplateHash)) | |
| } | |
| } | |
| } | |
| } | |
| } |
72975fd to
5c56774
Compare
|
@rphillips: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The goal is to fix the race condition in which the Kueue CR is updated, but the test—via the webhook—still observes the old controller-manager pods instead of the new ones.