Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the SmartUpdate flow for PerconaServerMySQL StatefulSets to perform an explicit primary switchover before updating the current primary pod, aiming to reduce disruption during upgrades.
Changes:
- Refactors SmartUpdate to update secondaries first, then perform switchover, then update the former primary.
- Adds switchover helpers for async (Orchestrator) and GR (mysqlsh) clusters, including a wait-for-primary-change step.
- Adds unit tests covering primary-candidate selection, revision-change detection, and switchover helpers.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/controller/ps/upgrade.go | Implements switchover-driven SmartUpdate flow and replaces pod deletion waiting logic. |
| pkg/controller/ps/upgrade_test.go | Adds focused unit tests for candidate selection, stsChanged, and switchover helpers (GR/async). |
Comments suppressed due to low confidence (1)
pkg/controller/ps/upgrade.go:281
deletePodAndWaitreturns an error ifcli.Deletereturns NotFound. During reconciliation (or after a previous attempt), the pod may already be deleted, and treating NotFound as fatal will unnecessarily fail SmartUpdate. Consider ignoring NotFound errors on delete (e.g.,client.IgnoreNotFound) and continuing to the wait loop.
func deletePodAndWait(ctx context.Context, cli client.Client, pod *corev1.Pod, sts *appsv1.StatefulSet) error {
err := cli.Delete(ctx, pod)
if err != nil {
return err
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/controller/ps/upgrade.go
Outdated
| return err | ||
| } | ||
|
|
||
| if !strings.HasPrefix(primHost, target.Name) { |
There was a problem hiding this comment.
switchOverAndWait checks the new primary via strings.HasPrefix(primHost, target.Name). Pod names are of the form <cluster>-mysql-<idx> (e.g. ...-1, ...-10), so HasPrefix can match the wrong pod (...-1 is a prefix of ...-10). Parse the hostname (e.g. split on .) and compare the pod name segment for equality with target.Name instead.
| if !strings.HasPrefix(primHost, target.Name) { | |
| hostParts := strings.SplitN(primHost, ".", 2) | |
| if len(hostParts) == 0 || hostParts[0] != target.Name { |
| orcPod, err := getReadyOrcPod(ctx, r.Client, cr) | ||
| if err != nil { | ||
| return errors.Wrap(err, "get ready orchestrator pod") | ||
| } |
There was a problem hiding this comment.
switchOverAsync always requires a ready Orchestrator pod. Async clusters can have orchestrator disabled when spec.unsafe.orchestrator is set (see PerconaServerMySQL.OrchestratorEnabled()), and in that configuration SmartUpdate will now fail with "get ready orchestrator pod" instead of progressing. Consider guarding this path (e.g. skip switchover / fall back to deleting the primary) when orchestrator isn’t enabled/available.
There was a problem hiding this comment.
if you disable orchestrator, our CRD validations won't let you use SmartUpdate
| func (r *PerconaServerMySQLReconciler) switchOverAndWait( | ||
| ctx context.Context, | ||
| cr *apiv1.PerconaServerMySQL, | ||
| primary *corev1.Pod, target *corev1.Pod, | ||
| ) error { | ||
| log := logf.FromContext(ctx) | ||
|
|
||
| log.Info("switchover", "current", primary.Name, "target", target.Name) | ||
|
|
There was a problem hiding this comment.
New switchover orchestration in switchOverAndWait (including the retry loop that polls getPrimaryHost and the GR primary-label reconciliation) is not covered by tests; current tests only cover switchOverGR/switchOverAsync in isolation. Adding a focused unit test for switchOverAndWait would help prevent regressions in the wait/verification logic.
commit: a578dcc |
CHANGE DESCRIPTION
Problem:
Short explanation of the problem.
Cause:
Short explanation of the root cause of the issue if applicable.
Solution:
Short explanation of the solution we are providing with this PR.
CHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
Config/Logging/Testability