Skip to content

K8SPS-683: Perform switchover during smart update#1282

Draft
egegunes wants to merge 6 commits intomainfrom
K8SPS-683
Draft

K8SPS-683: Perform switchover during smart update#1282
egegunes wants to merge 6 commits intomainfrom
K8SPS-683

Conversation

@egegunes
Copy link
Copy Markdown
Contributor

@egegunes egegunes commented Apr 7, 2026

CHANGE DESCRIPTION

Problem:
Short explanation of the problem.

Cause:
Short explanation of the root cause of the issue if applicable.

Solution:
Short explanation of the solution we are providing with this PR.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PS version?
  • Does the change support oldest and newest supported Kubernetes version?

@egegunes egegunes added this to the v1.1.0 milestone Apr 7, 2026
Copilot AI review requested due to automatic review settings April 7, 2026 06:19
@pull-request-size pull-request-size bot added the size/XL 500-999 lines label Apr 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the SmartUpdate flow for PerconaServerMySQL StatefulSets to perform an explicit primary switchover before updating the current primary pod, aiming to reduce disruption during upgrades.

Changes:

  • Refactors SmartUpdate to update secondaries first, then perform switchover, then update the former primary.
  • Adds switchover helpers for async (Orchestrator) and GR (mysqlsh) clusters, including a wait-for-primary-change step.
  • Adds unit tests covering primary-candidate selection, revision-change detection, and switchover helpers.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
pkg/controller/ps/upgrade.go Implements switchover-driven SmartUpdate flow and replaces pod deletion waiting logic.
pkg/controller/ps/upgrade_test.go Adds focused unit tests for candidate selection, stsChanged, and switchover helpers (GR/async).
Comments suppressed due to low confidence (1)

pkg/controller/ps/upgrade.go:281

  • deletePodAndWait returns an error if cli.Delete returns NotFound. During reconciliation (or after a previous attempt), the pod may already be deleted, and treating NotFound as fatal will unnecessarily fail SmartUpdate. Consider ignoring NotFound errors on delete (e.g., client.IgnoreNotFound) and continuing to the wait loop.
func deletePodAndWait(ctx context.Context, cli client.Client, pod *corev1.Pod, sts *appsv1.StatefulSet) error {
	err := cli.Delete(ctx, pod)
	if err != nil {
		return err
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings April 7, 2026 12:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return err
}

if !strings.HasPrefix(primHost, target.Name) {
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switchOverAndWait checks the new primary via strings.HasPrefix(primHost, target.Name). Pod names are of the form <cluster>-mysql-<idx> (e.g. ...-1, ...-10), so HasPrefix can match the wrong pod (...-1 is a prefix of ...-10). Parse the hostname (e.g. split on .) and compare the pod name segment for equality with target.Name instead.

Suggested change
if !strings.HasPrefix(primHost, target.Name) {
hostParts := strings.SplitN(primHost, ".", 2)
if len(hostParts) == 0 || hostParts[0] != target.Name {

Copilot uses AI. Check for mistakes.
Comment on lines +245 to +248
orcPod, err := getReadyOrcPod(ctx, r.Client, cr)
if err != nil {
return errors.Wrap(err, "get ready orchestrator pod")
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switchOverAsync always requires a ready Orchestrator pod. Async clusters can have orchestrator disabled when spec.unsafe.orchestrator is set (see PerconaServerMySQL.OrchestratorEnabled()), and in that configuration SmartUpdate will now fail with "get ready orchestrator pod" instead of progressing. Consider guarding this path (e.g. skip switchover / fall back to deleting the primary) when orchestrator isn’t enabled/available.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you disable orchestrator, our CRD validations won't let you use SmartUpdate

Comment on lines +178 to +186
func (r *PerconaServerMySQLReconciler) switchOverAndWait(
ctx context.Context,
cr *apiv1.PerconaServerMySQL,
primary *corev1.Pod, target *corev1.Pod,
) error {
log := logf.FromContext(ctx)

log.Info("switchover", "current", primary.Name, "target", target.Name)

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New switchover orchestration in switchOverAndWait (including the retry loop that polls getPrimaryHost and the GR primary-label reconciliation) is not covered by tests; current tests only cover switchOverGR/switchOverAsync in isolation. Adding a focused unit test for switchOverAndWait would help prevent regressions in the wait/verification logic.

Copilot uses AI. Check for mistakes.
@JNKPercona
Copy link
Copy Markdown
Collaborator

Test Name Result Time
async-ignore-annotations-8-4 passed 00:06:45
async-global-metadata-8-4 passed 00:14:45
async-upgrade-8-0 passed 00:12:58
async-upgrade-8-4 passed 00:13:07
auto-config-8-4 passed 00:26:27
config-8-4 passed 00:22:11
config-router-8-0 passed 00:07:30
config-router-8-4 passed 00:07:44
demand-backup-minio-8-0 passed 00:19:21
demand-backup-minio-8-4 passed 00:20:30
demand-backup-cloud-8-4 passed 00:22:56
demand-backup-retry-8-4 passed 00:17:13
demand-backup-incremental-8-0 passed 00:32:34
demand-backup-incremental-8-4 passed 00:34:57
async-data-at-rest-encryption-8-0 passed 00:13:27
async-data-at-rest-encryption-8-4 passed 00:13:43
gr-global-metadata-8-4 passed 00:14:27
gr-data-at-rest-encryption-8-0 passed 00:14:26
gr-data-at-rest-encryption-8-4 passed 00:14:26
gr-demand-backup-minio-8-4 passed 00:12:44
gr-demand-backup-cloud-8-4 passed 00:22:25
gr-demand-backup-haproxy-8-4 passed 00:10:10
gr-demand-backup-incremental-8-0 passed 00:35:32
gr-demand-backup-incremental-8-4 passed 00:35:46
gr-finalizer-8-4 passed 00:06:21
gr-haproxy-8-0 passed 00:05:06
gr-haproxy-8-4 passed 00:04:20
gr-ignore-annotations-8-4 passed 00:05:00
gr-init-deploy-8-0 passed 00:10:16
gr-init-deploy-8-4 passed 00:10:13
gr-one-pod-8-4 passed 00:05:52
gr-recreate-8-4 passed 00:16:51
gr-scaling-8-4 passed 00:08:20
gr-scheduled-backup-8-4 passed 00:16:29
gr-scheduled-backup-incremental-8-4 passed 00:23:17
gr-security-context-8-4 passed 00:10:07
gr-self-healing-8-4 passed 00:21:42
gr-tls-cert-manager-8-4 failure 00:07:20
gr-users-8-4 passed 00:05:27
gr-upgrade-8-0 failure 00:13:08
gr-upgrade-8-4 passed 00:09:47
haproxy-8-0 passed 00:08:20
haproxy-8-4 passed 00:08:17
init-deploy-8-0 passed 00:06:51
init-deploy-8-4 passed 00:05:42
limits-8-4 passed 00:07:26
monitoring-8-4 passed 00:14:35
one-pod-8-0 passed 00:05:50
one-pod-8-4 passed 00:05:34
operator-self-healing-8-4 passed 00:11:51
pvc-resize-8-4 passed 00:05:52
recreate-8-4 failure 00:00:43
scaling-8-4 passed 00:11:20
scheduled-backup-8-0 failure 00:02:14
scheduled-backup-8-4 failure 00:00:49
scheduled-backup-incremental-8-0 failure 00:00:50
scheduled-backup-incremental-8-4 failure 00:00:51
service-per-pod-8-4 passed 00:06:44
sidecars-8-4 passed 00:04:52
smart-update-8-4 failure 00:00:49
storage-8-4 failure 00:00:49
telemetry-8-4 failure 00:00:49
tls-cert-manager-8-4 failure 00:00:51
users-8-0 failure 00:00:50
users-8-4 passed 00:11:37
version-service-8-4 passed 00:19:56
Summary Value
Tests Run 66/66
Job Duration 06:19:58
Total Test Time 12:50:37

commit: a578dcc
image: perconalab/percona-server-mysql-operator:PR-1282-a578dcca

@hors hors removed this from the v1.1.0 milestone Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL 500-999 lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants