K8SPS-69 | async cluster readiness check should account for running backup by mayankshah1607 · Pull Request #1292 · percona/percona-server-mysql-operator

mayankshah1607 · 2026-04-10T16:02:52Z

CHANGE DESCRIPTION

Problem:
The current readiness check for async cluster checks if the replication is stopped. However, during backups, the replication is stopped. As a result, the backup source pod will appear unhealthy.

Cause:
We do not take into account the state of running backups

Solution:
Update the readiness check to ignore stopped replication if there is a backup running. The xtrabackup sidecar already exposes an HTTP API to get the status of a running backup, and this is available to the readiness check cmd on localhost

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PS version?
Does the change support oldest and newest supported Kubernetes version?

Signed-off-by: Mayank Shah <mayank.shah@percona.com>

Copilot

Pull request overview

Adjusts the async-cluster readiness probe to avoid flagging the backup source pod as unhealthy when replication is intentionally stopped during a running backup by consulting the xtrabackup sidecar HTTP API.

Changes:

Add xtrabackup sidecar client usage to detect whether a backup is currently running.
Update async readiness logic to ignore ReplicationStatusStopped when a backup is running.
Introduce isBackupRunning(ctx) helper to query /backup/ on the sidecar.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-10T16:07:12Z

+	backupRunning, err := isBackupRunning(ctx)
+	if err != nil {
+		return errors.Wrap(err, "check backup running")
+	}


isBackupRunning is called unconditionally before evaluating replStatus, which makes the async readiness check depend on the xtrabackup sidecar HTTP endpoint even when replication is healthy. If the sidecar is temporarily unavailable (startup, crash, port blocked), readiness will fail even though MySQL replication/read-only checks would otherwise pass. Consider only querying the sidecar when replStatus == ReplicationStatusStopped (and/or only treating sidecar errors as fatal in that branch) so the new behavior is limited to the stopped-replication case you’re trying to special-case for backups.

hors · 2026-04-10T16:57:09Z

 	return true, nil
 }
+
+func isBackupRunning(ctx context.Context) (bool, error) {


@mayankshah1607 @eleo007, in this case, we need to have a separate task to improve it in one of the next releases. The idea is to mark this pod as unready to exclude it from read traffic while keeping the cluster in a ready state.

haproxy already excludes it if replication is stopped

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T15:10:48Z

+	backupRunning, err := isBackupRunning(ctx)
+	if err != nil {
+		return errors.Wrap(err, "check backup running")
+	}
+


isBackupRunning() errors currently make the whole async readiness check fail (return errors.Wrap(err, "check backup running")). That means transient sidecar issues (e.g., sidecar not yet listening, connection refused, localhost networking hiccup) can mark the MySQL container unready even when replication/read-only checks pass. Consider treating sidecar connectivity/errors as "no running backup" (return false, nil for dial/timeouts) and only performing the sidecar call when replStatus == ReplicationStatusStopped to avoid unnecessary probe failures and extra HTTP calls.

Copilot · 2026-04-13T15:10:48Z

+func isBackupRunning(ctx context.Context) (bool, error) {
+	sc := xtrabackup.NewSidecarClient("localhost")
+	bcp, err := sc.GetRunningBackupConfig(ctx)
+	if err != nil {
+		return false, errors.Wrap(err, "get running backup config")
+	}


isBackupRunning() currently wraps and returns any error from GetRunningBackupConfig(). Since this function is used by the readiness probe, consider explicitly handling common "sidecar not reachable" cases (connection refused / context deadline exceeded) by returning (false, nil) so readiness is not gated on the sidecar HTTP server being available.

Signed-off-by: Mayank Shah <mayank.shah@percona.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T15:19:59Z

+	if cr.CompareVersion("1.1.0") >= 0 && cr.Spec.Backup != nil {
+		env = append(env, corev1.EnvVar{
+			Name:  "BACKUPS_ENABLED",
+			Value: strconv.FormatBool(cr.Spec.Backup.Enabled),
+		})
+	}


BACKUPS_ENABLED is introduced as a raw string here and is also read as a raw string in cmd/healthcheck/main.go. To avoid drift/typos, define a shared constant (e.g. in pkg/naming/env.go) and reference it from both places.

Copilot · 2026-04-14T15:20:00Z

+	if cr.CompareVersion("1.1.0") >= 0 && cr.Spec.Backup != nil {
+		env = append(env, corev1.EnvVar{
+			Name:  "BACKUPS_ENABLED",
+			Value: strconv.FormatBool(cr.Spec.Backup.Enabled),


This new behavior is not covered by the existing pkg/mysql unit tests: please add a test asserting mysqldContainer (or the rendered StatefulSet container env) includes BACKUPS_ENABLED=true/false when backups are enabled/disabled (and when spec.backup is nil), so regressions in readiness behavior gating are caught.

Suggested change

if cr.CompareVersion("1.1.0") >= 0 && cr.Spec.Backup != nil {

env = append(env, corev1.EnvVar{

Name: "BACKUPS_ENABLED",

Value: strconv.FormatBool(cr.Spec.Backup.Enabled),

if cr.CompareVersion("1.1.0") >= 0 {

backupsEnabled := false

if cr.Spec.Backup != nil {

backupsEnabled = cr.Spec.Backup.Enabled

}

env = append(env, corev1.EnvVar{

Name: "BACKUPS_ENABLED",

Value: strconv.FormatBool(backupsEnabled),

Copilot · 2026-04-14T15:20:00Z

+	backupsEnabled := os.Getenv("BACKUPS_ENABLED")
+	backupRunning := false
+
+	if backupsEnabled == "true" {
+		var err error
+		backupRunning, err = isBackupRunning(ctx)


BACKUPS_ENABLED is checked via a raw string here and is also set as a raw string in pkg/mysql/mysql.go. Please use a shared constant for the env var name (e.g. naming.Env...) to avoid subtle typos and keep producer/consumer in sync.

Copilot · 2026-04-14T15:20:01Z

+	backupsEnabled := os.Getenv("BACKUPS_ENABLED")
+	backupRunning := false
+
+	if backupsEnabled == "true" {
+		var err error
+		backupRunning, err = isBackupRunning(ctx)
+		if err != nil {
+			return errors.Wrap(err, "check backup running")
+		}
+	}


isBackupRunning() is called whenever backups are enabled, even when replication is active (i.e., the result won’t affect the outcome). This unnecessarily couples MySQL readiness to the sidecar HTTP API and can make pods go NotReady if the sidecar is temporarily unavailable even though replication/read-only checks pass. Consider only calling isBackupRunning() in the ReplicationStatusStopped path (where it’s actually needed) and leaving other statuses unaffected.

Copilot · 2026-04-14T15:20:01Z

+
 	switch {
 	case replStatus == mysqldb.ReplicationStatusActive && !readOnly:
 		return errors.New("replica is not read only")


With this change, a replica can become Ready while replication is stopped (during a backup) even if readOnly is false. Previously, the pod would be NotReady whenever replication was stopped, which implicitly prevented routing traffic to a potentially writable, non-replicating instance. Consider still enforcing readOnly==true when replStatus==ReplicationStatusStopped (even if backupRunning is true), or explicitly document/handle the writable case to avoid serving writes from an out-of-sync replica.

Suggested change

return errors.New("replica is not read only")

return errors.New("replica is not read only")

case replStatus == mysqldb.ReplicationStatusStopped && !readOnly:

return errors.New("replica is not read only while replication is stopped")

Signed-off-by: Mayank Shah <mayank.shah@percona.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T08:48:32Z

+	if cr.CompareVersion("1.1.0") >= 0 {
+		backupsEnabled := false
+		if cr.Spec.Backup != nil {
+			backupsEnabled = cr.Spec.Backup.Enabled
+		}
+		env = append(env, corev1.EnvVar{
+			Name:  naming.EnvBackupsEnabled,
+			Value: strconv.FormatBool(backupsEnabled),
+		})
+	}


BACKUPS_ENABLED is only injected into the mysqld container when cr.CompareVersion("1.1.0") >= 0. However, cmd/healthcheck relies on this env var to decide whether to query the xtrabackup sidecar; if an upgraded cluster is still pinned to an older .spec.crVersion, the env var won’t be present and readiness will continue to fail during backups (the original issue). Consider setting BACKUPS_ENABLED unconditionally (or at least whenever .spec.backup is configured), since it’s consumed only by the healthcheck and is safe to expose across CR versions.

Suggested change

if cr.CompareVersion("1.1.0") >= 0 {

backupsEnabled := false

if cr.Spec.Backup != nil {

backupsEnabled = cr.Spec.Backup.Enabled

}

env = append(env, corev1.EnvVar{

Name: naming.EnvBackupsEnabled,

Value: strconv.FormatBool(backupsEnabled),

})

}

backupsEnabled := false

if cr.Spec.Backup != nil {

backupsEnabled = cr.Spec.Backup.Enabled

}

env = append(env, corev1.EnvVar{

Name: naming.EnvBackupsEnabled,

Value: strconv.FormatBool(backupsEnabled),

})

Consider setting BACKUPS_ENABLED unconditionally (or at least whenever .spec.backup is configured)

It will lead to rolling restart

JNKPercona · 2026-04-15T10:44:37Z

Test Name	Result	Time
async-ignore-annotations-8-4	passed	00:06:23
async-global-metadata-8-4	passed	00:15:48
async-upgrade-8-0	passed	00:12:52
async-upgrade-8-4	passed	00:13:01
auto-config-8-4	passed	00:24:55
config-8-4	passed	00:20:38
config-router-8-0	passed	00:07:22
config-router-8-4	passed	00:07:23
demand-backup-8-0	passed	00:18:37
demand-backup-8-4	passed	00:19:48
demand-backup-cloud-8-4	passed	00:22:26
demand-backup-retry-8-4	passed	00:14:48
demand-backup-incremental-8-0	passed	00:40:37
demand-backup-incremental-8-4	passed	00:40:32
async-data-at-rest-encryption-8-0	passed	00:13:40
async-data-at-rest-encryption-8-4	passed	00:13:51
gr-global-metadata-8-4	passed	00:13:30
gr-data-at-rest-encryption-8-0	passed	00:15:19
gr-data-at-rest-encryption-8-4	passed	00:14:18
gr-demand-backup-8-4	passed	00:12:28
gr-demand-backup-cloud-8-4	passed	00:21:23
gr-demand-backup-haproxy-8-4	passed	00:09:49
gr-demand-backup-incremental-8-0	passed	00:40:28
gr-demand-backup-incremental-8-4	passed	00:40:26
gr-finalizer-8-4	passed	00:05:20
gr-haproxy-8-0	passed	00:04:44
gr-haproxy-8-4	passed	00:04:03
gr-ignore-annotations-8-4	passed	00:04:43
gr-init-deploy-8-0	passed	00:10:02
gr-init-deploy-8-4	passed	00:09:34
gr-one-pod-8-4	passed	00:07:02
gr-recreate-8-4	passed	00:16:31
gr-scaling-8-4	passed	00:07:28
gr-scheduled-backup-8-4	passed	00:15:21
gr-scheduled-backup-incremental-8-4	passed	00:22:10
gr-security-context-8-4	passed	00:09:50
gr-self-healing-8-4	passed	00:22:44
gr-tls-cert-manager-8-4	passed	00:08:51
gr-users-8-4	passed	00:05:18
gr-upgrade-8-0	passed	00:09:12
gr-upgrade-8-4	passed	00:09:29
haproxy-8-0	passed	00:08:08
haproxy-8-4	passed	00:08:07
init-deploy-8-0	passed	00:06:46
init-deploy-8-4	passed	00:05:38
limits-8-4	passed	00:06:34
monitoring-8-4	passed	00:14:00
one-pod-8-0	passed	00:05:58
one-pod-8-4	passed	00:05:31
operator-self-healing-8-4	passed	00:11:24
pvc-resize-8-4	passed	00:05:45
recreate-8-4	passed	00:12:34
scaling-8-4	passed	00:11:17
scheduled-backup-8-0	passed	00:16:12
scheduled-backup-8-4	passed	00:16:01
scheduled-backup-incremental-8-0	passed	00:22:10
scheduled-backup-incremental-8-4	passed	00:22:04
service-per-pod-8-4	passed	00:06:43
sidecars-8-4	passed	00:04:36
smart-update-8-4	passed	00:09:26
storage-8-4	passed	00:04:08
telemetry-8-4	passed	00:06:05
tls-cert-manager-8-4	passed	00:10:26
users-8-0	passed	00:08:20
users-8-4	passed	00:07:40
version-service-8-4	passed	00:20:29

Summary	Value
Tests Run	66/66
Job Duration	01:59:55
Total Test Time	14:49:15

commit: 7efcb1c
image: perconalab/percona-server-mysql-operator:PR-1292-7efcb1c5

check backup status in async readiness check

5306921

Signed-off-by: Mayank Shah <mayank.shah@percona.com>

pull-request-size bot added the size/S 10-29 lines label Apr 10, 2026

Copilot AI review requested due to automatic review settings April 10, 2026 16:02

mayankshah1607 requested review from egegunes, gkech, hors, nmarukovich and pooknull as code owners April 10, 2026 16:02

Copilot started reviewing on behalf of mayankshah1607 April 10, 2026 16:03 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

hors reviewed Apr 10, 2026

View reviewed changes

mayankshah1607 added 2 commits April 12, 2026 10:42

Merge branch 'main' into K8SPS-69-fix-backup-readiness

b0fc760

Merge branch 'main' into K8SPS-69-fix-backup-readiness

c6ebfa4

Copilot AI review requested due to automatic review settings April 13, 2026 15:07

Copilot started reviewing on behalf of mayankshah1607 April 13, 2026 15:08 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

gkech previously approved these changes Apr 13, 2026

View reviewed changes

pooknull previously approved these changes Apr 14, 2026

View reviewed changes

hors added this to the v1.1.0 milestone Apr 14, 2026

check that backups are enabled first

c85e652

Signed-off-by: Mayank Shah <mayank.shah@percona.com>

mayankshah1607 dismissed stale reviews from pooknull and gkech via c85e652 April 14, 2026 13:13

pull-request-size bot added size/M 30-99 lines and removed size/S 10-29 lines labels Apr 14, 2026

fix nil pointer

e4b188a

Signed-off-by: Mayank Shah <mayank.shah@percona.com>

Copilot AI review requested due to automatic review settings April 14, 2026 15:14

Copilot AI reviewed Apr 14, 2026

View reviewed changes

hors requested review from gkech, hors and pooknull April 14, 2026 15:21

gkech previously approved these changes Apr 14, 2026

View reviewed changes

fix e2e tests

97574cd

Signed-off-by: Mayank Shah <mayank.shah@percona.com>

mayankshah1607 dismissed gkech’s stale review via 97574cd April 15, 2026 08:34

mayankshah1607 requested review from eleo007, jvpasinatto and valmiranogueira as code owners April 15, 2026 08:34

mayankshah1607 added 3 commits April 15, 2026 11:35

use env constant

cf5dd7d

Signed-off-by: Mayank Shah <mayank.shah@percona.com>

update unit tests

01cddb3

Signed-off-by: Mayank Shah <mayank.shah@percona.com>

check backup running only when replication is stopped

7efcb1c

Signed-off-by: Mayank Shah <mayank.shah@percona.com>

Copilot AI review requested due to automatic review settings April 15, 2026 08:43

Copilot started reviewing on behalf of mayankshah1607 April 15, 2026 08:43 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

pooknull approved these changes Apr 15, 2026

View reviewed changes

hors approved these changes Apr 15, 2026

View reviewed changes

egegunes approved these changes Apr 15, 2026

View reviewed changes

hors merged commit e0d47d8 into main Apr 15, 2026
19 checks passed

hors deleted the K8SPS-69-fix-backup-readiness branch April 15, 2026 10:57

Conversation

mayankshah1607 commented Apr 10, 2026

CHANGE DESCRIPTION

CHECKLIST

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

hors Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

mayankshah1607 Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

mayankshah1607 Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

JNKPercona commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants