Skip to content

K8SPS-69 | async cluster readiness check should account for running backup#1292

Merged
hors merged 9 commits intomainfrom
K8SPS-69-fix-backup-readiness
Apr 15, 2026
Merged

K8SPS-69 | async cluster readiness check should account for running backup#1292
hors merged 9 commits intomainfrom
K8SPS-69-fix-backup-readiness

Conversation

@mayankshah1607
Copy link
Copy Markdown
Member

CHANGE DESCRIPTION

Problem:
The current readiness check for async cluster checks if the replication is stopped. However, during backups, the replication is stopped. As a result, the backup source pod will appear unhealthy.

Cause:
We do not take into account the state of running backups

Solution:
Update the readiness check to ignore stopped replication if there is a backup running. The xtrabackup sidecar already exposes an HTTP API to get the status of a running backup, and this is available to the readiness check cmd on localhost

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PS version?
  • Does the change support oldest and newest supported Kubernetes version?

Signed-off-by: Mayank Shah <mayank.shah@percona.com>
@pull-request-size pull-request-size bot added the size/S 10-29 lines label Apr 10, 2026
Copilot AI review requested due to automatic review settings April 10, 2026 16:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the async-cluster readiness probe to avoid flagging the backup source pod as unhealthy when replication is intentionally stopped during a running backup by consulting the xtrabackup sidecar HTTP API.

Changes:

  • Add xtrabackup sidecar client usage to detect whether a backup is currently running.
  • Update async readiness logic to ignore ReplicationStatusStopped when a backup is running.
  • Introduce isBackupRunning(ctx) helper to query /backup/ on the sidecar.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cmd/healthcheck/main.go Outdated
Comment on lines +130 to +133
backupRunning, err := isBackupRunning(ctx)
if err != nil {
return errors.Wrap(err, "check backup running")
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isBackupRunning is called unconditionally before evaluating replStatus, which makes the async readiness check depend on the xtrabackup sidecar HTTP endpoint even when replication is healthy. If the sidecar is temporarily unavailable (startup, crash, port blocked), readiness will fail even though MySQL replication/read-only checks would otherwise pass. Consider only querying the sidecar when replStatus == ReplicationStatusStopped (and/or only treating sidecar errors as fatal in that branch) so the new behavior is limited to the stopped-replication case you’re trying to special-case for backups.

Copilot uses AI. Check for mistakes.
Comment thread cmd/healthcheck/main.go
return true, nil
}

func isBackupRunning(ctx context.Context) (bool, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayankshah1607 @eleo007, in this case, we need to have a separate task to improve it in one of the next releases. The idea is to mark this pod as unready to exclude it from read traffic while keeping the cluster in a ready state.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haproxy already excludes it if replication is stopped

Copilot AI review requested due to automatic review settings April 13, 2026 15:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cmd/healthcheck/main.go Outdated
Comment on lines +130 to +134
backupRunning, err := isBackupRunning(ctx)
if err != nil {
return errors.Wrap(err, "check backup running")
}

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isBackupRunning() errors currently make the whole async readiness check fail (return errors.Wrap(err, "check backup running")). That means transient sidecar issues (e.g., sidecar not yet listening, connection refused, localhost networking hiccup) can mark the MySQL container unready even when replication/read-only checks pass. Consider treating sidecar connectivity/errors as "no running backup" (return false, nil for dial/timeouts) and only performing the sidecar call when replStatus == ReplicationStatusStopped to avoid unnecessary probe failures and extra HTTP calls.

Copilot uses AI. Check for mistakes.
Comment thread cmd/healthcheck/main.go
Comment on lines +337 to +342
func isBackupRunning(ctx context.Context) (bool, error) {
sc := xtrabackup.NewSidecarClient("localhost")
bcp, err := sc.GetRunningBackupConfig(ctx)
if err != nil {
return false, errors.Wrap(err, "get running backup config")
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isBackupRunning() currently wraps and returns any error from GetRunningBackupConfig(). Since this function is used by the readiness probe, consider explicitly handling common "sidecar not reachable" cases (connection refused / context deadline exceeded) by returning (false, nil) so readiness is not gated on the sidecar HTTP server being available.

Copilot uses AI. Check for mistakes.
gkech
gkech previously approved these changes Apr 13, 2026
pooknull
pooknull previously approved these changes Apr 14, 2026
@hors hors added this to the v1.1.0 milestone Apr 14, 2026
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
@mayankshah1607 mayankshah1607 dismissed stale reviews from pooknull and gkech via c85e652 April 14, 2026 13:13
@pull-request-size pull-request-size bot added size/M 30-99 lines and removed size/S 10-29 lines labels Apr 14, 2026
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Copilot AI review requested due to automatic review settings April 14, 2026 15:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/mysql/mysql.go Outdated
Comment on lines +682 to +687
if cr.CompareVersion("1.1.0") >= 0 && cr.Spec.Backup != nil {
env = append(env, corev1.EnvVar{
Name: "BACKUPS_ENABLED",
Value: strconv.FormatBool(cr.Spec.Backup.Enabled),
})
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BACKUPS_ENABLED is introduced as a raw string here and is also read as a raw string in cmd/healthcheck/main.go. To avoid drift/typos, define a shared constant (e.g. in pkg/naming/env.go) and reference it from both places.

Copilot uses AI. Check for mistakes.
Comment thread pkg/mysql/mysql.go Outdated
Comment on lines +682 to +685
if cr.CompareVersion("1.1.0") >= 0 && cr.Spec.Backup != nil {
env = append(env, corev1.EnvVar{
Name: "BACKUPS_ENABLED",
Value: strconv.FormatBool(cr.Spec.Backup.Enabled),
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new behavior is not covered by the existing pkg/mysql unit tests: please add a test asserting mysqldContainer (or the rendered StatefulSet container env) includes BACKUPS_ENABLED=true/false when backups are enabled/disabled (and when spec.backup is nil), so regressions in readiness behavior gating are caught.

Suggested change
if cr.CompareVersion("1.1.0") >= 0 && cr.Spec.Backup != nil {
env = append(env, corev1.EnvVar{
Name: "BACKUPS_ENABLED",
Value: strconv.FormatBool(cr.Spec.Backup.Enabled),
if cr.CompareVersion("1.1.0") >= 0 {
backupsEnabled := false
if cr.Spec.Backup != nil {
backupsEnabled = cr.Spec.Backup.Enabled
}
env = append(env, corev1.EnvVar{
Name: "BACKUPS_ENABLED",
Value: strconv.FormatBool(backupsEnabled),

Copilot uses AI. Check for mistakes.
Comment thread cmd/healthcheck/main.go Outdated
Comment on lines +130 to +135
backupsEnabled := os.Getenv("BACKUPS_ENABLED")
backupRunning := false

if backupsEnabled == "true" {
var err error
backupRunning, err = isBackupRunning(ctx)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BACKUPS_ENABLED is checked via a raw string here and is also set as a raw string in pkg/mysql/mysql.go. Please use a shared constant for the env var name (e.g. naming.Env...) to avoid subtle typos and keep producer/consumer in sync.

Copilot uses AI. Check for mistakes.
Comment thread cmd/healthcheck/main.go Outdated
Comment on lines +130 to +139
backupsEnabled := os.Getenv("BACKUPS_ENABLED")
backupRunning := false

if backupsEnabled == "true" {
var err error
backupRunning, err = isBackupRunning(ctx)
if err != nil {
return errors.Wrap(err, "check backup running")
}
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isBackupRunning() is called whenever backups are enabled, even when replication is active (i.e., the result won’t affect the outcome). This unnecessarily couples MySQL readiness to the sidecar HTTP API and can make pods go NotReady if the sidecar is temporarily unavailable even though replication/read-only checks pass. Consider only calling isBackupRunning() in the ReplicationStatusStopped path (where it’s actually needed) and leaving other statuses unaffected.

Copilot uses AI. Check for mistakes.
Comment thread cmd/healthcheck/main.go

switch {
case replStatus == mysqldb.ReplicationStatusActive && !readOnly:
return errors.New("replica is not read only")
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, a replica can become Ready while replication is stopped (during a backup) even if readOnly is false. Previously, the pod would be NotReady whenever replication was stopped, which implicitly prevented routing traffic to a potentially writable, non-replicating instance. Consider still enforcing readOnly==true when replStatus==ReplicationStatusStopped (even if backupRunning is true), or explicitly document/handle the writable case to avoid serving writes from an out-of-sync replica.

Suggested change
return errors.New("replica is not read only")
return errors.New("replica is not read only")
case replStatus == mysqldb.ReplicationStatusStopped && !readOnly:
return errors.New("replica is not read only while replication is stopped")

Copilot uses AI. Check for mistakes.
@hors hors requested review from gkech, hors and pooknull April 14, 2026 15:21
gkech
gkech previously approved these changes Apr 14, 2026
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Copilot AI review requested due to automatic review settings April 15, 2026 08:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/mysql/mysql.go
Comment on lines +682 to +691
if cr.CompareVersion("1.1.0") >= 0 {
backupsEnabled := false
if cr.Spec.Backup != nil {
backupsEnabled = cr.Spec.Backup.Enabled
}
env = append(env, corev1.EnvVar{
Name: naming.EnvBackupsEnabled,
Value: strconv.FormatBool(backupsEnabled),
})
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BACKUPS_ENABLED is only injected into the mysqld container when cr.CompareVersion("1.1.0") >= 0. However, cmd/healthcheck relies on this env var to decide whether to query the xtrabackup sidecar; if an upgraded cluster is still pinned to an older .spec.crVersion, the env var won’t be present and readiness will continue to fail during backups (the original issue). Consider setting BACKUPS_ENABLED unconditionally (or at least whenever .spec.backup is configured), since it’s consumed only by the healthcheck and is safe to expose across CR versions.

Suggested change
if cr.CompareVersion("1.1.0") >= 0 {
backupsEnabled := false
if cr.Spec.Backup != nil {
backupsEnabled = cr.Spec.Backup.Enabled
}
env = append(env, corev1.EnvVar{
Name: naming.EnvBackupsEnabled,
Value: strconv.FormatBool(backupsEnabled),
})
}
backupsEnabled := false
if cr.Spec.Backup != nil {
backupsEnabled = cr.Spec.Backup.Enabled
}
env = append(env, corev1.EnvVar{
Name: naming.EnvBackupsEnabled,
Value: strconv.FormatBool(backupsEnabled),
})

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider setting BACKUPS_ENABLED unconditionally (or at least whenever .spec.backup is configured)

It will lead to rolling restart

@JNKPercona
Copy link
Copy Markdown
Collaborator

Test Name Result Time
async-ignore-annotations-8-4 passed 00:06:23
async-global-metadata-8-4 passed 00:15:48
async-upgrade-8-0 passed 00:12:52
async-upgrade-8-4 passed 00:13:01
auto-config-8-4 passed 00:24:55
config-8-4 passed 00:20:38
config-router-8-0 passed 00:07:22
config-router-8-4 passed 00:07:23
demand-backup-8-0 passed 00:18:37
demand-backup-8-4 passed 00:19:48
demand-backup-cloud-8-4 passed 00:22:26
demand-backup-retry-8-4 passed 00:14:48
demand-backup-incremental-8-0 passed 00:40:37
demand-backup-incremental-8-4 passed 00:40:32
async-data-at-rest-encryption-8-0 passed 00:13:40
async-data-at-rest-encryption-8-4 passed 00:13:51
gr-global-metadata-8-4 passed 00:13:30
gr-data-at-rest-encryption-8-0 passed 00:15:19
gr-data-at-rest-encryption-8-4 passed 00:14:18
gr-demand-backup-8-4 passed 00:12:28
gr-demand-backup-cloud-8-4 passed 00:21:23
gr-demand-backup-haproxy-8-4 passed 00:09:49
gr-demand-backup-incremental-8-0 passed 00:40:28
gr-demand-backup-incremental-8-4 passed 00:40:26
gr-finalizer-8-4 passed 00:05:20
gr-haproxy-8-0 passed 00:04:44
gr-haproxy-8-4 passed 00:04:03
gr-ignore-annotations-8-4 passed 00:04:43
gr-init-deploy-8-0 passed 00:10:02
gr-init-deploy-8-4 passed 00:09:34
gr-one-pod-8-4 passed 00:07:02
gr-recreate-8-4 passed 00:16:31
gr-scaling-8-4 passed 00:07:28
gr-scheduled-backup-8-4 passed 00:15:21
gr-scheduled-backup-incremental-8-4 passed 00:22:10
gr-security-context-8-4 passed 00:09:50
gr-self-healing-8-4 passed 00:22:44
gr-tls-cert-manager-8-4 passed 00:08:51
gr-users-8-4 passed 00:05:18
gr-upgrade-8-0 passed 00:09:12
gr-upgrade-8-4 passed 00:09:29
haproxy-8-0 passed 00:08:08
haproxy-8-4 passed 00:08:07
init-deploy-8-0 passed 00:06:46
init-deploy-8-4 passed 00:05:38
limits-8-4 passed 00:06:34
monitoring-8-4 passed 00:14:00
one-pod-8-0 passed 00:05:58
one-pod-8-4 passed 00:05:31
operator-self-healing-8-4 passed 00:11:24
pvc-resize-8-4 passed 00:05:45
recreate-8-4 passed 00:12:34
scaling-8-4 passed 00:11:17
scheduled-backup-8-0 passed 00:16:12
scheduled-backup-8-4 passed 00:16:01
scheduled-backup-incremental-8-0 passed 00:22:10
scheduled-backup-incremental-8-4 passed 00:22:04
service-per-pod-8-4 passed 00:06:43
sidecars-8-4 passed 00:04:36
smart-update-8-4 passed 00:09:26
storage-8-4 passed 00:04:08
telemetry-8-4 passed 00:06:05
tls-cert-manager-8-4 passed 00:10:26
users-8-0 passed 00:08:20
users-8-4 passed 00:07:40
version-service-8-4 passed 00:20:29
Summary Value
Tests Run 66/66
Job Duration 01:59:55
Total Test Time 14:49:15

commit: 7efcb1c
image: perconalab/percona-server-mysql-operator:PR-1292-7efcb1c5

@hors hors merged commit e0d47d8 into main Apr 15, 2026
19 checks passed
@hors hors deleted the K8SPS-69-fix-backup-readiness branch April 15, 2026 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M 30-99 lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants