K8SPS-69 | async cluster readiness check should account for running backup#1292
K8SPS-69 | async cluster readiness check should account for running backup#1292
Conversation
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
There was a problem hiding this comment.
Pull request overview
Adjusts the async-cluster readiness probe to avoid flagging the backup source pod as unhealthy when replication is intentionally stopped during a running backup by consulting the xtrabackup sidecar HTTP API.
Changes:
- Add xtrabackup sidecar client usage to detect whether a backup is currently running.
- Update async readiness logic to ignore
ReplicationStatusStoppedwhen a backup is running. - Introduce
isBackupRunning(ctx)helper to query/backup/on the sidecar.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| backupRunning, err := isBackupRunning(ctx) | ||
| if err != nil { | ||
| return errors.Wrap(err, "check backup running") | ||
| } |
There was a problem hiding this comment.
isBackupRunning is called unconditionally before evaluating replStatus, which makes the async readiness check depend on the xtrabackup sidecar HTTP endpoint even when replication is healthy. If the sidecar is temporarily unavailable (startup, crash, port blocked), readiness will fail even though MySQL replication/read-only checks would otherwise pass. Consider only querying the sidecar when replStatus == ReplicationStatusStopped (and/or only treating sidecar errors as fatal in that branch) so the new behavior is limited to the stopped-replication case you’re trying to special-case for backups.
| return true, nil | ||
| } | ||
|
|
||
| func isBackupRunning(ctx context.Context) (bool, error) { |
There was a problem hiding this comment.
@mayankshah1607 @eleo007, in this case, we need to have a separate task to improve it in one of the next releases. The idea is to mark this pod as unready to exclude it from read traffic while keeping the cluster in a ready state.
There was a problem hiding this comment.
haproxy already excludes it if replication is stopped
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| backupRunning, err := isBackupRunning(ctx) | ||
| if err != nil { | ||
| return errors.Wrap(err, "check backup running") | ||
| } | ||
|
|
There was a problem hiding this comment.
isBackupRunning() errors currently make the whole async readiness check fail (return errors.Wrap(err, "check backup running")). That means transient sidecar issues (e.g., sidecar not yet listening, connection refused, localhost networking hiccup) can mark the MySQL container unready even when replication/read-only checks pass. Consider treating sidecar connectivity/errors as "no running backup" (return false, nil for dial/timeouts) and only performing the sidecar call when replStatus == ReplicationStatusStopped to avoid unnecessary probe failures and extra HTTP calls.
| func isBackupRunning(ctx context.Context) (bool, error) { | ||
| sc := xtrabackup.NewSidecarClient("localhost") | ||
| bcp, err := sc.GetRunningBackupConfig(ctx) | ||
| if err != nil { | ||
| return false, errors.Wrap(err, "get running backup config") | ||
| } |
There was a problem hiding this comment.
isBackupRunning() currently wraps and returns any error from GetRunningBackupConfig(). Since this function is used by the readiness probe, consider explicitly handling common "sidecar not reachable" cases (connection refused / context deadline exceeded) by returning (false, nil) so readiness is not gated on the sidecar HTTP server being available.
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if cr.CompareVersion("1.1.0") >= 0 && cr.Spec.Backup != nil { | ||
| env = append(env, corev1.EnvVar{ | ||
| Name: "BACKUPS_ENABLED", | ||
| Value: strconv.FormatBool(cr.Spec.Backup.Enabled), | ||
| }) | ||
| } |
There was a problem hiding this comment.
BACKUPS_ENABLED is introduced as a raw string here and is also read as a raw string in cmd/healthcheck/main.go. To avoid drift/typos, define a shared constant (e.g. in pkg/naming/env.go) and reference it from both places.
| if cr.CompareVersion("1.1.0") >= 0 && cr.Spec.Backup != nil { | ||
| env = append(env, corev1.EnvVar{ | ||
| Name: "BACKUPS_ENABLED", | ||
| Value: strconv.FormatBool(cr.Spec.Backup.Enabled), |
There was a problem hiding this comment.
This new behavior is not covered by the existing pkg/mysql unit tests: please add a test asserting mysqldContainer (or the rendered StatefulSet container env) includes BACKUPS_ENABLED=true/false when backups are enabled/disabled (and when spec.backup is nil), so regressions in readiness behavior gating are caught.
| if cr.CompareVersion("1.1.0") >= 0 && cr.Spec.Backup != nil { | |
| env = append(env, corev1.EnvVar{ | |
| Name: "BACKUPS_ENABLED", | |
| Value: strconv.FormatBool(cr.Spec.Backup.Enabled), | |
| if cr.CompareVersion("1.1.0") >= 0 { | |
| backupsEnabled := false | |
| if cr.Spec.Backup != nil { | |
| backupsEnabled = cr.Spec.Backup.Enabled | |
| } | |
| env = append(env, corev1.EnvVar{ | |
| Name: "BACKUPS_ENABLED", | |
| Value: strconv.FormatBool(backupsEnabled), |
| backupsEnabled := os.Getenv("BACKUPS_ENABLED") | ||
| backupRunning := false | ||
|
|
||
| if backupsEnabled == "true" { | ||
| var err error | ||
| backupRunning, err = isBackupRunning(ctx) |
There was a problem hiding this comment.
BACKUPS_ENABLED is checked via a raw string here and is also set as a raw string in pkg/mysql/mysql.go. Please use a shared constant for the env var name (e.g. naming.Env...) to avoid subtle typos and keep producer/consumer in sync.
| backupsEnabled := os.Getenv("BACKUPS_ENABLED") | ||
| backupRunning := false | ||
|
|
||
| if backupsEnabled == "true" { | ||
| var err error | ||
| backupRunning, err = isBackupRunning(ctx) | ||
| if err != nil { | ||
| return errors.Wrap(err, "check backup running") | ||
| } | ||
| } |
There was a problem hiding this comment.
isBackupRunning() is called whenever backups are enabled, even when replication is active (i.e., the result won’t affect the outcome). This unnecessarily couples MySQL readiness to the sidecar HTTP API and can make pods go NotReady if the sidecar is temporarily unavailable even though replication/read-only checks pass. Consider only calling isBackupRunning() in the ReplicationStatusStopped path (where it’s actually needed) and leaving other statuses unaffected.
|
|
||
| switch { | ||
| case replStatus == mysqldb.ReplicationStatusActive && !readOnly: | ||
| return errors.New("replica is not read only") |
There was a problem hiding this comment.
With this change, a replica can become Ready while replication is stopped (during a backup) even if readOnly is false. Previously, the pod would be NotReady whenever replication was stopped, which implicitly prevented routing traffic to a potentially writable, non-replicating instance. Consider still enforcing readOnly==true when replStatus==ReplicationStatusStopped (even if backupRunning is true), or explicitly document/handle the writable case to avoid serving writes from an out-of-sync replica.
| return errors.New("replica is not read only") | |
| return errors.New("replica is not read only") | |
| case replStatus == mysqldb.ReplicationStatusStopped && !readOnly: | |
| return errors.New("replica is not read only while replication is stopped") |
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if cr.CompareVersion("1.1.0") >= 0 { | ||
| backupsEnabled := false | ||
| if cr.Spec.Backup != nil { | ||
| backupsEnabled = cr.Spec.Backup.Enabled | ||
| } | ||
| env = append(env, corev1.EnvVar{ | ||
| Name: naming.EnvBackupsEnabled, | ||
| Value: strconv.FormatBool(backupsEnabled), | ||
| }) | ||
| } |
There was a problem hiding this comment.
BACKUPS_ENABLED is only injected into the mysqld container when cr.CompareVersion("1.1.0") >= 0. However, cmd/healthcheck relies on this env var to decide whether to query the xtrabackup sidecar; if an upgraded cluster is still pinned to an older .spec.crVersion, the env var won’t be present and readiness will continue to fail during backups (the original issue). Consider setting BACKUPS_ENABLED unconditionally (or at least whenever .spec.backup is configured), since it’s consumed only by the healthcheck and is safe to expose across CR versions.
| if cr.CompareVersion("1.1.0") >= 0 { | |
| backupsEnabled := false | |
| if cr.Spec.Backup != nil { | |
| backupsEnabled = cr.Spec.Backup.Enabled | |
| } | |
| env = append(env, corev1.EnvVar{ | |
| Name: naming.EnvBackupsEnabled, | |
| Value: strconv.FormatBool(backupsEnabled), | |
| }) | |
| } | |
| backupsEnabled := false | |
| if cr.Spec.Backup != nil { | |
| backupsEnabled = cr.Spec.Backup.Enabled | |
| } | |
| env = append(env, corev1.EnvVar{ | |
| Name: naming.EnvBackupsEnabled, | |
| Value: strconv.FormatBool(backupsEnabled), | |
| }) |
There was a problem hiding this comment.
Consider setting BACKUPS_ENABLED unconditionally (or at least whenever .spec.backup is configured)
It will lead to rolling restart
commit: 7efcb1c |
CHANGE DESCRIPTION
Problem:
The current readiness check for async cluster checks if the replication is stopped. However, during backups, the replication is stopped. As a result, the backup source pod will appear unhealthy.
Cause:
We do not take into account the state of running backups
Solution:
Update the readiness check to ignore stopped replication if there is a backup running. The xtrabackup sidecar already exposes an HTTP API to get the status of a running backup, and this is available to the readiness check cmd on localhost
CHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
Config/Logging/Testability