Skip to content

K8SPS-686: fail backup if restore is running#1284

Open
pooknull wants to merge 10 commits intomainfrom
K8SPS-435-fix
Open

K8SPS-686: fail backup if restore is running#1284
pooknull wants to merge 10 commits intomainfrom
K8SPS-435-fix

Conversation

@pooknull
Copy link
Copy Markdown
Contributor

@pooknull pooknull commented Apr 7, 2026

https://perconadev.atlassian.net/browse/K8SPS-686

DESCRIPTION

This PR prevents backup and restore from running at the same time for the same cluster.

Operator creates a lease for an active restore and blocks backups while that lease is held. Also it makes restores wait if a backup is still running.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PS version?
  • Does the change support oldest and newest supported Kubernetes version?

Copilot AI review requested due to automatic review settings April 7, 2026 12:16
@pull-request-size pull-request-size bot added the size/L 100-499 lines label Apr 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Kubernetes Lease-based locking mechanism to prevent concurrency between restores and backups for the same PerconaServerMySQL cluster, ensuring backups fail when a restore is in progress and restores wait while backups are running.

Changes:

  • Add pkg/k8s helpers to acquire/release/query a “restore lock” using coordination.k8s.io/v1 Lease.
  • Update restore and backup controllers to enforce mutual exclusion (restore blocks on running backups; backup fails when restore lease is active).
  • Extend unit test coverage and update RBAC manifests to allow Lease access.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/k8s/lease.go New Lease helper utilities for restore locking
pkg/k8s/lease_test.go Unit tests for Lease helper behavior
pkg/controller/psrestore/controller.go Acquire/Release restore lease; block restore on running backups
pkg/controller/psrestore/controller_test.go Add test case for “running backup blocks restore”
pkg/controller/psbackup/controller.go Fail backup creation when an active restore lease exists
pkg/controller/psbackup/controller_test.go Add test case for “active restore lease blocks backup”
deploy/rbac.yaml Grant operator Role permissions for Leases
deploy/cw-rbac.yaml Grant CW Role permissions for Leases
deploy/cw-bundle.yaml Bundle update for Lease RBAC
deploy/bundle.yaml Bundle update for Lease RBAC
config/rbac/role.yaml Controller role updated for Lease permissions
config/rbac/cluster/role.yaml Cluster role updated for Lease permissions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/k8s/lease.go Outdated
Comment on lines +52 to +56
return errors.Wrap(err, "create lease")
}

result = lease.DeepCopy()
return nil
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AcquireRestoreLease can race when multiple restores attempt to create the Lease at the same time: if cl.Create returns an AlreadyExists error, RetryOnConflict will not retry (it only retries conflicts), so the reconciler will fail even though it could recover by re-getting/updating the existing Lease. Consider handling apierrors.IsAlreadyExists(err) by re-fetching and continuing, or switching to controllerutil.CreateOrUpdate / a Patch-based create-or-update flow.

Suggested change
return errors.Wrap(err, "create lease")
}
result = lease.DeepCopy()
return nil
if !k8serrors.IsAlreadyExists(err) {
return errors.Wrap(err, "create lease")
}
lease = &coordv1.Lease{}
if err := cl.Get(ctx, nn, lease); err != nil {
return errors.Wrap(err, "get existing lease after create")
}
} else {
result = lease.DeepCopy()
return nil
}

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +157
lease, err := k8s.AcquireRestoreLease(ctx, r.Client, cr)
if err != nil {
return ctrl.Result{}, errors.Wrap(err, "acquire restore lease")
}
if lease.Spec.HolderIdentity != nil && *lease.Spec.HolderIdentity != cr.Name {
status.State = apiv1.RestoreNew
status.StateDesc = fmt.Sprintf("PerconaServerMySQLRestore %s is already running", *lease.Spec.HolderIdentity)
log.Info("PerconaServerMySQLRestore is already running", "restore", *lease.Spec.HolderIdentity)
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restore lock relies on LeaseDurationSeconds (30s) and only updates RenewTime when the restore reconciles. During long-running restores, there may be long stretches without reconcile events (Job status often changes only at start/finish), so the lease can expire while the restore is still Running and backups would no longer be blocked. To make the lock effective, either requeue periodically while the restore is Starting/Running to renew the lease (e.g., RequeueAfter < duration), and/or use a substantially longer duration.

Copilot uses AI. Check for mistakes.
Comment on lines +286 to +300
func (r *PerconaServerMySQLRestoreReconciler) getRunningBackup(ctx context.Context, cr *apiv1.PerconaServerMySQLRestore) (*apiv1.PerconaServerMySQLBackup, error) {
backups := new(apiv1.PerconaServerMySQLBackupList)
if err := r.List(ctx, backups, client.InNamespace(cr.Namespace)); err != nil {
return nil, errors.Wrap(err, "list backups")
}

for _, b := range backups.Items {
if b.Spec.ClusterName != cr.Spec.ClusterName {
continue
}

switch b.Status.State {
case apiv1.BackupStarting, apiv1.BackupRunning:
return &b, nil
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRunningBackup currently lists all backups in the namespace and then filters in-memory. This is O(N) per reconcile and can become expensive as backup history grows. A field indexer for PerconaServerMySQLBackup spec.clusterName is already set up (see cmd/manager/main.go), so you can use client.MatchingFields{"spec.clusterName": cr.Spec.ClusterName} along with client.InNamespace(...) to query only backups for the target cluster.

Copilot uses AI. Check for mistakes.
pkg/k8s/lease.go Outdated
Comment on lines +44 to +49
Spec: coordv1.LeaseSpec{
HolderIdentity: &cr.Name,
LeaseDurationSeconds: ptr.To(restoreLeaseDurationSeconds),
AcquireTime: &metav1.MicroTime{Time: now},
RenewTime: &metav1.MicroTime{Time: now},
},
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LeaseSpec.HolderIdentity is being set using &cr.Name. Using ptr.To(cr.Name) (or an explicit local variable) avoids taking the address of a field on another object and makes the intent clearer (and prevents accidental aliasing if the source struct is ever reused/mutated).

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +92
if err := k8s.ReleaseRestoreLease(ctx, r.Client, cr); err != nil {
log.Error(err, "failed to release restore lease")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this mean we'll release the lease even when restore doesn't finish in a single reconciliation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReleaseRestoreLease has status check to release only for finished restores:

switch cr.Status.State {
case apiv1.RestoreNew, apiv1.RestoreStarting, apiv1.RestoreRunning:
return nil
}

}

defer func() {
if err := k8s.ReleaseRestoreLease(ctx, r.Client, cr); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need a finalizer to release if restore is deleted in running state

Copy link
Copy Markdown
Contributor Author

@pooknull pooknull Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we need a finalizer. It could cause problems. The lease remains active for 30 seconds. If the restore is deleted while in the running state, another restore will be started after that period.

@pooknull pooknull changed the title K8SPS-435: fail backup if restore is running K8SPS-686: fail backup if restore is running Apr 8, 2026
@pooknull pooknull requested a review from egegunes April 8, 2026 08:24
Copilot AI review requested due to automatic review settings April 8, 2026 08:28
@pooknull pooknull marked this pull request as ready for review April 8, 2026 08:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pull-request-size pull-request-size bot added size/XL 500-999 lines and removed size/L 100-499 lines labels Apr 9, 2026
return ctrl.Result{}, nil
}

lease, err := k8s.GetLease(ctx, r.Client, cr.Namespace, cr.Spec.ClusterName)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the parameters correct? I think name and namespace are interchanged. Also shouldn't we pass naming.RestoreLeaseName ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

lease, err := k8s.GetLease(ctx, r.Client, cr.Namespace, cr.Spec.ClusterName)
if err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we ignore NotFound error? Under normal cases I don't think restore lease will be present right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +274 to +277
for _, b := range backups.Items {
if b.Spec.ClusterName != cr.Spec.ClusterName {
continue
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have field indexer on clusterName

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.Info("PerconaServerMySQLRestore is already running", "restore", restore.Name)
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
}
backup, err := r.getRunningBackup(ctx, cr)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might still have a TOCTOU race: when you start a backup and restore at the same time, what happens if the restore lease is acquired between the BackupNew -> BackupStarting transition window?

Consider the following sequence:
T1: backup starts -> no restore lease -> proceeds
T2: restore starts, acquires lease, checks backup is still BackupNew -> proceeds
T3: we have backup and restore running at same time

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might become more evident if the informer caches are not updated fast enough

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


defer func() {
switch cr.Status.State {
case apiv1.RestoreNew, apiv1.RestoreStarting, apiv1.RestoreRunning:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be checking status.Status instead? That's the copy we modify during the course of that reconciliation, cr.Status is what is observed before reconciling

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot AI review requested due to automatic review settings April 10, 2026 06:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +178 to +186
lease, err := k8s.GetLease(ctx, r.Client, cr.Namespace, cr.Spec.ClusterName)
if err != nil {
return rr, errors.Wrap(err, "get restore lease")
}
if k8s.IsLeaseActive(lease) {
status.State = apiv1.BackupError
status.StateDesc = fmt.Sprintf("backup cannot run while restore %s is in progress", *lease.Spec.HolderIdentity)
return ctrl.Result{}, nil
}
obj: []client.Object{
&coordv1.Lease{
ObjectMeta: metav1.ObjectMeta{
Name: "restore-lock-" + cr.Spec.ClusterName,
pkg/k8s/lease.go Outdated
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
)

const leaseDurationSeconds int32 = 30
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need renew logic for our use-case? IIUC, we just want to use lease like a global lock. When a restore start -> acquire it. When it is done -> release it. Why does it need to periodically send a heart beat?
The stale checking logic already covers the case when a restore completed but failed to release the lock

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. With these changes we don't need the renew logic: 22ab993

Copilot AI review requested due to automatic review settings April 10, 2026 12:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +80 to +83
for _, backup := range backupList.Items {
switch backup.Status.State {
case apiv1.BackupStarting, apiv1.BackupRunning:
return &backup, nil
Comment on lines +472 to +476
if err := k8s.AcquireLease(ctx, r.Client, naming.RestoreLeaseName(cr.Spec.ClusterName), cr.Name, cr.Namespace, checkStale); err != nil {
if errors.Is(err, k8s.ErrLeaseAlreadyHeld) || k8serrors.IsAlreadyExists(err) || k8serrors.IsConflict(err) {
status.State = apiv1.RestoreNew
status.StateDesc = fmt.Sprintf("PerconaServerMySQLRestore %s is already running", leaseHolder)
log.Info("PerconaServerMySQLRestore is already running", "restore", leaseHolder)
@JNKPercona
Copy link
Copy Markdown
Collaborator

Test Name Result Time
async-ignore-annotations-8-4 passed 00:06:13
async-global-metadata-8-4 passed 00:15:22
async-upgrade-8-0 passed 00:12:48
async-upgrade-8-4 passed 00:13:01
auto-config-8-4 passed 00:24:41
config-8-4 passed 00:21:02
config-router-8-0 passed 00:07:34
config-router-8-4 passed 00:07:29
demand-backup-minio-8-0 passed 00:20:25
demand-backup-minio-8-4 passed 00:19:59
demand-backup-cloud-8-4 passed 00:21:38
demand-backup-retry-8-4 passed 00:16:35
demand-backup-incremental-8-0 passed 00:35:28
demand-backup-incremental-8-4 passed 00:37:20
async-data-at-rest-encryption-8-0 passed 00:13:19
async-data-at-rest-encryption-8-4 passed 00:13:06
gr-global-metadata-8-4 passed 00:13:44
gr-data-at-rest-encryption-8-0 passed 00:14:43
gr-data-at-rest-encryption-8-4 passed 00:14:43
gr-demand-backup-minio-8-4 passed 00:12:36
gr-demand-backup-cloud-8-4 passed 00:21:38
gr-demand-backup-haproxy-8-4 passed 00:09:38
gr-demand-backup-incremental-8-0 passed 00:34:53
gr-demand-backup-incremental-8-4 passed 00:35:48
gr-finalizer-8-4 passed 00:05:18
gr-haproxy-8-0 passed 00:04:44
gr-haproxy-8-4 passed 00:04:05
gr-ignore-annotations-8-4 passed 00:04:48
gr-init-deploy-8-0 passed 00:09:14
gr-init-deploy-8-4 passed 00:09:10
gr-one-pod-8-4 passed 00:05:45
gr-recreate-8-4 passed 00:17:00
gr-scaling-8-4 passed 00:08:05
gr-scheduled-backup-8-4 passed 00:16:11
gr-scheduled-backup-incremental-8-4 passed 00:22:00
gr-security-context-8-4 passed 00:09:56
gr-self-healing-8-4 passed 00:23:57
gr-tls-cert-manager-8-4 passed 00:09:58
gr-users-8-4 passed 00:05:15
gr-upgrade-8-0 passed 00:09:33
gr-upgrade-8-4 passed 00:09:52
haproxy-8-0 passed 00:09:14
haproxy-8-4 passed 00:08:26
init-deploy-8-0 passed 00:05:43
init-deploy-8-4 passed 00:05:42
limits-8-4 passed 00:08:36
monitoring-8-4 passed 00:14:08
one-pod-8-0 passed 00:05:38
one-pod-8-4 passed 00:06:23
operator-self-healing-8-4 passed 00:11:30
pvc-resize-8-4 passed 00:05:30
recreate-8-4 passed 00:12:47
scaling-8-4 passed 00:10:59
scheduled-backup-8-0 passed 00:18:10
scheduled-backup-8-4 passed 00:16:21
scheduled-backup-incremental-8-0 passed 00:21:03
scheduled-backup-incremental-8-4 passed 00:21:59
service-per-pod-8-4 passed 00:07:55
sidecars-8-4 passed 00:04:41
smart-update-8-4 passed 00:09:30
storage-8-4 passed 00:04:01
telemetry-8-4 passed 00:06:15
tls-cert-manager-8-4 passed 00:11:54
users-8-0 passed 00:08:08
users-8-4 passed 00:07:34
version-service-8-4 passed 00:20:44
Summary Value
Tests Run 66/66
Job Duration 02:01:15
Total Test Time 14:41:56

commit: 355d80f
image: perconalab/percona-server-mysql-operator:PR-1284-355d80fb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL 500-999 lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants