K8SPS-686: fail backup if restore is running by pooknull · Pull Request #1284 · percona/percona-server-mysql-operator

pooknull · 2026-04-07T12:16:13Z

https://perconadev.atlassian.net/browse/K8SPS-686

DESCRIPTION

This PR prevents backup and restore from running at the same time for the same cluster.

Operator creates a lease for an active restore and blocks backups while that lease is held. Also it makes restores wait if a backup is still running.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PS version?
Does the change support oldest and newest supported Kubernetes version?

https://perconadev.atlassian.net/browse/K8SPS-435

Copilot

Pull request overview

This PR introduces a Kubernetes Lease-based locking mechanism to prevent concurrency between restores and backups for the same PerconaServerMySQL cluster, ensuring backups fail when a restore is in progress and restores wait while backups are running.

Changes:

Add pkg/k8s helpers to acquire/release/query a “restore lock” using coordination.k8s.io/v1 Lease.
Update restore and backup controllers to enforce mutual exclusion (restore blocks on running backups; backup fails when restore lease is active).
Extend unit test coverage and update RBAC manifests to allow Lease access.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
pkg/k8s/lease.go	New Lease helper utilities for restore locking
pkg/k8s/lease_test.go	Unit tests for Lease helper behavior
pkg/controller/psrestore/controller.go	Acquire/Release restore lease; block restore on running backups
pkg/controller/psrestore/controller_test.go	Add test case for “running backup blocks restore”
pkg/controller/psbackup/controller.go	Fail backup creation when an active restore lease exists
pkg/controller/psbackup/controller_test.go	Add test case for “active restore lease blocks backup”
deploy/rbac.yaml	Grant operator Role permissions for Leases
deploy/cw-rbac.yaml	Grant CW Role permissions for Leases
deploy/cw-bundle.yaml	Bundle update for Lease RBAC
deploy/bundle.yaml	Bundle update for Lease RBAC
config/rbac/role.yaml	Controller role updated for Lease permissions
config/rbac/cluster/role.yaml	Cluster role updated for Lease permissions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T12:23:24Z

pkg/k8s/lease.go

+				return errors.Wrap(err, "create lease")
+			}
+
+			result = lease.DeepCopy()
+			return nil


AcquireRestoreLease can race when multiple restores attempt to create the Lease at the same time: if cl.Create returns an AlreadyExists error, RetryOnConflict will not retry (it only retries conflicts), so the reconciler will fail even though it could recover by re-getting/updating the existing Lease. Consider handling apierrors.IsAlreadyExists(err) by re-fetching and continuing, or switching to controllerutil.CreateOrUpdate / a Patch-based create-or-update flow.

Suggested change

return errors.Wrap(err, "create lease")

}

result = lease.DeepCopy()

return nil

if !k8serrors.IsAlreadyExists(err) {

return errors.Wrap(err, "create lease")

}

lease = &coordv1.Lease{}

if err := cl.Get(ctx, nn, lease); err != nil {

return errors.Wrap(err, "get existing lease after create")

}

} else {

result = lease.DeepCopy()

return nil

}

Copilot · 2026-04-07T12:23:25Z

pkg/controller/psrestore/controller.go

+	lease, err := k8s.AcquireRestoreLease(ctx, r.Client, cr)
+	if err != nil {
+		return ctrl.Result{}, errors.Wrap(err, "acquire restore lease")
+	}
+	if lease.Spec.HolderIdentity != nil && *lease.Spec.HolderIdentity != cr.Name {
+		status.State = apiv1.RestoreNew
+		status.StateDesc = fmt.Sprintf("PerconaServerMySQLRestore %s is already running", *lease.Spec.HolderIdentity)
+		log.Info("PerconaServerMySQLRestore is already running", "restore", *lease.Spec.HolderIdentity)
+		return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
+	}


The restore lock relies on LeaseDurationSeconds (30s) and only updates RenewTime when the restore reconciles. During long-running restores, there may be long stretches without reconcile events (Job status often changes only at start/finish), so the lease can expire while the restore is still Running and backups would no longer be blocked. To make the lock effective, either requeue periodically while the restore is Starting/Running to renew the lease (e.g., RequeueAfter < duration), and/or use a substantially longer duration.

Copilot · 2026-04-07T12:23:25Z

pkg/controller/psrestore/controller.go

+func (r *PerconaServerMySQLRestoreReconciler) getRunningBackup(ctx context.Context, cr *apiv1.PerconaServerMySQLRestore) (*apiv1.PerconaServerMySQLBackup, error) {
+	backups := new(apiv1.PerconaServerMySQLBackupList)
+	if err := r.List(ctx, backups, client.InNamespace(cr.Namespace)); err != nil {
+		return nil, errors.Wrap(err, "list backups")
+	}
+
+	for _, b := range backups.Items {
+		if b.Spec.ClusterName != cr.Spec.ClusterName {
+			continue
+		}
+
+		switch b.Status.State {
+		case apiv1.BackupStarting, apiv1.BackupRunning:
+			return &b, nil
+		}


getRunningBackup currently lists all backups in the namespace and then filters in-memory. This is O(N) per reconcile and can become expensive as backup history grows. A field indexer for PerconaServerMySQLBackup spec.clusterName is already set up (see cmd/manager/main.go), so you can use client.MatchingFields{"spec.clusterName": cr.Spec.ClusterName} along with client.InNamespace(...) to query only backups for the target cluster.

Copilot · 2026-04-07T12:23:25Z

pkg/k8s/lease.go

+				Spec: coordv1.LeaseSpec{
+					HolderIdentity:       &cr.Name,
+					LeaseDurationSeconds: ptr.To(restoreLeaseDurationSeconds),
+					AcquireTime:          &metav1.MicroTime{Time: now},
+					RenewTime:            &metav1.MicroTime{Time: now},
+				},


LeaseSpec.HolderIdentity is being set using &cr.Name. Using ptr.To(cr.Name) (or an explicit local variable) avoids taking the address of a field on another object and makes the intent clearer (and prevents accidental aliasing if the source struct is ever reused/mutated).

egegunes · 2026-04-07T17:23:15Z

pkg/controller/psrestore/controller.go

+		if err := k8s.ReleaseRestoreLease(ctx, r.Client, cr); err != nil {
+			log.Error(err, "failed to release restore lease")
+		}


doesn't this mean we'll release the lease even when restore doesn't finish in a single reconciliation?

ReleaseRestoreLease has status check to release only for finished restores:

percona-server-mysql-operator/pkg/k8s/lease.go

Lines 86 to 89 in 84f2a5d

switch cr.Status.State {

case apiv1.RestoreNew, apiv1.RestoreStarting, apiv1.RestoreRunning:

return nil

}

egegunes · 2026-04-07T17:23:42Z

pkg/controller/psrestore/controller.go

 	}

 	defer func() {
+		if err := k8s.ReleaseRestoreLease(ctx, r.Client, cr); err != nil {


we also need a finalizer to release if restore is deleted in running state

I don’t think we need a finalizer. It could cause problems. The lease remains active for 30 seconds. If the restore is deleted while in the running state, another restore will be started after that period.

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/k8s/lease.go

mayankshah1607 · 2026-04-10T06:01:46Z

pkg/controller/psbackup/controller.go

 			return ctrl.Result{}, nil
 		}

+		lease, err := k8s.GetLease(ctx, r.Client, cr.Namespace, cr.Spec.ClusterName)


Are the parameters correct? I think name and namespace are interchanged. Also shouldn't we pass naming.RestoreLeaseName ?

mayankshah1607 · 2026-04-10T06:03:11Z

pkg/controller/psbackup/controller.go

 		}

+		lease, err := k8s.GetLease(ctx, r.Client, cr.Namespace, cr.Spec.ClusterName)
+		if err != nil {


Should we ignore NotFound error? Under normal cases I don't think restore lease will be present right?

mayankshah1607 · 2026-04-10T06:05:36Z

pkg/controller/psrestore/controller.go

+	for _, b := range backups.Items {
+		if b.Spec.ClusterName != cr.Spec.ClusterName {
+			continue
+		}


We have field indexer on clusterName

mayankshah1607 · 2026-04-10T06:12:05Z

pkg/controller/psrestore/controller.go

-			log.Info("PerconaServerMySQLRestore is already running", "restore", restore.Name)
-			return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
-		}
+	backup, err := r.getRunningBackup(ctx, cr)


I think we might still have a TOCTOU race: when you start a backup and restore at the same time, what happens if the restore lease is acquired between the BackupNew -> BackupStarting transition window?

Consider the following sequence:
T1: backup starts -> no restore lease -> proceeds
T2: restore starts, acquires lease, checks backup is still BackupNew -> proceeds
T3: we have backup and restore running at same time

This might become more evident if the informer caches are not updated fast enough

mayankshah1607 · 2026-04-10T06:13:23Z

pkg/controller/psrestore/controller.go


 	defer func() {
+		switch cr.Status.State {
+		case apiv1.RestoreNew, apiv1.RestoreStarting, apiv1.RestoreRunning:


Should we be checking status.Status instead? That's the copy we modify during the course of that reconciliation, cr.Status is what is observed before reconciling

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/controller/psbackup/controller.go

+		lease, err := k8s.GetLease(ctx, r.Client, cr.Namespace, cr.Spec.ClusterName)
+		if err != nil {
+			return rr, errors.Wrap(err, "get restore lease")
+		}
+		if k8s.IsLeaseActive(lease) {
+			status.State = apiv1.BackupError
+			status.StateDesc = fmt.Sprintf("backup cannot run while restore %s is in progress", *lease.Spec.HolderIdentity)
+			return ctrl.Result{}, nil
+		}


pkg/controller/psrestore/controller.go

pkg/controller/psbackup/controller_test.go

+			obj: []client.Object{
+				&coordv1.Lease{
+					ObjectMeta: metav1.ObjectMeta{
+						Name:      "restore-lock-" + cr.Spec.ClusterName,


mayankshah1607 · 2026-04-10T07:41:43Z

pkg/k8s/lease.go

+	"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
+)
+
+const leaseDurationSeconds int32 = 30


Do we actually need renew logic for our use-case? IIUC, we just want to use lease like a global lock. When a restore start -> acquire it. When it is done -> release it. Why does it need to periodically send a heart beat?
The stale checking logic already covers the case when a restore completed but failed to release the lock

I agree. With these changes we don't need the renew logic: 22ab993

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/util/k8s/util.go

+	for _, backup := range backupList.Items {
+		switch backup.Status.State {
+		case apiv1.BackupStarting, apiv1.BackupRunning:
+			return &backup, nil


pkg/controller/psrestore/controller.go

+	if err := k8s.AcquireLease(ctx, r.Client, naming.RestoreLeaseName(cr.Spec.ClusterName), cr.Name, cr.Namespace, checkStale); err != nil {
+		if errors.Is(err, k8s.ErrLeaseAlreadyHeld) || k8serrors.IsAlreadyExists(err) || k8serrors.IsConflict(err) {
+			status.State = apiv1.RestoreNew
+			status.StateDesc = fmt.Sprintf("PerconaServerMySQLRestore %s is already running", leaseHolder)
+			log.Info("PerconaServerMySQLRestore is already running", "restore", leaseHolder)


JNKPercona · 2026-04-10T14:37:50Z

Test Name	Result	Time
async-ignore-annotations-8-4	passed	00:06:13
async-global-metadata-8-4	passed	00:15:22
async-upgrade-8-0	passed	00:12:48
async-upgrade-8-4	passed	00:13:01
auto-config-8-4	passed	00:24:41
config-8-4	passed	00:21:02
config-router-8-0	passed	00:07:34
config-router-8-4	passed	00:07:29
demand-backup-minio-8-0	passed	00:20:25
demand-backup-minio-8-4	passed	00:19:59
demand-backup-cloud-8-4	passed	00:21:38
demand-backup-retry-8-4	passed	00:16:35
demand-backup-incremental-8-0	passed	00:35:28
demand-backup-incremental-8-4	passed	00:37:20
async-data-at-rest-encryption-8-0	passed	00:13:19
async-data-at-rest-encryption-8-4	passed	00:13:06
gr-global-metadata-8-4	passed	00:13:44
gr-data-at-rest-encryption-8-0	passed	00:14:43
gr-data-at-rest-encryption-8-4	passed	00:14:43
gr-demand-backup-minio-8-4	passed	00:12:36
gr-demand-backup-cloud-8-4	passed	00:21:38
gr-demand-backup-haproxy-8-4	passed	00:09:38
gr-demand-backup-incremental-8-0	passed	00:34:53
gr-demand-backup-incremental-8-4	passed	00:35:48
gr-finalizer-8-4	passed	00:05:18
gr-haproxy-8-0	passed	00:04:44
gr-haproxy-8-4	passed	00:04:05
gr-ignore-annotations-8-4	passed	00:04:48
gr-init-deploy-8-0	passed	00:09:14
gr-init-deploy-8-4	passed	00:09:10
gr-one-pod-8-4	passed	00:05:45
gr-recreate-8-4	passed	00:17:00
gr-scaling-8-4	passed	00:08:05
gr-scheduled-backup-8-4	passed	00:16:11
gr-scheduled-backup-incremental-8-4	passed	00:22:00
gr-security-context-8-4	passed	00:09:56
gr-self-healing-8-4	passed	00:23:57
gr-tls-cert-manager-8-4	passed	00:09:58
gr-users-8-4	passed	00:05:15
gr-upgrade-8-0	passed	00:09:33
gr-upgrade-8-4	passed	00:09:52
haproxy-8-0	passed	00:09:14
haproxy-8-4	passed	00:08:26
init-deploy-8-0	passed	00:05:43
init-deploy-8-4	passed	00:05:42
limits-8-4	passed	00:08:36
monitoring-8-4	passed	00:14:08
one-pod-8-0	passed	00:05:38
one-pod-8-4	passed	00:06:23
operator-self-healing-8-4	passed	00:11:30
pvc-resize-8-4	passed	00:05:30
recreate-8-4	passed	00:12:47
scaling-8-4	passed	00:10:59
scheduled-backup-8-0	passed	00:18:10
scheduled-backup-8-4	passed	00:16:21
scheduled-backup-incremental-8-0	passed	00:21:03
scheduled-backup-incremental-8-4	passed	00:21:59
service-per-pod-8-4	passed	00:07:55
sidecars-8-4	passed	00:04:41
smart-update-8-4	passed	00:09:30
storage-8-4	passed	00:04:01
telemetry-8-4	passed	00:06:15
tls-cert-manager-8-4	passed	00:11:54
users-8-0	passed	00:08:08
users-8-4	passed	00:07:34
version-service-8-4	passed	00:20:44

Summary	Value
Tests Run	66/66
Job Duration	02:01:15
Total Test Time	14:41:56

commit: 355d80f
image: perconalab/percona-server-mysql-operator:PR-1284-355d80fb

K8SPS-435: fail backup if restore is running

84f2a5d

https://perconadev.atlassian.net/browse/K8SPS-435

Copilot AI review requested due to automatic review settings April 7, 2026 12:16

pull-request-size bot added the size/L 100-499 lines label Apr 7, 2026

Copilot started reviewing on behalf of pooknull April 7, 2026 12:16 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

egegunes requested changes Apr 7, 2026

View reviewed changes

pooknull changed the title ~~K8SPS-435: fail backup if restore is running~~ K8SPS-686: fail backup if restore is running Apr 8, 2026

pooknull requested a review from egegunes April 8, 2026 08:24

pooknull added 2 commits April 8, 2026 11:27

fix lint

e3ab45e

Merge branch 'main' into K8SPS-435-fix

fa35aff

Copilot AI review requested due to automatic review settings April 8, 2026 08:28

pooknull marked this pull request as ready for review April 8, 2026 08:28

pooknull requested review from gkech, hors and nmarukovich as code owners April 8, 2026 08:28

Copilot started reviewing on behalf of pooknull April 8, 2026 08:28 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

pkg/k8s/lease.go Outdated Show resolved Hide resolved

align lease.go with K8SPS-410-disable-parallel-backups branch

983ad91

pull-request-size bot added size/XL 500-999 lines and removed size/L 100-499 lines labels Apr 9, 2026

mayankshah1607 requested changes Apr 10, 2026

View reviewed changes

Merge branch 'main' into K8SPS-435-fix

ba99be1

Copilot AI review requested due to automatic review settings April 10, 2026 06:17

Copilot started reviewing on behalf of mayankshah1607 April 10, 2026 06:18 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

mayankshah1607 reviewed Apr 10, 2026

View reviewed changes

pooknull added 4 commits April 10, 2026 12:24

small fixes

2cf1b53

remove getRunningBackup method

ee3e6bb

fix race

e3bb19b

remove renew logic

22ab993

use naming.RestoreLeaseName in tests

355d80f

Copilot AI review requested due to automatic review settings April 10, 2026 12:34

Copilot started reviewing on behalf of pooknull April 10, 2026 12:35 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

	switch cr.Status.State {
	case apiv1.RestoreNew, apiv1.RestoreStarting, apiv1.RestoreRunning:
	return nil
	}

Conversation

pooknull commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DESCRIPTION

CHECKLIST

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pooknull Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

JNKPercona commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

pooknull commented Apr 7, 2026 •

edited

Loading

pooknull Apr 8, 2026 •

edited

Loading