Skip to content

K8SPS-642: Point-in-time recovery#1252

Open
egegunes wants to merge 98 commits intomainfrom
K8SPS-642
Open

K8SPS-642: Point-in-time recovery#1252
egegunes wants to merge 98 commits intomainfrom
K8SPS-642

Conversation

@egegunes
Copy link
Copy Markdown
Contributor

@egegunes egegunes commented Mar 22, 2026

CHANGE DESCRIPTION

Problem:

Jira: https://perconadev.atlassian.net/browse/K8SPS-642

Based on this proposal: #1251

Cause:
Short explanation of the root cause of the issue if applicable.

Solution:
Short explanation of the solution we are providing with this PR.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PS version?
  • Does the change support oldest and newest supported Kubernetes version?

@egegunes egegunes added this to the v1.1.0 milestone Mar 22, 2026
Copilot AI review requested due to automatic review settings March 22, 2026 15:37
@pull-request-size pull-request-size bot added the size/XXL 1000+ lines label Mar 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Point-in-Time Recovery (PITR) support to the Percona Server for MySQL Operator by introducing a PITR restore workflow that (1) searches binlogs from the Binlog Server for a requested target (GTID or timestamp) and (2) runs a dedicated PITR restore Job after the base xtrabackup restore completes.

Changes:

  • Introduces PITR restore orchestration in the restore controller (binlog search → ConfigMap → PITR Job).
  • Adds a PITR restore Job builder (pkg/pitr) plus a new pitr CLI used inside the PITR Job.
  • Extends Binlog Server configuration generation (TLS/replication/storage options) and adds binlog search exec helpers.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pkg/pitr/pitr.go Builds PITR restore Job + env/volume wiring for PITR execution.
pkg/controller/psrestore/controller.go Adds PITR config creation, binlog search, and PITR Job lifecycle after base restore.
pkg/controller/ps/controller.go Updates Binlog Server reconcile: endpoint parsing, S3 URI formatting, TLS/replication/storage config, cleanup.
pkg/binlogserver/search.go Adds exec-based binlog search helpers returning structured entries.
pkg/binlogserver/config.go Extends Binlog Server config schema (SSL/TLS, replication mode/rewrite, storage options).
pkg/binlogserver/binlog_server.go Adds buffer volume/mount, uses shared binary const, changes replicas behavior.
api/v1/perconaservermysqlrestore_types.go Adds .spec.pitr to Restore CRD API types.
api/v1/zz_generated.deepcopy.go Deepcopy support for new PITR restore spec.
deploy/crd.yaml / deploy/bundle.yaml / deploy/cw-bundle.yaml / config/crd/bases/... CRD schema updates for .spec.pitr.
deploy/cr.yaml Example CR updates to include PiTR binlogServer configuration.
cmd/pitr/main.go New PITR helper CLI (setup downloads relay logs; apply runs replication until target).
cmd/internal/db/db.go Adds replication/PITR helper SQL methods for the PITR CLI.
build/run-pitr-restore.sh New script executed by PITR Job container to run setup/apply around a local mysqld.
build/ps-init-entrypoint.sh Installs pitr binary and PITR restore script into the shared /opt/percona volume.
build/Dockerfile Builds and packages the new pitr binary into the operator image.
api/v1/perconaservermysql_types.go Disables PITR when cluster is paused (defaulting behavior).
Comments suppressed due to low confidence (1)

pkg/binlogserver/binlog_server.go:84

  • StatefulSet replicas are hard-coded to 1, ignoring the user-provided spec.Size from BinlogServerSpec (PodSpec inline). This is a behavior change from using spec.Size and makes the size field effectively meaningless for this component. Either honor spec.Size (including allowing 0 to disable) or remove/ignore the field consistently at the API level.
func StatefulSet(cr *apiv1.PerconaServerMySQL, initImage, configHash string) *appsv1.StatefulSet {
	spec := cr.Spec.Backup.PiTR.BinlogServer

	labels := MatchLabels(cr)

	annotations := make(map[string]string)
	if configHash != "" {
		annotations[string(naming.AnnotationConfigHash)] = configHash
	}

	return &appsv1.StatefulSet{
		TypeMeta: metav1.TypeMeta{
			APIVersion: "apps/v1",
			Kind:       "StatefulSet",
		},
		ObjectMeta: metav1.ObjectMeta{
			Name:        Name(cr),
			Namespace:   cr.Namespace,
			Labels:      labels,
			Annotations: cr.GlobalAnnotations(),
		},
		Spec: appsv1.StatefulSetSpec{
			Replicas: ptr.To(int32(1)),
			Selector: &metav1.LabelSelector{
				MatchLabels: labels,
			},
			Template: corev1.PodTemplateSpec{
				ObjectMeta: metav1.ObjectMeta{
					Labels:      labels,
					Annotations: util.SSMapMerge(cr.GlobalAnnotations(), annotations),
				},
				Spec: spec.Core(
					labels,
					volumes(cr),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1328 to 1330
s3Uri := fmt.Sprintf("%s://%s:%s@%s/%s", protocol, accessKey, secretKey, host, s3.Bucket)
if len(s3.Prefix) > 0 {
s3Uri += fmt.Sprintf("/%s", s3.Prefix)
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcileBinlogServer builds s3Uri using s3.Bucket and s3.Prefix directly. BackupStorageS3Spec.Bucket can already include a prefix (BucketWithPrefix), so this can produce an invalid URI like "...///". Use BucketAndPrefix() and join the resulting prefix once to avoid double-prefixing and to keep bucket vs key separation correct.

Suggested change
s3Uri := fmt.Sprintf("%s://%s:%s@%s/%s", protocol, accessKey, secretKey, host, s3.Bucket)
if len(s3.Prefix) > 0 {
s3Uri += fmt.Sprintf("/%s", s3.Prefix)
bucket, prefix := s3.BucketAndPrefix()
s3Uri := fmt.Sprintf("%s://%s:%s@%s/%s", protocol, accessKey, secretKey, host, bucket)
if len(prefix) > 0 {
s3Uri += fmt.Sprintf("/%s", prefix)

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +245
if binlogServer.Storage.S3 != nil {
s3 := binlogServer.Storage.S3
bucket, _ := s3.BucketAndPrefix()
envs = append(envs,
corev1.EnvVar{
Name: "STORAGE_TYPE",
Value: "s3",
},
corev1.EnvVar{
Name: "AWS_ACCESS_KEY_ID",
ValueFrom: &corev1.EnvVarSource{
SecretKeyRef: k8s.SecretKeySelector(s3.CredentialsSecret, secret.CredentialsAWSAccessKey),
},
},
corev1.EnvVar{
Name: "AWS_SECRET_ACCESS_KEY",
ValueFrom: &corev1.EnvVarSource{
SecretKeyRef: k8s.SecretKeySelector(s3.CredentialsSecret, secret.CredentialsAWSSecretKey),
},
},
corev1.EnvVar{
Name: "AWS_DEFAULT_REGION",
Value: s3.Region,
},
corev1.EnvVar{
Name: "AWS_ENDPOINT",
Value: s3.EndpointURL,
},
corev1.EnvVar{
Name: "S3_BUCKET",
Value: bucket,
},
)
}
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restoreContainer() doesn’t propagate storage.VerifyTLS into the PITR job environment, but cmd/pitr defaults to VERIFY_TLS=true. This prevents disabling TLS verification for S3-compatible endpoints with custom/self-signed certs (a capability that existing backup/restore jobs support via VERIFY_TLS). Add VERIFY_TLS env based on storage.VerifyTLS (similar to xtrabackup restoreContainer).

Copilot uses AI. Check for mistakes.
cmd/pitr/main.go Outdated
Comment on lines +238 to +245
// e.g. "https://minio-service:9000/binlogs/binlog.000001" -> "binlogs/binlog.000001"
func objectKeyFromURI(uri string) (string, error) {
u, err := url.Parse(uri)
if err != nil {
return "", fmt.Errorf("parse URL: %w", err)
}
// Path starts with "/", trim the leading slash to get the object key
return strings.TrimPrefix(u.Path, "/"), nil
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BinlogEntry.URI is parsed and the entire URL path (minus leading "/") is used as the S3 object key. If the URI is path-style (common for s3://host/bucket/key or https://endpoint/bucket/key), this returns "bucket/key" and will not match when the S3 client is already scoped to that bucket. Consider making objectKeyFromURI aware of the bucket (strip a leading "//" segment when present) and support both virtual-host and path-style URIs robustly.

Suggested change
// e.g. "https://minio-service:9000/binlogs/binlog.000001" -> "binlogs/binlog.000001"
func objectKeyFromURI(uri string) (string, error) {
u, err := url.Parse(uri)
if err != nil {
return "", fmt.Errorf("parse URL: %w", err)
}
// Path starts with "/", trim the leading slash to get the object key
return strings.TrimPrefix(u.Path, "/"), nil
// It supports both virtual-hosted-style and path-style S3/MinIO URLs.
// Examples:
// "https://minio-service:9000/binlogs/binlog.000001" -> "binlogs/binlog.000001"
// "https://bucket.s3.amazonaws.com/binlogs/binlog.000001" -> "binlogs/binlog.000001"
func objectKeyFromURI(uri string) (string, error) {
u, err := url.Parse(uri)
if err != nil {
return "", fmt.Errorf("parse URL: %w", err)
}
// Normalize path by trimming the leading slash.
path := strings.TrimPrefix(u.Path, "/")
if path == "" {
return "", nil
}
// Try to infer the bucket from the hostname for virtual-hosted-style URLs,
// e.g. "bucket.s3.amazonaws.com" -> bucket "bucket".
host := u.Hostname()
if host != "" {
parts := strings.Split(host, ".")
// Heuristic: if the first label is not "s3" and there is more than one label,
// treat the first label as a bucket name.
if len(parts) > 1 && parts[0] != "s3" {
bucketFromHost := parts[0]
prefix := bucketFromHost + "/"
if strings.HasPrefix(path, prefix) {
// Strip a redundant leading "<bucket>/" segment when present.
path = strings.TrimPrefix(path, prefix)
}
}
}
return path, nil

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 23, 2026 16:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 72 to 74
Spec: appsv1.StatefulSetSpec{
Replicas: &spec.Size,
Replicas: ptr.To(int32(1)),
Selector: &metav1.LabelSelector{
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StatefulSet replicas are hard-coded to 1, which ignores the configured .spec.backup.pitr.binlogServer.size (PodSpec.Size). This prevents users from scaling the binlog server and is inconsistent with other components that honor PodSpec.Size; use the spec.Size value here (and ensure it’s non-zero/defaulted) rather than forcing 1.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 13, 2026 15:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 77 out of 77 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gkech
Copy link
Copy Markdown
Contributor

gkech commented Apr 14, 2026

we have conflicts with the e2e tests


if cr.Spec.PITR != nil {
if !cluster.Spec.Backup.PiTR.Enabled || cluster.Spec.Backup.PiTR.BinlogServer == nil {
if cluster.Spec.Backup.PiTR.BinlogServer == nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also check that the pitr is enabled? We may have the server fully configured, but have enabled set to false 🤔

Copilot AI review requested due to automatic review settings April 14, 2026 13:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 106 out of 106 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings April 15, 2026 14:30
Comment on lines +6 to +7
local ts=$(date +%Y-%m-%dT%H:%M:%S.%N%z --utc | sed 's/+0000/Z/g')
echo "${ts} 0 [Info] [K8SPS-642] [Job] $*" >&2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[shfmt] reported by reviewdog 🐶

Suggested change
local ts=$(date +%Y-%m-%dT%H:%M:%S.%N%z --utc | sed 's/+0000/Z/g')
echo "${ts} 0 [Info] [K8SPS-642] [Job] $*" >&2
local ts=$(date +%Y-%m-%dT%H:%M:%S.%N%z --utc | sed 's/+0000/Z/g')
echo "${ts} 0 [Info] [K8SPS-642] [Job] $*" >&2

Comment on lines +13 to +16
--admin-address=127.0.0.1 \
--user=mysql \
--gtid-mode=ON \
--enforce-gtid-consistency=ON &
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[shfmt] reported by reviewdog 🐶

Suggested change
--admin-address=127.0.0.1 \
--user=mysql \
--gtid-mode=ON \
--enforce-gtid-consistency=ON &
--admin-address=127.0.0.1 \
--user=mysql \
--gtid-mode=ON \
--enforce-gtid-consistency=ON &


log "waiting for mysqld to be ready"
until mysqladmin -u operator -p"$(</etc/mysql/mysql-users-secret/operator)" ping --silent 2>/dev/null; do
sleep 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[shfmt] reported by reviewdog 🐶

Suggested change
sleep 1;
sleep 1

Comment on lines +25 to +31
SLEEP_FOREVER_FILE=/var/lib/mysql/sleep-forever
log "sleeping forever... remove ${SLEEP_FOREVER_FILE} to terminate."
touch ${SLEEP_FOREVER_FILE}
while [[ -f ${SLEEP_FOREVER_FILE} ]]; do
sleep 10
done
exit 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[shfmt] reported by reviewdog 🐶

Suggested change
SLEEP_FOREVER_FILE=/var/lib/mysql/sleep-forever
log "sleeping forever... remove ${SLEEP_FOREVER_FILE} to terminate."
touch ${SLEEP_FOREVER_FILE}
while [[ -f ${SLEEP_FOREVER_FILE} ]]; do
sleep 10
done
exit 0
SLEEP_FOREVER_FILE=/var/lib/mysql/sleep-forever
log "sleeping forever... remove ${SLEEP_FOREVER_FILE} to terminate."
touch ${SLEEP_FOREVER_FILE}
while [[ -f ${SLEEP_FOREVER_FILE} ]]; do
sleep 10
done
exit 0

Comment on lines +153 to +157
entries: defaultEntries,
pitrType: "gtid",
pitrGTID: "uuid:1",
db: &fakeDB{getGTIDExecutedResult: "uuid:1-5"},
applyErr: errors.New("fetch binlog binlogs/binlog.000001: download failed"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[gofmt] reported by reviewdog 🐶

Suggested change
entries: defaultEntries,
pitrType: "gtid",
pitrGTID: "uuid:1",
db: &fakeDB{getGTIDExecutedResult: "uuid:1-5"},
applyErr: errors.New("fetch binlog binlogs/binlog.000001: download failed"),
entries: defaultEntries,
pitrType: "gtid",
pitrGTID: "uuid:1",
db: &fakeDB{getGTIDExecutedResult: "uuid:1-5"},
applyErr: errors.New("fetch binlog binlogs/binlog.000001: download failed"),

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 106 out of 106 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +88 to +91
operatorPass, err := getSecret(apiv1.UserOperator)
if err != nil {
return fmt.Errorf("get operator password: %w", err)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just mount it as an env variable (from secret) in the restore job?

Orchestrator StatefulAppStatus `json:"orchestrator,omitempty"`
HAProxy StatefulAppStatus `json:"haproxy,omitempty"`
Router StatefulAppStatus `json:"router,omitempty"`
BinlogServer StatefulAppStatus `json:"binlogServer,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but for future, let's try to add these as conditions

Comment on lines +1044 to +1071
if bls.SSLMode == "" {
bls.SSLMode = "verify_identity"
}
if bls.VerifyChecksum == nil {
t := true
bls.VerifyChecksum = &t
}
if bls.RewriteFileSize == "" {
bls.RewriteFileSize = "128M"
}
if bls.CheckpointSize == "" {
bls.CheckpointSize = "16M"
}
if bls.CheckpointInterval == "" {
bls.CheckpointInterval = "30s"
}
if bls.ConnectTimeout == 0 {
bls.ConnectTimeout = 30
}
if bls.ReadTimeout == 0 {
bls.ReadTimeout = 30
}
if bls.WriteTimeout == 0 {
bls.WriteTimeout = 30
}
if bls.IdleTime == 0 {
bls.IdleTime = 30
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets set defaults on CRD also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL 1000+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants