Skip to content

Add MysqlStandbyCluster CRD + Phase 0 cross-cluster DR runbook#84

Open
colinmollenhour wants to merge 3 commits into
mainfrom
megamind/dr-7-phase-0-1
Open

Add MysqlStandbyCluster CRD + Phase 0 cross-cluster DR runbook#84
colinmollenhour wants to merge 3 commits into
mainfrom
megamind/dr-7-phase-0-1

Conversation

@colinmollenhour
Copy link
Copy Markdown
Collaborator

AI Megamind - By: Claude Code (Claude Sonnet orchestrating; Opus + GPT-5.5 + Gemini-3.1-Pro for review/critique)

Summary

First slice of WISHLIST #7 — cross-region/cross-cluster DR. Implements Phases 0 + 1 of the multi-phase plan: an end-to-end cross-cluster DR runbook over the existing MysqlFailoverGroup + initFromBackup + PITR surface, plus a new MysqlStandbyCluster CRD with a passive verifier reconciler.

  • Phase 0 (docs runbook). New docs/docs/multi-cluster-dr.mdx is a complete recovery walkthrough using only today's CRDs. An on-call engineer can recover into another cluster using this doc alone — no new operator features required. Includes topology, IAM policy, encryption-passphrase distribution, source-fencing checklist, recovery commands, DNS cutover, and prose failback narrative.
  • Phase 1 (passive verifier CR). New MysqlStandbyCluster CRD (api/v1alpha1/mysqlstandbycluster_types.go, 523 lines) plus reconciler (internal/controller/standbycluster_reconciler.go, 747 lines). The CR declares a DR relationship from the DR cluster's side: scan the shared S3 bucket on freshness.discoveryInterval (default 5m), pick the newest dump by @.json end timestamp (not lex order — a real bug; see "Notes" below), read per-site PITR manifests, publish BucketReadable and SourceConfigKnown conditions. No activation, no promotion, no writes into MySQL — those land in Phase 2 (continuous restore verification) and Phase 3 (dr-activate), which are explicit follow-up PRs.

The CRD ships the full v1alpha1 schema (template, freshness, activate blocks) so the API surface is locked once and Phase 2/3 can read those fields without bumping the CRD version (no conversion webhook is available — see docs/docs/known-limitations.mdx).

What's NOT in this PR (explicit follow-ups)

Phase Feature Why deferred
2 Restorable condition powered by MysqlBackupVerification reuse; dr-cursors/<name>.json retention-floor sentinel Hard prereq on WISHLIST #43 (real-cluster PITR/backup E2E scenarios), which is also still open
3 dr-activate kubectl plugin verb; activation state machine (Validating → Restoring → Replaying → Provisioning → Activated); materialization of the active MysqlFailoverGroup Builds on Phase 2's Restorable gate
4 Symmetric failback CR + runbook Builds on Phase 3

The full plan and per-phase scope are in .tmp/megamind-dr-7/plans/second-draft.md (not committed; available locally during the run).

Test plan

  • make generate && make manifests — clean, no further drift on re-run
  • diff config/crd/bases/shipstream.io_mysqlstandbyclusters.yaml charts/bloodraven/crds/shipstream.io_mysqlstandbyclusters.yaml — byte-identical
  • go build ./... — success
  • make vet
  • make lint (golangci-lint run ./...)
  • go test -race -count=1 ./internal/... ./test/component/ — full suite, includes the new standby-cluster reconciler tests (13 original + 9 fix-pass extras)
  • kubectl apply --dry-run=client -f examples/standby-cluster.yaml against the new CRD — accepted
  • CI: Generate Check — must pass (CRD parity)
  • CI: Lint — must pass
  • CI: Unit + Component — must pass
  • CI: Envtest — new envtest scenarios in test/envtest/standbycluster_test.go
  • CI: E2E (smoke profile) — Phase 1 does not add a real-cluster scenario; smoke profile should not regress

Notes for reviewers

This PR went through Megamind's full review/fix loop:

  1. Three-model MBOT critique of the wishlist line surfaced 30 deduplicated findings; the critic dissent on the transport choice (Gemini wanted network-mediated; Opus + GPT-5.5 picked object-store-mediated) was resolved with the transport=ObjectStore|Network enum where Network is reserved for v2.
  2. One planning agent produced a 2353-line implementation plan with explicit Phase 0..4 sequencing.
  3. Three coding agents (CRD scaffolding, reconciler, docs) implemented Phases 0 + 1 in disjoint work packages.
  4. Three-reviewer ultra-review (bugs / runtime / craft) returned 31 findings. 18 routed to fixes (most notable: dump selection was using sort.Strings() on GenerateName-suffixed directory names — picks the wrong dump; replaced with @.json end-timestamp comparison). 5 were explicitly deferred to Phase 2/3 (e.g. standby metrics, GenerationChangedPredicate hardening) and tracked in .tmp/megamind-dr-7/reviews/validated-findings.md.
  5. Three fix agents addressed every validated finding. One fixed-review pass caught a trivial Helm-chart CRD copy drift; one-line re-copy.

Megamind Educational Brief

Educational brief — Cross-cluster DR (WISHLIST #7)

Status: future-state design. No PR has been opened. The
authoritative spec is
plans/second-draft.md (2353 lines).
This brief compresses it for reviewers and future agents.

Run: .tmp/megamind-dr-7/ · Mode: planning-only · Readiness:
READY_TO_START with zero unresolved decisions.


Journey

How the wishlist line traveled through Megamind's planning loop:

  • Resolution. User invocation Help me plan out WISHLIST.md item #7
    resolved against WISHLIST.md:21 (cross-region/cross-cluster DR as a
    first-class feature). The line bundles four distinct ideas — new CR,
    continuous shipping, one-command promote, runbook — that the critics
    treated as separable products.
  • Context capture. briefs/context.md grounded the planning in the
    existing surface: per-cluster operator, sidecar archiver gated on
    !@@read_only (intra-pod, primary-only), dr-only site role
    (intra-cluster, never auto-promoted), full-backup + PITR archive in
    S3, initFromBackup + pointInTime already deployed.
  • MBOT critique. Three critics — Claude Opus (max thinking),
    OpenAI GPT-5.5 (xhigh), Google Gemini 3.1 Pro (high) — ran in
    parallel against the wishlist line and produced ~30 deduplicated
    findings: 7 contradictions/hidden-assumptions (C-*), 10 failure-mode
    gaps (F-*), 8 architectural decisions needing an explicit choice
    (D-*), 5 naming concerns (N-*), 8 scope-discipline items (S-*).
  • Critic dissent (one item). D-1, the DR transport choice: Gemini
    argued for network-mediated first (cross-cluster MySQL replication)
    and deferring continuous S3 replay. Opus + GPT lean object-store-
    mediated
    because today's surface already does the work. The
    collector sided with the 2-of-3 majority; the rejected option survives
    as a reserved transport=Network enum so v2 can revisit without a
    CRD bump.
  • Recommended-defaults table. The critique closed with a 9-row
    table (D-1..D-8, S-3) that became the seed for the planning pass. The
    table is the source of every "Source" citation in the Design Decisions
    section below.
  • Single planning pass. One Claude Opus agent applied the defaults
    end-to-end, producing the 16-section second draft (goals, phasing,
    CRD shape, state machine, conditions, metrics, IAM/RBAC, DNS, test
    plan, docs, risks, readiness). No MBOD/bundled-decisions phase ran
    because the critique left zero open multi-option questions. Status
    landed at READY_TO_START.

Design Decisions

Each row resolves a critique finding. "Source" cites the plan section
or current-code path that grounds it.

ID Decision Choice Alternative rejected Source
D-1 DR transport Object-store-mediated (S3 + PITR archive already in place). transport=ObjectStore. Network-mediated (CHANGE REPLICATION SOURCE across clusters). Reserved as transport=Network enum for v2. plans/second-draft.md §1.1, §4.4; critique §3 (D-1, Gemini dissent)
N-1 Kind name MysqlStandbyCluster (short name msc). MysqlDRTarget — collides with SiteRoleDROnly (api/v1alpha1/types.go:280-283), which is the intra-cluster passive role and cannot be auto-promoted. Also rejected: MysqlClusterReplica, MysqlRemoteFollower, MysqlDRPair. plans/second-draft.md §4.2; critique §4 (N-1)
N-3 Activation verb kubectl bloodraven dr-activate promote — already means zero-RPO intra-cluster switchover with transactionsLost=0 (cmd/kubectl-bloodraven/promote.go:23-46). Reusing it across a non-zero-RPO cross-cluster path misleads operators. plans/second-draft.md §6.5; critique §4 (N-3)
D-2 CRD residency Target-side only. The CR lives on the DR cluster and declares "consume from bucket X, promote on confirm." The source operator never knows the relationship exists. Source-side CR, or symmetric pair. Both require an out-of-band linking step. Failback is achieved by symmetry — drop a new standby CR on the original source. plans/second-draft.md §4.3; critique §3 (D-2)
D-5 Promotion contract Spec confirm-token (RFC 3339). spec.activate.confirm must parse and be strictly greater than status.activation.confirmTokenUsed. Mirrors restoreInPlace.confirm. Annotation (source operator never sees it) or one-shot MysqlPromote CR (extra Kind for negligible gain). plans/second-draft.md §6.1; api/v1alpha1/backup_types.go:723-732
S-3 Split-brain stance Accept-loss with after-the-fact audit. Controller does not check whether the source is still writable. Operator owns the risk; runbook says "fence source first." Bucket-fence sentinel object (TTL'd, source-written) as a hard interlock. Deferred to follow-up; the transport enum makes adding spec.activate.requireSourceFenceTTL non-breaking. plans/second-draft.md §1.2, §6.7, §15.4; critique §5 (S-3)
D-6 DNS cutover Operator writes DNSEndpoint in both clusters; user runs external-dns symmetrically. Bloodraven owns per-MFG records; the application-facing record (weighted-CNAME / GSLB / manual flip) is user-owned. Operator-driven cross-provider DNS cutover. Blast radius too large; explicit non-goal. plans/second-draft.md §12; internal/platform/dns.go:23-31; api/v1alpha1/types.go:371-384
D-3 Bucket IAM User-provisioned. Read-only at DR (s3:ListBucket, s3:GetObject) plus a tightly-scoped write on dr-cursors/* only. Runbook publishes the minimum policy. Operator-managed IAM (STS / bucket policy automation). Out of v1 scope. plans/second-draft.md §11.1; critique §3 (D-3)
D-4 Encryption passphrase User-managed. Source-side passphrase Secret is manually mirrored to DR namespace; preflight (Validating phase) validates non-empty. Operator-driven passphrase distribution. Unanimous critic rejection. plans/second-draft.md §11.2; docs/docs/backup-encryption.mdx:217-271
F-3 Failback shape Symmetric: a second MysqlStandbyCluster on the original source cluster pointing at the new primary's bucket prefix. No dedicated MysqlFailback Kind. Dedicated failback CR. Near-duplicate of the standby Kind; semantic difference is zero from operator's perspective. plans/second-draft.md §7; critique §2 (F-3)
D-7 Artifact ownership across clusters DR-side bucket scan + synthetic shadow MysqlBackup CRs. DR controller writes phase-Succeeded MysqlBackup CRs annotated dr.bloodraven.shipstream.io/synthetic=true; the existing verification reconciler is taught a single predicate to accept them. Source-side mirroring of MysqlBackup CRs (GitOps or otherwise). Violates "only the bucket is the cross-cluster bus." plans/second-draft.md §5.2.1
D-8 Single CR vs multiple Single Kind with transport discriminator (matches BackupStorage.Type precedent in api/v1alpha1/backup_types.go:388-444). Separate Kinds for each transport mode. Causes API surface bloat. plans/second-draft.md §4.4
F-2 PITR pruning vs DR consumer dr-cursors/<ns>-<name>.json retention-floor sentinel. DR controller refreshes every 5m (TTL 60m); source operator's /pitr-cutoff returns min(MysqlBackup_retention, oldest_required_across_cursors). Source-side CR coordination, or accepting the race. plans/second-draft.md §5.3; cmd/bloodraven/main.go:388-410; internal/sidecar/binlog_archiver.go:350-458
S-1, S-2 Phasing Phase 0 (docs runbook over existing surface) ships first. Phases 1..4 layer the CR, verification, activation, failback on top. Each phase ships independent value; Phase 0 is the floor if Phases 1+ slip. Big-bang ship of CR + activation. Critics unanimous that writing the runbook surfaces the gaps the CR must close. plans/second-draft.md §2, §3

Architecture

Bloodraven on main (commit 5b5f0b0) is a single-cluster Kubernetes
operator: each MysqlFailoverGroup is one logical database with 2-16
sites that all live in the same cluster. The sidecar binlog archiver
runs only on the active primary (gated on !@@read_only), uploads
sealed binlogs to a shared S3 prefix, and the operator drives PITR
pruning from /pitr-cutoff. Today's "DR into another cluster" is a
manual checklist: stand up a fresh MFG in the target cluster with
spec.initFromBackup pointing at the source bucket, mirror passphrase
Secrets, flip DNS. There is no CR tracking the relationship, no
freshness signal, no consumer-side retention guard, no audit-grade
promote.

WISHLIST #7 introduces one new Kind — MysqlStandbyCluster — that
lives on the DR cluster, declares the relationship, continuously
verifies the latest dump + PITR window is restorable, and on a
confirm-token-gated dr-activate materializes a writable
MysqlFailoverGroup loaded from the source archive. The only
cross-cluster bus is the shared object store. Each operator stays
single-cluster: no federation, no operator-to-operator RPC.

Diagram 1 — End-state two-cluster topology

flowchart LR
    subgraph SourceCluster["Source cluster (e.g. us-west-prod)"]
        direction TB
        SOp["Bloodraven operator"]
        MFG["MysqlFailoverGroup (orders)"]
        SidePri["Sidecar (active primary)<br/>!@@read_only ⇒ writes binlogs"]
        SideRep["Sidecar (replicas)<br/>@@read_only ⇒ idle"]
        SOp -->|"reconciles"| MFG
        MFG --> SidePri
        MFG --> SideRep
    end

    subgraph Bucket["Shared S3 bucket (cross-cluster bus)"]
        direction TB
        Dumps["&lt;prefix&gt;/&lt;mysqlbackup-name&gt;/<br/>(full dumps + @.json)"]
        Binlogs["&lt;prefix&gt;/binlogs/<br/>(sealed binlogs + per-site manifest)"]
        Cursors["&lt;prefix&gt;/dr-cursors/&lt;name&gt;.json<br/>(retention floor sentinel)"]
    end

    subgraph DRCluster["DR cluster (e.g. us-east-prod)"]
        direction TB
        DOp["Bloodraven operator<br/>+ MysqlStandbyClusterReconciler"]
        MSC["MysqlStandbyCluster CR<br/>(verifier mode)"]
        MBVer["MysqlBackupVerification (periodic)<br/>+ synthetic MysqlBackup CRs"]
        FutureMFG["Materialized MysqlFailoverGroup<br/>(not yet created — Phase 3 only)"]
        DOp -->|"reconciles"| MSC
        MSC -->|"Owns"| MBVer
        MSC -.->|"materializes on dr-activate"| FutureMFG
    end

    SidePri -->|"PUT sealed binlogs"| Binlogs
    SOp -->|"PUT full dumps (Job)"| Dumps
    SOp -->|"GET dr-cursors/*.json<br/>during /pitr-cutoff"| Cursors

    MSC -->|"GET (list + read) dumps, binlogs"| Dumps
    MSC -->|"GET binlog manifests"| Binlogs
    MSC -->|"PUT dr-cursors/&lt;name&gt;.json<br/>(only object DR writes)"| Cursors

    classDef src fill:#fee,stroke:#900
    classDef dr fill:#eef,stroke:#009
    classDef bus fill:#ffd,stroke:#960
    class SourceCluster,SOp,MFG,SidePri,SideRep src
    class DRCluster,DOp,MSC,MBVer,FutureMFG dr
    class Bucket,Dumps,Binlogs,Cursors bus
Loading

The asymmetry is the design's defining feature:

  • The source cluster writes the bucket: full dumps via backup Jobs
    (operator-driven), sealed binlogs via the sidecar archiver (gated on
    !@@read_only, so the upload happens only on the active primary
    and switches over within one scan cycle on failover).
  • The DR cluster only reads dumps + binlogs. The one exception
    is the dr-cursors/<name>.json sentinel — a tiny per-standby file
    the DR controller refreshes every 5 minutes (TTL 60m) to bound the
    source operator's /pitr-cutoff and prevent it from pruning binlogs
    a DR consumer still needs (critique F-2).
  • IAM follows the asymmetry: s3:ListBucket + s3:GetObject on the
    whole prefix; s3:PutObject + s3:DeleteObject scoped to
    dr-cursors/* only.

Diagram 2 — MysqlStandbyCluster activation state machine

Mirrors plans/second-draft.md §9. One transition per reconcile so
operator restarts land on a well-defined observable state.

stateDiagram-v2
    [*] --> None
    None: "" (no activation requested)

    None --> Validating: "confirm set & valid<br/>Restorable=True (or AcceptUnverified=true)<br/>not already Activated"

    Validating --> Restoring: "spec snapshot taken<br/>template MFG name free or owned by this CR<br/>preflight passed"
    Validating --> Failed: "RFC3339 parse fail<br/>confirm ≤ confirmTokenUsed<br/>Restorable stale + !acceptUnverified<br/>TemplateInvalid"

    Restoring --> Replaying: "materialized MFG<br/>status.restore.phase == Succeeded"
    Restoring --> Failed: "MFG status.restore.phase == Failed<br/>(RestoreFailed) or MaterializedGroupCollision"

    Replaying --> Provisioning: "initFromBackup.pointInTime applied (or N/A)<br/>target GTID covers source dump GTID"
    Replaying --> Failed: "PitrReplayFailed<br/>(GTID mismatch)"

    Provisioning --> Activated: "MFG status.activeSite != ''<br/>Ready=True condition stamped"
    Provisioning --> Failed: "wall-clock &gt; spec.activate.restoreTimeout<br/>(ProvisioningTimeout)"

    Activated --> [*]: "terminal success<br/>Active=True, ActivationInProgress=False"
    Failed --> [*]: "terminal failure<br/>confirmTokenUsed NOT bumped — edit confirm to retry"
Loading

Key invariants:

  • confirmTokenUsed is monotonically non-decreasing. A retry after
    Failed requires the user to bump spec.activate.confirm to a
    strictly-greater RFC 3339 timestamp (or use --auto-confirm /
    kubectl bloodraven dr-activate).
  • Every transition writes status before the next phase's work
    starts. Crash semantics: the next reconcile reads the current phase,
    re-runs idempotent work (e.g. CreateOrUpdate on the materialized
    MFG), and re-checks the exit condition. Pattern matches
    PlannedFailoverReconciler.handle* in
    internal/controller/planned_failover_reconciler.go:138-152.
  • Post-Activated the controller stops processing new confirm
    edits and emits an ActivationLocked event. A second activation is
    always a fresh CR.

Diagram 3 — DR-event lifecycle (with failback)

sequenceDiagram
    autonumber
    participant Apps as "Applications"
    participant SrcOp as "Source operator"
    participant Bucket as "Shared S3 bucket"
    participant DrOp as "DR operator"
    participant MSC as "MysqlStandbyCluster CR"
    participant DrMFG as "Materialized MysqlFailoverGroup"
    participant Admin as "Admin"

    Note over SrcOp,Bucket: "Steady state (Phase 1 + 2)"
    SrcOp->>Bucket: "PUT full dumps + sealed binlogs"
    DrOp->>Bucket: "LIST + GET (discovery loop, 5m)"
    DrOp->>MSC: "stamp status.discovered, BucketReadable=True"
    DrOp->>Bucket: "PUT dr-cursors/&lt;name&gt;.json (5m refresh)"
    DrOp->>DrOp: "scheduled MysqlBackupVerification (cron, default 0 4 * * *)"
    DrOp->>MSC: "Restorable=True; bloodraven_dr_restorable_timestamp_seconds gauge"

    Note over SrcOp,DrOp: "Source cluster loss"
    SrcOp--xApps: "primary unreachable / cluster API down"
    Admin->>Admin: "confirm source down (3 signals: /active-site 503,<br/>API server unreachable, MySQL TCP unreachable)"

    Note over Admin,MSC: "Activation (Phase 3)"
    Admin->>MSC: "kubectl bloodraven dr-activate &lt;msc&gt; --confirm $(date -u +%FT%TZ) --wait"
    DrOp->>MSC: "Validating: parse confirm, snapshot discovered.dumpName/Loc/GTID"
    DrOp->>DrMFG: "Restoring: create MFG with spec=template.spec + synthesized initFromBackup"
    DrMFG->>Bucket: "GET dump + binlogs (existing initFromBackup path)"
    DrMFG->>DrOp: "status.restore.phase=Succeeded"
    DrOp->>MSC: "Replaying: validate target GTID ⊇ source dump GTID"
    DrOp->>MSC: "Provisioning: wait Ready=True, activeSite set"
    DrOp->>MSC: "Activated: stamp materializedFailoverGroup,<br/>Active=True, emit StandbyActivated event"
    DrMFG-->>Apps: "writable (after DNS cutover by admin)"

    Note over Admin,Bucket: "DNS cutover (D-6) — user-driven"
    Admin->>Apps: "flip weighted-CNAME / external-dns ownership"

    Note over SrcOp,Bucket: "Source returns (Phase 4 failback)"
    SrcOp->>SrcOp: "original cluster comes back"
    Admin->>SrcOp: "delete old MFG + PVCs (destructive, manual)"
    Admin->>DrMFG: "ensure spec.backup.profiles[].storage.s3.prefix uses<br/>new directional layout (e.g. orders/east/)"
    Admin->>SrcOp: "apply *new* MysqlStandbyCluster pointing at DR cluster's prefix"
    SrcOp->>Bucket: "discovery + verification against DR's new bucket prefix"
    SrcOp->>SrcOp: "Restorable=True"
    Admin->>SrcOp: "kubectl bloodraven dr-activate (failback) — original cluster becomes standby of new primary"
Loading

The symmetry of MysqlStandbyCluster is the failback story: the same
Kind/controller/state-machine runs in both directions. No new "failback"
Kind, no swap-direction operation; just a second standby CR pointing
the other way. The plan calls this "current-state-driven, not
identity-driven" — exactly the same discipline as in-cluster
fail-back, where a returning original primary wins promotion only if
it wins the normal GTID-freshest candidate path.

CR shape (top-level fields from plan §8)

MysqlStandbyClusterSpec (shipstream.io/v1alpha1, namespace-scoped,
shortname msc, categories bloodraven;mysql;dr):

  • transportObjectStore (only honored in v1) or reserved Network.
  • sourcefailoverGroupName, optional namespace/cluster (informational), storage (mirrors BackupStorage), profileName, optional decryption (mirrors BackupDecryptionSpec).
  • template — embedded MysqlFailoverGroupSpec declared at standby-CR-creation time so activation is not a YAML scramble during an incident; plus name of the MFG to materialize.
  • freshnessdiscoveryInterval (5m default), verifySchedule cron (default 0 4 * * * UTC), verifyTimeZone, maxStaleness (48h default), suspend, retentionFloorRefresh (5m default).
  • activateconfirm (required RFC 3339), optional pointInTime (mirrors PointInTimeSpec), acceptUnverified (bypass Restorable gate), restoreTimeout (2h default).

Status carries discovered, lastVerified, activation (the full
StandbyActivationStatus audit block with source/target GTID, PITR
stop datetime, replayed binlog count, materialized active site,
reason, message), materializedFailoverGroup, and conditions.
Conditions: BucketReadable, SourceConfigKnown, Restorable,
ActivationInProgress, Active.

Phasing (plan §2)

  • Phase 0docs/docs/multi-cluster-dr.mdx runbook over existing CRDs only. Ships first; surfaces every gap Phases 1+ must close. Required for v1 floor.
  • Phase 1MysqlStandbyCluster CR + controller in passive verifier mode. Discovery loop populates status.discovered; stamps BucketReadable and SourceConfigKnown. No load, no materialization.
  • Phase 2 — Continuous DR readiness. Synthetic MysqlBackup CRs (annotated dr.bloodraven.shipstream.io/synthetic=true); CronJob-scheduled MysqlBackupVerification runs; Restorable condition; bloodraven_dr_restorable_timestamp_seconds gauge. Source operator gains dr-cursors/*.json honor in /pitr-cutoff. Hard prereq: WISHLIST Bump azure/setup-helm from 4 to 5 #43 PITR E2E scenarios.
  • Phase 3dr-activate (kubectl plugin) + spec confirm-token; full activation state machine; materialized MFG. New verb name picked deliberately to not collide with intra-cluster promote (zero-RPO).
  • Phase 4 — Failback runbook + symmetric-CR rehearsal. No CRD changes — the Kind is already symmetric.

Two crucial reuse points

The new controller is essentially a scheduler around primitives that
already exist
.

  1. Existing MysqlBackupVerification powers Phase 2 readiness. The
    verification reconciler already restores a backup into an ephemeral
    mysqld and (optionally) replays binlogs to validate the dump. The
    only new code on that path is a single predicate flip in
    internal/controller/backup_verification_reconciler.go to accept
    MysqlBackup CRs carrying the synthetic annotation and resolve
    their location from MysqlBackup.status.location.
  2. Existing initFromBackup + pointInTime powers Phase 3
    activation.
    The Restoring phase synthesizes an initFromBackup
    block pointing at the discovered dump location (+ optional
    pointInTime) and creates the materialized MFG. From there, the
    normal greenfield bootstrap path runs unchanged — restore Job,
    sentinel write, replica clone, DNSEndpoint write, isFreshDeploy
    gating. The standby controller's job at that point is purely to
    wait for the existing status.restore.phase=Succeeded and
    Ready=True signals.

This is the design's lever: almost every primitive Phase 2/3 needs
already exists.
The new CR is a scheduler + audit layer that names
the relationship; nearly all the heavy machinery (S3 client, BRV1
header parsing, dump load via mysqlsh util.loadDump, binlog replay,
DNSEndpoint, condition surface, metrics shape) is reused verbatim.


Lessons

  • A naming collision is a critique-phase finding, not an
    implementation-review finding.
    All three critics independently
    surfaced N-1 — MysqlDRTarget vs SiteRoleDROnly. If the planning
    pass had gone first, the name would have shipped, gone to
    implementation review, and been renamed at the worst possible time
    (after generated DeepCopy code + Helm chart edits + docs are in
    flight). The MBOT critique catches naming hazards before anyone
    writes a Go file.

  • Make the only cross-cluster bus explicit in the first diagram.
    Diagram 1 puts the shared S3 bucket dead center with its three
    subprefixes, and labels the directionality of every arrow. The
    trust boundary becomes obvious immediately — and the asymmetry
    ("source writes, DR reads, except for one tiny sentinel object")
    catches the C-1/F-2 critiques in one image. A reviewer who only
    reads the diagram still knows the answer to "what runs the shipper
    on the DR side?" (nothing).

  • MBOT critique value comes from picking models with different
    failure modes.
    Opus + GPT + Gemini disagreed on exactly one
    thing — the D-1 transport choice — and that disagreement was the
    most valuable finding in the entire critique. The decision became
    visible (object-store-first, with Network reserved as a
    forward-compatible enum) rather than buried in a single agent's
    default. When models agree on everything, the critique is probably
    rubber-stamping; when one dissents on one item, the planner has a
    real tradeoff to write down.

  • Existing primitives drive the CRD shape, not the other way
    around.
    The template field, the synthetic-MysqlBackup trick,
    the confirm-token pattern, the phase enum vocabulary — every one of
    these mirrors something already in the codebase
    (MysqlFailoverGroupSpec, MysqlBackup shape, RestoreInPlaceSpec,
    PlannedFailoverPhase). The CR is dense with mirrors X /
    analog of Y citations on purpose: it makes the v1 surface
    forward-compatible with the existing operator's discipline, and it
    keeps the implementation small because most of the heavy code is
    already there.

  • Phasing lets the docs ship before the code. Phase 0 (runbook
    over existing CRDs only) is independently useful: an on-call
    engineer at 03:00 can recover into another cluster using only the
    existing surface. That floor de-risks every later phase — if Phase 1+
    slips a release, users still have a documented recovery path. The
    rest of the wishlist line's gaps (no freshness signal, no audit-
    grade promote) become enhancements over a working baseline, not
    blockers for shipping anything.

  • Cross-cluster split-brain is a policy decision, not a
    technology decision.
    S-3 was the single hardest call. The plan
    resolves it as "accept-loss with audit" + runbook + post-hoc
    divergent-GTID detection on rejoin — explicitly, in §1.2 non-goals
    and §6.7. Critically, the transport discriminator preserves the
    option to add a spec.activate.requireSourceFenceTTL bucket
    sentinel in v2 without a CRD bump. The lesson generalizes: declare
    the v1 stance up front (it's documented in non-goals) so design
    review doesn't re-litigate it; and leave a forward-compatible knob
    for future interlock-mode without committing to it now.

  • The kubectl-plugin verb name encodes a contract. promote
    ships a specific guarantee (drain → GTID catch-up →
    transactionsLost=0). dr-activate cannot offer that contract.
    Different verb. Operators reading docs or running history-search
    immediately know which contract they're invoking — N-3 is a tiny
    decision with disproportionate operator-experience leverage.

  • Megamind's planning loop wins when readiness gates are tight.
    This run landed at READY_TO_START with zero unresolved
    decisions because the recommended-defaults table closed every
    open [D-*] / [N-*] / [S-*] item with a concrete pick. A
    planning agent applying defaults that don't close every open
    finding produces a draft with TODOs; that's where implementation
    cycles start spinning. The discipline is: if the critique can't
    produce a default, the critique is not done.


Evidence

Claim-to-source table. Diagram and design-decision rows ground in
specific plan sections; runtime/contract claims ground in
current-code paths verified at planning time.

Claim Source
Source operator is the sole bucket writer for dumps + binlogs internal/sidecar/binlog_archiver.go:239,537 (IsReadOnly gate); internal/controller/backup_reconciler.go:60 (backup-Job RBAC); plans/second-draft.md §1.1, §3.3 step 1
Sidecar archiver runs only on the active primary, per-pod internal/sidecar/binlog_archiver.go:531-537 (read-only check); critique §1 (C-1, F-9)
dr-only site role is intra-cluster, never auto-promoted api/v1alpha1/types.go:280-283; docs/docs/multi-site.mdx:14-23; docs/docs/known-limitations.mdx:60-63
Existing DR is "ad-hoc initFromBackup in another cluster" WISHLIST.md:21; briefs/context.md §"What DR today actually looks like"; api/v1alpha1/backup_types.go:552-616
Diagram 1 (topology) grounded plans/second-draft.md §1.1, §3.3 (bullet 1 topology overview), §5.3 (cursor file), §11.1 (IAM asymmetry)
Diagram 2 (activation state machine) grounded plans/second-draft.md §9 (entire section); enum at §8.3 (StandbyActivationPhase); idempotency rules at §9.3-§9.4
Diagram 3 (DR-event lifecycle + failback) grounded plans/second-draft.md §6 (activation flow), §7 (failback runbook), §12.4 (DNS event copy)
MysqlStandbyCluster Kind name (D-1/N-1) plans/second-draft.md §4.2; critique §4 (N-1) "MysqlStandbyCluster is the clearest"
Object-store transport choice with Network reserved (D-1) plans/second-draft.md §1.2, §4.4, §8.2 (StandbyTransport enum); critique §3 (D-1) "object-store-mediated DR is the natural extension"
dr-activate verb chosen over promote (N-3) plans/second-draft.md §6.5; cmd/kubectl-bloodraven/promote.go:23-46 (transactionsLost=0 contract); critique §4 (N-3)
Target-side-only CRD residency (D-2) plans/second-draft.md §4.3; critique §3 (D-2) "the cluster running the command and the cluster being promoted are the same"
Spec confirm-token gate pattern (D-5) plans/second-draft.md §6.1; api/v1alpha1/backup_types.go:723-732 (RestoreInPlaceSpec.Confirm)
Phase enum mirrors RestoreInPlacePhase plans/second-draft.md §8.3 (StandbyActivationPhase); api/v1alpha1/backup_types.go:762-810
Split-brain stance "accept-loss with audit" (S-3) plans/second-draft.md §1.2, §6.7, §15.4, §15.7; critique §5 (S-3); docs/docs/durability-and-rpo.mdx:94-118 (divergent-GTID detection)
DR cluster reads only, plus dr-cursors writes plans/second-draft.md §5.3, §11.1; IAM policy in §11.1 (DRReadOnly + DRCursorWrite statements scoped to dr-cursors/*)
dr-cursors/<name>.json retention-floor sentinel (F-2 mitigation) plans/second-draft.md §5.3, §15.5; cmd/bloodraven/main.go:388-410 (/pitr-cutoff handler); internal/sidecar/binlog_archiver.go:350-458 (archive pruning)
Restorable condition powered by MysqlBackupVerification (S-6) plans/second-draft.md §5.2; api/v1alpha1/mysqlbackupverification_types.go (existing CRD); reuse via synthetic-MysqlBackup annotation predicate flip at internal/controller/backup_verification_reconciler.go
Activation reuses initFromBackup + pointInTime unchanged plans/second-draft.md §6.2-§6.3, §9 (Restoring phase); api/v1alpha1/backup_types.go:552-616 (InitFromBackupSpec shape); api/v1alpha1/backup_types.go:191-210 (PointInTimeSpec)
template field declared at CR-create-time, not activation-time plans/second-draft.md §4.4 (last bullet) "user declares site list, DNS hostname, storage class, credentials secret at standby-CR-creation time, not at activation time (otherwise activation becomes a YAML scramble in an incident)"
Materialized MFG owner ref with BlockOwnerDeletion=false plans/second-draft.md §4.3, §9.2 (Restoring phase work) — deleting standby after activation does NOT cascade-delete the writable MFG
Second activation is locked post-Activated; users delete-and-recreate to re-fire plans/second-draft.md §6.6; ActivationLocked event in §10.5
Failback is symmetric — second MysqlStandbyCluster on returning original source (F-3) plans/second-draft.md §7.1-§7.5; critique §2 (F-3)
Directional bucket prefix recommendation (e.g. orders/east/) plans/second-draft.md §7.2 step 3, §7.4 (future automation note)
DNS handled by external-dns symmetrically; operator only writes DNSEndpoint (D-6) plans/second-draft.md §12; internal/platform/dns.go:23-31; api/v1alpha1/types.go:371-384
Encryption passphrase mirrored manually; preflight-validated (D-4) plans/second-draft.md §11.2; docs/docs/backup-encryption.mdx:217-271; api/v1alpha1/backup_types.go:343-349 (BackupDecryptionSpec reuse)
IAM policy minimum (D-3) plans/second-draft.md §11.1 (JSON policy verbatim)
Per-cluster operator with leader election; no federation (S-8) briefs/context.md §"Operator/sidecar facts"; plans/second-draft.md §1.2 (non-goal)
Metrics: bloodraven_dr_* series mirror bloodraven_backup_* shape plans/second-draft.md §10.3; internal/metrics/metrics.go:114-200,163-170 (existing pattern)
Hard prereq on WISHLIST #43 (PITR E2E scenarios) for Phase 2 plans/second-draft.md §2.2, §15.1; WISHLIST.md:17 (#43); critique §5 (S-5)
Critic dissent on D-1 transport choice (Opus + GPT vs Gemini) critiques/mbot-critique.md §"Where the critics disagreed"
Ledger: planning-only run, zero unresolved decisions at end final/ledger.md; plans/second-draft.md §16
CRD evolution constraint (no conversion webhook on v1alpha1) docs/docs/known-limitations.mdx:18-19; plans/second-draft.md §15.12
Per-MFG Helm RBAC is hand-maintained; new Kind requires mirror plans/second-draft.md §11.4.2; CLAUDE.md "Pre-PR gate" §5; charts/bloodraven/templates/clusterrole.yaml:48-77
Synthetic MysqlBackup annotation contract for verification reuse plans/second-draft.md §5.2.1; annotations dr.bloodraven.shipstream.io/synthetic=true, dr.bloodraven.shipstream.io/source-bucket=…
Cross-cluster RPO floor max_binlog_size ÷ throughput + upload latency plans/second-draft.md §1.2, §10.4; docs/docs/durability-and-rpo.mdx:142-167; critique §1 (C-3, F-6)

End of brief. Length target ~400-700 lines; this brief is within that
budget while staying grounded in the artifacts listed in the
"Required reading" section of the prompt.

This is the first slice of WISHLIST #7 ("Cross-region/cross-cluster DR
as a first-class feature") and implements Phases 0 + 1 of the plan at
.tmp/megamind-dr-7/plans/second-draft.md.

Phase 0 (docs runbook) — docs/docs/multi-cluster-dr.mdx is a complete
end-to-end recovery runbook that works against the existing surface
(MysqlFailoverGroup + initFromBackup + PITR archive). An on-call
engineer can recover into another cluster using only this doc and
today's CRDs. Phases 1+ layer first-class tooling on top.

Phase 1 (passive verifier CR) — new MysqlStandbyCluster CRD declares a
DR relationship from the DR cluster's side. The reconciler scans the
shared S3 bucket on a configurable cadence (default 5m), discovers the
newest full dump (by @.json `end` timestamp, not lex order) and reads
per-site PITR manifests, then publishes BucketReadable and
SourceConfigKnown conditions. No activation, no promotion, no writes
into MySQL — that machinery lands in Phases 2 and 3 (deferred to
follow-up PRs).

The CRD ships the full v1alpha1 schema (template/freshness/activate
blocks) so the API surface is locked once and Phase 2/3 can read
those fields without bumping the CRD version (no conversion webhook
available, per docs/docs/known-limitations.mdx).

Implementation went through Megamind's review/fix loop:
- Three-model MBOT critique of the wishlist line → consolidated
  defaults
- One planning agent produced the 2353-line plan
- Three coding agents (CRD scaffolding, reconciler, docs) implemented
  Phase 0 + 1 in parallel work packages
- Three-reviewer ultra-review (bugs/runtime/craft) returned 31
  findings; 18 routed to fixes (S/R/D/T bundles), 5 explicitly
  deferred to Phase 2/3
- Three fix agents addressed every validated finding; one
  fixed-review pass plus a one-line trailing chart-CRD refresh
- All local gates green: go build, go vet, golangci-lint, race-test
  suite, generate/manifests clean

See .tmp/megamind-dr-7/ for run artifacts (plan, critique, reviews,
fixes, educational brief).
The envtest tests created MysqlStandbyCluster CRs with a near-empty
template.spec, which passed the local fake-client tests but failed
admission in CI:

  spec.source.storage.s3.credentialsSecret: Invalid value: ""
  spec.template.spec.dns.hostname: Invalid value: ""
  spec.template.spec.sites: Required value

The CRD embeds the full MysqlFailoverGroupSpec under spec.template.spec,
so admission validates the template at standby-cluster create time, not
only at Phase 3 activation. The fixture now mirrors
examples/minimal-failovergroup.yaml (two primary-candidate sites with
zone/taintNodeSelector/lbIP/storage and a DNS hostname) and sets a
non-empty credentialsSecret on the S3 source.

A new ensureEnvtestS3CredsSecret helper provisions the dummy Secret
referenced by spec.source.storage.s3.credentialsSecret so the
reconciler's resolveS3CredsToDir path (which always runs before the
SetNewStoreFunc injection point) finds a valid Secret. Each of the
three envtest tests now calls it before creating the CR.

The miss: `make test-envtest` was not run locally before the original
push; only the unit + component suites were. CLAUDE.md's Pre-PR gate
requires test-envtest when CRD validation is touched.
Comment thread internal/controller/standbycluster_reconciler.go
Comment thread internal/controller/standbycluster_reconciler.go Outdated
Comment thread internal/controller/standbycluster_reconciler.go Outdated
prefix string,
) (*standbyScanResult, error) {
// List everything under the prefix. For large archives this may be
// many thousands of keys; ArchiveStore handles pagination internally.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Ultra Review · Commit: bafb7c1 · Role: runtime · Flagged by: Pi

The discovery loop lists the entire backup prefix every 5 minutes by default. In production that prefix contains dump shard files plus the full archived-binlog history, so work and S3 List cost scale with total archive age/size for every MysqlStandbyCluster, not just with metadata objects. That can throttle or slow controller workers in long-lived installations. Please bound discovery to metadata/index prefixes (or maintain a small sentinel/index object) rather than scanning the full archive namespace on each reconcile.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review Response · Commit: fe09da2 · By: OpenCode with GPT 5.5

I addressed the worker-hang part of this by wrapping discovery scans in a bounded context timeout, and the sibling-prefix safety issue in a separate resolved thread. I did not add a new metadata index/sentinel object in this commit because the current ArchiveStore contract only exposes List/Get and there is no existing writer-side index format for dumps.

Do you want Phase 1 to block on introducing a durable dump index/sentinel format now, or is bounded-time prefix scanning acceptable for this phase with the index format tracked as a follow-up?

@colinmollenhour
Copy link
Copy Markdown
Collaborator Author

AI Ultra Review · Commit: bafb7c1 · Roles: bugs, runtime, craft · Models: Pi

Posted 4 inline findings after validating and deduplicating the role outputs.

Agent Found Validated False Positives Unique Finds Shared Finds Accuracy Composite Score
Pi 9 8 1 4 0 89% 6

Best/worst agent: no differentiation; only the Pi-backed participant was available in this harness run.

Role Found Validated Unique-to-role Accuracy
bugs 3 3 1 100%
runtime 3 3 1 100%
craft 3 2 0 67%

Most validated signal came from bugs and runtime; craft contributed duplicate coverage for the dump-selection and lastScanAt issues but also produced one false positive about generated artifacts that were present in the PR and excluded only from review bucketing.

Comment thread internal/controller/standbycluster_reconciler.go Outdated
Comment thread internal/controller/standbycluster_reconciler.go Outdated
Comment thread docs/docs/multi-cluster-dr.mdx Outdated
Comment thread docs/docs/multi-cluster-dr.mdx Outdated
Comment thread test/envtest/standbycluster_test.go
@colinmollenhour
Copy link
Copy Markdown
Collaborator Author

AI Ultra Review · Commit: bafb7c1 · Roles: bugs, runtime, craft · Models: GPT, Gemini, Qwen

Full-branch re-review posted 5 additional inline findings. Four validated findings from this pass were already covered by the prior ultra-review, so I did not duplicate those comments: encrypted source archive discovery, lexicographic dump-candidate truncation, stale LastScanAt, and unbounded full-prefix listing.

Agent Found Validated False Positives Unique Finds Shared Finds Accuracy Composite Score
GPT 8 8 0 3 5 100% 11
Gemini 13 5 8 1 4 38% -10
Qwen 9 2 7 0 2 22% -12

Best agent by composite score: GPT. Worst agent by composite score: Qwen. Gemini produced the most new signal for prefix-boundary and envtest issues, but also returned several lower-confidence craft items that did not clear validation.

Role Found Validated Unique-to-role Accuracy
bugs 6 4 1 67%
runtime 8 4 1 50%
craft 16 6 3 38%

Most validated signal came from craft for docs/test issues and from runtime for production-safety hazards. No role produced zero validated issues.

@colinmollenhour
Copy link
Copy Markdown
Collaborator Author

AI Review Response · Commit: fe09da2 · By: OpenCode with GPT 5.5 · Summary

Pushed commit fe09da2 (Address standby cluster review feedback) and resolved 8 review threads.

Fixes included: wired spec.source.decryption.passphraseSecret into PITRConfig.PassphraseFile; added a bounded discovery scan context; normalized S3 prefixes and enforced slash-bounded candidate matching; removed the lexicographic dump-candidate cap and added a >10 non-time-sorted dump regression test; updated LastScanAt on every successful scan while keeping BucketScanned event suppression separate; corrected the /active-site runbook endpoint; added required spec.template to the docs sample; and changed the envtest to fail on unexpected reconcile errors.

Posted one follow-up question on the remaining unresolved thread about whether Phase 1 must introduce a durable dump index/sentinel format now, or whether bounded-time prefix scanning is acceptable with indexing tracked as follow-up. I skipped resolving that thread pending the reviewer decision.

Validation: go test ./internal/controller -run TestMysqlStandbyCluster passes. The targeted envtest command could not run locally because this checkout excludes all test/envtest files without the required build constraints; CI Test / Envtest is running. Current PR state: branch pushed, working tree clean, Generate Check passed, other CI checks are in progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant