ShipStream · colinmollenhour · May 21, 2026 · May 21, 2026 · May 21, 2026 · May 21, 2026
diff --git a/.github/kind/e2e-calico.yaml b/.github/kind/e2e-calico.yaml
@@ -0,0 +1,22 @@
+# kind cluster configuration for Bloodraven E2E tests.
+# Uses Calico CNI so NetworkPolicy resources are enforced (the default
+# kindnet CNI does not implement NetworkPolicy, which means partition /
+# self-fencing scenarios would silently pass without actually testing
+# policy behaviour).
+#
+# Usage:
+#   kind create cluster --config=.github/kind/e2e-calico.yaml
+#   # then install Calico:
+#   kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
+kind: Cluster
+apiVersion: kind.x-k8s.io/v1alpha4
+name: bloodraven-e2e
+nodes:
+  - role: control-plane
+  - role: worker
+  - role: worker
+networking:
+  # Disable kindnet so Calico can manage CNI instead.
+  disableDefaultCNI: true
+  # Match the stock Calico manifest's default IPv4 pool.
+  podSubnet: "192.168.0.0/16"
diff --git a/.github/workflows/README.md b/.github/workflows/README.md
@@ -105,6 +105,28 @@ Then enable GitHub Pages for the repository pointing at the `gh-pages` branch. T
 
 ---
 
+### `e2e.yml` / `_e2e.yml` — Real-Cluster E2E
+
+**Triggers:**
+- Nightly schedule (release profile)
+- Manual dispatch with profile selection (smoke / release / full)
+- Pull requests with the `e2e` label (smoke profile)
+
+The reusable workflow (`_e2e.yml`) creates a kind cluster with Calico CNI, deploys the playground, and runs `playground-chaos run-all` with the selected profile. It uploads JUnit results, chaos forensics, setup logs, and kind logs as artifacts.
+
+Profiles:
+| Profile | Scenarios | Use case |
+|---|---|---|
+| `smoke` | 3 (~3-5 min) | PR label gate, fast feedback |
+| `release` | 10 (~20-30 min) | Release and nightly gate |
+| `full` | All registered | Full regression (manual only) |
+
+The release workflow (`.github/workflows/release.yml`) blocks Docker image builds and Helm chart publishing on the E2E release-profile gate. This ensures every tagged release is validated against real MySQL failover scenarios (WISHLIST #32).
+
+**Permissions:** `contents: read` (default)
+
+---
+
 ### `scan.yml` — Trivy Security Scan
 
 **Triggers:** Pull requests targeting `main`

diff --git a/.github/workflows/_e2e.yml b/.github/workflows/_e2e.yml
@@ -0,0 +1,125 @@
+# Reusable E2E workflow — creates a kind cluster, deploys the playground,
+# and runs playground-chaos with the selected profile.
+#
+# Called by:
+#   .github/workflows/e2e.yml (nightly, manual, PR label)
+#   .github/workflows/release.yml (release gate)
+name: E2E (reusable)
+
+on:
+  workflow_call:
+    inputs:
+      profile:
+        description: "Chaos profile (smoke|release|full)"
+        required: false
+        default: "release"
+        type: string
+      timeout-minutes:
+        description: "Job timeout in minutes"
+        required: false
+        default: 90
+        type: number
+
+permissions:
+  contents: read
+
+env:
+  BLOODRAVEN_SETUP_HELM_INSTALL_CRDS: "1"
+  SKIP_IMAGE_BUILD: "1"
+
+concurrency:
+  group: e2e-${{ github.workflow }}-${{ github.ref }}-${{ inputs.profile }}
+  cancel-in-progress: true
+
+jobs:
+  e2e:
+    name: Real-cluster E2E (${{ inputs.profile }})
+    runs-on: ubuntu-latest
+    timeout-minutes: ${{ inputs.timeout-minutes }}
+    steps:
+      - uses: actions/checkout@v6
+
+      - uses: actions/setup-go@v6
+        with:
+          go-version-file: go.mod
+          cache-dependency-path: go.sum
+
+      - name: Build playground-chaos
+        run: make build-playground-chaos
+
+      - name: Build Docker images
+        run: |
+          docker build --target bloodraven -t bloodraven:playground .
+          docker build --target sidecar -t bloodraven-sidecar:playground .
+          docker build -t bloodraven-counter:playground playground/counter-app
+          docker build -t bloodraven-dashboard:playground playground/dashboard
+          docker build -t bloodraven-dns-webhook:playground playground/dns-webhook
+
+      - name: Create kind cluster
+        uses: helm/kind-action@v1.12.0
+        with:
+          cluster_name: bloodraven-e2e
+          config: .github/kind/e2e-calico.yaml
+          # CNI is disabled in this kind config, so nodes cannot become
+          # Ready until Calico is installed in the next step.
+          wait: 0s
+
+      - name: Install Calico CNI
+        run: |
+          kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
+          kubectl -n kube-system rollout status daemonset/calico-node --timeout=180s
+          kubectl wait nodes --all --for=condition=Ready --timeout=180s
+
+      - name: Load images into kind
+        run: |
+          kind load docker-image bloodraven:playground bloodraven-sidecar:playground bloodraven-counter:playground bloodraven-dashboard:playground bloodraven-dns-webhook:playground --name bloodraven-e2e
+
+      - name: Deploy playground
+        run: |
+          set -o pipefail
+          ./playground/setup.sh 2>&1 | tee playground/setup.log
+        timeout-minutes: 10
+
+      - name: Run E2E (${{ inputs.profile }} profile)
+        run: make test-e2e E2E_PROFILE=${{ inputs.profile }} E2E_JUNIT_OUT=playground/chaos-results/e2e-${{ inputs.profile }}-junit.xml
+        timeout-minutes: ${{ inputs.timeout-minutes }}
+
+      - name: Upload JUnit results
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: e2e-${{ inputs.profile }}-junit
+          path: playground/chaos-results/e2e-${{ inputs.profile }}-junit.xml
+          retention-days: 30
+
+      - name: Upload chaos forensics
+        if: failure()
+        uses: actions/upload-artifact@v4
+        with:
+          name: e2e-${{ inputs.profile }}-forensics
+          path: playground/chaos-results/
+          retention-days: 30
+
+      - name: Upload kind logs
+        if: failure()
+        run: |
+          mkdir -p /tmp/kind-logs
+          kind export logs --name=bloodraven-e2e /tmp/kind-logs || true
+        continue-on-error: true
+
+      - name: Upload kind logs artifact
+        if: failure()
+        uses: actions/upload-artifact@v4
+        with:
+          name: e2e-${{ inputs.profile }}-kind-logs
+          path: /tmp/kind-logs/
+          retention-days: 14
+        continue-on-error: true
+
+      - name: Upload setup logs
+        if: failure()
+        uses: actions/upload-artifact@v4
+        with:
+          name: e2e-${{ inputs.profile }}-setup-logs
+          path: playground/setup.log
+          retention-days: 14
diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml
@@ -0,0 +1,39 @@
+# E2E trigger workflow — nightly, manual, and PR-label-gated.
+# The reusable workflow is in .github/workflows/_e2e.yml.
+name: E2E
+
+on:
+  # Nightly release-profile run
+  schedule:
+    - cron: "0 5 * * *" # 05:00 UTC daily
+
+  # Manual dispatch with profile selection
+  workflow_dispatch:
+    inputs:
+      profile:
+        description: "Chaos profile (smoke|release|full)"
+        required: false
+        default: "release"
+        type: choice
+        options:
+          - smoke
+          - release
+          - full
+
+  # PR label gate: run smoke while the "e2e" label is present.
+  pull_request:
+    types: [opened, reopened, synchronize, labeled]
+
+permissions:
+  contents: read
+
+jobs:
+  # Skip PR-triggered runs unless the "e2e" label is present.
+  e2e:
+    if: >-
+      github.event_name == 'schedule' ||
+      github.event_name == 'workflow_dispatch' ||
+      (github.event_name == 'pull_request' && contains(github.event.pull_request.labels.*.name, 'e2e'))
+    uses: ./.github/workflows/_e2e.yml
+    with:
+      profile: ${{ github.event_name == 'pull_request' && 'smoke' || (github.event.inputs.profile || 'release') }}
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -73,10 +73,21 @@ jobs:
         working-directory: docs
         run: npm run verify:llms
 
+  # E2E release gate — runs the release-profile real-cluster E2E before
+  # any publishing jobs. This ensures that every tagged release has been
+  # validated against real MySQL pods, PVCs, DNS, taints, failover, and
+  # network partition scenarios (WISHLIST #32).
+  e2e-gate:
+    name: E2E gate (release profile)
+    needs: ci-gate
+    uses: ./.github/workflows/_e2e.yml
+    with:
+      profile: release
+
   draft-release:
     name: Create Draft Release
     runs-on: ubuntu-latest
-    needs: ci-gate
+    needs: [ci-gate, e2e-gate]
     steps:
       - uses: actions/checkout@v6
         with:
@@ -116,7 +127,7 @@ jobs:
   docker:
     name: Build and Push Docker Images
     runs-on: ubuntu-latest
-    needs: [ci-gate, draft-release]
+    needs: [ci-gate, e2e-gate, draft-release]
     strategy:
       matrix:
         include:

diff --git a/AGENTS.md b/AGENTS.md
@@ -1,7 +1,7 @@
 # Repository Guidelines
 
 ## Project Structure & Module Organization
-Primary code lives in the root Go module. `cmd/bloodraven` is the Kubernetes operator entrypoint; `cmd/sidecar` is the per-MySQL sidecar; `cmd/kubectl-bloodraven` is the day-2 `kubectl` plugin (status / promote / reclone / backup / verify-backup, built via `make build-kubectl-plugin`). API types live in `api/v1alpha1`, controller logic in `internal/controller`, and supporting packages in `internal/mysql`, `internal/platform`, `internal/sidecar`, `internal/state`, and `internal/metrics`. End-to-end and scenario-style tests live in `test/e2e`. Treat `bitpoke/` and `orchestrator/` as bundled upstream references, not the default place for new feature work.
+Primary code lives in the root Go module. `cmd/bloodraven` is the Kubernetes operator entrypoint; `cmd/sidecar` is the per-MySQL sidecar; `cmd/kubectl-bloodraven` is the day-2 `kubectl` plugin (status / promote / reclone / backup / verify-backup, built via `make build-kubectl-plugin`). API types live in `api/v1alpha1`, controller logic in `internal/controller`, and supporting packages in `internal/mysql`, `internal/platform`, `internal/sidecar`, `internal/state`, and `internal/metrics`. Real-cluster scenario tests live under `internal/playground/scenarios` and run through `cmd/playground-chaos`; faster cross-component tests live under `test/component`, with API-server/envtest coverage under `test/envtest`. Treat `bitpoke/` and `orchestrator/` as bundled upstream references, not the default place for new feature work.
 
 ## Build, Test, and Development Commands
 Run commands from the repository root:
@@ -10,6 +10,8 @@ Run commands from the repository root:
 - `go build ./cmd/sidecar` builds the sidecar binary.
 - `make build-kubectl-plugin` builds `bin/kubectl-bloodraven` (the day-2 `kubectl` plugin). Override `KUBECTL_PLUGIN_VERSION=<tag>` to stamp a release; `make install-kubectl-plugin` drops the binary onto `$PATH`.
 - `make test` runs `go test ./...` across unit and e2e-style packages.
+- `make test-e2e` runs the release profile of real-cluster E2E tests against the current playground cluster (requires kind/k3d/minikube context prepared with `./playground/setup.sh`; CI creates kind and runs setup first).
+- `make test-e2e-smoke` runs the smoke profile (~3 scenarios, fast feedback).
 - `make vet` runs `go vet ./...`.
 - `make lint` runs `golangci-lint run ./...`. `golangci-lint` is not vendored; install it with `go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest` (it lands in `$(go env GOPATH)/bin`). CI installs the same tool with the same command in `.github/workflows/ci.yml`, so local and CI output match when you run this.
 - `make generate` refreshes API deep-copy code in `api/v1alpha1`.
@@ -26,7 +28,7 @@ Use standard Go formatting: run `gofmt` on changed files and keep imports organi
 Structured-log `msg` strings and field names listed in `docs/docs/log-schema.mdx` are a public stability contract — downstream log pipelines filter on them. When you touch a log call site whose `msg` appears in that doc's Event reference, either preserve the `msg` string and the documented field set exactly, or update `docs/docs/log-schema.mdx` in the same PR and call out the break in the PR description. The same applies to field naming: log keys are `camelCase` (per the contract), not `snake_case`.
 
 ## Testing Guidelines
-Add table-driven unit tests beside the code they cover, using the existing `*_test.go` layout under `internal/`. Put cross-component behavior tests in `test/e2e`. Some tests create local HTTP listeners with `httptest`, so restricted sandboxes may fail even when local developer runs pass.
+Add table-driven unit tests beside the code they cover, using the existing `*_test.go` layout under `internal/`. Put cross-component behavior tests in `test/component`, API-server/controller-runtime tests in `test/envtest`, and real-cluster playground scenarios in `internal/playground/scenarios` through `cmd/playground-chaos`. Some tests create local HTTP listeners with `httptest`, so restricted sandboxes may fail even when local developer runs pass.
 
 ### Pre-PR gate (required, do not skip)
 Before pushing a branch that opens or updates a PR, run all of the following from the repo root and fix anything they report. Do **not** push expecting CI to find problems you could have caught locally — CI failures on lint or generate drift are round-trip latency and reviewer noise.
@@ -89,7 +91,7 @@ Lessons from running chaos scenarios against a live k3d cluster:
 `./playground/rebuild.sh operator` builds, imports to k3d, and restarts the operator deployment. For sidecar changes, use `./playground/rebuild.sh sidecar` (restarts MySQL pods). Both can be combined: `./playground/rebuild.sh operator sidecar`.
 
 ### Automated chaos runner
-A subset of `playground/chaos-scenarios.md` is automated by `cmd/playground-chaos` and exposed as Make targets: `make chaos-list`, `make chaos-check`, `make chaos-run SCENARIO=<id>`, `make chaos-run-all`. The runner refuses to mutate any kubectl context outside the `_guard.sh` allowlist; on assertion failure it captures cluster YAML + pods + events + operator/sidecar logs + raw `/metrics` under `playground/chaos-results/<timestamp>/<scenario-id>/` for triage. Use `--no-cleanup` to keep injected state in place for forensics.
+A subset of `playground/chaos-scenarios.md` is automated by `cmd/playground-chaos` and exposed as Make targets: `make chaos-list`, `make chaos-check`, `make chaos-run SCENARIO=<id>`, `make chaos-run-all`, `make chaos-run-all-profile PROFILE=smoke|release|full`. The runner supports three E2E profiles (`--profile=smoke|release|full`) that filter which scenarios run. The runner refuses to mutate any kubectl context outside the `_guard.sh` allowlist; on assertion failure it captures cluster YAML + pods + events + operator/sidecar logs + raw `/metrics` under `playground/chaos-results/<timestamp>/<scenario-id>/` for triage. Use `--no-cleanup` to keep injected state in place for forensics.
 
 The runner stamps an in-progress marker on the MFG (`chaos.playground.bloodraven.io/in-progress`) after Precheck and clears it on cleanup. A subsequent run that finds a leftover marker refuses to start with a specific reason (live owner / abandoned / different host). Override with `--force` (delete the marker before preflight) or `--auto-reset` (on Precheck failure, shell out to `reset-mysql.sh + setup.sh` and retry once; 3s pause unless `CI=1`). `chaos-check` runs the same structural baseline scenarios use — stuck scale-to-0 deployments, bogus `lastFailoverTarget`, anti-flap cooldown still ticking, `NoPrimary` (both-sites-read-only), replication off on a non-active candidate — each with the exact remediation command in the error.
 

diff --git a/Makefile b/Makefile
@@ -1,6 +1,6 @@
 CONTROLLER_GEN ?= go run sigs.k8s.io/controller-tools/cmd/controller-gen
 
-.PHONY: help generate manifests build build-bloodraven build-sidecar build-playground-chaos build-kubectl-plugin install-kubectl-plugin test test-unit test-component test-envtest test-e2e test-integration fmt vet lint docker-build chaos-list chaos-check chaos-run chaos-run-all
+.PHONY: help generate manifests build build-bloodraven build-sidecar build-playground-chaos build-kubectl-plugin install-kubectl-plugin test test-unit test-component test-envtest test-e2e test-e2e-smoke test-integration fmt vet lint docker-build chaos-list chaos-check chaos-run chaos-run-all chaos-run-all-profile
 
 ##@ General
 
@@ -90,10 +90,15 @@ test-component: ## Run component tests (cross-package with fakes, no real cluste
 test-envtest: ## Run envtest controller tests (real API server, no cluster)
 	go test -race -tags envtest ./test/envtest/
 
-test-e2e: ## Run real cluster end-to-end tests (requires kind/k3d — Phase 4, not yet implemented)
-	@echo "Real cluster e2e tests are not yet implemented (Testing 2.0 Phase 4)."
-	@echo "See TESTING_2.0.md for the planned scenarios."
-	@exit 1
+E2E_PROFILE ?= release
+E2E_JUNIT_OUT ?= playground/chaos-results/e2e-$(E2E_PROFILE)-junit.xml
+E2E_ARGS ?=
+
+test-e2e: build-playground-chaos ## Run real-cluster E2E tests (E2E_PROFILE=release|smoke|full; requires kind/k3d)
+	./bin/playground-chaos run-all --profile=$(E2E_PROFILE) --auto-reset --continue-on-failure --junit-out=$(E2E_JUNIT_OUT) $(E2E_ARGS)
+
+test-e2e-smoke: build-playground-chaos ## Run real-cluster E2E smoke (smoke profile — requires kind/k3d)
+	$(MAKE) test-e2e E2E_PROFILE=smoke E2E_JUNIT_OUT=playground/chaos-results/e2e-smoke-junit.xml
 
 test-integration: ## Run integration tests (network listener tests)
 	go test -tags integration -race ./internal/platform/ ./test/component/
@@ -123,3 +128,7 @@ chaos-run: build-playground-chaos ## Run a single scenario (SCENARIO=<id>)
 
 chaos-run-all: build-playground-chaos ## Run every registered chaos scenario in order
 	./bin/playground-chaos run-all
+
+chaos-run-all-profile: build-playground-chaos ## Run chaos scenarios filtered by profile (PROFILE=smoke|release|full)
+	@if [ -z "$(PROFILE)" ]; then echo "usage: make chaos-run-all-profile PROFILE=smoke"; exit 2; fi
+	./bin/playground-chaos run-all --profile=$(PROFILE)
diff --git a/WISHLIST.md b/WISHLIST.md
@@ -5,13 +5,16 @@
 - [ ] 7. Cross-region/cross-cluster DR as a first-class feature
 - [ ] 27. Backup/restore performance guide
 - [ ] 30. Public repo, license, release cadence
-- [ ] 32. Real-cluster E2E CI gate
+- [x] 32. Real-cluster E2E CI gate
 - [ ] 41. Safe Secret watch narrowing design
 - [ ] 42. Namespace-scoped watch/cache mode evaluation
+- [ ] 43. Dedicated backup/PITR real-cluster E2E scenarios
 
 ## P0 — Production adoption blockers
 
-**32. Real-cluster E2E CI gate.** Unit/component/envtest coverage is not enough for a MySQL failover operator. Add an optional-but-required-before-release k3d/kind CI job that installs the chart and exercises real MySQL pods, PVCs, Services, DNS/DNSEndpoint behavior, taints, planned failover, emergency failover, operator restart, PVC loss, NetworkPolicy partition, backup restore, and PITR verification. This should run at least on release tags and nightly; if cost is acceptable, run a reduced smoke subset on PRs.
+**32. Real-cluster E2E CI gate.** Done: `make test-e2e` runs the release profile of `playground-chaos run-all` against a real cluster instead of the former placeholder. `make test-e2e-smoke` runs a fast smoke subset (3 scenarios). Three profiles (`smoke`/`release`/`full`) filter scenarios via `--profile` on `playground-chaos run-all` and `make chaos-run-all-profile PROFILE=`. CI uses a reusable workflow (`_e2e.yml`) that creates a kind cluster with Calico CNI, deploys the playground, and runs the selected profile. Nightly and manual runs use the release profile; PRs with the `e2e` label trigger a smoke run. Release publishing blocks on the E2E release-profile gate. JUnit, forensics, setup logs, and kind logs are uploaded as artifacts. Dedicated MySQL backup restore and PITR verification scenarios are split out as follow-up #43 so the gate can start enforcing the existing real-cluster chaos suite now without misrepresenting that coverage.
+
+**43. Dedicated backup/PITR real-cluster E2E scenarios.** Follow-up to #32: add release-profile playground-chaos scenarios that configure the playground backup profile against RustFS, trigger a real `MysqlBackup`, verify restore via `MysqlBackupVerification`, then enable PITR/binlog archival and verify a point-in-time replay with deterministic marker rows. The #32 gate now exists and is release-blocking, but this backup/PITR coverage should be added before claiming the E2E release profile exercises every backup/restore path.
 
 ## P1 — DR and operational completeness