Add real-cluster E2E release gate by colinmollenhour · Pull Request #83 · ShipStream/bloodraven

colinmollenhour · 2026-05-21T04:07:36Z

AI Megamind - By: Pi / GPT-5.5 orchestration with Claude Opus and GLM 5.1 review agents

Summary

Replaces the placeholder make test-e2e with profile-driven real-cluster playground chaos runs (smoke, release, full).
Adds a reusable kind+Calico GitHub Actions E2E workflow, nightly/manual/PR-label trigger workflow, and release-gate dependency before publishing.
Updates playground setup for CI-friendly prebuilt images and Helm fresh-install CRD ownership, plus JUnit directory creation and docs/examples cleanup around CRD ownership.
Marks WISHLIST Add old primary recovery procedure (#8) #32 done for the gate itself and splits dedicated backup/PITR real-cluster scenarios into follow-up Bump azure/setup-helm from 4 to 5 #43.

Test Plan

go test ./internal/playground/runner — PASS
make build-playground-chaos — PASS
make vet — PASS
npm run build --prefix docs — PASS
npm run verify:llms --prefix docs — PASS
make test-unit — PASS
make test-component — PASS
git diff --check — PASS
make lint — SKIPPED locally (golangci-lint not installed)

Real kind/k3d E2E was not run locally in this harness; the added workflow creates a fresh kind+Calico cluster and runs the selected profile.

Megamind Artifacts

Plan: .tmp/megamind-wishlist-32/plans/final.md
Validated review findings: .tmp/megamind-wishlist-32/reviews/validated-findings.md
Fixed review: .tmp/megamind-wishlist-32/reviews/fixed-review.md
Local gates: .tmp/megamind-wishlist-32/final/local-gates.md

Megamind Educational Appendix

Journey

The request started as WISHLIST Add old primary recovery procedure (#8) #32: add a real-cluster E2E CI gate for a MySQL failover operator, required before releases, nightly, and optionally smoke-tested on PRs; the original wording included real pods/PVCs/Services, DNS/DNSEndpoint, taints, planned/emergency failover, operator restart, PVC loss, NetworkPolicy partition, backup restore, and PITR verification (.tmp/megamind-wishlist-32/briefs/request.md, .tmp/megamind-wishlist-32/briefs/context.md).
Multi-model critique identified the key forks before coding: reuse the existing playground-chaos runner instead of inventing a new E2E framework; avoid false NetworkPolicy coverage on CNIs that do not enforce policies; make release blocking explicit; define PR smoke scope; upload forensics; and do not pretend backup/PITR coverage exists without scenarios (.tmp/megamind-wishlist-32/critiques/mbot-critique.md).
The plan was narrowed from the broader second draft to an incremental production gate: wire the existing playground runner into Make/CI now, add smoke/release/full profiles, and split dedicated backup/PITR scenarios into a later follow-up rather than blocking the gate on a new framework or new backup suite (.tmp/megamind-wishlist-32/plans/second-draft.md, .tmp/megamind-wishlist-32/plans/final.md).
Implementation delivered the gate surface: make test-e2e, make test-e2e-smoke, playground-chaos run-all --profile, profile selection tests, kind+Calico reusable workflow, nightly/manual/PR-label triggers, release workflow dependency, CI-friendly playground setup, JUnit directory creation, and docs/examples cleanup for Helm CRD ownership (.tmp/megamind-wishlist-32/agents/e2e-gate-coder-final.md, .tmp/megamind-wishlist-32/final/diff-stat.txt).
Review fixes addressed workflow and CI sharp edges: kind creation now waits 0s before Calico, Calico is mandatory, node readiness uses kubectl wait nodes, JUnit parent directories are created, PR E2E reruns on synchronize while labeled, workflow concurrency was added, and live-registry tests guard profile drift (.tmp/megamind-wishlist-32/reviews/validated-findings.md, .tmp/megamind-wishlist-32/fixes/review-fixes-final.md).
Local verification passed runner tests, chaos binary build, vet, docs build, llms verification, unit tests, component tests, and diff check; make lint was skipped because golangci-lint was not installed, and real kind/k3d E2E was not run inside this harness (.tmp/megamind-wishlist-32/final/local-gates.md, .tmp/megamind-wishlist-32/final/delivery.md).

Design Decisions

Decision	Why	Alternative rejected	Evidence
Reuse `cmd/playground-chaos` as the E2E harness.	Existing runner already had scenario registry, guards, reset/forensics, JUnit support, and real playground semantics.	New `test/e2e` Go/Ginkgo/e2e-framework suite.	Critique C1/H8 in `.tmp/megamind-wishlist-32/critiques/mbot-critique.md`; final plan scope; `cmd/playground-chaos/main.go`; `Makefile`.
Add `smoke`, `release`, and `full` profiles.	Gives bounded PR/manual feedback, a curated release/nightly gate, and preserves old all-scenarios behavior.	One all-or-nothing `run-all`, or an undefined PR smoke subset.	`internal/playground/runner/profile.go`; `internal/playground/runner/profile_test.go`; `internal/playground/runner/profile_registry_test.go`; `Makefile`.
Keep CLI default `run-all` profile as `full`, but make `make test-e2e` default to `release`.	Avoids surprising existing `playground-chaos run-all` users while making the canonical E2E gate the curated release profile.	Change `run-all` default semantics to release.	`DefaultProfile = ProfileFull` in `internal/playground/runner/profile.go`; `E2E_PROFILE ?= release` in `Makefile`.
Use kind with Calico for CI.	NetworkPolicy/partition tests need a CNI that enforces NetworkPolicy; default kindnet was called out as unsafe for this claim.	Default kind networking or k3d/flannel without enforcement.	`.tmp/megamind-wishlist-32/critiques/mbot-critique.md`; `.github/kind/e2e-calico.yaml`; `.github/workflows/_e2e.yml`.
Put release blocking inside `release.yml`.	A separate tag workflow could run in parallel and not block publishing; release jobs need an explicit dependency.	Non-blocking nightly policy or independent workflow only.	`.github/workflows/release.yml`; critique release-enforcement findings.
Use a reusable `_e2e.yml` workflow plus trigger workflow.	Keeps nightly/manual/PR-label/release entrypoints on one implementation path.	Duplicating kind/setup/run/artifact steps in every workflow.	`.github/workflows/_e2e.yml`; `.github/workflows/e2e.yml`; `.github/workflows/release.yml`; `.github/workflows/README.md`.
Let Helm own first-install chart CRDs by using the chart `crds/` directory, not an `installCRDs` value.	Helm does not template `crds/` based on values; the previous value was misleading. CI setup skips manual Bloodraven CRD apply so fresh Helm install exercises chart CRDs.	Keep or document `installCRDs=true/false`.	`playground/setup.sh`; removed value in `charts/bloodraven/values.yaml`; docs/example edits; `.tmp/megamind-wishlist-32/reviews/validated-findings.md`.
Split dedicated backup/PITR E2E scenarios into WISHLIST #43.	Critique found no existing automated backup/PITR scenarios; the delivered gate should not misrepresent coverage.	Block #32 until new backup/PITR scenarios exist, or claim release profile covers them.	`.tmp/megamind-wishlist-32/reviews/validated-findings.md`; `WISHLIST.md`; `.tmp/megamind-wishlist-32/final/pr-body.md`.

Architecture

flowchart TD
  subgraph Local[Local / Make entrypoints]
    A[make test-e2e\nE2E_PROFILE defaults release] --> B[bin/playground-chaos run-all]
    A2[make test-e2e-smoke] --> B
    A3[make chaos-run-all-profile PROFILE=...] --> B
  end

  subgraph Runner[playground-chaos]
    B --> C[--profile validation\nsmoke | release | full]
    C --> D[runner.SelectForProfile]
    D --> E[Executor runs selected scenarios]
    E --> F[JUnit XML]
    E --> G[chaos-results forensics on failure]
  end

  subgraph CI[GitHub Actions]
    H[e2e.yml\nschedule/manual/PR label] --> I[_e2e.yml reusable]
    R[release.yml e2e-gate] --> I
    I --> J[kind bloodraven-e2e\nCalico CNI]
    J --> K[build + load playground images]
    K --> L[playground/setup.sh\nSKIP_IMAGE_BUILD=1\nBLOODRAVEN_SETUP_HELM_INSTALL_CRDS=1]
    L --> A
    F --> M[upload JUnit]
    G --> N[upload forensics/kind/setup logs]
  end

  R --> O[draft/docker publishing require e2e-gate]

Key module boundaries:

internal/playground/runner/profile.go is intentionally small and data-driven: profile constants, allowlists, validation, and SelectForProfile(all, profile) filtering. It does not know Kubernetes or scenario internals.
cmd/playground-chaos/main.go owns CLI parsing and wires --profile only into run-all; single-scenario run remains unchanged.
Makefile is the local contract: test-e2e builds the runner and runs run-all --profile=$(E2E_PROFILE) --auto-reset --continue-on-failure --junit-out=...; test-e2e-smoke is a convenience wrapper.
.github/workflows/_e2e.yml is the CI implementation: build runner/images, create kind with Calico, install Calico, load images, deploy playground, run the requested profile, then upload JUnit and failure artifacts.
playground/setup.sh gained CI-safe behavior by honoring prebuilt/preloaded images (SKIP_IMAGE_BUILD) and skipping manual Bloodraven CRD application when Helm should install chart CRDs on a fresh cluster (BLOODRAVEN_SETUP_HELM_INSTALL_CRDS).
Documentation and examples were updated to remove the misleading installCRDs value and explain that Helm installs chart CRDs on first install while upgrades need explicit CRD review/application.

Lessons

For real-cluster gates, prefer reusing a battle-tested chaos/playground runner over creating a parallel E2E framework unless the existing runner cannot express the assertion. This preserves forensics, cleanup, guardrails, and scenario inventory.
Make test subsets explicit. A PR “smoke” gate needs named profiles or allowlists, not prose like “reduced subset,” and those allowlists need tests against the live scenario registry to prevent silent drift.
Do not claim NetworkPolicy coverage without controlling the CNI. The kind+Calico choice came directly from critique of default CNI false positives.
Release gating must be a dependency in the release workflow, not just a separate scheduled/tag workflow.
Long-running CI needs artifacts by design: JUnit for check summaries, chaos forensics for scenario failures, setup logs for deploy failures, and kind logs for infrastructure failures.
Helm CRD semantics are easy to misdocument: files in crds/ install on first install and are not templated by values or automatically upgraded. Avoid fake installCRDs toggles unless the chart actually implements them.
When scope contains missing scenario classes, record the limitation as a follow-up rather than hiding it. Here, backup/PITR became WISHLIST Bump azure/setup-helm from 4 to 5 #43 so Add old primary recovery procedure (#8) #32 can be truthful and still useful.
Keep repository guidance aligned with the actual test topology. This change updated AGENTS.md so future agents route unit/component/envtest/real-cluster scenario work to the right directories and runner.

Evidence

Claim	Source
Original objective required a real-cluster E2E gate and listed backup/PITR among desired behaviors.	`.tmp/megamind-wishlist-32/briefs/request.md`; `.tmp/megamind-wishlist-32/briefs/context.md`.
Critique drove harness reuse, CNI/NetworkPolicy caution, release blocking, artifact upload, and backup/PITR scope correction.	`.tmp/megamind-wishlist-32/critiques/mbot-critique.md`.
Final plan chose existing playground runner, profile-driven gate, CI workflows, release blocking, docs, and WISHLIST update.	`.tmp/megamind-wishlist-32/plans/final.md`.
Profile implementation contains smoke/release/full constants, allowlists, validation, and selector.	`internal/playground/runner/profile.go`.
Profile behavior is unit-tested and checked against registered scenarios.	`internal/playground/runner/profile_test.go`; `internal/playground/runner/profile_registry_test.go`.
CLI exposes `--profile` for `run-all`, validates values, filters scenarios, and still writes JUnit.	`cmd/playground-chaos/main.go`.
Make targets replaced the old placeholder with runnable E2E entrypoints.	`Makefile`; diff against `main` shows removal of the `TESTING_2.0.md` placeholder.
JUnit output now creates parent directories before writing.	`internal/playground/runner/junit.go`; validated finding #5.
CI uses kind+Calico and builds/loads local playground images before setup.	`.github/kind/e2e-calico.yaml`; `.github/workflows/_e2e.yml`.
Nightly/manual/PR-label triggers use the reusable E2E workflow.	`.github/workflows/e2e.yml`.
Release publishing is blocked on `e2e-gate` before draft/docker, with Helm/publish transitively dependent.	`.github/workflows/release.yml`.
Setup can skip image builds and manual Bloodraven CRD application for CI fresh Helm installs.	`playground/setup.sh`; `.github/workflows/_e2e.yml` env.
`installCRDs` value and docs/examples references were removed because Helm CRDs are not value-controlled.	`charts/bloodraven/values.yaml`; `docs/docs/install-production.mdx`; `docs/docs/gitops.mdx`; `docs/docs/production-install-examples.mdx`; `examples/argocd-application.yaml`; `examples/production-values.yaml`; `.tmp/megamind-wishlist-32/reviews/fixed-review.md`.
Backup/PITR dedicated scenario work is explicitly a follow-up, not claimed as done.	`WISHLIST.md` item #43; `.tmp/megamind-wishlist-32/reviews/validated-findings.md`; `.tmp/megamind-wishlist-32/final/pr-body.md`.
Local gates passed except lint skipped and real cluster E2E not run locally.	`.tmp/megamind-wishlist-32/final/local-gates.md`; `.tmp/megamind-wishlist-32/agents/e2e-gate-coder-final.md`.
PR and branch delivery metadata.	`.tmp/megamind-wishlist-32/final/delivery.md`; `.tmp/megamind-wishlist-32/final/diff-stat.txt`.

Copilot

Pull request overview

This PR introduces a real-cluster E2E “release gate” by making make test-e2e run the existing playground chaos suite against an actual Kubernetes cluster, and wiring that into GitHub Actions so release publishing is blocked on the E2E run. It also updates playground setup and docs/examples to align with Helm CRD ownership behavior and CI execution.

Changes:

Replace the placeholder make test-e2e with profile-driven playground-chaos run-all execution (smoke/release/full) and add Make targets for smoke/profile runs.
Add reusable kind+Calico E2E workflow and a trigger workflow (nightly/manual/PR label), and require the E2E gate in the release workflow before publishing.
Adjust playground setup for CI (skip image build) and CRD installation behavior; update docs/examples to remove the now-obsolete installCRDs value guidance.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`WISHLIST.md`	Marks #32 complete and adds follow-up #43 for backup/PITR-specific E2E scenarios.
`playground/setup.sh`	Adds CI-friendly image-build skip and a toggle for relying on Helm-installed CRDs vs manual CRD apply.
`Makefile`	Implements real `test-e2e`/`test-e2e-smoke` targets and adds a profile-filtered chaos runner target.
`internal/playground/runner/profile.go`	Introduces E2E profiles and scenario selection filtering.
`internal/playground/runner/profile_test.go`	Adds unit tests for profile validation and selection behavior.
`internal/playground/runner/profile_registry_test.go`	Verifies profiles select the intended subset from the registered scenarios set.
`internal/playground/runner/junit.go`	Ensures the JUnit output directory exists before writing the report.
`examples/production-values.yaml`	Removes `installCRDs` example since CRD ownership guidance changed.
`examples/argocd-application.yaml`	Removes `installCRDs` override from Argo CD Helm values example.
`docs/docs/production-install-examples.mdx`	Updates CRD ownership/install guidance to reflect Helm CRD behavior.
`docs/docs/install-production.mdx`	Removes `--set installCRDs=true/false` guidance; clarifies Helm CRD upgrade limitations.
`docs/docs/gitops.mdx`	Simplifies CRD ownership table and clarifies install/upgrade sequencing.
`cmd/playground-chaos/main.go`	Adds `--profile` flag and applies profile filtering in `run-all`.
`charts/bloodraven/values.yaml`	Removes the `installCRDs` value from chart values.
`AGENTS.md`	Documents new E2E Make targets and profile-based chaos runs.
`.github/workflows/release.yml`	Adds an E2E release-profile gate job and makes publishing jobs depend on it.
`.github/workflows/README.md`	Documents the new E2E workflows and the profile matrix.
`.github/workflows/e2e.yml`	Adds the trigger workflow for nightly/manual/PR-label E2E runs.
`.github/workflows/_e2e.yml`	Adds the reusable kind+Calico cluster workflow that deploys the playground and runs E2E.
`.github/kind/e2e-calico.yaml`	Adds a kind config that disables default CNI and supports Calico-enforced NetworkPolicy testing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

colinmollenhour · 2026-05-21T05:31:06Z

AI Review Response · Commit: be89c05 · By: AI agent · Summary

Addressed both unresolved Copilot review threads and resolved them:

playground/setup.sh: added an explicit guard for BLOODRAVEN_SETUP_HELM_INSTALL_CRDS=1 when an existing bloodraven Helm release is present. The script now fails clearly instead of implying Helm will upgrade/repair CRDs on helm upgrade.
.github/workflows/e2e.yml: renamed the trigger workflow job from smoke to e2e so scheduled/manual release/full profile runs are not mislabeled in the Actions UI.

Verification run after the fixes:

git diff --check — PASS
make build-playground-chaos — PASS
make vet — PASS

Skipped threads: none.
Follow-up questions: none.
PR state: fixes pushed in be89c05; review threads resolved; CI is running for the new commit.

colinmollenhour · 2026-05-21T05:34:12Z

AI Review Response · Commit: be89c05 · By: AI agent · Summary

Final update after CI:

Fixed and resolved both review threads.
Pushed commit be89c05 with the setup.sh Helm-CRD rerun guard and the E2E workflow job rename.
Verification before push: git diff --check, make build-playground-chaos, and make vet all passed.
GitHub CI is now green: CI / All Checks passed, Trivy passed. The E2E PR-label job is skipped as expected unless the PR has the e2e label.
Skipped threads: none.
Open follow-up questions: none.

Copilot AI review requested due to automatic review settings May 21, 2026 04:07

Copilot started reviewing on behalf of colinmollenhour May 21, 2026 04:07 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread playground/setup.sh

Comment thread .github/workflows/e2e.yml

colinmollenhour force-pushed the one-large-step-megamind branch from 0a78f67 to 387720e Compare May 21, 2026 04:10

colinmollenhour added 2 commits May 21, 2026 04:17

Add real-cluster E2E release gate

06b0a34

Trigger CI for E2E gate PR

fb1c1f7

colinmollenhour force-pushed the one-large-step-megamind branch from 6875b36 to fb1c1f7 Compare May 21, 2026 04:18

Address E2E review comments

be89c05

colinmollenhour added the e2e Run real-cluster E2E smoke workflow label May 21, 2026

colinmollenhour added 14 commits May 21, 2026 06:06

Fix kind worker selection for E2E

9759fa7

Harden playground replication user setup

a05be76

Use non-default RustFS playground credentials

36a8aa3

Use TCP for playground MySQL setup

7c50f96

Let playground operator tolerate DB taints

4a25027

Tolerate readonly NoExecute taints in playground

61b9a56

Use MySQL LTS for playground E2E

156a17a

Install clone plugin during MySQL init

71326fe

Install clone plugin during playground setup

c762471

Tolerate absent clone donor allowlist

266fdaf

Skip removed clone DDL timeout

2f44943

Load generated MySQL config as cnf

b1a0519

Ensure clone plugin on donors

8a8da3b

Pass critical MySQL settings as args

6955eec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add real-cluster E2E release gate#83

Add real-cluster E2E release gate#83
colinmollenhour wants to merge 17 commits into
mainfrom
one-large-step-megamind

colinmollenhour commented May 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

colinmollenhour commented May 21, 2026

Uh oh!

colinmollenhour commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

colinmollenhour commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Megamind Artifacts

Megamind Educational Appendix

Journey

Design Decisions

Architecture

Lessons

Evidence

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

colinmollenhour commented May 21, 2026

Uh oh!

colinmollenhour commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

colinmollenhour commented May 21, 2026 •

edited

Loading