Skip to content

Add real-cluster E2E release gate#83

Open
colinmollenhour wants to merge 17 commits into
mainfrom
one-large-step-megamind
Open

Add real-cluster E2E release gate#83
colinmollenhour wants to merge 17 commits into
mainfrom
one-large-step-megamind

Conversation

@colinmollenhour
Copy link
Copy Markdown
Collaborator

@colinmollenhour colinmollenhour commented May 21, 2026

AI Megamind - By: Pi / GPT-5.5 orchestration with Claude Opus and GLM 5.1 review agents

Summary

  • Replaces the placeholder make test-e2e with profile-driven real-cluster playground chaos runs (smoke, release, full).
  • Adds a reusable kind+Calico GitHub Actions E2E workflow, nightly/manual/PR-label trigger workflow, and release-gate dependency before publishing.
  • Updates playground setup for CI-friendly prebuilt images and Helm fresh-install CRD ownership, plus JUnit directory creation and docs/examples cleanup around CRD ownership.
  • Marks WISHLIST Add old primary recovery procedure (#8) #32 done for the gate itself and splits dedicated backup/PITR real-cluster scenarios into follow-up Bump azure/setup-helm from 4 to 5 #43.

Test Plan

  • go test ./internal/playground/runner — PASS
  • make build-playground-chaos — PASS
  • make vet — PASS
  • npm run build --prefix docs — PASS
  • npm run verify:llms --prefix docs — PASS
  • make test-unit — PASS
  • make test-component — PASS
  • git diff --check — PASS
  • make lint — SKIPPED locally (golangci-lint not installed)

Real kind/k3d E2E was not run locally in this harness; the added workflow creates a fresh kind+Calico cluster and runs the selected profile.

Megamind Artifacts

  • Plan: .tmp/megamind-wishlist-32/plans/final.md
  • Validated review findings: .tmp/megamind-wishlist-32/reviews/validated-findings.md
  • Fixed review: .tmp/megamind-wishlist-32/reviews/fixed-review.md
  • Local gates: .tmp/megamind-wishlist-32/final/local-gates.md

Megamind Educational Appendix

Journey

  • The request started as WISHLIST Add old primary recovery procedure (#8) #32: add a real-cluster E2E CI gate for a MySQL failover operator, required before releases, nightly, and optionally smoke-tested on PRs; the original wording included real pods/PVCs/Services, DNS/DNSEndpoint, taints, planned/emergency failover, operator restart, PVC loss, NetworkPolicy partition, backup restore, and PITR verification (.tmp/megamind-wishlist-32/briefs/request.md, .tmp/megamind-wishlist-32/briefs/context.md).
  • Multi-model critique identified the key forks before coding: reuse the existing playground-chaos runner instead of inventing a new E2E framework; avoid false NetworkPolicy coverage on CNIs that do not enforce policies; make release blocking explicit; define PR smoke scope; upload forensics; and do not pretend backup/PITR coverage exists without scenarios (.tmp/megamind-wishlist-32/critiques/mbot-critique.md).
  • The plan was narrowed from the broader second draft to an incremental production gate: wire the existing playground runner into Make/CI now, add smoke/release/full profiles, and split dedicated backup/PITR scenarios into a later follow-up rather than blocking the gate on a new framework or new backup suite (.tmp/megamind-wishlist-32/plans/second-draft.md, .tmp/megamind-wishlist-32/plans/final.md).
  • Implementation delivered the gate surface: make test-e2e, make test-e2e-smoke, playground-chaos run-all --profile, profile selection tests, kind+Calico reusable workflow, nightly/manual/PR-label triggers, release workflow dependency, CI-friendly playground setup, JUnit directory creation, and docs/examples cleanup for Helm CRD ownership (.tmp/megamind-wishlist-32/agents/e2e-gate-coder-final.md, .tmp/megamind-wishlist-32/final/diff-stat.txt).
  • Review fixes addressed workflow and CI sharp edges: kind creation now waits 0s before Calico, Calico is mandatory, node readiness uses kubectl wait nodes, JUnit parent directories are created, PR E2E reruns on synchronize while labeled, workflow concurrency was added, and live-registry tests guard profile drift (.tmp/megamind-wishlist-32/reviews/validated-findings.md, .tmp/megamind-wishlist-32/fixes/review-fixes-final.md).
  • Local verification passed runner tests, chaos binary build, vet, docs build, llms verification, unit tests, component tests, and diff check; make lint was skipped because golangci-lint was not installed, and real kind/k3d E2E was not run inside this harness (.tmp/megamind-wishlist-32/final/local-gates.md, .tmp/megamind-wishlist-32/final/delivery.md).

Design Decisions

Decision Why Alternative rejected Evidence
Reuse cmd/playground-chaos as the E2E harness. Existing runner already had scenario registry, guards, reset/forensics, JUnit support, and real playground semantics. New test/e2e Go/Ginkgo/e2e-framework suite. Critique C1/H8 in .tmp/megamind-wishlist-32/critiques/mbot-critique.md; final plan scope; cmd/playground-chaos/main.go; Makefile.
Add smoke, release, and full profiles. Gives bounded PR/manual feedback, a curated release/nightly gate, and preserves old all-scenarios behavior. One all-or-nothing run-all, or an undefined PR smoke subset. internal/playground/runner/profile.go; internal/playground/runner/profile_test.go; internal/playground/runner/profile_registry_test.go; Makefile.
Keep CLI default run-all profile as full, but make make test-e2e default to release. Avoids surprising existing playground-chaos run-all users while making the canonical E2E gate the curated release profile. Change run-all default semantics to release. DefaultProfile = ProfileFull in internal/playground/runner/profile.go; E2E_PROFILE ?= release in Makefile.
Use kind with Calico for CI. NetworkPolicy/partition tests need a CNI that enforces NetworkPolicy; default kindnet was called out as unsafe for this claim. Default kind networking or k3d/flannel without enforcement. .tmp/megamind-wishlist-32/critiques/mbot-critique.md; .github/kind/e2e-calico.yaml; .github/workflows/_e2e.yml.
Put release blocking inside release.yml. A separate tag workflow could run in parallel and not block publishing; release jobs need an explicit dependency. Non-blocking nightly policy or independent workflow only. .github/workflows/release.yml; critique release-enforcement findings.
Use a reusable _e2e.yml workflow plus trigger workflow. Keeps nightly/manual/PR-label/release entrypoints on one implementation path. Duplicating kind/setup/run/artifact steps in every workflow. .github/workflows/_e2e.yml; .github/workflows/e2e.yml; .github/workflows/release.yml; .github/workflows/README.md.
Let Helm own first-install chart CRDs by using the chart crds/ directory, not an installCRDs value. Helm does not template crds/ based on values; the previous value was misleading. CI setup skips manual Bloodraven CRD apply so fresh Helm install exercises chart CRDs. Keep or document installCRDs=true/false. playground/setup.sh; removed value in charts/bloodraven/values.yaml; docs/example edits; .tmp/megamind-wishlist-32/reviews/validated-findings.md.
Split dedicated backup/PITR E2E scenarios into WISHLIST #43. Critique found no existing automated backup/PITR scenarios; the delivered gate should not misrepresent coverage. Block #32 until new backup/PITR scenarios exist, or claim release profile covers them. .tmp/megamind-wishlist-32/reviews/validated-findings.md; WISHLIST.md; .tmp/megamind-wishlist-32/final/pr-body.md.

Architecture

flowchart TD
  subgraph Local[Local / Make entrypoints]
    A[make test-e2e\nE2E_PROFILE defaults release] --> B[bin/playground-chaos run-all]
    A2[make test-e2e-smoke] --> B
    A3[make chaos-run-all-profile PROFILE=...] --> B
  end

  subgraph Runner[playground-chaos]
    B --> C[--profile validation\nsmoke | release | full]
    C --> D[runner.SelectForProfile]
    D --> E[Executor runs selected scenarios]
    E --> F[JUnit XML]
    E --> G[chaos-results forensics on failure]
  end

  subgraph CI[GitHub Actions]
    H[e2e.yml\nschedule/manual/PR label] --> I[_e2e.yml reusable]
    R[release.yml e2e-gate] --> I
    I --> J[kind bloodraven-e2e\nCalico CNI]
    J --> K[build + load playground images]
    K --> L[playground/setup.sh\nSKIP_IMAGE_BUILD=1\nBLOODRAVEN_SETUP_HELM_INSTALL_CRDS=1]
    L --> A
    F --> M[upload JUnit]
    G --> N[upload forensics/kind/setup logs]
  end

  R --> O[draft/docker publishing require e2e-gate]
Loading

Key module boundaries:

  • internal/playground/runner/profile.go is intentionally small and data-driven: profile constants, allowlists, validation, and SelectForProfile(all, profile) filtering. It does not know Kubernetes or scenario internals.
  • cmd/playground-chaos/main.go owns CLI parsing and wires --profile only into run-all; single-scenario run remains unchanged.
  • Makefile is the local contract: test-e2e builds the runner and runs run-all --profile=$(E2E_PROFILE) --auto-reset --continue-on-failure --junit-out=...; test-e2e-smoke is a convenience wrapper.
  • .github/workflows/_e2e.yml is the CI implementation: build runner/images, create kind with Calico, install Calico, load images, deploy playground, run the requested profile, then upload JUnit and failure artifacts.
  • playground/setup.sh gained CI-safe behavior by honoring prebuilt/preloaded images (SKIP_IMAGE_BUILD) and skipping manual Bloodraven CRD application when Helm should install chart CRDs on a fresh cluster (BLOODRAVEN_SETUP_HELM_INSTALL_CRDS).
  • Documentation and examples were updated to remove the misleading installCRDs value and explain that Helm installs chart CRDs on first install while upgrades need explicit CRD review/application.

Lessons

  • For real-cluster gates, prefer reusing a battle-tested chaos/playground runner over creating a parallel E2E framework unless the existing runner cannot express the assertion. This preserves forensics, cleanup, guardrails, and scenario inventory.
  • Make test subsets explicit. A PR “smoke” gate needs named profiles or allowlists, not prose like “reduced subset,” and those allowlists need tests against the live scenario registry to prevent silent drift.
  • Do not claim NetworkPolicy coverage without controlling the CNI. The kind+Calico choice came directly from critique of default CNI false positives.
  • Release gating must be a dependency in the release workflow, not just a separate scheduled/tag workflow.
  • Long-running CI needs artifacts by design: JUnit for check summaries, chaos forensics for scenario failures, setup logs for deploy failures, and kind logs for infrastructure failures.
  • Helm CRD semantics are easy to misdocument: files in crds/ install on first install and are not templated by values or automatically upgraded. Avoid fake installCRDs toggles unless the chart actually implements them.
  • When scope contains missing scenario classes, record the limitation as a follow-up rather than hiding it. Here, backup/PITR became WISHLIST Bump azure/setup-helm from 4 to 5 #43 so Add old primary recovery procedure (#8) #32 can be truthful and still useful.
  • Keep repository guidance aligned with the actual test topology. This change updated AGENTS.md so future agents route unit/component/envtest/real-cluster scenario work to the right directories and runner.

Evidence

Claim Source
Original objective required a real-cluster E2E gate and listed backup/PITR among desired behaviors. .tmp/megamind-wishlist-32/briefs/request.md; .tmp/megamind-wishlist-32/briefs/context.md.
Critique drove harness reuse, CNI/NetworkPolicy caution, release blocking, artifact upload, and backup/PITR scope correction. .tmp/megamind-wishlist-32/critiques/mbot-critique.md.
Final plan chose existing playground runner, profile-driven gate, CI workflows, release blocking, docs, and WISHLIST update. .tmp/megamind-wishlist-32/plans/final.md.
Profile implementation contains smoke/release/full constants, allowlists, validation, and selector. internal/playground/runner/profile.go.
Profile behavior is unit-tested and checked against registered scenarios. internal/playground/runner/profile_test.go; internal/playground/runner/profile_registry_test.go.
CLI exposes --profile for run-all, validates values, filters scenarios, and still writes JUnit. cmd/playground-chaos/main.go.
Make targets replaced the old placeholder with runnable E2E entrypoints. Makefile; diff against main shows removal of the TESTING_2.0.md placeholder.
JUnit output now creates parent directories before writing. internal/playground/runner/junit.go; validated finding #5.
CI uses kind+Calico and builds/loads local playground images before setup. .github/kind/e2e-calico.yaml; .github/workflows/_e2e.yml.
Nightly/manual/PR-label triggers use the reusable E2E workflow. .github/workflows/e2e.yml.
Release publishing is blocked on e2e-gate before draft/docker, with Helm/publish transitively dependent. .github/workflows/release.yml.
Setup can skip image builds and manual Bloodraven CRD application for CI fresh Helm installs. playground/setup.sh; .github/workflows/_e2e.yml env.
installCRDs value and docs/examples references were removed because Helm CRDs are not value-controlled. charts/bloodraven/values.yaml; docs/docs/install-production.mdx; docs/docs/gitops.mdx; docs/docs/production-install-examples.mdx; examples/argocd-application.yaml; examples/production-values.yaml; .tmp/megamind-wishlist-32/reviews/fixed-review.md.
Backup/PITR dedicated scenario work is explicitly a follow-up, not claimed as done. WISHLIST.md item #43; .tmp/megamind-wishlist-32/reviews/validated-findings.md; .tmp/megamind-wishlist-32/final/pr-body.md.
Local gates passed except lint skipped and real cluster E2E not run locally. .tmp/megamind-wishlist-32/final/local-gates.md; .tmp/megamind-wishlist-32/agents/e2e-gate-coder-final.md.
PR and branch delivery metadata. .tmp/megamind-wishlist-32/final/delivery.md; .tmp/megamind-wishlist-32/final/diff-stat.txt.

Copilot AI review requested due to automatic review settings May 21, 2026 04:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a real-cluster E2E “release gate” by making make test-e2e run the existing playground chaos suite against an actual Kubernetes cluster, and wiring that into GitHub Actions so release publishing is blocked on the E2E run. It also updates playground setup and docs/examples to align with Helm CRD ownership behavior and CI execution.

Changes:

  • Replace the placeholder make test-e2e with profile-driven playground-chaos run-all execution (smoke/release/full) and add Make targets for smoke/profile runs.
  • Add reusable kind+Calico E2E workflow and a trigger workflow (nightly/manual/PR label), and require the E2E gate in the release workflow before publishing.
  • Adjust playground setup for CI (skip image build) and CRD installation behavior; update docs/examples to remove the now-obsolete installCRDs value guidance.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
WISHLIST.md Marks #32 complete and adds follow-up #43 for backup/PITR-specific E2E scenarios.
playground/setup.sh Adds CI-friendly image-build skip and a toggle for relying on Helm-installed CRDs vs manual CRD apply.
Makefile Implements real test-e2e/test-e2e-smoke targets and adds a profile-filtered chaos runner target.
internal/playground/runner/profile.go Introduces E2E profiles and scenario selection filtering.
internal/playground/runner/profile_test.go Adds unit tests for profile validation and selection behavior.
internal/playground/runner/profile_registry_test.go Verifies profiles select the intended subset from the registered scenarios set.
internal/playground/runner/junit.go Ensures the JUnit output directory exists before writing the report.
examples/production-values.yaml Removes installCRDs example since CRD ownership guidance changed.
examples/argocd-application.yaml Removes installCRDs override from Argo CD Helm values example.
docs/docs/production-install-examples.mdx Updates CRD ownership/install guidance to reflect Helm CRD behavior.
docs/docs/install-production.mdx Removes --set installCRDs=true/false guidance; clarifies Helm CRD upgrade limitations.
docs/docs/gitops.mdx Simplifies CRD ownership table and clarifies install/upgrade sequencing.
cmd/playground-chaos/main.go Adds --profile flag and applies profile filtering in run-all.
charts/bloodraven/values.yaml Removes the installCRDs value from chart values.
AGENTS.md Documents new E2E Make targets and profile-based chaos runs.
.github/workflows/release.yml Adds an E2E release-profile gate job and makes publishing jobs depend on it.
.github/workflows/README.md Documents the new E2E workflows and the profile matrix.
.github/workflows/e2e.yml Adds the trigger workflow for nightly/manual/PR-label E2E runs.
.github/workflows/_e2e.yml Adds the reusable kind+Calico cluster workflow that deploys the playground and runs E2E.
.github/kind/e2e-calico.yaml Adds a kind config that disables default CNI and supports Calico-enforced NetworkPolicy testing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread playground/setup.sh
Comment thread .github/workflows/e2e.yml
@colinmollenhour colinmollenhour force-pushed the one-large-step-megamind branch from 0a78f67 to 387720e Compare May 21, 2026 04:10
@colinmollenhour colinmollenhour force-pushed the one-large-step-megamind branch from 6875b36 to fb1c1f7 Compare May 21, 2026 04:18
@colinmollenhour
Copy link
Copy Markdown
Collaborator Author

AI Review Response · Commit: be89c05 · By: AI agent · Summary

Addressed both unresolved Copilot review threads and resolved them:

  • playground/setup.sh: added an explicit guard for BLOODRAVEN_SETUP_HELM_INSTALL_CRDS=1 when an existing bloodraven Helm release is present. The script now fails clearly instead of implying Helm will upgrade/repair CRDs on helm upgrade.
  • .github/workflows/e2e.yml: renamed the trigger workflow job from smoke to e2e so scheduled/manual release/full profile runs are not mislabeled in the Actions UI.

Verification run after the fixes:

  • git diff --check — PASS
  • make build-playground-chaos — PASS
  • make vet — PASS

Skipped threads: none.
Follow-up questions: none.
PR state: fixes pushed in be89c05; review threads resolved; CI is running for the new commit.

@colinmollenhour
Copy link
Copy Markdown
Collaborator Author

AI Review Response · Commit: be89c05 · By: AI agent · Summary

Final update after CI:

  • Fixed and resolved both review threads.
  • Pushed commit be89c05 with the setup.sh Helm-CRD rerun guard and the E2E workflow job rename.
  • Verification before push: git diff --check, make build-playground-chaos, and make vet all passed.
  • GitHub CI is now green: CI / All Checks passed, Trivy passed. The E2E PR-label job is skipped as expected unless the PR has the e2e label.
  • Skipped threads: none.
  • Open follow-up questions: none.

@colinmollenhour colinmollenhour added the e2e Run real-cluster E2E smoke workflow label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

e2e Run real-cluster E2E smoke workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants