Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .github/kind/e2e-calico.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# kind cluster configuration for Bloodraven E2E tests.
# Uses Calico CNI so NetworkPolicy resources are enforced (the default
# kindnet CNI does not implement NetworkPolicy, which means partition /
# self-fencing scenarios would silently pass without actually testing
# policy behaviour).
#
# Usage:
# kind create cluster --config=.github/kind/e2e-calico.yaml
# # then install Calico:
# kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: bloodraven-e2e
nodes:
- role: control-plane
- role: worker
- role: worker
networking:
# Disable kindnet so Calico can manage CNI instead.
disableDefaultCNI: true
# Match the stock Calico manifest's default IPv4 pool.
podSubnet: "192.168.0.0/16"
22 changes: 22 additions & 0 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,28 @@ Then enable GitHub Pages for the repository pointing at the `gh-pages` branch. T

---

### `e2e.yml` / `_e2e.yml` — Real-Cluster E2E

**Triggers:**
- Nightly schedule (release profile)
- Manual dispatch with profile selection (smoke / release / full)
- Pull requests with the `e2e` label (smoke profile)

The reusable workflow (`_e2e.yml`) creates a kind cluster with Calico CNI, deploys the playground, and runs `playground-chaos run-all` with the selected profile. It uploads JUnit results, chaos forensics, setup logs, and kind logs as artifacts.

Profiles:
| Profile | Scenarios | Use case |
|---|---|---|
| `smoke` | 3 (~3-5 min) | PR label gate, fast feedback |
| `release` | 10 (~20-30 min) | Release and nightly gate |
| `full` | All registered | Full regression (manual only) |

The release workflow (`.github/workflows/release.yml`) blocks Docker image builds and Helm chart publishing on the E2E release-profile gate. This ensures every tagged release is validated against real MySQL failover scenarios (WISHLIST #32).

**Permissions:** `contents: read` (default)

---

### `scan.yml` — Trivy Security Scan

**Triggers:** Pull requests targeting `main`
Expand Down
125 changes: 125 additions & 0 deletions .github/workflows/_e2e.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Reusable E2E workflow — creates a kind cluster, deploys the playground,
# and runs playground-chaos with the selected profile.
#
# Called by:
# .github/workflows/e2e.yml (nightly, manual, PR label)
# .github/workflows/release.yml (release gate)
name: E2E (reusable)

on:
workflow_call:
inputs:
profile:
description: "Chaos profile (smoke|release|full)"
required: false
default: "release"
type: string
timeout-minutes:
description: "Job timeout in minutes"
required: false
default: 90
type: number

permissions:
contents: read

env:
BLOODRAVEN_SETUP_HELM_INSTALL_CRDS: "1"
SKIP_IMAGE_BUILD: "1"

concurrency:
group: e2e-${{ github.workflow }}-${{ github.ref }}-${{ inputs.profile }}
cancel-in-progress: true

jobs:
e2e:
name: Real-cluster E2E (${{ inputs.profile }})
runs-on: ubuntu-latest
timeout-minutes: ${{ inputs.timeout-minutes }}
steps:
- uses: actions/checkout@v6

- uses: actions/setup-go@v6
with:
go-version-file: go.mod
cache-dependency-path: go.sum

- name: Build playground-chaos
run: make build-playground-chaos

- name: Build Docker images
run: |
docker build --target bloodraven -t bloodraven:playground .
docker build --target sidecar -t bloodraven-sidecar:playground .
docker build -t bloodraven-counter:playground playground/counter-app
docker build -t bloodraven-dashboard:playground playground/dashboard
docker build -t bloodraven-dns-webhook:playground playground/dns-webhook

- name: Create kind cluster
uses: helm/kind-action@v1.12.0
with:
cluster_name: bloodraven-e2e
config: .github/kind/e2e-calico.yaml
# CNI is disabled in this kind config, so nodes cannot become
# Ready until Calico is installed in the next step.
wait: 0s

- name: Install Calico CNI
run: |
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
kubectl -n kube-system rollout status daemonset/calico-node --timeout=180s
kubectl wait nodes --all --for=condition=Ready --timeout=180s

- name: Load images into kind
run: |
kind load docker-image bloodraven:playground bloodraven-sidecar:playground bloodraven-counter:playground bloodraven-dashboard:playground bloodraven-dns-webhook:playground --name bloodraven-e2e

- name: Deploy playground
run: |
set -o pipefail
./playground/setup.sh 2>&1 | tee playground/setup.log
timeout-minutes: 10

- name: Run E2E (${{ inputs.profile }} profile)
run: make test-e2e E2E_PROFILE=${{ inputs.profile }} E2E_JUNIT_OUT=playground/chaos-results/e2e-${{ inputs.profile }}-junit.xml
timeout-minutes: ${{ inputs.timeout-minutes }}

- name: Upload JUnit results
if: always()
uses: actions/upload-artifact@v4
with:
name: e2e-${{ inputs.profile }}-junit
path: playground/chaos-results/e2e-${{ inputs.profile }}-junit.xml
retention-days: 30

- name: Upload chaos forensics
if: failure()
uses: actions/upload-artifact@v4
with:
name: e2e-${{ inputs.profile }}-forensics
path: playground/chaos-results/
retention-days: 30

- name: Upload kind logs
if: failure()
run: |
mkdir -p /tmp/kind-logs
kind export logs --name=bloodraven-e2e /tmp/kind-logs || true
continue-on-error: true

- name: Upload kind logs artifact
if: failure()
uses: actions/upload-artifact@v4
with:
name: e2e-${{ inputs.profile }}-kind-logs
path: /tmp/kind-logs/
retention-days: 14
continue-on-error: true

- name: Upload setup logs
if: failure()
uses: actions/upload-artifact@v4
with:
name: e2e-${{ inputs.profile }}-setup-logs
path: playground/setup.log
retention-days: 14
39 changes: 39 additions & 0 deletions .github/workflows/e2e.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# E2E trigger workflow — nightly, manual, and PR-label-gated.
# The reusable workflow is in .github/workflows/_e2e.yml.
name: E2E

on:
# Nightly release-profile run
schedule:
- cron: "0 5 * * *" # 05:00 UTC daily

# Manual dispatch with profile selection
workflow_dispatch:
inputs:
profile:
description: "Chaos profile (smoke|release|full)"
required: false
default: "release"
type: choice
options:
- smoke
- release
- full

# PR label gate: run smoke while the "e2e" label is present.
pull_request:
types: [opened, reopened, synchronize, labeled]

permissions:
contents: read

jobs:
# Skip PR-triggered runs unless the "e2e" label is present.
e2e:
if: >-
github.event_name == 'schedule' ||
github.event_name == 'workflow_dispatch' ||
(github.event_name == 'pull_request' && contains(github.event.pull_request.labels.*.name, 'e2e'))
uses: ./.github/workflows/_e2e.yml
with:
profile: ${{ github.event_name == 'pull_request' && 'smoke' || (github.event.inputs.profile || 'release') }}
Comment thread
colinmollenhour marked this conversation as resolved.
15 changes: 13 additions & 2 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,10 +73,21 @@ jobs:
working-directory: docs
run: npm run verify:llms

# E2E release gate — runs the release-profile real-cluster E2E before
# any publishing jobs. This ensures that every tagged release has been
# validated against real MySQL pods, PVCs, DNS, taints, failover, and
# network partition scenarios (WISHLIST #32).
e2e-gate:
name: E2E gate (release profile)
needs: ci-gate
uses: ./.github/workflows/_e2e.yml
with:
profile: release

draft-release:
name: Create Draft Release
runs-on: ubuntu-latest
needs: ci-gate
needs: [ci-gate, e2e-gate]
steps:
- uses: actions/checkout@v6
with:
Expand Down Expand Up @@ -116,7 +127,7 @@ jobs:
docker:
name: Build and Push Docker Images
runs-on: ubuntu-latest
needs: [ci-gate, draft-release]
needs: [ci-gate, e2e-gate, draft-release]
strategy:
matrix:
include:
Expand Down
8 changes: 5 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Repository Guidelines

## Project Structure & Module Organization
Primary code lives in the root Go module. `cmd/bloodraven` is the Kubernetes operator entrypoint; `cmd/sidecar` is the per-MySQL sidecar; `cmd/kubectl-bloodraven` is the day-2 `kubectl` plugin (status / promote / reclone / backup / verify-backup, built via `make build-kubectl-plugin`). API types live in `api/v1alpha1`, controller logic in `internal/controller`, and supporting packages in `internal/mysql`, `internal/platform`, `internal/sidecar`, `internal/state`, and `internal/metrics`. End-to-end and scenario-style tests live in `test/e2e`. Treat `bitpoke/` and `orchestrator/` as bundled upstream references, not the default place for new feature work.
Primary code lives in the root Go module. `cmd/bloodraven` is the Kubernetes operator entrypoint; `cmd/sidecar` is the per-MySQL sidecar; `cmd/kubectl-bloodraven` is the day-2 `kubectl` plugin (status / promote / reclone / backup / verify-backup, built via `make build-kubectl-plugin`). API types live in `api/v1alpha1`, controller logic in `internal/controller`, and supporting packages in `internal/mysql`, `internal/platform`, `internal/sidecar`, `internal/state`, and `internal/metrics`. Real-cluster scenario tests live under `internal/playground/scenarios` and run through `cmd/playground-chaos`; faster cross-component tests live under `test/component`, with API-server/envtest coverage under `test/envtest`. Treat `bitpoke/` and `orchestrator/` as bundled upstream references, not the default place for new feature work.

## Build, Test, and Development Commands
Run commands from the repository root:
Expand All @@ -10,6 +10,8 @@ Run commands from the repository root:
- `go build ./cmd/sidecar` builds the sidecar binary.
- `make build-kubectl-plugin` builds `bin/kubectl-bloodraven` (the day-2 `kubectl` plugin). Override `KUBECTL_PLUGIN_VERSION=<tag>` to stamp a release; `make install-kubectl-plugin` drops the binary onto `$PATH`.
- `make test` runs `go test ./...` across unit and e2e-style packages.
- `make test-e2e` runs the release profile of real-cluster E2E tests against the current playground cluster (requires kind/k3d/minikube context prepared with `./playground/setup.sh`; CI creates kind and runs setup first).
- `make test-e2e-smoke` runs the smoke profile (~3 scenarios, fast feedback).
- `make vet` runs `go vet ./...`.
- `make lint` runs `golangci-lint run ./...`. `golangci-lint` is not vendored; install it with `go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest` (it lands in `$(go env GOPATH)/bin`). CI installs the same tool with the same command in `.github/workflows/ci.yml`, so local and CI output match when you run this.
- `make generate` refreshes API deep-copy code in `api/v1alpha1`.
Expand All @@ -26,7 +28,7 @@ Use standard Go formatting: run `gofmt` on changed files and keep imports organi
Structured-log `msg` strings and field names listed in `docs/docs/log-schema.mdx` are a public stability contract — downstream log pipelines filter on them. When you touch a log call site whose `msg` appears in that doc's Event reference, either preserve the `msg` string and the documented field set exactly, or update `docs/docs/log-schema.mdx` in the same PR and call out the break in the PR description. The same applies to field naming: log keys are `camelCase` (per the contract), not `snake_case`.

## Testing Guidelines
Add table-driven unit tests beside the code they cover, using the existing `*_test.go` layout under `internal/`. Put cross-component behavior tests in `test/e2e`. Some tests create local HTTP listeners with `httptest`, so restricted sandboxes may fail even when local developer runs pass.
Add table-driven unit tests beside the code they cover, using the existing `*_test.go` layout under `internal/`. Put cross-component behavior tests in `test/component`, API-server/controller-runtime tests in `test/envtest`, and real-cluster playground scenarios in `internal/playground/scenarios` through `cmd/playground-chaos`. Some tests create local HTTP listeners with `httptest`, so restricted sandboxes may fail even when local developer runs pass.

### Pre-PR gate (required, do not skip)
Before pushing a branch that opens or updates a PR, run all of the following from the repo root and fix anything they report. Do **not** push expecting CI to find problems you could have caught locally — CI failures on lint or generate drift are round-trip latency and reviewer noise.
Expand Down Expand Up @@ -89,7 +91,7 @@ Lessons from running chaos scenarios against a live k3d cluster:
`./playground/rebuild.sh operator` builds, imports to k3d, and restarts the operator deployment. For sidecar changes, use `./playground/rebuild.sh sidecar` (restarts MySQL pods). Both can be combined: `./playground/rebuild.sh operator sidecar`.

### Automated chaos runner
A subset of `playground/chaos-scenarios.md` is automated by `cmd/playground-chaos` and exposed as Make targets: `make chaos-list`, `make chaos-check`, `make chaos-run SCENARIO=<id>`, `make chaos-run-all`. The runner refuses to mutate any kubectl context outside the `_guard.sh` allowlist; on assertion failure it captures cluster YAML + pods + events + operator/sidecar logs + raw `/metrics` under `playground/chaos-results/<timestamp>/<scenario-id>/` for triage. Use `--no-cleanup` to keep injected state in place for forensics.
A subset of `playground/chaos-scenarios.md` is automated by `cmd/playground-chaos` and exposed as Make targets: `make chaos-list`, `make chaos-check`, `make chaos-run SCENARIO=<id>`, `make chaos-run-all`, `make chaos-run-all-profile PROFILE=smoke|release|full`. The runner supports three E2E profiles (`--profile=smoke|release|full`) that filter which scenarios run. The runner refuses to mutate any kubectl context outside the `_guard.sh` allowlist; on assertion failure it captures cluster YAML + pods + events + operator/sidecar logs + raw `/metrics` under `playground/chaos-results/<timestamp>/<scenario-id>/` for triage. Use `--no-cleanup` to keep injected state in place for forensics.

The runner stamps an in-progress marker on the MFG (`chaos.playground.bloodraven.io/in-progress`) after Precheck and clears it on cleanup. A subsequent run that finds a leftover marker refuses to start with a specific reason (live owner / abandoned / different host). Override with `--force` (delete the marker before preflight) or `--auto-reset` (on Precheck failure, shell out to `reset-mysql.sh + setup.sh` and retry once; 3s pause unless `CI=1`). `chaos-check` runs the same structural baseline scenarios use — stuck scale-to-0 deployments, bogus `lastFailoverTarget`, anti-flap cooldown still ticking, `NoPrimary` (both-sites-read-only), replication off on a non-active candidate — each with the exact remediation command in the error.

Expand Down
19 changes: 14 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
CONTROLLER_GEN ?= go run sigs.k8s.io/controller-tools/cmd/controller-gen

.PHONY: help generate manifests build build-bloodraven build-sidecar build-playground-chaos build-kubectl-plugin install-kubectl-plugin test test-unit test-component test-envtest test-e2e test-integration fmt vet lint docker-build chaos-list chaos-check chaos-run chaos-run-all
.PHONY: help generate manifests build build-bloodraven build-sidecar build-playground-chaos build-kubectl-plugin install-kubectl-plugin test test-unit test-component test-envtest test-e2e test-e2e-smoke test-integration fmt vet lint docker-build chaos-list chaos-check chaos-run chaos-run-all chaos-run-all-profile

##@ General

Expand Down Expand Up @@ -90,10 +90,15 @@ test-component: ## Run component tests (cross-package with fakes, no real cluste
test-envtest: ## Run envtest controller tests (real API server, no cluster)
go test -race -tags envtest ./test/envtest/

test-e2e: ## Run real cluster end-to-end tests (requires kind/k3d — Phase 4, not yet implemented)
@echo "Real cluster e2e tests are not yet implemented (Testing 2.0 Phase 4)."
@echo "See TESTING_2.0.md for the planned scenarios."
@exit 1
E2E_PROFILE ?= release
E2E_JUNIT_OUT ?= playground/chaos-results/e2e-$(E2E_PROFILE)-junit.xml
E2E_ARGS ?=

test-e2e: build-playground-chaos ## Run real-cluster E2E tests (E2E_PROFILE=release|smoke|full; requires kind/k3d)
./bin/playground-chaos run-all --profile=$(E2E_PROFILE) --auto-reset --continue-on-failure --junit-out=$(E2E_JUNIT_OUT) $(E2E_ARGS)

test-e2e-smoke: build-playground-chaos ## Run real-cluster E2E smoke (smoke profile — requires kind/k3d)
$(MAKE) test-e2e E2E_PROFILE=smoke E2E_JUNIT_OUT=playground/chaos-results/e2e-smoke-junit.xml

test-integration: ## Run integration tests (network listener tests)
go test -tags integration -race ./internal/platform/ ./test/component/
Expand Down Expand Up @@ -123,3 +128,7 @@ chaos-run: build-playground-chaos ## Run a single scenario (SCENARIO=<id>)

chaos-run-all: build-playground-chaos ## Run every registered chaos scenario in order
./bin/playground-chaos run-all

chaos-run-all-profile: build-playground-chaos ## Run chaos scenarios filtered by profile (PROFILE=smoke|release|full)
@if [ -z "$(PROFILE)" ]; then echo "usage: make chaos-run-all-profile PROFILE=smoke"; exit 2; fi
./bin/playground-chaos run-all --profile=$(PROFILE)
7 changes: 5 additions & 2 deletions WISHLIST.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,16 @@
- [ ] 7. Cross-region/cross-cluster DR as a first-class feature
- [ ] 27. Backup/restore performance guide
- [ ] 30. Public repo, license, release cadence
- [ ] 32. Real-cluster E2E CI gate
- [x] 32. Real-cluster E2E CI gate
- [ ] 41. Safe Secret watch narrowing design
- [ ] 42. Namespace-scoped watch/cache mode evaluation
- [ ] 43. Dedicated backup/PITR real-cluster E2E scenarios

## P0 — Production adoption blockers

**32. Real-cluster E2E CI gate.** Unit/component/envtest coverage is not enough for a MySQL failover operator. Add an optional-but-required-before-release k3d/kind CI job that installs the chart and exercises real MySQL pods, PVCs, Services, DNS/DNSEndpoint behavior, taints, planned failover, emergency failover, operator restart, PVC loss, NetworkPolicy partition, backup restore, and PITR verification. This should run at least on release tags and nightly; if cost is acceptable, run a reduced smoke subset on PRs.
**32. Real-cluster E2E CI gate.** Done: `make test-e2e` runs the release profile of `playground-chaos run-all` against a real cluster instead of the former placeholder. `make test-e2e-smoke` runs a fast smoke subset (3 scenarios). Three profiles (`smoke`/`release`/`full`) filter scenarios via `--profile` on `playground-chaos run-all` and `make chaos-run-all-profile PROFILE=`. CI uses a reusable workflow (`_e2e.yml`) that creates a kind cluster with Calico CNI, deploys the playground, and runs the selected profile. Nightly and manual runs use the release profile; PRs with the `e2e` label trigger a smoke run. Release publishing blocks on the E2E release-profile gate. JUnit, forensics, setup logs, and kind logs are uploaded as artifacts. Dedicated MySQL backup restore and PITR verification scenarios are split out as follow-up #43 so the gate can start enforcing the existing real-cluster chaos suite now without misrepresenting that coverage.

**43. Dedicated backup/PITR real-cluster E2E scenarios.** Follow-up to #32: add release-profile playground-chaos scenarios that configure the playground backup profile against RustFS, trigger a real `MysqlBackup`, verify restore via `MysqlBackupVerification`, then enable PITR/binlog archival and verify a point-in-time replay with deterministic marker rows. The #32 gate now exists and is release-blocking, but this backup/PITR coverage should be added before claiming the E2E release profile exercises every backup/restore path.

## P1 — DR and operational completeness

Expand Down
Loading
Loading