Distributes Caddy configuration to the anycast edge fleet via rqlite. One Go binary, two roles:
- operator CLI (
push,site) — writes the intended config to rqlite. - per-POP daemon (
agent, also reachable ascfgctld) — polls rqlite, renders a Caddyfile, and applies it to the local Caddy via the admin/loadAPI.
It is the config half of the edge control plane; knit is the cert half (knit issues/distributes TLS certs over Valkey; cfgctl never touches cert bytes — it only references them by path). See DESIGN.md for the full design, the rationale, and the adversarial-review trail.
The fleet config is global platform settings + a dynamic list of per-domain
reverse-proxy sites (a multi-tenant CDN). Three rqlite tables (prefixed
caddy_config_): global, region (per-POP overrides), and site (one row per
proxied domain). A single meta version marker is bumped on any change; the
agent polls only that.
Because the fleet is anycast, certs are issued centrally by knit and
referenced by path — cfgctl renders no ACME. Anycast names (lg, apex,
tenants) use knit-distributed static/wildcard certs; the single-box per-POP name
(<airport>.<suffix>) serves a knit-distributed *.<suffix> wildcard
(sites.pop.cert_path) so no POP self-issues, falling back to Caddy automatic
HTTPS only if that path is unset.
Static, no CGO:
CGO_ENABLED=0 go build -o cfgctl .
# cfgctld is the same binary; symlink it or use `cfgctl agent`.
ln -s cfgctl cfgctldReleases are built by goreleaser on v* tags (linux/darwin,
amd64/arm64), shipping both cfgctl and cfgctld.
# Platform config (global and/or one region per invocation).
cfgctl push --global global.json
cfgctl push --region ewr --overrides ewr.json
# Tenant proxy domains (the CDN site list) — each is one rqlite transaction.
cfgctl site set api.a5t.dev --upstream 10.0.0.5:8443 --scheme https --no-verify-upstream
cfgctl site set taskd.sh --upstream 10.0.0.9:80 --scheme http --cert-ref taskd.sh \
--cache /static/*=300s --waf detection --waf-exclude /stream/*
cfgctl site set foo.a5t.dev -f foo.json # JSON site config instead of flags
cfgctl site list
cfgctl site disable bad.a5t.dev # keep the row, stop serving it
cfgctl site rm old.a5t.devEvery write:
- renders the full resulting config and validates it — including a
caddy adaptdry-run against the real custom binary when one is onPATH(--caddy-binto point at it; skipped with a note if absent); - computes the fleet content hash and skips entirely if nothing changed (no-op, no version bump);
- writes the row(s) and the version bump in one transaction, guarded by a compare-and-set on the version (safe under concurrent pushes).
--dry-run validates and reports without writing.
The rqlite endpoint defaults to http://127.0.0.1:4001 (the local node forwards
writes to the leader over the mTLS raft port); override with --rqlite-url /
CFGCTL_RQLITE_URL.
cfgctl agent --config /etc/cfgctl/agent.yaml
cfgctld # same thing (argv[0] alias)
cfgctl agent --once # single reconcile pass and exit/etc/cfgctl/agent.yaml (every field is also a flag and an env var; precedence
flag > env > file > default):
| field | flag / env | default |
|---|---|---|
region |
--region / CFGCTL_REGION |
(required) |
rqlite_url |
--rqlite-url / CFGCTL_RQLITE_URL |
http://127.0.0.1:4001 |
caddy_admin |
--caddy-admin / CFGCTL_CADDY_ADMIN |
http://localhost:2019 |
poll_interval |
--poll-interval / CFGCTL_POLL_INTERVAL |
15s |
metrics_addr |
--metrics-addr / CFGCTL_METRICS_ADDR |
127.0.0.1:9619 ※ |
state_path |
--state-path / CFGCTL_STATE_PATH |
/var/lib/cfgctl/state.json |
log_level |
--log-level / CFGCTL_LOG_LEVEL |
info |
※ Production must template metrics_addr to the host's mesh IP
(<wg_mesh_v4>:9619) so Prometheus can scrape it over WireGuard.
Each pass: poll the meta version (level=none, local node); on a bump fetch
global + this region + enabled sites, merge, render, and POST /load. On a
rejection the agent quarantines the changed site(s) and re-applies the good
subset, so one bad domain can't block the rest or take down live domains. Success
is decided by the /load response body (a trailing {"error":…} is a
failure even with HTTP 200 — caddy #7246), not the status alone. State (applied
version + per-site hashes) survives restarts; the agent force-reconciles once on
startup in case Caddy restarted to its baseline. Run Caddy with --resume so an
API-loaded config is durable across Caddy restarts.
Prometheus on metrics_addr:
caddy_config_applied_version(gauge)caddy_config_apply_success(gauge, 1/0)caddy_config_last_apply_timestamp(gauge)caddy_config_load_failures_total(counter)caddy_config_quarantined_sites(gauge)
Unit tests need nothing external. The render tests adapt-validate against a real
custom Caddy when CADDY_BIN is set; the end-to-end test runs a real single-node
rqlite and a mock Caddy admin:
go test ./... # unit
CADDY_BIN=/path/to/custom/caddy go test ./internal/render/ -run TestRender -v
CFGCTL_IT_RQLITED=/path/to/rqlited CADDY_BIN=/path/to/custom/caddy \
go test -tags integration -run TestE2E -v .Pins: rqlite v10.2.0, custom Caddy v2.11.3-a5t.1 (Souin cache-handler +
storages.cache.redis + deSEC + Coraza WAF). CI runs lint (golangci-lint v2),
CGO_ENABLED=0 build, race tests, govulncheck, and the integration suite
against both pinned binaries.