Skip to content

x6c-co/cfgctl

Repository files navigation

cfgctl

Distributes Caddy configuration to the anycast edge fleet via rqlite. One Go binary, two roles:

  • operator CLI (push, site) — writes the intended config to rqlite.
  • per-POP daemon (agent, also reachable as cfgctld) — polls rqlite, renders a Caddyfile, and applies it to the local Caddy via the admin /load API.

It is the config half of the edge control plane; knit is the cert half (knit issues/distributes TLS certs over Valkey; cfgctl never touches cert bytes — it only references them by path). See DESIGN.md for the full design, the rationale, and the adversarial-review trail.

Model

The fleet config is global platform settings + a dynamic list of per-domain reverse-proxy sites (a multi-tenant CDN). Three rqlite tables (prefixed caddy_config_): global, region (per-POP overrides), and site (one row per proxied domain). A single meta version marker is bumped on any change; the agent polls only that.

Because the fleet is anycast, certs are issued centrally by knit and referenced by path — cfgctl renders no ACME. Anycast names (lg, apex, tenants) use knit-distributed static/wildcard certs; the single-box per-POP name (<airport>.<suffix>) serves a knit-distributed *.<suffix> wildcard (sites.pop.cert_path) so no POP self-issues, falling back to Caddy automatic HTTPS only if that path is unset.

Install / build

Static, no CGO:

CGO_ENABLED=0 go build -o cfgctl .
# cfgctld is the same binary; symlink it or use `cfgctl agent`.
ln -s cfgctl cfgctld

Releases are built by goreleaser on v* tags (linux/darwin, amd64/arm64), shipping both cfgctl and cfgctld.

Operator CLI

# Platform config (global and/or one region per invocation).
cfgctl push --global global.json
cfgctl push --region ewr --overrides ewr.json

# Tenant proxy domains (the CDN site list) — each is one rqlite transaction.
cfgctl site set api.a5t.dev --upstream 10.0.0.5:8443 --scheme https --no-verify-upstream
cfgctl site set taskd.sh    --upstream 10.0.0.9:80   --scheme http --cert-ref taskd.sh \
                            --cache /static/*=300s --waf detection --waf-exclude /stream/*
cfgctl site set foo.a5t.dev -f foo.json     # JSON site config instead of flags
cfgctl site list
cfgctl site disable bad.a5t.dev             # keep the row, stop serving it
cfgctl site rm  old.a5t.dev

Every write:

  1. renders the full resulting config and validates it — including a caddy adapt dry-run against the real custom binary when one is on PATH (--caddy-bin to point at it; skipped with a note if absent);
  2. computes the fleet content hash and skips entirely if nothing changed (no-op, no version bump);
  3. writes the row(s) and the version bump in one transaction, guarded by a compare-and-set on the version (safe under concurrent pushes).

--dry-run validates and reports without writing.

The rqlite endpoint defaults to http://127.0.0.1:4001 (the local node forwards writes to the leader over the mTLS raft port); override with --rqlite-url / CFGCTL_RQLITE_URL.

Agent (per-POP daemon)

cfgctl agent --config /etc/cfgctl/agent.yaml
cfgctld                                  # same thing (argv[0] alias)
cfgctl agent --once                      # single reconcile pass and exit

/etc/cfgctl/agent.yaml (every field is also a flag and an env var; precedence flag > env > file > default):

field flag / env default
region --region / CFGCTL_REGION (required)
rqlite_url --rqlite-url / CFGCTL_RQLITE_URL http://127.0.0.1:4001
caddy_admin --caddy-admin / CFGCTL_CADDY_ADMIN http://localhost:2019
poll_interval --poll-interval / CFGCTL_POLL_INTERVAL 15s
metrics_addr --metrics-addr / CFGCTL_METRICS_ADDR 127.0.0.1:9619
state_path --state-path / CFGCTL_STATE_PATH /var/lib/cfgctl/state.json
log_level --log-level / CFGCTL_LOG_LEVEL info

※ Production must template metrics_addr to the host's mesh IP (<wg_mesh_v4>:9619) so Prometheus can scrape it over WireGuard.

Each pass: poll the meta version (level=none, local node); on a bump fetch global + this region + enabled sites, merge, render, and POST /load. On a rejection the agent quarantines the changed site(s) and re-applies the good subset, so one bad domain can't block the rest or take down live domains. Success is decided by the /load response body (a trailing {"error":…} is a failure even with HTTP 200 — caddy #7246), not the status alone. State (applied version + per-site hashes) survives restarts; the agent force-reconciles once on startup in case Caddy restarted to its baseline. Run Caddy with --resume so an API-loaded config is durable across Caddy restarts.

Metrics

Prometheus on metrics_addr:

  • caddy_config_applied_version (gauge)
  • caddy_config_apply_success (gauge, 1/0)
  • caddy_config_last_apply_timestamp (gauge)
  • caddy_config_load_failures_total (counter)
  • caddy_config_quarantined_sites (gauge)

Development

Unit tests need nothing external. The render tests adapt-validate against a real custom Caddy when CADDY_BIN is set; the end-to-end test runs a real single-node rqlite and a mock Caddy admin:

go test ./...                                            # unit
CADDY_BIN=/path/to/custom/caddy go test ./internal/render/ -run TestRender -v
CFGCTL_IT_RQLITED=/path/to/rqlited CADDY_BIN=/path/to/custom/caddy \
  go test -tags integration -run TestE2E -v .

Pins: rqlite v10.2.0, custom Caddy v2.11.3-a5t.1 (Souin cache-handler + storages.cache.redis + deSEC + Coraza WAF). CI runs lint (golangci-lint v2), CGO_ENABLED=0 build, race tests, govulncheck, and the integration suite against both pinned binaries.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors