Problem
firewall.Stack.ensureContainer (internal/controlplane/firewall/stack.go:791-821) decides whether to recreate Envoy/CoreDNS sibling containers based on only two drift labels (driftLabels, line 759-764):
dev.clawker.firewall.infra_certs_ready
dev.clawker.firewall.otel_infra_port
If both match the desired spec and the container is running, ensureContainer returns "already running" and no restart/recreate happens.
This is the same class of bug as the CP container drift gate (fixed in #300 via LabelCPBinarySHA), but the sibling gate is much narrower than the CP one.
What drift labels DO NOT cover
A clawker CLI upgrade that changes any of the following will leave the running Envoy/CoreDNS containers serving stale state until something else forces a recreate:
- Envoy image digest —
envoyImage const at stack.go:36 (pinned envoyproxy/envoy:distroless-v1.37.1@sha256:...). Bumping the SHA updates ensureEnvoyImage so the new image is pulled, but ensureContainer doesn't compare image digests against the running container → existing container keeps the old Envoy binary.
- Embedded CoreDNS binary —
embed_coredns.go ships coredns-clawker as a //go:embed asset. ensureCorednsImage rebuilds coredns-clawker:latest locally; image ID changes; existing container is not recreated.
envoy_config.go / coredns_config.go template changes — ensureConfigs rewrites envoy.yaml / Corefile to disk on every EnsureRunning, but Envoy/CoreDNS read at startup. Without a recreate (or Reload), the new files sit on disk while the processes serve the old in-memory config until the next rule mutation triggers Stack.Reload.
containerSpec.cmd / mounts / env shape changes — Docker preserves these from create-time. Not in the drift comparison.
Why this is security-relevant
Firewall sibling containers ARE the egress enforcement plane. The CP drift gate exists precisely so a security fix to the CP can land via an ordinary clawker run after a CLI upgrade. The same expectation must hold for Envoy/CoreDNS — but today, an Envoy CVE bump (image digest), a CoreDNS dnsbpf plugin fix (embedded binary), or a deny-chain hardening (config template) won't reach users until either:
- a rule mutation happens (which calls
Stack.Reload → reloadContainer → uses the same drift labels, same gap), or
- the user manually
docker rms the sibling containers, or
- one of the two existing drift labels (
infra_certs_ready, otel_infra_port) happens to flip.
The CP container itself is correctly replaced on every binary change. Its first action on boot is to eventually Stack.EnsureRunning (via FirewallInit), which then short-circuits on the narrow label match. Net effect: CP upgrades cleanly; the security-critical proxy + DNS resolver it manages do not.
Fix shape
Stamp cpboot.cpBinaryHash() (or a firewall-subset hash) as a third drift label on both siblings:
const labelStackBuildSHA = "dev.clawker.firewall.stack_build_sha"
func (s *Stack) driftLabels() map[string]string {
return map[string]string{
labelInfraCertsReady: strconv.FormatBool(s.infraCertsReady),
labelOtelInfraPort: strconv.Itoa(...),
labelStackBuildSHA: stackBuildSHA, // injected at NewStack
}
}
Any change to the embedded CP binary → siblings recreate. Trade-off: every CP rebuild churns Envoy+CoreDNS even when only unrelated CP code changed. Acceptable — siblings are stateless, recreate is sub-second.
A tighter scope (hash only firewall/ package + embed_coredns.go output + envoyImage const) would minimize churn but needs build-time tooling to compute. Start with the broad hash; narrow later if churn matters.
Acceptance
Context
Surfaced during follow-up review of #300 (CP drift gate). Same conceptual gap, separate code path. See the resilience contract in internal/controlplane/CLAUDE.md — security-critical infrastructure must propagate updates via the same clawker run path users already trigger naturally.
Problem
firewall.Stack.ensureContainer(internal/controlplane/firewall/stack.go:791-821) decides whether to recreate Envoy/CoreDNS sibling containers based on only two drift labels (driftLabels, line 759-764):dev.clawker.firewall.infra_certs_readydev.clawker.firewall.otel_infra_portIf both match the desired spec and the container is running,
ensureContainerreturns "already running" and no restart/recreate happens.This is the same class of bug as the CP container drift gate (fixed in #300 via
LabelCPBinarySHA), but the sibling gate is much narrower than the CP one.What drift labels DO NOT cover
A clawker CLI upgrade that changes any of the following will leave the running Envoy/CoreDNS containers serving stale state until something else forces a recreate:
envoyImageconst atstack.go:36(pinnedenvoyproxy/envoy:distroless-v1.37.1@sha256:...). Bumping the SHA updatesensureEnvoyImageso the new image is pulled, butensureContainerdoesn't compare image digests against the running container → existing container keeps the old Envoy binary.embed_coredns.goshipscoredns-clawkeras a//go:embedasset.ensureCorednsImagerebuildscoredns-clawker:latestlocally; image ID changes; existing container is not recreated.envoy_config.go/coredns_config.gotemplate changes —ensureConfigsrewritesenvoy.yaml/Corefileto disk on everyEnsureRunning, but Envoy/CoreDNS read at startup. Without a recreate (orReload), the new files sit on disk while the processes serve the old in-memory config until the next rule mutation triggersStack.Reload.containerSpec.cmd/mounts/envshape changes — Docker preserves these from create-time. Not in the drift comparison.Why this is security-relevant
Firewall sibling containers ARE the egress enforcement plane. The CP drift gate exists precisely so a security fix to the CP can land via an ordinary
clawker runafter a CLI upgrade. The same expectation must hold for Envoy/CoreDNS — but today, an Envoy CVE bump (image digest), a CoreDNS dnsbpf plugin fix (embedded binary), or a deny-chain hardening (config template) won't reach users until either:Stack.Reload→reloadContainer→ uses the same drift labels, same gap), ordocker rms the sibling containers, orinfra_certs_ready,otel_infra_port) happens to flip.The CP container itself is correctly replaced on every binary change. Its first action on boot is to eventually
Stack.EnsureRunning(viaFirewallInit), which then short-circuits on the narrow label match. Net effect: CP upgrades cleanly; the security-critical proxy + DNS resolver it manages do not.Fix shape
Stamp
cpboot.cpBinaryHash()(or a firewall-subset hash) as a third drift label on both siblings:Any change to the embedded CP binary → siblings recreate. Trade-off: every CP rebuild churns Envoy+CoreDNS even when only unrelated CP code changed. Acceptable — siblings are stateless, recreate is sub-second.
A tighter scope (hash only
firewall/package +embed_coredns.gooutput + envoyImage const) would minimize churn but needs build-time tooling to compute. Start with the broad hash; narrow later if churn matters.Acceptance
driftLabels()returns a third label keyed on a CP-binary-derived SHA (or narrower firewall-subset SHA)ensureContainer/reloadContaineremitevent=firewall_container_spec_driftwith the new label fields when the SHA mismatchesEnsureRunningcalls, assert sibling containers are recreatedinfra_certs_ready/otel_infra_portsemanticsContext
Surfaced during follow-up review of #300 (CP drift gate). Same conceptual gap, separate code path. See the resilience contract in
internal/controlplane/CLAUDE.md— security-critical infrastructure must propagate updates via the sameclawker runpath users already trigger naturally.