Skip to content

Firewall sibling drift gate misses image/binary/config changes #308

@schmitthub

Description

@schmitthub

Problem

firewall.Stack.ensureContainer (internal/controlplane/firewall/stack.go:791-821) decides whether to recreate Envoy/CoreDNS sibling containers based on only two drift labels (driftLabels, line 759-764):

  • dev.clawker.firewall.infra_certs_ready
  • dev.clawker.firewall.otel_infra_port

If both match the desired spec and the container is running, ensureContainer returns "already running" and no restart/recreate happens.

This is the same class of bug as the CP container drift gate (fixed in #300 via LabelCPBinarySHA), but the sibling gate is much narrower than the CP one.

What drift labels DO NOT cover

A clawker CLI upgrade that changes any of the following will leave the running Envoy/CoreDNS containers serving stale state until something else forces a recreate:

  1. Envoy image digestenvoyImage const at stack.go:36 (pinned envoyproxy/envoy:distroless-v1.37.1@sha256:...). Bumping the SHA updates ensureEnvoyImage so the new image is pulled, but ensureContainer doesn't compare image digests against the running container → existing container keeps the old Envoy binary.
  2. Embedded CoreDNS binaryembed_coredns.go ships coredns-clawker as a //go:embed asset. ensureCorednsImage rebuilds coredns-clawker:latest locally; image ID changes; existing container is not recreated.
  3. envoy_config.go / coredns_config.go template changesensureConfigs rewrites envoy.yaml / Corefile to disk on every EnsureRunning, but Envoy/CoreDNS read at startup. Without a recreate (or Reload), the new files sit on disk while the processes serve the old in-memory config until the next rule mutation triggers Stack.Reload.
  4. containerSpec.cmd / mounts / env shape changes — Docker preserves these from create-time. Not in the drift comparison.

Why this is security-relevant

Firewall sibling containers ARE the egress enforcement plane. The CP drift gate exists precisely so a security fix to the CP can land via an ordinary clawker run after a CLI upgrade. The same expectation must hold for Envoy/CoreDNS — but today, an Envoy CVE bump (image digest), a CoreDNS dnsbpf plugin fix (embedded binary), or a deny-chain hardening (config template) won't reach users until either:

  • a rule mutation happens (which calls Stack.ReloadreloadContainer → uses the same drift labels, same gap), or
  • the user manually docker rms the sibling containers, or
  • one of the two existing drift labels (infra_certs_ready, otel_infra_port) happens to flip.

The CP container itself is correctly replaced on every binary change. Its first action on boot is to eventually Stack.EnsureRunning (via FirewallInit), which then short-circuits on the narrow label match. Net effect: CP upgrades cleanly; the security-critical proxy + DNS resolver it manages do not.

Fix shape

Stamp cpboot.cpBinaryHash() (or a firewall-subset hash) as a third drift label on both siblings:

const labelStackBuildSHA = "dev.clawker.firewall.stack_build_sha"

func (s *Stack) driftLabels() map[string]string {
    return map[string]string{
        labelInfraCertsReady: strconv.FormatBool(s.infraCertsReady),
        labelOtelInfraPort:   strconv.Itoa(...),
        labelStackBuildSHA:   stackBuildSHA,  // injected at NewStack
    }
}

Any change to the embedded CP binary → siblings recreate. Trade-off: every CP rebuild churns Envoy+CoreDNS even when only unrelated CP code changed. Acceptable — siblings are stateless, recreate is sub-second.

A tighter scope (hash only firewall/ package + embed_coredns.go output + envoyImage const) would minimize churn but needs build-time tooling to compute. Start with the broad hash; narrow later if churn matters.

Acceptance

  • driftLabels() returns a third label keyed on a CP-binary-derived SHA (or narrower firewall-subset SHA)
  • ensureContainer / reloadContainer emit event=firewall_container_spec_drift with the new label fields when the SHA mismatches
  • Unit test: simulate SHA change between EnsureRunning calls, assert sibling containers are recreated
  • No change to existing infra_certs_ready / otel_infra_port semantics

Context

Surfaced during follow-up review of #300 (CP drift gate). Same conceptual gap, separate code path. See the resilience contract in internal/controlplane/CLAUDE.md — security-critical infrastructure must propagate updates via the same clawker run path users already trigger naturally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingsecuritySecurity hardening or fixes

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions