Skip to content

feat(shim): setcap cap_bpf+cap_perfmon to enable eBPF tracing on hosted runners#19

Open
colek42 wants to merge 26 commits into
mainfrom
nk/ebpf-setcap-shim
Open

feat(shim): setcap cap_bpf+cap_perfmon to enable eBPF tracing on hosted runners#19
colek42 wants to merge 26 commits into
mainfrom
nk/ebpf-setcap-shim

Conversation

@colek42
Copy link
Copy Markdown
Contributor

@colek42 colek42 commented May 24, 2026

Summary

Best-effort `sudo -n setcap cap_bpf,cap_perfmon+ep` on the cilock binary right after download. GH-hosted runners have NOPASSWD sudo, so this enables eBPF tracing for the default-config case. Container jobs without sudo fail silently and cilock falls back to ptrace+seccomp with a warning that includes the `container.options: --cap-add=BPF --cap-add=PERFMON` snippet needed to enable eBPF there.

We grant only the minimum caps needed — not CAP_SYS_ADMIN.

Why this is needed

cilock's eBPF tracing path (aflock-ai/rookery#176, V1 of #167) needs CAP_BPF + CAP_PERFMON to attach kprobes. Hosted runners give the user no caps by default, so cilock would fall back to ptrace+seccomp on every invocation — which is significantly slower than eBPF for typical builds.

Test plan

  • PR's Dogfood workflow runs successfully on a hosted runner
  • Manual: pick up the resulting cilock-action build, invoke it in a workflow with `trace: true`, and confirm the cilock binary uses the eBPF path (look for `tracing mode = eBPF` in logs)
  • When CILOCK_TRACE_MODE=ptrace is set explicitly, the setcap warning is suppressed

🤖 Generated with Claude Code

cole-rgb and others added 26 commits May 23, 2026 19:47
…cing

The cilock binary's eBPF tracing path needs CAP_BPF + CAP_PERFMON to
attach kprobes. On GH-hosted runners these caps are NOT inherited
from the runner user, so without this hop cilock falls back to its
slower ptrace+seccomp path on every invocation.

After downloading + chmod, we try \`sudo -n setcap cap_bpf,cap_perfmon+ep\`
on the binary. Hosted runners have NOPASSWD sudo, so this succeeds
on the default config. In containers without sudo (most \`container:\`
jobs), it fails silently and cilock falls back to ptrace+seccomp.
The warning surfaces the container-config snippet needed to enable
eBPF in that case.

Critically we do NOT grant CAP_SYS_ADMIN — only the minimum caps
needed for unprivileged BPF prog_load + kprobe attach.

Pair with the rookery uname-based kernel version fix; together they
let setcap'd-but-not-root cilock invocations use the eBPF tracing
path on hosted runners.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a rookery_ref workflow_dispatch input so we can cut a
cilock-action dev release that embeds a specific rookery
branch / tag / SHA before that ref has merged to rookery's main.

Use case: testing rc48 of rookery (which ships the new fanotify
+ fs-verity + tracee-priv-drop stack) without first merging
nk/ci-trace-mode-probe to rookery's main.

Tag-push behavior is unchanged: pushing v* triggers a release
with the rookery main checkout, as before.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cilock ships a pre-built .bpf.o embedded in its binary, but the
CO-RE relocations are baked against whichever vmlinux.h the release
was built on. On GHA hosted runners with the Azure-flavored kernel,
x86_64 BTF differs enough from mainline that every kprobe poisons
("bad CO-RE relocation: invalid func unknown#195896080").

rookery now auto-rebuilds the .bpf.o from its embedded source against
/sys/kernel/btf/vmlinux when CO-RE fails. That path needs
clang + bpftool + libbpf-dev on PATH. Install them here.

bpftool standalone isn't in every Ubuntu image's universe repo;
fall back to linux-tools-generic which ships
/usr/lib/linux-tools/<kernel>/bpftool. rookery's findBpftool()
globs both locations.

Together with the existing setcap step, the user-facing UX is now
just `uses: aflock-ai/cilock-action@v1` — kernel-arch-portable BPF
tracing handled transparently. End users see one INFO line about
the toolchain install, nothing else.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
GoReleaser refuses to release unless HEAD has a semver tag pointing
at it. On tag-push triggers GITHUB_REF_NAME provides that; on
workflow_dispatch (the path we use for dev RCs that pin a non-main
rookery_ref) we have nothing — goreleaser then picks the stale v1
major-version alias and bails with "git tag v1 was not made against
commit <sha>".

Materialise a LOCAL tag at the dispatch HEAD (never pushed) and pass
it via GORELEASER_CURRENT_TAG. Tag name comes from a new release_tag
input, or is derived from rookery_ref + short SHA when omitted
(v0.0.0-dev-<rookery-ref>-<sha>).

Also: skip the v1 major-tag-update on dispatch runs. Dev RCs should
not shift the v1 alias that production consumers point at.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rookery rc50 separated the eBPF code into its own Go submodule at
plugins/attestors/commandrun/ebpf so the runtime BPF-rebuild path
(rebuild_linux.go) has a clean boundary. cilock-action's go.mod
needs both a require entry and a replace pointing at the same local
checkout the release workflow does (./rookery → ../rookery/...).

Matches the existing pattern used for every other rookery sibling
module in this file.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Dev RCs (workflow_dispatch with rookery_ref) can pin a rookery that's
brought in new transitive deps not yet reflected in cilock-action's
checked-in go.sum. Most recent case: rc50 split commandrun/ebpf into
its own module and pulled cilium/ebpf via that boundary; cilock-action
go.sum had no entries for those packages so goreleaser bailed.

Run \`go mod tidy\` after checking out both repos; CI-local change
to go.sum that's not pushed back. Tag-push triggers (real releases)
keep using the committed go.sum since rookery is at main and the
go.sum should match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
goreleaser aborts with 'git is in a dirty state' if go.mod/go.sum
were modified after checkout. The tidy step in the dev-RC dispatch
flow legitimately modifies them; commit those changes locally (CI
workspace only; never pushed) so the subsequent materialise-tag
step tags the new HEAD and goreleaser is happy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Self-test workflow that uses aflock-ai/cilock-action@v1.0.5-rc1 the
way an external repo would. Three workloads (hello-go, hello-rust,
hello-shell) × the published action × sigstore keyless on hosted
ubuntu-24.04. Verifies the attestation file lands and parses; logs
captureMode, traceModeDetail, totals, and diagnostics so a human
review confirms the eBPF path actually engaged.

This is the end-to-end check the matrix workflow couldn't be (matrix
builds cilock from source on the runner; smoke uses the published
release artifacts). Triggers on workflow_dispatch and on changes to
the smoke yaml itself for iteration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Without this push, GitHub's release-create API points the tag at
the default branch's HEAD (not the post-tidy HEAD inside the runner).
Downstream consumers doing \`uses: aflock-ai/cilock-action@<tag>\`
then fetch the wrong action.yml + shim, missing whatever changes
the dispatch was meant to ship.

v1.0.5-rc1 hit this: artifacts were correctly built from the
nk/ebpf-setcap-shim HEAD + rookery rc50, but the published tag
pointed at main's HEAD (d39bb9b — days old). Downstream smoke
fetched the old shim that doesn't install BPF deps or run setcap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
v1.0.5-rc1 GitHub tag pointed at main's stale HEAD (d39bb9b) due to
the missing tag-push step. v1.0.5-rc2 ships with the tag-push fix
and points at the post-tidy commit (ae2e641) that includes the shim
BPF deps install + setcap path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Smoke just needs to validate the action runs end-to-end and produces
a local attestation file. Archivista upload requires platform API
credentials which the test repo doesn't have. The local outfile is
the actual signal we want anyway.

Prior smoke (26420816740) confirmed the BPF self-heal works end-to-end:
  ✓ Installed BPF rebuild toolchain
  cilock-ebpf: embedded BPF object failed CO-RE — attempting to rebuild from embedded source
  cilock-ebpf: using bpftool at /usr/lib/linux-tools/6.8.0-117-generic/bpftool
  cilock-ebpf: rebuilt BPF object loaded successfully

This commit re-greens the smoke matrix so we can capture the success
case as a published artifact for the record.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Make the zero-drop guarantee the default consumer experience.
With CILOCK_FANOTIFY=auto the kernel synchronously blocks the
tracee on every open until userspace has hashed the file —
turning the BPF capture path's drop-tolerant 'events' into a
kernel-enforced 'every file is recorded'. require-zero-drops=true
fails the attestation rather than ship one that silently lost
content (rookery's WithRequireZeroDrops).

Defaults are ON; consumers wanting the old loose semantics opt
out explicitly:
  fanotify: 'off'
  require-zero-drops: 'false'

Smoke of rc2 produced 370 hashFailureSilentDrops on a tiny go
build because fanotify was off. The new smoke asserts every
drop counter is 0 and fails loudly if not — this is the
contract we're now shipping by default.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rc3 wires fanotify=auto + require-zero-drops=true into action.yml
and the shim. Smoke now asserts every drop counter is zero in the
emitted attestation — the contract this release ships by default.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
rc51 fixes the false-positive zero-drop gate where fanotify rescues
weren't reconciled against UnhashedOpens / FallbackHashFailures.
This smoke confirms the user-facing default-on path actually
delivers what it promises.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three changes that fix the conceptual model surfaced by the gh CLI
smoke (which classified 9281 compiler intermediates as "products"):

1. New \`products\` input — newline-separated list of paths/globs
   the build is expected to produce. Joined as a {a,b,c} brace
   pattern for the rookery product attestor.

2. Default = workingDir/** when \`products\` is empty. Idiomatic
   builds that write under the workspace just work. Builds that
   write to /tmp or ~/.local/bin/ must explicitly list those paths.

3. \`::warning::no products detected\` when the resolved glob
   matched nothing. Surfaces the active glob and tells the user
   exactly where + how to override it in their workflow YAML.

Legacy \`product-include-glob\` input still honoured (no default,
opt-in). \`product-exclude-glob\` unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
resolveProductIncludeGlob compiled relative entries (e.g.,
\`products: bin/gh\`) as-is, but rookery's trace mode emits absolute
paths in TraceOutputs (e.g., /home/runner/work/cli/cli/bin/gh).
The relative glob matched zero paths → empty products map even when
the summary classifier saw 2.

Now: relative entries get filepath.Join'd against cfg.WorkingDir
(falling back to os.Getwd if WorkingDir is empty) before compiling
into the {a,b,c} brace glob.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
End-to-end smoke for the v0.3 multi-step chain pipeline.

Pipeline:
  1) cilock-action @step=source attests the gh CLI source tree;
     emits the v0.3 leaf sidecar alongside the signed envelope.
  2) cilock-action @step=build attests 'go build -o bin/gh ./cmd/gh'
     in trace mode; captures every material the build read under
     src/.
  3) jq extracts the consumed materials from the build attestation,
     filters to paths under src/ (the source step's coverage), feeds
     them into 'cilock prove-chain' which generates per-material
     RFC 6962 inclusion proofs against the source step's signed
     Merkle root.
  4) cilock verify walks a multi-step policy with
     build.artifactsFrom=[source] and allowedUntracked covering
     toolchain paths under /opt/hostedtoolcache/**, /usr/lib/**, etc.
     The chain sidecar source is FilesystemChainSidecarSource
     pointing at /tmp/chain-sidecars/.

Negative case: flip a bit in the chain sidecar's first audit-path
entry, rerun verify, must exit non-zero.

Depends on a cilock-action release that includes prove-chain
(rookery PR #176 commit 75c35ed or later). Default ref is
v1.1.0-rc1 — bump after each rc until the chain pipeline is GA.
cilock run --outfile <path> writes the v0.3 product leaf sidecar to
'<path>.product.tree.json' adjacent to the signed envelope. No new
action input needed — the previous draft invented a 'chain-sidecar-out'
flag that doesn't exist.

Also add a confirmation step that prints the sidecar's schemaVersion,
source label, Merkle root, treeSize, and leaf count so a failure
later in the chain pipeline has a precise breadcrumb.
…es:')

GitHub Actions doesn't permit ${{ github.event.inputs.X }} inside
the 'uses:' clause of a step, so the workflow_dispatch input we
added was a syntactic dead end. Hardcode v1.0.5-rc14 for the
action and v1.1.0-rc65 for the install.sh fetch — these are the
RCs that contain the chain primitives today.

Bump these together when cutting a future release; both pins live
in the same file.
Two bugs the rc14 dispatch surfaced:

  1) Used 'working-directory' (the GHA generic name); the action's
     input is 'workingdir'. Renamed.

  2) Step ran from $GITHUB_WORKSPACE but actions/checkout puts the
     gh CLI .git under src/. cilock's git attestor requires a .git
     at workingdir, so it failed 'repository does not exist'. Both
     source + build steps now set workingdir=src.

  3) Build moves to writing under src/bin/ instead of the workspace
     root, so the product target stays inside the workingdir scope
     ('products: bin/gh' instead of '../bin/gh').

  4) The chain-proof extractor stripped $GITHUB_WORKSPACE from the
     traced materials list, but the source step's leaf sidecar now
     records paths RELATIVE to workingdir=src/. Strip both the
     workspace prefix AND the src/ component so 'path=sha256hex'
     entries line up with what BuildChainSidecar will look up in
     the source sidecar.
…' is the read set)

The 'source' step's command was 'git rev-parse HEAD' — a no-op that
writes no files, so the products Merkle tree had treeSize=0 and the
products leaf sidecar was never written (rookery emits the sidecar
conditionally on len(products) > 0).

Semantic fix, not just plumbing: for a source-provenance step where
nothing is *produced*, the relevant Merkle commitment is what the
step OBSERVED. cilock's walk mode classifies every file under
workingdir as a material (1259 leaves in this run). That's the tree
step 2's consumed materials must trace back to.

Point prove-chain at the material sidecar instead of the (empty)
product sidecar, and use the matching leaf domain
('rookery-material/v0.3'). The verifier-side chain proof binds the
same way — only the upstream Merkle root + domain matter; products
vs materials is just which side of the in-toto Statement contract
the leaves come from.
colek42 pushed a commit that referenced this pull request May 26, 2026
…ero-dep)

The zero-dep rewrite dropped the capability setup that #19 added to the old
shim, so eBPF hard-failed on hosted runners (bpf(BPF_MAP_CREATE): operation
not permitted — needs CAP_BPF+CAP_PERFMON). Re-add it with spawnSync (no deps):
best-effort `sudo -n apt-get install` of clang/llvm/libbpf-dev/bpftool (for
cilock's CO-RE rebuild) and `sudo -n setcap cap_bpf,cap_perfmon+ep` on the
binary. Both non-fatal — without sudo, cilock falls back to ptrace+seccomp.

Also pin the trace-probe to v1.0.5-rc15 (freshly built against current rookery
main) so the attestation reports captureMode/traceModeDetail.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants