▶ Live demo: neurotrace.dexdevs.com — run it in your browser, free offline backend. Browse all 10 portfolio demos via the all demos link.
A causal-tracing benchmark for neural-network interpretability. NeuroTrace hand-compiles small networks whose ground-truth circuit is known by construction, then scores attribution methods on whether they recover it. The result is a clean two-effect dissociation: causal activation patching recovers the circuit where the cheap proxies each get fooled — correlation by a confounder, gradient saliency by a saturated decoy — and a control circuit proves each failure comes from the planted decoy, not the method in general.
Why it matters: saliency maps and correlations are the attributions people actually read off a model, and both can confidently point at the wrong thing. NeuroTrace turns that folklore into a measured, reproducible result against a known answer — no trained model, no labels to trust, no API keys.
$ neurotrace attribute --method gradient --circuit suppressor
method : gradient (proxy)
circuit : suppressor labels=real samples=512 seed=0
auroc : 0.7500 (1.0 = circuit fully recovered, 0.5 = chance)
features : f7=1.437 f6=1.236 f1*=0.663 f3*=0.587 f0*=0.471 f2*=0.425 f4=0.000 ...
(* = ground-truth circuit feature)
# ^ the two non-causal decoys (f6, f7) outrank every real circuit feature
$ neurotrace attribute --method patching --circuit suppressor
auroc : 1.0000 # causal tracing is not fooled — the decoys move the output by ~0Every circuit is a compiled one-hidden-layer network (y = v · φ(Wx + b)) wired so the
output depends on exactly the causal features. Each regime plants one non-causal
feature engineered to look important to one cheap method; the clean control removes
the decoy so the fooled method recovers.
flowchart LR
subgraph circuits["compiled circuits (known ground truth)"]
clean["clean<br/>(control)"]
conf["confounded<br/>(corr. decoy)"]
supp["suppressor<br/>(gradient decoy)"]
end
subgraph methods["attribution methods"]
corr["correlation<br/><i>proxy</i>"]
grad["gradient<br/><i>proxy</i>"]
ig["integrated gradients<br/><i>causal</i>"]
patch["activation patching<br/><i>causal</i>"]
end
circuits -->|score per-feature<br/>importance| methods
methods -->|AUROC vs<br/>ground truth| gate["eval gate<br/>asserts the<br/>dissociation"]
correlation—|corr(feature, output)|. A confounder correlated with the output fools it (it can't tell correlation from causation).gradient—mean |dy/dx|. A feature with a steep local slope but ~zero net effect (two near-cancelling saturated gates) fools it.integrated_gradients— the gradient integrated along a path from a baseline; reads the finite effect, so the saturated decoy doesn't fool it.patching— corrupt the feature, measure the output change. The causal-tracing primitive: robust to both decoys.
Mean AUROC over 16 seeds, 512 samples/circuit, against the known circuit
(1.0 = perfect, 0.5 = chance). Reproduce with python -m evals.harness.
| method | family | clean (control) | confounded | suppressor |
|---|---|---|---|---|
| random | baseline | 0.496 | 0.496 | 0.496 |
| correlation | proxy | 1.000 | 0.750 ⤵ | 1.000 |
| gradient | proxy | 1.000 | 1.000 | 0.750 ⤵ |
| integrated_gradients | causal | 1.000 | 1.000 | 1.000 |
| patching | causal | 1.000 | 1.000 | 1.000 |
- Effect 1 — confounding breaks correlation, not causation. Correlation drops
1.000 → 0.750onconfounded; patching stays1.000. - Effect 2 — saturation breaks gradients, not causation. Gradient saliency drops
1.000 → 0.750onsuppressor; patching (and integrated gradients) stay1.000. - Each proxy fails only its own decoy — correlation is fine on
suppressor, gradient is fine onconfounded— so the collapse is the decoy, not the method. - Scrambled-label null: shuffle the ground truth and every method falls to ~0.50,
confirming the AUROC is real (full table in
evals/RESULTS.md).
pip install -e ".[dev]" # numpy only; no API keys, no downloads
neurotrace compare --circuit confounded # all methods on one circuit
neurotrace compare --circuit suppressor
neurotrace attribute --method correlation --circuit confounded
neurotrace circuits # describe the regimes
python -m evals.harness # write evals/RESULTS.md
python -m evals.gate # assert the dissociation (CI gate)
pytest -q # 48 testsConfigure via env vars (see .env.example): NEUROTRACE_METHOD,
NEUROTRACE_CIRCUIT, NEUROTRACE_LABELS, NEUROTRACE_SAMPLES, NEUROTRACE_SEED.
docker build -t neurotrace .
docker run --rm neurotrace # runs the full offline benchmarkpip install -e ".[torch]" # then the skipped test runsRecomputes the gradient saliency with torch autograd and asserts it matches the
hand-derived numpy Jacobian — evidence the analytic gradients are correct.
- Offline & deterministic. numpy is the only runtime dependency; every number is
produced from
np.random.default_rngwith a fixed salt, so CI reproduces the table bit-for-bit across Python 3.10–3.12. - Compiled, not trained. Weights are set by hand so the ground truth is exact — no training noise, no "is this really the circuit?" ambiguity.
- The experiment is tuned, never the method. Decoy strengths are chosen to make the failure visible; the attribution methods are textbook implementations.
See docs/ARCHITECTURE.md and
docs/DECISIONS.md for the full design.
MIT — see LICENSE.
