NeuroTrace

▶ Live demo: neurotrace.dexdevs.com — run it in your browser, free offline backend. Browse all 10 portfolio demos via the all demos link.

A causal-tracing benchmark for neural-network interpretability. NeuroTrace hand-compiles small networks whose ground-truth circuit is known by construction, then scores attribution methods on whether they recover it. The result is a clean two-effect dissociation: causal activation patching recovers the circuit where the cheap proxies each get fooled — correlation by a confounder, gradient saliency by a saturated decoy — and a control circuit proves each failure comes from the planted decoy, not the method in general.

Why it matters: saliency maps and correlations are the attributions people actually read off a model, and both can confidently point at the wrong thing. NeuroTrace turns that folklore into a measured, reproducible result against a known answer — no trained model, no labels to trust, no API keys.

Demo

$ neurotrace attribute --method gradient --circuit suppressor
method   : gradient  (proxy)
circuit  : suppressor   labels=real  samples=512  seed=0
auroc    : 0.7500   (1.0 = circuit fully recovered, 0.5 = chance)
features : f7=1.437 f6=1.236 f1*=0.663 f3*=0.587 f0*=0.471 f2*=0.425 f4=0.000 ...
           (* = ground-truth circuit feature)
#         ^ the two non-causal decoys (f6, f7) outrank every real circuit feature

$ neurotrace attribute --method patching --circuit suppressor
auroc    : 1.0000   # causal tracing is not fooled — the decoys move the output by ~0

How it works

Every circuit is a compiled one-hidden-layer network (y = v · φ(Wx + b)) wired so the output depends on exactly the causal features. Each regime plants one non-causal feature engineered to look important to one cheap method; the clean control removes the decoy so the fooled method recovers.

flowchart LR
    subgraph circuits["compiled circuits (known ground truth)"]
        clean["clean<br/>(control)"]
        conf["confounded<br/>(corr. decoy)"]
        supp["suppressor<br/>(gradient decoy)"]
    end
    subgraph methods["attribution methods"]
        corr["correlation<br/><i>proxy</i>"]
        grad["gradient<br/><i>proxy</i>"]
        ig["integrated gradients<br/><i>causal</i>"]
        patch["activation patching<br/><i>causal</i>"]
    end
    circuits -->|score per-feature<br/>importance| methods
    methods -->|AUROC vs<br/>ground truth| gate["eval gate<br/>asserts the<br/>dissociation"]

correlation — |corr(feature, output)|. A confounder correlated with the output fools it (it can't tell correlation from causation).
gradient — mean |dy/dx|. A feature with a steep local slope but ~zero net effect (two near-cancelling saturated gates) fools it.
integrated_gradients — the gradient integrated along a path from a baseline; reads the finite effect, so the saturated decoy doesn't fool it.
patching — corrupt the feature, measure the output change. The causal-tracing primitive: robust to both decoys.

Results

Mean AUROC over 16 seeds, 512 samples/circuit, against the known circuit (1.0 = perfect, 0.5 = chance). Reproduce with python -m evals.harness.

method	family	clean (control)	confounded	suppressor
random	baseline	0.496	0.496	0.496
correlation	proxy	1.000	0.750 ⤵	1.000
gradient	proxy	1.000	1.000	0.750 ⤵
integrated_gradients	causal	1.000	1.000	1.000
patching	causal	1.000	1.000	1.000

Effect 1 — confounding breaks correlation, not causation. Correlation drops 1.000 → 0.750 on confounded; patching stays 1.000.
Effect 2 — saturation breaks gradients, not causation. Gradient saliency drops 1.000 → 0.750 on suppressor; patching (and integrated gradients) stay 1.000.
Each proxy fails only its own decoy — correlation is fine on suppressor, gradient is fine on confounded — so the collapse is the decoy, not the method.
Scrambled-label null: shuffle the ground truth and every method falls to ~0.50, confirming the AUROC is real (full table in evals/RESULTS.md).

Quickstart

pip install -e ".[dev]"          # numpy only; no API keys, no downloads

neurotrace compare --circuit confounded   # all methods on one circuit
neurotrace compare --circuit suppressor
neurotrace attribute --method correlation --circuit confounded
neurotrace circuits                        # describe the regimes

python -m evals.harness          # write evals/RESULTS.md
python -m evals.gate             # assert the dissociation (CI gate)
pytest -q                        # 48 tests

Configure via env vars (see .env.example): NEUROTRACE_METHOD, NEUROTRACE_CIRCUIT, NEUROTRACE_LABELS, NEUROTRACE_SAMPLES, NEUROTRACE_SEED.

Docker

docker build -t neurotrace .
docker run --rm neurotrace        # runs the full offline benchmark

Optional: PyTorch cross-check

pip install -e ".[torch]"         # then the skipped test runs

Recomputes the gradient saliency with torch autograd and asserts it matches the hand-derived numpy Jacobian — evidence the analytic gradients are correct.

Design notes

Offline & deterministic. numpy is the only runtime dependency; every number is produced from np.random.default_rng with a fixed salt, so CI reproduces the table bit-for-bit across Python 3.10–3.12.
Compiled, not trained. Weights are set by hand so the ground truth is exact — no training noise, no "is this really the circuit?" ambiguity.
The experiment is tuned, never the method. Decoy strengths are chosen to make the failure visible; the attribution methods are textbook implementations.

See docs/ARCHITECTURE.md and docs/DECISIONS.md for the full design.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
docs		docs
evals		evals
examples		examples
neurotrace		neurotrace
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeuroTrace

Demo

How it works

Results

Quickstart

Docker

Optional: PyTorch cross-check

Design notes

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NeuroTrace

Demo

How it works

Results

Quickstart

Docker

Optional: PyTorch cross-check

Design notes

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages