Skip to content

ranafaraz/NeuroTrace

Repository files navigation

NeuroTrace

Live demo

▶ Live demo: neurotrace.dexdevs.com — run it in your browser, free offline backend. Browse all 10 portfolio demos via the all demos link.

CI Python License: MIT

A causal-tracing benchmark for neural-network interpretability. NeuroTrace hand-compiles small networks whose ground-truth circuit is known by construction, then scores attribution methods on whether they recover it. The result is a clean two-effect dissociation: causal activation patching recovers the circuit where the cheap proxies each get fooled — correlation by a confounder, gradient saliency by a saturated decoy — and a control circuit proves each failure comes from the planted decoy, not the method in general.

Why it matters: saliency maps and correlations are the attributions people actually read off a model, and both can confidently point at the wrong thing. NeuroTrace turns that folklore into a measured, reproducible result against a known answer — no trained model, no labels to trust, no API keys.

Demo

NeuroTrace demo

$ neurotrace attribute --method gradient --circuit suppressor
method   : gradient  (proxy)
circuit  : suppressor   labels=real  samples=512  seed=0
auroc    : 0.7500   (1.0 = circuit fully recovered, 0.5 = chance)
features : f7=1.437 f6=1.236 f1*=0.663 f3*=0.587 f0*=0.471 f2*=0.425 f4=0.000 ...
           (* = ground-truth circuit feature)
#         ^ the two non-causal decoys (f6, f7) outrank every real circuit feature

$ neurotrace attribute --method patching --circuit suppressor
auroc    : 1.0000   # causal tracing is not fooled — the decoys move the output by ~0

How it works

Every circuit is a compiled one-hidden-layer network (y = v · φ(Wx + b)) wired so the output depends on exactly the causal features. Each regime plants one non-causal feature engineered to look important to one cheap method; the clean control removes the decoy so the fooled method recovers.

flowchart LR
    subgraph circuits["compiled circuits (known ground truth)"]
        clean["clean<br/>(control)"]
        conf["confounded<br/>(corr. decoy)"]
        supp["suppressor<br/>(gradient decoy)"]
    end
    subgraph methods["attribution methods"]
        corr["correlation<br/><i>proxy</i>"]
        grad["gradient<br/><i>proxy</i>"]
        ig["integrated gradients<br/><i>causal</i>"]
        patch["activation patching<br/><i>causal</i>"]
    end
    circuits -->|score per-feature<br/>importance| methods
    methods -->|AUROC vs<br/>ground truth| gate["eval gate<br/>asserts the<br/>dissociation"]
Loading
  • correlation|corr(feature, output)|. A confounder correlated with the output fools it (it can't tell correlation from causation).
  • gradientmean |dy/dx|. A feature with a steep local slope but ~zero net effect (two near-cancelling saturated gates) fools it.
  • integrated_gradients — the gradient integrated along a path from a baseline; reads the finite effect, so the saturated decoy doesn't fool it.
  • patching — corrupt the feature, measure the output change. The causal-tracing primitive: robust to both decoys.

Results

Mean AUROC over 16 seeds, 512 samples/circuit, against the known circuit (1.0 = perfect, 0.5 = chance). Reproduce with python -m evals.harness.

method family clean (control) confounded suppressor
random baseline 0.496 0.496 0.496
correlation proxy 1.000 0.750 1.000
gradient proxy 1.000 1.000 0.750
integrated_gradients causal 1.000 1.000 1.000
patching causal 1.000 1.000 1.000
  • Effect 1 — confounding breaks correlation, not causation. Correlation drops 1.000 → 0.750 on confounded; patching stays 1.000.
  • Effect 2 — saturation breaks gradients, not causation. Gradient saliency drops 1.000 → 0.750 on suppressor; patching (and integrated gradients) stay 1.000.
  • Each proxy fails only its own decoy — correlation is fine on suppressor, gradient is fine on confounded — so the collapse is the decoy, not the method.
  • Scrambled-label null: shuffle the ground truth and every method falls to ~0.50, confirming the AUROC is real (full table in evals/RESULTS.md).

Quickstart

pip install -e ".[dev]"          # numpy only; no API keys, no downloads

neurotrace compare --circuit confounded   # all methods on one circuit
neurotrace compare --circuit suppressor
neurotrace attribute --method correlation --circuit confounded
neurotrace circuits                        # describe the regimes

python -m evals.harness          # write evals/RESULTS.md
python -m evals.gate             # assert the dissociation (CI gate)
pytest -q                        # 48 tests

Configure via env vars (see .env.example): NEUROTRACE_METHOD, NEUROTRACE_CIRCUIT, NEUROTRACE_LABELS, NEUROTRACE_SAMPLES, NEUROTRACE_SEED.

Docker

docker build -t neurotrace .
docker run --rm neurotrace        # runs the full offline benchmark

Optional: PyTorch cross-check

pip install -e ".[torch]"         # then the skipped test runs

Recomputes the gradient saliency with torch autograd and asserts it matches the hand-derived numpy Jacobian — evidence the analytic gradients are correct.

Design notes

  • Offline & deterministic. numpy is the only runtime dependency; every number is produced from np.random.default_rng with a fixed salt, so CI reproduces the table bit-for-bit across Python 3.10–3.12.
  • Compiled, not trained. Weights are set by hand so the ground truth is exact — no training noise, no "is this really the circuit?" ambiguity.
  • The experiment is tuned, never the method. Decoy strengths are chosen to make the failure visible; the attribution methods are textbook implementations.

See docs/ARCHITECTURE.md and docs/DECISIONS.md for the full design.

License

MIT — see LICENSE.

About

Causal-tracing benchmark for neural-net interpretability: activation patching recovers the ground-truth circuit where correlational and gradient attribution each collapse -- proven by controls that toggle confounding and saturation on and off. Offline, numpy-only, no API keys.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors