Skip to content

Latest commit

 

History

History
139 lines (107 loc) · 4.73 KB

File metadata and controls

139 lines (107 loc) · 4.73 KB

interpretability-toolkit

Practical tools for mechanistic interpretability of neural networks. Built for AI safety researchers who need to understand what's happening inside language models.

Motivation

Mechanistic interpretability is one of the most promising approaches to AI safety. If we can understand the internal computations of neural networks — the actual algorithms they implement — we can identify potential failure modes before deployment and build more trustworthy systems.

This toolkit provides composable, well-tested building blocks for interpretability research, focused on transformer-based language models.

Capabilities

Activation Analysis

Extract, cache, and analyze intermediate activations from any layer of a transformer. Supports residual stream, attention patterns, and MLP activations.

from interp_toolkit import ActivationCache
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
cache = ActivationCache(model)

activations = cache.run("The capital of France is")
residual = activations.residual_stream(layer=6)  # (seq_len, d_model)
attn_pattern = activations.attention_pattern(layer=6, head=3)  # (seq_len, seq_len)

Linear Probes

Train linear probes on model internals to detect learned features — sentiment, factuality, named entities, syntactic structure, or custom concepts.

from interp_toolkit.probes import LinearProbe

probe = LinearProbe(input_dim=768, num_classes=2)
probe.train(
    activations=residual_stream_data,
    labels=factuality_labels,
    epochs=50,
)
accuracy = probe.evaluate(test_activations, test_labels)
# accuracy: 0.94 — the model represents factuality at layer 6

Activation Patching

Causally intervene on model internals by patching activations from one forward pass into another. Essential for understanding which components are necessary and sufficient for a behavior.

from interp_toolkit.circuits import activation_patch

result = activation_patch(
    model=model,
    clean_input="The Eiffel Tower is in",
    corrupt_input="The Colosseum is in",
    target_token="Paris",
    patch_layer=8,
    patch_component="mlp",
)
print(f"Logit difference change: {result.logit_diff_change:.3f}")

Circuit Discovery

Automatically identify minimal circuits responsible for specific behaviors using iterative ablation and path patching.

from interp_toolkit.circuits import CircuitFinder

finder = CircuitFinder(model, threshold=0.01)
circuit = finder.find_circuit(
    clean_inputs=["The doctor said she", "The nurse said he"],
    corrupt_inputs=["The doctor said he", "The nurse said she"],
    target_metric="logit_diff",
)
print(circuit.summary())
# Circuit: 5 attention heads, 2 MLP layers
# Key heads: L5H1 (name mover), L7H3 (subject tracker)

Visualization

Interactive visualizations for attention patterns, activation distributions, and circuit diagrams.

from interp_toolkit.visualization import plot_attention, plot_circuit

plot_attention(attn_pattern, tokens=["The", "capital", "of", "France", "is"])
plot_circuit(circuit, output="circuit_diagram.html")

Architecture

interp_toolkit/
├── activations/       # Activation extraction and caching
│   ├── cache.py       # Hook-based activation capture
│   └── store.py       # Disk-backed activation storage
├── probes/            # Linear and nonlinear probing
│   ├── linear.py      # Linear probe implementation
│   └── trainer.py     # Probe training utilities
├── circuits/          # Circuit analysis tools
│   ├── patching.py    # Activation and path patching
│   ├── ablation.py    # Ablation studies
│   └── finder.py      # Automatic circuit discovery
└── visualization/     # Plotting and interactive vis
    ├── attention.py   # Attention pattern plots
    └── circuits.py    # Circuit diagram generation

Installation

pip install interpretability-toolkit

For GPU support:

pip install interpretability-toolkit[gpu]

Research Applications

This toolkit has been used for:

  • Factuality circuits: Identifying which attention heads track factual associations
  • Sycophancy mechanisms: Locating components that cause models to agree with users
  • Refusal circuits: Understanding how safety training modifies model internals
  • Capability elicitation: Finding latent capabilities through activation steering

Citation

@software{calkin2026interptoolkit,
  title={interpretability-toolkit: Practical Tools for Mechanistic Interpretability},
  author={Calkin, Maxwell},
  year={2026},
  url={https://github.com/MaxwellCalkin/interpretability-toolkit}
}

License

MIT License. See LICENSE.