DT-Circuits: Mechanistic Interpretability for Decision Transformers

DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.

Motivation

Mechanistic interpretability has primarily focused on language models, while reinforcement learning agents remain comparatively underexplored.

Decision Transformers provide a uniquely analyzable architecture because trajectories, rewards, and actions are represented in a unified autoregressive sequence.

DT-Circuits aims to make RL agents inspectable at the circuit level rather than only through behavioral evaluation.

Documentation

Features

1. Neural Mapping

Hooked-DT: Access any internal activation or weight during the agent's run.
Logit Attribution: See which attention heads or MLP layers drive specific actions.
Induction Scan: Find heads that recognize temporal patterns and past states.

2. Testing Causality

Activation Patching: Swap internal states to see what actually changes the agent's move.
Behavior Steering: Add vectors to activations to push the agent toward specific goals without retraining.

3. Concept Discovery

TopK SAEs: Decompose complex activations into a few active "concepts" for easier reading.
Auto-Labeling (NLA): Use an LLM to automatically describe what each discovered neuron feature does.
Cross-Model Probes: Check if different agents (like DQNs) learn the same internal concepts as the DT.

4. Circuit Analysis

ACDC: Automatically strip the model down to the minimal circuit needed for a task.
Path Patching: Trace how a signal flows from a specific input token to the final action.
Evolutionary Scan: Watch how decision-making circuits form during training.

Technical Architecture

Data: Collects expert paths using a PPO harvester.
Model: Custom Decision Transformer compatible with TransformerLens.
Tools: Dedicated modules for attribution, patching, SAEs, and steering.
Dashboard: Streamlit UI for real-time model analysis.

Project Structure

DT-Circuits/
├── scripts/                
│   ├── train_dt.py         # Decision Transformer training pipeline
│   └── train_sae.py        # Sparse Autoencoder (SAE) training script
├── src/                    
│   ├── dashboard/          
│   │   └── app.py          # Streamlit-based visualization UI
│   ├── data/               
│   │   └── harvester.py    # PPO-based expert trajectory harvester
│   ├── interpretability/   
│   │   ├── acdc.py         # Automated Circuit Discovery logic
│   │   ├── attribution.py  # Direct Logit Attribution (DLA)
│   │   ├── evolution.py    # Training Dynamics Analysis
│   │   ├── induction_scan.py # Induction head detection logic
│   │   ├── nla.py          # Natural Language Autoencoder Explainer
│   │   ├── patching.py     # Causal activation patching tools
│   │   ├── path_patching.py # Path-based causal intervention engine
│   │   ├── sae_manager.py  # SAE deployment and anomaly detection
│   │   ├── steering.py     # Steering vector generation and injection
│   │   └── universality.py # Cross-architecture feature mapping
│   ├── models/             
│   │   └── hooked_dt.py    # TransformerLens-wrapped Decision Transformer
│   └── utils/              
├── tests/                  # Unit tests for all modules
├── config.yaml             
└── requirements.txt

Getting Started

Prerequisites

Python 3.9+
PyTorch 2.x
TransformerLens
SAE-Lens

Quick Start

Follow these steps to initialize the environment and verify the installation.

Environment Setup

python -m venv venv
source venv/bin/activate  
pip install -r requirements.txt

Verification Run the component tests to ensure all dependencies and hooks are correctly configured.
```
PYTHONPATH=. pytest tests/test_components.py
```
Dashboard Execution Launch the DT-Explorer dashboard. The dashboard will initialize with a random model if no trained weights are detected.
```
streamlit run src/dashboard/app.py
```

Workflow

Data Harvesting & Model Training
```
python scripts/train_dt.py
```
Interpretability Analysis
```
streamlit run src/dashboard/app.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DT-Circuits: Mechanistic Interpretability for Decision Transformers

Motivation

Table of Contents

Documentation

Features

1. Neural Mapping

2. Testing Causality

3. Concept Discovery

4. Circuit Analysis

Technical Architecture

Project Structure

Getting Started

Prerequisites

Quick Start

Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DT-Circuits: Mechanistic Interpretability for Decision Transformers

Motivation

Table of Contents

Documentation

Features

1. Neural Mapping

2. Testing Causality

3. Concept Discovery

4. Circuit Analysis

Technical Architecture

Project Structure

Getting Started

Prerequisites

Quick Start

Workflow

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages