Knowledge Distillation for Hybrid Transformer-SSM Models

A PyTorch framework for knowledge distillation from large language models (LLMs) to hybrid architectures that combine Transformer attention mechanisms with State-Space Models (SSM), such as Mamba.

What This Repository Contains

Model building blocks for Llama, Qwen2, Falcon, and Phi-style hybrids.
A composable YAML config system (LOAD-based inheritance).
Distillation objectives: supervised, hstates, matrices, and dpo.
Distributed training wrappers (DDP/FSDP/centralized).
Evaluation utilities for perplexity and lm-eval-harness tasks.

This is not a packaged library. You run scripts directly from the repo.

Setup

Requirements

Python 3.8+
PyTorch 2.1+ with CUDA
One or more NVIDIA GPUs for training/eval

Install

git clone https://github.com/avivbick/mohawk.git
cd mohawk
pip install -r requirements.txt

Optional accelerators:

pip install flash-attn --no-build-isolation

Environment Variables

Use environment variables for credentials instead of hardcoding:

HF_TOKEN for private/gated Hugging Face models
WANDB_API_KEY for experiment tracking
CUDA_VISIBLE_DEVICES to pin GPUs

Global runtime defaults live in configs/management.yaml.

Quick Start

Single-GPU Training

python run.py --config configs/Qwen2/1.5B/hybrid/adapter.yaml

Multi-GPU Training

torchrun --standalone --nproc_per_node=8 run.py \
  --config configs/Qwen2/1.5B/hybrid/adapter.yaml

--config also accepts a comma-separated list; configs are loaded and run sequentially.

How Configuration Works

Every run is driven by YAML. The important top-level sections are:

ComponentsConfig: architecture definition (block sequence and layer types)
TrainConfig: optimization schedule and training length
DistillConfig: objective selection and logging run name
TeacherConfig: teacher checkpoint/path and tokenizer context
TrainDataConfig: dataset source and loader strategy
LoadConfig: initialization and checkpoint loading rules
ManagementConfig: cache paths, W&B config, environment defaults

Useful starting points:

configs/Qwen2/1.5B/hybrid/adapter.yaml
configs/Llama/1B/hybrid/mohawk_8.yaml
configs/Llama/8B/bases/_supervised.yaml

Evaluation and Analysis

Perplexity

Perplexity is integrated through training/eval wrappers and evals/eval_ppl.py implements the evaluator class used by those wrappers.

lm-eval-harness Benchmarks

python evals/benchmark.py --dir <checkpoint_or_hf_model_dir> --tasks mmlu

--tasks is a comma-separated list, for example: arc_challenge,arc_easy,piqa,winogrande,hellaswag,mmlu.

Utility Scripts

tools/hybrid_weights_transfer.py Copies selected attention heads from a teacher to a hybrid student. Uses --config and expects a supported TeacherConfig.dir.
tools/benchmark_throughput.py CUDA-graph throughput microbenchmark. This script is research-oriented and currently contains model-specific assumptions and hardcoded defaults.
tools/visualize_attention.py Produces attention heatmaps for manually selected heads on a fixed example. Useful for qualitative inspection, not automated evaluation.
generation/generate.py Inference/sampling script with timing output.

Repository Layout

mohawk/
├── components/          # Blocks, mixers, LM heads
├── configs/             # Train/eval architecture recipes
├── dataloaders/         # Dataset generators and wrappers
├── distill/             # Run orchestration and objective steps
├── evals/               # Evaluation entrypoints and adapters
├── external_models/     # External model definitions integrated here
├── generation/          # Text generation utilities
├── training_wrapper/    # DDP/FSDP/centralized wrappers
├── utils/               # Config, logging, init, distributed helpers
└── run.py               # Main training entrypoint

Publications

This codebase was used in the following research publications:

Retrieval-Aware Distillation for Transformer-SSM Hybrids

@misc{bick2026retrieval,
      title={Retrieval-Aware Distillation for Transformer-SSM Hybrids}, 
      author={Aviv Bick and Eric P. Xing and Albert Gu},
      year={2026},
      eprint={2602.11374},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.11374}, 
}

Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

@article{bick2025llamba,
  title={Llamba: Scaling distilled recurrent models for efficient language processing},
  author={Bick, Aviv and Katsch, Tobias and Sohoni, Nimit and Desai, Arjun and Gu, Albert},
  journal={arXiv preprint arXiv:2502.14458},
  year={2025}
}

Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners

@misc{paliotta2025thinking,
      title={Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners}, 
      author={Daniele Paliotta and Junxiong Wang and Matteo Pagliardini and Kevin Y. Li and Aviv Bick and J. Zico Kolter and Albert Gu and François Fleuret and Tri Dao},
      year={2025},
      eprint={2502.20339},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.20339}, 
}

Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models (Mohawk)

@misc{mohawk,
      title={Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models}, 
      author={Aviv Bick and Kevin Y. Li and Eric P. Xing and J. Zico Kolter and Albert Gu},
      year={2025},
      eprint={2408.10189},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.10189}, 
}

Citation

If this repository is useful in your work, cite:

@software{mohawk,
  title = {Knowledge Distillation for Hybrid Transformer-SSM Models},
  author = {Aviv Bick},
  year = {2024},
  url = {https://github.com/avivbick/mohawk}
}

License

MIT. See LICENSE.

Contributing

Contribution workflow and expectations are documented in CONTRIBUTING.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Distillation for Hybrid Transformer-SSM Models

What This Repository Contains

Setup

Requirements

Install

Environment Variables

Quick Start

Single-GPU Training

Multi-GPU Training

How Configuration Works

Evaluation and Analysis

Perplexity

lm-eval-harness Benchmarks

Utility Scripts

Repository Layout

Publications

Retrieval-Aware Distillation for Transformer-SSM Hybrids

Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners

Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models (Mohawk)

Citation

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
components		components
configs		configs
dataloaders		dataloaders
distill		distill
evals		evals
external_models		external_models
generation		generation
tools		tools
training_wrapper		training_wrapper
utils		utils
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

Knowledge Distillation for Hybrid Transformer-SSM Models

What This Repository Contains

Setup

Requirements

Install

Environment Variables

Quick Start

Single-GPU Training

Multi-GPU Training

How Configuration Works

Evaluation and Analysis

Perplexity

lm-eval-harness Benchmarks

Utility Scripts

Repository Layout

Publications

Retrieval-Aware Distillation for Transformer-SSM Hybrids

Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners

Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models (Mohawk)

Citation

License

Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages