Memory-efficient DoRA (Weight-Decomposed Low-Rank Adaptation) for PEFT, featuring factored column norms, fused Triton kernels with custom autograd, and automatic dispatch across eager PyTorch and Triton backends.
From the paper: Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels (arXiv preprint arXiv:2603.22276, 2026).
Clone with submodules (recommended — one command, zero setup):
git clone --recurse-submodules https://github.com/sockeye44/dorafactors
cd dorafactorsIf you already cloned without --recurse-submodules:
git submodule update --initInstall dependencies:
pip install -r code/requirements.txtRun the benchmark (from the repo root):
python code/bench_dora_comprehensive.py --suite all --verboseThe script validates its vendor dependencies at startup and will guide you
through any missing pieces. All paths are resolved relative to the script's
location, so python code/bench_dora_comprehensive.py works from any working
directory — but always invoke it via the repo-root-relative path shown above
to keep commands unambiguous and copy-pasteable.
What the script expects at startup
| Artifact | Source | How it gets there |
|---|---|---|
vendor/dorafactors-peft/ |
Git submodule → sockeye44/dorafactors-peft branch v1 |
--recurse-submodules or git submodule update --init |
Reference dora.py |
Upstream HF PEFT @ 20a9829 |
Searched at code/scripts/dora.reference_hf_peft.py then vendor/ (SHA-1 verified); fetch with wget if absent |
If either is missing or fails integrity checks, the script prints exact
remediation commands. In interactive sessions it offers a [y/N] prompt to
continue under non-standard conditions; in non-interactive sessions (CI, piped
stdin) it exits with an error.
Instead of using the submodule, you can reconstruct the patched PEFT module
from hf.patch against upstream PEFT commit
20a9829 (v0.18.0.rc0):
git clone https://github.com/huggingface/peft /tmp/peft-fork
cd /tmp/peft-fork && git checkout 20a9829
git apply /path/to/dorafactors/hf.patchThe figure-generation scripts read pre-collected benchmark JSON from
code/bench_it6/ and TensorBoard events from code/convergence_runs/.
Both resolve data paths relative to their own location, so they work from
any working directory.
# Microbenchmark + model-level figures (requires matplotlib, numpy)
python paper/generate_figures.py # PDF only
python paper/generate_figures.py --png # PDF + PNG
# Training convergence figure (requires matplotlib, numpy, tensorboard)
python paper/generate_training_figure.pyOutputs go to paper/figures/.
Unit tests for the factored norm, fused kernels, and DoRA math live in the
PEFT fork submodule. One test (test_reference_vs_optimized_forward_equivalence)
loads the upstream HF PEFT baseline from docs/dora.reference_hf_peft.py
inside the fork tree — copy it from the parent repo before running:
cp code/scripts/dora.reference_hf_peft.py vendor/dorafactors-peft/docs/
cd vendor/dorafactors-peft
pytest tests/test_lora_variants.py \
tests/tuners/lora/test_dora_fused.py \
tests/tuners/lora/test_dora_math.py -vTriton kernel tests require an SM 80+ GPU (Ampere or newer); validated on SM 80 (A100) through SM 120 (RTX 6000 PRO).
Full documentation, how-to guides, and API reference are available at:
- Home — overview and quick-start
- Getting Started — installation, setup, and first steps
- Configuration — all configuration options and environment variables
| Module | Description | Reference |
|---|---|---|
peft.tuners.lora.dora |
Layer classes (DoraLinearLayer, DoraEmbeddingLayer, conv variants), configuration functions, FSDP/ZeRO-3 integration, and eager composition helpers |
Layer Classes, Configuration |
peft.tuners.lora.dora_fused |
Fused Triton kernels for DoRA compose, norm assembly, and forward+inner products; custom autograd function; PyTorch fallbacks; autotune configs | Fused Kernels |
| Variable | Default | Description |
|---|---|---|
PEFT_DORA_FUSED |
unset (auto) | Enable fused Triton kernels: "1", "0", or unset (auto: use if Triton available) |
PEFT_DORA_FUSED_BACKWARD |
unset (on) | Fused backward pass: "1" (force on, bypass shape heuristic), "0" (off), or unset (on, with shape-based filtering for linear layers) |
PEFT_DORA_NORM_CHUNK_MB |
256 |
Column-norm chunking threshold in MB; matrices exceeding this are chunked (min 16) |
PEFT_DORA_FWD_CHUNK_MB |
256 |
Forward-pass chunking threshold in MB (min 16) |
DORA_AUTOTUNE_COMPREHENSIVE |
"0" |
Enable comprehensive Triton autotuning ("1" for full search) |
PEFT_DORA_ALLOW_PARTIAL_GATHER |
"0" |
Allow partial parameter gathering under ZeRO-3 ("1" to enable) |
PEFT_FORCE_GATHER |
unset (auto) | Force full parameter gathering: "1", "0", or unset (auto-detect) |
dora.py is the primary module: it defines all layer classes and configuration functions.
It lazy-imports dora_fused.py on first use (guarded by _get_dora_fused()) so that Triton
is not required at import time.
@misc{zelenin2026scalingdorahighrankadaptation,
title={Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels},
author={Alexandra Zelenin and Alexandra Zhuravlyova},
year={2026},
eprint={2603.22276},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.22276},
}