DMax: Aggressive Parallel Decoding for dLLMs
-
Updated
May 8, 2026 - Python
DMax: Aggressive Parallel Decoding for dLLMs
llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP speculative decoding for ~30-50% throughput gains
Curated collection of research on the limitations of next-token prediction and methods that go beyond it.
Fused TBQ4 Flash Attention + MTP + Shared Tensors for llama.cpp — 82+ tok/s with lossless 4.25 bpv KV cache at 200K context on RTX 4090
A lightweight experimental generative model for chemistry, with mini Qwen2-like architecture and horizon loss and biologically-aware RL fine-tuning on SELFIES molecular representations.
ChemMiniQ3-SAbRLo is a lightweight experimental generative model for chemistry, built on mini Qwen2-like arch, designed for rapid prototyping of HuggingFace AutoModel and AutoTokenizer compatibility, and fast iteration of Multi-Token Prediction (MTP) and RL fine-tuning algorithms/rewards.
Multi-Token Prediction benchmarks for Gemma 4 on Apple Silicon — LiteRT-LM, transformers, and llama.cpp at batch=1 on a MacBook M4 Pro. ~2× speedup reproducible in one specific runtime.
Research code for ProbeRoute, a probe-initialized sparse routing method for frozen-backbone multi-token prediction
Reverse-engineering how DeepSeek achieved frontier LLM performance at a fraction of the cost — through hands-on PyTorch implementations of MLA, MoE, MTP, RoPE, and quantization.
Add a description, image, and links to the multi-token-prediction topic page so that developers can more easily learn about it.
To associate your repository with the multi-token-prediction topic, visit your repo's landing page and select "manage topics."