Fast-dLLM-mlx

Fast-dLLM-mlx implements Dream architecture inference in MLX for Apple Silicon.

This repository is an MLX implementation of the first Fast-dLLM ideas from the NVLabs/Fast-dLLM project, adapted for Dream-style diffusion language models. The focus here is training-free inference speedups for Dream models running on MLX.

Tech note: https://research.macpaw.com/publications/fast-dllm-mlx.

The current implementation includes:

Dream architecture inference in MLX
A first MLX version of the Fast-dLLM approach from the original NVLabs repo
Dual-cache support for more efficient decoding
Parallel token generation with probability thresholding

The repository also includes small benchmark scripts used to compare MLX variants across the prompts in prompts/.

Install

uv sync

or

pip install -e .

Basic Usage

Run the Fast-dLLM MLX benchmark:

uv run python -m benchmarks.fast_dllm_mlx_benchmark \
  --model mlx-community/DiffuCoder-7B-cpGRPO-8bit \
  --trust-remote-code \
  --max-new-tokens 128 \
  --steps 20 \
  --block-length 32 \
  --threshold 0.9 \
  --warmup

To print the generated response for each prompt while benchmarking use flag:

  --print-response

Run the Dream MLX benchmark:

uv run python -m benchmarks.dream_mlx_benchmark \
  --model mlx-community/DiffuCoder-7B-cpGRPO-8bit \
  --trust-remote-code \
  --max-new-tokens 128 \
  --steps 20 \
  --use-compile

Run the Qwen mlx_lm benchmark:

uv run python -m benchmarks.qwen_mlx_lm_benchmark \
  --model mlx-community/Qwen2.5-Coder-7B-Instruct-8bit \
  --max-new-tokens 128 \
  --temp 0.0 \
  --top-p 1.0 \
  --warmup

All benchmark scripts write CSV and JSON summaries by default, and you can point them at a custom prompt set with --prompt-file.

Benchmarks

The repo includes benchmark entrypoints for the main comparison paths:

The benchmark prompts currently come from the limited sample set in prompts/, so the reported numbers should be treated as directional rather than exhaustive.

Results

The comparison below was run on a limited amount of samples from the prompts/ folder.

For the MLX version of the Dream architecture, this repository uses DiffuCoder.

We include Qwen2.5-Coder in the comparison because the Dream architecture used here is based on Qwen2.5, and using the coder variant makes the comparison with DiffuCoder more reliable.

In this benchmark slice, the Fast-dLLM MLX variants are substantially faster than the Dream MLX variants. Among the Qwen mlx_lm baselines, only the 4-bit variant outperforms Dream with Fast-dLLM on MLX. These numbers are useful for a quick relative comparison, but they are not a full evaluation across larger prompt sets or different generation settings.

To cite

@online{yemets-2026-fast-dllm-mlx, author = {Kyrylo Yemets}, title = {Fast-dLLM on MLX: Training-Free Acceleration for Diffusion Language Models on Apple Silicon}, note = {\emph{Online.} \url{https://research.macpaw.com/publications/fast-dllm-mlx}}, month = {Apr}, year = {2026}, }

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmarks		benchmarks
dream_mlx		dream_mlx
fast_dllm_mlx		fast_dllm_mlx
prompts		prompts
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
header.png		header.png
mlx_variants_with_fast_dllm_category_comparison.png		mlx_variants_with_fast_dllm_category_comparison.png
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast-dLLM-mlx

Install

Basic Usage

Benchmarks

Results

To cite

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fast-dLLM-mlx

Install

Basic Usage

Benchmarks

Results

To cite

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages