Skip to content

MacPaw/Fast-dLLM-mlx

Fast-dLLM-mlx header

Fast-dLLM-mlx

Fast-dLLM-mlx implements Dream architecture inference in MLX for Apple Silicon.

This repository is an MLX implementation of the first Fast-dLLM ideas from the NVLabs/Fast-dLLM project, adapted for Dream-style diffusion language models. The focus here is training-free inference speedups for Dream models running on MLX.

Tech note: https://research.macpaw.com/publications/fast-dllm-mlx.

The current implementation includes:

  • Dream architecture inference in MLX
  • A first MLX version of the Fast-dLLM approach from the original NVLabs repo
  • Dual-cache support for more efficient decoding
  • Parallel token generation with probability thresholding

The repository also includes small benchmark scripts used to compare MLX variants across the prompts in prompts/.

Install

uv sync

or

pip install -e .

Basic Usage

Run the Fast-dLLM MLX benchmark:

uv run python -m benchmarks.fast_dllm_mlx_benchmark \
  --model mlx-community/DiffuCoder-7B-cpGRPO-8bit \
  --trust-remote-code \
  --max-new-tokens 128 \
  --steps 20 \
  --block-length 32 \
  --threshold 0.9 \
  --warmup

To print the generated response for each prompt while benchmarking use flag:

  --print-response

Run the Dream MLX benchmark:

uv run python -m benchmarks.dream_mlx_benchmark \
  --model mlx-community/DiffuCoder-7B-cpGRPO-8bit \
  --trust-remote-code \
  --max-new-tokens 128 \
  --steps 20 \
  --use-compile

Run the Qwen mlx_lm benchmark:

uv run python -m benchmarks.qwen_mlx_lm_benchmark \
  --model mlx-community/Qwen2.5-Coder-7B-Instruct-8bit \
  --max-new-tokens 128 \
  --temp 0.0 \
  --top-p 1.0 \
  --warmup

All benchmark scripts write CSV and JSON summaries by default, and you can point them at a custom prompt set with --prompt-file.

Benchmarks

The repo includes benchmark entrypoints for the main comparison paths:

The benchmark prompts currently come from the limited sample set in prompts/, so the reported numbers should be treated as directional rather than exhaustive.

Results

The comparison below was run on a limited amount of samples from the prompts/ folder.

For the MLX version of the Dream architecture, this repository uses DiffuCoder.

We include Qwen2.5-Coder in the comparison because the Dream architecture used here is based on Qwen2.5, and using the coder variant makes the comparison with DiffuCoder more reliable.

MLX variants including Fast-dLLM by benchmark category

In this benchmark slice, the Fast-dLLM MLX variants are substantially faster than the Dream MLX variants. Among the Qwen mlx_lm baselines, only the 4-bit variant outperforms Dream with Fast-dLLM on MLX. These numbers are useful for a quick relative comparison, but they are not a full evaluation across larger prompt sets or different generation settings.

To cite

@online{yemets-2026-fast-dllm-mlx, author = {Kyrylo Yemets}, title = {Fast-dLLM on MLX: Training-Free Acceleration for Diffusion Language Models on Apple Silicon}, note = {\emph{Online.} \url{https://research.macpaw.com/publications/fast-dllm-mlx}}, month = {Apr}, year = {2026}, }

About

Fast-dLLM inference engine on mlx

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages