Replies: 1 comment
-
|
And I just saw 28fab9f 😂 well done! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
A rather neat approach of using a diffusion model as a draft to dramatically speed up inference is doing the rounds.
Blog: https://z-lab.ai/projects/dflash/
Paper: https://arxiv.org/html/2602.06036v1
@bstnxbt made a MLX server for it: https://github.com/bstnxbt/dflash-mlx
Qwen 3.5 DFlash (draft) models:
Testing out Qwen 3.5 27B 4-Bit, with DFlash draft on M5 Max:
1904 tokens | 40.0 tok/s | 82.6% acceptanceLooks significantly more promising than #500
Wondering how viable implementing diffusion draft models in oMLX might be?
Beta Was this translation helpful? Give feedback.
All reactions