Skip to content

fix(admin): clarify DFlash draft quantization option + FP16 draft model boost#880

Open
deepsweet wants to merge 1 commit intojundot:mainfrom
deepsweet:clarify-draft-quantization
Open

fix(admin): clarify DFlash draft quantization option + FP16 draft model boost#880
deepsweet wants to merge 1 commit intojundot:mainfrom
deepsweet:clarify-draft-quantization

Conversation

@deepsweet
Copy link
Copy Markdown

@deepsweet deepsweet commented Apr 21, 2026

Hi.

Rationale

Currently DFlash "draft quantization" values are the following:

<option value="">bf16 (default)</option>
<option value="4">4-bit</option>
<option value="8">8-bit</option>

However under the hood it all gets cast into boolean:

omlx/omlx/engine/dflash.py

Lines 94 to 101 in 9a095cd

from dflash_mlx.runtime import load_target_bundle, load_draft_bundle
model, tokenizer, meta = load_target_bundle(self._model_name)
draft, draft_meta = load_draft_bundle(
self._draft_model_path,
quantize_draft=bool(self._draft_quant_bits),
)
return model, tokenizer, meta, draft

So the empty "default" becomes false and any other value becomes true, and in this format it goes directly to the dflash-mlx.

Which in turn checks for the boolean:

https://github.com/bstnxbt/dflash-mlx/blob/f825ffb268e50d531e8b6524413b0847334a14dd/dflash_mlx/runtime.py#L229-L233

def _should_quantize_draft(quantize_draft: bool = False) -> bool:
    if quantize_draft:
        return True
    raw = os.environ.get("DFLASH_QUANTIZE_DRAFT", "").strip().lower()
    return raw not in {"", "0", "false", "no"}

And then hardcodes any true value into 4-bit quantization whatsoever:

https://github.com/bstnxbt/dflash-mlx/blob/f825ffb268e50d531e8b6524413b0847334a14dd/dflash_mlx/runtime.py#L769-L771

    quantized = _should_quantize_draft(quantize_draft)
    if quantized:
        nn.quantize(model, bits=4, group_size=64)

Proposal

<option value="">None (default)</option>
<option value="4">4-bit</option>

Because actually… FP16 for a draft model strikes again and works as expected on M1/M2 Apple Silicon.

dflash_mlx seem to support it just fine.

DFlash FP16

No DFlash:

Benchmark Model: Qwen3.6-35B-A3B-MLX-oQ5-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1208.2       13.32   847.6 tok/s    75.7 tok/s       2.899   397.3 tok/s    23.86 GB
pp4096/tg128          4249.8       14.62   963.8 tok/s    69.0 tok/s       6.106   691.8 tok/s    24.64 GB
pp8192/tg128          9851.2       15.50   831.6 tok/s    65.0 tok/s      11.820   703.9 tok/s    24.99 GB
pp16384/tg128        20957.0       17.87   781.8 tok/s    56.4 tok/s      23.227   710.9 tok/s    25.61 GB
pp32768/tg128        49014.8       23.50   668.5 tok/s    42.9 tok/s      51.999   632.6 tok/s    26.95 GB

DFlash + DFLASH_MAX_CTX=32768 + default BF16 draft model:

Benchmark Model: Qwen3.6-35B-A3B-MLX-oQ5-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1318.2        6.84   776.8 tok/s   147.3 tok/s       2.187   526.8 tok/s    25.87 GB
pp4096/tg128          5964.4        7.26   686.7 tok/s   138.9 tok/s       6.886   613.4 tok/s    27.01 GB
pp8192/tg128         14474.0        8.12   566.0 tok/s   124.2 tok/s      15.505   536.6 tok/s    27.52 GB
pp16384/tg128        37778.8        9.34   433.7 tok/s   107.9 tok/s      38.965   423.8 tok/s    28.15 GB
pp32768/tg128        50907.5       22.04   643.7 tok/s    45.7 tok/s      53.706   612.5 tok/s    26.95 GB

DFlash + DFLASH_MAX_CTX=32768 + my converted to FP16 draft model:

Benchmark Model: Qwen3.6-35B-A3B-MLX-oQ5-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1206.3        5.61   848.9 tok/s   179.5 tok/s       1.919   600.2 tok/s    25.19 GB
pp4096/tg128          7203.3        5.89   568.6 tok/s   171.0 tok/s       7.952   531.2 tok/s    26.28 GB
pp8192/tg128         14451.1        7.14   566.9 tok/s   141.2 tok/s      15.358   541.8 tok/s    26.40 GB
pp16384/tg128        38684.3        8.83   423.5 tok/s   114.2 tok/s      39.805   414.8 tok/s    27.07 GB
pp32768/tg128        52087.6       22.36   629.1 tok/s    45.1 tok/s      54.927   598.9 tok/s    26.95 GB

DFLASH_MAX_CTX=32768 is there just as a reminder that DFlash noticeable degrades on large context, so the default 4096 value, and maybe 8192 in some cases, is the sweet spot.

Closes #993.

@jundot jundot force-pushed the main branch 2 times, most recently from 7844f15 to b078330 Compare April 28, 2026 02:11
@deepsweet
Copy link
Copy Markdown
Author

Qwen3.6-27B is pretty tough for my M2 Max 64GB MacBook in general, but here are the results.

No DFlash:

Benchmark Model: Qwen3.6-27B-MLX-oQ4-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          6889.5       65.84   148.6 tok/s    15.3 tok/s      15.251    75.5 tok/s    16.48 GB
pp4096/tg128         31597.4       72.54   129.6 tok/s    13.9 tok/s      40.809   103.5 tok/s    17.90 GB
pp8192/tg128         65813.7       78.58   124.5 tok/s    12.8 tok/s      75.793   109.8 tok/s    18.52 GB

DFlash + DFLASH_MAX_CTX=8192 + default BF16 draft model:

Benchmark Model: Qwen3.6-27B-MLX-oQ4-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128         11763.3       30.99    87.1 tok/s    32.5 tok/s      15.699    73.4 tok/s    21.96 GB
pp4096/tg128         45524.0       34.07    90.0 tok/s    29.6 tok/s      49.851    84.7 tok/s    24.00 GB
pp8192/tg128         77660.9       85.28   105.5 tok/s    11.8 tok/s      88.492    94.0 tok/s    18.52 GB

DFlash + DFLASH_MAX_CTX=8192 + my converted to FP16 draft model:

Benchmark Model: Qwen3.6-27B-MLX-oQ4-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          9160.7       32.80   111.8 tok/s    30.7 tok/s      13.326    86.4 tok/s    20.12 GB
pp4096/tg128         35638.3       26.82   114.9 tok/s    37.6 tok/s      39.045   108.2 tok/s    21.86 GB
pp8192/tg128         66174.5       73.32   123.8 tok/s    13.7 tok/s      75.486   110.2 tok/s    18.52 GB

Have also re-uploaded Qwen3.6-35B-A3B-DFlash-FP16 after the original model got updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Please add Dflash fp16 type

1 participant