fix(admin): clarify DFlash draft quantization option + FP16 draft model boost by deepsweet · Pull Request #880 · jundot/omlx

deepsweet · 2026-04-21T12:24:49Z

Hi.

Rationale

Currently DFlash "draft quantization" values are the following:

omlx/omlx/admin/templates/dashboard/_modal_model_settings.html

Lines 636 to 638 in 9a095cd

    
           <option value="">bf16 (default)</option> 
        
           <option value="4">4-bit</option> 
        
           <option value="8">8-bit</option>

However under the hood it all gets cast into boolean:

omlx/omlx/engine/dflash.py

Lines 94 to 101 in 9a095cd

    
           from dflash_mlx.runtime import load_target_bundle, load_draft_bundle 
        
           model, tokenizer, meta = load_target_bundle(self._model_name) 
        
           draft, draft_meta = load_draft_bundle( 
        
               self._draft_model_path, 
        
               quantize_draft=bool(self._draft_quant_bits), 
        
           ) 
        
           return model, tokenizer, meta, draft

So the empty "default" becomes false and any other value becomes true, and in this format it goes directly to the dflash-mlx.

Which in turn checks for the boolean:

https://github.com/bstnxbt/dflash-mlx/blob/f825ffb268e50d531e8b6524413b0847334a14dd/dflash_mlx/runtime.py#L229-L233

def _should_quantize_draft(quantize_draft: bool = False) -> bool:
    if quantize_draft:
        return True
    raw = os.environ.get("DFLASH_QUANTIZE_DRAFT", "").strip().lower()
    return raw not in {"", "0", "false", "no"}

And then hardcodes any true value into 4-bit quantization whatsoever:

https://github.com/bstnxbt/dflash-mlx/blob/f825ffb268e50d531e8b6524413b0847334a14dd/dflash_mlx/runtime.py#L769-L771

    quantized = _should_quantize_draft(quantize_draft)
    if quantized:
        nn.quantize(model, bits=4, group_size=64)

Proposal

<option value="">None (default)</option>
<option value="4">4-bit</option>

Because actually… FP16 for a draft model strikes again and works as expected on M1/M2 Apple Silicon.

dflash_mlx seem to support it just fine.

DFlash FP16

No DFlash:

Benchmark Model: Qwen3.6-35B-A3B-MLX-oQ5-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1208.2       13.32   847.6 tok/s    75.7 tok/s       2.899   397.3 tok/s    23.86 GB
pp4096/tg128          4249.8       14.62   963.8 tok/s    69.0 tok/s       6.106   691.8 tok/s    24.64 GB
pp8192/tg128          9851.2       15.50   831.6 tok/s    65.0 tok/s      11.820   703.9 tok/s    24.99 GB
pp16384/tg128        20957.0       17.87   781.8 tok/s    56.4 tok/s      23.227   710.9 tok/s    25.61 GB
pp32768/tg128        49014.8       23.50   668.5 tok/s    42.9 tok/s      51.999   632.6 tok/s    26.95 GB

DFlash + DFLASH_MAX_CTX=32768 + default BF16 draft model:

Benchmark Model: Qwen3.6-35B-A3B-MLX-oQ5-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1318.2        6.84   776.8 tok/s   147.3 tok/s       2.187   526.8 tok/s    25.87 GB
pp4096/tg128          5964.4        7.26   686.7 tok/s   138.9 tok/s       6.886   613.4 tok/s    27.01 GB
pp8192/tg128         14474.0        8.12   566.0 tok/s   124.2 tok/s      15.505   536.6 tok/s    27.52 GB
pp16384/tg128        37778.8        9.34   433.7 tok/s   107.9 tok/s      38.965   423.8 tok/s    28.15 GB
pp32768/tg128        50907.5       22.04   643.7 tok/s    45.7 tok/s      53.706   612.5 tok/s    26.95 GB

DFlash + DFLASH_MAX_CTX=32768 + my converted to FP16 draft model:

Benchmark Model: Qwen3.6-35B-A3B-MLX-oQ5-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1206.3        5.61   848.9 tok/s   179.5 tok/s       1.919   600.2 tok/s    25.19 GB
pp4096/tg128          7203.3        5.89   568.6 tok/s   171.0 tok/s       7.952   531.2 tok/s    26.28 GB
pp8192/tg128         14451.1        7.14   566.9 tok/s   141.2 tok/s      15.358   541.8 tok/s    26.40 GB
pp16384/tg128        38684.3        8.83   423.5 tok/s   114.2 tok/s      39.805   414.8 tok/s    27.07 GB
pp32768/tg128        52087.6       22.36   629.1 tok/s    45.1 tok/s      54.927   598.9 tok/s    26.95 GB

DFLASH_MAX_CTX=32768 is there just as a reminder that DFlash noticeable degrades on large context, so the default 4096 value, and maybe 8192 in some cases, is the sweet spot.

Closes #993.

deepsweet · 2026-04-28T15:58:36Z

Qwen3.6-27B is pretty tough for my M2 Max 64GB MacBook in general, but here are the results.

No DFlash:

Benchmark Model: Qwen3.6-27B-MLX-oQ4-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          6889.5       65.84   148.6 tok/s    15.3 tok/s      15.251    75.5 tok/s    16.48 GB
pp4096/tg128         31597.4       72.54   129.6 tok/s    13.9 tok/s      40.809   103.5 tok/s    17.90 GB
pp8192/tg128         65813.7       78.58   124.5 tok/s    12.8 tok/s      75.793   109.8 tok/s    18.52 GB

DFlash + DFLASH_MAX_CTX=8192 + default BF16 draft model:

Benchmark Model: Qwen3.6-27B-MLX-oQ4-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128         11763.3       30.99    87.1 tok/s    32.5 tok/s      15.699    73.4 tok/s    21.96 GB
pp4096/tg128         45524.0       34.07    90.0 tok/s    29.6 tok/s      49.851    84.7 tok/s    24.00 GB
pp8192/tg128         77660.9       85.28   105.5 tok/s    11.8 tok/s      88.492    94.0 tok/s    18.52 GB

DFlash + DFLASH_MAX_CTX=8192 + my converted to FP16 draft model:

Benchmark Model: Qwen3.6-27B-MLX-oQ4-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          9160.7       32.80   111.8 tok/s    30.7 tok/s      13.326    86.4 tok/s    20.12 GB
pp4096/tg128         35638.3       26.82   114.9 tok/s    37.6 tok/s      39.045   108.2 tok/s    21.86 GB
pp8192/tg128         66174.5       73.32   123.8 tok/s    13.7 tok/s      75.486   110.2 tok/s    18.52 GB

Have also re-uploaded Qwen3.6-35B-A3B-DFlash-FP16 after the original model got updated.

fix(admin): clarify DFlash draft quantization option

ef0e458

jundot force-pushed the main branch 2 times, most recently from 7844f15 to b078330 Compare April 28, 2026 02:11

deepsweet mentioned this pull request Apr 28, 2026

Please add Dflash fp16 type #993

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(admin): clarify DFlash draft quantization option + FP16 draft model boost#880

fix(admin): clarify DFlash draft quantization option + FP16 draft model boost#880
deepsweet wants to merge 1 commit intojundot:mainfrom
deepsweet:clarify-draft-quantization

deepsweet commented Apr 21, 2026 •

edited

Loading

Uh oh!

deepsweet commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	<option value="">bf16 (default)</option>
	<option value="4">4-bit</option>
	<option value="8">8-bit</option>

	from dflash_mlx.runtime import load_target_bundle, load_draft_bundle

	model, tokenizer, meta = load_target_bundle(self._model_name)
	draft, draft_meta = load_draft_bundle(
	self._draft_model_path,
	quantize_draft=bool(self._draft_quant_bits),
	)
	return model, tokenizer, meta, draft

Conversation

deepsweet commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale

Proposal

DFlash FP16

Uh oh!

deepsweet commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

deepsweet commented Apr 21, 2026 •

edited

Loading