MTP Appears Non-Functional on Qwen3.6-27B-MTP (Performance Drops Instead of Improving)
Environment
- Model: Qwen3.6-27B-MTP-GGUF
- OS: Fedora 44
- Runtime: Vulkan (llama.cpp v2.20.1)
- GPU: AMD Radeon RX 7900 XTX
Issue
Enabling MTP for Qwen3.6-27B-MTP-GGUF does not appear to provide any multi-token prediction benefit.
Instead of increasing throughput, generation speed drops significantly, suggesting that MTP may not actually be functioning while still incurring additional overhead.
Results
| Configuration |
Generation Speed |
| MTP Disabled |
~13 tokens/s |
| MTP Enabled |
~13 tokens/s |
Expected Behavior
When MTP is working correctly, generation speed should increase due to successful multi-token predictions.
For comparison, on the same system, Qwen3.6-35B-A3B-MTP-GGUF behaves as expected:
| Model |
MTP Disabled |
MTP Enabled |
| Qwen3.6-35B-A3B-MTP-GGUF |
~50 tokens/s |
~110 tokens/s |
Why This Looks Like an MTP Issue
The runtime and hardware are clearly capable of benefiting from MTP, as demonstrated by the 35B A3B model.
With Qwen3.6-27B-MTP-GGUF, enabling MTP appears to:
- Provide no observable multi-token prediction speedup.
- Reduce throughput by roughly 60%.
- Behave as though MTP overhead is present, but the MTP predictions themselves are not contributing to generation.
Reproduction
- Load
Qwen3.6-27B-MTP-GGUF.
- Enable MTP in Advanced Settings.
- Generate text and measure throughput.
- Disable MTP and repeat.
- Observe that throughput decreases from ~35 t/s to ~14 t/s when MTP is enabled.
Question
Is MTP currently expected to work with Qwen3.6-27B-MTP-GGUF under Vulkan and llama.cpp v2.20.1, or are there known limitations/issues affecting this model?
MTP Appears Non-Functional on Qwen3.6-27B-MTP (Performance Drops Instead of Improving)
Environment
Issue
Enabling MTP for
Qwen3.6-27B-MTP-GGUFdoes not appear to provide any multi-token prediction benefit.Instead of increasing throughput, generation speed drops significantly, suggesting that MTP may not actually be functioning while still incurring additional overhead.
Results
Expected Behavior
When MTP is working correctly, generation speed should increase due to successful multi-token predictions.
For comparison, on the same system,
Qwen3.6-35B-A3B-MTP-GGUFbehaves as expected:Why This Looks Like an MTP Issue
The runtime and hardware are clearly capable of benefiting from MTP, as demonstrated by the 35B A3B model.
With
Qwen3.6-27B-MTP-GGUF, enabling MTP appears to:Reproduction
Qwen3.6-27B-MTP-GGUF.Question
Is MTP currently expected to work with
Qwen3.6-27B-MTP-GGUFunder Vulkan and llama.cpp v2.20.1, or are there known limitations/issues affecting this model?