Skip to content

spec : fix compatibility with n-gram and add TODOs#13

Merged
am17an merged 7 commits into
am17an:mtp-cleanfrom
ggml-org:gg/mtp-fix-ngram
May 15, 2026
Merged

spec : fix compatibility with n-gram and add TODOs#13
am17an merged 7 commits into
am17an:mtp-cleanfrom
ggml-org:gg/mtp-fix-ngram

Conversation

@ggerganov
Copy link
Copy Markdown

@ggerganov ggerganov commented May 15, 2026

Sample commands

These are some sample commands to get started with MTP:

# MTP with draft size N (values for N: 2,3,...)
llama-server -hf [model-with-mtp] --spec-type draft-mtp --spec-draft-n-max 2

# add `--no-mmproj` to disable vision support if not needed (uses less memory)
llama-server ... --no-mmproj

# [ADVANCED]
# combine MTP + ngram-* (experimental, suitable for non-CUDA systems)
# use these combinations only if you know what you are doing 
llama-server -hf [model-with-mtp] \
  --spec-type draft-mtp --spec-draft-n-max 3 \
  --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64

# (same as above, but shorter)
llama-server -hf [model-with-mtp] --spec-default --spec-type draft-mtp --spec-draft-n-max 2

Tip

MTP is compatible with Vision input.

Note

Prompt processing (PP) speed typically takes a negative hit when MTP is enabled mainly due to Device-To-Host (D2H) embedding transfers. It's something to be optimized in the future.

Note

Parallel decoding with MTP is supported, but not fully optimized yet.

Models

Quality check

The results from 4 runs of the AIME2026 eval (4x30 questions in total) with MTP enabled, using llama-eval, are within expectation and match the reported value by Qwen team.

image

Full data: aime2026-qwen3.6-27b-mtp-q4_k-x4.json.html

@am17an am17an merged commit a957b77 into am17an:mtp-clean May 15, 2026
30 of 50 checks passed
@ggerganov ggerganov deleted the gg/mtp-fix-ngram branch May 15, 2026 12:16
am17an pushed a commit that referenced this pull request May 16, 2026
* metal : cleanup

* llama : fix faulty bitwise check in recurrent memory

* server : disable RS-based MTP in combination with other spec types

* spec : add TODOs

* cont : fix comment

* cont : update comment

* common : fix logic for ngram + mtp compat
am17an added a commit that referenced this pull request May 19, 2026
* spec: support MTP

* fix batch size

* rename files

* cont : simplify (#7)

* MTP: clean-up (#9)

* MTP: clean-up

* review: use llama_context_type instead of llama_graph_type

* review: remove llama_model_has_mtp

* review: fix convert issues

* convert: fix pycheck

* review: formatting

* use `mtp-` for identifying mtp models

* convert: fix mtp conversion

* mtp -> draft-mtp

* remove unused llama_arch

* add need_embd in speculative

* llama: allow partial seq_rm for GDN models for speculative decoding

Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.

* fix pending state

* vulkan: add GDN partial rollback

* meta: extend check to axis 1

* metal: add GDN partial rollback

Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.

- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior

Ref: ggml-org@8c05923

Assisted-by: llama.cpp:local pi

* delta_net_base: use ggml_pad instead of new_tensor

* review: add need_rs_seq

* review: rename part_bounded to n_rs

* review: deslop comments

* review: rename, add asserts

* server : adjust checkpoint logic (#11)

* server : adjust checkpoint logic

* cont : rm asserts

* server-context: fix early exit

* spec : fix compatibility with n-gram and add TODOs (#13)

* metal : cleanup

* llama : fix faulty bitwise check in recurrent memory

* server : disable RS-based MTP in combination with other spec types

* spec : add TODOs

* cont : fix comment

* cont : update comment

* common : fix logic for ngram + mtp compat

* llama-memory: enable checkpointing with partial rollback

* cont: add test-case for loading into a dirty ctx

* llama-memory-recurrent: clear rs_idx in clear

* download: fix mtp path

* llama-arch: fix enorm op

* docs: update docs

* conversion: fix type annotations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants