spec : fix compatibility with n-gram and add TODOs by ggerganov · Pull Request #13 · am17an/llama.cpp

ggerganov · 2026-05-15T10:08:20Z

Sample commands

These are some sample commands to get started with MTP:

# MTP with draft size N (values for N: 2,3,...)
llama-server -hf [model-with-mtp] --spec-type draft-mtp --spec-draft-n-max 2

# add `--no-mmproj` to disable vision support if not needed (uses less memory)
llama-server ... --no-mmproj

# [ADVANCED]
# combine MTP + ngram-* (experimental, suitable for non-CUDA systems)
# use these combinations only if you know what you are doing 
llama-server -hf [model-with-mtp] \
  --spec-type draft-mtp --spec-draft-n-max 3 \
  --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64

# (same as above, but shorter)
llama-server -hf [model-with-mtp] --spec-default --spec-type draft-mtp --spec-draft-n-max 2

Tip

MTP is compatible with Vision input.

Note

Prompt processing (PP) speed typically takes a negative hit when MTP is enabled mainly due to Device-To-Host (D2H) embedding transfers. It's something to be optimized in the future.

Note

Parallel decoding with MTP is supported, but not fully optimized yet.

Models

Quality check

The results from 4 runs of the AIME2026 eval (4x30 questions in total) with MTP enabled, using llama-eval, are within expectation and match the reported value by Qwen team.

Full data: aime2026-qwen3.6-27b-mtp-q4_k-x4.json.html

* metal : cleanup * llama : fix faulty bitwise check in recurrent memory * server : disable RS-based MTP in combination with other spec types * spec : add TODOs * cont : fix comment * cont : update comment * common : fix logic for ngram + mtp compat

* spec: support MTP * fix batch size * rename files * cont : simplify (#7) * MTP: clean-up (#9) * MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion * mtp -> draft-mtp * remove unused llama_arch * add need_embd in speculative * llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates. * fix pending state * vulkan: add GDN partial rollback * meta: extend check to axis 1 * metal: add GDN partial rollback Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: ggml-org@8c05923 Assisted-by: llama.cpp:local pi * delta_net_base: use ggml_pad instead of new_tensor * review: add need_rs_seq * review: rename part_bounded to n_rs * review: deslop comments * review: rename, add asserts * server : adjust checkpoint logic (#11) * server : adjust checkpoint logic * cont : rm asserts * server-context: fix early exit * spec : fix compatibility with n-gram and add TODOs (#13) * metal : cleanup * llama : fix faulty bitwise check in recurrent memory * server : disable RS-based MTP in combination with other spec types * spec : add TODOs * cont : fix comment * cont : update comment * common : fix logic for ngram + mtp compat * llama-memory: enable checkpointing with partial rollback * cont: add test-case for loading into a dirty ctx * llama-memory-recurrent: clear rs_idx in clear * download: fix mtp path * llama-arch: fix enorm op * docs: update docs * conversion: fix type annotations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggerganov added 4 commits May 15, 2026 12:23

metal : cleanup

a009e04

llama : fix faulty bitwise check in recurrent memory

5122abb

server : disable RS-based MTP in combination with other spec types

1e2a77a

spec : add TODOs

9317548

github-actions Bot added ggml examples server Apple Metal labels May 15, 2026

ggerganov added 3 commits May 15, 2026 13:10

cont : fix comment

1271f82

cont : update comment

baabc98

common : fix logic for ngram + mtp compat

5c11cae

am17an merged commit a957b77 into am17an:mtp-clean May 15, 2026
30 of 50 checks passed

ggerganov deleted the gg/mtp-fix-ngram branch May 15, 2026 12:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec : fix compatibility with n-gram and add TODOs#13

spec : fix compatibility with n-gram and add TODOs#13
am17an merged 7 commits into
am17an:mtp-cleanfrom
ggml-org:gg/mtp-fix-ngram

ggerganov commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ggerganov commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sample commands

Models

Quality check

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented May 15, 2026 •

edited

Loading