Gemma4 MTP by am17an · Pull Request #17 · am17an/llama.cpp

am17an · 2026-05-19T15:56:42Z

Works with both gemma-31B and gemma-26B but the MoE model is slower. I see a good speed up on my DGX spark (~2-2.5x speedup) on the dense model. The main problem is sharing the memory ctx between the two llama_contexts, so currently it's pretty hacky plus also the ubatch splitting is not super clean.

Replicated the AIME-26 results for Gemma-31B with -np 4

This commit attempts to clarify a code comment in graph_mtp regarding where the MTP layer is stored. The motivation for this is that it was not obvious to me what the original comment meant and hopefully this makes it clearer.

am17an · 2026-05-19T16:58:01Z

+    // of streams (one per active draft seq); q->ne[2] is not divisible by the full
+    // n_stream and the view collapses tokens. Slice k/v down to exactly the streams
+    // referenced by this ubatch. Requires those streams to form a contiguous range.
+    if (k->ne[3] > 1 && (uint32_t) k->ne[3] != ubatch.n_seqs_unq) {


@ggerganov this part

* update mtp related help * remove outdated experimental text

* opencl: add q4_k moe support * opencl: add q5_k moe support * opencl: add q6_k moe support * opencl: adjust format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

* snapdragon: update compiler flags to enable all CPU features * snapdragon: update readme to point to toolchain v0.6 * snapdragon: bump toolchain docker to v0.6

* metal : optimize pad * metal : optinmize cpy * cont : better row packing in threadgroup

…rg#23330) * refactor: `isMobile` as reactive value in `viewport` store * refactor: Use Svelte media query for the viewport store

* mtmd: fit_params now take into account mmproj * rename alloc_compute_meta to reserve_compute_meta * rm unused functions * add ggml_backend_dev_t support * add debug log

… completions payload (ggml-org#23406)

* app : introduce the llama unified executable Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use serve for server Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Hide completion and bench, add help command Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove STATIC Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use -impl targets instead of -lib Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Revert "Remove STATIC" This reverts commit cc44cac. --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…DIA GPUs (Hopper+) (ggml-org#22522) * Adds initial PDL setup. * Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst. * Further optimization pass of the first half of kernels * Optimized PDL barriers for the second batch of kernels * Further refinements after rebase. * Moves pdl logic to separate function, removes some whitespace * Strips post-hoc PDL logic * Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to overlap execution with previous kernels * Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL * Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL * Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx, to enable hip/musa compatibility * Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32 * Enrolls flash_attn_combine_results * Fix: Drops needless and broken check of CUDA arch for PDL. PDL either works or is without effect. * Enrolls flash-attention kernels to pdl * Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for kernels args. This fixes PDL. * Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via and template alias and template expansion * Enrolls all remaining kernels for qwen3-coder-next into PDL * Remove all PDL LC calls to create a baseline * Added LC according to internal guidance and tested kernel performance. * Enrols missing qwen3-5 kernels passively into PDL. * Kernel optimizations (LC signals) for qwen3.5 * Enrolls ssm-scan kernels into PDL * Adds GGML_CUDA_PDL command line option to toggle PDL. * Fix: Ada and lower compilation by guarding PDL calls correctly * Cleanup: Removes commented out GGML_CUDA_PDL_LC * Cleanup: Removes experimental comments * Adds 90-virtual to build script so that Hopper GPUs can leverage PDL. * Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL. * Fix: Correct PDL en/disablement based on device-side arch check. Host side check is UB. Required moving from macros to inlined functions * Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1 * Enable PDL by default for Hopper+ devices * Enrolls softcap_f32 and two flash_attn kernels into PDL. * Improves flash attn PDL barrier placement * Fix: Perf regression on ada; excludes ada and below from PDL launches * Improves some sync barrier placements * Drops superfluous constructor * Adds #endif guard comments * Reverts experimental change to top-k-moe.cu, which moved expensive allocations in front of the PDL barrier. It did not have a meaningful impact. * Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0 PDL is disabled * Revert "Drops superfluous constructor". Adds const to remaining arguments This reverts commit 12b1d25. * Cleanup: Removes and fixes some comments and whitespace * Clarifies comment of sync-barrier position * Relocates and refactors PDL launch functions and accessories * Adds error checking to the regular kernel launch path * Drops "auto" in favor of "ggml_cuda_kernel_params" * Adds "const" to ggml_cuda_kernel_launch_params * [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy

* hmx-mm: update debug logging in hmx-mm * hmx-mm: update dequant logic to use HVX_vector_x2/4 * hmx-mm: remove non-pipelined version of the quantize matmul It seems that we don't reall need non-pipelined version * hmx-mm: use activation depth mode and update naming Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> * hex-mm: minor hmx matmul naming updates * hmx-mm: remove unused vars * snapdragon: scripts bump default ubatch-size to 1K * hexagon: combine HMX and power and clock settings into a single set_power call * hmx-mm: remove leftover of the scale repl helper * hexagon: fix editconf error --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

…gml-org#23396)

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

…refactor (ggml-org#23345) * mtmd : deepseek-ocr fixes, improvements and refactoring - image processing changes to achieve full parity with Pillow (reference impl) - SAM mask casting only when flash-attn is on - SAM refactor (build_sam() extracted so deepseek-ocr-2 can reuse it) - llama-chat changes to fix server/WebUI issue (new media_markers_first()) - adapted test-chat-template and added test cases for deepseek-ocr - changed regression test for deepseek-ocr to use CER+chrF scores for ground-truth comparison; removed embedding-model - ty.toml ignore unresolved-import for tools/mtmd/tests/** * image-text reordering fix removed * refactor bool add_padding + pad_rounding enum into a single pad_style enum

…3386) ggml_backend_dev_by_name always appends a nullptr sentinel to the devices vector. Skipping nullptr entries prevents assertion failure in ggml_backend_dev_name. Assisted-by: llama.cpp:local pi

…d binary (ggml-org#23442)

…#23461) The destroy() function in server_context_impl only cleaned up the main model and context (via llama_init.reset()) but did not free the speculative decoder (spec), draft context (ctx_dft), or draft model (model_dft). For MTP (Multi-Token Prediction) models, ctx_dft holds GPU-allocated resources (KV cache, compute buffers) that are not freed when entering the sleeping state. On each sleep/resume cycle, new resources are allocated without the old ones being freed, leading to a VRAM leak that eventually crashes the server with out-of-memory errors. Fix by explicitly resetting spec, ctx_dft, and model_dft in destroy() before resetting llama_init, ensuring proper cleanup order to avoid use-after-free. ref: ggml-org#23395 Assisted-by: llama.cpp:local pi

ggerganov · 2026-05-21T08:47:44Z

@am17an Are these AIME results with default thinking, or did you set a reasoning budget?

am17an · 2026-05-21T08:55:00Z

Just the default, no budget

…3411) * metal : fix GGML_OP_SET kernel threads * tests : extend test_cpy to support different src/dst shapes Extend test_cpy to support different source and destination tensor shapes for CPY operations (reshaping), where the total number of elements must match. - Renamed ne -> ne_src, added ne_dst parameter (default: use src shape) - Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions - Tests exercise 1024 boundary, small shapes, and large dimensionality changes - Fixed dangling reference bug (storing & to temporary std::array) - Updated all existing test calls with permute/transpose args for compatibility Assisted-by: llama.cpp:local pi * metal : optimize concat kernel with row batching for small widths When ne0 < 256, batch multiple rows into a single threadgroup to improve occupancy. This avoids underutilizing the GPU when processing narrow tensors. - Dispatch nth = min(256, ne0) threads per group - Calculate nrptg (rows per threadgroup) to fill up to 256 threads - Update kernel index calculation to handle the row batching - Add boundary check for i1 >= ne1 Assisted-by: llama.cpp:local pi * tests : clean-up * tests : refactor CPY shape tests to use dimension permutations Replace 75 hardcoded test cases with a loop over permutations of {3, 5, 7, 32} (total elements: 3360). Each src permutation is tested against canonical sorted and reverse dst, skipping identical shapes. Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32). Assisted-by: llama.cpp:local pi

Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache to the /slots JSON response. These fields are already tracked internally but were not exposed, making it impossible for clients to monitor prompt evaluation progress during processing.

* tests : move save-load-state from examples to tests - Move examples/save-load-state/ to tests/test-save-load-state.cpp - Remove subdirectory reference from examples/CMakeLists.txt - Add test to tests/CMakeLists.txt as a model test - Remove CODEOWNERS entry for removed example directory Assisted-by: llama.cpp:local pi * cont : update ci

* vulkan: fuse snake activation (mul, sin, sqr, mul, add) Add snake.comp shader with F32 / F16 / BF16 pipelines and ggml_vk_snake_dispatch_fused. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. test_snake_fuse from the CUDA PR now also compares CPU naive vs Vulkan fused across F32 / F16 / BF16. * vulkan: address jeffbolznv review for fused snake activation Rename T / C to ne0 / ne1 in the shader and push constants to match the standard naming convention used across the Vulkan backend. Tighten ggml_vk_can_fuse_snake: require x and dst to be contiguous (the shader uses idx = i0 + i1 * ne0) and require a / inv_b to be tightly packed on the broadcast dim (the shader reads data_a[i1]). * vulkan: tighten snake fusion type checks for all operands (address jeffbolznv review) * vulkan: reject snake fusion when ne[2] or ne[3] > 1 (address jeffbolznv review) * vulkan: address 0cc4m review for fused snake activation snake.comp is renamed to follow the ggml DATA_A_* / A_TYPE convention. A_TYPE now applies to the activation tensor data_a instead of the broadcast multiplier, and the bindings become data_a (A_TYPE), data_b (float), data_c (float) and data_d (D_TYPE). A header at the top of the shader maps each buffer to its role in y = x + sin(b * x)^2 * c. On the C++ side, ggml_vk_can_fuse_snake reuses the existing snake_pattern constant instead of duplicating the op list, sin_node is extracted as a named local alongside the other chain nodes, and the broadcast operands a and inv_b are now required to be GGML_TYPE_F32 to match the hardcoded float bindings on data_b and data_c (the previous a->type == x->type would silently reject any future BF16 or F16 chain once the supports_op gate for SIN / SQR is lifted). ggml_vk_snake_dispatch_fused gets an explicit GGML_TYPE_F32 case and GGML_ABORT on default in place of the silent f32 fallback, and a stale comment about data_a[i1] / data_inv_b[i1] is refreshed to match the new binding names.

…default (ggml-org#23462) * cmake : remove STATIC from impl libraries, allow BUILD_SHARED_LIBS control Remove explicit STATIC from all -impl libraries (server, cli, completion, bench, batched-bench, fit-params, quantize, perplexity) so BUILD_SHARED_LIBS controls shared vs static linkage. Add WINDOWS_EXPORT_ALL_SYMBOLS ON for proper DLL export on Windows. Assisted-by: llama.cpp:local pi * cmake : enable LLAMA_BUILD_APP by default Assisted-by: llama.cpp:local pi * ci : disable app in build-cmake-pkg.yml

github-actions Bot added examples python server model labels May 19, 2026

danbev and others added 2 commits May 19, 2026 18:41

hexagon: enable support for NORM op (ggml-org#23319)

ac76808

am17an commented May 19, 2026

View reviewed changes

CISC and others added 21 commits May 19, 2026 21:16

convert : update mtp related help (ggml-org#23334)

b7393a4

* update mtp related help * remove outdated experimental text

common: fix --fit verbosity with --verbosity 4 (ggml-org#23282)

7256fce

common: fix --help for --verbosity (ggml-org#23278)

57cb35c

github: mention --log-file in issue templates (ggml-org#23277)

a807867

refactor: Chat Screen UI rendering (ggml-org#23333)

67ace02

hexagon: add MROPE and IMROPE support in HTP rope op (ggml-org#23317)

17d22a3

opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (ggml-org#23303)

b28a2f3

* opencl: add q4_k moe support * opencl: add q5_k moe support * opencl: add q6_k moe support * opencl: adjust format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349)

b39a7bf

snapdragon: update toolchain to v0.6 (ggml-org#23369)

871b0b7

* snapdragon: update compiler flags to enable all CPU features * snapdragon: update readme to point to toolchain v0.6 * snapdragon: bump toolchain docker to v0.6

metal : optimize pad + cpy (ggml-org#23354)

57ebaf4

* metal : optimize pad * metal : optinmize cpy * cont : better row packing in threadgroup

fix: Div wrapper no pointer events on hidden (ggml-org#23390)

585080d

ui: Refactor isMobile as reactive value in viewport store (ggml-o…

5028447

…rg#23330) * refactor: `isMobile` as reactive value in `viewport` store * refactor: Use Svelte media query for the viewport store

docker : copy conversion files (ggml-org#23370)

7e50ef7

mtmd: fit_params now take into account mmproj (ggml-org#21489)

e2b129e

* mtmd: fit_params now take into account mmproj * rename alloc_compute_meta to reserve_compute_meta * rm unused functions * add ggml_backend_dev_t support * add debug log

refactor: Move text attachments up before the message content in chat…

e6b4acf

… completions payload (ggml-org#23406)

feat: Add WAV MIME type variants and improve audio format detection (g…

6ce9671

…gml-org#23396)

vulkan: optimize operations in the IM2COL shader (ggml-org#22685)

acd604f

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

am17an force-pushed the gemma4-mtp branch from cd2e5b2 to a03120c Compare May 20, 2026 16:28

common/speculative : fix nullptr crash in get_devices_str (ggml-org#2…

510b5c2

…3386) ggml_backend_dev_by_name always appends a nullptr sentinel to the devices vector. Skipping nullptr entries prevents assertion failure in ggml_backend_dev_name. Assisted-by: llama.cpp:local pi

ServeurpersoCom and others added 2 commits May 21, 2026 10:09

server: re-inject subcommand when router spawns children under unifie…

c902171

…d binary (ggml-org#23442)

ggerganov and others added 12 commits May 21, 2026 13:34

fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372)

5306f4b

Update WebGPU support and add link to blog/demo (ggml-org#23483)

ee7c305

CUDA: fix PDL CC check for JIT compilation (ggml-org#23471)

4f0e43d

llama: Gemma 4 MTP

91cea27

fix multi-seq

c93c750

add assert that draft + shared kv should be on same device

3c4e521

add Q rot when cache is quantized

4b1d1ae

am17an force-pushed the gemma4-mtp branch from a03120c to 4b1d1ae Compare May 23, 2026 07:01

github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU testing ggml Vulkan script Apple Metal devops OpenCL Hexagon WebGPU build server/ui labels May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma4 MTP#17

Gemma4 MTP#17
am17an wants to merge 51 commits into
masterfrom
gemma4-mtp

am17an commented May 19, 2026 •

edited

Loading

Uh oh!

am17an May 19, 2026

Uh oh!

ggerganov commented May 21, 2026

Uh oh!

am17an commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

am17an commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an May 19, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov commented May 21, 2026

Uh oh!

am17an commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

am17an commented May 19, 2026 •

edited

Loading