Rebase on mainline#12
Open
doctorjei wants to merge 385 commits into
Open
Conversation
* nix: support unified apple-sdk * Impl roll op for Metal * Revert "nix: support unified apple-sdk" This reverts commit abfa473360471532c547de8b202c780507924d4b. * update ops.md * update op docs
* ggml: add graph_reused * use versioning instead of reuse flag * increment version with atomic * use top bits for split numbering * add assert * move counter to ggml.c * set uid in split_graph only * fix windows * address further review comments * get next_uid rather than doing bit manipulation * rename + add comment about uid
* fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns * fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code
* support nvfp4 tensors for Gemma4 * add wo_s to build_attn * add wo_s to build_attn * fix glm4
…ers (#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
… (#21980) * server: tests: fetch random media marker via /apply-template (#21962 fix) * server: allow pinning media marker via LLAMA_MEDIA_MARKER env var get_media_marker() checks LLAMA_MEDIA_MARKER at first call and uses it as-is if set, falling back to the random marker otherwise. Tests no longer need to fetch the marker dynamically via /apply-template: the fixture sets LLAMA_MEDIA_MARKER=<__media__> so the hardcoded prompts work as before. Address review feedback from ngxson * server: make get_media_marker() thread-safe via magic statics Use a C++11 static local with a lambda initializer instead of a global static with an empty-check. The runtime guarantees initialization exactly once without explicit locking. Address review feedback from ggerganov * nits * nits
* model: using single llm_build per arch * fix merge * nits
* optimize hmx_mat_mul functions by calculating row and column tiles upfront * refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability * wip * set scale outside of loop * wip * refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts * wip * wip * refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions * refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation * wip * refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization * refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking * refactor core_dot_chunk_fp16 to improve tile stride calculations for output * refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization * fix compiling error * wip * optimize row and column tile indexing in core_mma_chunk_fp16 function * wip * Revert "wip" This reverts commit cde679eff79c4a28dd2d89d32f710015e09592b6. * Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions * wip
…dreno (#21938) * opencl: refactor q8_0 gemm/gemv Adreno dispatch * opencl: refactor q8_0 set_tensor * opencl: fix whitespace
* model : Gemma4 model type detection * model : Gemma4 model type detection
* cmake : allow libcommon to be shared * cmake : rename libcommon to libllama-common * cont : set -fPIC for httplib * cont : export all symbols * cont : fix build_info exports * libs : add libllama-common-base * log : add common_log_get_verbosity_thold()
* server: respect the ignore eos flag * ci: add android arm64 build and release * patch * pin android-setup actions to v4 * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * lf in the suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up
…ng (#21052) * Update workflows to remove dependence on llvmpipe * Try setting Dawn_DIR * remove c++20 initializers * Move to proper guid * Try avoiding segfaults on vulkan backend process exit * Remove compiler warnings on parameter casting * Fix soft_max and update reg_tile accumulation to f32 for better precision * Refactor flash_attn a bit * remove c++20 initializers and format * Increase div precision for NVIDIA * revert div precision and comment out ggml-ci node for now * Formatting * Try debugging on a failing CI node * Revert "Try debugging on a failing CI node" This reverts commit 1971e33cba919915e12bcfd5828abfbd54ca942e.
* refactor bias tensor variable names * use create_tensor_qkv for jina-bert-v2
* rpc : refactor the RPC transport Move all transport related code into a separate file and use the socket_t interface to hide all transport implementation details. * fix win32 * better socket_t construction
* server : speculative decoding using checkpoints * server : fix draft check with checkpoints * server : rename spec vars * server : log levels * server : refactored spec logic to speculative.cpp * server : renamed spec checkpoints option * server : fix spec checkpoints, logging * speculative : checkpoints with draft model, logging * server : n_tokens_cur and create_checkpoint in draft * server : fix server_speculative_callback (slot.id) * spec : fix ngram-map/begin idx_last_check * spec : init ckpt (begin() wasn't called) * chore: update webui build output * server : restore sampler in spec checkpoint and clear mem * cont : avoid --spec-use-checkpoints argument * cont : remove server_prompt_checkpoint_with_size * spec : rename (leave_draft_state) * cont : clean-up * cont : do not ignore partial drafts even if the are short * cont : spec callback owned by session * cont : simplify * cont : avoid empty speculative session * cont : simplify * cont : simplify * cont : enable mtmd speculative decoding * cont : keep the spec sampler alive * cont : simplify * cont : fix nullptr deref + draft checkpoints * cont : remove common_speculative_accept_response * cont : remove callback * cont : simplify * cont : minor * cont : simplify * cont : fix accepted number --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
#21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds. Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform. Reported-by: oobabooga Refs: #21630 Co-authored-by: texasich <texasich@users.noreply.github.com>
* convert : support sentence-transformer 5.4 config files * fix: embeddinggemma * fix: mapping Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix: pooling_mode Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* opencl: refactor adreno q4_0 gemm/gemv dispatch * opencl: refactor q4_0 gemm/gemv loading, use consistent names * opencl: use consistent name for adreno q8_0 gemm/gemv * opencl: use consistent names for adreno q4_0 gemm/gemv * opencl: simplify adreno q4_0 set_tensor * opencl: refactor q4_0 get_tensor
* hex-mm: process m-tail rows on HMX instead of HVX * hmx-mm: unroll and optimize padded activation loop --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
…--fit (#22688) * ggml : report estimated OpenCL memory for --fit Signed-off-by: Florian Reinle <f.reinle@otec.de> * ggml : estimated OpenCL memory backend integrated Signed-off-by: Florian Reinle <f.reinle@otec.de> --------- Signed-off-by: Florian Reinle <f.reinle@otec.de>
* add filter_tensors classmethod * remove language_model * fix parts validation
…22719) * refactor: Remove Google favicon utility * fix: MCP Server favicon * refactor: Cleanup * refactor: MCP Server Information * fix: Fix MCP Settings UI * refactor: Cleanup
* convert : ignore non-language tensors for Gemma4Model
This commit adds a check to make sure only text language tensors are
handled in filter_tensors.
The motivation is that currently when trying to convert a Gemma4 model
the following error occurs:
```console
(venv) $ ./convert-gemma.sh
INFO:hf-to-gguf:Loading model: gemma-4-E2B-it
INFO:hf-to-gguf:Model architecture: Gemma4ForConditionalGeneration
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight, torch.float32 --> F32, shape = {256}
Traceback (most recent call last):
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13752, in <module>
main()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13746, in main
model_instance.write()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 945, in write
self.prepare_tensors()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 805, in prepare_tensors
for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7925, in modify_tensors
yield from super().modify_tensors(data_torch, name, bid)
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7290, in modify_tensors
yield from super().modify_tensors(data_torch, name, bid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 579, in modify_tensors
new_name = self.map_tensor_name(name)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 572, in map_tensor_name
raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.embed_vision.embedding_projection.weight'
```
* add forgotten embed_vision and embed_audio
* improve
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: migrate to PEP 621 and add uv support * fix: remove upper bound on protobuf * remove poetry.lock and uv.lock * fix/add torch dependency version and markers * fix dev-dependency deprecation warning * gguf-py : update python version requirement to 3.10 --------- Co-authored-by: David Huggins-Daines <dhd@dhd.ecolingui.ca> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
…(#22101) * mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) Conformer encoder with Shaw relative position encoding, QFormer projector, log-mel spectrogram with frame stacking. Encoder uses GLU gating, folded batch norm, and SSM depthwise conv. QFormer compresses encoder output via windowed cross-attention (window=15, queries=3) into the LLM embedding space. Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, 2x frame stacking (80->160 mel). GGUF converter handles batch norm folding at export time, fused K/V split, and Conv1d weight reshaping. Tested against HF transformers reference: token-for-token match on 30s/60s audio clips with greedy decoding. * mtmd: rename gs_ prefixed tensors to generic/architecture names * mtmd: use tensor_mapping.py for all granite_speech tensors * convert: fold GraniteSpeechTextModel into GraniteModel * mtmd: replace n_layer hack with explicit has_standard_layers flag * mtmd: replace hardcoded magic numbers with GGUF hparams for granite speech * mtmd: align KEY_A_ define spacing * convert: register GraniteModel for GraniteSpeechForConditionalGeneration * convert: fix ty type-check for GraniteSpeechMmprojModel registration * mtmd: align TN_ define spacing * mtmd: use generic layer loop for granite speech tensor loading * mtmd: merge qformer_proj_layer into clip_layer * mtmd: granite_speech remove redundant ggml_build_forward_expand on inputs * mtmd: granite_speech add comment explaining why build_attn is not used * mtmd: granite_speech hard-code eps in cpp, remove from GGUF metadata * gguf: add spacing between granite_speech tensor mapping blocks * mtmd: make generic audio layer_norm_eps read optional * mtmd: granite_speech keep encoder eps in GGUF, only hard-code projector eps * mtmd: align defines and struct fields in clip-impl.h and clip-model.h * mtmd: fix alignment and ordering issues across granite speech files * convert: granite_speech use filter_tensors instead of modify_tensors for skipping
* gguf-py : bump version to 0.19.0 * bump poetry --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common: do not fit to unknown device memory Signed-off-by: Florian Reinle <f.reinle@otec.de> * common: preserve host fallback for non-GPU fit devices Signed-off-by: Florian Reinle <f.reinle@otec.de> * common: keep unknown GPU fit memory at zero Signed-off-by: Florian Reinle <f.reinle@otec.de> --------- Signed-off-by: Florian Reinle <f.reinle@otec.de>
* model: don't crash on unsupported architecture * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Support MiniCPM-V 4.6 in new branch Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix code bug Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix pre-commit Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix convert Signed-off-by: tc-mb <tianchi_cai@icloud.com> * rename clip_graph_minicpmv4_6 Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use new TYPE_MINICPMV4_6 Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use build_attn to allow flash attention support Signed-off-by: tc-mb <tianchi_cai@icloud.com> * no use legacy code, restored here. Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use the existing tensors name Signed-off-by: tc-mb <tianchi_cai@icloud.com> * unused ctx->model.hparams.minicpmv_version Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use n_merge for slice alignment Signed-off-by: tc-mb <tianchi_cai@icloud.com> * borrow wa_layer_indexes for vit_merger insertion point Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix code style Signed-off-by: tc-mb <tianchi_cai@icloud.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use filter_tensors and add model.vision_tower Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix chkhsh Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix type check Signed-off-by: tc-mb <tianchi_cai@icloud.com> --------- Signed-off-by: tc-mb <tianchi_cai@icloud.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
The error:
./examples/sycl/test.sh: line 122: level_zero:${$GGML_SYCL_DEVICE}: bad
substitution
was thrown whenever the user used this command:
./examples/sycl/test.sh -mg 0
Fix is to get rid of a dollar sign.
…#22773) * add fill-mode-forwards * generated diffs
* codeowners : add ZenDNN backend codeowner * codeowners : fix zendnn owners to use individual github handles
* webui: fix ?model= URL param race in router mode * chore: update webui build output
* add mimo-v2.5 support * mimo-v2.5: fix modify_tensors row split * mimi-v2.5: forgot `add_attn_value_scale` plumbing * mimi-v2.5: fix tp dequant to detect tp rows * mimo-v2.5: fix TP iteration to be descending * mimo-v2.5: fix comment * mimo-v2.5: retain fused qkv * mimo-v2.5: missed the attn_value scale during merge * mimo-v2.5: fused QKV needs contiguous for scaling attention value * mimo-v2.5: move `speech_embeddings.` to TextModel filter_tensors * Update src/llama-hparams.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/mimo2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/mimo2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/mimo2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * mimo-v2.5: include MTP weights in gguf --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Port of TurboQuant (Zandieh et al., ICLR 2026) KV cache compression to HIP/ROCm, cherry-picked from TheTom/llama-cpp-turboquant onto clean mainline b8668. Original fork hangs on HIP due to backend drift; this clean port has zero regressions. New GGML types: TURBO3_0 (3-bit, 4.9x), TURBO4_0 (4-bit, 3.8x), TURBO2_0 (2-bit, 6.4x), TQ3_1S, TQ4_1S Components: - GGML type definitions + type traits (ggml.h, ggml.c, ggml-common.h) - CPU quantize/dequantize (ggml-turbo-quant.c, ggml-quants.c/h) - CPU TURBO_WHT op (Walsh-Hadamard Transform) - KV cache integration (llama-kv-cache.cpp/h, llama-memory.h) - Graph WHT rotation (llama-graph.cpp) - HIP/CUDA kernels: set-rows, convert, dequantize, turbo-wht, turbo-innerq, mmvq-tq, flash attention instances - HIP compatibility layer (vendors/hip.h) - llama-bench turbo type support Validated on RX 7900 XTX (gfx1100), ROCm 6.4: Perplexity (Qwen3.5-9B Q4_K, wikitext-2, 20 chunks): f16: PPL 7.056 q8_0: PPL 7.040 (-0.2%) turbo4: PPL 7.121 (+0.9%) turbo3: PPL 7.129 (+1.0%) Throughput (Qwen3.5-27B Q5_K_M, 16K context): f16: pp=395 t/s, tg=29.8 t/s turbo3: pp=394 t/s, tg=29.6 t/s (99.1%) VRAM (27B Q5_K_M @ 80K context, 24GB GPU): f16: OOM (needs ~26 GiB) turbo3: runs at 314 t/s pp, 29.4 t/s tg (needs ~20 GiB) Based on: TheTom/llama-cpp-turboquant (discussion #20969) Paper: https://arxiv.org/abs/2504.19874 Standalone HIP benchmark: https://github.com/domvox/turboquant-hip
…g (Linux) First commit to add --hugepages option (CLI). This commit adds the structure for the flag, data structures, and function call signatures, but does not change functionality. Full descriptions included here for convenience... --- Back model weight mappings with anonymous 2 MiB HugeTLB pages on Linux, activated by a new --hugepages CLI flag (env: LLAMA_ARG_HUGEPAGES). Primary benefit is kernel vmemmap reclamation via HugeTLB Vmemmap Optimization (HVO) — not TLB speedup. On a 128 GiB system fully backed with 2 MiB hugepages this frees ~1.75 GiB of struct page memory, turning tight-ceiling workloads from OOM into working. Mechanism: llama_mmap allocates MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB| MAP_HUGE_2MB|MAP_POPULATE (zero-filled), then load_all_data populates the region per-tensor via file->read_raw before check_tensors and view allocation consume it. mprotect downgrades to PROT_READ after load. MAP_POPULATE is a race-safety guarantee (pool-short → clean ENOMEM at mmap time, not SIGBUS mid-load). Measured on qwen3 19.4 GB / Strix Halo: 1.31x cold / 4.24x warm slowdown vs baseline mmap; vmemmap reclamation confirmed via VmRSS delta (3872 kB hugepages vs 19.30 GB baseline).
…ght loading (Linux) This is the implementation commit that enables the feature itself. Full description: --- Back model weight mappings with anonymous 2 MiB HugeTLB pages on Linux, activated by a new --hugepages CLI flag (env: LLAMA_ARG_HUGEPAGES). Primary benefit is kernel vmemmap reclamation via HugeTLB Vmemmap Optimization (HVO) — not TLB speedup. On a 128 GiB system fully backed with 2 MiB hugepages this frees ~1.75 GiB of struct page memory, turning tight-ceiling workloads from OOM into working. Mechanism: llama_mmap allocates MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB| MAP_HUGE_2MB|MAP_POPULATE (zero-filled), then load_all_data populates the region per-tensor via file->read_raw before check_tensors and view allocation consume it. mprotect downgrades to PROT_READ after load. MAP_POPULATE is a race-safety guarantee (pool-short → clean ENOMEM at mmap time, not SIGBUS mid-load). Measured on qwen3 19.4 GB / Strix Halo: 1.31x cold / 4.24x warm slowdown vs baseline mmap; vmemmap reclamation confirmed via VmRSS delta (3872 kB hugepages vs 19.30 GB baseline).
Implement ggml_backend_cuda_device_buffer_from_host_ptr using cudaHostRegister + cudaHostGetDevicePointer, enabling zero-copy import of host-allocated memory (including hugepage-backed regions) on unified-memory GPUs. The capability is gated on GGML_USE_HIP && prop.integrated > 0, read directly from cudaGetDeviceProperties. The existing force- disable of info.devices[id].integrated (added for #15034 on NVIDIA Jetson) is left untouched; it affects a separate cuda_host buffer path and does not overlap with buffer_from_host_ptr. Scope is limited to HIP because only Strix Halo / ROCm 7.2.0 has been validated. NVIDIA Jetson reports prop.integrated == 1 and may benefit, but this needs testing before extending. A TODO comment in the code notes the extension path. Adds an owned flag to ggml_backend_cuda_buffer_context so the destructor can cudaHostUnregister externally-owned buffers instead of cudaFree. Notes: - Defines ggml_backend_cuda_imported_buffer_interface with NULL write ops (memset_tensor, set_tensor, cpy_tensor, clear); wire buffer_from_host_ptr to it. Enforces read-only semantics at the type level rather than relying on runtime fallthrough. - Zero quantized-tensor padding via host-side memset during init_tensor for externally-owned buffers, since cudaMemset through the device alias is unsupported on GFX1151 / ROCm 7.2.0 (Mapped regions are effectively read-only from the GPU side). The host-side approach works for any caller that keeps the host buffer writable through init_tensor and removes the fragile GGUF-zero-padding assumption. - Keep cudaHostRegisterReadOnly off: empirically a ~14% TG regression on GFX1151 / ROCm 7.2.0 vs Portable|Mapped alone (measured on Qwen3-30B-A3B Q4_K_M). Read-only access is enforced at the buffer-interface level instead.
Removed the "extern" type from GGML_API, as the mainline removed it. (Was causing error on compile)
Contributor
Author
|
It looks like the "openvino" tests have never been run before, so they may be new? I am not sure if the test errors (basically, quantization error thresholds) were introduced by the merge, or if they were always "that way". |
Added error ranges for TQ variants
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I'm not sure if you'll find this useful, but I went ahead and rebased my main on the mainline because they've integrated some new elements. (Rebase and commit cleanup only; no other changes)