Skip to content

UPSTREAM PR #22102: fix: GLM-DSA crash in llama-tokenize when using vocab_only#1360

Open
loci-dev wants to merge 1 commit into
mainfrom
loci/pr-22102-fix-glm-dsa-tokenize-crash
Open

UPSTREAM PR #22102: fix: GLM-DSA crash in llama-tokenize when using vocab_only#1360
loci-dev wants to merge 1 commit into
mainfrom
loci/pr-22102-fix-glm-dsa-tokenize-crash

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#22102

When running llama-tokenize with GLM-DSA models, the process crashes with a fatal error in llama-hparams.cpp. This happens because vocab_only mode skips the full hparams loading, leaving n_layer and the MLA params uninitialized, but print_info still calls n_embd_head_k_mla() which internally falls back to n_embd_head_k(0) and hits the abort when n_layer is 0. Fixed by guarding the DeepSeek2/GLM-DSA/Mistral4 print block with consistent with how other non-vocab hparams are already handled in print_info. Fixes #22026

@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 19, 2026

Overview

Performance Impact: Negligible (0.024% power consumption increase)

Single commit (b625dd9) fixes GLM-DSA crash in print_info when vocab_only is set. Modified 1 file (src/llama-model.cpp).

Function Counts: 46,871 total | 13 modified (0.03%) | 0 new | 0 removed | 46,858 unchanged

Power Consumption by Binary:

  • build.bin.libllama.so: +0.024% (264,506 → 264,570 nJ)
  • build.bin.libmtmd.so: -0.000%
  • All other binaries (llama-bench, llama-cvector-generator, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-quantize, llama-qwen2vl-cli, llama-tokenize, llama-tts, libggml.so, libggml-cpu.so, libggml-base.so): 0% change

Function Analysis

All 13 modified functions are C++ STL template instantiations or initialization utilities—zero inference hot-path functions affected. Changes stem from compiler optimizations and security hardening, not source code modifications.

Most Significant:

  • std::unique_ptr::operator= (qwen3moe): -99.14% response time (-88,039 ns) — compiler dead code elimination removed unreachable exception paths
  • std::unique_ptr::operator= (gemma4_iswa): +9,961% response time (+76,223 ns) — measurement context change capturing graph construction, not regression (throughput improved -1.64%)
  • llama_quant_compute_types: +0.33% (+108 ns) — stack canary protection added
  • unicode_cpts_from_utf8: -0.13% response time (-6 ns), +4.92% throughput (+7 ns) — security hardening with optimization in called function

Other analyzed functions showed compiler optimization variance with negligible absolute impact (<100 ns).

Additional Findings

Inference Performance: Unaffected. Matrix operations (70-90% of inference time), attention mechanisms, KV cache, and all GPU backends (CUDA, Metal, HIP, Vulkan) show 0% change. Security enhancements (stack canary protection) add ~230 ns cumulative overhead across initialization functions—acceptable trade-off for buffer overflow protection. Net compiler optimization effect is positive (+87.8 μs) due to unique_ptr qwen3moe improvement.

💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants