UPSTREAM PR #22102: fix: GLM-DSA crash in llama-tokenize when using vocab_only by loci-dev · Pull Request #1360 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-19T03:11:01Z

Note

Source pull request: ggml-org/llama.cpp#22102

When running llama-tokenize with GLM-DSA models, the process crashes with a fatal error in llama-hparams.cpp. This happens because vocab_only mode skips the full hparams loading, leaving n_layer and the MLA params uninitialized, but print_info still calls n_embd_head_k_mla() which internally falls back to n_embd_head_k(0) and hits the abort when n_layer is 0. Fixed by guarding the DeepSeek2/GLM-DSA/Mistral4 print block with consistent with how other non-vocab hparams are already handled in print_info. Fixes #22026

loci-review · 2026-04-19T04:02:27Z

Overview

Performance Impact: Negligible (0.024% power consumption increase)

Single commit (b625dd9) fixes GLM-DSA crash in print_info when vocab_only is set. Modified 1 file (src/llama-model.cpp).

Function Counts: 46,871 total | 13 modified (0.03%) | 0 new | 0 removed | 46,858 unchanged

Power Consumption by Binary:

build.bin.libllama.so: +0.024% (264,506 → 264,570 nJ)
build.bin.libmtmd.so: -0.000%
All other binaries (llama-bench, llama-cvector-generator, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-quantize, llama-qwen2vl-cli, llama-tokenize, llama-tts, libggml.so, libggml-cpu.so, libggml-base.so): 0% change

Function Analysis

All 13 modified functions are C++ STL template instantiations or initialization utilities—zero inference hot-path functions affected. Changes stem from compiler optimizations and security hardening, not source code modifications.

Most Significant:

std::unique_ptr::operator= (qwen3moe): -99.14% response time (-88,039 ns) — compiler dead code elimination removed unreachable exception paths
std::unique_ptr::operator= (gemma4_iswa): +9,961% response time (+76,223 ns) — measurement context change capturing graph construction, not regression (throughput improved -1.64%)
llama_quant_compute_types: +0.33% (+108 ns) — stack canary protection added
unicode_cpts_from_utf8: -0.13% response time (-6 ns), +4.92% throughput (+7 ns) — security hardening with optimization in called function

Other analyzed functions showed compiler optimization variance with negligible absolute impact (<100 ns).

Additional Findings

Inference Performance: Unaffected. Matrix operations (70-90% of inference time), attention mechanisms, KV cache, and all GPU backends (CUDA, Metal, HIP, Vulkan) show 0% change. Security enhancements (stack canary protection) add ~230 ns cumulative overhead across initialization functions—acceptable trade-off for buffer overflow protection. Net compiler optimization effect is positive (+87.8 μs) due to unique_ptr qwen3moe improvement.

💬 Questions? Tag @loci-dev

llama: fix crash in print_info for GLM-DSA when vocab_only is set

b625dd9

loci-dev temporarily deployed to PROD__AL_DEMO April 19, 2026 03:11 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #22102: fix: GLM-DSA crash in llama-tokenize when using vocab_only#1360

UPSTREAM PR #22102: fix: GLM-DSA crash in llama-tokenize when using vocab_only#1360
loci-dev wants to merge 1 commit into
mainfrom
loci/pr-22102-fix-glm-dsa-tokenize-crash

loci-dev commented Apr 19, 2026

Uh oh!

loci-review Bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 19, 2026

Uh oh!

loci-review Bot commented Apr 19, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants