UPSTREAM PR #22102: fix: GLM-DSA crash in llama-tokenize when using vocab_only#1360
UPSTREAM PR #22102: fix: GLM-DSA crash in llama-tokenize when using vocab_only#1360loci-dev wants to merge 1 commit into
Conversation
OverviewPerformance Impact: Negligible (0.024% power consumption increase) Single commit (b625dd9) fixes GLM-DSA crash in Function Counts: 46,871 total | 13 modified (0.03%) | 0 new | 0 removed | 46,858 unchanged Power Consumption by Binary:
Function AnalysisAll 13 modified functions are C++ STL template instantiations or initialization utilities—zero inference hot-path functions affected. Changes stem from compiler optimizations and security hardening, not source code modifications. Most Significant:
Other analyzed functions showed compiler optimization variance with negligible absolute impact (<100 ns). Additional FindingsInference Performance: Unaffected. Matrix operations (70-90% of inference time), attention mechanisms, KV cache, and all GPU backends (CUDA, Metal, HIP, Vulkan) show 0% change. Security enhancements (stack canary protection) add ~230 ns cumulative overhead across initialization functions—acceptable trade-off for buffer overflow protection. Net compiler optimization effect is positive (+87.8 μs) due to unique_ptr qwen3moe improvement. 💬 Questions? Tag @loci-dev |
Note
Source pull request: ggml-org/llama.cpp#22102
When running llama-tokenize with GLM-DSA models, the process crashes with a fatal error in llama-hparams.cpp. This happens because vocab_only mode skips the full hparams loading, leaving n_layer and the MLA params uninitialized, but print_info still calls n_embd_head_k_mla() which internally falls back to n_embd_head_k(0) and hits the abort when n_layer is 0. Fixed by guarding the DeepSeek2/GLM-DSA/Mistral4 print block with consistent with how other non-vocab hparams are already handled in print_info. Fixes #22026