fix(tokenizer): prefer lm_head weight as vocab_size source#962
Open
contrapuntal wants to merge 1 commit intojundot:mainfrom
Open
fix(tokenizer): prefer lm_head weight as vocab_size source#962contrapuntal wants to merge 1 commit intojundot:mainfrom
contrapuntal wants to merge 1 commit intojundot:mainfrom
Conversation
Several mlx-vlm ModelConfig dataclasses (glm4v, glm4v_moe, gemma3)
hard-code a top-level vocab_size default that mismatches the inner
language model's vocab when config.json omits the top-level key.
Example: GLM-4.6V has text_config.vocab_size=151552 but
ModelConfig.vocab_size=257152 as a dataclass default. Code sizing
logits-aligned buffers (e.g. xgrammar bitmasks) from the top-level
value produced a shape mismatch against the real 151552 logits.
Resolution order becomes:
1. lm_head.weight.shape[0] — authoritative; the exact vocabulary
the model emits logits over.
2. text_config.vocab_size — inner LM vocab on VLM composite configs.
3. config.vocab_size / args.vocab_size — top-level fallback.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this fixes
Users running grammar-constrained output (
[grammar]extra) on GLM-4.6V — and on several other VLM families — currently hit a shape-mismatch crash on the first constrained-decode step, with no clear error about why. The xgrammar bitmask is allocated for the wrong vocabulary, then doesn't fit the model's actual logits tensor.The root cause is
omlx/utils/tokenizer.py::resolve_vocab_size()returning the wrong number, which then propagates to every consumer that sizes logits-aligned buffers from it. Two such consumers exist today:omlx/api/grammar.py:46(xgrammar bitmask) andomlx/scheduler.py:1459(scheduler bookkeeping).Affected models
Three mlx-vlm
ModelConfigdataclasses ship a hard-coded top-levelvocab_sizedefault that doesn't match the inner LM's actual vocab whenconfig.jsonomits the top-level key (as it typically does for these models):ModelConfig.vocab_sizedefaulttext_config.vocab_sizeglm4v(GLM-4.6V)257152151552glm4v_moe257152config.json; e.g.151552for GLM-4.6V-Flashgemma3257152262208resolve_vocab_size()previously read the top-level value first, so it returned the wrong vocab on any of the above.Concrete public model that reproduces today:
mlx-community/GLM-4.6V-Flash-bf16with[grammar]enabled — the bitmask allocation reads 257152, the logits tensor is 151552 wide, and the constrained-decode step crashes on the shape mismatch.Fix
Re-order the resolution to prefer authoritative sources:
lm_head.weight.shape[0]— authoritative; this is the exact vocabulary the model emits logits over, regardless of whatever the config dataclass says. Probed under_language_model.lm_head(VLM adapter),language_model.lm_head(raw mlx-vlm), andlm_head(raw mlx-lm) so the lookup works for every wrapping shape we ship.text_config.vocab_size— inner LM vocab on VLM composite configs. Correct whenlm_headis unavailable (rare).model.config.vocab_size/model.args.vocab_size— original top-level fallback. Still consulted last for compatibility with pure-LLM configs that don't have atext_config.For pure-LLM models the new order returns the same value the old order did, because
lm_head.weight.shape[0]andconfig.vocab_sizeagree. The change only affects models where the two disagree, which is exactly the broken case.Why not fix this in mlx-vlm
The dataclass defaults in
glm4v,glm4v_moe,gemma3ModelConfigare arguably wrong (they should beNoneor read fromtext_config), and an upstream fix would obsolete most of this PR. But:ModelConfigwithout aconfig.json, so changing them is a behavior question, not just a bug fix — out of scope here.lm_head.weight.shape[0]is still the more authoritative source — config can drift from weights, weights can't drift from themselves. The new resolution order also defends against runtime config-weight drift on any future model where the two could disagree, not just the specific dataclasses currently broken.Verification
Added
TestResolveVocabSizetotests/test_utils_tokenizer.pycovering the new resolution order, the three lm_head wrapping shapes, fallback totext_config(object and dict forms), fallback to top-levelconfig.vocab_size/args.vocab_size, themodel = Nonecase, and the malformed-shape defensive path. 13 cases total; all pass.End-to-end:
[grammar]: structured output now produces 151552-wide bitmasks matching the logits; previously raised a shape-mismatch on first constrained step.resolve_vocab_sizereturns the same int as before;lm_head.weight.shape[0]matchesconfig.vocab_size.vocab_size: int = 32000in itsModelConfig): same int as before.Risk
The new lm_head probe is gated by
getattrchains and a defensive shape extraction (try/except aroundshape[0]), so it returnsNone(falling back to the old config path) on any model that doesn't exposelm_headin any of the three wrapping shapes, or whoseweight.shapeis not subscriptable. No code that previously succeeded should now fail.Tied embeddings:
lm_head.weightisembed_tokens.weightand still maps accurately to the logits vocab dimension.Padded vocabularies: when a model pads its vocab to a multiple of 64 (or similar) for tensor-core efficiency,
lm_head.weight.shape[0]returns the padded size — which is exactly what logits-aligned buffers (e.g.xgrammarbitmasks) need to allocate to avoid shape mismatches against the actual logits tensor. The new resolution order is strictly safer than the oldconfig-first order in this case too.