Add JANG model loader integration#212
Conversation
Add JANG model loader integration
|
Validation update:
|
|
Validation update:
|
|
Final validation update:
|
|
Performance/streaming update:
|
2f48ce6 to
0ee615b
Compare
# Conflicts: # vllm_mlx/routes/chat.py
ea128df to
9b0bb10
Compare
|
Hi @samuelfaj — thanks for the work. Applying our new SOP §0 necessity gate (see docs/development/pr_merge_sop.md) I need a demand signal before merging. Holding for clarification, not closing yet. Reasoning:
To unlock merge, I need one or more of:
For now please rebase on top of latest Apologies for the friction — the necessity gate is new this week and I'm working through the backlog. Your #204 (Qwen tool-call fix) is being reviewed now since it has clear user value. |
|
Thanks for putting this together. Two requests before review: (1) Please split this into independent PRs. The diff is +4007 LOC across 27 files but the title scopes it to the JANG loader. The JANG-loader part is a coherent change on its own:
The TUI ( (2) Verify the JANG import path. The PR imports
Happy to review the loader-only PR once it's split out — that part looks reasonable on first read. |
raullenchai
left a comment
There was a problem hiding this comment.
Thanks for putting the time in on this — JANG/JANGTQ is a real quantization scheme and jang (Jinho Jang's adaptive mixed-precision package) is a legitimate active project on PyPI. But I can't merge this PR in its current shape, and I'd like to explain why so we can find a path that works.
Scope drift: only ~15% of this PR is actually about JANG
The PR title and description are "Add JANG model loader integration." Mapping the 27 changed files against that scope:
Actually JANG-related (~600 LOC):
pyproject.toml— adds thejangextra ✓tests/test_jangtq_loader.py— JANG loader tests ✓vllm_mlx/utils/tokenizer.py— DSV4 JANGTQ tokenizer workaround ✓vllm_mlx/engine/batched.py(partial, the JANG-detection branch) — partial ✓vllm_mlx/cli.py(partial, JANG flag plumbing) — partial ✓
Not JANG-related — 5+ separate features bundled in (~3,400 LOC):
vllm_mlx/tui.py(+736 LOC, new file) — a full-screen termios TUI for monitoringrapid-mlx serve. Real product question, not a loader change.vllm_mlx/request_metrics.py(+201, new) +vllm_mlx/middleware/metrics.py(+247, new) — an entirely new in-process metrics system. Overlaps with the telemetry worker we already shipped in v0.6.81 (telemetry.rapidmlx.com).vllm_mlx/api/tool_calling.py(+248/-31) +vllm_mlx/tool_parsers/deepseek_tool_parser.py(+80) +vllm_mlx/tool_parsers/qwen3coder_tool_parser.py(+62/-9) — major tool-calling rewrites that conflict with the openai-harmony StreamableParser path landed in PR #515 (v0.6.75).vllm_mlx/routes/chat.py(+374/-4) +vllm_mlx/service/postprocessor.py(+176/-1) +tests/test_postprocessor.py(+219) — new_looks_like_deferred_tool_useheuristic plus a postprocessor change. Maybe relates to the tool-calling rewrite above, but unrelated to JANG.
Each of these is independently worth discussing on its merits. Bundling them all into one PR titled "Add JANG model loader integration" means:
- Reviewing any one of them requires reviewing all of them
- Merging the JANG bits requires us to accept (or carefully detangle) all the rest
- The "do we want a TUI?" / "do we want a second metrics system?" product questions get hidden under the loader integration label
State of this branch
- Branch is
dirty(merge conflicts) — main has moved substantially since 2026-05-05 (we're now at v0.7.3, with rewrites to the postprocessor, tool parsers, and chat route from PRs #408, #515, #555, #558). GitHub reportsmergeable_state: dirtyon this PR. - No driving issue exists — I searched
JANG/JANGTQacross all open + closed issues; nothing. So the only signal that we should support this loader path is this PR itself. - Stale for 5 weeks — your last push was 2026-06-07 (helpful, thanks), but my best estimate is that a rebase on top of v0.7.3 main would require non-trivial reconciliation work given how much of the surface area has moved.
Step 0 — Necessity
Putting the scope question aside: does our product roadmap actually want JANG/JANGTQ support right now?
Honest answer: not enough signal. The supported quant tiers in vllm_mlx/aliases.json today are 4bit, 6bit, 8bit, mxfp4, ud, dwq, mxfp4-q8. Adding mxtq / jangtq is plausible — Jinho's work on adaptive precision is interesting — but I'd want to see:
- At least one user request (issue, Discord, anything) for JANGTQ inference support, not initiated by the contributor
- Bench evidence that JANGTQ on our path is meaningfully better than the existing
mxfp4/4bittiers on Apple Silicon (since we already have those landed) - A specific JANGTQ checkpoint that's popular enough on HF to justify the loader-detection branch
Without those, accepting this loader is committing to maintaining an external-dependency integration path forever for a quantization scheme that may stay niche.
What I'd ask instead
If you want to land JANG support, the path I can support is:
-
Open an issue ("Support JANG/JANGTQ model loading") with a specific HF checkpoint URL, a brief explainer of why JANGTQ matters vs. our existing quant tiers, and ideally a bench number. If 1-2 people +1 the issue, that's the signal we need.
-
If the issue gets traction, open a NEW focused PR containing only:
pyproject.tomlextratests/test_jangtq_loader.pyvllm_mlx/utils/tokenizer.py- The JANG-detection branch in
engine/batched.py(extracted from this PR, isolated) - The CLI flag in
vllm_mlx/cli.py(extracted, isolated)
Probably ~500-700 LOC across 5 files. Easy to review, easy to bench, easy to land.
-
Move the TUI, the metrics system, and the tool-calling work into separate PRs with their own product justification. The TUI in particular looks genuinely useful — but it's a separate decision from JANG.
Given the above, my recommendation is to close this PR (no maintainer-rejection stigma — just "wrong shape") and follow the path above. I realize that's frustrating after 5 weeks of work, and I'd rather pay you the courtesy of saying "this won't merge as-is" now than leave you waiting longer.
What's good
- The detection ordering ("check
jang_config.jsonbefore vendored arch fallback") is the right shape for plugin-style quantization integration. - The DSV4 tokenizer/AutoConfig patching context is well-documented in the PR description — that's a real edge case that we'd have hit too.
- Validation evidence in the PR description (replaced 129 routed TQ modules, etc.) suggests you actually got this working end-to-end, which is more than most external loader-integration PRs deliver.
- The tests in
tests/test_jangtq_loader.py(+651 LOC) are substantive. jangpackage is legitimate, actively maintained, real PyPI metadata — supply-chain shape is clean for the JANG-only subset.
Summary
- Step 0 (does this solve a real product problem): ⚠ Unclear — no driving issue, contributor-initiated only. Need at least one independent user request before committing to ongoing maintenance.
- Supply-chain audit (JANG-only subset): ✅ Clean —
jangpackage is legitimate. - Supply-chain audit (full PR): N/A — too much unrelated surface to audit meaningfully.
- Action: I'd like to close this PR and have you re-open a JANG-only PR (~5 files) after a driving issue gets +1s, with the TUI / metrics / tool-calling work split into separate PRs of their own. Happy to keep this branch alive in your fork as a reference — just don't think it should be the merge candidate.
Genuine thanks for the contribution — closing in this state doesn't reflect on the quality of any individual piece. The shape is the problem, not the work.
Summary
jang_config.jsonbefore the vendored architecture fallback.jang_tools.load_jangtq.load_jangtq_modeland standard JANG models throughjang_tools.loader.load_jang_model.rapid-mlx[jang]dependency extra and regression tests for JANGTQ, JANG v2, and normal DeepSeek V4 fallback behavior.jang-toolsdoes not fall through Transformers AutoConfig for the vendoreddeepseek_v4architecture.Root cause
DeepSeek V4 JANGTQ bundles declare
weight_format: mxtqand store routed experts astq_packed/tq_normstensors. The existing loader treated them like normal DeepSeek V4 MLX weights, somlx_lm.load_modelrejected thousands of unexpected JANGTQ parameters. During live validation,jang-toolsalso hit a DSV4 tokenizer/EOS expansion path that calls Transformers AutoConfig; the wrapper now patches that call for DSV4 JANGTQ to loadtokenizer.jsondirectly.Validation
uv run --extra dev --extra jang python -m pytest tests/test_jangtq_loader.py tests/test_deepseek_v4_vendored.py -quv run --extra dev ruff check pyproject.toml vllm_mlx/utils/tokenizer.py tests/test_jangtq_loader.pyuv run --extra jang python - <<'PY' ... import jang_tools ... PYDeepSeek-V4-Flash-JANGTQdetected asweight_format=mxtq,profile=JANGTQ2.