Add JANG model loader integration by samuelfaj · Pull Request #212 · raullenchai/Rapid-MLX

samuelfaj · 2026-05-05T01:09:34Z

Summary

Detect local or Hugging Face models with jang_config.json before the vendored architecture fallback.
Route JANGTQ/MXTQ models through jang_tools.load_jangtq.load_jangtq_model and standard JANG models through jang_tools.loader.load_jang_model.
Add optional rapid-mlx[jang] dependency extra and regression tests for JANGTQ, JANG v2, and normal DeepSeek V4 fallback behavior.
Patch DeepSeek V4 JANGTQ tokenizer loading so jang-tools does not fall through Transformers AutoConfig for the vendored deepseek_v4 architecture.

Root cause

DeepSeek V4 JANGTQ bundles declare weight_format: mxtq and store routed experts as tq_packed/tq_norms tensors. The existing loader treated them like normal DeepSeek V4 MLX weights, so mlx_lm.load_model rejected thousands of unexpected JANGTQ parameters. During live validation, jang-tools also hit a DSV4 tokenizer/EOS expansion path that calls Transformers AutoConfig; the wrapper now patches that call for DSV4 JANGTQ to load tokenizer.json directly.

Validation

uv run --extra dev --extra jang python -m pytest tests/test_jangtq_loader.py tests/test_deepseek_v4_vendored.py -q
uv run --extra dev ruff check pyproject.toml vllm_mlx/utils/tokenizer.py tests/test_jangtq_loader.py
uv run --extra jang python - <<'PY' ... import jang_tools ... PY
Local model detection: DeepSeek-V4-Flash-JANGTQ detected as weight_format=mxtq, profile=JANGTQ2.
Live serve validation reached DSV4 streaming hydrate, replaced 129 routed TQ modules, loaded 85 regular shards, patched 43 SwitchGLU instances, then exposed a tokenizer path bug that this branch patches.

… new-main

Add JANG model loader integration

samuelfaj · 2026-05-05T01:24:07Z

Validation update:

Full JANGTQ serve startup completed locally for .
Hydration replaced 129 DSV4 routed TQ modules, loaded 85 regular shards, patched 43 SwitchGLU instances, completed warmup, and served on port 8011.
OpenAI-compatible request returned HTTP 200 with , , , .
Additional compatibility fixes landed in the branch for DSV4 tokenizer metadata and MLX scalar RoPE offsets under rapid-mlx batching.

samuelfaj · 2026-05-05T01:24:16Z

Validation update:

Full JANGTQ serve startup completed locally for /Users/samuelfajreldines/dev/models/DeepSeek-V4-Flash-JANGTQ.
Hydration replaced 129 DSV4 routed TQ modules, loaded 85 regular shards, patched 43 SwitchGLU instances, completed warmup, and served on port 8011.
OpenAI-compatible /v1/chat/completions request returned HTTP 200 with model=local, prompt_tokens=9, completion_tokens=8, total_tokens=17.
Additional compatibility fixes landed in the branch for DSV4 tokenizer metadata and MLX scalar RoPE offsets under rapid-mlx batching.

samuelfaj · 2026-05-05T02:20:05Z

Final validation update:

Fixed quality issue by routing DSV4 JANGTQ through direct mlx_lm.generate on the model-owning MLX worker instead of the continuous batching generator path, which produced corrupted/repetitive tokens for this runtime.
Server validation command completed on port 8013 with /Users/samuelfajreldines/dev/models/DeepSeek-V4-Flash-JANGTQ.
/v1/chat/completions simple math request returned HTTP 200 with content exactly 4, prompt_tokens=17, completion_tokens=1, total_tokens=18.
/v1/chat/completions exact-ok request returned HTTP 200 with content exactly ok, prompt_tokens=9, completion_tokens=1, total_tokens=10.
Regression tests: uv run --extra dev --extra jang python -m pytest tests/test_jangtq_loader.py tests/test_deepseek_v4_vendored.py -q passed, 12 tests.
Ruff passed for changed files.

samuelfaj · 2026-05-05T03:05:28Z

Performance/streaming update:

The DeepSeek V4 JANGTQ direct fallback now uses mlx_lm.stream_generate for streaming requests, so tokens are delivered as they are produced instead of waiting for full completion.
Non-streaming requests keep the safe direct mlx_lm.generate path.
Added an explicit TODO in the direct fallback explaining the future real batching fix: compare BatchGenerator logits/output against mlx_lm.generate, then fix cache offset handling, prompt-cache merge/extract, and RoPE position state until batching is bit-consistent with the direct path.
Live streaming validation returned SSE chunks with content exactly ok and final usage prompt_tokens=9, completion_tokens=2, total_tokens=11.
Focused tests passed: 17 tests.
Ruff passed.

# Conflicts: # vllm_mlx/routes/chat.py

raullenchai · 2026-05-09T15:42:32Z

Hi @samuelfaj — thanks for the work. Applying our new SOP §0 necessity gate (see docs/development/pr_merge_sop.md) I need a demand signal before merging.

Holding for clarification, not closing yet.

Reasoning:

This adds 4007 lines for JANG/JANGTQ model support including a new [jang] extras dependency. That's significant scope.
I searched our issues for "JANG" — zero hits. No one has filed a model-support request for JANG/JANGTQ.
This PR is also stacked on Fix Qwen tool-call OpenAI translation #204, Add serve TUI monitor #205, Improve Hermes tool-call recovery #206 — tests/test_cli_tui_ready.py, tests/test_chat_tool_retry.py, etc. show up here. With Add serve TUI monitor #205 now closed, this will need a rebase to drop the TUI bits.

To unlock merge, I need one or more of:

User demand: a GitHub issue from a user (you or someone else) saying "I want to serve JANG model X with rapid-mlx and it doesn't work". Even one is enough.
JANG popularity signal: pointer to a HuggingFace model page using JANGTQ/MXTQ format with non-trivial download counts, or a community discussion (Reddit/Discord/X) showing people are trying to run JANG locally.
Scope split: separate the JANG-specific changes (vllm_mlx/jang_tools/*, tests/test_jangtq_loader.py, jang detection in loader, [jang] extras) from the unrelated infra changes (anthropic auth, completions, health, request_metrics, etc.). The current diff makes it impossible to review JANG support on its own merits.

For now please rebase on top of latest main (which now has #260, #262, #258 merged) and drop the parts that came from #205/#212-stack-overlap. After that I can give the JANG-specific surface the focused review it deserves.

Apologies for the friction — the necessity gate is new this week and I'm working through the backlog. Your #204 (Qwen tool-call fix) is being reviewed now since it has clear user value.

raullenchai · 2026-06-07T22:47:53Z

Thanks for putting this together. Two requests before review:

(1) Please split this into independent PRs. The diff is +4007 LOC across 27 files but the title scopes it to the JANG loader. The JANG-loader part is a coherent change on its own:

pyproject.toml (the [jang] extra)
vllm_mlx/utils/tokenizer.py (DSV4 JANGTQ tokenizer patch)
tests/test_jangtq_loader.py
whichever loader-routing code path detects jang_config.json before the vendored-arch fallback

The TUI (vllm_mlx/tui.py +736), metrics middleware (vllm_mlx/middleware/metrics.py +247, vllm_mlx/request_metrics.py +201), chat-route refactor (vllm_mlx/routes/chat.py +374), postprocessor changes (vllm_mlx/service/postprocessor.py +176), and batched-engine changes (vllm_mlx/engine/batched.py +224) are each their own scope and should be reviewed separately — they're unrelated to JANG and bundling them makes the diff impossible to review responsibly.

(2) Verify the JANG import path. The PR imports jang_tools.loader.load_jang_model and jang_tools.load_jangtq.load_jangtq_model, but the package published on PyPI is named jang, not jang-tools (https://pypi.org/project/jang/ — jang-tools returns 404). Either the published name has changed since you tested, or the imports here won't resolve on a clean install. Please:

Confirm the actual import path on a fresh venv (uv venv && uv pip install jang && python -c "import jang_tools" vs import jang).
Pin the exact version in the [jang] extra (jang-tools>=X.Y or jang>=X.Y) — this is a single-maintainer dependency with custom Metal kernels, so an unpinned floor is risky.
Add a one-line note in the PR description acknowledging the JANGQ-AI ecosystem is a small Apple-Silicon community (no academic backing, single primary maintainer at jangq.ai) so reviewers understand the supply-chain shape.

Happy to review the loader-only PR once it's split out — that part looks reasonable on first read.

raullenchai

Thanks for putting the time in on this — JANG/JANGTQ is a real quantization scheme and jang (Jinho Jang's adaptive mixed-precision package) is a legitimate active project on PyPI. But I can't merge this PR in its current shape, and I'd like to explain why so we can find a path that works.

Scope drift: only ~15% of this PR is actually about JANG

The PR title and description are "Add JANG model loader integration." Mapping the 27 changed files against that scope:

Actually JANG-related (~600 LOC):

pyproject.toml — adds the jang extra ✓
tests/test_jangtq_loader.py — JANG loader tests ✓
vllm_mlx/utils/tokenizer.py — DSV4 JANGTQ tokenizer workaround ✓
vllm_mlx/engine/batched.py (partial, the JANG-detection branch) — partial ✓
vllm_mlx/cli.py (partial, JANG flag plumbing) — partial ✓

Not JANG-related — 5+ separate features bundled in (~3,400 LOC):

vllm_mlx/tui.py (+736 LOC, new file) — a full-screen termios TUI for monitoring rapid-mlx serve. Real product question, not a loader change.
vllm_mlx/request_metrics.py (+201, new) + vllm_mlx/middleware/metrics.py (+247, new) — an entirely new in-process metrics system. Overlaps with the telemetry worker we already shipped in v0.6.81 (telemetry.rapidmlx.com).
vllm_mlx/api/tool_calling.py (+248/-31) + vllm_mlx/tool_parsers/deepseek_tool_parser.py (+80) + vllm_mlx/tool_parsers/qwen3coder_tool_parser.py (+62/-9) — major tool-calling rewrites that conflict with the openai-harmony StreamableParser path landed in PR #515 (v0.6.75).
vllm_mlx/routes/chat.py (+374/-4) + vllm_mlx/service/postprocessor.py (+176/-1) + tests/test_postprocessor.py (+219) — new _looks_like_deferred_tool_use heuristic plus a postprocessor change. Maybe relates to the tool-calling rewrite above, but unrelated to JANG.

Each of these is independently worth discussing on its merits. Bundling them all into one PR titled "Add JANG model loader integration" means:

Reviewing any one of them requires reviewing all of them
Merging the JANG bits requires us to accept (or carefully detangle) all the rest
The "do we want a TUI?" / "do we want a second metrics system?" product questions get hidden under the loader integration label

State of this branch

Branch is dirty (merge conflicts) — main has moved substantially since 2026-05-05 (we're now at v0.7.3, with rewrites to the postprocessor, tool parsers, and chat route from PRs #408, #515, #555, #558). GitHub reports mergeable_state: dirty on this PR.
No driving issue exists — I searched JANG / JANGTQ across all open + closed issues; nothing. So the only signal that we should support this loader path is this PR itself.
Stale for 5 weeks — your last push was 2026-06-07 (helpful, thanks), but my best estimate is that a rebase on top of v0.7.3 main would require non-trivial reconciliation work given how much of the surface area has moved.

Step 0 — Necessity

Putting the scope question aside: does our product roadmap actually want JANG/JANGTQ support right now?

Honest answer: not enough signal. The supported quant tiers in vllm_mlx/aliases.json today are 4bit, 6bit, 8bit, mxfp4, ud, dwq, mxfp4-q8. Adding mxtq / jangtq is plausible — Jinho's work on adaptive precision is interesting — but I'd want to see:

At least one user request (issue, Discord, anything) for JANGTQ inference support, not initiated by the contributor
Bench evidence that JANGTQ on our path is meaningfully better than the existing mxfp4 / 4bit tiers on Apple Silicon (since we already have those landed)
A specific JANGTQ checkpoint that's popular enough on HF to justify the loader-detection branch

Without those, accepting this loader is committing to maintaining an external-dependency integration path forever for a quantization scheme that may stay niche.

What I'd ask instead

If you want to land JANG support, the path I can support is:

Open an issue ("Support JANG/JANGTQ model loading") with a specific HF checkpoint URL, a brief explainer of why JANGTQ matters vs. our existing quant tiers, and ideally a bench number. If 1-2 people +1 the issue, that's the signal we need.
If the issue gets traction, open a NEW focused PR containing only:
- pyproject.toml extra
- tests/test_jangtq_loader.py
- vllm_mlx/utils/tokenizer.py
- The JANG-detection branch in engine/batched.py (extracted from this PR, isolated)
- The CLI flag in vllm_mlx/cli.py (extracted, isolated)
Probably ~500-700 LOC across 5 files. Easy to review, easy to bench, easy to land.
Move the TUI, the metrics system, and the tool-calling work into separate PRs with their own product justification. The TUI in particular looks genuinely useful — but it's a separate decision from JANG.

Given the above, my recommendation is to close this PR (no maintainer-rejection stigma — just "wrong shape") and follow the path above. I realize that's frustrating after 5 weeks of work, and I'd rather pay you the courtesy of saying "this won't merge as-is" now than leave you waiting longer.

What's good

The detection ordering ("check jang_config.json before vendored arch fallback") is the right shape for plugin-style quantization integration.
The DSV4 tokenizer/AutoConfig patching context is well-documented in the PR description — that's a real edge case that we'd have hit too.
Validation evidence in the PR description (replaced 129 routed TQ modules, etc.) suggests you actually got this working end-to-end, which is more than most external loader-integration PRs deliver.
The tests in tests/test_jangtq_loader.py (+651 LOC) are substantive.
jang package is legitimate, actively maintained, real PyPI metadata — supply-chain shape is clean for the JANG-only subset.

Summary

Step 0 (does this solve a real product problem): ⚠ Unclear — no driving issue, contributor-initiated only. Need at least one independent user request before committing to ongoing maintenance.
Supply-chain audit (JANG-only subset): ✅ Clean — jang package is legitimate.
Supply-chain audit (full PR): N/A — too much unrelated surface to audit meaningfully.
Action: I'd like to close this PR and have you re-open a JANG-only PR (~5 files) after a driving issue gets +1s, with the TUI / metrics / tool-calling work split into separate PRs of their own. Happy to keep this branch alive in your fork as a reference — just don't think it should be the merge candidate.

Genuine thanks for the contribution — closing in this state doesn't reflect on the quality of any individual piece. The shape is the problem, not the work.

samuelfaj and others added 18 commits May 4, 2026 15:53

Fix Qwen tool call OpenAI translation

7b13ea4

Preserve tool schemas after streamed content

5c9b4e8

Coerce generic tool arguments from schema

5594261

Handle additional OpenCode tool call formats

7fa174d

Preserve code brackets near partial tool markers

ff6f247

Fix PR check failures

0b64dcf

Add serve TUI monitor

8b42dc6

Fix TUI PR CI failures

4d5a3b7

Add TUI request throughput metrics

b2b98b2

Enhance serve TUI request metrics

7f3a1ee

Improve Hermes tool-call recovery

a1a188e

Merge remote-tracking branch 'origin/add-serve-tui' into new-main

3841801

Merge remote-tracking branch 'origin/hermes-pr204-tool-recovery' into…

bbc6136

… new-main

Add JANG model loader integration

bfeb2f2

Merge pull request #1 from samuelfaj/add-jangtq-loader

907d343

Add JANG model loader integration

Patch DeepSeek V4 JANGTQ tokenizer loading

4ce7046

Apply JANG tokenizer metadata

1746f84

Patch JANGTQ RoPE batching offset

7ac0c59

Use direct generation for DeepSeek V4 JANGTQ

197243b

samuelfaj added 2 commits May 4, 2026 23:51

Wait for server readiness before TUI

1ad7852

Stream direct JANGTQ generation

9fd2f5a

Track direct JANGTQ prefill progress

0ee615b

samuelfaj force-pushed the add-jangtq-loader-v2 branch from 2f48ce6 to 0ee615b Compare May 5, 2026 03:31

samuelfaj added 3 commits May 5, 2026 00:42

Cap default direct JANG generation

eebf7dd

Sanitize direct JANG tool prompts

63eabbb

Merge remote-tracking branch 'upstream/main'

05c1f30

# Conflicts: # vllm_mlx/routes/chat.py

samuelfaj marked this pull request as draft May 5, 2026 14:23

Restore direct JANG tool execution

ae6a2af

samuelfaj marked this pull request as ready for review May 5, 2026 15:48

Improve direct JANG tool artifact fallback

9b0bb10

samuelfaj force-pushed the add-jangtq-loader-v2 branch from ea128df to 9b0bb10 Compare May 5, 2026 15:58

raullenchai requested changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JANG model loader integration#212

Add JANG model loader integration#212
samuelfaj wants to merge 27 commits into
raullenchai:mainfrom
samuelfaj:add-jangtq-loader-v2

samuelfaj commented May 5, 2026

Uh oh!

samuelfaj commented May 5, 2026

Uh oh!

samuelfaj commented May 5, 2026

Uh oh!

samuelfaj commented May 5, 2026

Uh oh!

samuelfaj commented May 5, 2026

Uh oh!

raullenchai commented May 9, 2026

Uh oh!

raullenchai commented Jun 7, 2026

Uh oh!

raullenchai left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samuelfaj commented May 5, 2026

Summary

Root cause

Validation

Uh oh!

samuelfaj commented May 5, 2026

Uh oh!

samuelfaj commented May 5, 2026

Uh oh!

samuelfaj commented May 5, 2026

Uh oh!

samuelfaj commented May 5, 2026

Uh oh!

raullenchai commented May 9, 2026

Uh oh!

raullenchai commented Jun 7, 2026

Uh oh!

raullenchai left a comment

Choose a reason for hiding this comment

Scope drift: only ~15% of this PR is actually about JANG

State of this branch

Step 0 — Necessity

What I'd ask instead

What's good

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants