Skip to content

Add JANG model loader integration#212

Open
samuelfaj wants to merge 27 commits into
raullenchai:mainfrom
samuelfaj:add-jangtq-loader-v2
Open

Add JANG model loader integration#212
samuelfaj wants to merge 27 commits into
raullenchai:mainfrom
samuelfaj:add-jangtq-loader-v2

Conversation

@samuelfaj

Copy link
Copy Markdown
Contributor

Summary

  • Detect local or Hugging Face models with jang_config.json before the vendored architecture fallback.
  • Route JANGTQ/MXTQ models through jang_tools.load_jangtq.load_jangtq_model and standard JANG models through jang_tools.loader.load_jang_model.
  • Add optional rapid-mlx[jang] dependency extra and regression tests for JANGTQ, JANG v2, and normal DeepSeek V4 fallback behavior.
  • Patch DeepSeek V4 JANGTQ tokenizer loading so jang-tools does not fall through Transformers AutoConfig for the vendored deepseek_v4 architecture.

Root cause

DeepSeek V4 JANGTQ bundles declare weight_format: mxtq and store routed experts as tq_packed/tq_norms tensors. The existing loader treated them like normal DeepSeek V4 MLX weights, so mlx_lm.load_model rejected thousands of unexpected JANGTQ parameters. During live validation, jang-tools also hit a DSV4 tokenizer/EOS expansion path that calls Transformers AutoConfig; the wrapper now patches that call for DSV4 JANGTQ to load tokenizer.json directly.

Validation

  • uv run --extra dev --extra jang python -m pytest tests/test_jangtq_loader.py tests/test_deepseek_v4_vendored.py -q
  • uv run --extra dev ruff check pyproject.toml vllm_mlx/utils/tokenizer.py tests/test_jangtq_loader.py
  • uv run --extra jang python - <<'PY' ... import jang_tools ... PY
  • Local model detection: DeepSeek-V4-Flash-JANGTQ detected as weight_format=mxtq, profile=JANGTQ2.
  • Live serve validation reached DSV4 streaming hydrate, replaced 129 routed TQ modules, loaded 85 regular shards, patched 43 SwitchGLU instances, then exposed a tokenizer path bug that this branch patches.

@samuelfaj

Copy link
Copy Markdown
Contributor Author

Validation update:

  • Full JANGTQ serve startup completed locally for .
  • Hydration replaced 129 DSV4 routed TQ modules, loaded 85 regular shards, patched 43 SwitchGLU instances, completed warmup, and served on port 8011.
  • OpenAI-compatible request returned HTTP 200 with , , , .
  • Additional compatibility fixes landed in the branch for DSV4 tokenizer metadata and MLX scalar RoPE offsets under rapid-mlx batching.

@samuelfaj

Copy link
Copy Markdown
Contributor Author

Validation update:

  • Full JANGTQ serve startup completed locally for /Users/samuelfajreldines/dev/models/DeepSeek-V4-Flash-JANGTQ.
  • Hydration replaced 129 DSV4 routed TQ modules, loaded 85 regular shards, patched 43 SwitchGLU instances, completed warmup, and served on port 8011.
  • OpenAI-compatible /v1/chat/completions request returned HTTP 200 with model=local, prompt_tokens=9, completion_tokens=8, total_tokens=17.
  • Additional compatibility fixes landed in the branch for DSV4 tokenizer metadata and MLX scalar RoPE offsets under rapid-mlx batching.

@samuelfaj

Copy link
Copy Markdown
Contributor Author

Final validation update:

  • Fixed quality issue by routing DSV4 JANGTQ through direct mlx_lm.generate on the model-owning MLX worker instead of the continuous batching generator path, which produced corrupted/repetitive tokens for this runtime.
  • Server validation command completed on port 8013 with /Users/samuelfajreldines/dev/models/DeepSeek-V4-Flash-JANGTQ.
  • /v1/chat/completions simple math request returned HTTP 200 with content exactly 4, prompt_tokens=17, completion_tokens=1, total_tokens=18.
  • /v1/chat/completions exact-ok request returned HTTP 200 with content exactly ok, prompt_tokens=9, completion_tokens=1, total_tokens=10.
  • Regression tests: uv run --extra dev --extra jang python -m pytest tests/test_jangtq_loader.py tests/test_deepseek_v4_vendored.py -q passed, 12 tests.
  • Ruff passed for changed files.

@samuelfaj

Copy link
Copy Markdown
Contributor Author

Performance/streaming update:

  • The DeepSeek V4 JANGTQ direct fallback now uses mlx_lm.stream_generate for streaming requests, so tokens are delivered as they are produced instead of waiting for full completion.
  • Non-streaming requests keep the safe direct mlx_lm.generate path.
  • Added an explicit TODO in the direct fallback explaining the future real batching fix: compare BatchGenerator logits/output against mlx_lm.generate, then fix cache offset handling, prompt-cache merge/extract, and RoPE position state until batching is bit-consistent with the direct path.
  • Live streaming validation returned SSE chunks with content exactly ok and final usage prompt_tokens=9, completion_tokens=2, total_tokens=11.
  • Focused tests passed: 17 tests.
  • Ruff passed.

@samuelfaj samuelfaj force-pushed the add-jangtq-loader-v2 branch from 2f48ce6 to 0ee615b Compare May 5, 2026 03:31
@samuelfaj samuelfaj marked this pull request as draft May 5, 2026 14:23
@samuelfaj samuelfaj marked this pull request as ready for review May 5, 2026 15:48
@samuelfaj samuelfaj force-pushed the add-jangtq-loader-v2 branch from ea128df to 9b0bb10 Compare May 5, 2026 15:58
@raullenchai

Copy link
Copy Markdown
Owner

Hi @samuelfaj — thanks for the work. Applying our new SOP §0 necessity gate (see docs/development/pr_merge_sop.md) I need a demand signal before merging.

Holding for clarification, not closing yet.

Reasoning:

To unlock merge, I need one or more of:

  1. User demand: a GitHub issue from a user (you or someone else) saying "I want to serve JANG model X with rapid-mlx and it doesn't work". Even one is enough.
  2. JANG popularity signal: pointer to a HuggingFace model page using JANGTQ/MXTQ format with non-trivial download counts, or a community discussion (Reddit/Discord/X) showing people are trying to run JANG locally.
  3. Scope split: separate the JANG-specific changes (vllm_mlx/jang_tools/*, tests/test_jangtq_loader.py, jang detection in loader, [jang] extras) from the unrelated infra changes (anthropic auth, completions, health, request_metrics, etc.). The current diff makes it impossible to review JANG support on its own merits.

For now please rebase on top of latest main (which now has #260, #262, #258 merged) and drop the parts that came from #205/#212-stack-overlap. After that I can give the JANG-specific surface the focused review it deserves.

Apologies for the friction — the necessity gate is new this week and I'm working through the backlog. Your #204 (Qwen tool-call fix) is being reviewed now since it has clear user value.

@raullenchai

Copy link
Copy Markdown
Owner

Thanks for putting this together. Two requests before review:

(1) Please split this into independent PRs. The diff is +4007 LOC across 27 files but the title scopes it to the JANG loader. The JANG-loader part is a coherent change on its own:

  • pyproject.toml (the [jang] extra)
  • vllm_mlx/utils/tokenizer.py (DSV4 JANGTQ tokenizer patch)
  • tests/test_jangtq_loader.py
  • whichever loader-routing code path detects jang_config.json before the vendored-arch fallback

The TUI (vllm_mlx/tui.py +736), metrics middleware (vllm_mlx/middleware/metrics.py +247, vllm_mlx/request_metrics.py +201), chat-route refactor (vllm_mlx/routes/chat.py +374), postprocessor changes (vllm_mlx/service/postprocessor.py +176), and batched-engine changes (vllm_mlx/engine/batched.py +224) are each their own scope and should be reviewed separately — they're unrelated to JANG and bundling them makes the diff impossible to review responsibly.

(2) Verify the JANG import path. The PR imports jang_tools.loader.load_jang_model and jang_tools.load_jangtq.load_jangtq_model, but the package published on PyPI is named jang, not jang-tools (https://pypi.org/project/jang/jang-tools returns 404). Either the published name has changed since you tested, or the imports here won't resolve on a clean install. Please:

  • Confirm the actual import path on a fresh venv (uv venv && uv pip install jang && python -c "import jang_tools" vs import jang).
  • Pin the exact version in the [jang] extra (jang-tools>=X.Y or jang>=X.Y) — this is a single-maintainer dependency with custom Metal kernels, so an unpinned floor is risky.
  • Add a one-line note in the PR description acknowledging the JANGQ-AI ecosystem is a small Apple-Silicon community (no academic backing, single primary maintainer at jangq.ai) so reviewers understand the supply-chain shape.

Happy to review the loader-only PR once it's split out — that part looks reasonable on first read.

@raullenchai raullenchai left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting the time in on this — JANG/JANGTQ is a real quantization scheme and jang (Jinho Jang's adaptive mixed-precision package) is a legitimate active project on PyPI. But I can't merge this PR in its current shape, and I'd like to explain why so we can find a path that works.

Scope drift: only ~15% of this PR is actually about JANG

The PR title and description are "Add JANG model loader integration." Mapping the 27 changed files against that scope:

Actually JANG-related (~600 LOC):

  • pyproject.toml — adds the jang extra ✓
  • tests/test_jangtq_loader.py — JANG loader tests ✓
  • vllm_mlx/utils/tokenizer.py — DSV4 JANGTQ tokenizer workaround ✓
  • vllm_mlx/engine/batched.py (partial, the JANG-detection branch) — partial ✓
  • vllm_mlx/cli.py (partial, JANG flag plumbing) — partial ✓

Not JANG-related — 5+ separate features bundled in (~3,400 LOC):

  • vllm_mlx/tui.py (+736 LOC, new file) — a full-screen termios TUI for monitoring rapid-mlx serve. Real product question, not a loader change.
  • vllm_mlx/request_metrics.py (+201, new) + vllm_mlx/middleware/metrics.py (+247, new) — an entirely new in-process metrics system. Overlaps with the telemetry worker we already shipped in v0.6.81 (telemetry.rapidmlx.com).
  • vllm_mlx/api/tool_calling.py (+248/-31) + vllm_mlx/tool_parsers/deepseek_tool_parser.py (+80) + vllm_mlx/tool_parsers/qwen3coder_tool_parser.py (+62/-9) — major tool-calling rewrites that conflict with the openai-harmony StreamableParser path landed in PR #515 (v0.6.75).
  • vllm_mlx/routes/chat.py (+374/-4) + vllm_mlx/service/postprocessor.py (+176/-1) + tests/test_postprocessor.py (+219) — new _looks_like_deferred_tool_use heuristic plus a postprocessor change. Maybe relates to the tool-calling rewrite above, but unrelated to JANG.

Each of these is independently worth discussing on its merits. Bundling them all into one PR titled "Add JANG model loader integration" means:

  • Reviewing any one of them requires reviewing all of them
  • Merging the JANG bits requires us to accept (or carefully detangle) all the rest
  • The "do we want a TUI?" / "do we want a second metrics system?" product questions get hidden under the loader integration label

State of this branch

  • Branch is dirty (merge conflicts) — main has moved substantially since 2026-05-05 (we're now at v0.7.3, with rewrites to the postprocessor, tool parsers, and chat route from PRs #408, #515, #555, #558). GitHub reports mergeable_state: dirty on this PR.
  • No driving issue exists — I searched JANG / JANGTQ across all open + closed issues; nothing. So the only signal that we should support this loader path is this PR itself.
  • Stale for 5 weeks — your last push was 2026-06-07 (helpful, thanks), but my best estimate is that a rebase on top of v0.7.3 main would require non-trivial reconciliation work given how much of the surface area has moved.

Step 0 — Necessity

Putting the scope question aside: does our product roadmap actually want JANG/JANGTQ support right now?

Honest answer: not enough signal. The supported quant tiers in vllm_mlx/aliases.json today are 4bit, 6bit, 8bit, mxfp4, ud, dwq, mxfp4-q8. Adding mxtq / jangtq is plausible — Jinho's work on adaptive precision is interesting — but I'd want to see:

  1. At least one user request (issue, Discord, anything) for JANGTQ inference support, not initiated by the contributor
  2. Bench evidence that JANGTQ on our path is meaningfully better than the existing mxfp4 / 4bit tiers on Apple Silicon (since we already have those landed)
  3. A specific JANGTQ checkpoint that's popular enough on HF to justify the loader-detection branch

Without those, accepting this loader is committing to maintaining an external-dependency integration path forever for a quantization scheme that may stay niche.

What I'd ask instead

If you want to land JANG support, the path I can support is:

  1. Open an issue ("Support JANG/JANGTQ model loading") with a specific HF checkpoint URL, a brief explainer of why JANGTQ matters vs. our existing quant tiers, and ideally a bench number. If 1-2 people +1 the issue, that's the signal we need.

  2. If the issue gets traction, open a NEW focused PR containing only:

    • pyproject.toml extra
    • tests/test_jangtq_loader.py
    • vllm_mlx/utils/tokenizer.py
    • The JANG-detection branch in engine/batched.py (extracted from this PR, isolated)
    • The CLI flag in vllm_mlx/cli.py (extracted, isolated)

    Probably ~500-700 LOC across 5 files. Easy to review, easy to bench, easy to land.

  3. Move the TUI, the metrics system, and the tool-calling work into separate PRs with their own product justification. The TUI in particular looks genuinely useful — but it's a separate decision from JANG.

Given the above, my recommendation is to close this PR (no maintainer-rejection stigma — just "wrong shape") and follow the path above. I realize that's frustrating after 5 weeks of work, and I'd rather pay you the courtesy of saying "this won't merge as-is" now than leave you waiting longer.

What's good

  • The detection ordering ("check jang_config.json before vendored arch fallback") is the right shape for plugin-style quantization integration.
  • The DSV4 tokenizer/AutoConfig patching context is well-documented in the PR description — that's a real edge case that we'd have hit too.
  • Validation evidence in the PR description (replaced 129 routed TQ modules, etc.) suggests you actually got this working end-to-end, which is more than most external loader-integration PRs deliver.
  • The tests in tests/test_jangtq_loader.py (+651 LOC) are substantive.
  • jang package is legitimate, actively maintained, real PyPI metadata — supply-chain shape is clean for the JANG-only subset.

Summary

  • Step 0 (does this solve a real product problem): ⚠ Unclear — no driving issue, contributor-initiated only. Need at least one independent user request before committing to ongoing maintenance.
  • Supply-chain audit (JANG-only subset): ✅ Clean — jang package is legitimate.
  • Supply-chain audit (full PR): N/A — too much unrelated surface to audit meaningfully.
  • Action: I'd like to close this PR and have you re-open a JANG-only PR (~5 files) after a driving issue gets +1s, with the TUI / metrics / tool-calling work split into separate PRs of their own. Happy to keep this branch alive in your fork as a reference — just don't think it should be the merge candidate.

Genuine thanks for the contribution — closing in this state doesn't reflect on the quality of any individual piece. The shape is the problem, not the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants