Skip to content

[qwen] Add Qwen3.5-MoE builder (port of onnxruntime-genai #2146)#349

Merged
xadupre merged 3 commits into
mainfrom
copilot/import-changes-and-add-fast-unit-test
May 24, 2026
Merged

[qwen] Add Qwen3.5-MoE builder (port of onnxruntime-genai #2146)#349
xadupre merged 3 commits into
mainfrom
copilot/import-changes-and-add-fast-unit-test

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 24, 2026

Ports the Qwen3.5-MoE (Qwen3_5MoeForConditionalGeneration) builder from microsoft/onnxruntime-genai#2146 and adds fast unit tests.

Changes

  • modelbuilder/builders/qwen.py
    • New Qwen35MoeTextModel(Qwen35TextModel): bias-free top-k router, packed routed experts with HF [gate|up] → ORT interleaved repack for swiglu_fusion=1, SiLU shared expert with sigmoid gating, supports both MoE and QMoE (symmetric blockwise, no zero_points). Strips stale /mlp/ entries from the int4 customized_weight_config.
    • Moved self.model_type assignment from make_genai_config into Qwen35TextModel.__init__ so subclasses can override it. Qwen35MoeTextModel sets it to "Qwen3_5_MoeForConditionalGeneration" so the base-class snake-case strip yields qwen3_5_moe (matching the C++ key registered upstream).
  • modelbuilder/builder.py — dispatch Qwen3_5MoeForConditionalGenerationQwen35MoeTextModel.
  • tests/fast/test_random_qwen3_5_moe.py
    • test_qwen3_5_moe_fp32_cpu_full_attention_build — builds a tiny random-weight MoE (4 experts, 2 full_attention layers, hidden=128, moe_intermediate=64); asserts an MoE/QMoE node is emitted and genai_config.json reports model.type == "qwen3_5_moe".
    • test_qwen3_5_moe_fp32_cpu_discrepancy_full_attention — runs ONNX Runtime CPU prefill and the HF PyTorch forward on the same inputs_embeds + 3-D mRoPE position_ids, then asserts np.testing.assert_allclose(pt_logits, ort_logits, atol=1e-3, rtol=1e-3) and that the greedy first-token argmax of the last-row logits agrees between PyTorch and ONNX Runtime. Observed on the tiny random-weight model: avg |PT−ORT| ≈ 8×10⁻⁸, max ≈ 6×10⁻⁷, first-token argmax matches.

Notes / deviations from upstream PR

  • The C++ model_type.h / model.cpp changes are not ported — this repo is Python-only.
  • Upstream also retitles the text-only Qwen3.5 model type to Qwen3_5_textForCausalLM; left unchanged here since mbext already has a dedicated Qwen35CausalLMModel subclass and the existing genai-config string is preserved.

Copilot AI changed the title [WIP] Import changes from PR 2146 and add fast unit test Add Qwen3.5-MoE builder (port of onnxruntime-genai #2146) May 24, 2026
Copilot AI requested a review from xadupre May 24, 2026 12:32

@requires_transformers("5")
@hide_stdout()
def test_qwen3_5_moe_fp32_cpu_full_attention_build(self):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot you need to check for discrepancies with onnxruntime and first token difference with onnxruntime-genai

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 377b8b6: new test_qwen3_5_moe_fp32_cpu_discrepancy_full_attention runs ORT CPU prefill and HF PyTorch forward on the same inputs_embeds + 3-D mRoPE position_ids, then asserts np.testing.assert_allclose(pt_logits, ort_logits, atol=1e-3, rtol=1e-3) and that the greedy first-token argmax of the last-row logits agrees between PT and ORT. Measured on the tiny random-weight model: avg |PT−ORT| ≈ 8e-8, max ≈ 6e-7, first-token argmax matches.

Copilot AI requested a review from xadupre May 24, 2026 12:43
@xadupre xadupre marked this pull request as ready for review May 24, 2026 12:51
@github-actions
Copy link
Copy Markdown

📂 Impacted Files

3 file(s) changed in this pull request:

🔵 modelbuilder/builder.py (+4 / -0)
🔵 modelbuilder/builders/qwen.py (+209 / -4)
🟢 tests/fast/test_random_qwen3_5_moe.py (+271 / -0)

@github-actions github-actions Bot changed the title Add Qwen3.5-MoE builder (port of onnxruntime-genai #2146) [qwen] Add Qwen3.5-MoE builder (port of onnxruntime-genai #2146) May 24, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 24, 2026

Codecov Report

❌ Patch coverage is 83.50515% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.96%. Comparing base (b586dd8) to head (377b8b6).

Files with missing lines Patch % Lines
modelbuilder/builders/qwen.py 82.97% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #349      +/-   ##
==========================================
+ Coverage   79.92%   79.96%   +0.03%     
==========================================
  Files          25       25              
  Lines        7741     7835      +94     
==========================================
+ Hits         6187     6265      +78     
- Misses       1554     1570      +16     
Flag Coverage Δ
fast-tests 79.96% <83.50%> (+0.03%) ⬆️
fast-tests-ort-nightly 79.91% <83.50%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xadupre xadupre merged commit 97d302d into main May 24, 2026
9 checks passed
@xadupre xadupre deleted the copilot/import-changes-and-add-fast-unit-test branch May 24, 2026 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

import changes from https://github.com/microsoft/onnxruntime-genai/pull/2146, add fast unit test

2 participants