[qwen] Add Qwen3.5-MoE builder (port of onnxruntime-genai #2146)#349
Conversation
…t test Agent-Logs-Url: https://github.com/xadupre/mbext/sessions/0d946b1a-b3d9-4d9a-bba8-57a66a71d021 Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>
|
|
||
| @requires_transformers("5") | ||
| @hide_stdout() | ||
| def test_qwen3_5_moe_fp32_cpu_full_attention_build(self): |
There was a problem hiding this comment.
@copilot you need to check for discrepancies with onnxruntime and first token difference with onnxruntime-genai
There was a problem hiding this comment.
Added in 377b8b6: new test_qwen3_5_moe_fp32_cpu_discrepancy_full_attention runs ORT CPU prefill and HF PyTorch forward on the same inputs_embeds + 3-D mRoPE position_ids, then asserts np.testing.assert_allclose(pt_logits, ort_logits, atol=1e-3, rtol=1e-3) and that the greedy first-token argmax of the last-row logits agrees between PT and ORT. Measured on the tiny random-weight model: avg |PT−ORT| ≈ 8e-8, max ≈ 6e-7, first-token argmax matches.
Agent-Logs-Url: https://github.com/xadupre/mbext/sessions/b964cd1f-c911-4ab1-b273-972fb0386e8f Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>
📂 Impacted Files3 file(s) changed in this pull request: 🔵 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #349 +/- ##
==========================================
+ Coverage 79.92% 79.96% +0.03%
==========================================
Files 25 25
Lines 7741 7835 +94
==========================================
+ Hits 6187 6265 +78
- Misses 1554 1570 +16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Ports the Qwen3.5-MoE (
Qwen3_5MoeForConditionalGeneration) builder from microsoft/onnxruntime-genai#2146 and adds fast unit tests.Changes
modelbuilder/builders/qwen.pyQwen35MoeTextModel(Qwen35TextModel): bias-free top-k router, packed routed experts with HF[gate|up]→ ORT interleaved repack forswiglu_fusion=1, SiLU shared expert with sigmoid gating, supports bothMoEandQMoE(symmetric blockwise, nozero_points). Strips stale/mlp/entries from the int4customized_weight_config.self.model_typeassignment frommake_genai_configintoQwen35TextModel.__init__so subclasses can override it.Qwen35MoeTextModelsets it to"Qwen3_5_MoeForConditionalGeneration"so the base-class snake-case strip yieldsqwen3_5_moe(matching the C++ key registered upstream).modelbuilder/builder.py— dispatchQwen3_5MoeForConditionalGeneration→Qwen35MoeTextModel.tests/fast/test_random_qwen3_5_moe.pytest_qwen3_5_moe_fp32_cpu_full_attention_build— builds a tiny random-weight MoE (4 experts, 2full_attentionlayers,hidden=128,moe_intermediate=64); asserts anMoE/QMoEnode is emitted andgenai_config.jsonreportsmodel.type == "qwen3_5_moe".test_qwen3_5_moe_fp32_cpu_discrepancy_full_attention— runs ONNX Runtime CPU prefill and the HF PyTorch forward on the sameinputs_embeds+ 3-D mRoPEposition_ids, then assertsnp.testing.assert_allclose(pt_logits, ort_logits, atol=1e-3, rtol=1e-3)and that the greedy first-tokenargmaxof the last-row logits agrees between PyTorch and ONNX Runtime. Observed on the tiny random-weight model: avg|PT−ORT|≈ 8×10⁻⁸, max ≈ 6×10⁻⁷, first-token argmax matches.Notes / deviations from upstream PR
model_type.h/model.cppchanges are not ported — this repo is Python-only.Qwen3_5_textForCausalLM; left unchanged here since mbext already has a dedicatedQwen35CausalLMModelsubclass and the existing genai-config string is preserved.