Add Qwen3.5-35B-A3B MoE VLM ONNX export recipe#405
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Olive recipe to export and run Qwen/Qwen3.5-35B-A3B as a three-submodel ONNX Runtime GenAI pipeline (vision encoder + embedding fusion + INT4 text decoder), including a custom ONNX-export-friendly MoE model shell and an inference/benchmark script.
Changes:
- Introduces a custom
Qwen3_5MoeModelimplementation used for ONNX export of the vision and embedding submodels. - Adds Olive JSON pipelines for exporting/optimizing
vision.onnx,embedding.onnx, and buildingtext.onnxvia ModelBuilder (INT4). - Adds end-to-end
optimize.pyconfig generation andinference.pyrunner with interactive + benchmark + optional PyTorch comparison.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
Qwen-Qwen3.5-35B-A3B/LICENSE |
Adds upstream Apache-2.0 license text for the recipe content. |
Qwen-Qwen3.5-35B-A3B/builtin/user_script.py |
Provides Olive model loaders + dummy inputs for exporting embedding/vision via a custom model shell. |
Qwen-Qwen3.5-35B-A3B/builtin/optimize.py |
Orchestrates Olive runs and patches genai_config.json + writes processor_config.json + tokenizer fixups. |
Qwen-Qwen3.5-35B-A3B/builtin/inference.py |
Adds ORT GenAI inference script with interactive mode and benchmarking (optionally vs PyTorch). |
Qwen-Qwen3.5-35B-A3B/builtin/cpu_and_mobile/text.json |
Olive pipeline to build INT4 text decoder via ModelBuilder. |
Qwen-Qwen3.5-35B-A3B/builtin/cpu_and_mobile/embedding.json |
Olive pipeline to export embedding fusion model and apply graph surgeries/optimizations. |
Qwen-Qwen3.5-35B-A3B/builtin/cpu_and_mobile/vision.json |
Olive pipeline to export vision encoder, apply PackedAttention surgery, and optimization passes. |
Qwen-Qwen3.5-35B-A3B/builtin/codes/modeling_qwen3_5_moe.py |
Custom ONNX-export-friendly model implementation (vision + embedding shell + MoE text components). |
Qwen-Qwen3.5-35B-A3B/builtin/codes/__init__.py |
Initializes the codes module for imports. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@microsoft-github-policy-service agree company="AMD" |
|
@xieofxie / @devang-ml pls review |
|
please wait for microsoft/onnxruntime-genai#2146 |
883738b to
274a85d
Compare
OGA PR merged. |
|
Please add README and info.yml |
- Olive recipe for exporting Qwen/Qwen3.5-35B-A3B (256 experts, 8 routed + 1 shared) - Three sub-model pipeline: text decoder (INT4 QMoE), embedding (FP32), vision (FP32) - Custom ONNX-export-friendly MoE model class (codes/modeling_qwen3_5_moe.py) - Inference script with text, image, interactive, and benchmark modes - Requires ORT GenAI built with qwen3_5_moe support (see DEBUG_STATUS.md)
…ript.py - optimize.py: Read token IDs, vision params, and preprocessor settings from HuggingFace model/generation configs instead of hardcoding them. Model name is read from the Olive text.json config. Added --context-length CLI arg. - user_script.py: Use safetensors safe_open with prefix filtering to load only vision/embedding weights (~4 GB) instead of the full 35B checkpoint (~67 GB). Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
274a85d to
03b5b4b
Compare
|
@devang-ml Added README.md and info.yml. Please review. |
Uh oh!
There was an error while loading. Please reload this page.