Skip to content

Add Qwen3.5-35B-A3B MoE VLM ONNX export recipe#405

Open
tanzeel-amd wants to merge 10 commits into
microsoft:mainfrom
tanzeel-amd:turrahma/qwen3.5-moe-35B-A3B
Open

Add Qwen3.5-35B-A3B MoE VLM ONNX export recipe#405
tanzeel-amd wants to merge 10 commits into
microsoft:mainfrom
tanzeel-amd:turrahma/qwen3.5-moe-35B-A3B

Conversation

@tanzeel-amd
Copy link
Copy Markdown
Contributor

@tanzeel-amd tanzeel-amd commented May 8, 2026

  • Olive recipe for exporting Qwen/Qwen3.5-35B-A3B (256 experts, 8 routed + 1 shared)
  • Three sub-model pipeline: text decoder (INT4 QMoE), embedding (FP32), vision (FP32)
  • Custom ONNX-export-friendly MoE model class (codes/modeling_qwen3_5_moe.py)
  • Inference script with text, image, interactive, and benchmark modes
  • Requires ORT GenAI built with qwen3_5_moe support

Copilot AI review requested due to automatic review settings May 8, 2026 10:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Olive recipe to export and run Qwen/Qwen3.5-35B-A3B as a three-submodel ONNX Runtime GenAI pipeline (vision encoder + embedding fusion + INT4 text decoder), including a custom ONNX-export-friendly MoE model shell and an inference/benchmark script.

Changes:

  • Introduces a custom Qwen3_5MoeModel implementation used for ONNX export of the vision and embedding submodels.
  • Adds Olive JSON pipelines for exporting/optimizing vision.onnx, embedding.onnx, and building text.onnx via ModelBuilder (INT4).
  • Adds end-to-end optimize.py config generation and inference.py runner with interactive + benchmark + optional PyTorch comparison.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
Qwen-Qwen3.5-35B-A3B/LICENSE Adds upstream Apache-2.0 license text for the recipe content.
Qwen-Qwen3.5-35B-A3B/builtin/user_script.py Provides Olive model loaders + dummy inputs for exporting embedding/vision via a custom model shell.
Qwen-Qwen3.5-35B-A3B/builtin/optimize.py Orchestrates Olive runs and patches genai_config.json + writes processor_config.json + tokenizer fixups.
Qwen-Qwen3.5-35B-A3B/builtin/inference.py Adds ORT GenAI inference script with interactive mode and benchmarking (optionally vs PyTorch).
Qwen-Qwen3.5-35B-A3B/builtin/cpu_and_mobile/text.json Olive pipeline to build INT4 text decoder via ModelBuilder.
Qwen-Qwen3.5-35B-A3B/builtin/cpu_and_mobile/embedding.json Olive pipeline to export embedding fusion model and apply graph surgeries/optimizations.
Qwen-Qwen3.5-35B-A3B/builtin/cpu_and_mobile/vision.json Olive pipeline to export vision encoder, apply PackedAttention surgery, and optimization passes.
Qwen-Qwen3.5-35B-A3B/builtin/codes/modeling_qwen3_5_moe.py Custom ONNX-export-friendly model implementation (vision + embedding shell + MoE text components).
Qwen-Qwen3.5-35B-A3B/builtin/codes/__init__.py Initializes the codes module for imports.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Qwen-Qwen3.5-35B-A3B/builtin/user_script.py Outdated
Comment thread Qwen-Qwen3.5-35B-A3B/builtin/optimize.py
Comment thread Qwen-Qwen3.5-35B-A3B/builtin/codes/modeling_qwen3_5_moe.py Outdated
@tanzeel-amd
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree company="AMD"

@VishalX
Copy link
Copy Markdown
Contributor

VishalX commented May 11, 2026

@xieofxie / @devang-ml pls review

@xieofxie
Copy link
Copy Markdown
Contributor

please wait for microsoft/onnxruntime-genai#2146

@tanzeel-amd tanzeel-amd force-pushed the turrahma/qwen3.5-moe-35B-A3B branch 3 times, most recently from 883738b to 274a85d Compare May 22, 2026 11:18
@VishalX
Copy link
Copy Markdown
Contributor

VishalX commented May 24, 2026

please wait for microsoft/onnxruntime-genai#2146

OGA PR merged.

xieofxie
xieofxie previously approved these changes May 25, 2026
@devang-ml
Copy link
Copy Markdown
Contributor

Please add README and info.yml

Ur Rahman and others added 10 commits May 26, 2026 11:16
- Olive recipe for exporting Qwen/Qwen3.5-35B-A3B (256 experts, 8 routed + 1 shared)
- Three sub-model pipeline: text decoder (INT4 QMoE), embedding (FP32), vision (FP32)
- Custom ONNX-export-friendly MoE model class (codes/modeling_qwen3_5_moe.py)
- Inference script with text, image, interactive, and benchmark modes
- Requires ORT GenAI built with qwen3_5_moe support (see DEBUG_STATUS.md)
…ript.py

- optimize.py: Read token IDs, vision params, and preprocessor settings from HuggingFace model/generation configs instead of hardcoding them. Model name is read from the Olive text.json config. Added --context-length CLI arg.

- user_script.py: Use safetensors safe_open with prefix filtering to load only vision/embedding weights (~4 GB) instead of the full 35B checkpoint (~67 GB).

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@tanzeel-amd
Copy link
Copy Markdown
Contributor Author

@devang-ml Added README.md and info.yml. Please review.

@tanzeel-amd tanzeel-amd requested a review from xieofxie May 26, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants