Add OnnxMoEQuantization pass (com.microsoft::MoE → QMoE)#2491
Draft
justinchuby wants to merge 3 commits into
Draft
Add OnnxMoEQuantization pass (com.microsoft::MoE → QMoE)#2491justinchuby wants to merge 3 commits into
justinchuby wants to merge 3 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a new ONNX graph-rewrite pass to quantize fused MoE blocks exported as com.microsoft::MoE by rewriting them into com.microsoft::QMoE, enabling int4/int8 weight-only quantization of per-expert FC weights using ONNX Runtime’s quantization + CUTLASS prepack helpers.
Changes:
- Added
OnnxMoEQuantizationpass to convertMoE → QMoE, quantize per-expert FC1/FC2 weights, and register packed uint8 weights + fp16 scales. - Added unit tests covering successful conversion, block-wise scales, skip behavior when weights are dynamic, and config validation.
- Registered the new pass in
olive_config.jsonwith CUDA EP support and int4/int8 precision metadata.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
olive/passes/onnx/moe_quantization.py |
Implements the MoE→QMoE rewrite and per-expert quantization/packing pipeline. |
test/passes/onnx/test_moe_quantization.py |
Adds targeted unit tests for the new pass (conversion, blockwise scales, skips, config validation). |
olive/olive_config.json |
Registers the new pass and its supported providers/accelerators/precisions. |
Adds a new ONNX graph pass that rewrites every com.microsoft::MoE node into a com.microsoft::QMoE node with the per-expert FC1/FC2 weight initializers quantized to symmetric int4 (default) or int8, plus corresponding fp16 scale initializers. Motivation: mobius (and similar exporters) emit the fused com.microsoft::MoE op with the per-expert weights as 3-D fp16/bf16/fp32 initializers. The existing weight-quantization passes (OnnxKQuantQuantization, OnnxBlockWiseRtnQuantization, OnnxBnb4Quantization) only target MatMul nodes, so for MoE models the per-expert weights (~80% of total parameters) stay at the model's compute dtype, leaving just ~6% size reduction after quantization. The QMoE op is the correct target for MoE weights and is supported by the CUDA + experimental CPU kernels in ORT main (PR microsoft/onnxruntime#28467). Implementation: - Walks the graph and finds every com.microsoft::MoE node whose fc1_experts_weights and fc2_experts_weights are 3-D static initializers. - For each expert, calls ORT's pybind quantize_matmul_{4,8}bits to produce per-expert int4/int8 weights + symmetric fp16 scales, then CUTLASS-prepacks them via pack_weights_for_cuda_mixed_gemm so the QMoE kernels can consume the bytes directly. - Stacks per-expert tensors along axis 0 and registers them as new initializers (uint8 weight + fp16 scale per expert). - Replaces the MoE node with a QMoE node carrying the original activation/routing attributes plus expert_weight_bits, optional block_size, and quant_type='int'. - Orphaned fp16 weight initializers are dropped. Supports per-row scales (block_size=0, default) and block-wise scales (block_size ≥ 16, must be power of two). Nodes can be selectively excluded via nodes_to_exclude. The pass requires a CUDA-enabled ONNX Runtime build because pack_weights_for_cuda_mixed_gemm is only exposed when ORT is compiled with USE_CUDA. A descriptive RuntimeError is raised at run time when the binding is unavailable. Limitations / out-of-scope: - fc3 inputs (3-fold MoE variants) are not supported and trigger a warning-skip per node. - Only symmetric int quantization (matching the kernel's preferred layout). FP4 / FP8 / WFP4AFP8 quant_types are left for a follow-up. - Calibration-aware quantization (GPTQ / AWQ) is out of scope; this pass is pure RTN. Tests: 5 unit tests covering (a) end-to-end MoE → QMoE conversion with int4 + per-row scales, (b) block-wise int4, (c) graceful skip when weights are not static initializers, (d) bits validation, and (e) block_size validation. The CUTLASS prepack helper is patched during tests so CI without onnxruntime-gpu can still exercise the graph transform. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>
- _maybe_input now treats an ir.Value with empty name as missing, not present. ONNX represents unset optional inputs as empty-string slots; while onnx_ir typically normalises those to None, defensively handle the ir.Value(name='') case too so the fc3 reject path doesn't fire on MoE nodes that include empty placeholder slots for fc1_bias / fc2_bias. - Validate N % pack_factor == 0 and block_size % pack_factor == 0 in _quantize_one_expert. These were latent failure modes where the CUTLASS prepack helper would either crash or produce a wrong layout; now we emit a clear _UnsupportedMoEError and the MoE node is skipped with a warning instead of being silently corrupted. - Add a comment in _quantize_one_expert explaining that the 2-D scale / zero_point shapes match the upstream ORT test harness (test_qmoe_cuda.py::quant_dequant_blockwise) — pybind11's buffer protocol accepts any contiguous shape as long as the element count matches, so this isn't a regression vs the 1-D layout used in rtn_quantization.py. Two new unit tests cover the changes: - test_moe_to_qmoe_handles_explicit_empty_optional_inputs: appends empty-string fc2_bias / fc3_W / fc3_bias slots to the MoE node and asserts the pass still converts it (fc3 reject path doesn't trigger). - test_n_not_divisible_by_pack_factor_skipped: builds an MoE node with N=3 (odd) and asserts the conversion is skipped with a clean warning rather than crashing. All 7 tests pass, lintrunner clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>
f056fa8 to
289d56c
Compare
- Move `_convert_moe_to_qmoe`, `_convert_single_moe`, and the `_drop_unused_initializers` helper from the `OnnxMoEQuantization` class into module-level private functions, per Google's Python style guide preference for free functions over class methods when no class state is involved. The `OnnxMoEQuantization` class now only owns config defaulting and the `_run_for_config` entry point. - Replace the hand-rolled orphan-initializer sweep with `onnx_ir.passes.common.RemoveUnusedNodesPass`, which also handles dead-node removal and keeps the cleanup consistent with the rest of the IR pass ecosystem. No behaviour change: all 7 unit tests pass, lintrunner clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>
Contributor
Author
|
Follow-ups filed:
|
Contributor
Author
|
Sent the proposed ORT-side fix as microsoft/onnxruntime#28749. Once that lands and ships in a release, this pass can drop the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Mobius and other ONNX exporters emit MoE blocks as fused
com.microsoft::MoEnodes whose per-expert FC1/FC2 weights are 3-D fp16/bf16/fp32 initializers. The existing weight-quantization passes (OnnxKQuantQuantization,OnnxBlockWiseRtnQuantization,OnnxBnb4Quantization) only target standaloneMatMulnodes, so:The correct target for these weights is
com.microsoft::QMoE, which is now supported by both the CUDA and (experimental) CPU kernels in ORT main after microsoft/onnxruntime#28467. There is no existing pass in either Olive or ORT that performs theMoE → QMoErewrite.Solution
New
OnnxMoEQuantizationpass that walks the graph and rewrites everycom.microsoft::MoEnode:fc1_experts_weightsandfc2_experts_weightsare 3-D static initializers.quantize_matmul_{4,8}bitsto produce per-expert quantized weights and symmetric fp16 scales, then CUTLASS-prepacks them viapack_weights_for_cuda_mixed_gemmso the QMoE kernels can read them directly.uint8weights +fp16scales).expert_weight_bits, optionalblock_size, andquant_type='int'.Configuration
bitsblock_sizenodes_to_excludeforce_archExample
{ "input_model": {"type": "OnnxModel", "model_path": "decoder/model.onnx"}, "passes": { "moe_quant": { "type": "OnnxMoEQuantization", "bits": 4, "block_size": 0 } } }Runtime requirement
The pass uses
pack_weights_for_cuda_mixed_gemmfrom ORT, which is only exported when ORT is built withUSE_CUDA. A descriptiveRuntimeErroris raised at run time when the binding is unavailable, telling the user to installonnxruntime-gpu >= 1.28(or a recent nightly).Limitations / out-of-scope
fp4/fp8/wfp4afp8quant_types are left for a follow-up.Tests
5 unit tests in
test/passes/onnx/test_moe_quantization.py:test_moe_to_qmoe_conversiontest_moe_to_qmoe_blockwise[E, N, K/block_size]and emits theblock_sizeattributetest_moe_to_qmoe_skip_when_not_initializertest_invalid_bits_rejectedtest_invalid_block_size_rejectedThe CUTLASS prepack helper is patched with a pass-through stub during tests so CPU-only CI can exercise the graph transform. All 5 pass locally + the existing 416
test/passes/onnxtests still pass +lintrunneris clean.