Add OnnxMoEQuantization pass (com.microsoft::MoE → QMoE) by justinchuby · Pull Request #2491 · microsoft/Olive

justinchuby · 2026-06-01T23:42:17Z

Problem

Mobius and other ONNX exporters emit MoE blocks as fused com.microsoft::MoE nodes whose per-expert FC1/FC2 weights are 3-D fp16/bf16/fp32 initializers. The existing weight-quantization passes (OnnxKQuantQuantization, OnnxBlockWiseRtnQuantization, OnnxBnb4Quantization) only target standalone MatMul nodes, so:

The per-expert weights — which are typically ~80% of total parameters in a MoE model — stay at the model's compute dtype after running any of those passes.
For Gemma 4 26B-A4B, this means int4 quantization yields only ~6% file-size reduction (~48 GB → ~45 GB).

The correct target for these weights is com.microsoft::QMoE, which is now supported by both the CUDA and (experimental) CPU kernels in ORT main after microsoft/onnxruntime#28467. There is no existing pass in either Olive or ORT that performs the MoE → QMoE rewrite.

Solution

New OnnxMoEQuantization pass that walks the graph and rewrites every com.microsoft::MoE node:

Validates that fc1_experts_weights and fc2_experts_weights are 3-D static initializers.
For each expert, calls ORT's pybind quantize_matmul_{4,8}bits to produce per-expert quantized weights and symmetric fp16 scales, then CUTLASS-prepacks them via pack_weights_for_cuda_mixed_gemm so the QMoE kernels can read them directly.
Stacks the per-expert tensors along axis 0 and registers them as new initializers (uint8 weights + fp16 scales).
Replaces the MoE node with a QMoE node, carrying over all original routing/activation attributes plus expert_weight_bits, optional block_size, and quant_type='int'.
Drops the orphan fp16 weight initializers.

Configuration

Param	Default	Notes
`bits`	4	4 (int4) or 8 (int8)
`block_size`	0	0 = per-row scales; otherwise must be a power of two ≥ 16
`nodes_to_exclude`	None	MoE node names to leave unquantized
`force_arch`	80	CUTLASS prepacking target SM (80 = Ampere, 90 = Hopper)

Example

{
  "input_model": {"type": "OnnxModel", "model_path": "decoder/model.onnx"},
  "passes": {
    "moe_quant": {
      "type": "OnnxMoEQuantization",
      "bits": 4,
      "block_size": 0
    }
  }
}

Runtime requirement

The pass uses pack_weights_for_cuda_mixed_gemm from ORT, which is only exported when ORT is built with USE_CUDA. A descriptive RuntimeError is raised at run time when the binding is unavailable, telling the user to install onnxruntime-gpu >= 1.28 (or a recent nightly).

Limitations / out-of-scope

fc3 inputs (3-fold MoE variants) trigger a warning-skip per node.
Symmetric integer quantization only (matching the QMoE kernel's preferred layout). fp4 / fp8 / wfp4afp8 quant_types are left for a follow-up.
No calibration-aware quantization (GPTQ / AWQ). This pass is pure RTN.

Tests

5 unit tests in test/passes/onnx/test_moe_quantization.py:

Test	Coverage
`test_moe_to_qmoe_conversion`	End-to-end MoE→QMoE with int4 + per-row scales; checks node replacement, input slot order, attribute preservation, and uint8/fp16 shape/dtype
`test_moe_to_qmoe_blockwise`	block_size=16 produces 3-D scales `[E, N, K/block_size]` and emits the `block_size` attribute
`test_moe_to_qmoe_skip_when_not_initializer`	Dynamic-weight MoE nodes are skipped with a warning, not converted
`test_invalid_bits_rejected`	bits ∉ {4, 8} fails fast
`test_invalid_block_size_rejected`	non-power-of-two block_size fails fast

The CUTLASS prepack helper is patched with a pass-through stub during tests so CPU-only CI can exercise the graph transform. All 5 pass locally + the existing 416 test/passes/onnx tests still pass + lintrunner is clean.

Copilot

Pull request overview

This PR introduces a new ONNX graph-rewrite pass to quantize fused MoE blocks exported as com.microsoft::MoE by rewriting them into com.microsoft::QMoE, enabling int4/int8 weight-only quantization of per-expert FC weights using ONNX Runtime’s quantization + CUTLASS prepack helpers.

Changes:

Added OnnxMoEQuantization pass to convert MoE → QMoE, quantize per-expert FC1/FC2 weights, and register packed uint8 weights + fp16 scales.
Added unit tests covering successful conversion, block-wise scales, skip behavior when weights are dynamic, and config validation.
Registered the new pass in olive_config.json with CUDA EP support and int4/int8 precision metadata.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`olive/passes/onnx/moe_quantization.py`	Implements the MoE→QMoE rewrite and per-expert quantization/packing pipeline.
`test/passes/onnx/test_moe_quantization.py`	Adds targeted unit tests for the new pass (conversion, blockwise scales, skips, config validation).
`olive/olive_config.json`	Registers the new pass and its supported providers/accelerators/precisions.

Adds a new ONNX graph pass that rewrites every com.microsoft::MoE node into a com.microsoft::QMoE node with the per-expert FC1/FC2 weight initializers quantized to symmetric int4 (default) or int8, plus corresponding fp16 scale initializers. Motivation: mobius (and similar exporters) emit the fused com.microsoft::MoE op with the per-expert weights as 3-D fp16/bf16/fp32 initializers. The existing weight-quantization passes (OnnxKQuantQuantization, OnnxBlockWiseRtnQuantization, OnnxBnb4Quantization) only target MatMul nodes, so for MoE models the per-expert weights (~80% of total parameters) stay at the model's compute dtype, leaving just ~6% size reduction after quantization. The QMoE op is the correct target for MoE weights and is supported by the CUDA + experimental CPU kernels in ORT main (PR microsoft/onnxruntime#28467). Implementation: - Walks the graph and finds every com.microsoft::MoE node whose fc1_experts_weights and fc2_experts_weights are 3-D static initializers. - For each expert, calls ORT's pybind quantize_matmul_{4,8}bits to produce per-expert int4/int8 weights + symmetric fp16 scales, then CUTLASS-prepacks them via pack_weights_for_cuda_mixed_gemm so the QMoE kernels can consume the bytes directly. - Stacks per-expert tensors along axis 0 and registers them as new initializers (uint8 weight + fp16 scale per expert). - Replaces the MoE node with a QMoE node carrying the original activation/routing attributes plus expert_weight_bits, optional block_size, and quant_type='int'. - Orphaned fp16 weight initializers are dropped. Supports per-row scales (block_size=0, default) and block-wise scales (block_size ≥ 16, must be power of two). Nodes can be selectively excluded via nodes_to_exclude. The pass requires a CUDA-enabled ONNX Runtime build because pack_weights_for_cuda_mixed_gemm is only exposed when ORT is compiled with USE_CUDA. A descriptive RuntimeError is raised at run time when the binding is unavailable. Limitations / out-of-scope: - fc3 inputs (3-fold MoE variants) are not supported and trigger a warning-skip per node. - Only symmetric int quantization (matching the kernel's preferred layout). FP4 / FP8 / WFP4AFP8 quant_types are left for a follow-up. - Calibration-aware quantization (GPTQ / AWQ) is out of scope; this pass is pure RTN. Tests: 5 unit tests covering (a) end-to-end MoE → QMoE conversion with int4 + per-row scales, (b) block-wise int4, (c) graceful skip when weights are not static initializers, (d) bits validation, and (e) block_size validation. The CUTLASS prepack helper is patched during tests so CI without onnxruntime-gpu can still exercise the graph transform. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>

- _maybe_input now treats an ir.Value with empty name as missing, not present. ONNX represents unset optional inputs as empty-string slots; while onnx_ir typically normalises those to None, defensively handle the ir.Value(name='') case too so the fc3 reject path doesn't fire on MoE nodes that include empty placeholder slots for fc1_bias / fc2_bias. - Validate N % pack_factor == 0 and block_size % pack_factor == 0 in _quantize_one_expert. These were latent failure modes where the CUTLASS prepack helper would either crash or produce a wrong layout; now we emit a clear _UnsupportedMoEError and the MoE node is skipped with a warning instead of being silently corrupted. - Add a comment in _quantize_one_expert explaining that the 2-D scale / zero_point shapes match the upstream ORT test harness (test_qmoe_cuda.py::quant_dequant_blockwise) — pybind11's buffer protocol accepts any contiguous shape as long as the element count matches, so this isn't a regression vs the 1-D layout used in rtn_quantization.py. Two new unit tests cover the changes: - test_moe_to_qmoe_handles_explicit_empty_optional_inputs: appends empty-string fc2_bias / fc3_W / fc3_bias slots to the MoE node and asserts the pass still converts it (fc3 reject path doesn't trigger). - test_n_not_divisible_by_pack_factor_skipped: builds an MoE node with N=3 (odd) and asserts the conversion is skipped with a clean warning rather than crashing. All 7 tests pass, lintrunner clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>

- Move `_convert_moe_to_qmoe`, `_convert_single_moe`, and the `_drop_unused_initializers` helper from the `OnnxMoEQuantization` class into module-level private functions, per Google's Python style guide preference for free functions over class methods when no class state is involved. The `OnnxMoEQuantization` class now only owns config defaulting and the `_run_for_config` entry point. - Replace the hand-rolled orphan-initializer sweep with `onnx_ir.passes.common.RemoveUnusedNodesPass`, which also handles dead-node removal and keeps the cleanup consistent with the rest of the IR pass ecosystem. No behaviour change: all 7 unit tests pass, lintrunner clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>

justinchuby · 2026-06-02T01:52:25Z

Follow-ups filed:

com.microsoft::QMoE should prepack int4/int8 weights in PrePack(), like MatMulNBits does onnxruntime#28748 — proposed fix on the ORT side: com.microsoft::QMoE should prepack int4/int8 weights in PrePack(), like com.microsoft::MatMulNBits already does. That would let this pass quantize on a CPU-only host with no special steps. The current asymmetry is most likely an oversight inherited from the TensorRT-LLM-flavoured fpA_intB code, since MatMulNBits does it correctly.
OnnxMoEQuantization: port pack_weights_for_cuda_mixed_gemm to numpy / cupy so the pass runs on CPU-only hosts #2492 — tracks a numpy + optional-cupy port of pack_weights_for_cuda_mixed_gemm for this pass, so it can run on CPU-only hosts without waiting for the ORT-side fix to ship.

justinchuby · 2026-06-02T02:11:01Z

Sent the proposed ORT-side fix as microsoft/onnxruntime#28749. Once that lands and ships in a release, this pass can drop the pack_weights_for_cuda_mixed_gemm dependency entirely — the Olive pass would only need quantize_matmul_{4,8}bits (which is in every ORT build, CPU or CUDA).

Copilot AI review requested due to automatic review settings June 1, 2026 23:42

Copilot started reviewing on behalf of justinchuby June 1, 2026 23:42 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread olive/passes/onnx/moe_quantization.py

Comment thread olive/passes/onnx/moe_quantization.py Outdated

justinchuby and others added 2 commits June 2, 2026 00:10

justinchuby force-pushed the moe-to-qmoe-conversion branch from f056fa8 to 289d56c Compare June 2, 2026 00:11

justinchuby marked this pull request as draft June 2, 2026 00:13

This was referenced Jun 2, 2026

com.microsoft::QMoE should prepack int4/int8 weights in PrePack(), like MatMulNBits does microsoft/onnxruntime#28748

Open

OnnxMoEQuantization: port pack_weights_for_cuda_mixed_gemm to numpy / cupy so the pass runs on CPU-only hosts #2492

Open

justinchuby mentioned this pull request Jun 2, 2026

QMoE: prepack int4/int8 expert weights in PrePack hook (symmetric with MatMulNBits) microsoft/onnxruntime#28749

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OnnxMoEQuantization pass (com.microsoft::MoE → QMoE)#2491

Add OnnxMoEQuantization pass (com.microsoft::MoE → QMoE)#2491
justinchuby wants to merge 3 commits into
microsoft:mainfrom
justinchuby:moe-to-qmoe-conversion

justinchuby commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

justinchuby commented Jun 2, 2026

Uh oh!

justinchuby commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

justinchuby commented Jun 1, 2026

Problem

Solution

Configuration

Example

Runtime requirement

Limitations / out-of-scope

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

justinchuby commented Jun 2, 2026

Uh oh!

justinchuby commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants