Skip to content

Add OnnxMoEQuantization pass (com.microsoft::MoE → QMoE)#2491

Draft
justinchuby wants to merge 3 commits into
microsoft:mainfrom
justinchuby:moe-to-qmoe-conversion
Draft

Add OnnxMoEQuantization pass (com.microsoft::MoE → QMoE)#2491
justinchuby wants to merge 3 commits into
microsoft:mainfrom
justinchuby:moe-to-qmoe-conversion

Conversation

@justinchuby
Copy link
Copy Markdown
Contributor

Problem

Mobius and other ONNX exporters emit MoE blocks as fused com.microsoft::MoE nodes whose per-expert FC1/FC2 weights are 3-D fp16/bf16/fp32 initializers. The existing weight-quantization passes (OnnxKQuantQuantization, OnnxBlockWiseRtnQuantization, OnnxBnb4Quantization) only target standalone MatMul nodes, so:

  • The per-expert weights — which are typically ~80% of total parameters in a MoE model — stay at the model's compute dtype after running any of those passes.
  • For Gemma 4 26B-A4B, this means int4 quantization yields only ~6% file-size reduction (~48 GB → ~45 GB).

The correct target for these weights is com.microsoft::QMoE, which is now supported by both the CUDA and (experimental) CPU kernels in ORT main after microsoft/onnxruntime#28467. There is no existing pass in either Olive or ORT that performs the MoE → QMoE rewrite.

Solution

New OnnxMoEQuantization pass that walks the graph and rewrites every com.microsoft::MoE node:

  1. Validates that fc1_experts_weights and fc2_experts_weights are 3-D static initializers.
  2. For each expert, calls ORT's pybind quantize_matmul_{4,8}bits to produce per-expert quantized weights and symmetric fp16 scales, then CUTLASS-prepacks them via pack_weights_for_cuda_mixed_gemm so the QMoE kernels can read them directly.
  3. Stacks the per-expert tensors along axis 0 and registers them as new initializers (uint8 weights + fp16 scales).
  4. Replaces the MoE node with a QMoE node, carrying over all original routing/activation attributes plus expert_weight_bits, optional block_size, and quant_type='int'.
  5. Drops the orphan fp16 weight initializers.

Configuration

Param Default Notes
bits 4 4 (int4) or 8 (int8)
block_size 0 0 = per-row scales; otherwise must be a power of two ≥ 16
nodes_to_exclude None MoE node names to leave unquantized
force_arch 80 CUTLASS prepacking target SM (80 = Ampere, 90 = Hopper)

Example

{
  "input_model": {"type": "OnnxModel", "model_path": "decoder/model.onnx"},
  "passes": {
    "moe_quant": {
      "type": "OnnxMoEQuantization",
      "bits": 4,
      "block_size": 0
    }
  }
}

Runtime requirement

The pass uses pack_weights_for_cuda_mixed_gemm from ORT, which is only exported when ORT is built with USE_CUDA. A descriptive RuntimeError is raised at run time when the binding is unavailable, telling the user to install onnxruntime-gpu >= 1.28 (or a recent nightly).

Limitations / out-of-scope

  • fc3 inputs (3-fold MoE variants) trigger a warning-skip per node.
  • Symmetric integer quantization only (matching the QMoE kernel's preferred layout). fp4 / fp8 / wfp4afp8 quant_types are left for a follow-up.
  • No calibration-aware quantization (GPTQ / AWQ). This pass is pure RTN.

Tests

5 unit tests in test/passes/onnx/test_moe_quantization.py:

Test Coverage
test_moe_to_qmoe_conversion End-to-end MoE→QMoE with int4 + per-row scales; checks node replacement, input slot order, attribute preservation, and uint8/fp16 shape/dtype
test_moe_to_qmoe_blockwise block_size=16 produces 3-D scales [E, N, K/block_size] and emits the block_size attribute
test_moe_to_qmoe_skip_when_not_initializer Dynamic-weight MoE nodes are skipped with a warning, not converted
test_invalid_bits_rejected bits ∉ {4, 8} fails fast
test_invalid_block_size_rejected non-power-of-two block_size fails fast

The CUTLASS prepack helper is patched with a pass-through stub during tests so CPU-only CI can exercise the graph transform. All 5 pass locally + the existing 416 test/passes/onnx tests still pass + lintrunner is clean.

Copilot AI review requested due to automatic review settings June 1, 2026 23:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new ONNX graph-rewrite pass to quantize fused MoE blocks exported as com.microsoft::MoE by rewriting them into com.microsoft::QMoE, enabling int4/int8 weight-only quantization of per-expert FC weights using ONNX Runtime’s quantization + CUTLASS prepack helpers.

Changes:

  • Added OnnxMoEQuantization pass to convert MoE → QMoE, quantize per-expert FC1/FC2 weights, and register packed uint8 weights + fp16 scales.
  • Added unit tests covering successful conversion, block-wise scales, skip behavior when weights are dynamic, and config validation.
  • Registered the new pass in olive_config.json with CUDA EP support and int4/int8 precision metadata.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
olive/passes/onnx/moe_quantization.py Implements the MoE→QMoE rewrite and per-expert quantization/packing pipeline.
test/passes/onnx/test_moe_quantization.py Adds targeted unit tests for the new pass (conversion, blockwise scales, skips, config validation).
olive/olive_config.json Registers the new pass and its supported providers/accelerators/precisions.

Comment thread olive/passes/onnx/moe_quantization.py
Comment thread olive/passes/onnx/moe_quantization.py Outdated
justinchuby and others added 2 commits June 2, 2026 00:10
Adds a new ONNX graph pass that rewrites every com.microsoft::MoE node
into a com.microsoft::QMoE node with the per-expert FC1/FC2 weight
initializers quantized to symmetric int4 (default) or int8, plus
corresponding fp16 scale initializers.

Motivation: mobius (and similar exporters) emit the fused
com.microsoft::MoE op with the per-expert weights as 3-D fp16/bf16/fp32
initializers. The existing weight-quantization passes
(OnnxKQuantQuantization, OnnxBlockWiseRtnQuantization, OnnxBnb4Quantization)
only target MatMul nodes, so for MoE models the per-expert weights
(~80% of total parameters) stay at the model's compute dtype, leaving
just ~6% size reduction after quantization. The QMoE op is the correct
target for MoE weights and is supported by the CUDA + experimental CPU
kernels in ORT main (PR microsoft/onnxruntime#28467).

Implementation:

- Walks the graph and finds every com.microsoft::MoE node whose
  fc1_experts_weights and fc2_experts_weights are 3-D static initializers.
- For each expert, calls ORT's pybind quantize_matmul_{4,8}bits to
  produce per-expert int4/int8 weights + symmetric fp16 scales, then
  CUTLASS-prepacks them via pack_weights_for_cuda_mixed_gemm so the
  QMoE kernels can consume the bytes directly.
- Stacks per-expert tensors along axis 0 and registers them as new
  initializers (uint8 weight + fp16 scale per expert).
- Replaces the MoE node with a QMoE node carrying the original
  activation/routing attributes plus expert_weight_bits, optional
  block_size, and quant_type='int'.
- Orphaned fp16 weight initializers are dropped.

Supports per-row scales (block_size=0, default) and block-wise scales
(block_size ≥ 16, must be power of two). Nodes can be selectively
excluded via nodes_to_exclude.

The pass requires a CUDA-enabled ONNX Runtime build because
pack_weights_for_cuda_mixed_gemm is only exposed when ORT is compiled
with USE_CUDA. A descriptive RuntimeError is raised at run time when
the binding is unavailable.

Limitations / out-of-scope:

- fc3 inputs (3-fold MoE variants) are not supported and trigger a
  warning-skip per node.
- Only symmetric int quantization (matching the kernel's preferred
  layout). FP4 / FP8 / WFP4AFP8 quant_types are left for a follow-up.
- Calibration-aware quantization (GPTQ / AWQ) is out of scope; this
  pass is pure RTN.

Tests: 5 unit tests covering (a) end-to-end MoE → QMoE conversion
with int4 + per-row scales, (b) block-wise int4, (c) graceful skip
when weights are not static initializers, (d) bits validation, and
(e) block_size validation. The CUTLASS prepack helper is patched
during tests so CI without onnxruntime-gpu can still exercise the
graph transform.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>
- _maybe_input now treats an ir.Value with empty name as missing, not
  present. ONNX represents unset optional inputs as empty-string slots;
  while onnx_ir typically normalises those to None, defensively handle
  the ir.Value(name='') case too so the fc3 reject path doesn't fire on
  MoE nodes that include empty placeholder slots for fc1_bias / fc2_bias.

- Validate N % pack_factor == 0 and block_size % pack_factor == 0 in
  _quantize_one_expert. These were latent failure modes where the
  CUTLASS prepack helper would either crash or produce a wrong layout;
  now we emit a clear _UnsupportedMoEError and the MoE node is skipped
  with a warning instead of being silently corrupted.

- Add a comment in _quantize_one_expert explaining that the 2-D scale /
  zero_point shapes match the upstream ORT test harness
  (test_qmoe_cuda.py::quant_dequant_blockwise) — pybind11's buffer
  protocol accepts any contiguous shape as long as the element count
  matches, so this isn't a regression vs the 1-D layout used in
  rtn_quantization.py.

Two new unit tests cover the changes:

- test_moe_to_qmoe_handles_explicit_empty_optional_inputs: appends
  empty-string fc2_bias / fc3_W / fc3_bias slots to the MoE node and
  asserts the pass still converts it (fc3 reject path doesn't trigger).
- test_n_not_divisible_by_pack_factor_skipped: builds an MoE node with
  N=3 (odd) and asserts the conversion is skipped with a clean warning
  rather than crashing.

All 7 tests pass, lintrunner clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>
@justinchuby justinchuby force-pushed the moe-to-qmoe-conversion branch from f056fa8 to 289d56c Compare June 2, 2026 00:11
@justinchuby justinchuby marked this pull request as draft June 2, 2026 00:13
- Move `_convert_moe_to_qmoe`, `_convert_single_moe`, and the
  `_drop_unused_initializers` helper from the `OnnxMoEQuantization`
  class into module-level private functions, per Google's Python style
  guide preference for free functions over class methods when no class
  state is involved. The `OnnxMoEQuantization` class now only owns
  config defaulting and the `_run_for_config` entry point.
- Replace the hand-rolled orphan-initializer sweep with
  `onnx_ir.passes.common.RemoveUnusedNodesPass`, which also handles
  dead-node removal and keeps the cleanup consistent with the rest of
  the IR pass ecosystem.

No behaviour change: all 7 unit tests pass, lintrunner clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>
@justinchuby
Copy link
Copy Markdown
Contributor Author

Follow-ups filed:

@justinchuby
Copy link
Copy Markdown
Contributor Author

Sent the proposed ORT-side fix as microsoft/onnxruntime#28749. Once that lands and ships in a release, this pass can drop the pack_weights_for_cuda_mixed_gemm dependency entirely — the Olive pass would only need quantize_matmul_{4,8}bits (which is in every ORT build, CPU or CUDA).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants