Skip to content

[NPUW][MoE Scaling]Support MoE expert layout variants and add FoldShapeComputeChain pass#36184

Open
intelgaoxiong wants to merge 1 commit into
openvinotoolkit:masterfrom
intelgaoxiong:xiong/moe_new
Open

[NPUW][MoE Scaling]Support MoE expert layout variants and add FoldShapeComputeChain pass#36184
intelgaoxiong wants to merge 1 commit into
openvinotoolkit:masterfrom
intelgaoxiong:xiong/moe_new

Conversation

@intelgaoxiong
Copy link
Copy Markdown
Contributor

@intelgaoxiong intelgaoxiong commented Jun 2, 2026

Details:

MoE models exported with different opset versions produce expert output tensors with the singleton dimension in different positions:

  • Layout A: [num_experts, 1, token_num, hidden]
  • Layout B: [num_experts, token_num, 1, hidden]

Previously only Layout A was handled. This PR makes NPUW MoE inference layout-agnostic and adds a constant-folding pass to unblock pattern matching on static-shape graphs.

Validation job: https://cje-ir-prod01.devtools.intel.com/sai-npu-experience/job/Staging/job/ding/job/Validate/29/Validation_20report/

Changes

New: FoldShapeComputeChain pass (fold_const.hpp/cpp)

Four MatcherPass classes (FoldShapeOf, FoldGatherOfConst, FoldUnsqueezeOfConst, FoldConcatOfConsts) plus a ModelPass wrapper that runs the full pipeline in one call, which makes partitioning easier.

moe.cpp

  • GPTOSSRouter: removes ShapeOf/topk_convert from the formal pattern (both are resolved before matching) and uses any_input() for all Slice shape inputs.
  • GPTOSSExpert: decoding/prefill detection scans middle dims instead of assuming a fixed rank-2 token dimension, accepting both layout variants.

moe_transformation.cpp

Replaces update_reshape_constant_dimension (fixed negative index) with update_reshape_dimensions (range-based scan over middle dims), correctly handling both 3-D and 4-D reshape patterns for both layout variants.

moe_infer_utils.cpp / moe_resources.cpp

  • Extracts get_router_token_count(router_shape) helper to unify the two-layout token-dim detection in parse_selected_experts_from_router and gather_router_scores.
  • Accumulator buffer shape derivation now explicitly excludes the last dimension (hidden dim) from chunk-size substitution, preventing silent shape corruption when chunk_size == embed_dim.

Tests

fold_const_test.cpp: three GTest cases verify FoldShapeComputeChain on a graph mirroring the actual router subgraph.

Tickets:

AI Assistance:

  • AI assistance used: no / yes
  • If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks).

@github-actions github-actions Bot added category: build OpenVINO cmake script / infra category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Jun 2, 2026
…ze to constant.

Support Qwen/GPT-OSS MoE layout differences throughout inference pipeline

GPT-OSS and Qwen use different 4-D tensor layouts for MoE expert output:
  GPT-OSS: [N, 1, T, H]  (singleton at dim 1)
  Qwen:    [N, T, 1, H]  (singleton at dim 2)
Both have identical flat memory strides; only shape metadata differs.

Changes:
- moe_transformation.cpp: fix_token_count_for_expert_iterative now scans
  middle dims (1..n-2) by value instead of hardcoding second-to-last index,
  so both layouts are correctly patched for chunked prefill.
- moe_transformation.cpp: detect_and_transform_moe_downstream accepts both
  [N,1,H,W] and [N,H,1,W] parameter shapes for the ReduceSum pattern.
- moe_infer_utils.cpp: parse_selected_experts_from_router and
  gather_router_scores detect layout by checking which dim equals 1.
- moe_resources.cpp: expert_output_accumulator shape is derived from the
  compiled model output shape template instead of hardcoded [K,1,T,H].

Solved layout issue for GPT-OSS.

Fixed link error.

Runs the full shape-compute-chain folding pipeline in a single pass.

Add FoldConstTest.

Refine code.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
@intelgaoxiong intelgaoxiong marked this pull request as ready for review June 3, 2026 05:22
@intelgaoxiong intelgaoxiong requested review from a team as code owners June 3, 2026 05:22
@intelgaoxiong intelgaoxiong changed the title Xiong/moe new [NPUW]Support MoE expert layout variants and add FoldShapeComputeChain pass Jun 3, 2026
@intelgaoxiong intelgaoxiong changed the title [NPUW]Support MoE expert layout variants and add FoldShapeComputeChain pass [NPUW][MoE Scaling]Support MoE expert layout variants and add FoldShapeComputeChain pass Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: build OpenVINO cmake script / infra category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant