Extend moe_3gemm to all oneDNN aware GPUs by peterchen-intel · Pull Request #35335 · openvinotoolkit/openvino

peterchen-intel · 2026-04-14T11:34:23Z

Details:

This pull request enhances support for the compressed Mixture-of-Experts (MoE) fusion chain in the Intel GPU plugin, particularly improving compatibility on non-systolic (non-IMAD) hardware. The changes ensure that the MoE fusion passes and their dependencies are applied more broadly, and that the required oneDNN engine is enabled when using the new MOE3GemmFusedCompressed operation, even on devices that do not natively support systolic operations.

MoE Fusion Pipeline Improvements:

The compressed MoE fusion chain (including ConvertTiledMoeBlockToGatherMatmuls and ConvertGatherMatmulToGatherMatmulCompressed) is now run on all devices, not just those with systolic (IMAD) support, improving model compatibility and performance across a wider range of hardware.

oneDNN Integration for MoE:

The execution config now automatically enables use_onednn when the MOE3GemmFusedCompressed operation is detected, ensuring the oneDNN engine is initialized for models that require it, even on non-systolic hardware. This is necessary for correct operation and performance of the fused MoE kernels.

Dependency Updates:

Added the include for moe_3gemm_fused_compressed.hpp to the runtime configuration source, ensuring the new operation is recognized and available during execution.

Limitation

Deps on the oneDNN means supports only the platforms oneDNN supports.
Long context and chunk size limitation due to the OCL mixed kernel requires big buffer. (e.g. 32K input with chunk size = 4K will exceed 4GB buffer limitation)

Tickets:

CVS-182696 (perf_check results), CVS-182695

AI Assistance:

AI assistance used: yes
If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks). AI done the debug/analysis/fixing with the guidance.

causing it to be skipped on MTL-class iGPU (12.70.x, XeHPG, no DPAS). This left raw FP32 weight-decompression chains that overwhelmed propagate_constants with ~56 GB of constant-folding memory. Root cause of inference failure: moe_3gemm_swiglu_opt uses oneDNN internally (onednn_linear for gate/up/down matrix multiplications). OneDNN requires an in-order OCL queue. MTL uses out-of-order queue by default because use_onednn is false when supports_immad=false. Fix: three MoE transformation passes (FuseVectorizedMOE3GEMM, ConvertMOEToMOECompressed, FuseMOE3GemmCompressed) run on all architectures. FuseMOE3GemmCompressed creates MOE3GemmFusedCompressed which the OCL moe_3gemm_swiglu_opt kernel executes. - Detect MOE3GemmFusedCompressed in apply_model_specific_options and force use_onednn=true so finalize_impl sets queue_type=in_order, satisfying the oneDNN in-order queue requirement. - Fix moe_gather validate_impl to accept rank-2 input for models where the batch dimension is pre-flattened (Qwen3-style). - Re-apply iGPU transfer skip (usm_shared -> usm_device) in network.cpp and program.cpp for integrated GPUs where both allocation types share system DRAM (xe2+ or 12.7x-class MTL/ARL-S). Tested on machine (GPU uArch 12.70.4 / XeHPG / System memory 64 GB): model loads in 14 s, generates meaningful tokens, Unevictable stays below 120 MB. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Chen Peter <peter.chen@intel.com>

Instead of setting use_onednn=true when MOE3GemmFusedCompressed is detected, set m_queue_type=in_order directly. This is more precise: the only requirement is an in-order OCL command queue (for onednn_linear in moe_3gemm_swiglu_opt.cpp), not full oneDNN enablement for the whole model. Leaving use_onednn=false on non-systolic hardware (MTL, 12.70.x) ensures that oneDNN implementations for FC, convolution, GEMM etc. are not activated on hardware without DPAS units. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The MTL-class (12.70.x) iGPU has a separate GPU L3 cache from the CPU, so copying usm_shared -> usm_device does improve GPU access performance. Reverts the MTL condition added in the prior fix commit, keeping only the original xe2+ integrated GPU skip (which has true unified memory). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

In the correct fix, GEMM3_SWIGLU (Qwen3) always goes through FuseMOE3GemmCompressed -> MOE3GemmFusedCompressed, which creates a single fused primitive with no standalone moe_gather node. The rank-2 accept was only needed during an intermediate broken debug state where FuseMOE3GemmCompressed was wrongly blocked. moe_gather is only used by GEMM2_BIAS_SWIGLU_CLAMP models, whose input is rank-3. Restore original: input_pshapes.rank() != 3 || input_pshapes[2].is_dynamic() Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The previous fix set m_queue_type = in_order directly in apply_model_specific_options, but this left m_use_onednn = false. On non-systolic hardware (supports_immad=false), program.cpp only calls lo.enable_onednn_for<lstm_seq/gru_seq>() (making the onednn_impls_optimization_attribute non-empty, which triggers create_onednn_engine() in select_preferred_formats.cpp) when use_onednn=true. With use_onednn=false, the engine is never initialized, causing moe_3gemm_fused_compressed to crash at inference time with 'oneDNN engine not initialized'. Fix: set m_use_onednn = true (not queue_type) when MOE3GemmFusedCompressed is detected. finalize_impl then sets queue_type = in_order because use_onednn=true, and the create_onednn_engine() call is correctly triggered. This is safe on non-systolic hardware: FuseVectorizedFC (systolic FC) is gated independently on supports_immad, so no systolic ops are introduced by enabling use_onednn for the MoE path. Verified: all 3 prompts pass with correct output on MTL iGPU (GPU_UARCH_VERSION=12.70.4, supports_immad=false). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Extends the Intel GPU plugin’s MoE GEMM3 (Qwen3-style GEMM3_SWIGLU) conversion/execution path so it is no longer limited to systolic-array devices, ensuring the MoE graph is structurally converted early and executed via the fused OpenCL kernel path.

Changes:

Always registers/runs ov::pass::FuseVectorizedMOE3GEMM (removes the non-systolic skip), enabling the downstream MOE → MOECompressed → MOE3GemmFusedCompressed pipeline across architectures.
Ensures use_onednn=true is enabled when MOE3GemmFusedCompressed is present so the GPU plugin uses an in-order queue and initializes oneDNN engine required by the fused MoE kernel.
Makes MOECompressed(GEMM3_SWIGLU) a hard error at program build time if it reaches primitive creation, catching missing fusion/pipeline misconfiguration early.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`src/plugins/intel_gpu/src/runtime/execution_config.cpp`	Enables `use_onednn` when `MOE3GemmFusedCompressed` is detected to force in-order queue + oneDNN engine init.
`src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp`	Removes architecture gating so `FuseVectorizedMOE3GEMM` runs unconditionally.
`src/plugins/intel_gpu/src/plugin/ops/moe.cpp`	Throws if `MOECompressed(GEMM3_SWIGLU)` reaches program build, enforcing the fused-kernel execution path.

rkazants

Tests?

peterchen-intel · 2026-04-19T05:49:53Z

Tests?

No. It is a fixing, but still has issue for multi-chunk.

paged_attention_opt__multi_tokens allocates a tmp_out scratch buffer sized total_tokens * heads_num * v_head_size * num_of_partitions * sizeof(float). For Qwen3-30B with chunk_size=4096 and 8K KV context this is 2 GB per layer. With 48 layers all executing sequentially, this totalled 96 GB of demand-paged USM device allocation. On Intel iGPU (ARLS, i915 driver), the driver pins the entire allocation as Unevictable on first GPU access regardless of pages touched, causing CL_OUT_OF_RESOURCES on a 31 GB machine. Root cause: can_share_internal_buffer(false) in paged_attention_node unconditionally blocked the memory pool for ALL internal buffers. This was added in PR openvinotoolkit#33204 to prevent CPU/GPU races on lockable buffers (blocks_indexes_start/end, blocked_gws_subseq_mapping) written by prepare_internal_buffers(). However it also blocked pool reuse for non-lockable GPU-only buffers (exp_sums, max_logits, tmp_out) which are safe to share across sequential layers. Fix: - Remove can_share_internal_buffer(false) from paged_attention_node; per-buffer lockability already tracked via BufferDescriptor::m_lockable. so CPU-written (lockable=true, usm_host) buffers remain non-shareable while GPU-only (lockable=false, usm_device) buffers can be reused from the pool. - In allocate_internal_buffers(): pass buffer_descs[i].m_lockable to the call (previously dropped, causing wrong alloc type on initial allocation). Result: 48 layers share one 2 GB tmp_out buffer instead of allocating 48 separate 2 GB buffers. Peak Unevictable drops from OOM crash (~28+ GB) to ~18.9 GB on ARLS (Intel Arc 8086:7d67, Arrow Lake-S iGPU, 31 GB). Verified: Qwen3-30B-A3B-Instruct-2507-int4-ov with chunk_size=4096, 8K prompt, ContinuousBatching on ARLS completes successfully with exit code 0 and 20 coherent output tokens. Not affected on ARLH (supports_immad=true takes micro_sdpa path which does not allocate tmp_out at all). Signed-off-by: Chen Peter <peter.chen@intel.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

peterchen-intel · 2026-04-22T12:41:10Z

Since #34974 merged today. Will test QWen3 and GPT-OSS again. The results are in CVS-182696

ceciliapeng2011 · 2026-04-24T01:36:37Z

@e-ddykim related to #35479

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Co-authored-by: Chen Peter <peter.chen@intel.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

+        // moe_3gemm_fused_compressed uses oneDNN internally for matrix multiplications
+        // (onednn_linear wrappers in moe_3gemm_swiglu_opt.cpp), which requires:
+        //   1. use_onednn=true so create_onednn_engine() is called during program build
+        //      (see program.cpp: lo.enable_onednn_for<lstm_seq/gru_seq> path which makes
+        //       onednn_impls_optimization_attribute non-empty, triggering engine init).
+        //   2. in-order OCL command queue (finalize_impl sets this when use_onednn=true).
+        // Auto-enable this only on architectures with oneDNN support, consistent with
+        // the LSTM/GRU path above, to avoid initializing oneDNN on unsupported devices.
+        if (ov::is_type<ov::intel_gpu::op::MOE3GemmFusedCompressed>(op) &&
+            info.arch >= cldnn::gpu_arch::xe_lp) {
+            m_use_onednn = true;
+        }


peterchen-intel and others added 4 commits April 13, 2026 06:54

peterchen-intel requested review from a team as code owners April 14, 2026 11:34

github-actions Bot added the category: GPU OpenVINO GPU plugin label Apr 14, 2026

peterchen-intel requested a review from riverlijunjie April 14, 2026 11:34

peterchen-intel assigned e-ddykim Apr 14, 2026

peterchen-intel added the under_perf_check label Apr 14, 2026

peterchen-intel requested a review from Copilot April 15, 2026 00:25

Copilot started reviewing on behalf of peterchen-intel April 15, 2026 00:26 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

rkazants reviewed Apr 16, 2026

View reviewed changes

Merge branch 'master' into oom/fixing

8a030fd

peterchen-intel added the do_not_merge label Apr 19, 2026

peterchen-intel force-pushed the oom/fixing branch from 8575a7d to 1498dd1 Compare April 21, 2026 03:28

Merge branch 'master' into oom/fixing

2004557

peterchen-intel removed under_perf_check do_not_merge labels Apr 22, 2026

Merge branch 'master' into oom/fixing

c9eb5c6

peterchen-intel added the under_perf_check label Apr 22, 2026

peterchen-intel changed the title ~~Extend moe_3gemm to all Ultra series iGPU~~ Extend moe_3gemm to all GPUs and reduce internal buffer holding in paged_attention_opt Apr 22, 2026

riverlijunjie reviewed Apr 24, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/graph/paged_attention.cpp Outdated

Comment thread src/plugins/intel_gpu/src/graph/primitive_inst.cpp Outdated

Comment thread src/plugins/intel_gpu/src/graph/primitive_inst.cpp Outdated

Merge branch 'master' into oom/fixing

046c575

peterchen-intel force-pushed the oom/fixing branch from ef2519b to 2ee0c08 Compare May 22, 2026 09:36

Merge branch 'master' into oom/fixing

1284765

peterchen-intel changed the title ~~Extend moe_3gemm to all GPUs and reduce internal buffer holding in paged_attention_opt~~ Extend moe_3gemm to all Intel GPUs May 22, 2026

peterchen-intel requested review from Copilot and riverlijunjie May 22, 2026 14:58

Copilot started reviewing on behalf of peterchen-intel May 22, 2026 14:59 View session

peterchen-intel requested a review from rkazants May 22, 2026 14:59

Copilot AI reviewed May 22, 2026

View reviewed changes

Support oneDNN known gpu_arch only

f8d0cc3

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

peterchen-intel requested a review from Copilot June 2, 2026 05:28

Copilot started reviewing on behalf of peterchen-intel June 2, 2026 05:28 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Outdated

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Outdated

device_info.arch >= cldnn::gpu_arch::xe_lp

3875448

peterchen-intel changed the title ~~Extend moe_3gemm to all Intel GPUs~~ Extend moe_3gemm to all oneDNN known GPUs Jun 2, 2026

peterchen-intel changed the title ~~Extend moe_3gemm to all oneDNN known GPUs~~ Extend moe_3gemm to all oneDNN aware GPUs Jun 2, 2026

peterchen-intel requested a review from Copilot June 2, 2026 08:45

Copilot started reviewing on behalf of peterchen-intel June 2, 2026 08:46 View session

Merge branch 'master' into oom/fixing

ae72a35

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Outdated

Comment thread src/plugins/intel_gpu/src/runtime/execution_config.cpp

peterchen-intel commented Jun 2, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp

peterchen-intel commented Jun 2, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/runtime/execution_config.cpp

peterchen-intel commented Jun 2, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/runtime/execution_config.cpp

For ENABLE_ONEDNN_FOR_GPU only

fa26fd1

Co-authored-by: Chen Peter <peter.chen@intel.com>

peterchen-intel commented Jun 2, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Outdated

peterchen-intel added 2 commits June 2, 2026 20:15

Remove duplicate

4ce2b9a

Remove duplicate

dfd234c

peterchen-intel requested a review from Copilot June 2, 2026 12:18

Copilot started reviewing on behalf of peterchen-intel June 2, 2026 12:18 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Define disable_moe_opt only if ENABLE_ONEDNN_FOR_GPU

8edab88

Conversation

peterchen-intel commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

AI Assistance:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

rkazants left a comment

Choose a reason for hiding this comment

Uh oh!

peterchen-intel commented Apr 19, 2026

Uh oh!

peterchen-intel commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ceciliapeng2011 commented Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

peterchen-intel commented Apr 14, 2026 •

edited

Loading

peterchen-intel commented Apr 22, 2026 •

edited

Loading