Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
a7fa364
intel_gpu: fix Qwen3 MoE GEMM3_SWIGLU on MTL-class (non-systolic) iGPU
peterchen-intel Apr 12, 2026
2bcc00a
intel_gpu: scope MOE3GemmFusedCompressed queue fix to in_order only
peterchen-intel Apr 14, 2026
d442d39
intel_gpu: revert skip_transfer_on_igpu extension to MTL
peterchen-intel Apr 14, 2026
4a0a504
intel_gpu: revert moe_gather.hpp rank-2 fix (not needed in final path)
peterchen-intel Apr 14, 2026
e1a7c83
intel_gpu: fix oneDNN engine init for MoE on non-systolic GPU
peterchen-intel Apr 14, 2026
8a030fd
Merge branch 'master' into oom/fixing
peterchen-intel Apr 19, 2026
1498dd1
intel_gpu: fix unnecessary tmp_out buffer per-layer in paged_attention
peterchen-intel Apr 21, 2026
2004557
Merge branch 'master' into oom/fixing
peterchen-intel Apr 21, 2026
c9eb5c6
Merge branch 'master' into oom/fixing
peterchen-intel Apr 22, 2026
046c575
Merge branch 'master' into oom/fixing
peterchen-intel Apr 28, 2026
3ebb016
Merge branch 'master' into oom/fixing
peterchen-intel May 19, 2026
990c0bc
Roll back the mistaken change
peterchen-intel May 20, 2026
39ba7fc
Merge branch 'master' into oom/fixing
peterchen-intel May 20, 2026
2ee0c08
iintel_gpu: enable compressed MoE fusion chain on non-systolic devices
peterchen-intel May 22, 2026
1284765
Merge branch 'master' into oom/fixing
peterchen-intel May 22, 2026
f8d0cc3
Support oneDNN known gpu_arch only
peterchen-intel Jun 2, 2026
3875448
device_info.arch >= cldnn::gpu_arch::xe_lp
peterchen-intel Jun 2, 2026
ae72a35
Merge branch 'master' into oom/fixing
peterchen-intel Jun 2, 2026
fa26fd1
For ENABLE_ONEDNN_FOR_GPU only
peterchen-intel Jun 2, 2026
4ce2b9a
Remove duplicate
peterchen-intel Jun 2, 2026
dfd234c
Remove duplicate
peterchen-intel Jun 2, 2026
8edab88
Define disable_moe_opt only if ENABLE_ONEDNN_FOR_GPU
peterchen-intel Jun 3, 2026
fdfdc18
Merge branch 'master' into oom/fixing
peterchen-intel Jun 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -564,12 +564,13 @@ void TransformationsPipeline::apply(std::shared_ptr<ov::Model> func) {
return false;
}();

#ifdef ENABLE_ONEDNN_FOR_GPU
const bool disable_moe_opt = GPU_DEBUG_VALUE_OR(config.get_disable_moe_opt(), false);

// MOE: TiledMoeBlock -> GatherMatmuls(compressed) -> MoeOp(compressed) -> MoeOpWithRouting(compressed).
// Gated on supports_immad (systolic-only) and oneDNN (required for expert GEMM dispatch).
// Note: even though we are already inside `if (supports_immad)`, oneDNN can still be explicitly disabled by the user.
if (device_info.supports_immad && config.get_use_onednn()) {
// Gated on oneDNN supports platforms.
Comment thread
peterchen-intel marked this conversation as resolved.
// Note: even though we are already inside `oneDNN supports platforms`, oneDNN can still be explicitly disabled by the user.
if (device_info.arch >= cldnn::gpu_arch::xe_lp && config.get_use_onednn()) {
manager.register_pass<ov::pass::ConvertTiledMoeBlockToGatherMatmuls>();

// f32 listed because this pass runs before ConvertPrecision (line ~588);
Expand All @@ -591,6 +592,7 @@ void TransformationsPipeline::apply(std::shared_ptr<ov::Model> func) {
manager.register_pass<ov::intel_gpu::FuseMOE3GemmCompressed>();
}
}
#endif // ENABLE_ONEDNN_FOR_GPU
manager.register_pass<ov::pass::GatedDeltaNetFusion>();
manager.register_pass<ov::pass::InitNodeInfo>();
manager.register_pass<EinsumDecomposition>();
Expand Down
17 changes: 17 additions & 0 deletions src/plugins/intel_gpu/src/runtime/execution_config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@

#include "intel_gpu/op/indirect_sdpa.hpp"
#include "intel_gpu/op/kv_cache.hpp"
#ifdef ENABLE_ONEDNN_FOR_GPU
#include "intel_gpu/op/moe_3gemm_fused_compressed.hpp"
Comment thread
peterchen-intel marked this conversation as resolved.
#endif // ENABLE_ONEDNN_FOR_GPU
#include "intel_gpu/op/sdpa.hpp"
#include "intel_gpu/plugin/remote_context.hpp"
#include "intel_gpu/primitives/paged_attention.hpp"
Expand Down Expand Up @@ -217,13 +220,27 @@ void ExecutionConfig::apply_model_specific_options(const IRemoteContext* context
m_max_kernels_per_batch = 4;
}

#ifdef ENABLE_ONEDNN_FOR_GPU
// Allow using onednn for models with LSTMSequence op as it's much more performant than existing ocl impl
// Onednn only support on Gen12 (XeLP) and later architectures
if ((ov::is_type<ov::op::v5::LSTMSequence>(op) || ov::is_type<ov::op::v5::GRUSequence>(op)) &&
info.arch >= cldnn::gpu_arch::xe_lp) {
m_use_onednn = true;
}

// moe_3gemm_fused_compressed uses oneDNN internally for matrix multiplications
// (onednn_linear wrappers in moe_3gemm_swiglu_opt.cpp), which requires:
// 1. use_onednn=true so create_onednn_engine() is called during program build
// (see program.cpp: lo.enable_onednn_for<lstm_seq/gru_seq> path which makes
// onednn_impls_optimization_attribute non-empty, triggering engine init).
// 2. in-order OCL command queue (finalize_impl sets this when use_onednn=true).
// Auto-enable this only on architectures with oneDNN support, consistent with
// the LSTM/GRU path above, to avoid initializing oneDNN on unsupported devices.
if (ov::is_type<ov::intel_gpu::op::MOE3GemmFusedCompressed>(op) &&
info.arch >= cldnn::gpu_arch::xe_lp) {
m_use_onednn = true;
}
Comment on lines +231 to +242
#endif //ENABLE_ONEDNN_FOR_GPU
if (auto multi_subgraph_op = ov::as_type_ptr<ov::op::util::MultiSubGraphOp>(op)) {
for (const auto& sub_graph : multi_subgraph_op->get_functions()) {
for (auto& sub_op : sub_graph->get_ops()) {
Expand Down
Loading