Skip to content

[OV MX][GPU] Fix MoE model cache load crashes#36152

Open
xzhan34 wants to merge 4 commits into
openvinotoolkit:masterfrom
xzhan34:qwen3_omni_moe_dev
Open

[OV MX][GPU] Fix MoE model cache load crashes#36152
xzhan34 wants to merge 4 commits into
openvinotoolkit:masterfrom
xzhan34:qwen3_omni_moe_dev

Conversation

@xzhan34
Copy link
Copy Markdown

@xzhan34 xzhan34 commented Jun 1, 2026

Details:

  • GPU: auto-enable oneDNN for MOE3GemmFusedCompressed ops - The MOE3GemmFusedCompressed kernel unconditionally calls get_onednn_stream(), but on non-IMMAD GPUs (e.g. Intel UHD 770) the queue is in out-of-order mode, causing an assertion failure. Fix: detect MOE3GemmFusedCompressed ops during model analysis in apply_model_specific_options() and enable oneDNN (which switches to in-order queue). Follows the same pattern as LSTMSequence/GRUSequence.
  • GPU: fix std::bad_function_call crash on compiled model cache load - When loading from blob cache with shape changes, two issues caused crashes: (1) kv_cache load() omitted the zp_concat stage dispatch-data function restoration for compressed KV-cache models with zero-point inputs; (2) DispatchDataFunc::operator() called the inner std::function without null check. Fix: restore zp_concat's update_dispatch_data_func in load() and add null guard in DispatchDataFunc::operator().
  • GPU: serialize MoE prefill execution flags in compiled model cache - moe_3gemm_swiglu_opt_impl selects between three prefill execution paths based on boolean flags set from environment variables and hardware capabilities. These flags also determine internal buffer count (9 vs 15). The existing load() did not serialize these flags, causing mismatched execution paths after cache load. Fix: add save()/load() overrides for the three flags.
  • Unit tests - 12 new tests covering all three fixes: dispatch data null-safety, KV-cache serialization round-trip (with/without zp_concat), MoE prefill flags serialization, and oneDNN auto-enable for MOE3GemmFusedCompressed.

Tested on:

  • Intel UHD Graphics 770 (iGPU, non-IMMAD): previously crashed, now runs Qwen3-Omni-30B MoE successfully
  • Intel Arc A770 (dGPU, IMMAD): no regression

Tickets:

  • N/A

AI Assistance:

  • AI assistance used: yes
  • Claude was used to help draft commit messages, unit tests, and the PR description. All code changes were manually validated through builds and runtime testing on both iGPU and dGPU platforms.

@xzhan34 xzhan34 requested review from a team as code owners June 1, 2026 02:08
@github-actions github-actions Bot added the category: GPU OpenVINO GPU plugin label Jun 1, 2026
@wangleis wangleis changed the title [GPU] Fix MoE model cache load crashes and auto-enable oneDNN for MOE3GemmFusedCompressed [OV MX][GPU] Fix MoE model cache load crashes and auto-enable oneDNN for MOE3GemmFusedCompressed Jun 1, 2026
@wangleis wangleis requested a review from isanghao June 1, 2026 02:12
@wangleis wangleis requested a review from riverlijunjie June 1, 2026 02:12
Comment thread src/plugins/intel_gpu/src/runtime/execution_config.cpp Outdated
@xzhan34 xzhan34 force-pushed the qwen3_omni_moe_dev branch 3 times, most recently from 0e29ca9 to 65f9c40 Compare June 1, 2026 03:00
Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Outdated
@xzhan34 xzhan34 force-pushed the qwen3_omni_moe_dev branch 3 times, most recently from 75172ae to e33956a Compare June 1, 2026 08:19
Copy link
Copy Markdown
Author

@xzhan34 xzhan34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the test, the code change are necessary for both transformations_pipeline.cpp and execution_config.cpp files with more comments added.

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Outdated
@xzhan34 xzhan34 force-pushed the qwen3_omni_moe_dev branch 2 times, most recently from 9d0b1ee to 9b6b213 Compare June 1, 2026 09:46
@xzhan34 xzhan34 requested a review from e-ddykim June 1, 2026 09:52
@xzhan34 xzhan34 force-pushed the qwen3_omni_moe_dev branch 2 times, most recently from 343bf2a to c2f7833 Compare June 1, 2026 23:41
Comment thread src/plugins/intel_gpu/src/runtime/execution_config.cpp Outdated
Comment thread src/plugins/intel_gpu/tests/unit/module_tests/execution_config_test.cpp Outdated
@xzhan34 xzhan34 force-pushed the qwen3_omni_moe_dev branch 3 times, most recently from f21e11f to 84126d2 Compare June 2, 2026 05:44
@xzhan34 xzhan34 requested a review from e-ddykim June 2, 2026 05:46
@e-ddykim
Copy link
Copy Markdown
Contributor

e-ddykim commented Jun 2, 2026

build_jenkins

@e-ddykim e-ddykim changed the title [OV MX][GPU] Fix MoE model cache load crashes and auto-enable oneDNN for MOE3GemmFusedCompressed [OV MX][GPU] Fix MoE model cache load crashes Jun 2, 2026
@e-ddykim e-ddykim requested a review from Copilot June 2, 2026 06:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets GPU plugin stability when using compiled model caching and MoE-related kernels, addressing two cache-load crash paths (kv_cache dispatch restoration and null-safe dispatch callbacks) and persisting MoE prefill execution-path flags across cache save/load. It also tightens MoE fusion pass gating to only run when oneDNN is enabled.

Changes:

  • Restore zp_concat stage dispatch-data function setup on kv_cache cache load (dynamic path) and make DispatchDataFunc::operator() null-safe.
  • Serialize/deserialize moe_3gemm_swiglu_opt_impl prefill execution flags to keep execution path and internal buffer counts consistent after cache load.
  • Add new unit tests covering dispatch-data null-safety and basic serialization round-trips for kv-cache stages / MoE flags.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/plugins/intel_gpu/tests/unit/test_cases/cache_serialization_test.cpp Adds unit tests intended to cover cache serialization round-trips for kv-cache stage lists and MoE prefill flags.
src/plugins/intel_gpu/tests/unit/module_tests/dispatch_data_func_test.cpp Adds regression tests for null-safe DispatchDataFunc invocation behavior.
src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Gates MoE fusion pipeline on supports_immad && use_onednn and adds explanatory comment.
src/plugins/intel_gpu/src/graph/impls/ocl/kv_cache.cpp Restores zp_concat stage dispatch-data function initialization during cache load() for dynamic kv_cache.
src/plugins/intel_gpu/src/graph/impls/ocl_v2/moe/moe_3gemm_swiglu_opt.cpp Persists three MoE prefill execution flags via save()/load().
src/plugins/intel_gpu/src/graph/common_utils/kernel_generator_base.hpp Adds a null guard in DispatchDataFunc::operator() to avoid std::bad_function_call.

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp
Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Outdated
@xzhan34 xzhan34 force-pushed the qwen3_omni_moe_dev branch from 84126d2 to 8635ebd Compare June 2, 2026 12:00
@xzhan34 xzhan34 requested a review from Copilot June 2, 2026 12:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp
xzhan34 added 4 commits June 3, 2026 10:55
The MOE3GemmFusedCompressed kernel dispatches expert GEMMs through oneDNN,
which requires an in-order OCL queue.  If oneDNN is disabled (e.g. via
OV_GPU_USE_ONEDNN=0 on an IMMAD GPU), the queue stays out-of-order and
the oneDNN stream creation asserts at ocl_stream.cpp:240.

Guard the FuseMOE3GemmCompressed transformation pass behind
config.get_use_onednn() so the op is never introduced into the graph
when oneDNN is unavailable.  The outer 'supports_immad' block already
prevents this on non-IMMAD GPUs; this inner guard handles the case
where oneDNN is explicitly disabled by the user on IMMAD hardware.

Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com>
When a GPU-compiled model is loaded from the blob cache and the first
inference uses a different input shape, the SHAPE_CHANGED flag triggers
update_dispatch_data_func() for every kernel stage.  Two issues caused
std::bad_function_call:

1. kv_cache: the load() deserializer restored dispatch-data functions
   for scatter_update, concat, beam_table, dq, and scale_concat stages
   but omitted the zp_concat stage.  On compressed KV-cache models with
   zero-point inputs (e.g. kv_int8), the first shape-changed inference
   called the null std::function and crashed.

   Fix: restore zp_concat's update_dispatch_data_func in load() using
   the same kernel-selector lookup pattern as scale_concat.

2. DispatchDataFunc::operator() (ocl_v2 framework) called the inner
   std::function without checking for null, even though the class
   explicitly supports construction from nullptr and KernelData
   defaults to nullptr.

   Fix: add a null guard so stages whose codegen legitimately returns
   no dispatch-data updater silently skip the call instead of crashing.

Reproducible with Qwen3-Omni-30B-A3B-Instruct on GPU with --cache-model:
compile on one prompt length, then infer with a different prompt length.

Signed-off-by: Xiaolin Zhang <xiaolin.zhang@intel.com>
moe_3gemm_swiglu_opt_impl selects between three prefill execution paths
(micro_gemm, grouped_gemm, per-expert onednn loop) based on three
boolean flags set in the constructor from environment variables and
hardware capabilities.  These flags also determine the number of
internal buffers (9 vs 15) via get_internal_buffer_descs().

The existing load() override did not serialize these flags.  After
loading from the compiled model cache the flags reverted to their
default-initialized values (all false), which could select a different
execution path and allocate a mismatched number of internal buffers
compared to the cached kernel stages.

Add a save() override that writes use_micro_gemm_prefill,
use_gpu_mask_gen_prefill, and use_grouped_gemm_prefill after the
base-class data.  Update load() to read them back, restoring the
exact execution configuration that was active when the model was
originally compiled.

Signed-off-by: Xiaolin Zhang <xiaolin.zhang@intel.com>
1. dispatch_data_func_test.cpp (new):
   - Tests the null-guard added to DispatchDataFunc::operator()
   - Verifies that calling a DispatchDataFunc constructed from nullptr
     does not throw std::bad_function_call
   - Verifies that a valid dispatch function is correctly invoked
   - Verifies that default-constructed KernelData's update_dispatch_data_func
     is safe to call (null-safe)

2. cache_serialization_test.cpp (new):
   - Tests KV-cache stages serialization round-trip including the zp_concat
     dispatch data func restoration path (save/load cycle)
   - Tests KV-cache stages without zp_concat (scatter + kv_cache only)
   - Tests KV-cache scatter-only path serialization
   - Tests MoE prefill flags (use_micro_gemm_prefill, use_gpu_mask_gen_prefill,
     use_grouped_gemm_prefill) round-trip for:
     * micro_gemm path
     * grouped_gemm path
     * fallback path (all false)
     * all-true path

Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com>
@xzhan34 xzhan34 force-pushed the qwen3_omni_moe_dev branch from 8635ebd to b195f83 Compare June 3, 2026 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants