[OV MX][GPU] Fix MoE model cache load crashes by xzhan34 · Pull Request #36152 · openvinotoolkit/openvino

xzhan34 · 2026-06-01T02:08:53Z

Details:

GPU: auto-enable oneDNN for MOE3GemmFusedCompressed ops - The MOE3GemmFusedCompressed kernel unconditionally calls get_onednn_stream(), but on non-IMMAD GPUs (e.g. Intel UHD 770) the queue is in out-of-order mode, causing an assertion failure. Fix: detect MOE3GemmFusedCompressed ops during model analysis in apply_model_specific_options() and enable oneDNN (which switches to in-order queue). Follows the same pattern as LSTMSequence/GRUSequence.
GPU: fix std::bad_function_call crash on compiled model cache load - When loading from blob cache with shape changes, two issues caused crashes: (1) kv_cache load() omitted the zp_concat stage dispatch-data function restoration for compressed KV-cache models with zero-point inputs; (2) DispatchDataFunc::operator() called the inner std::function without null check. Fix: restore zp_concat's update_dispatch_data_func in load() and add null guard in DispatchDataFunc::operator().
GPU: serialize MoE prefill execution flags in compiled model cache - moe_3gemm_swiglu_opt_impl selects between three prefill execution paths based on boolean flags set from environment variables and hardware capabilities. These flags also determine internal buffer count (9 vs 15). The existing load() did not serialize these flags, causing mismatched execution paths after cache load. Fix: add save()/load() overrides for the three flags.
Unit tests - 12 new tests covering all three fixes: dispatch data null-safety, KV-cache serialization round-trip (with/without zp_concat), MoE prefill flags serialization, and oneDNN auto-enable for MOE3GemmFusedCompressed.

Tested on:

Intel UHD Graphics 770 (iGPU, non-IMMAD): previously crashed, now runs Qwen3-Omni-30B MoE successfully
Intel Arc A770 (dGPU, IMMAD): no regression

Tickets:

N/A

AI Assistance:

AI assistance used: yes
Claude was used to help draft commit messages, unit tests, and the PR description. All code changes were manually validated through builds and runtime testing on both iGPU and dGPU platforms.

xzhan34

after the test, the code change are necessary for both transformations_pipeline.cpp and execution_config.cpp files with more comments added.

e-ddykim · 2026-06-02T06:49:06Z

build_jenkins

Copilot

Pull request overview

This PR targets GPU plugin stability when using compiled model caching and MoE-related kernels, addressing two cache-load crash paths (kv_cache dispatch restoration and null-safe dispatch callbacks) and persisting MoE prefill execution-path flags across cache save/load. It also tightens MoE fusion pass gating to only run when oneDNN is enabled.

Changes:

Restore zp_concat stage dispatch-data function setup on kv_cache cache load (dynamic path) and make DispatchDataFunc::operator() null-safe.
Serialize/deserialize moe_3gemm_swiglu_opt_impl prefill execution flags to keep execution path and internal buffer counts consistent after cache load.
Add new unit tests covering dispatch-data null-safety and basic serialization round-trips for kv-cache stages / MoE flags.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/plugins/intel_gpu/tests/unit/test_cases/cache_serialization_test.cpp	Adds unit tests intended to cover cache serialization round-trips for kv-cache stage lists and MoE prefill flags.
src/plugins/intel_gpu/tests/unit/module_tests/dispatch_data_func_test.cpp	Adds regression tests for null-safe `DispatchDataFunc` invocation behavior.
src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp	Gates MoE fusion pipeline on `supports_immad && use_onednn` and adds explanatory comment.
src/plugins/intel_gpu/src/graph/impls/ocl/kv_cache.cpp	Restores `zp_concat` stage dispatch-data function initialization during cache `load()` for dynamic kv_cache.
src/plugins/intel_gpu/src/graph/impls/ocl_v2/moe/moe_3gemm_swiglu_opt.cpp	Persists three MoE prefill execution flags via `save()`/`load()`.
src/plugins/intel_gpu/src/graph/common_utils/kernel_generator_base.hpp	Adds a null guard in `DispatchDataFunc::operator()` to avoid `std::bad_function_call`.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

The MOE3GemmFusedCompressed kernel dispatches expert GEMMs through oneDNN, which requires an in-order OCL queue. If oneDNN is disabled (e.g. via OV_GPU_USE_ONEDNN=0 on an IMMAD GPU), the queue stays out-of-order and the oneDNN stream creation asserts at ocl_stream.cpp:240. Guard the FuseMOE3GemmCompressed transformation pass behind config.get_use_onednn() so the op is never introduced into the graph when oneDNN is unavailable. The outer 'supports_immad' block already prevents this on non-IMMAD GPUs; this inner guard handles the case where oneDNN is explicitly disabled by the user on IMMAD hardware. Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com>

When a GPU-compiled model is loaded from the blob cache and the first inference uses a different input shape, the SHAPE_CHANGED flag triggers update_dispatch_data_func() for every kernel stage. Two issues caused std::bad_function_call: 1. kv_cache: the load() deserializer restored dispatch-data functions for scatter_update, concat, beam_table, dq, and scale_concat stages but omitted the zp_concat stage. On compressed KV-cache models with zero-point inputs (e.g. kv_int8), the first shape-changed inference called the null std::function and crashed. Fix: restore zp_concat's update_dispatch_data_func in load() using the same kernel-selector lookup pattern as scale_concat. 2. DispatchDataFunc::operator() (ocl_v2 framework) called the inner std::function without checking for null, even though the class explicitly supports construction from nullptr and KernelData defaults to nullptr. Fix: add a null guard so stages whose codegen legitimately returns no dispatch-data updater silently skip the call instead of crashing. Reproducible with Qwen3-Omni-30B-A3B-Instruct on GPU with --cache-model: compile on one prompt length, then infer with a different prompt length. Signed-off-by: Xiaolin Zhang <xiaolin.zhang@intel.com>

moe_3gemm_swiglu_opt_impl selects between three prefill execution paths (micro_gemm, grouped_gemm, per-expert onednn loop) based on three boolean flags set in the constructor from environment variables and hardware capabilities. These flags also determine the number of internal buffers (9 vs 15) via get_internal_buffer_descs(). The existing load() override did not serialize these flags. After loading from the compiled model cache the flags reverted to their default-initialized values (all false), which could select a different execution path and allocate a mismatched number of internal buffers compared to the cached kernel stages. Add a save() override that writes use_micro_gemm_prefill, use_gpu_mask_gen_prefill, and use_grouped_gemm_prefill after the base-class data. Update load() to read them back, restoring the exact execution configuration that was active when the model was originally compiled. Signed-off-by: Xiaolin Zhang <xiaolin.zhang@intel.com>

1. dispatch_data_func_test.cpp (new): - Tests the null-guard added to DispatchDataFunc::operator() - Verifies that calling a DispatchDataFunc constructed from nullptr does not throw std::bad_function_call - Verifies that a valid dispatch function is correctly invoked - Verifies that default-constructed KernelData's update_dispatch_data_func is safe to call (null-safe) 2. cache_serialization_test.cpp (new): - Tests KV-cache stages serialization round-trip including the zp_concat dispatch data func restoration path (save/load cycle) - Tests KV-cache stages without zp_concat (scatter + kv_cache only) - Tests KV-cache scatter-only path serialization - Tests MoE prefill flags (use_micro_gemm_prefill, use_gpu_mask_gen_prefill, use_grouped_gemm_prefill) round-trip for: * micro_gemm path * grouped_gemm path * fallback path (all false) * all-true path Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com>

xzhan34 requested review from a team as code owners June 1, 2026 02:08

github-actions Bot added the category: GPU OpenVINO GPU plugin label Jun 1, 2026

wangleis changed the title ~~[GPU] Fix MoE model cache load crashes and auto-enable oneDNN for MOE3GemmFusedCompressed~~ [OV MX][GPU] Fix MoE model cache load crashes and auto-enable oneDNN for MOE3GemmFusedCompressed Jun 1, 2026

wangleis requested a review from isanghao June 1, 2026 02:12

wangleis assigned isanghao Jun 1, 2026

wangleis requested a review from riverlijunjie June 1, 2026 02:12

e-ddykim reviewed Jun 1, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/runtime/execution_config.cpp Outdated

xzhan34 force-pushed the qwen3_omni_moe_dev branch 3 times, most recently from 0e29ca9 to 65f9c40 Compare June 1, 2026 03:00

e-ddykim reviewed Jun 1, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Outdated

xzhan34 force-pushed the qwen3_omni_moe_dev branch 3 times, most recently from 75172ae to e33956a Compare June 1, 2026 08:19

xzhan34 commented Jun 1, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Outdated

xzhan34 force-pushed the qwen3_omni_moe_dev branch 2 times, most recently from 9d0b1ee to 9b6b213 Compare June 1, 2026 09:46

xzhan34 requested a review from e-ddykim June 1, 2026 09:52

xzhan34 force-pushed the qwen3_omni_moe_dev branch 2 times, most recently from 343bf2a to c2f7833 Compare June 1, 2026 23:41

e-ddykim reviewed Jun 2, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/runtime/execution_config.cpp Outdated

Comment thread src/plugins/intel_gpu/tests/unit/module_tests/execution_config_test.cpp Outdated

xzhan34 force-pushed the qwen3_omni_moe_dev branch 3 times, most recently from f21e11f to 84126d2 Compare June 2, 2026 05:44

xzhan34 requested a review from e-ddykim June 2, 2026 05:46

e-ddykim changed the title ~~[OV MX][GPU] Fix MoE model cache load crashes and auto-enable oneDNN for MOE3GemmFusedCompressed~~ [OV MX][GPU] Fix MoE model cache load crashes Jun 2, 2026

e-ddykim requested a review from Copilot June 2, 2026 06:50

Copilot started reviewing on behalf of e-ddykim June 2, 2026 06:50 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp

Comment thread src/plugins/intel_gpu/tests/unit/test_cases/cache_serialization_test.cpp

Comment thread src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Outdated

xzhan34 force-pushed the qwen3_omni_moe_dev branch from 84126d2 to 8635ebd Compare June 2, 2026 12:00

xzhan34 requested a review from Copilot June 2, 2026 12:02

Copilot AI reviewed Jun 2, 2026

View reviewed changes

xzhan34 added 4 commits June 3, 2026 10:55

xzhan34 force-pushed the qwen3_omni_moe_dev branch from 8635ebd to b195f83 Compare June 3, 2026 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OV MX][GPU] Fix MoE model cache load crashes#36152

[OV MX][GPU] Fix MoE model cache load crashes#36152
xzhan34 wants to merge 4 commits into
openvinotoolkit:masterfrom
xzhan34:qwen3_omni_moe_dev

xzhan34 commented Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

xzhan34 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

e-ddykim commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xzhan34 commented Jun 1, 2026

Details:

Tickets:

AI Assistance:

Uh oh!

Uh oh!

Uh oh!

xzhan34 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

e-ddykim commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants