[OV MX][GPU] Fix MoE model cache load crashes#36152
Open
xzhan34 wants to merge 4 commits into
Open
Conversation
e-ddykim
reviewed
Jun 1, 2026
0e29ca9 to
65f9c40
Compare
e-ddykim
reviewed
Jun 1, 2026
75172ae to
e33956a
Compare
xzhan34
commented
Jun 1, 2026
Author
xzhan34
left a comment
There was a problem hiding this comment.
after the test, the code change are necessary for both transformations_pipeline.cpp and execution_config.cpp files with more comments added.
9d0b1ee to
9b6b213
Compare
343bf2a to
c2f7833
Compare
e-ddykim
reviewed
Jun 2, 2026
f21e11f to
84126d2
Compare
Contributor
|
build_jenkins |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR targets GPU plugin stability when using compiled model caching and MoE-related kernels, addressing two cache-load crash paths (kv_cache dispatch restoration and null-safe dispatch callbacks) and persisting MoE prefill execution-path flags across cache save/load. It also tightens MoE fusion pass gating to only run when oneDNN is enabled.
Changes:
- Restore
zp_concatstage dispatch-data function setup onkv_cachecache load (dynamic path) and makeDispatchDataFunc::operator()null-safe. - Serialize/deserialize
moe_3gemm_swiglu_opt_implprefill execution flags to keep execution path and internal buffer counts consistent after cache load. - Add new unit tests covering dispatch-data null-safety and basic serialization round-trips for kv-cache stages / MoE flags.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/plugins/intel_gpu/tests/unit/test_cases/cache_serialization_test.cpp | Adds unit tests intended to cover cache serialization round-trips for kv-cache stage lists and MoE prefill flags. |
| src/plugins/intel_gpu/tests/unit/module_tests/dispatch_data_func_test.cpp | Adds regression tests for null-safe DispatchDataFunc invocation behavior. |
| src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp | Gates MoE fusion pipeline on supports_immad && use_onednn and adds explanatory comment. |
| src/plugins/intel_gpu/src/graph/impls/ocl/kv_cache.cpp | Restores zp_concat stage dispatch-data function initialization during cache load() for dynamic kv_cache. |
| src/plugins/intel_gpu/src/graph/impls/ocl_v2/moe/moe_3gemm_swiglu_opt.cpp | Persists three MoE prefill execution flags via save()/load(). |
| src/plugins/intel_gpu/src/graph/common_utils/kernel_generator_base.hpp | Adds a null guard in DispatchDataFunc::operator() to avoid std::bad_function_call. |
84126d2 to
8635ebd
Compare
The MOE3GemmFusedCompressed kernel dispatches expert GEMMs through oneDNN, which requires an in-order OCL queue. If oneDNN is disabled (e.g. via OV_GPU_USE_ONEDNN=0 on an IMMAD GPU), the queue stays out-of-order and the oneDNN stream creation asserts at ocl_stream.cpp:240. Guard the FuseMOE3GemmCompressed transformation pass behind config.get_use_onednn() so the op is never introduced into the graph when oneDNN is unavailable. The outer 'supports_immad' block already prevents this on non-IMMAD GPUs; this inner guard handles the case where oneDNN is explicitly disabled by the user on IMMAD hardware. Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com>
When a GPU-compiled model is loaded from the blob cache and the first inference uses a different input shape, the SHAPE_CHANGED flag triggers update_dispatch_data_func() for every kernel stage. Two issues caused std::bad_function_call: 1. kv_cache: the load() deserializer restored dispatch-data functions for scatter_update, concat, beam_table, dq, and scale_concat stages but omitted the zp_concat stage. On compressed KV-cache models with zero-point inputs (e.g. kv_int8), the first shape-changed inference called the null std::function and crashed. Fix: restore zp_concat's update_dispatch_data_func in load() using the same kernel-selector lookup pattern as scale_concat. 2. DispatchDataFunc::operator() (ocl_v2 framework) called the inner std::function without checking for null, even though the class explicitly supports construction from nullptr and KernelData defaults to nullptr. Fix: add a null guard so stages whose codegen legitimately returns no dispatch-data updater silently skip the call instead of crashing. Reproducible with Qwen3-Omni-30B-A3B-Instruct on GPU with --cache-model: compile on one prompt length, then infer with a different prompt length. Signed-off-by: Xiaolin Zhang <xiaolin.zhang@intel.com>
moe_3gemm_swiglu_opt_impl selects between three prefill execution paths (micro_gemm, grouped_gemm, per-expert onednn loop) based on three boolean flags set in the constructor from environment variables and hardware capabilities. These flags also determine the number of internal buffers (9 vs 15) via get_internal_buffer_descs(). The existing load() override did not serialize these flags. After loading from the compiled model cache the flags reverted to their default-initialized values (all false), which could select a different execution path and allocate a mismatched number of internal buffers compared to the cached kernel stages. Add a save() override that writes use_micro_gemm_prefill, use_gpu_mask_gen_prefill, and use_grouped_gemm_prefill after the base-class data. Update load() to read them back, restoring the exact execution configuration that was active when the model was originally compiled. Signed-off-by: Xiaolin Zhang <xiaolin.zhang@intel.com>
1. dispatch_data_func_test.cpp (new):
- Tests the null-guard added to DispatchDataFunc::operator()
- Verifies that calling a DispatchDataFunc constructed from nullptr
does not throw std::bad_function_call
- Verifies that a valid dispatch function is correctly invoked
- Verifies that default-constructed KernelData's update_dispatch_data_func
is safe to call (null-safe)
2. cache_serialization_test.cpp (new):
- Tests KV-cache stages serialization round-trip including the zp_concat
dispatch data func restoration path (save/load cycle)
- Tests KV-cache stages without zp_concat (scatter + kv_cache only)
- Tests KV-cache scatter-only path serialization
- Tests MoE prefill flags (use_micro_gemm_prefill, use_gpu_mask_gen_prefill,
use_grouped_gemm_prefill) round-trip for:
* micro_gemm path
* grouped_gemm path
* fallback path (all false)
* all-true path
Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com>
8635ebd to
b195f83
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Details:
get_onednn_stream(), but on non-IMMAD GPUs (e.g. Intel UHD 770) the queue is in out-of-order mode, causing an assertion failure. Fix: detect MOE3GemmFusedCompressed ops during model analysis inapply_model_specific_options()and enable oneDNN (which switches to in-order queue). Follows the same pattern as LSTMSequence/GRUSequence.load()omitted thezp_concatstage dispatch-data function restoration for compressed KV-cache models with zero-point inputs; (2)DispatchDataFunc::operator()called the innerstd::functionwithout null check. Fix: restorezp_concat's update_dispatch_data_func inload()and add null guard inDispatchDataFunc::operator().moe_3gemm_swiglu_opt_implselects between three prefill execution paths based on boolean flags set from environment variables and hardware capabilities. These flags also determine internal buffer count (9 vs 15). The existingload()did not serialize these flags, causing mismatched execution paths after cache load. Fix: addsave()/load()overrides for the three flags.Tested on:
Tickets:
AI Assistance: