Skip to content

[GPU] fix dGPU func testcases in smoke_ScaledAttnDynamic4D_GPU and smoke_MatMulCompressedWeights_extra_multiply#35343

Closed
yuanxion wants to merge 8 commits into
openvinotoolkit:masterfrom
yuanxion:fix-ci-dgpu-tests-gha-scale-atten2
Closed

[GPU] fix dGPU func testcases in smoke_ScaledAttnDynamic4D_GPU and smoke_MatMulCompressedWeights_extra_multiply#35343
yuanxion wants to merge 8 commits into
openvinotoolkit:masterfrom
yuanxion:fix-ci-dgpu-tests-gha-scale-atten2

Conversation

@yuanxion
Copy link
Copy Markdown
Contributor

Details

fixes 2 Intel dGPU functional testcases:

  1. smoke_ScaledAttnDynamic4D_GPU/ScaledAttnLayerGPUTest.CompareWithRefs
  2. smoke_MatMulCompressedWeights_extra_multiply/MatmulWeightsDecompression.Inference

Description of the issue

Symptom

  1. smoke_ScaledAttnDynamic4D_GPU/ScaledAttnLayerGPUTest.CompareWithRefs failed on Intel dGPU when the graph carried a scalar or rank-1 placeholder input in the attention-mask slot. The SDPA OCL path treated that placeholder as a real runtime attention mask, which changed kernel configuration and input binding unexpectedly.
  2. smoke_MatMulCompressedWeights_extra_multiply/MatmulWeightsDecompression.Inference failed on dGPU for the group_size=2 + extra_multiply=1 + param_weights=1 configuration. The runtime failure surfaced during GPU program build / implementation selection.

Root cause

  1. The SDPA GPU path only checked whether an attention-mask input slot existed, but did not distinguish a real runtime mask tensor from a scalar / rank-1 placeholder. As a result, placeholder inputs were propagated through the real attention-mask path and affected JIT constants, kernel arguments, and execution logic.
  2. ConvertMatMulToFullyConnected converted a MatMul with parameter-based compressed weights into FullyConnected even when the decompressed weights still had non-trivial batch dimensions. That conversion is unsafe for this pattern: after extra multiply and transpose, the weights remained effectively per-batch / 3D, but were still fed into the FC path.

How to fix it

  1. Add a shared helper to identify whether the attention-mask input is a real runtime mask. Scalar and rank-1 placeholders are now excluded from the runtime attention-mask path, and the same logic is reused by the SDPA OCL implementations.
  2. Detect parameter-based compressed weights explicitly in the matcher result and block MatMul -> FullyConnected conversion when the aligned weight shape still contains non-1 batch dimensions. In that case, keep the original MatMul path instead of forcing FC.

The code and line that caused this issue

  1. smoke_ScaledAttnDynamic4D_GPU
    if (i == attn_mask_idx && desc->attn_mask_val.has_value())
    continue;
  2. smoke_MatMulCompressedWeights_extra_multiply
    } else if (!is_compressed_weight || !supports_immad) {
    return std::make_tuple(false, std::move(shape_a_aligned), std::move(shape_b_aligned));
    }

Reproduction step and snapshot

  1. smoke_ScaledAttnDynamic4D_GPU
    ./ov_gpu_func_tests --device_suffix=1 --gtest_filter='smoke_ScaledAttnDynamic4D_GPU/ScaledAttnLayerGPUTest.CompareWithRefs/netPRC=f16_IS=[?.5.?.128]_[?.5.?.128]_[?.5.?.32]_[?.1.?.?]_TS=(2.5.100.128)_(2.5.1.128)_(2.5.387.128)_(2.5.100.128)_(2.5.1.128)_(2.5.387.128)_(2.5.100.32)_(2.5.1.32)_(2.5.387.32)_(1.1.100.100)_(1.1.1.1)_(2.1.387.387)_is_causal=0_has_attn=0_is_attn_const=1_has_scale=1_is_scale_const=1_with_transpose0_has_sink=0_'
  2. smoke_MatMulCompressedWeights_extra_multiply
    ./ov_gpu_func_tests --device_suffix=1 --gtest_filter='smoke_MatMulCompressedWeights_extra_multiply/MatmulWeightsDecompression.Inference/data_shape=[]_[1.4.16]__weights_shape=[16,32]_group_size=2_weights_precision=u8_activations_precision=f32_transpose_weights=0_decompression_subtract=0_reshape_on_decompression=0_extra_multiply=1_per_tensor_zp=0_param_weights=1_dyn_quan_group_size=0'

Problematic graph

N/A

Checklist

  • Is it a proper fix? (not a workaround)
  • Did you include test case for this fix, if necessary?
  • Did you review existing test that can be extended to cover this scenario? Which test did you review?

Tickets:

AI Assistance:

  • AI assistance used: yes
  • Used it to reproduce the failing dGPU function tests, inspect the dumped GPU graphs, find the root cause, and also add testcases.

@yuanxion yuanxion requested review from a team as code owners April 15, 2026 03:21
@github-actions github-actions Bot added the category: GPU OpenVINO GPU plugin label Apr 15, 2026
@yuanxion yuanxion requested a review from Copilot April 15, 2026 09:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two Intel dGPU functional test failures by (1) treating scalar/rank-1 SDPA attention-mask inputs as placeholders (not real runtime masks) in OCL SDPA implementations, and (2) preventing unsafe MatMul→FullyConnected conversion for parameter-based compressed weights that still carry non-trivial batch dimensions.

Changes:

  • Add shared SDPA helper to detect whether the attention-mask input is a real runtime mask, and reuse it across SDPA OCL implementations.
  • Update MatMul→FC transformation to block conversion for parameter-based compressed weights with non-1 aligned batch dimensions.
  • Add/extend unit tests for SDPA placeholder-mask behavior and MatMul→FC “no-convert” scenario.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/plugins/intel_gpu/tests/unit/transformations/convert_matmul_to_fc_test.cpp Adds a regression unit test ensuring MatMul→FC does not occur for parameter-based compressed weights with per-batch dimensions.
src/plugins/intel_gpu/tests/unit/test_cases/sdpa_gpu_test.cpp Adds a unit test asserting scalar placeholder mask behaves like “no runtime mask” in SDPA OCL path.
src/plugins/intel_gpu/src/plugin/transformations/convert_matmul_to_fc.cpp Extends aligned-shape check to detect parameter-based compressed weights and block unsafe FC conversion.
src/plugins/intel_gpu/src/graph/impls/ocl_v2/sdpa/sdpa_utils.hpp Introduces has_runtime_attn_mask_input() helper to filter out scalar/1D placeholder masks.
src/plugins/intel_gpu/src/graph/impls/ocl_v2/sdpa/sdpa_ref.cpp Uses the helper to set JIT constants and to skip binding placeholder mask inputs.
src/plugins/intel_gpu/src/graph/impls/ocl_v2/sdpa/sdpa_gen_opt.cpp Uses the helper for JIT config and argument binding in the optimized SDPA generator.
src/plugins/intel_gpu/src/graph/impls/ocl_v2/sdpa/sdpa_gen_micro.cpp Uses the helper to control mask-related JIT constants/args for the micro-kernel generator.

Comment thread src/plugins/intel_gpu/tests/unit/test_cases/sdpa_gpu_test.cpp Outdated
…efs func test cases

Signed-off-by: yuan.xiong <yuan.xiong@intel.com>
…ession.Inference func test cases

Signed-off-by: yuan.xiong <yuan.xiong@intel.com>
Signed-off-by: yuan.xiong <yuan.xiong@intel.com>
Signed-off-by: yuan.xiong <yuan.xiong@intel.com>
Signed-off-by: yuan.xiong <yuan.xiong@intel.com>
Signed-off-by: yuan.xiong <yuan.xiong@intel.com>
@yuanxion yuanxion force-pushed the fix-ci-dgpu-tests-gha-scale-atten2 branch from a68d87a to 996a20e Compare April 21, 2026 01:09
@yuanxion yuanxion marked this pull request as draft April 21, 2026 05:12
@yuanxion
Copy link
Copy Markdown
Contributor Author

This PR is split into 2 PRs:
#35437 for smoke_ScaledAttnDynamic4D_GPU
#35442 for smoke_MatMulCompressedWeights_extra_multiply

No needed anymore, close it.

@yuanxion yuanxion closed this May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants