[GPU] Fix implicit concat fusing blocked for feature-axis and multi-user predecessors#36032
[GPU] Fix implicit concat fusing blocked for feature-axis and multi-user predecessors#36032deepaks2 wants to merge 9 commits into
Conversation
|
build_jenkins |
|
build_jenkins |
|
build_jenkins |
| // batch=2 with feature-axis concat is now correctly fused after removing the | ||
| // overly conservative batch>1 guard (which is only needed for batch-axis concat). |
There was a problem hiding this comment.
This comment is not needed here
| // batch=2 with feature-axis concat is now correctly fused after removing the | |
| // overly conservative batch>1 guard (which is only needed for batch-axis concat). |
| // padding on the input side. reorder and permute are excluded because some of their | ||
| // implementations copy over raw buffer byte ranges and would include padding bytes. | ||
| auto reads_padded_input_safely = [](const program_node& user) { | ||
| return user.is_type<convolution>() || |
There was a problem hiding this comment.
Will everything work fine when using onednn implementations, for example for convolution?
This is not checked in tests
Also what about can_be_optimized case?
There was a problem hiding this comment.
Added the check for can_be_optimized case
There was a problem hiding this comment.
I still don't see a test with onednn impl (for example for convolution)
| return; | ||
|
|
||
| // batch=2, feature-axis concat with b_fs_yx_fsv16 — previously blocked by the batch>1 guard | ||
| auto in_layout1 = layout{ ov::PartialShape{2, 16, 4, 4}, data_types::f32, format::bfyx }; |
There was a problem hiding this comment.
Can you use a more awkward shape for the test? And test 3 inputs
For example: [3, 16, 2, 3], [3, 32, 2, 3], [3, 16, 2, 3]
| std::vector<float> d1(16 * 4 * 4, 1.0f); | ||
| std::vector<float> d2(16 * 4 * 4, 2.0f); |
There was a problem hiding this comment.
Please use natural number values to detect memory overlaps (0, 1 .. mem_size)
There was a problem hiding this comment.
Thanks. updated the test to detect memory overlaps
|
build_jenkins |
|
build_jenkins |
3a66690 to
01bd5f8
Compare
|
build_jenkins |
…r batch>1 The batch=2 feature-axis case had is_implicit_concat=false, reflecting the old prepare_buffer_fusing behaviour where batch > 1 unconditionally disabled in-place fusing on the oneDNN path. After tightening that guard to apply only when concat_axis_index == 0 (batch-axis concat), feature-axis concat at batch=2 is correctly fused. Update is_implicit_concat to true to match the corrected behaviour; the existing diff_count == 0 assertion in the test body verifies the output values are bit-exact. Signed-off-by: S, Deepak <deepak.s@intel.com>
Replace NaN/Inf output checks with element-wise comparison against a reference network running an explicit (non-fused) concat over the same random inputs. This catches buffer aliasing bugs where the implicit path produces numerically valid but incorrect values. The batch>1 test switches to a two-network implicit-vs-explicit comparison to avoid hard-coding layout assumptions for block formats at batch>1. The multi-user test adds force_implementations to pin the oneDNN fusing code path regardless of the runtime heuristic. Signed-off-by: S, Deepak <deepak.s@intel.com>
…check The available_pred lambda guards output-padding support: whether a node type's kernel can write output with buffer gaps for in-place aliasing. The multi-user predecessor check in concat_in_place_optimization reused that lambda to validate non-concat consumers, but the question there is different: whether the consumer's kernel reads input by coordinate and therefore skips padding correctly. Introduce reads_padded_input_safely with that explicit semantic and use it for the multi-user consumer check. The list is narrower than available_pred: reorder and permute are excluded because some of their implementations copy over raw buffer byte ranges and would include padding bytes. The types retained (convolution, deconvolution, pooling, activation, eltwise, quantize) address input by explicit tensor coordinate in all GPU kernel variants. Signed-off-by: S, Deepak <deepak.s@intel.com>
01bd5f8 to
418cf02
Compare
Replace uniform constant fill (1.0f / 2.0f) with sequential 0..N-1 values so any buffer-overlap or aliasing regression produces a wrong value at a specific index rather than going undetected. Update output assertions to match: concat halves checked element-wise, relu output equals input, clamp output is 0.0f at index 0 and 1.0f elsewhere. Signed-off-by: S, Deepak <deepak.s@intel.com>
Replace uniform constant fill (1.0f / 2.0f) with sequential 0..N-1 values so any buffer-overlap or aliasing bug produces a mismatch at a specific index rather than going undetected. Update output assertions: concat halves and conv output checked element-wise against float(i); relu output is identity for all non-negative inputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
3b151cd to
2970ebc
Compare
Force both shared_r and conv to b_fs_yx_fsv16 + oneDNN. Since their formats already match, reorder_inputs inserts no intermediate reorder, so oneDNN conv reads directly from the padded predecessor buffer — confirming the safety of reads_padded_input_safely for oneDNN convolution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
abs preserves the full natural-number range of the input (0..N-1), making buffer-overlap bugs immediately visible at every index. clamp(0,1) was masking all values above 1, reducing detectability. Also swaps concat input order (other_r first) and offsets d2 by 512 so the two halves of the concat output are unambiguously distinguishable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Swap concat input order (other_r first) so shared_r lands in the second slot, exercising a more asymmetric padding layout. Offset d2 by 512 so the two concat halves are unambiguously distinguishable — any buffer aliasing or ordering bug produces a clearly wrong value rather than going unnoticed when both halves held identical data. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reduce comments to a minimum, only for cases where the code is not obvious Signed-off-by: S, Deepak <deepak.s@intel.com>
prepare_buffer_fusing was skipping in-place (zero-copy) concat fusing due to two overly conservative guards on the oneDNN static path.
First, concat_out_l.batch() > 1 was returning false unconditionally for any concat when the output batch size exceeded one. The check was intended for batch-axis concat, where block formats are not contiguous across batch elements. For feature-axis (axis=1) concat the 64-byte alignment check already in place is the correct correctness gate; the batch check is only needed when concat_axis_index == 0.
Second, get_users().size() > 2 rejected any predecessor with more than two consumers regardless of what those consumers were. The existing available_pred lambda already enumerates node types whose kernels never access padded regions (convolution, pooling, activation, eltwise, reorder, etc.). Fusing is safe whenever every non-concat user of the predecessor passes that check.
On YOLOv8s FP16 explicit concat kernel count drops from 15 to 2 at batch=8, reducing latency by ~36% at batch=8 on Intel iGPU.
Two unit tests are added to prepare_buffer_fusing_test.cpp covering the corrected code paths; all 37 existing tests continue to pass.
Tickets:
CVS-187256
AI Assistance:
*AI assistance used: yes
AI was used for test generation and Human reviewed and run all unit tests manually