[GPU] Fix implicit concat fusing blocked for feature-axis and multi-user predecessors by deepaks2 · Pull Request #36032 · openvinotoolkit/openvino

deepaks2 · 2026-05-21T15:16:33Z

prepare_buffer_fusing was skipping in-place (zero-copy) concat fusing due to two overly conservative guards on the oneDNN static path.

First, concat_out_l.batch() > 1 was returning false unconditionally for any concat when the output batch size exceeded one. The check was intended for batch-axis concat, where block formats are not contiguous across batch elements. For feature-axis (axis=1) concat the 64-byte alignment check already in place is the correct correctness gate; the batch check is only needed when concat_axis_index == 0.

Second, get_users().size() > 2 rejected any predecessor with more than two consumers regardless of what those consumers were. The existing available_pred lambda already enumerates node types whose kernels never access padded regions (convolution, pooling, activation, eltwise, reorder, etc.). Fusing is safe whenever every non-concat user of the predecessor passes that check.

On YOLOv8s FP16 explicit concat kernel count drops from 15 to 2 at batch=8, reducing latency by ~36% at batch=8 on Intel iGPU.

Two unit tests are added to prepare_buffer_fusing_test.cpp covering the corrected code paths; all 37 existing tests continue to pass.

Tickets:
CVS-187256

AI Assistance:
*AI assistance used: yes
AI was used for test generation and Human reviewed and run all unit tests manually

clee30 · 2026-05-22T01:56:01Z

build_jenkins

p-durandin · 2026-05-22T12:28:23Z

build_jenkins

clee30 · 2026-05-26T03:55:01Z

build_jenkins

Lyamin-Roman · 2026-06-01T12:05:02Z

+// batch=2 with feature-axis concat is now correctly fused after removing the
+// overly conservative batch>1 guard (which is only needed for batch-axis concat).


This comment is not needed here

Suggested change

// batch=2 with feature-axis concat is now correctly fused after removing the

// overly conservative batch>1 guard (which is only needed for batch-axis concat).

Updated the code

Still not deleted

Lyamin-Roman · 2026-06-01T12:19:18Z

+// padding on the input side. reorder and permute are excluded because some of their
+// implementations copy over raw buffer byte ranges and would include padding bytes.
+auto reads_padded_input_safely = [](const program_node& user) {
+    return user.is_type<convolution>()   ||


Will everything work fine when using onednn implementations, for example for convolution?
This is not checked in tests
Also what about can_be_optimized case?

Added the check for can_be_optimized case

I still don't see a test with onednn impl (for example for convolution)

Lyamin-Roman · 2026-06-01T12:41:31Z

+        return;
+
+    // batch=2, feature-axis concat with b_fs_yx_fsv16 — previously blocked by the batch>1 guard
+    auto in_layout1 = layout{ ov::PartialShape{2, 16, 4, 4}, data_types::f32, format::bfyx };


Can you use a more awkward shape for the test? And test 3 inputs
For example: [3, 16, 2, 3], [3, 32, 2, 3], [3, 16, 2, 3]

Added new test. Please review

Lyamin-Roman · 2026-06-01T12:44:22Z

+    std::vector<float> d1(16 * 4 * 4, 1.0f);
+    std::vector<float> d2(16 * 4 * 4, 2.0f);


Please use natural number values to detect memory overlaps (0, 1 .. mem_size)

Thanks. updated the test to detect memory overlaps

Still not fixed

deepaks2 · 2026-06-02T08:56:29Z

build_jenkins

deepaks2 · 2026-06-02T12:59:14Z

build_jenkins

deepaks2 · 2026-06-02T14:12:46Z

build_jenkins

…r batch>1 The batch=2 feature-axis case had is_implicit_concat=false, reflecting the old prepare_buffer_fusing behaviour where batch > 1 unconditionally disabled in-place fusing on the oneDNN path. After tightening that guard to apply only when concat_axis_index == 0 (batch-axis concat), feature-axis concat at batch=2 is correctly fused. Update is_implicit_concat to true to match the corrected behaviour; the existing diff_count == 0 assertion in the test body verifies the output values are bit-exact. Signed-off-by: S, Deepak <deepak.s@intel.com>

Replace NaN/Inf output checks with element-wise comparison against a reference network running an explicit (non-fused) concat over the same random inputs. This catches buffer aliasing bugs where the implicit path produces numerically valid but incorrect values. The batch>1 test switches to a two-network implicit-vs-explicit comparison to avoid hard-coding layout assumptions for block formats at batch>1. The multi-user test adds force_implementations to pin the oneDNN fusing code path regardless of the runtime heuristic. Signed-off-by: S, Deepak <deepak.s@intel.com>

…check The available_pred lambda guards output-padding support: whether a node type's kernel can write output with buffer gaps for in-place aliasing. The multi-user predecessor check in concat_in_place_optimization reused that lambda to validate non-concat consumers, but the question there is different: whether the consumer's kernel reads input by coordinate and therefore skips padding correctly. Introduce reads_padded_input_safely with that explicit semantic and use it for the multi-user consumer check. The list is narrower than available_pred: reorder and permute are excluded because some of their implementations copy over raw buffer byte ranges and would include padding bytes. The types retained (convolution, deconvolution, pooling, activation, eltwise, quantize) address input by explicit tensor coordinate in all GPU kernel variants. Signed-off-by: S, Deepak <deepak.s@intel.com>

Replace uniform constant fill (1.0f / 2.0f) with sequential 0..N-1 values so any buffer-overlap or aliasing regression produces a wrong value at a specific index rather than going undetected. Update output assertions to match: concat halves checked element-wise, relu output equals input, clamp output is 0.0f at index 0 and 1.0f elsewhere. Signed-off-by: S, Deepak <deepak.s@intel.com>

Replace uniform constant fill (1.0f / 2.0f) with sequential 0..N-1 values so any buffer-overlap or aliasing bug produces a mismatch at a specific index rather than going undetected. Update output assertions: concat halves and conv output checked element-wise against float(i); relu output is identity for all non-negative inputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Force both shared_r and conv to b_fs_yx_fsv16 + oneDNN. Since their formats already match, reorder_inputs inserts no intermediate reorder, so oneDNN conv reads directly from the padded predecessor buffer — confirming the safety of reads_padded_input_safely for oneDNN convolution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

abs preserves the full natural-number range of the input (0..N-1), making buffer-overlap bugs immediately visible at every index. clamp(0,1) was masking all values above 1, reducing detectability. Also swaps concat input order (other_r first) and offsets d2 by 512 so the two halves of the concat output are unambiguously distinguishable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Swap concat input order (other_r first) so shared_r lands in the second slot, exercising a more asymmetric padding layout. Offset d2 by 512 so the two concat halves are unambiguously distinguishable — any buffer aliasing or ordering bug produces a clearly wrong value rather than going unnoticed when both halves held identical data. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reduce comments to a minimum, only for cases where the code is not obvious Signed-off-by: S, Deepak <deepak.s@intel.com>

deepaks2 requested review from a team as code owners May 21, 2026 15:16

github-actions Bot added the category: GPU OpenVINO GPU plugin label May 21, 2026

deepaks2 marked this pull request as draft May 21, 2026 15:16

sys-openvino-ci added the ExternalIntelPR External contributor from Intel label May 21, 2026

deepaks2 marked this pull request as ready for review May 22, 2026 05:25

Lyamin-Roman reviewed May 22, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/graph/graph_optimizer/prepare_buffer_fusing.cpp Outdated

Comment thread src/plugins/intel_gpu/src/graph/graph_optimizer/prepare_buffer_fusing.cpp

Lyamin-Roman reviewed May 22, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/tests/unit/passes/prepare_buffer_fusing_test.cpp Outdated

deepaks2 requested a review from Lyamin-Roman May 27, 2026 04:32

Lyamin-Roman reviewed Jun 1, 2026

View reviewed changes

deepaks2 requested a review from Lyamin-Roman June 2, 2026 08:56

deepaks2 force-pushed the gpu-optimize-concat-buffer-fusing branch 2 times, most recently from 3a66690 to 01bd5f8 Compare June 2, 2026 14:05

deepaks2 added 3 commits June 2, 2026 22:30

deepaks2 force-pushed the gpu-optimize-concat-buffer-fusing branch from 01bd5f8 to 418cf02 Compare June 2, 2026 14:39

deepaks2 and others added 2 commits June 2, 2026 22:51

deepaks2 force-pushed the gpu-optimize-concat-buffer-fusing branch from 3b151cd to 2970ebc Compare June 2, 2026 15:00

deepaks2 and others added 4 commits June 2, 2026 23:08

[GPU] Updating the comments in the source code

8dbc003

Reduce comments to a minimum, only for cases where the code is not obvious Signed-off-by: S, Deepak <deepak.s@intel.com>

		// batch=2 with feature-axis concat is now correctly fused after removing the
		// overly conservative batch>1 guard (which is only needed for batch-axis concat).

		std::vector<float> d1(16 * 4 * 4, 1.0f);
		std::vector<float> d2(16 * 4 * 4, 2.0f);

Conversation

deepaks2 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clee30 commented May 22, 2026

Uh oh!

p-durandin commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clee30 commented May 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deepaks2 commented Jun 2, 2026

Uh oh!

deepaks2 commented Jun 2, 2026

Uh oh!

deepaks2 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

deepaks2 commented May 21, 2026 •

edited

Loading