disable kv cache broadcast for better performance by nazanin-beheshti · Pull Request #36118 · openvinotoolkit/openvino

nazanin-beheshti · 2026-05-28T18:00:47Z

Details:

For MSFT Orca model, it has different num_head for query and key/value. It leads to a broadcast behavior in the graph which introduce a big overhead for the performance. This ticket is created to eliminate the broadcast and get good conformance for Orca pipeline.

Tickets:

CVS-186733

AI Assistance:

AI assistance used: no

Copilot

Pull request overview

This PR aims to improve performance for models using Grouped-Query Attention (e.g., MSFT Orca) by avoiding KV-cache/head broadcast behavior in the generated OpenVINO graph, and adjusting ScaledDotProductAttention (SDPA) shape inference/tests accordingly.

Changes:

Removes KV head “broadcast via concat/reshape” logic from GroupQueryAttentionDecomposition to avoid expanding KV tensors to Query head count.
Updates SDPA shape inference/tests to stop propagating Key/Value batch dims into the output shape (reducing broadcast-driven shape effects).
Updates SDPA type-prop expectations/error-message assertions to match the new shape inference behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`src/core/tests/type_prop/scaled_dot_product_attention.cpp`	Updates SDPA type-prop expected output shapes and thrown-message substrings to match new inference behavior.
`src/core/shape_inference/include/scaled_dot_product_attention_shape_inference.hpp`	Modifies SDPA shape validation/inference to avoid broadcast-merging Key/Value leading dims into output; currently contains a Value-shape validation bug.
`src/common/transformations/src/transformations/op_conversions/group_query_attention_decomposition.cpp`	Removes explicit KV broadcasting in GQA decomposition to avoid performance overhead from KV expansion.

Comments suppressed due to low confidence (1)

src/core/shape_inference/include/scaled_dot_product_attention_shape_inference.hpp:50

[HIGH] After removing broadcast_merge_into(...) from Key/Value validation, the shape inference no longer checks that Key/Value leading dimensions are compatible with Query at all (e.g., mismatched batch dimension can now silently pass). If the intent is to support GQA without explicit broadcast nodes, consider reintroducing non-broadcasting compatibility checks for the true batch dims and a GQA head-dimension rule (Q heads divisible by KV heads), similar to FlashAttentionTile shape inference (src/plugins/intel_npu/src/ops/src/intel_npu/ops/flash_attention_tile.cpp:163-215). This would keep type inference rejecting genuinely incompatible shapes while still avoiding KV broadcast.

    const auto& key = input_shapes[1];
    const auto& key_rank = key.rank();
    if (key_rank.is_static()) {
        const bool& key_input_correctness =
            key_rank.get_length() >= 3 && DimType::merge(e_dim, e_dim, *(key.end() - 1));
        NODE_SHAPE_INFER_CHECK(op,
                               input_shapes,
                               key_input_correctness,
                               "Key input shape not compatible with other inputs.");
        s_dim = *(key.end() - 2);

        const bool& value_input_correctness =
-            value_rank.get_length() >= 3 &&
-            TRShape::broadcast_merge_into(n_dims,
-                                          TRShape(std::vector<DimType>(value.begin(), value.end() - 2)),
-                                          AutoBroadcastType::NUMPY) &&
-            DimType::merge(s_dim, s_dim, *(value.end() - 2));
+            key_rank.get_length() >= 3 && DimType::merge(e_dim, e_dim, *(key.end() - 1));


Kotomi-Du · 2026-06-02T16:53:12Z

        std::make_shared<op::v13::ScaledDotProductAttention>(query, key, value, attention_mask, scale, causal);
    EXPECT_EQ(op->get_output_element_type(0), element::f64);
-    EXPECT_EQ(op->get_output_partial_shape(0), (PartialShape{4, 3, {4, 5}, {3, 7}}));
+    EXPECT_EQ(op->get_output_partial_shape(0), (PartialShape{{2, 4}, 3, {4, 5}, {3, 7}}));


How broadcast behavior impacts the shape which has range?

Kotomi-Du · 2026-06-02T16:53:49Z

build_jenkins

Kotomi-Du · 2026-06-03T06:13:32Z

        auto op = std::make_shared<op::v13::ScaledDotProductAttention>(query, key, value, attention_mask, causal),
        AssertFailure,
-        testing::HasSubstr("Value input shape not compatible with other inputs."));
+        testing::HasSubstr("Attention mask input shape not compatible with other inputs."));


it seems you add some unintentional change which is incorrect.

disable kv cache broadcast for better performance

2149629

nazanin-beheshti requested review from a team as code owners May 28, 2026 18:00

github-actions Bot added category: Core OpenVINO Core (aka ngraph) category: transformations OpenVINO Runtime library - Transformations labels May 28, 2026

nazanin-beheshti requested a review from Kotomi-Du May 28, 2026 18:01

nazanin-beheshti added 6 commits May 28, 2026 11:28

code style change

4122a8f

remove variables

6b71aa8

update test units

5ed8f69

format issue

b12ba5c

fix unit tests

e572438

clang format issue

5da2a50

nazanin-beheshti requested a review from Copilot June 2, 2026 15:54

Copilot started reviewing on behalf of nazanin-beheshti June 2, 2026 15:54 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

update value_input_correctness check

73af2b5

Kotomi-Du reviewed Jun 2, 2026

View reviewed changes

Comment thread src/core/shape_inference/include/scaled_dot_product_attention_shape_inference.hpp

nazanin-beheshti requested a review from mmikolajcz June 2, 2026 17:10

nazanin-beheshti added 2 commits June 2, 2026 16:11

revert unit test changes and boradcast_merge_into

8199420

change code format

cd37f67

Kotomi-Du reviewed Jun 3, 2026

View reviewed changes

Comment thread src/core/shape_inference/include/scaled_dot_product_attention_shape_inference.hpp

Kotomi-Du reviewed Jun 3, 2026

View reviewed changes

Comment thread src/core/tests/type_prop/scaled_dot_product_attention.cpp Outdated

Kotomi-Du reviewed Jun 3, 2026

View reviewed changes

nazanin-beheshti added 3 commits June 3, 2026 08:34

update code format

afd225e

remove extra line

42a4780

fix clang format

10e7308

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disable kv cache broadcast for better performance#36118

disable kv cache broadcast for better performance#36118
nazanin-beheshti wants to merge 13 commits into
openvinotoolkit:masterfrom
nazanin-beheshti:naz/disable-kv-cache-broadcast

nazanin-beheshti commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Kotomi-Du Jun 2, 2026

Uh oh!

Kotomi-Du commented Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kotomi-Du Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nazanin-beheshti commented May 28, 2026

Details:

Tickets:

AI Assistance:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Kotomi-Du Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Kotomi-Du commented Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kotomi-Du Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants