[NPUW] Add block-based KV cache support for HFA and Pyramid attention by intelgaoxiong · Pull Request #35014 · openvinotoolkit/openvino

intelgaoxiong · 2026-03-29T08:35:09Z

Details:

What this PR does

Extends Host Flash Attention (HFA) and Pyramid Attention to operate with the block-split KV cache produced by SplitKVCacheIntoBlocks. After that transformation a single past_key / past_value parameter is replaced by N independent block parameters each feeding one input of a multi-input Concat node. Both attention backends must
detect this new layout and adapt their compilation-time model shaping and inference-time tensor dispatch accordingly.

This is Part 3/4 of the block-based KV cache feature series:

Part	PR	Description
1/4	#35012	`SplitKVCacheIntoBlocks` graph transformation
2/4	#35013	`KVCacheBlockManager` — block allocation and lifecycle
3/4	this PR	HFA and Pyramid attention: block-KV compilation & runtime
4/4	#35018	LLM inference pipeline integration

Changes by section

Section 1 — Shared infrastructure

util.hpp/cpp — rename isPastKeyValues{Key,Value} → isPastKeyParam / isPastValueParam; add …Contiguous variants that match only single-parameter (non-block) names; used by both HFA and Pyramid to distinguish contiguous from block-split layouts.
sdpa_utils.hpp/cpp (new file) — extracts SDPAPatternNodes (holding vector<…> past_key_param_nodes / past_value_param_nodes for 1 or N elements), find_sdpa_pattern_nodes(), and find_mask_parameter(). Previously duplicated between HFA and Pyramid; now shared.
attention.hpp — extend function::Attention with past_key/value_block_variant_param_indices (ordered by Concat input); extend compiled::SDPAIndices with past_key_blocks / past_value_blocks vectors.

Section 2 — Host Flash Attention

Compilation (host_flash_attention.cpp):

build_sdpa_param_mapping() now iterates all Concat inputs (not only the first) to discover and record every block-parameter index into _past_key_block_indices / _past_value_block_indices.
These indices are promoted into compiled::SDPAIndices.past_key_blocks / past_value_blocks for use at inference time.
The tile models (HFA_Tile, HFA_Final_Tile) are generated once and remain layout-agnostic; only the tensor sourcing strategy differs at inference time.

Inference (attn_subgraph.cpp):

Contiguous KV: single concatenated KV tensor; the tile loop iterates context_size / tile_size offsets, slicing or zero-copy-viewing into the tensor at each kv_offset.
Block KV: Loop over past_key_blocks (one entry per block tensor); the final tile uses present_key_tensor. All tiles are dispatched through the same process_tile lambda regardless of source.

Section 3 — Pyramid Attention

Compilation (pyramid_attention.cpp):

Block-split detection — is_block_split = (past_key_param_nodes.size() > 1) or name does not match isPastKeyParamContiguous().
Contiguous KV — for each pyramid step: clone model → set KV param shapes to current_past_length → reshape() → validate_nodes_and_infer_types().
Block KV — for each pyramid step: clone model → call shrink_concat_inputs() to keep exactly model_idx past block inputs in the Concat (model[0] → 0 blocks, model[1] → 1 block, …, model[k] → k blocks) → patch_broadcast_constants + patch_reshape_constants → validate_nodes_and_infer_types → collect_concat_block_indices to populate past_key/value_block_variant_param_indices. Precompute past_key_block_port_map (global index → variant port) and past_key_block_port_set for O(1) lookup at inference time.

Inference (attn_subgraph.cpp, just_sync_infer_request.cpp):

Contiguous KV — bind_function_input() calls util::view(tensor, param.dim, 0, past_len) to present each pyramid
variant with a correctly sized KV slice; mask is rebuilt per-variant in prologue().
Block KV:
- Setup: alias_block_slots() pre-wires all global block-slot ports on the main request to block_0's buffer as a placeholder, so earlier generic binding code never touches unallocated slots.
- Binding: bind_function_input() → try_bind_block() looks up past_key_block_port_map[global_idx] to dispatch each incoming block tensor to the correct variant-local port. Variants that expose no port for a given block index (e.g. model[0] with 0 past blocks) consume the call silently without set_tensor.
- just_sync_infer_request.cpp: share_kv_block_buffers() shares block KV buffers across pyramid variant sub-requests to avoid redundant allocations.

Other:

base_sync_infer_request.cpp — replace scalar past_key / past_value checks with an is_past_kv()lambda; addblock_mode+bind_block_ports()lambda inbind_pyramid_attention_inputs()`.
partitioning/patterns/sdpa.cpp — relax Concat input-count guard to allow multi-block inputs.

Tickets:

EISW-206740

AI Assistance:

AI assistance used: no / yes
If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks).

…ock-aware HFA + Pyramid

Refactor the SDPA index structures and attention metadata to accommodate block-based KV cache layouts where past_key/value are split into N fixed- size block tensors instead of a single contiguous buffer. Key changes: - attention.hpp: rename SDPAIndices past_key/value -> past_key_blocks/ past_value_blocks (vector<size_t>); extend PyramidAttentionInfo with per-variant block port sets and global param index lists - sdpa_utils.cpp/hpp: new shared helpers extracted from pyramid_attention and host_flash_attention (build_sdpa_param_mapping, etc.) - host_flash_attention: use block index loop in build_sdpa_param_mapping - pyramid_attention: add is_block_split path; shrink_concat_inputs / collect_concat_block_indices helpers; populate block port metadata - attn/attn_subgraph.cpp: alias block slots on main request; name-based input sharing for pyramid variants in block mode - partitioning/patterns/sdpa.cpp: relax Concat input-count check to allow variable number of block inputs - serialization.cpp + test: update field names to match SDPAIndices rename Related-to: EISW-206740 Fixed clang-format. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

github-actions Bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Mar 29, 2026

intelgaoxiong force-pushed the xiong/block_kv_pr3_hfa_decouple branch from bcef02a to 88c1504 Compare May 27, 2026 06:45

github-actions Bot added the category: build OpenVINO cmake script / infra label May 27, 2026

intelgaoxiong marked this pull request as ready for review May 27, 2026 06:49

intelgaoxiong requested review from a team as code owners May 27, 2026 06:49

intelgaoxiong force-pushed the xiong/block_kv_pr3_hfa_decouple branch from 88c1504 to f1f9080 Compare May 27, 2026 07:38

dylanneve1 added a commit to dylanneve1/openvino that referenced this pull request May 27, 2026

Merge xiong/block_kv_pr3_hfa_decouple (PR openvinotoolkit#35014) — Bl…

c7dd7c7

…ock-aware HFA + Pyramid

intelgaoxiong force-pushed the xiong/block_kv_pr3_hfa_decouple branch 2 times, most recently from 86b3295 to 2a50694 Compare June 2, 2026 12:59

intelgaoxiong added 4 commits June 2, 2026 15:55

Refine HFA and PyramidAttn.

8744fa4

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

Refine serialization.

f42d198

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

Fix for new line at EOF.

3eaa3f4

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

intelgaoxiong force-pushed the xiong/block_kv_pr3_hfa_decouple branch from 2a50694 to 3eaa3f4 Compare June 2, 2026 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPUW] Add block-based KV cache support for HFA and Pyramid attention#35014

[NPUW] Add block-based KV cache support for HFA and Pyramid attention#35014
intelgaoxiong wants to merge 4 commits into
openvinotoolkit:masterfrom
intelgaoxiong:xiong/block_kv_pr3_hfa_decouple

intelgaoxiong commented Mar 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

intelgaoxiong commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

What this PR does

Changes by section

Section 1 — Shared infrastructure

Section 2 — Host Flash Attention

Section 3 — Pyramid Attention

Tickets:

AI Assistance:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

intelgaoxiong commented Mar 29, 2026 •

edited

Loading