Skip to content

[NPUW] Add block-based KV cache support for HFA and Pyramid attention#35014

Open
intelgaoxiong wants to merge 4 commits into
openvinotoolkit:masterfrom
intelgaoxiong:xiong/block_kv_pr3_hfa_decouple
Open

[NPUW] Add block-based KV cache support for HFA and Pyramid attention#35014
intelgaoxiong wants to merge 4 commits into
openvinotoolkit:masterfrom
intelgaoxiong:xiong/block_kv_pr3_hfa_decouple

Conversation

@intelgaoxiong
Copy link
Copy Markdown
Contributor

@intelgaoxiong intelgaoxiong commented Mar 29, 2026

Details:

What this PR does

Extends Host Flash Attention (HFA) and Pyramid Attention to operate with the block-split KV cache produced by SplitKVCacheIntoBlocks. After that transformation a single past_key / past_value parameter is replaced by N independent block parameters each feeding one input of a multi-input Concat node. Both attention backends must
detect this new layout and adapt their compilation-time model shaping and inference-time tensor dispatch accordingly.

This is Part 3/4 of the block-based KV cache feature series:

Part PR Description
1/4 #35012 SplitKVCacheIntoBlocks graph transformation
2/4 #35013 KVCacheBlockManager — block allocation and lifecycle
3/4 this PR HFA and Pyramid attention: block-KV compilation & runtime
4/4 #35018 LLM inference pipeline integration

Changes by section

Section 1 — Shared infrastructure

  • util.hpp/cpp — rename isPastKeyValues{Key,Value}isPastKeyParam / isPastValueParam; add …Contiguous variants that match only single-parameter (non-block) names; used by both HFA and Pyramid to distinguish contiguous from block-split layouts.

  • sdpa_utils.hpp/cpp (new file) — extracts SDPAPatternNodes (holding vector<…> past_key_param_nodes / past_value_param_nodes for 1 or N elements), find_sdpa_pattern_nodes(), and find_mask_parameter(). Previously duplicated between HFA and Pyramid; now shared.

  • attention.hpp — extend function::Attention with past_key/value_block_variant_param_indices (ordered by Concat input); extend compiled::SDPAIndices with past_key_blocks / past_value_blocks vectors.


Section 2 — Host Flash Attention

Compilation (host_flash_attention.cpp):

  • build_sdpa_param_mapping() now iterates all Concat inputs (not only the first) to discover and record every block-parameter index into _past_key_block_indices / _past_value_block_indices.
  • These indices are promoted into compiled::SDPAIndices.past_key_blocks / past_value_blocks for use at inference time.
  • The tile models (HFA_Tile, HFA_Final_Tile) are generated once and remain layout-agnostic; only the tensor sourcing strategy differs at inference time.

Inference (attn_subgraph.cpp):

  • Contiguous KV: single concatenated KV tensor; the tile loop iterates context_size / tile_size offsets, slicing or zero-copy-viewing into the tensor at each kv_offset.
  • Block KV: Loop over past_key_blocks (one entry per block tensor); the final tile uses present_key_tensor. All tiles are dispatched through the same process_tile lambda regardless of source.

Section 3 — Pyramid Attention

Compilation (pyramid_attention.cpp):

  • Block-split detectionis_block_split = (past_key_param_nodes.size() > 1) or name does not match isPastKeyParamContiguous().
  • Contiguous KV — for each pyramid step: clone model → set KV param shapes to current_past_lengthreshape()validate_nodes_and_infer_types().
  • Block KV — for each pyramid step: clone model → call shrink_concat_inputs() to keep exactly model_idx past block inputs in the Concat (model[0] → 0 blocks, model[1] → 1 block, …, model[k] → k blocks) → patch_broadcast_constants + patch_reshape_constantsvalidate_nodes_and_infer_typescollect_concat_block_indices to populate past_key/value_block_variant_param_indices. Precompute past_key_block_port_map (global index → variant port) and past_key_block_port_set for O(1) lookup at inference time.

Inference (attn_subgraph.cpp, just_sync_infer_request.cpp):

  • Contiguous KVbind_function_input() calls util::view(tensor, param.dim, 0, past_len) to present each pyramid
    variant with a correctly sized KV slice; mask is rebuilt per-variant in prologue().
  • Block KV:
    • Setup: alias_block_slots() pre-wires all global block-slot ports on the main request to block_0's buffer as a placeholder, so earlier generic binding code never touches unallocated slots.
    • Binding: bind_function_input()try_bind_block() looks up past_key_block_port_map[global_idx] to dispatch each incoming block tensor to the correct variant-local port. Variants that expose no port for a given block index (e.g. model[0] with 0 past blocks) consume the call silently without set_tensor.
    • just_sync_infer_request.cpp: share_kv_block_buffers() shares block KV buffers across pyramid variant sub-requests to avoid redundant allocations.

Other:

  • base_sync_infer_request.cpp — replace scalar past_key / past_value checks with an is_past_kv()lambda; addblock_mode+bind_block_ports()lambda inbind_pyramid_attention_inputs()`.
  • partitioning/patterns/sdpa.cpp — relax Concat input-count guard to allow multi-block inputs.

Tickets:

AI Assistance:

  • AI assistance used: no / yes
  • If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks).

@github-actions github-actions Bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Mar 29, 2026
@intelgaoxiong intelgaoxiong force-pushed the xiong/block_kv_pr3_hfa_decouple branch from bcef02a to 88c1504 Compare May 27, 2026 06:45
@github-actions github-actions Bot added the category: build OpenVINO cmake script / infra label May 27, 2026
@intelgaoxiong intelgaoxiong marked this pull request as ready for review May 27, 2026 06:49
@intelgaoxiong intelgaoxiong requested review from a team as code owners May 27, 2026 06:49
@intelgaoxiong intelgaoxiong force-pushed the xiong/block_kv_pr3_hfa_decouple branch from 88c1504 to f1f9080 Compare May 27, 2026 07:38
dylanneve1 added a commit to dylanneve1/openvino that referenced this pull request May 27, 2026
@intelgaoxiong intelgaoxiong force-pushed the xiong/block_kv_pr3_hfa_decouple branch 2 times, most recently from 86b3295 to 2a50694 Compare June 2, 2026 12:59
Refactor the SDPA index structures and attention metadata to accommodate
block-based KV cache layouts where past_key/value are split into N fixed-
size block tensors instead of a single contiguous buffer.

Key changes:
- attention.hpp: rename SDPAIndices past_key/value -> past_key_blocks/
  past_value_blocks (vector<size_t>); extend PyramidAttentionInfo with
  per-variant block port sets and global param index lists
- sdpa_utils.cpp/hpp: new shared helpers extracted from pyramid_attention
  and host_flash_attention (build_sdpa_param_mapping, etc.)
- host_flash_attention: use block index loop in build_sdpa_param_mapping
- pyramid_attention: add is_block_split path; shrink_concat_inputs /
  collect_concat_block_indices helpers; populate block port metadata
- attn/attn_subgraph.cpp: alias block slots on main request; name-based
  input sharing for pyramid variants in block mode
- partitioning/patterns/sdpa.cpp: relax Concat input-count check to allow
  variable number of block inputs
- serialization.cpp + test: update field names to match SDPAIndices rename

Related-to: EISW-206740

Fixed clang-format.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
@intelgaoxiong intelgaoxiong force-pushed the xiong/block_kv_pr3_hfa_decouple branch from 2a50694 to 3eaa3f4 Compare June 2, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: build OpenVINO cmake script / infra category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant