Skip to content

[NPUW] Enable mask skipping for fused flash attention #36077

Open
kkoryun wants to merge 9 commits into
openvinotoolkit:masterfrom
kkoryun:enable_mask_skipping
Open

[NPUW] Enable mask skipping for fused flash attention #36077
kkoryun wants to merge 9 commits into
openvinotoolkit:masterfrom
kkoryun:enable_mask_skipping

Conversation

@kkoryun
Copy link
Copy Markdown
Contributor

@kkoryun kkoryun commented May 26, 2026

Details:

PR adds logic for working with a attention mask:

  • Skip mask operations (in kernel) for middle tiles that are fully filled
  • Use a mask for the final tile and the last line of attention

New subgraph without mask input added.

Tickets:

  • E*215657

AI Assistance:

  • AI assistance used: no / yes
  • If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks).

@github-actions github-actions Bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels May 26, 2026
@kkoryun kkoryun marked this pull request as ready for review June 3, 2026 15:42
@kkoryun kkoryun requested review from a team as code owners June 3, 2026 15:42
@kkoryun kkoryun requested a review from Copilot June 3, 2026 15:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables skipping attention-mask processing for fused Host Flash Attention (HFA) regular tiles by introducing an alternate “no mask input” tiled subgraph and wiring runtime selection between the masked vs no-mask variants.

Changes:

  • Add generation/compilation plumbing for an additional regular-tile HFA model variant without the mask input.
  • Extend HFA runtime selector interface with current_length() to support mask-skipping decisions.
  • Update HFA runtime request setup/execution to optionally use the no-mask regular-tile infer request/model.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/plugins/intel_npu/src/plugin/npuw/host_flash_attention.hpp Adds storage for a no-mask tile model/compiled model and extends selector API with current_length().
src/plugins/intel_npu/src/plugin/npuw/host_flash_attention.cpp Builds an optional no-mask tile model (fused path) and implements PositionIDs::current_length().
src/plugins/intel_npu/src/plugin/npuw/compiled_model.cpp Compiles/dumps the additional no-mask tile model when present.
src/plugins/intel_npu/src/plugin/npuw/attn/attn_subgraph.cpp Creates/shares infer requests for the no-mask model and selects masked vs no-mask execution at runtime.

Comment on lines 994 to +998
kv_tile_offset,
mask_tile_offset,
tile_size);
tile_size,
false,
use_mask);
Comment on lines +833 to +837
// If the regular tile is not fully filled, need to use the mask
const bool use_mask = (actual_kv_length + 1) % tile_size != 0;
const bool use_no_mask_model = !use_mask && hfa_desc->_compiled_tile_no_mask_model;
auto& regular_tile_request =
use_no_mask_model ? state.hfa_requests.infer_requests[HFARequestSet::REGULAR_TILE_NO_MASK]
@dmatveev dmatveev added this to the 2026.3 milestone Jun 3, 2026
Copy link
Copy Markdown
Contributor

@esmirno esmirno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM - but tests better to always create since otherwise not clear what this change is fixing.

REGULAR_TILE = 0,
FINAL_TILE = 1,
COUNT = 2,
REGULAR_TILE_MASK = 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is regular tile with mask - may be better to have TILE_WITH_MASK and REGULAR_TILE

state.hfa_requests.pipeline_requests[HFARequestSet::REGULAR_TILE] =
state.hfa_requests.pipeline_requests[HFARequestSet::REGULAR_TILE_MASK] =
hfa->_compiled_tile_model->create_infer_request();
if (hfa->_compiled_tile_no_mask_model) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is clear place to spot a problem - initial no_mask_model might refer missing a model, so better to have compiled_tile, compiled_tile_with_mask

// ========================================================================
HostFlashAttention hfa;
hfa._tile_model = tile_model;
if (fused_flash_attention) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any tests for HFA - i think @intelgaoxiong introduced one - interestingly how non of it is failing with such change - could you please add some tests that shows MASKS behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants