[NPUW] Enable mask skipping for fused flash attention by kkoryun · Pull Request #36077 · openvinotoolkit/openvino

kkoryun · 2026-05-26T18:00:47Z

Details:

PR adds logic for working with a attention mask:

Skip mask operations (in kernel) for middle tiles that are fully filled
Use a mask for the final tile and the last line of attention

New subgraph without mask input added.

Tickets:

E*215657

AI Assistance:

AI assistance used: no / yes
If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks).

Copilot

Pull request overview

Enables skipping attention-mask processing for fused Host Flash Attention (HFA) regular tiles by introducing an alternate “no mask input” tiled subgraph and wiring runtime selection between the masked vs no-mask variants.

Changes:

Add generation/compilation plumbing for an additional regular-tile HFA model variant without the mask input.
Extend HFA runtime selector interface with current_length() to support mask-skipping decisions.
Update HFA runtime request setup/execution to optionally use the no-mask regular-tile infer request/model.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
src/plugins/intel_npu/src/plugin/npuw/host_flash_attention.hpp	Adds storage for a no-mask tile model/compiled model and extends selector API with `current_length()`.
src/plugins/intel_npu/src/plugin/npuw/host_flash_attention.cpp	Builds an optional no-mask tile model (fused path) and implements `PositionIDs::current_length()`.
src/plugins/intel_npu/src/plugin/npuw/compiled_model.cpp	Compiles/dumps the additional no-mask tile model when present.
src/plugins/intel_npu/src/plugin/npuw/attn/attn_subgraph.cpp	Creates/shares infer requests for the no-mask model and selects masked vs no-mask execution at runtime.

                                         kv_tile_offset,
                                         mask_tile_offset,
-                                         tile_size);
+                                         tile_size,
+                                         false,
+                                         use_mask);


+                        // If the regular tile is not fully filled, need to use the mask
+                        const bool use_mask = (actual_kv_length + 1) % tile_size != 0;
+                        const bool use_no_mask_model = !use_mask && hfa_desc->_compiled_tile_no_mask_model;
+                        auto& regular_tile_request =
+                            use_no_mask_model ? state.hfa_requests.infer_requests[HFARequestSet::REGULAR_TILE_NO_MASK]


esmirno

overall LGTM - but tests better to always create since otherwise not clear what this change is fixing.

esmirno · 2026-06-03T21:52:20Z

-        REGULAR_TILE = 0,
-        FINAL_TILE = 1,
-        COUNT = 2,
+        REGULAR_TILE_MASK = 0,


if it is regular tile with mask - may be better to have TILE_WITH_MASK and REGULAR_TILE

esmirno · 2026-06-03T21:56:05Z

-        state.hfa_requests.pipeline_requests[HFARequestSet::REGULAR_TILE] =
+        state.hfa_requests.pipeline_requests[HFARequestSet::REGULAR_TILE_MASK] =
            hfa->_compiled_tile_model->create_infer_request();
+        if (hfa->_compiled_tile_no_mask_model) {


this is clear place to spot a problem - initial no_mask_model might refer missing a model, so better to have compiled_tile, compiled_tile_with_mask

esmirno · 2026-06-03T22:19:22Z

    // ========================================================================
    HostFlashAttention hfa;
    hfa._tile_model = tile_model;
+    if (fused_flash_attention) {


do we have any tests for HFA - i think @intelgaoxiong introduced one - interestingly how non of it is failing with such change - could you please add some tests that shows MASKS behavior

kkoryun added 6 commits May 8, 2026 16:28

remove broadcast for fused hfa

89a9754

change compiler version to support GQA

ba3fcd6

Merge branch 'master' into enabling_gqa_hfa

d4dbbf5

refactoring

68ab3ac

added nomask subgraph

ff5cac7

revert host_flash_attention.cpp

9d133d4

github-actions Bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels May 26, 2026

kkoryun added 3 commits June 3, 2026 12:42

Merge branch 'master' into enable_mask_skipping

6a59259

fix

7b06aec

fixed false branch for no mask

8bc4f47

kkoryun marked this pull request as ready for review June 3, 2026 15:42

kkoryun requested review from a team as code owners June 3, 2026 15:42

kkoryun requested a review from Copilot June 3, 2026 15:58

Copilot started reviewing on behalf of kkoryun June 3, 2026 15:58 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

dmatveev added this to the 2026.3 milestone Jun 3, 2026

esmirno reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPUW] Enable mask skipping for fused flash attention #36077

[NPUW] Enable mask skipping for fused flash attention #36077
kkoryun wants to merge 9 commits into
openvinotoolkit:masterfrom
kkoryun:enable_mask_skipping

kkoryun commented May 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

esmirno left a comment

Uh oh!

esmirno Jun 3, 2026

Uh oh!

esmirno Jun 3, 2026

Uh oh!

esmirno Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kkoryun commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

AI Assistance:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

esmirno left a comment

Choose a reason for hiding this comment

Uh oh!

esmirno Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

esmirno Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

esmirno Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kkoryun commented May 26, 2026 •

edited

Loading