local attention and cache system by LoserCheems · Pull Request #291 · HKUSTDial/flash-sparse-attention

LoserCheems · 2026-05-24T12:08:53Z

No description provided.

… improve masking logic

…nctions

…d add window_sizes_heuristic function for computing partitioned window sizes.

…rn values

…and local masks

…uery heads naming

…ct query heads naming

…w size parameters for improved local attention handling

…rate window size parameters for enhanced local attention processing

…w size parameters for improved local attention processing

…ow sizes for improved local attention processing

…rate window size parameters for enhanced local attention processing

…_local for improved local attention handling

…on processing

…_block_min calculation for split KV cases

…h parameters for improved flexibility

…th parameters for improved flexibility

…h parameters for improved flexibility

…flexible partitioning

…adding bounds checks for m_idx_max and refining n_block_min/max calculations for improved handling of split KV cases

…e dynamic window sizes for improved local attention handling

…ate dynamic window sizes for improved local attention handling

…namic window sizes for enhanced local attention handling

…ility in dense and sparse attention handling

…g with window size parameters

… in attention handling

…rward, backward, and decode configurations

…tention handling

… for local attention handling

… local attention handling

… query-output handling

- Introduced a persistent cache for launch configurations, improving efficiency by storing tuned configurations per device. - Removed redundant launch configuration functions and replaced them with a unified caching mechanism. - Added functions to compute and sanitize device names, manage cache directories, and handle JSON read/write operations. - Simplified the process of loading and storing launch configurations, allowing for better maintainability and readability. - Updated the method for extracting the best configuration from autotuned kernels. - Removed deprecated launch configuration functions for backward and decode kernels, streamlining the codebase.

…e operations - Updated the backward functions in `flash_dense_bwd.py`, `flash_gated_bwd.py`, and `flash_sparse_bwd.py` to utilize a launch configuration template for kernel execution, improving performance tuning. - Modified the decode functions in `flash_dense_dec.py`, `flash_gated_dec.py`, and `flash_sparse_dec.py` to implement similar launch configuration logic. - Enhanced forward functions in `flash_dense_fwd.py`, `flash_gated_fwd.py`, and `flash_sparse_fwd.py` to adopt the new launch configuration approach. - Removed redundant architecture checks and streamlined kernel selection logic. - Added logic to store the best launch configuration for future use, enhancing autotuning capabilities.

LoserCheems added 30 commits May 16, 2026 10:48

Refactor block min/max functions to use window size parameters directly

7b08b9c

Refactor apply_mask function to streamline window size parameters and…

44f8925

… improve masking logic

Fix normalization logic in combine kernels to handle zero sums correctly

9c4fbd8

Fix parameter naming for query heads in softmax and gate threshold fu…

0f2cb55

…nctions

Enhance num_splits_heuristic to ensure a minimum return value of 1 an…

c19e4c1

…d add window_sizes_heuristic function for computing partitioned window sizes.

Fix parameter naming for query heads in launch configuration functions

00f9eec

Add window size parameters to block min/max functions and update retu…

faa0946

…rn values

Refactor masking logic to prevent simultaneous application of causal …

ba8c30f

…and local masks

Refactor kernel parameters to use window size variables and correct q…

8b3b0eb

…uery heads naming

Refactor kernel parameters to include window size variables and corre…

3cb47af

…ct query heads naming

Refactor kernel parameters to include window size variables and corre…

51afda4

…ct query heads naming

Refactor _bwd_inner_dense_kernel to use window size parameters directly

d12554a

Refactor _bwd_inner_sparse_kernel to use window size parameters directly

dba9f28

Refactor _bwd_inner_gated_kernel to use window size parameters directly

cbfe34d

Refactor _fwd_dense_kernel and related functions to incorporate windo…

b96e248

…w size parameters for improved local attention handling

Refactor _fwd_sparse_kernel and _flash_sparse_attn functions to integ…

4df37c6

…rate window size parameters for enhanced local attention processing

Refactor _fwd_gated_kernel and _flash_gated_attn functions to incorpo…

7125da9

…rate window size parameters for enhanced local attention processing

Refactor _bwd_dense_kernel and related functions to incorporate windo…

ca02613

…w size parameters for improved local attention processing

Refactor _bwd_sparse_kernel and related functions to incorporate wind…

efdfac8

…ow sizes for improved local attention processing

Refactor _bwd_gated_kernel and _flash_gated_attn functions to incorpo…

7c9bbb2

…rate window size parameters for enhanced local attention processing

Refactor attention functions to replace window_size parameter with is…

ebffbfe

…_local for improved local attention handling

Fix condition in _fwd_sparse_kernel to correctly handle local attenti…

3bf67e9

…on processing

Enhance local attention handling in get_n_block_min_max by updating n…

d1f5b64

…_block_min calculation for split KV cases

Refactor _dec_inner_dense_kernel to replace window size constants wit…

fdab34f

…h parameters for improved flexibility

Refactor _dec_inner_sparse_kernel to replace window size constants wi…

bb33953

…th parameters for improved flexibility

Refactor _dec_inner_gated_kernel to replace window size constants wit…

7ba6377

…h parameters for improved flexibility

Refactor window_sizes_heuristic to add equal_bandwidth parameter for …

afbfdf1

…flexible partitioning

Enhance get_n_block_min_max and get_n_block_min_before_local_mask by …

5a92d3c

…adding bounds checks for m_idx_max and refining n_block_min/max calculations for improved handling of split KV cases

Refactor _dec_dense_kernel and _flash_dense_attn_decode to incorporat…

83480b5

…e dynamic window sizes for improved local attention handling

Refactor _dec_sparse_kernel and _flash_sparse_attn_decode to incorpor…

cca319b

…ate dynamic window sizes for improved local attention handling

LoserCheems added 17 commits May 19, 2026 17:09

Refactor _dec_gated_kernel and _flash_gated_attn_decode to support dy…

d310248

…namic window sizes for enhanced local attention handling

Refactor test cases to support is_local parameter for enhanced flexib…

a415eae

…ility in dense and sparse attention handling

Refactor reference score functions to support local attention handlin…

8eabb00

…g with window size parameters

Remove window_size parameter from benchmark functions for consistency…

4997a9f

… in attention handling

Refactor autotuner functions to support dynamic memory pruning for fo…

d69f1c3

…rward, backward, and decode configurations

Refactor test cases to standardize parameters for dense and sparse at…

5a19ef8

…tention handling

Refactor get_m_block_min_max function to support split configurations…

0364b44

… for local attention handling

Enhance backward kernel functions to support split configurations for…

5d5fde3

… local attention handling

Enhance backward kernel functions to support split configurations for…

9e20ab0

… query-output handling

Enhance backward kernel functions to support split configurations for…

e703354

… query-output handling

Add support for split-QO configuration in attention functions

6bf22d7

Refactor grid functions to enhance backward and forward kernel handling

e417ffe

Add IS_CAUSAL and IS_LOCAL keys to autotuned kernel configurations

689039c

Remove static buffer pool and adjust compiled kernel cache size

9fbf2a2

Update is_autotune parameter description to clarify cache behavior

873e5bd

LoserCheems merged commit 8826f58 into quant May 24, 2026
1 check passed

LoserCheems mentioned this pull request May 24, 2026

Revert "local attention and cache system" #292

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local attention and cache system#291

local attention and cache system#291
LoserCheems merged 47 commits into
quantfrom
flex-window

LoserCheems commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LoserCheems commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant