Skip to content

local attention and cache system#291

Merged
LoserCheems merged 47 commits into
quantfrom
flex-window
May 24, 2026
Merged

local attention and cache system#291
LoserCheems merged 47 commits into
quantfrom
flex-window

Conversation

@LoserCheems
Copy link
Copy Markdown
Collaborator

No description provided.

…d add window_sizes_heuristic function for computing partitioned window sizes.
…w size parameters for improved local attention handling
…rate window size parameters for enhanced local attention processing
…rate window size parameters for enhanced local attention processing
…w size parameters for improved local attention processing
…ow sizes for improved local attention processing
…rate window size parameters for enhanced local attention processing
…_local for improved local attention handling
…adding bounds checks for m_idx_max and refining n_block_min/max calculations for improved handling of split KV cases
…e dynamic window sizes for improved local attention handling
…ate dynamic window sizes for improved local attention handling
…namic window sizes for enhanced local attention handling
…ility in dense and sparse attention handling
- Introduced a persistent cache for launch configurations, improving efficiency by storing tuned configurations per device.
- Removed redundant launch configuration functions and replaced them with a unified caching mechanism.
- Added functions to compute and sanitize device names, manage cache directories, and handle JSON read/write operations.
- Simplified the process of loading and storing launch configurations, allowing for better maintainability and readability.
- Updated the method for extracting the best configuration from autotuned kernels.
- Removed deprecated launch configuration functions for backward and decode kernels, streamlining the codebase.
…e operations

- Updated the backward functions in `flash_dense_bwd.py`, `flash_gated_bwd.py`, and `flash_sparse_bwd.py` to utilize a launch configuration template for kernel execution, improving performance tuning.
- Modified the decode functions in `flash_dense_dec.py`, `flash_gated_dec.py`, and `flash_sparse_dec.py` to implement similar launch configuration logic.
- Enhanced forward functions in `flash_dense_fwd.py`, `flash_gated_fwd.py`, and `flash_sparse_fwd.py` to adopt the new launch configuration approach.
- Removed redundant architecture checks and streamlined kernel selection logic.
- Added logic to store the best launch configuration for future use, enhancing autotuning capabilities.
@LoserCheems LoserCheems merged commit 8826f58 into quant May 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant