Update autotune cache system and local attention#293
Merged
Conversation
… improve masking logic
…d add window_sizes_heuristic function for computing partitioned window sizes.
…uery heads naming
…ct query heads naming
…ct query heads naming
…w size parameters for improved local attention handling
…rate window size parameters for enhanced local attention processing
…rate window size parameters for enhanced local attention processing
…w size parameters for improved local attention processing
…ow sizes for improved local attention processing
…rate window size parameters for enhanced local attention processing
…_local for improved local attention handling
…_block_min calculation for split KV cases
…h parameters for improved flexibility
…th parameters for improved flexibility
…h parameters for improved flexibility
…flexible partitioning
…adding bounds checks for m_idx_max and refining n_block_min/max calculations for improved handling of split KV cases
…e dynamic window sizes for improved local attention handling
…ate dynamic window sizes for improved local attention handling
…namic window sizes for enhanced local attention handling
…ility in dense and sparse attention handling
…g with window size parameters
… in attention handling
…rward, backward, and decode configurations
… for local attention handling
… local attention handling
… query-output handling
… query-output handling
- Introduced a persistent cache for launch configurations, improving efficiency by storing tuned configurations per device. - Removed redundant launch configuration functions and replaced them with a unified caching mechanism. - Added functions to compute and sanitize device names, manage cache directories, and handle JSON read/write operations. - Simplified the process of loading and storing launch configurations, allowing for better maintainability and readability. - Updated the method for extracting the best configuration from autotuned kernels. - Removed deprecated launch configuration functions for backward and decode kernels, streamlining the codebase.
…e operations - Updated the backward functions in `flash_dense_bwd.py`, `flash_gated_bwd.py`, and `flash_sparse_bwd.py` to utilize a launch configuration template for kernel execution, improving performance tuning. - Modified the decode functions in `flash_dense_dec.py`, `flash_gated_dec.py`, and `flash_sparse_dec.py` to implement similar launch configuration logic. - Enhanced forward functions in `flash_dense_fwd.py`, `flash_gated_fwd.py`, and `flash_sparse_fwd.py` to adopt the new launch configuration approach. - Removed redundant architecture checks and streamlined kernel selection logic. - Added logic to store the best launch configuration for future use, enhancing autotuning capabilities.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.