CUDA LLM Kernel Optimization

High-performance CUDA operator library for LLM inference optimization, including FlashAttention and high-performance GEMM kernels.

Features

FlashAttention: Online Softmax, O(N) memory, causal mask support
High-Performance GEMM: FP32/FP16/INT8 mixed precision, Tensor Core (WMMA)
Progressive Optimization: Naive → Tiled → FlashAttention (double-buffered)
Register Tiling GEMM: 128×128 blocks + 8×8 register accumulation + double buffer pipeline
PyTorch Integration: pybind11 Python bindings, direct PyTorch Tensor I/O
Property Testing: Hypothesis-driven property-based tests

Installation

pip install -r requirements.txt
pip install -e .

CMake Build

cmake --preset release
cmake --build --preset release

Usage

from cuda_llm_ops import flash_attention, gemm, tensor_core_gemm

# FlashAttention (causal mask)
output = flash_attention(q, k, v, is_causal=True)

# High-performance GEMM
c = gemm(a, b, alpha=1.0, beta=0.0)

# Tensor Core GEMM (FP16 → FP32)
c_fp32 = tensor_core_gemm(a, b)

Testing

pytest tests/ -v                         # All tests
pytest tests/ -v -m property             # Property tests
python benchmarks/benchmark_attention.py # Benchmarks

GPU Architecture Support

Arch	SM	Features
Volta	7.0	FP16 Tensor Core
Turing	7.5	FP16 + INT8
Ampere	8.0, 8.6	TF32 + async copy
Ada	8.9	FP8
Hopper	9.0	TMA + Warp Group MMA

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
.kiro/specs/cuda-llm-kernel-optimization		.kiro/specs/cuda-llm-kernel-optimization
benchmarks		benchmarks
changelog		changelog
include		include
python		python
src		src
tests		tests
.clang-format		.clang-format
.editorconfig		.editorconfig
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
_config.yml		_config.yml
index.md		index.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA LLM Kernel Optimization

Features

Installation

CMake Build

Usage

Testing

GPU Architecture Support

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA LLM Kernel Optimization

Features

Installation

CMake Build

Usage

Testing

GPU Architecture Support

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages