English | 简体中文
High-performance CUDA operator library for LLM inference optimization, including FlashAttention and high-performance GEMM kernels.
- FlashAttention: Online Softmax, O(N) memory, causal mask support
- High-Performance GEMM: FP32/FP16/INT8 mixed precision, Tensor Core (WMMA)
- Progressive Optimization: Naive → Tiled → FlashAttention (double-buffered)
- Register Tiling GEMM: 128×128 blocks + 8×8 register accumulation + double buffer pipeline
- PyTorch Integration: pybind11 Python bindings, direct PyTorch Tensor I/O
- Property Testing: Hypothesis-driven property-based tests
pip install -r requirements.txt
pip install -e .cmake --preset release
cmake --build --preset releasefrom cuda_llm_ops import flash_attention, gemm, tensor_core_gemm
# FlashAttention (causal mask)
output = flash_attention(q, k, v, is_causal=True)
# High-performance GEMM
c = gemm(a, b, alpha=1.0, beta=0.0)
# Tensor Core GEMM (FP16 → FP32)
c_fp32 = tensor_core_gemm(a, b)pytest tests/ -v # All tests
pytest tests/ -v -m property # Property tests
python benchmarks/benchmark_attention.py # Benchmarks| Arch | SM | Features |
|---|---|---|
| Volta | 7.0 | FP16 Tensor Core |
| Turing | 7.5 | FP16 + INT8 |
| Ampere | 8.0, 8.6 | TF32 + async copy |
| Ada | 8.9 | FP8 |
| Hopper | 9.0 | TMA + Warp Group MMA |
MIT License