English | 简体中文
A Living Textbook for High-Performance CUDA Kernel Development
A systematic CUDA high-performance computing tutorial, from naive implementations to extreme optimization, covering core operators needed by modern AI models (LLM, Diffusion).
| Module | Description | Key Techniques |
|---|---|---|
| GEMM | Matrix multiplication optimization | Tiled → Register Blocked → Tensor Core |
| Attention | FlashAttention variants | Online Softmax, causal masking |
| Normalization | LayerNorm, RMSNorm | Warp shuffle, vectorized loads |
| Elementwise | Activation functions | GELU, SiLU, vectorized |
| Quantization | INT8/FP8 | Calibration, per-channel scaling |
| Fusion | Kernel fusion patterns | Bias+Act, LayerNorm+Residual |
git clone https://github.com/LessUp/hpc-ai-optimization-lab.git
cd hpc-ai-optimization-lab
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failure- CUDA Toolkit 13.1+ (Hopper/Blackwell recommended)
- CMake 3.20+, C++20 compiler
- GPU: SM 8.0+ (Ampere or newer)
hpc-ai-optimization-lab/
├── src/ # Kernel implementations
│ ├── gemm/ # GEMM optimization levels
│ ├── attention/ # Attention kernels
│ ├── normalization/ # Norm kernels
│ ├── elementwise/ # Activation kernels
│ └── quantization/ # Quantization kernels
├── include/ # Public headers
├── tests/ # Google Test suite
├── benchmarks/ # Performance benchmarks
├── docs/ # Documentation
└── .github/workflows/ # CI
- Memory Hierarchy: Global → Shared → Register optimization
- Tensor Core Programming: WMMA / MMA for mixed-precision compute
- Async Operations: TMA, async copy, pipeline overlapping
- Warp-Level Primitives: Shuffle, vote, cooperative groups
- Kernel Fusion: Reducing HBM round-trips
MIT License