HPC-AI-Optimization-Lab

A Living Textbook for High-Performance CUDA Kernel Development

Overview

A systematic CUDA high-performance computing tutorial, from naive implementations to extreme optimization, covering core operators needed by modern AI models (LLM, Diffusion).

Modules

Module	Description	Key Techniques
GEMM	Matrix multiplication optimization	Tiled → Register Blocked → Tensor Core
Attention	FlashAttention variants	Online Softmax, causal masking
Normalization	LayerNorm, RMSNorm	Warp shuffle, vectorized loads
Elementwise	Activation functions	GELU, SiLU, vectorized
Quantization	INT8/FP8	Calibration, per-channel scaling
Fusion	Kernel fusion patterns	Bias+Act, LayerNorm+Residual

Quick Start

git clone https://github.com/LessUp/hpc-ai-optimization-lab.git
cd hpc-ai-optimization-lab

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failure

Requirements

CUDA Toolkit 13.1+ (Hopper/Blackwell recommended)
CMake 3.20+, C++20 compiler
GPU: SM 8.0+ (Ampere or newer)

Project Structure

hpc-ai-optimization-lab/
├── src/                    # Kernel implementations
│   ├── gemm/               # GEMM optimization levels
│   ├── attention/           # Attention kernels
│   ├── normalization/       # Norm kernels
│   ├── elementwise/         # Activation kernels
│   └── quantization/        # Quantization kernels
├── include/                # Public headers
├── tests/                  # Google Test suite
├── benchmarks/             # Performance benchmarks
├── docs/                   # Documentation
└── .github/workflows/      # CI

Key Topics

Memory Hierarchy: Global → Shared → Register optimization
Tensor Core Programming: WMMA / MMA for mixed-precision compute
Async Operations: TMA, async copy, pipeline overlapping
Warp-Level Primitives: Shuffle, vote, cooperative groups
Kernel Fusion: Reducing HBM round-trips

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
.kiro/specs		.kiro/specs
changelog		changelog
docker		docker
docs		docs
examples		examples
python		python
src		src
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HPC-AI-Optimization-Lab

Overview

Modules

Quick Start

Requirements

Project Structure

Key Topics

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HPC-AI-Optimization-Lab

Overview

Modules

Quick Start

Requirements

Project Structure

Key Topics

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages