A comprehensive collection of optimized CUDA kernels implementing fundamental deep learning operations and computational primitives. This project demonstrates efficient GPU programming techniques using CUDA C++ with PyTorch integration for various neural network components and mathematical operations.
- GPU: NVIDIA GeForce RTX 4060 or compatible Ada Lovelace architecture
- Memory: 8GB GDDR6 VRAM minimum
- Compute Capability: 8.9+
This repository contains hand-optimized CUDA kernel implementations for essential deep learning and computational operations, designed to showcase GPU programming best practices and performance optimization techniques. Each module includes both CUDA kernels and Python benchmarking scripts for performance evaluation against PyTorch's native implementations.
-
Flash Attention -
FlashAttention/- Memory-efficient attention computation -
GELU (Gaussian Error Linear Unit) -
Gelu/- FP32, FP32x4, FP16 implementations -
Sigmoid -
sigmoid/- FP32, FP32x4, FP16 implementations -
Layer Normalization -
layernorm/- FP32, FP32x4, FP16 implementations -
Block Reduce Sum -
Reduce/- FP32, FP32x4 hierarchical reduction -
Matrix Transpose -
Transpose/- Optimized memory access patterns -
Prefix Sum (Scan) -
PrefixSum/- Parallel scan algorithms
- CUDA Toolkit: 11.0+ with nvcc compiler
- PyTorch: 1.12+ with CUDA support
- Python: 3.8+ with numpy, time modules
- C++ Compiler: Compatible with C++17 standard
- My env: conda activate py313
This project demonstrates practical CUDA kernel development for deep learning applications, providing both educational value and production-ready kernel implementations optimized for modern GPU architectures.