CUDA Deep Learning Kernels

A comprehensive collection of optimized CUDA kernels implementing fundamental deep learning operations and computational primitives. This project demonstrates efficient GPU programming techniques using CUDA C++ with PyTorch integration for various neural network components and mathematical operations.

Target Hardware

GPU: NVIDIA GeForce RTX 4060 or compatible Ada Lovelace architecture
Memory: 8GB GDDR6 VRAM minimum
Compute Capability: 8.9+

Project Overview

This repository contains hand-optimized CUDA kernel implementations for essential deep learning and computational operations, designed to showcase GPU programming best practices and performance optimization techniques. Each module includes both CUDA kernels and Python benchmarking scripts for performance evaluation against PyTorch's native implementations.

Implemented Kernels

Flash Attention - FlashAttention/ - Memory-efficient attention computation
GELU (Gaussian Error Linear Unit) - Gelu/ - FP32, FP32x4, FP16 implementations
Sigmoid - sigmoid/ - FP32, FP32x4, FP16 implementations
Layer Normalization - layernorm/ - FP32, FP32x4, FP16 implementations
Block Reduce Sum - Reduce/ - FP32, FP32x4 hierarchical reduction
Matrix Transpose - Transpose/ - Optimized memory access patterns
Prefix Sum (Scan) - PrefixSum/ - Parallel scan algorithms

Development Environment

Software Dependencies

CUDA Toolkit: 11.0+ with nvcc compiler
PyTorch: 1.12+ with CUDA support
Python: 3.8+ with numpy, time modules
C++ Compiler: Compatible with C++17 standard
My env: conda activate py313

This project demonstrates practical CUDA kernel development for deep learning applications, providing both educational value and production-ready kernel implementations optimized for modern GPU architectures.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
FlashAttention_1		FlashAttention_1
GEMM/GEMM_CUTE		GEMM/GEMM_CUTE
Gelu		Gelu
PrefixSum		PrefixSum
Reduce		Reduce
SimpleTestRuns		SimpleTestRuns
Transpose		Transpose
layernorm		layernorm
sigmoid		sigmoid
third-party		third-party
topk		topk
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Deep Learning Kernels

Target Hardware

Project Overview

Implemented Kernels

Development Environment

Software Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA Deep Learning Kernels

Target Hardware

Project Overview

Implemented Kernels

Development Environment

Software Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages