Skip to content

ActivAai/cuda_kernel_practice

 
 

Repository files navigation

CUDA Deep Learning Kernels

A comprehensive collection of optimized CUDA kernels implementing fundamental deep learning operations and computational primitives. This project demonstrates efficient GPU programming techniques using CUDA C++ with PyTorch integration for various neural network components and mathematical operations.

Target Hardware

  • GPU: NVIDIA GeForce RTX 4060 or compatible Ada Lovelace architecture
  • Memory: 8GB GDDR6 VRAM minimum
  • Compute Capability: 8.9+

Project Overview

This repository contains hand-optimized CUDA kernel implementations for essential deep learning and computational operations, designed to showcase GPU programming best practices and performance optimization techniques. Each module includes both CUDA kernels and Python benchmarking scripts for performance evaluation against PyTorch's native implementations.

Implemented Kernels

  • Flash Attention - FlashAttention/ - Memory-efficient attention computation

  • GELU (Gaussian Error Linear Unit) - Gelu/ - FP32, FP32x4, FP16 implementations

  • Sigmoid - sigmoid/ - FP32, FP32x4, FP16 implementations

  • Layer Normalization - layernorm/ - FP32, FP32x4, FP16 implementations

  • Block Reduce Sum - Reduce/ - FP32, FP32x4 hierarchical reduction

  • Matrix Transpose - Transpose/ - Optimized memory access patterns

  • Prefix Sum (Scan) - PrefixSum/ - Parallel scan algorithms

Development Environment

Software Dependencies

  • CUDA Toolkit: 11.0+ with nvcc compiler
  • PyTorch: 1.12+ with CUDA support
  • Python: 3.8+ with numpy, time modules
  • C++ Compiler: Compatible with C++17 standard
  • My env: conda activate py313

This project demonstrates practical CUDA kernel development for deep learning applications, providing both educational value and production-ready kernel implementations optimized for modern GPU architectures.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Cuda 89.1%
  • Python 8.0%
  • C++ 1.8%
  • Makefile 1.1%