Skip to content

FarahSarah/cuda-neural-network-optimizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUDA Parallelization for Shallow Neural Networks

ESI Logo

Overview

A comparative study of two CUDA parallelization strategies for shallow neural network training. The baseline offloads only matrix multiplications to the GPU, spending around 80% of execution time on host-device transfers. The alternative rethinks the full training pipeline to keep computation GPU-resident throughout.

Strategies

Baseline maps one thread per output matrix element and runs activation, loss, and weight updates on the CPU. Simple and correct, but memory-bound: weights are transferred to the GPU and back on every operation, shared memory goes unused, and execution is fully synchronous.

Alternative is built around three ideas:

  • Memory persistence — weights live on the GPU for the entire training run, copied once at initialization.
  • Kernel fusion — matrix multiplication and ReLU are merged into a single kernel; gradient ops are similarly consolidated. No intermediate CPU round-trips.
  • Asynchronous execution — double buffering with CUDA streams overlaps data transfer for batch N+1 with computation for batch N.

Shared memory tiling and GPU-based transpose further reduce memory bottlenecks.

Repository

├── baseline/                   # Reference CUDA implementation
├── code/                       # Optimized alternative implementation
├── results/                    # Performance plots
├── article.pdf                 # Full research paper
└── project_description.pdf     # Original project specification

Read the paper

The full analysis including kernel design, architectural justification, and all experimental results is in article.pdf.

References

[1] Brouthen, K. and Akeb, A. (2024). Exploring parallelization of shallow neural network using CUDA. Refer to “Rapport de la stratégie de référence” in the baseline/ folder.

About

Optimized CUDA implementation for training shallow neural networks with shared memory tiling, kernel fusion, and double buffering techniques. Includes baseline comparison and performance analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors