CUDA Parallelization for Shallow Neural Networks

Overview

A comparative study of two CUDA parallelization strategies for shallow neural network training. The baseline offloads only matrix multiplications to the GPU, spending around 80% of execution time on host-device transfers. The alternative rethinks the full training pipeline to keep computation GPU-resident throughout.

Strategies

Baseline maps one thread per output matrix element and runs activation, loss, and weight updates on the CPU. Simple and correct, but memory-bound: weights are transferred to the GPU and back on every operation, shared memory goes unused, and execution is fully synchronous.

Alternative is built around three ideas:

Memory persistence — weights live on the GPU for the entire training run, copied once at initialization.
Kernel fusion — matrix multiplication and ReLU are merged into a single kernel; gradient ops are similarly consolidated. No intermediate CPU round-trips.
Asynchronous execution — double buffering with CUDA streams overlaps data transfer for batch N+1 with computation for batch N.

Shared memory tiling and GPU-based transpose further reduce memory bottlenecks.

Repository

├── baseline/                   # Reference CUDA implementation
├── code/                       # Optimized alternative implementation
├── results/                    # Performance plots
├── article.pdf                 # Full research paper
└── project_description.pdf     # Original project specification

Read the paper

The full analysis including kernel design, architectural justification, and all experimental results is in article.pdf.

References

[1] Brouthen, K. and Akeb, A. (2024). Exploring parallelization of shallow neural network using CUDA. Refer to “Rapport de la stratégie de référence” in the baseline/ folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Parallelization for Shallow Neural Networks

Overview

Strategies

Repository

Read the paper

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
baseline		baseline
code		code
results		results
README.md		README.md
article.pdf		article.pdf
project_description.pdf		project_description.pdf

Folders and files

Latest commit

History

Repository files navigation

CUDA Parallelization for Shallow Neural Networks

Overview

Strategies

Repository

Read the paper

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages