A comparative study of two CUDA parallelization strategies for shallow neural network training. The baseline offloads only matrix multiplications to the GPU, spending around 80% of execution time on host-device transfers. The alternative rethinks the full training pipeline to keep computation GPU-resident throughout.
Baseline maps one thread per output matrix element and runs activation, loss, and weight updates on the CPU. Simple and correct, but memory-bound: weights are transferred to the GPU and back on every operation, shared memory goes unused, and execution is fully synchronous.
Alternative is built around three ideas:
- Memory persistence — weights live on the GPU for the entire training run, copied once at initialization.
- Kernel fusion — matrix multiplication and ReLU are merged into a single kernel; gradient ops are similarly consolidated. No intermediate CPU round-trips.
- Asynchronous execution — double buffering with CUDA streams overlaps data transfer for batch N+1 with computation for batch N.
Shared memory tiling and GPU-based transpose further reduce memory bottlenecks.
├── baseline/ # Reference CUDA implementation
├── code/ # Optimized alternative implementation
├── results/ # Performance plots
├── article.pdf # Full research paper
└── project_description.pdf # Original project specification
The full analysis including kernel design, architectural justification, and all experimental results is in article.pdf.
[1] Brouthen, K. and Akeb, A. (2024). Exploring parallelization of shallow neural network using CUDA. Refer to “Rapport de la stratégie de référence” in the baseline/ folder.
