This project contains CUDA implementations of matrix multiplication (GEMM).
naive_gemm.cu: a straightforward GEMM kernelopt_gemm.cu: an optimized version using tiling and shared memory- and so on..
- Input:
- Matrix (A): size (m \times k)
- Matrix (B): size (k \times n)
- Output:
- Matrix (C): size (m \times n), where (C = A \times B)
All matrices are assumed to be row‑major float arrays.
-
naive_gemm- Each thread computes a single element
C[row, col]. - The kernel directly reads
A[row, :]andB[:, col]from global memory and performs the multiply–accumulate loop. - The implementation is simple but does not optimize memory reuse or access patterns.
- Each thread computes a single element
-
opt_gemm- Uses
TILE_SIZE × TILE_SIZEtiles for computation. - Uses
__shared__memory to cache tiles ofAandBand reuse them within a block. - Organizes global memory accesses to be coalesced, improving bandwidth utilization.
- Uses