CUDA GEMM Kernels

This project contains CUDA implementations of matrix multiplication (GEMM).

1. Problem Definition

All matrices are assumed to be row‑major float arrays.

naive_gemm
- Each thread computes a single element C[row, col].
- The kernel directly reads A[row, :] and B[:, col] from global memory and performs the multiply–accumulate loop.
- The implementation is simple but does not optimize memory reuse or access patterns.
opt_gemm
- Uses TILE_SIZE × TILE_SIZE tiles for computation.
- Uses __shared__ memory to cache tiles of A and B and reuse them within a block.
- Organizes global memory accesses to be coalesced, improving bandwidth utilization.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
double_buffering.cu		double_buffering.cu
naive_gemm.cu		naive_gemm.cu
opt_gemm.cu		opt_gemm.cu