Pair: Alexander Cueva & Bernard Li
This report details implementing, optimizing, and analyzing the performance of matrix-matrix multiplication. We explored various optimization techniques, including algorithmic changes and compiler flags, and evaluated their effectiveness using the Roofline performance model. Building on our previous homework, our algorithms benefitted from OMP parallel for directives as well as simd directives for loops where
This repository has:
- Three C++ implementations for matrix multiplication: a naive baseline, a loop-ordered optimization, and a block multiplication optimization.
- A
Makefileto compile the code with different optimization flags (-O1,-O2,-O3). - A Python script (
roofline.py) to generate Roofline models from the performance data. - CSV files containing the results of performance measurements for various matrix sizes and optimization levels.
- PNG images of the generated Roofline plots.
Three different C++ programs were used to measure the performance of matrix-matrix multiplication. All implementations use a single, contiguous block of memory for each matrix (e.g., new long double[rows * cols]), which is an efficient strategy that promotes good spatial locality.
This file contains the baseline, "naive" implementation using a standard ijk triple-loop structure, parallelized with OMP. This approach is simple to write but is known to have poor cache performance because it accesses the second matrix (B) in a column-wise fashion, leading to non-sequential memory access and frequent cache misses. Although this approach is paralleized with OMP, since
the result matrix cells do not rely on each other for computation, it is still
relatively slow due to these cache misses.
// parallel for: splits loop iterations among multiple threads
// collapse(2): merges 2 outer loops into a single iteration space,
// increasing the amount of parallel work to rows*cols threads
// schedule(static): divides rows*cols iterations as evenly as possible
// among threads before the loop starts, into chunks proportional to
// #iterations / #threads and assigns them statically to the threads
// efficient for loops where each iteration takes roughly the same time
#pragma omp parallel for collapse(2) schedule(static)
for (int i = 0; i < rows; i++) {
for (int j = 0; j < cols; j++) {
for (int k = 0; k < shared; k++) {
result[i * cols + j] += matrixA[i * shared + k] * matrixB[k * cols + j];
}
}
}This version reorders the loops to an ikj structure. This is an optimization for row-major matrices. In this ordering, the innermost loop iterates over j, resulting in sequential access patterns for both the result matrix C and the second input matrix B. The element from matrix A is fetched once and reused across the entire inner loop, significantly improving cache performance. The
OMP directives allow us to divide the work roughly evenly amongst all workers,
letting us take advantage of the inherently symmetrical/parallel nature of
matrix multiplication. Furthermore, the simd directive allows us to perform
the same operation (vectorized) on multiple data elements at once. This works
because of the tight loop where we update the results matrix.
// Optimized ikj loop order
// parallel for: splits loop iterations among multiple threads
// schedule(static): divides rows*cols iterations as evenly as possible among threads before loop starts (into chunks proportional to #iterations / #threads and assigns them statically to the threads)
// efficient for loops where each iteration takes roughly the same amount of time
#pragma omp parallel for schedule(static)
for (int i = 0; i < rows; i++) {
for (int k = 0; k < shared; k++) {
// a_val is private to each thread's stack (reduce memory access inside the inner loop)
long double a_val = matrixA[i * shared + k];
// simd: perform SIMD (same operation on multiple data elements at once)
#pragma omp simd
for (int j = 0; j < cols; j++) {
result[i * cols + j] += a_val * matrixB[k * cols + j];
}
}
}This optimization uses a block matrix multiplication (aka tiling). The matrices are divided into smaller, cache-sized blocks. The computation proceeds block by block, which dramatically increases data reuse due to temporal locality. Within each block, the ikj loop ordering is used to maintain efficient sequential memory access due to spatial locality. OMP directives parallize the for loops.
#pragma omp parallel for
for (int ii = 0; ii < rows; ii += blockSize) {
for (int kk = 0; kk < shared; kk += blockSize) {
for (int jj = 0; jj < cols; jj += blockSize) {
int iMax = std::min(ii + blockSize, rows);
int kMax = std::min(kk + blockSize, shared);
int jMax = std::min(jj + blockSize, cols);
for (int i = ii; i < iMax; i++) {
for (int k = kk; k < kMax; k++) {
long double a_ik = matrixA[i * shared + k];
for (int j = jj; j < jMax; j++) {
result[i * cols + j] += a_ik * matrixB[k * cols + j];
}
}
}
}
}
}The results stored in the results_*.csv files show that compiler optimizations have a massive impact on performance.
- No Flags: Performance is uniformly poor across all implementations.
- -O1: Provides a significant boost over the unoptimized version.
- -O2: Offers further improvements. This level enables a large number of optimizations, including instruction scheduling and automatic vectorization, that are highly effective for numerical code.
- -O3: Provides a marginal improvement over
-O2and in some cases can be slightly slower. For this reason,-O2is often the recommended "go-to" optimization level.
The OMP block multiplication and OMP ikj versions consistently outperform the naive ijk version at every optimization level, implying that while compiler optimizations are powerful, a better algorithm ultimately provides a stronger basis.
The Roofline model helps visualize the performance of an application in the context of the hardware's limitations—specifically, its peak memory bandwidth and peak computational rate. For this assignment, the assumed hardware specifications were:
- Peak Memory Bandwidth: 160 GB/s
- Peak FLOP Rate: 1 TFLOP/s (1000 GFLOP/s)
The ridge point of the model occurs at an arithmetic intensity (AI) of 1000 / 160 = 6.25 FLOPs/byte.
- If the AI of an application is less than 6.25, its performance is limited by memory bandwidth (it is memory-bound).
- If the AI is greater than 6.25, its performance is limited by the CPU's peak computational rate (it is compute-bound).
Figure 1: Roofline model for the OMP naive ijk implementation with -O3 optimization.
Figure 2: Roofline model for the OMP ijk multiplication implementation with -O3 optimization.
Figure 3: Roofline model for the OMP block multiplication implementation with -O3 optimization.
The plot shows that for all tested matrix sizes, the AI is greater than the ridge point. This means the matrix-matrix multiplication is compute-bound. Our goal is therefore to move the performance points vertically towards the "compute roof" of 1000 GFLOP/s.
The results show that even our best-optimized code does not reach the theoretical peak. However, the plots clearly show that the optimized versions (ikj and block) achieve higher performance (are higher on the chart) than the naive version. We should be able to reach the theoretical peak (or become memory-bound) with parallelization, which on CPUs can be via OpenMP. However, one can imagine
that all of this being computed on the CPU can make it difficult to reach this peak, which would more easily be hit via GPU usage.
Overall, we see that a simple change in loop ordering (ijk to ikj) provided a substantial performance boost by aligning memory access patterns with the row-major storage format. However, tiling provided the best performance by maximizing both spatial and temporal cache locality.
Compiler flags like -O2 and -O3 provide huge performance gains, but they cannot fix a fundamentally inefficient algorithm. The best results are achieved when an efficient algorithm is paired with aggressive compiler optimization. One more compiler optimization we did not implement in our code is the use of loop unrolling (#pragma unroll). This is especially effective on smaller matrices which can be easily unrolled and bounds entirely fit into local cache memory.
The Roofline model showed that our application is compute-bound, meaning further significant improvements would require more advanced techniques like parallelization via better OpenMP directives and better multiplciation algorithms. However, if one were to offload much of the work here to a more specialized unit like a GPU, we could more easily hit being memory bound rather than compute bound.
An important point to notice is that for our block multiplication OMP roofline model, it tends to perform better with higher matrix numbers, which makes sense as being able to partition out matrices into more blocks helps with parallelization. There was a possiblity to insert a simd directive in this code as well, but we ran into some correctness issues when doing so.


