Performance Optimization of Matrix-Matrix Multiplication on CPU

Pair: Alexander Cueva & Bernard Li

This report details implementing, optimizing, and analyzing the performance of matrix-matrix multiplication. We explored various optimization techniques, including algorithmic changes and compiler flags, and evaluated their effectiveness using the Roofline performance model. Building on our previous homework, our algorithms benefitted from OMP parallel for directives as well as simd directives for loops where

This repository has:

Three C++ implementations for matrix multiplication: a naive baseline, a loop-ordered optimization, and a block multiplication optimization.
A Makefile to compile the code with different optimization flags (-O1, -O2, -O3).
A Python script (roofline.py) to generate Roofline models from the performance data.
CSV files containing the results of performance measurements for various matrix sizes and optimization levels.
PNG images of the generated Roofline plots.

1. Implementation Details

Three different C++ programs were used to measure the performance of matrix-matrix multiplication. All implementations use a single, contiguous block of memory for each matrix (e.g., new long double[rows * cols]), which is an efficient strategy that promotes good spatial locality.

1.1. `main_omp.cc` - Naive `ijk` Implementation with OMP

This file contains the baseline, "naive" implementation using a standard ijk triple-loop structure, parallelized with OMP. This approach is simple to write but is known to have poor cache performance because it accesses the second matrix (B) in a column-wise fashion, leading to non-sequential memory access and frequent cache misses. Although this approach is paralleized with OMP, since the result matrix cells do not rely on each other for computation, it is still relatively slow due to these cache misses.

// parallel for: splits loop iterations among multiple threads
// collapse(2): merges 2 outer loops into a single iteration space,
// increasing the amount of parallel work to rows*cols threads
// schedule(static): divides rows*cols iterations as evenly as possible
// among threads before the loop starts, into chunks proportional to
// #iterations / #threads and assigns them statically to the threads
// efficient for loops where each iteration takes roughly the same time
#pragma omp parallel for collapse(2) schedule(static)
for (int i = 0; i < rows; i++) {
    for (int j = 0; j < cols; j++) {
        for (int k = 0; k < shared; k++) {
            result[i * cols + j] += matrixA[i * shared + k] * matrixB[k * cols + j];
        }
    }
}

1.2. `ikj_optimized_omp.cc` - Loop Reordering (`ikj`)

This version reorders the loops to an ikj structure. This is an optimization for row-major matrices. In this ordering, the innermost loop iterates over j, resulting in sequential access patterns for both the result matrix C and the second input matrix B. The element from matrix A is fetched once and reused across the entire inner loop, significantly improving cache performance. The OMP directives allow us to divide the work roughly evenly amongst all workers, letting us take advantage of the inherently symmetrical/parallel nature of matrix multiplication. Furthermore, the simd directive allows us to perform the same operation (vectorized) on multiple data elements at once. This works because of the tight loop where we update the results matrix.

// Optimized ikj loop order
// parallel for: splits loop iterations among multiple threads
// schedule(static): divides rows*cols iterations as evenly as possible among threads before loop starts (into chunks proportional to #iterations /  #threads and assigns them statically to the threads)
// efficient for loops where each iteration takes roughly the same amount of time
#pragma omp parallel for schedule(static)
for (int i = 0; i < rows; i++) {
for (int k = 0; k < shared; k++) {
    // a_val is private to each thread's stack (reduce memory access inside the inner loop)
    long double a_val = matrixA[i * shared + k];
    // simd: perform SIMD (same operation on multiple data elements at once)
    #pragma omp simd
    for (int j = 0; j < cols; j++) {
        result[i * cols + j] += a_val * matrixB[k * cols + j];
    }
}
}

1.3. `blockMultiply_opti_omp.cc` - Block Matrix Multiplication

This optimization uses a block matrix multiplication (aka tiling). The matrices are divided into smaller, cache-sized blocks. The computation proceeds block by block, which dramatically increases data reuse due to temporal locality. Within each block, the ikj loop ordering is used to maintain efficient sequential memory access due to spatial locality. OMP directives parallize the for loops.

#pragma omp parallel for 
for (int ii = 0; ii < rows; ii += blockSize) {
    for (int kk = 0; kk < shared; kk += blockSize) {
        for (int jj = 0; jj < cols; jj += blockSize) {
            int iMax = std::min(ii + blockSize, rows);
            int kMax = std::min(kk + blockSize, shared);
            int jMax = std::min(jj + blockSize, cols);

            for (int i = ii; i < iMax; i++) {
                for (int k = kk; k < kMax; k++) {
                    long double a_ik = matrixA[i * shared + k];
                    for (int j = jj; j < jMax; j++) {
                        result[i * cols + j] += a_ik * matrixB[k * cols + j];
                    }
                }
            }
        }
    }
}

2. Performance Analysis

2.1. Impact of Compiler Optimizations

The results stored in the results_*.csv files show that compiler optimizations have a massive impact on performance.

No Flags: Performance is uniformly poor across all implementations.
-O1: Provides a significant boost over the unoptimized version.
-O2: Offers further improvements. This level enables a large number of optimizations, including instruction scheduling and automatic vectorization, that are highly effective for numerical code.
-O3: Provides a marginal improvement over -O2 and in some cases can be slightly slower. For this reason, -O2 is often the recommended "go-to" optimization level.

The OMP block multiplication and OMP ikj versions consistently outperform the naive ijk version at every optimization level, implying that while compiler optimizations are powerful, a better algorithm ultimately provides a stronger basis.

2.2. Roofline Model Analysis

The Roofline model helps visualize the performance of an application in the context of the hardware's limitations—specifically, its peak memory bandwidth and peak computational rate. For this assignment, the assumed hardware specifications were:

Peak Memory Bandwidth: 160 GB/s
Peak FLOP Rate: 1 TFLOP/s (1000 GFLOP/s)

The ridge point of the model occurs at an arithmetic intensity (AI) of 1000 / 160 = 6.25 FLOPs/byte.

If the AI of an application is less than 6.25, its performance is limited by memory bandwidth (it is memory-bound).
If the AI is greater than 6.25, its performance is limited by the CPU's peak computational rate (it is compute-bound).

Figure 1: Roofline model for the OMP naive ijk implementation with -O3 optimization.

Figure 2: Roofline model for the OMP ijk multiplication implementation with -O3 optimization.

Figure 3: Roofline model for the OMP block multiplication implementation with -O3 optimization.

The plot shows that for all tested matrix sizes, the AI is greater than the ridge point. This means the matrix-matrix multiplication is compute-bound. Our goal is therefore to move the performance points vertically towards the "compute roof" of 1000 GFLOP/s.

The results show that even our best-optimized code does not reach the theoretical peak. However, the plots clearly show that the optimized versions (ikj and block) achieve higher performance (are higher on the chart) than the naive version. We should be able to reach the theoretical peak (or become memory-bound) with parallelization, which on CPUs can be via OpenMP. However, one can imagine that all of this being computed on the CPU can make it difficult to reach this peak, which would more easily be hit via GPU usage.

3. Conclusion

Overall, we see that a simple change in loop ordering (ijk to ikj) provided a substantial performance boost by aligning memory access patterns with the row-major storage format. However, tiling provided the best performance by maximizing both spatial and temporal cache locality.

Compiler flags like -O2 and -O3 provide huge performance gains, but they cannot fix a fundamentally inefficient algorithm. The best results are achieved when an efficient algorithm is paired with aggressive compiler optimization. One more compiler optimization we did not implement in our code is the use of loop unrolling (#pragma unroll). This is especially effective on smaller matrices which can be easily unrolled and bounds entirely fit into local cache memory.

The Roofline model showed that our application is compute-bound, meaning further significant improvements would require more advanced techniques like parallelization via better OpenMP directives and better multiplciation algorithms. However, if one were to offload much of the work here to a more specialized unit like a GPU, we could more easily hit being memory bound rather than compute bound.

An important point to notice is that for our block multiplication OMP roofline model, it tends to perform better with higher matrix numbers, which makes sense as being able to partition out matrices into more blocks helps with parallelization. There was a possiblity to insert a simd directive in this code as well, but we ran into some correctness issues when doing so.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
Lecture03_OpenMP.pdf		Lecture03_OpenMP.pdf
Makefile		Makefile
README.md		README.md
blockMultiply_opti.cc		blockMultiply_opti.cc
blockMultiply_opti_omp.cc		blockMultiply_opti_omp.cc
bmo		bmo
bmo_mp		bmo_mp
ikj_optimized		ikj_optimized
ikj_optimized.cc		ikj_optimized.cc
ikj_optimized_omp		ikj_optimized_omp
ikj_optimized_omp.cc		ikj_optimized_omp.cc
main		main
main.cc		main.cc
main_omp		main_omp
main_omp.cc		main_omp.cc
results.csv		results.csv
results_base_O1.csv		results_base_O1.csv
results_base_O2.csv		results_base_O2.csv
results_base_O3.csv		results_base_O3.csv
results_base_no.csv		results_base_no.csv
results_base_omp_O0.csv		results_base_omp_O0.csv
results_base_omp_O1.csv		results_base_omp_O1.csv
results_base_omp_O2.csv		results_base_omp_O2.csv
results_base_omp_O3.csv		results_base_omp_O3.csv
results_base_omp_no.csv		results_base_omp_no.csv
results_block_O1.csv		results_block_O1.csv
results_block_O2.csv		results_block_O2.csv
results_block_O3.csv		results_block_O3.csv
results_block_no.csv		results_block_no.csv
results_block_omp_O0.csv		results_block_omp_O0.csv
results_block_omp_O1.csv		results_block_omp_O1.csv
results_block_omp_O2.csv		results_block_omp_O2.csv
results_block_omp_O3.csv		results_block_omp_O3.csv
results_block_omp_no.csv		results_block_omp_no.csv
results_ikj_O1.csv		results_ikj_O1.csv
results_ikj_O2.csv		results_ikj_O2.csv
results_ikj_O3.csv		results_ikj_O3.csv
results_ikj_no.csv		results_ikj_no.csv
results_ikj_omp_O0.csv		results_ikj_omp_O0.csv
results_ikj_omp_O1.csv		results_ikj_omp_O1.csv
results_ikj_omp_O2.csv		results_ikj_omp_O2.csv
results_ikj_omp_O3.csv		results_ikj_omp_O3.csv
results_ikj_omp_no.csv		results_ikj_omp_no.csv
roofline.py		roofline.py
roofline_model.png		roofline_model.png
roofline_model_base_O1.png		roofline_model_base_O1.png
roofline_model_base_O2.png		roofline_model_base_O2.png
roofline_model_base_O3.png		roofline_model_base_O3.png
roofline_model_base_no.png		roofline_model_base_no.png
roofline_model_base_omp_O0.png		roofline_model_base_omp_O0.png
roofline_model_base_omp_O1.png		roofline_model_base_omp_O1.png
roofline_model_base_omp_O2.png		roofline_model_base_omp_O2.png
roofline_model_base_omp_O3.png		roofline_model_base_omp_O3.png
roofline_model_base_omp_no.png		roofline_model_base_omp_no.png
roofline_model_block_O1.png		roofline_model_block_O1.png
roofline_model_block_O2.png		roofline_model_block_O2.png
roofline_model_block_O3.png		roofline_model_block_O3.png
roofline_model_block_no.png		roofline_model_block_no.png
roofline_model_block_omp_O0.png		roofline_model_block_omp_O0.png
roofline_model_block_omp_O1.png		roofline_model_block_omp_O1.png
roofline_model_block_omp_O2.png		roofline_model_block_omp_O2.png
roofline_model_block_omp_O3.png		roofline_model_block_omp_O3.png
roofline_model_block_omp_no.png		roofline_model_block_omp_no.png
roofline_model_ikj_O1.png		roofline_model_ikj_O1.png
roofline_model_ikj_O2.png		roofline_model_ikj_O2.png
roofline_model_ikj_O3.png		roofline_model_ikj_O3.png
roofline_model_ikj_no.png		roofline_model_ikj_no.png
roofline_model_omp_O0.png		roofline_model_omp_O0.png
roofline_model_omp_O1.png		roofline_model_omp_O1.png
roofline_model_omp_O2.png		roofline_model_omp_O2.png
roofline_model_omp_O3.png		roofline_model_omp_O3.png
roofline_model_omp_no.png		roofline_model_omp_no.png
settings.json		settings.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performance Optimization of Matrix-Matrix Multiplication on CPU

1. Implementation Details

1.1. `main_omp.cc` - Naive `ijk` Implementation with OMP

1.2. `ikj_optimized_omp.cc` - Loop Reordering (`ikj`)

1.3. `blockMultiply_opti_omp.cc` - Block Matrix Multiplication

2. Performance Analysis

2.1. Impact of Compiler Optimizations

2.2. Roofline Model Analysis

3. Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Performance Optimization of Matrix-Matrix Multiplication on CPU

1. Implementation Details

1.1. main_omp.cc - Naive ijk Implementation with OMP

1.2. ikj_optimized_omp.cc - Loop Reordering (ikj)

1.3. blockMultiply_opti_omp.cc - Block Matrix Multiplication

2. Performance Analysis

2.1. Impact of Compiler Optimizations

2.2. Roofline Model Analysis

3. Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1.1. `main_omp.cc` - Naive `ijk` Implementation with OMP

1.2. `ikj_optimized_omp.cc` - Loop Reordering (`ikj`)

1.3. `blockMultiply_opti_omp.cc` - Block Matrix Multiplication

Packages