This repository contains a collection of small assignments exploring serial and parallel implementations using Pthreads, OpenMP, MPI, and CUDA. Each assignment demonstrates different parallelization techniques and performance considerations, accompanied by detailed reports analyzing the results.
Problems include:
- The Game of Life
- Gauss Elimination
- Readers/Writers on shared data structures
- Matrix Multiplication
- N particle simulation
This work was developed as part of the THP04 Parallel Systems Course at NKUA
-
You can find all implementations of the assignments under
src/[Framework]/. -
Scripts for gathering experiment data can be found under
scripts/, with plotting scripts being inscripts/graph-gen. -
Scripts will produce their own directories named
hwX_datainside ofscripts/, while plots will be saved in thegraphs-genfolder. -
Reports for all experiments can be found in the
reportsfolder
You can compile this project by running make all in the main directory.
All executables will be placed inside bin/, so you can execute each assignment using ./bin/[Framework]-[assigment_name] ....
You can run all data-gathering experiments using the bash script in the main directory, by running ./runall.bash
The graphs generation script requires specific Python packages to function. You can find the dependencies inside the scripts\graphs-gen\requirements.txt
Here you can find notes about each assignment.
The program can be executed by running:
$ ./bin/pthread-pi [number of threads] [number of iterations] [optional: exclusivity flag -s/-p]The optional flag was implemented for easier data gathering
-sruns only the serial implementation.-pruns only the parallel implementation.- Not specifying a flag will execute both implementations
General notes:
Our parallel implementation does not use synchronization mechanisms. Instead, it stores each thread's results and sums them up at the end.
The program contains two thread functions.
For mutex locking, run with:
$ ./bin/pthread-loop [-m] [thread_count] [incr_num]For atomic increments, use the following:
$ ./bin/pthread-loop [-a] [thread_count] [incr_num]The program contains two thread functions.
For standard implementation (multiple cache updates):
$ ./bin/pthread-parallel-sum [thread_count] [incr_num]For optimized cache-aware implementation:
$ ./bin/pthread-parallel-sum [thread_count] [incr_num] [-o]The program can be executed by running:
$ ./bin/pthread-parallel-list [priority flag -r/-w] [thread_count] [keys_count] [ops_count] [search_percentage] [insert_percentage]-wruns the writer's priority implementation.-rruns the reader's priority implementation.
General notes:
You can find detailed logic information about each implementation in the source code comments
The program can be executed by running:
$ ./bin/omp-game-of-life [#threads (0 for serial implementation)] [dimensions count] [generation count] [optional: per item or per row parallelism (i/r)]The optional flag was implemented to select the parallelism type
-ieach time a thread is assigned a cell.-reach time a thread is assigned a row.- Not specifying a flag will execute the per-row implementation
General notes:
- For our grid, we use an oversized table of (dim + 2)x(dim + 2). By filling the frame with zeros, we can make neighbour calculations easier, without the need for checking corner cases (e.g a[0][0]).
- As there is a clear dependency on the current generation to calculate the next, parallelism can only be used in the calculation process of a generation and NOT on the outer generation loop.
- Our per-row implementation allows threads to use the locality of items in the same row, while also preventing cache refreshes for the other threads when editing a cell.
The program can be executed by running:
$ ./bin/omp-gauss-elimination [matrice dimension] [-s (serial) | -p (parallel)] [-r (per row) | -c (per column)] [thread_count]The source code contains implementations of four algorithms in total (2 with serial implementation and 2 with parallel).
Parallelization occurs in the inner loop in both algorithms.
The program can be executed by running:
$ mpiexec [optional: -f ../machines] [-n num of processes] ../bin/mpi-game-of-life [dimensions count] [generation count]General notes:
- The root process initializes a (dim + 2)*(dim + 2) size grid and scatters it to every process in chunks of rows_per_proc = dim / commsz.
- Every process has its own local allocated chunk of size rows_per_proc + 2. These chunks have two extra upper and lower rows to make space for the fringes that it will receive from neighboring processes.
- Sends its (central) rows to neighboring processes as fringes (nonblocking send) and receives from the neighboring processes its own fringes.
- Calculates the (central) rows_per_proc with the help of the received fringes and repeats from the second step for generation count times.
The program can be executed by running:
$ mpiexec [optional: -f ../machines ] [-n num_of_processes] ../bin/mpi-table-matrix-mult [dimension]General notes:
- The program creates a random matrix and vector and performs the multiplication calculation upon it. To minimize cache misses we take as granted that the resulted random matrix is the TRANSPOSED version of the intended one.
- Contrary to standard, columns are represented as table rows and vice versa.
- Each process receives a part of the matrix and vector via the scatter function and performs its stand-alone calculations later merging them using reduction.
- Time takes into account only the calculation aspect (including communication times) and not time need for allocating memory