MiniSpark Engine

A miniature Apache Spark-like data processing engine written in C, built on POSIX threads and a hand-rolled thread pool.

Skills Demonstrated

Systems programming in C - manual memory management, file I/O, dynamic data structures
Concurrent programming - thread pool, mutex/condition variable synchronization, producer/consumer work queue
Parallel algorithm design - recursive DAG traversal, dependency-driven task scheduling, partition-level parallelism
OS-level APIs - sched_getaffinity, clock_gettime, POSIX threads
Testing - 21/21 tests passing, including a ThreadSanitizer run for concurrency correctness

Overview

MiniSpark implements the core execution model of Apache Spark on a single node. Programs declare lazy transformation pipelines over partitioned datasets (RDDs), and the engine materializes them in parallel using a fixed-size worker thread pool.

Built for UW-Madison CS 537 (Operating Systems, Spring 2025). Received full marks.

Architecture

RDDs (Resilient Distributed Datasets) Immutable, partitioned data collections that form the nodes of the processing graph. Partitions are populated lazily - no computation runs until an action is called.

Transformations (lazy - build the DAG)

Operation	Description
`map(rdd, fn)`	Apply `fn` to every element in every partition
`filter(rdd, fn, ctx)`	Keep only elements for which `fn` returns non-zero
`partitionBy(rdd, fn, n, ctx)`	Hash-repartition into `n` partitions
`join(rdd1, rdd2, fn, ctx)`	Inner join on matching keys

Actions (trigger materialization)

count(rdd) - materialize and return total element count
print(rdd, printer) - materialize and print each element

Simple DAG example - RDDFromFiles creates one partition per file, each holding a FILE*. A map transformation applies the mapper function to each partition lazily. Nothing executes until count or print is called.

Thread Pool Worker count is set to the number of available CPU cores via sched_getaffinity. The pool uses a circular bounded buffer protected by pthread_mutex_t, with pthread_cond_t for blocking workers when the queue is empty and waking them when work arrives. The main thread blocks until all in-flight partitions drain.

DAG Scheduler A recursive post-order traversal resolves dependencies before submitting tasks. No partition is enqueued until all partitions it depends on have been materialized, eliminating the deadlock risk of letting blocked workers hold pool threads.

The diagram below shows a wide-dependency DAG: two independent file pipelines (map -> partitionBy) run concurrently, then converge at a hash-partitioned join. RDD3 and RDD7 must both fully materialize before RDD8 can begin.

Metrics Monitor A dedicated monitoring thread drains a separate lock-protected metrics queue. Each completed task records creation time, scheduling time, and execution duration to metrics.log at microsecond resolution using clock_gettime(CLOCK_MONOTONIC).

Example Applications

Binary	DAG Pipeline
`linecount`	`RDDFromFiles` -> `map(GetLines)` -> `count`
`cat`	`RDDFromFiles` -> `map(GetLines)` -> `print`
`grep`	`RDDFromFiles` -> `map(GetLines)` -> `filter(StringContains)` -> `print`
`grepcount`	`RDDFromFiles` -> `map(GetLines)` -> `filter(StringContains)` -> `count`
`sumjoin`	Two file RDDs -> `map` -> `partitionBy` -> `join(SumJoiner)` -> `print`
`concurrency`	Stress test: many independent RDD chains materialized concurrently

Build

Recommended (any platform)

docker build -t minispark .
docker run --rm -v "$(pwd)":/app -w /app minispark make

Linux native (Ubuntu 22.04)

make        # compiles all 6 binaries into bin/
make clean  # removes object files and bin/

The codebase uses sched_getaffinity, a Linux-only syscall. Docker provides a consistent Ubuntu 22.04 build environment and is the easiest path on any platform.

Usage

./bin/linecount file1.txt file2.txt
./bin/grep needle file1.txt file2.txt
./bin/sumjoin left.txt right.txt

Testing

cd tests && ./run-tests.sh -c

The test suite was provided by the UW-Madison CS 537 course framework. Tests verify stdout, stderr, and return code against reference outputs.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
applications		applications
graphics		graphics
lib		lib
sample-files		sample-files
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
gdb_concurrent_debugging.md		gdb_concurrent_debugging.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MiniSpark Engine

Skills Demonstrated

Overview

Architecture

Example Applications

Build

Usage

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MiniSpark Engine

Skills Demonstrated

Overview

Architecture

Example Applications

Build

Usage

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages