Multi-Container Runtime

1. Team Information

Name	SRN
Zia Kadijah	PES2UG24CS620
Himani Nune	PES2UG24CS634

2. Build, Load, and Run Instructions

Prerequisites

Ubuntu 22.04 or 24.04 VM
Secure Boot OFF
Dependencies installed:

sudo apt update
sudo apt install -y build-essential linux-headers-$(uname -r)

Build

cd boilerplate
sudo make

Prepare Root Filesystems

cd boilerplate
mkdir rootfs-base
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/x86_64/alpine-minirootfs-3.20.3-x86_64.tar.gz
tar -xzf alpine-minirootfs-3.20.3-x86_64.tar.gz -C rootfs-base
cp -a ./rootfs-base ./rootfs-alpha
cp -a ./rootfs-base ./rootfs-beta

Load Kernel Module

sudo insmod monitor.ko

# Verify device was created
ls -l /dev/container_monitor

# Verify module loaded cleanly
sudo dmesg | tail -5

Start the Supervisor

In Terminal 1:

sudo ./engine supervisor ./rootfs-base

Launch Containers

In Terminal 2:

# Start containers in background
sudo ./engine start alpha ./rootfs-alpha /bin/sh
sudo ./engine start beta ./rootfs-beta /bin/sh

# List running containers
sudo ./engine ps

# View container logs
sudo ./engine logs alpha

# Stop a container
sudo ./engine stop alpha

# Run a container in foreground (blocks until exit)
sudo ./engine run gamma ./rootfs-alpha /bin/hostname

Memory Limit Testing

# Copy memory_hog into rootfs first
cp memory_hog ./rootfs-alpha/

# Start with custom limits
sudo ./engine start hog ./rootfs-alpha /memory_hog --soft-mib 10 --hard-mib 20

# Watch kernel events
sudo dmesg | grep container_monitor

Scheduling Experiments

# Copy cpu_hog into rootfs
cp cpu_hog ./rootfs-alpha/
cp cpu_hog ./rootfs-beta/

# Run two containers with different priorities
sudo ./engine start cn-normal ./rootfs-alpha /cpu_hog
sudo ./engine start cn-nice ./rootfs-beta /cpu_hog --nice 15

# Compare completion
sudo ./engine ps

Unload Module and Clean Up

# Stop supervisor with Ctrl+C in Terminal 1, then:
sudo rmmod monitor
sudo dmesg | tail -5

3. Demo Screenshots

Screenshot 1 — Multi-Container Supervision

Two containers (alpha, beta) running concurrently under one supervisor process.

Screenshot 2 — Metadata Tracking

Output of ps command showing container ID, host PID, and state for each tracked container.

Screenshot 3 — Bounded-Buffer Logging

Log file contents captured through the logging pipeline. Shows per-container log file created and readable via logs command.

Screenshot 4 — CLI and IPC

stop command issued via UNIX domain socket. Supervisor responds and updates container state from running to stopped.

Screenshot 5 — Soft-Limit Warning

dmesg output showing SOFT LIMIT warning event when container RSS exceeds the configured soft limit.

Screenshot 6 — Hard-Limit Enforcement

dmesg showing HARD LIMIT event and container killed. ps output showing container state updated to killed.

Screenshot 7 — Scheduling Experiment

cn-normal (nice 0) completed in ~10s. cn-nice (nice +15) still running after 2+ minutes under the same workload.

Screenshot 8 — Clean Teardown

No zombie processes in ps aux. Socket /tmp/mini_runtime.sock removed. Module unloaded cleanly with no errors in dmesg.

After killing the supervisor:

4. Engineering Analysis

4.1 Isolation Mechanisms

Our runtime achieves process and filesystem isolation using three Linux namespaces, created via the clone() system call with the flags CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS.

The PID namespace gives each container its own isolated PID number space. The first process inside the container appears as PID 1 from its own perspective, even though the host kernel assigns it a real host PID. Processes inside cannot see or signal host processes or processes in other containers.

The UTS namespace allows each container to have its own hostname. We call sethostname() inside child_fn to set the container's hostname to its ID, so tools like hostname running inside the container return the container name rather than the host machine name.

The mount namespace gives the container its own view of the filesystem. We call chroot() into the Alpine Linux mini root filesystem, so the container's root / maps to the rootfs-alpha or rootfs-beta directory on the host. The container cannot navigate above this root. We also mount a fresh /proc inside using mount("proc", "/proc", "proc", ...) so process utilities work correctly inside the container.

What the host kernel still shares with all containers: the kernel itself is shared. All containers run on the same kernel with no virtualization boundary. Additionally, we did not create a network namespace (CLONE_NEWNET), so all containers share the host network stack. IPC and user namespaces are also shared in our implementation.

4.2 Supervisor and Process Lifecycle

A long-running parent supervisor is essential because of how Unix process lifecycle works. When a child process exits, it becomes a zombie — its exit status is held in the kernel's process table until the parent calls wait() to collect it. If the parent exits immediately after spawning the child, the child gets reparented to PID 1 (init), which will eventually reap it. But this means the supervisor loses track of the child entirely and cannot update metadata, log the exit code, or enforce cleanup.

By keeping the supervisor alive, we maintain full visibility into every container's lifecycle. We install a SIGCHLD handler that calls waitpid(-1, &status, WNOHANG) in a loop, reaping all exited children immediately without blocking the event loop. We use WNOHANG so the handler returns immediately if no children are ready, avoiding a deadlock where the signal handler blocks the main loop.

Parent-child relationships are tracked in a mutex-protected linked list of container_record_t structs. Each record stores the host PID, start time, state, memory limits, log path, and exit information. The SIGCHLD handler walks this list to find the matching PID and updates the state to exited or killed depending on whether the process exited normally or was terminated by a signal.

SIGINT and SIGTERM to the supervisor set a should_stop flag. The event loop checks this flag on each iteration and exits cleanly, allowing the logger thread to drain and join before the supervisor process terminates.

4.3 IPC, Threads, and Synchronization

Our project uses two distinct IPC mechanisms: pipes for log data and a UNIX domain socket for the control plane.

Pipes (log data path): Each container's stdout and stderr are redirected via dup2() into the write end of a pipe created in launch_container(). The supervisor retains the read end and spawns a per-container pipe_reader_thread that reads chunks of up to 4096 bytes at a time. Each chunk is wrapped in a log_item_t struct (tagged with the container ID) and pushed into a shared bounded_buffer_t.

A single logging_thread consumer pops items from the bounded buffer and appends them to per-container log files under the logs/ directory. This design decouples the speed of log production from log file I/O.

Bounded buffer synchronization: The bounded buffer has a capacity of 16 slots. It is protected by a pthread_mutex with two condition variables: not_full (producer waits here when all 16 slots are occupied) and not_empty (consumer waits here when the buffer is empty). Without this synchronization the following race conditions would exist:

Two pipe reader threads pushing simultaneously could corrupt the tail index and count, causing data to be written to the wrong slot or overwritten.
The consumer could read a slot whose count was incremented but whose data has not yet been written, reading garbage data.
Non-atomic updates to head, tail, and count could leave the buffer in an inconsistent state under concurrent access.

UNIX domain socket (control plane): CLI commands connect to /tmp/mini_runtime.sock and send a control_request_t struct. The supervisor reads the struct, processes the command, and responds with a control_response_t. This is kept separate from the logging pipe so that control messages and log data never interfere with each other.

Container metadata list: The linked list of container_record_t structs is protected by metadata_lock, a pthread_mutex. Three concurrent accessors exist: the SIGCHLD handler (updates state on child exit), the event loop (reads and modifies state for CLI commands), and the CMD_PS handler (iterates the list to format output). Without the mutex, these could corrupt the list or read stale state.

4.4 Memory Management and Enforcement

RSS (Resident Set Size) measures the number of physical memory pages currently mapped into a process's address space that are actually resident in RAM. We read RSS using get_mm_rss(mm) * PAGE_SIZE in the kernel module, where mm is the process's memory descriptor.

RSS does not measure: pages that have been swapped out to disk, pages shared with other processes (such as shared library code, which may be counted in multiple processes' virtual memory but physically present only once), or pages that have been allocated via malloc but not yet touched (Linux uses lazy allocation — pages are only faulted into physical memory on first access).

Soft and hard limits implement different enforcement policies. The soft limit is a warning threshold — when a container's RSS first exceeds it, the kernel module logs a SOFT LIMIT warning to dmesg and sets a soft_limit_warned flag on the entry. The warning fires only once per container. This allows the supervisor to be notified that a container is memory-hungry without disrupting its execution. The hard limit is a termination threshold — when RSS exceeds it, the kernel module calls send_sig(SIGKILL, task, 1) to immediately terminate the process, then removes the entry from the monitored list.

Memory limit enforcement belongs in kernel space rather than user space for two reasons. First, a user-space monitor is subject to scheduler delays — it may sleep for hundreds of milliseconds between checks, during which a process could allocate far beyond its hard limit. The kernel timer fires at a predictable interval with much lower latency. Second, a process cannot escape kernel-space enforcement: SIGKILL from the kernel cannot be caught, blocked, or ignored by the target process, unlike signals sent from user space which could theoretically be masked.

4.5 Scheduling Behavior

We ran two containers simultaneously on the same single-core VM, both executing the cpu_hog workload which performs continuous CPU-bound computation and reports progress every second.

cn-normal was started with the default nice value of 0.
cn-nice was started with --nice 15, the highest deprioritization available to unprivileged processes.

Result: cn-normal completed its 10-second workload in approximately 10 seconds of wall time. cn-nice had not completed after more than 2 minutes under the same workload.

Linux uses the Completely Fair Scheduler (CFS). CFS assigns CPU time proportionally based on a process weight derived from its nice value. The weight for nice 0 is 1024; the weight for nice 15 is approximately 88. This means a nice 0 process receives roughly 11x more CPU time than a nice 15 process when they are competing on the same CPU. On a single-core VM with background system activity also competing for CPU, the nice +15 container was almost entirely starved.

This demonstrates CFS's core design goal: fairness among equal-priority processes, with deliberate and substantial bias against higher-nice processes when CPU is contested. This behavior is useful for separating interactive workloads (low nice, responsive) from background batch jobs (high nice, runs whenever CPU is idle) without hard resource partitioning.

5. Design Decisions and Tradeoffs

Namespace isolation: We used clone() with chroot() rather than pivot_root(). chroot is simpler to implement and sufficient for demonstrating isolation. The tradeoff is that chroot is less secure than pivot_root — a privileged process with CAP_SYS_CHROOT can escape the chroot jail. For a production container runtime, pivot_root with an unmounted old root would be the correct choice. We chose chroot because the project goal is demonstrating isolation concepts, not production security hardening.

Supervisor architecture: The supervisor uses a single-threaded event loop that handles one CLI request at a time. The tradeoff is that a long-running command (like run waiting for a container to exit) blocks other CLI requests. A multi-threaded accept loop would solve this but introduces more complex locking requirements around the metadata list. We chose single-threaded simplicity to reduce deadlock risk for this implementation.

IPC and logging: We used pipes for log data and a UNIX domain socket for control. Pipes are unidirectional, so containers can only produce output — they cannot receive input from the supervisor. This is acceptable for this project since containers run non-interactive workloads. A bidirectional channel (like a pair of pipes or a socket per container) would be needed for interactive container support.

Kernel monitor mutex vs spinlock: We used a mutex to protect the monitored list rather than a spinlock. Spinlocks cannot be held while sleeping, but our timer callback calls kmalloc(GFP_KERNEL) which can sleep, and our ioctl handler also allocates memory. A spinlock would be appropriate if we only needed to iterate the list without allocation, but the mutex is the correct choice given that we allocate and free nodes inside the critical section.

Scheduling experiments: We used nice values rather than cgroups CPU quota for the scheduling experiment. Nice values only affect relative scheduling priority — they don't enforce an absolute CPU time ceiling. Cgroups cpu.max would enforce a hard limit. We chose nice values because they are simpler to set via setpriority() without requiring cgroup hierarchy setup, and they are sufficient to demonstrate CFS priority behavior clearly.

6. Scheduler Experiment Results

Experiment Setup

Both containers ran the cpu_hog binary simultaneously on a single-core Ubuntu 24.04 VM. cpu_hog performs continuous CPU-bound computation and reports progress once per second.

Results

Container	Nice Value	Workload Duration	Wall-Clock Completion
cn-normal	0	10s	~10 seconds
cn-nice	+15	10s	>2 minutes (did not complete)

Observation

The cn-normal container completed its 10-second workload in approximately 10 seconds of real time, as expected. The cn-nice container with nice value +15 was still running after more than 2 minutes under the identical workload.

This result demonstrates that CFS strongly penalizes high-nice processes when CPU is contested. The weight ratio between nice 0 and nice 15 means the normal-priority container received approximately 11x more CPU time per scheduling period. On a single-core VM with background system activity, the nice +15 container was effectively starved — receiving CPU time only in the gaps between all other runnable processes.

This behavior is consistent with CFS design goals: fairness among equal-priority processes, with deliberate bias favoring lower nice values to allow system administrators to express workload priority without hard resource partitioning.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
boilerplate		boilerplate
screenshots		screenshots
.gitignore		.gitignore
README.md		README.md
project-guide.md		project-guide.md

Folders and files

Latest commit

History

Repository files navigation

Multi-Container Runtime

1. Team Information

2. Build, Load, and Run Instructions

Prerequisites

Build

Prepare Root Filesystems

Load Kernel Module

Start the Supervisor

Launch Containers

Memory Limit Testing

Scheduling Experiments

Unload Module and Clean Up

3. Demo Screenshots

Screenshot 1 — Multi-Container Supervision

Screenshot 2 — Metadata Tracking

Screenshot 3 — Bounded-Buffer Logging

Screenshot 4 — CLI and IPC

Screenshot 5 — Soft-Limit Warning

Screenshot 6 — Hard-Limit Enforcement

Screenshot 7 — Scheduling Experiment

Screenshot 8 — Clean Teardown

After killing the supervisor:

4. Engineering Analysis

4.1 Isolation Mechanisms

4.2 Supervisor and Process Lifecycle

4.3 IPC, Threads, and Synchronization

4.4 Memory Management and Enforcement

4.5 Scheduling Behavior

5. Design Decisions and Tradeoffs

6. Scheduler Experiment Results

Experiment Setup

Results

Observation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages