| Name | SRN |
|---|---|
| Zia Kadijah | PES2UG24CS620 |
| Himani Nune | PES2UG24CS634 |
- Ubuntu 22.04 or 24.04 VM
- Secure Boot OFF
- Dependencies installed:
sudo apt update
sudo apt install -y build-essential linux-headers-$(uname -r)cd boilerplate
sudo makecd boilerplate
mkdir rootfs-base
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/x86_64/alpine-minirootfs-3.20.3-x86_64.tar.gz
tar -xzf alpine-minirootfs-3.20.3-x86_64.tar.gz -C rootfs-base
cp -a ./rootfs-base ./rootfs-alpha
cp -a ./rootfs-base ./rootfs-betasudo insmod monitor.ko
# Verify device was created
ls -l /dev/container_monitor
# Verify module loaded cleanly
sudo dmesg | tail -5In Terminal 1:
sudo ./engine supervisor ./rootfs-baseIn Terminal 2:
# Start containers in background
sudo ./engine start alpha ./rootfs-alpha /bin/sh
sudo ./engine start beta ./rootfs-beta /bin/sh
# List running containers
sudo ./engine ps
# View container logs
sudo ./engine logs alpha
# Stop a container
sudo ./engine stop alpha
# Run a container in foreground (blocks until exit)
sudo ./engine run gamma ./rootfs-alpha /bin/hostname# Copy memory_hog into rootfs first
cp memory_hog ./rootfs-alpha/
# Start with custom limits
sudo ./engine start hog ./rootfs-alpha /memory_hog --soft-mib 10 --hard-mib 20
# Watch kernel events
sudo dmesg | grep container_monitor# Copy cpu_hog into rootfs
cp cpu_hog ./rootfs-alpha/
cp cpu_hog ./rootfs-beta/
# Run two containers with different priorities
sudo ./engine start cn-normal ./rootfs-alpha /cpu_hog
sudo ./engine start cn-nice ./rootfs-beta /cpu_hog --nice 15
# Compare completion
sudo ./engine ps# Stop supervisor with Ctrl+C in Terminal 1, then:
sudo rmmod monitor
sudo dmesg | tail -5Two containers (alpha, beta) running concurrently under one supervisor process.
Output of ps command showing container ID, host PID, and state for each tracked container.
Log file contents captured through the logging pipeline. Shows per-container log file created and readable via logs command.
stop command issued via UNIX domain socket. Supervisor responds and updates container state from running to stopped.
dmesg output showing SOFT LIMIT warning event when container RSS exceeds the configured soft limit.
dmesg showing HARD LIMIT event and container killed. ps output showing container state updated to killed.
cn-normal (nice 0) completed in ~10s. cn-nice (nice +15) still running after 2+ minutes under the same workload.
No zombie processes in ps aux. Socket /tmp/mini_runtime.sock removed. Module unloaded cleanly with no errors in dmesg.
Our runtime achieves process and filesystem isolation using three Linux namespaces, created via the clone() system call with the flags CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS.
The PID namespace gives each container its own isolated PID number space. The first process inside the container appears as PID 1 from its own perspective, even though the host kernel assigns it a real host PID. Processes inside cannot see or signal host processes or processes in other containers.
The UTS namespace allows each container to have its own hostname. We call sethostname() inside child_fn to set the container's hostname to its ID, so tools like hostname running inside the container return the container name rather than the host machine name.
The mount namespace gives the container its own view of the filesystem. We call chroot() into the Alpine Linux mini root filesystem, so the container's root / maps to the rootfs-alpha or rootfs-beta directory on the host. The container cannot navigate above this root. We also mount a fresh /proc inside using mount("proc", "/proc", "proc", ...) so process utilities work correctly inside the container.
What the host kernel still shares with all containers: the kernel itself is shared. All containers run on the same kernel with no virtualization boundary. Additionally, we did not create a network namespace (CLONE_NEWNET), so all containers share the host network stack. IPC and user namespaces are also shared in our implementation.
A long-running parent supervisor is essential because of how Unix process lifecycle works. When a child process exits, it becomes a zombie — its exit status is held in the kernel's process table until the parent calls wait() to collect it. If the parent exits immediately after spawning the child, the child gets reparented to PID 1 (init), which will eventually reap it. But this means the supervisor loses track of the child entirely and cannot update metadata, log the exit code, or enforce cleanup.
By keeping the supervisor alive, we maintain full visibility into every container's lifecycle. We install a SIGCHLD handler that calls waitpid(-1, &status, WNOHANG) in a loop, reaping all exited children immediately without blocking the event loop. We use WNOHANG so the handler returns immediately if no children are ready, avoiding a deadlock where the signal handler blocks the main loop.
Parent-child relationships are tracked in a mutex-protected linked list of container_record_t structs. Each record stores the host PID, start time, state, memory limits, log path, and exit information. The SIGCHLD handler walks this list to find the matching PID and updates the state to exited or killed depending on whether the process exited normally or was terminated by a signal.
SIGINT and SIGTERM to the supervisor set a should_stop flag. The event loop checks this flag on each iteration and exits cleanly, allowing the logger thread to drain and join before the supervisor process terminates.
Our project uses two distinct IPC mechanisms: pipes for log data and a UNIX domain socket for the control plane.
Pipes (log data path): Each container's stdout and stderr are redirected via dup2() into the write end of a pipe created in launch_container(). The supervisor retains the read end and spawns a per-container pipe_reader_thread that reads chunks of up to 4096 bytes at a time. Each chunk is wrapped in a log_item_t struct (tagged with the container ID) and pushed into a shared bounded_buffer_t.
A single logging_thread consumer pops items from the bounded buffer and appends them to per-container log files under the logs/ directory. This design decouples the speed of log production from log file I/O.
Bounded buffer synchronization: The bounded buffer has a capacity of 16 slots. It is protected by a pthread_mutex with two condition variables: not_full (producer waits here when all 16 slots are occupied) and not_empty (consumer waits here when the buffer is empty). Without this synchronization the following race conditions would exist:
- Two pipe reader threads pushing simultaneously could corrupt the
tailindex andcount, causing data to be written to the wrong slot or overwritten. - The consumer could read a slot whose
countwas incremented but whose data has not yet been written, reading garbage data. - Non-atomic updates to
head,tail, andcountcould leave the buffer in an inconsistent state under concurrent access.
UNIX domain socket (control plane): CLI commands connect to /tmp/mini_runtime.sock and send a control_request_t struct. The supervisor reads the struct, processes the command, and responds with a control_response_t. This is kept separate from the logging pipe so that control messages and log data never interfere with each other.
Container metadata list: The linked list of container_record_t structs is protected by metadata_lock, a pthread_mutex. Three concurrent accessors exist: the SIGCHLD handler (updates state on child exit), the event loop (reads and modifies state for CLI commands), and the CMD_PS handler (iterates the list to format output). Without the mutex, these could corrupt the list or read stale state.
RSS (Resident Set Size) measures the number of physical memory pages currently mapped into a process's address space that are actually resident in RAM. We read RSS using get_mm_rss(mm) * PAGE_SIZE in the kernel module, where mm is the process's memory descriptor.
RSS does not measure: pages that have been swapped out to disk, pages shared with other processes (such as shared library code, which may be counted in multiple processes' virtual memory but physically present only once), or pages that have been allocated via malloc but not yet touched (Linux uses lazy allocation — pages are only faulted into physical memory on first access).
Soft and hard limits implement different enforcement policies. The soft limit is a warning threshold — when a container's RSS first exceeds it, the kernel module logs a SOFT LIMIT warning to dmesg and sets a soft_limit_warned flag on the entry. The warning fires only once per container. This allows the supervisor to be notified that a container is memory-hungry without disrupting its execution. The hard limit is a termination threshold — when RSS exceeds it, the kernel module calls send_sig(SIGKILL, task, 1) to immediately terminate the process, then removes the entry from the monitored list.
Memory limit enforcement belongs in kernel space rather than user space for two reasons. First, a user-space monitor is subject to scheduler delays — it may sleep for hundreds of milliseconds between checks, during which a process could allocate far beyond its hard limit. The kernel timer fires at a predictable interval with much lower latency. Second, a process cannot escape kernel-space enforcement: SIGKILL from the kernel cannot be caught, blocked, or ignored by the target process, unlike signals sent from user space which could theoretically be masked.
We ran two containers simultaneously on the same single-core VM, both executing the cpu_hog workload which performs continuous CPU-bound computation and reports progress every second.
- cn-normal was started with the default nice value of 0.
- cn-nice was started with
--nice 15, the highest deprioritization available to unprivileged processes.
Result: cn-normal completed its 10-second workload in approximately 10 seconds of wall time. cn-nice had not completed after more than 2 minutes under the same workload.
Linux uses the Completely Fair Scheduler (CFS). CFS assigns CPU time proportionally based on a process weight derived from its nice value. The weight for nice 0 is 1024; the weight for nice 15 is approximately 88. This means a nice 0 process receives roughly 11x more CPU time than a nice 15 process when they are competing on the same CPU. On a single-core VM with background system activity also competing for CPU, the nice +15 container was almost entirely starved.
This demonstrates CFS's core design goal: fairness among equal-priority processes, with deliberate and substantial bias against higher-nice processes when CPU is contested. This behavior is useful for separating interactive workloads (low nice, responsive) from background batch jobs (high nice, runs whenever CPU is idle) without hard resource partitioning.
Namespace isolation: We used clone() with chroot() rather than pivot_root(). chroot is simpler to implement and sufficient for demonstrating isolation. The tradeoff is that chroot is less secure than pivot_root — a privileged process with CAP_SYS_CHROOT can escape the chroot jail. For a production container runtime, pivot_root with an unmounted old root would be the correct choice. We chose chroot because the project goal is demonstrating isolation concepts, not production security hardening.
Supervisor architecture: The supervisor uses a single-threaded event loop that handles one CLI request at a time. The tradeoff is that a long-running command (like run waiting for a container to exit) blocks other CLI requests. A multi-threaded accept loop would solve this but introduces more complex locking requirements around the metadata list. We chose single-threaded simplicity to reduce deadlock risk for this implementation.
IPC and logging: We used pipes for log data and a UNIX domain socket for control. Pipes are unidirectional, so containers can only produce output — they cannot receive input from the supervisor. This is acceptable for this project since containers run non-interactive workloads. A bidirectional channel (like a pair of pipes or a socket per container) would be needed for interactive container support.
Kernel monitor mutex vs spinlock: We used a mutex to protect the monitored list rather than a spinlock. Spinlocks cannot be held while sleeping, but our timer callback calls kmalloc(GFP_KERNEL) which can sleep, and our ioctl handler also allocates memory. A spinlock would be appropriate if we only needed to iterate the list without allocation, but the mutex is the correct choice given that we allocate and free nodes inside the critical section.
Scheduling experiments: We used nice values rather than cgroups CPU quota for the scheduling experiment. Nice values only affect relative scheduling priority — they don't enforce an absolute CPU time ceiling. Cgroups cpu.max would enforce a hard limit. We chose nice values because they are simpler to set via setpriority() without requiring cgroup hierarchy setup, and they are sufficient to demonstrate CFS priority behavior clearly.
Both containers ran the cpu_hog binary simultaneously on a single-core Ubuntu 24.04 VM. cpu_hog performs continuous CPU-bound computation and reports progress once per second.
| Container | Nice Value | Workload Duration | Wall-Clock Completion |
|---|---|---|---|
| cn-normal | 0 | 10s | ~10 seconds |
| cn-nice | +15 | 10s | >2 minutes (did not complete) |
The cn-normal container completed its 10-second workload in approximately 10 seconds of real time, as expected. The cn-nice container with nice value +15 was still running after more than 2 minutes under the identical workload.
This result demonstrates that CFS strongly penalizes high-nice processes when CPU is contested. The weight ratio between nice 0 and nice 15 means the normal-priority container received approximately 11x more CPU time per scheduling period. On a single-core VM with background system activity, the nice +15 container was effectively starved — receiving CPU time only in the gaps between all other runnable processes.
This behavior is consistent with CFS design goals: fairness among equal-priority processes, with deliberate bias favoring lower nice values to allow system administrators to express workload priority without hard resource partitioning.










