Platform: Ubuntu 22.04
Install:
sudo apt update && sudo apt install -y gcc build-essential e2fsprogs util-linux strace wget
You're going to create a real ext4 filesystem inside a regular file, mount it, and populate it with a real Linux root filesystem (Alpine). Sections after this operate on this image.
# Set your prompt to your SRN (last 3 digits only)
# Example: if your SRN is PES1UG24CS042, enter 042
export PS1="042\$ "# Create a 256 MB file full of zeros
dd if=/dev/zero of=orange.img bs=1M count=256
# Format it as ext4
mkfs.ext4 -L ORANGE orange.img
# Mount it
sudo mkdir -p /mnt/orange
sudo mount -o loop orange.img /mnt/orange
# Download Alpine Linux minirootfs and extract it
wget -q https://dl-cdn.alpinelinux.org/alpine/v3.21/releases/x86_64/alpine-minirootfs-3.21.3-x86_64.tar.gz
sudo tar xzf alpine-minirootfs-3.21.3-x86_64.tar.gz -C /mnt/orange
# See what we've got
df -h /mnt/orange
ls /mnt/orange# Record the filesystem metadata
sudo dumpe2fs -h orange.img 2>/dev/null | grep -E 'Block count|Block size|Inode count|Free blocks|Free inodes'Screenshot 0: Output of df -h /mnt/orange, ls /mnt/orange, and dumpe2fs summary.
Q0.1: From the
dumpe2fsoutput: how many total inodes exist, how many are free, and how many are used? How many total blocks, and what's the block size? Show the arithmetic to verify:used_inodes = total - free.
Filenames are a convenience for humans. The filesystem doesn't care about names — it identifies everything by inode number. An inode holds the file's metadata (size, permissions, timestamps, block pointers) but not the filename. The filename lives in the directory that contains the file.
This means you can have multiple names pointing to the same inode (hard links), or you can have a file with no name at all (deleted but still open).
Directory entry: Inode:
┌──────────────────┐ ┌─────────────────────┐
│ name: "hello.txt"│───────▶│ inode #41523 │
│ inode: 41523 │ │ size: 1024 │
└──────────────────┘ │ uid: 1000 │
┌──────────────────┐ │ blocks: [882, 883] │
│ name: "alias.txt"│───────▶│ link count: 2 │
│ inode: 41523 │ └─────────────────────┘
└──────────────────┘
Two names, one file. link count = 2.
# Create a test file on our filesystem
echo "original content" | sudo tee /mnt/orange/test.txt
# Hard link: another directory entry pointing to the SAME inode
sudo ln /mnt/orange/test.txt /mnt/orange/hard.txt
# Symbolic link: a separate file whose content is a path string
sudo ln -s /mnt/orange/test.txt /mnt/orange/soft.txt
# Compare inode numbers
ls -li /mnt/orange/test.txt /mnt/orange/hard.txt /mnt/orange/soft.txt
stat /mnt/orange/test.txt
stat /mnt/orange/soft.txtA filesystem can run out of inodes before running out of space. The number of inodes is fixed at format time.
# Create a tiny filesystem with very few inodes
dd if=/dev/zero of=tiny.img bs=1M count=10
mkfs.ext4 -N 16 tiny.img # only 16 inodes total!
sudo mkdir -p /mnt/tiny
sudo mount -o loop tiny.img /mnt/tiny
# Some inodes are already used by the FS itself (root dir, journal, etc.)
df -i /mnt/tiny
# Try to create files until we hit the inode limit
for i in $(seq 1 20); do
sudo touch /mnt/tiny/file_$i 2>&1
done
echo "--- after inode exhaustion ---"
df -i /mnt/tiny
df -h /mnt/tiny # space is still available!
sudo umount /mnt/tiny
rm -f tiny.imgQ1.1:
test.txtandhard.txtshare the same inode number. What is the link count shown bystat? Deletetest.txtwithsudo rm /mnt/orange/test.txt— doeshard.txtstill work? What happened to the link count? Why doesn't the data get freed?Q1.2:
soft.txthas a different inode number fromtest.txt. After deletingtest.txt, doescat /mnt/orange/soft.txtstill work? Why not? What's the fundamental difference between how the kernel resolves a hard link vs a symbolic link?Q1.3: In the inode exhaustion experiment, at what point did
touchstart failing? How much disk space was still free when that happened? Why can't ext4 just create more inodes on the fly?
Screenshot 1: ls -li showing inode numbers + link counts, and df -i /mnt/tiny before and after exhaustion.
When you call open(), the kernel doesn't just hand you the file. It walks the directory tree component by component, looks up each name in the parent directory's entries, follows the chain of inodes, and finally creates a file descriptor — an integer that's your process's handle to an open file. The kernel maintains three layers of bookkeeping:
Example matching the code below:
fd1 ──┐
├──▶ open file entry A: flags = O_RDWR, current offset = 0, refcount = 2 ──▶ inode #41523
fd3 ──┘
fd2 ─────▶ open file entry B: flags = O_RDONLY, current offset = 0, refcount = 1 ──▶ inode #41523
Per-process FD table:
- `fd1`, `fd2`, and `fd3` are just integer entries in the process's FD table
System-wide open file table:
- open file entry A is shared by `fd1` and `fd3` because `fd3 = dup(fd1)`
- open file entry B is separate because `fd2` came from a different `open()`
In-memory inode table:
- both open file entries still point to the same inode, because both refer to the same file
// fd_explore.c
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
int main(void) {
// Create a test file
int fd1 = open("/mnt/orange/fd_test.txt", O_CREAT | O_RDWR | O_TRUNC, 0644);
write(fd1, "ABCDEFGHIJ", 10);
// Two independent opens of the same file → two separate offsets
int fd2 = open("/mnt/orange/fd_test.txt", O_RDONLY);
// dup() shares the SAME open file entry → same offset
int fd3 = dup(fd1);
printf("fd1=%d fd2=%d fd3=%d\n\n", fd1, fd2, fd3);
// Read 3 bytes via fd1. Offset of fd1 (and fd3, since they share) advances.
char buf[16] = {0};
lseek(fd1, 0, SEEK_SET);
read(fd1, buf, 3);
printf("read(fd1, 3): \"%s\" → fd1 offset now at 3\n", buf);
// Read via fd3 — picks up where fd1 left off (offset = 3)
memset(buf, 0, sizeof(buf));
read(fd3, buf, 3);
printf("read(fd3, 3): \"%s\" → shared offset, continues from 3\n", buf);
// Read via fd2 — independent offset, starts at 0
memset(buf, 0, sizeof(buf));
read(fd2, buf, 3);
printf("read(fd2, 3): \"%s\" → independent open, starts at 0\n", buf);
// Show /proc/self/fd
printf("\n--- /proc/self/fd ---\n");
char cmd[128];
snprintf(cmd, sizeof(cmd), "ls -l /proc/%d/fd", getpid());
system(cmd);
close(fd1);
close(fd2);
close(fd3);
return 0;
}gcc -O0 -o fd_explore fd_explore.c
sudo ./fd_explore
# Use strace to see every syscall during a simple file operation
strace -e openat,read,write,close cat /mnt/orange/etc/hostname 2>&1Q2.1:
fd1andfd3share a file offset (becausefd3 = dup(fd1)), butfd2has its own. Afterread(fd1, 3), reading fromfd3returns "DEF" not "ABC". Why? Draw the three-layer diagram for this specific scenario.Q2.2: In the
straceoutput forcat /mnt/orange/etc/hostname, how manyread()calls doescatmake? What size buffer does it use? At what point does it see EOF (return value 0)?Q2.3: A file descriptor is just an integer index. If a process opens 1000 files, what limits prevent it from opening more? Run
ulimit -nto check. What happens if a server process hits this limit?
Screenshot 2: Output of ./fd_explore showing the three reads, and the strace output for cat.
A directory is just a file whose data contains a list of (name, inode) pairs. When you do ls, the kernel reads the directory's data blocks and parses these entries. ext4 uses a hash-tree (HTree) structure for large directories — essentially a B-tree keyed on filename hash — so that lookup doesn't degrade to a linear scan.
Simple directory (small, linear):
┌────────┬───────┬────────┬───────┐
│ "." │ ".. "│ "foo" │ "bar" │
│ ino=2 │ ino=2 │ ino=15 │ ino=23│
└────────┴───────┴────────┴───────┘
HTree directory (large, hashed):
┌──────────────────────┐
│ Root index block │
│ hash < 0x40 → blk 5 │
│ hash < 0x80 → blk 9 │
│ hash ≥ 0x80 → blk 12 │
└──────────────────────┘
│ │ │
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│entries│ │entries│ │entries│ ← leaf blocks with actual (name, inode) pairs
└───────┘ └───────┘ └───────┘
# On our Alpine filesystem, let's look at directory sizes
sudo ls -lad /mnt/orange/etc /mnt/orange/usr
# Create a large directory and watch it grow
sudo mkdir -p /mnt/orange/bigdir
for i in $(seq 1 5000); do
sudo touch /mnt/orange/bigdir/file_$i
done
sudo ls -lad /mnt/orange/bigdir # directory size grew!
# Now time a lookup in a small vs large directory
echo "--- small directory (etc, ~100 entries) ---"
time sudo ls /mnt/orange/etc > /dev/null
echo "--- large directory (bigdir, 5000 entries) ---"
time sudo ls /mnt/orange/bigdir > /dev/null// readdir_raw.c — read directory entries using getdents64 syscall
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
struct linux_dirent64 {
unsigned long long d_ino;
long long d_off;
unsigned short d_reclen;
unsigned char d_type;
char d_name[];
};
int main(int argc, char *argv[]) {
const char *dir = argc > 1 ? argv[1] : "/mnt/orange/etc";
int fd = open(dir, O_RDONLY | O_DIRECTORY);
if (fd < 0) { perror("open"); return 1; }
char buf[4096];
int count = 0;
int nread;
while ((nread = syscall(SYS_getdents64, fd, buf, sizeof(buf))) > 0) {
int pos = 0;
while (pos < nread) {
struct linux_dirent64 *d = (struct linux_dirent64 *)(buf + pos);
if (count < 10) // print first 10 entries
printf(" inode=%-8llu type=%d reclen=%-4d name=\"%s\"\n",
d->d_ino, d->d_type, d->d_reclen, d->d_name);
pos += d->d_reclen;
count++;
}
}
printf(" ... total entries: %d\n", count);
close(fd);
return 0;
}gcc -O0 -o readdir_raw readdir_raw.c
sudo ./readdir_raw /mnt/orange/etc
sudo ./readdir_raw /mnt/orange/bigdirQ3.1: The directory
/mnt/orange/bigdirhas 5000 files inside it. What is the directory file's size (fromls -lad)? Why is it much larger than/mnt/orange/etc? A directory's size on disk grows as you add entries — but does it shrink if you delete entries? Trysudo rm /mnt/orange/bigdir/file_{1..4000}and check again.Q3.2: In the
readdir_rawoutput, each entry has ad_reclen(record length). Are all entries the same size? What determines the record length? Why would the kernel want variable-length records here instead of fixed-size ones?Q3.3: ext4 switches from linear directory entries to an HTree (hash-based) index when a directory grows beyond one block. Why is linear scan O(n) per lookup while HTree is O(1) amortized? For a directory with 100,000 files, roughly how many disk reads does each approach need to find one file?
Screenshot 3: Directory sizes for etc vs bigdir, and first 10 entries from readdir_raw.
How does a filesystem decide which blocks on disk hold a given file's data? Three classical strategies exist:
Contiguous: File occupies consecutive blocks.
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ A │ A │ A │ │ B │ B │ │ │
└───┴───┴───┴───┴───┴───┴───┴───┘
+ Fast sequential read (one seek)
- External fragmentation, can't grow easily
Linked: Each block has a pointer to the next.
┌───┐ ┌───┐ ┌───┐
│ A │──▶│ A │──▶│ A │──▶ NULL
└───┘ └───┘ └───┘
+ No external fragmentation, easy to grow
- Random access is O(n): to reach block k, follow k pointers
Indexed (ext4 extents): Inode holds an extent tree.
┌──────────────────────────────┐
│ Inode extent tree: │
│ [0-99] → disk blocks 500 │
│ [100-149]→ disk blocks 800 │
│ [150-151]→ disk block 1200 │
└──────────────────────────────┘
+ Fast random access, no external frag
- More metadata overhead
ext4 uses extents (a form of indexed allocation). Let's see it in action, then build a FAT (linked allocation) simulator:
# Create files of different sizes and check their extent layout
sudo dd if=/dev/urandom of=/mnt/orange/small.bin bs=4K count=1
sudo dd if=/dev/urandom of=/mnt/orange/medium.bin bs=4K count=100
sudo dd if=/dev/urandom of=/mnt/orange/large.bin bs=1M count=50
# filefrag shows the extent map — how many extents (fragments) each file has
sudo filefrag -v /mnt/orange/small.bin
sudo filefrag -v /mnt/orange/medium.bin
sudo filefrag -v /mnt/orange/large.bin// fat_sim.c — simulate FAT-style linked allocation and measure random access cost
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define DISK_BLOCKS 4096
#define FAT_FREE -1
#define FAT_EOF -2
int fat[DISK_BLOCKS]; // FAT table: fat[i] = next block, or EOF
// Allocate a file of n blocks using first-fit from FAT
int fat_alloc(int n) {
int first = -1, prev = -1;
int allocated = 0;
for (int i = 0; i < DISK_BLOCKS && allocated < n; i++) {
if (fat[i] == FAT_FREE) {
fat[i] = FAT_EOF; // tentatively mark as end
if (prev >= 0) fat[prev] = i;
if (first < 0) first = i;
prev = i;
allocated++;
}
}
if (allocated < n) { return -1; } // not enough space
return first;
}
// Random access: reach block at logical offset k in the chain
int fat_seek(int start, int k) {
int cur = start, hops = 0;
while (k > 0 && cur != FAT_EOF) {
cur = fat[cur];
k--;
hops++;
}
return hops;
}
int main(void) {
// Initialize FAT
for (int i = 0; i < DISK_BLOCKS; i++) fat[i] = FAT_FREE;
// Fragment the disk: allocate 64 small files, free every other one
int files[64];
for (int i = 0; i < 64; i++)
files[i] = fat_alloc(32); // 64 files × 32 blocks = 2048 blocks used
for (int i = 0; i < 64; i += 2) // free even-numbered files
for (int b = files[i]; b != FAT_EOF; ) {
int next = fat[b];
fat[b] = FAT_FREE;
b = next;
}
// Allocate a large file on the now-fragmented disk
int big_file = fat_alloc(512);
if (big_file < 0) { printf("Allocation failed!\n"); return 1; }
// Measure random access: seek to positions 0, 100, 200, 300, 400, 500
printf("%-12s %s\n", "Offset", "Hops (pointer follows)");
printf("----------------------------\n");
for (int off = 0; off <= 500; off += 100) {
int hops = fat_seek(big_file, off);
printf("%-12d %d\n", off, hops);
}
printf("\nIn a FAT, seeking to block N requires following N pointers.\n");
printf("ext4 extents: one tree lookup regardless of offset.\n");
return 0;
}gcc -O0 -o fat_sim fat_sim.c
./fat_simQ4.1: Look at the
filefragoutput forlarge.bin. How many extents does it have? If it's just 1 extent, the file is stored contiguously. If it's more than 1, the file is fragmented. Why might a freshly formatted filesystem produce fewer extents than a heavily used one?Q4.2: In the FAT simulator, seeking to offset 500 requires 500 hops. What is the time complexity of random access in FAT? Compare this to ext4 extents where an extent tree of depth 1 can address
4 × 340^1 = 1360extents — how many block reads does ext4 need for the same random seek?Q4.3: Contiguous allocation has the best read performance but worst flexibility. Linked (FAT) has no external fragmentation but O(n) seeks. Indexed (extents) balances both. Why did early systems like MS-DOS choose FAT despite its O(n) seek problem? (Think about the hardware constraints of the 1980s — small disks, small files, simple CPUs.)
Screenshot 4: filefrag -v output for the three files, and the FAT simulator output.
The filesystem needs to know which blocks are free and which are taken. ext4 divides the disk into block groups, and each block group has a block bitmap — one bit per block, 1 = used, 0 = free. Finding a free block means scanning this bitmap. The layout looks like this:
Block Group 0 Block Group 1 Block Group 2
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Superblock copy │ │ Superblock copy │ │ (no SB copy) │
│ GDT copy │ │ GDT copy │ │ │
│ Block Bitmap │ │ Block Bitmap │ │ Block Bitmap │
│ Inode Bitmap │ │ Inode Bitmap │ │ Inode Bitmap │
│ Inode Table │ │ Inode Table │ │ Inode Table │
│ Data Blocks │ │ Data Blocks │ │ Data Blocks │
└─────────────────┘ └─────────────────┘ └─────────────────┘
# See the block group layout of our filesystem
sudo dumpe2fs orange.img 2>/dev/null | head -60
# Focus on one block group's free block count
sudo dumpe2fs orange.img 2>/dev/null | grep -A2 "Group 0"// bitmap_sim.c — simulate bitmap-based free-space management
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define TOTAL_BLOCKS 256
#define BITS_PER_BYTE 8
unsigned char bitmap[TOTAL_BLOCKS / BITS_PER_BYTE]; // 32 bytes = 256 bits
static int is_free(int block) {
return !(bitmap[block / 8] & (1 << (block % 8)));
}
static void mark_used(int block) {
bitmap[block / 8] |= (1 << (block % 8));
}
static void mark_free(int block) {
bitmap[block / 8] &= ~(1 << (block % 8));
}
// First-fit: find first run of 'n' contiguous free blocks
static int first_fit(int n) {
int run = 0, start = -1;
for (int i = 0; i < TOTAL_BLOCKS; i++) {
if (is_free(i)) {
if (run == 0) start = i;
run++;
if (run == n) return start;
} else {
run = 0;
}
}
return -1;
}
// Best-fit: find smallest contiguous free run that fits 'n'
static int best_fit(int n) {
int best_start = -1, best_size = TOTAL_BLOCKS + 1;
int run = 0, start = -1;
for (int i = 0; i <= TOTAL_BLOCKS; i++) {
if (i < TOTAL_BLOCKS && is_free(i)) {
if (run == 0) start = i;
run++;
} else {
if (run >= n && run < best_size) {
best_size = run;
best_start = start;
}
run = 0;
}
}
return best_start;
}
static void print_bitmap(void) {
int free_count = 0;
for (int i = 0; i < TOTAL_BLOCKS; i++) {
if (i % 64 == 0) printf(" %3d: ", i);
printf("%c", is_free(i) ? '.' : '#');
if (is_free(i)) free_count++;
if ((i + 1) % 64 == 0) printf("\n");
}
printf(" Free: %d/%d blocks\n\n", free_count, TOTAL_BLOCKS);
}
int main(void) {
memset(bitmap, 0, sizeof(bitmap));
printf("=== Initial (all free) ===\n");
print_bitmap();
// Allocate files of various sizes
int sizes[] = {10, 5, 20, 8, 15, 30, 12};
int starts[7];
for (int f = 0; f < 7; f++) {
starts[f] = first_fit(sizes[f]);
if (starts[f] >= 0)
for (int i = 0; i < sizes[f]; i++)
mark_used(starts[f] + i);
}
printf("=== After 7 allocations ===\n");
print_bitmap();
// Free files 1, 3, 5 (create holes)
for (int f = 1; f <= 5; f += 2) {
for (int i = 0; i < sizes[f]; i++)
mark_free(starts[f] + i);
}
printf("=== After freeing files 1, 3, 5 (holes created) ===\n");
print_bitmap();
// Try to allocate 25 contiguous blocks with first-fit vs best-fit
int ff = first_fit(25);
int bf = best_fit(25);
printf("Allocating 25 contiguous blocks:\n");
printf(" First-fit: %s (start=%d)\n", ff >= 0 ? "SUCCESS" : "FAILED", ff);
printf(" Best-fit: %s (start=%d)\n", bf >= 0 ? "SUCCESS" : "FAILED", bf);
return 0;
}gcc -O0 -o bitmap_sim bitmap_sim.c
./bitmap_simQ5.1: In the bitmap visualization, what pattern do you see after freeing files 1, 3, 5? How many total free blocks exist vs the largest contiguous run? This is the same external fragmentation problem from Unit 3, but now at the filesystem level.
Q5.2: First-fit and best-fit gave different starting positions for the 25-block allocation. Which strategy leaves the larger remaining hole? Why might first-fit be preferred in practice despite not being "optimal"? (Think about scan time.)
Q5.3: Run
sudo dumpe2fs orange.img 2>/dev/null | grep "Free blocks". ext4 distributes free blocks across block groups. Why? What happens if all free blocks were concentrated in one group and a new file is created in a directory whose inode is in a different group?
Screenshot 5: Bitmap visualization output and dumpe2fs free block counts.
When multiple I/O requests are pending, the disk scheduler decides the order to service them. The classical model assumes a spinning hard disk: the disk arm moves to the requested cylinder, and the total distance traveled (in cylinders) determines performance. The goal is to minimize total seek distance.
For modern SSDs, this exact physical model no longer applies because there is no moving head and no cylinder seek. We still study these algorithms because they are the standard OS model for reasoning about scheduling tradeoffs such as throughput, fairness, and starvation. In this section, treat the cylinder numbers as abstract request positions, not as something you need to measure on your own machine.
FCFS: Service in arrival order. Simple, possibly terrible.
SSTF: Go to nearest pending request. Greedy, risk of starvation.
SCAN: Elevator — sweep one direction, reverse at the end.
C-SCAN: One-direction sweep, jump back to start without servicing.
Disk cylinders: 0 ──────────────────────────────────── 199
▲ ▲ ▲ ▲ ▲ ▲ ▲
Pending requests at various positions
// disk_sched.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_CYL 200
static int abs_val(int x) { return x < 0 ? -x : x; }
static int cmp_int(const void *a, const void *b) {
return *(int *)a - *(int *)b;
}
static int fcfs(int *req, int n, int head) {
int total = 0, pos = head;
printf(" FCFS: %d", pos);
for (int i = 0; i < n; i++) {
total += abs_val(req[i] - pos);
pos = req[i];
printf(" → %d", pos);
}
printf("\n");
return total;
}
static int sstf(int *req, int n, int head) {
int total = 0, pos = head;
int *done = calloc(n, sizeof(int));
printf(" SSTF: %d", pos);
for (int step = 0; step < n; step++) {
int best = -1, best_dist = MAX_CYL + 1;
for (int i = 0; i < n; i++) {
if (!done[i] && abs_val(req[i] - pos) < best_dist) {
best_dist = abs_val(req[i] - pos);
best = i;
}
}
done[best] = 1;
total += best_dist;
pos = req[best];
printf(" → %d", pos);
}
printf("\n");
free(done);
return total;
}
static int scan(int *req, int n, int head) {
int total = 0, pos = head;
int *sorted = malloc(n * sizeof(int));
memcpy(sorted, req, n * sizeof(int));
qsort(sorted, n, sizeof(int), cmp_int);
printf(" SCAN: %d", pos);
// Go right first
for (int i = 0; i < n; i++) {
if (sorted[i] >= pos) {
total += abs_val(sorted[i] - pos);
pos = sorted[i];
printf(" → %d", pos);
}
}
// Hit the end
total += abs_val((MAX_CYL - 1) - pos);
pos = MAX_CYL - 1;
printf(" → %d", pos);
// Reverse
for (int i = n - 1; i >= 0; i--) {
if (sorted[i] < head) {
total += abs_val(sorted[i] - pos);
pos = sorted[i];
printf(" → %d", pos);
}
}
printf("\n");
free(sorted);
return total;
}
static int cscan(int *req, int n, int head) {
int total = 0, pos = head;
int *sorted = malloc(n * sizeof(int));
memcpy(sorted, req, n * sizeof(int));
qsort(sorted, n, sizeof(int), cmp_int);
printf(" C-SCAN: %d", pos);
// Go right
for (int i = 0; i < n; i++) {
if (sorted[i] >= pos) {
total += abs_val(sorted[i] - pos);
pos = sorted[i];
printf(" → %d", pos);
}
}
// Jump to end, then to start
total += (MAX_CYL - 1 - pos) + (MAX_CYL - 1);
pos = 0;
printf(" → %d → %d", MAX_CYL - 1, 0);
// Service remaining from the start
for (int i = 0; i < n; i++) {
if (sorted[i] < head) {
total += abs_val(sorted[i] - pos);
pos = sorted[i];
printf(" → %d", pos);
}
}
printf("\n");
free(sorted);
return total;
}
int main(void) {
int requests[] = {98, 183, 37, 122, 14, 124, 65, 67};
int n = sizeof(requests) / sizeof(requests[0]);
int head = 53;
printf("Requests: ");
for (int i = 0; i < n; i++) printf("%d ", requests[i]);
printf("\nHead starts at: %d\n\n", head);
int f = fcfs(requests, n, head);
int s = sstf(requests, n, head);
int sc = scan(requests, n, head);
int c = cscan(requests, n, head);
printf("\n%-10s %s\n", "Algorithm", "Total Seek (cylinders)");
printf("────────────────────────────────\n");
printf("%-10s %d\n", "FCFS", f);
printf("%-10s %d\n", "SSTF", s);
printf("%-10s %d\n", "SCAN", sc);
printf("%-10s %d\n", "C-SCAN", c);
return 0;
}gcc -O0 -o disk_sched disk_sched.c
./disk_schedQ6.1: Which algorithm produced the smallest total seek distance? Which produced the largest? Does this match your expectation from the textbook?
Q6.2: SSTF is greedy — it always picks the nearest request. Under what request pattern would SSTF starve a request at one extreme of the disk? Why does SCAN avoid this problem?
Q6.3: Modern SSDs have no moving parts — seek time is essentially zero. Does disk scheduling still matter for SSDs? What becomes the bottleneck instead? (Think about the I/O queue depth and parallelism, not mechanical latency.)
Screenshot 6: Full output of ./disk_sched with all four algorithm traces and the summary table.
RAID (Redundant Array of Independent Disks) combines multiple disks for performance, redundancy, or both. Three levels matter:
RAID 0 (Striping): RAID 1 (Mirroring): RAID 5 (Striping + Distributed Parity):
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ D0 │ │ D1 │ │ D0 │ │ D0 │ │ D0 │ │ D1 │ │ P0 │
│ D2 │ │ D3 │ │ D1 │ │ D1 │ │ D3 │ │ P1 │ │ D2 │
│ D4 │ │ D5 │ │ D2 │ │ D2 │ │ P2 │ │ D4 │ │ D5 │
└──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘
2× throughput 1 disk can fail 1 disk can fail
0 redundancy 50% capacity wasted only 1/N capacity wasted
RAID 5 parity works by XOR: P = D0 ⊕ D1. If any one disk dies, the missing data can be reconstructed by XORing the remaining disks.
// raid_sim.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define BLOCK_SIZE 8 // bytes per block (small for visibility)
#define NUM_STRIPES 4
#define NUM_DISKS 3 // RAID 5 with 3 disks
unsigned char disks[NUM_DISKS][NUM_STRIPES][BLOCK_SIZE];
static void print_disks(const char *label) {
printf("\n%s\n", label);
for (int d = 0; d < NUM_DISKS; d++) {
printf(" Disk %d: ", d);
for (int s = 0; s < NUM_STRIPES; s++) {
char tag;
// In RAID 5, parity rotates: stripe s has parity on disk (s % NUM_DISKS)
if (d == (s % NUM_DISKS))
tag = 'P';
else
tag = 'D';
printf("[%c:", tag);
for (int b = 0; b < BLOCK_SIZE; b++)
printf("%02x", disks[d][s][b]);
printf("] ");
}
printf("\n");
}
}
int main(void) {
srand(42);
// Write data blocks (skip parity disk for each stripe)
for (int s = 0; s < NUM_STRIPES; s++) {
int parity_disk = s % NUM_DISKS;
for (int d = 0; d < NUM_DISKS; d++) {
if (d != parity_disk) {
for (int b = 0; b < BLOCK_SIZE; b++)
disks[d][s][b] = rand() & 0xFF;
}
}
// Compute parity = XOR of all data blocks in this stripe
memset(disks[parity_disk][s], 0, BLOCK_SIZE);
for (int d = 0; d < NUM_DISKS; d++) {
if (d != parity_disk)
for (int b = 0; b < BLOCK_SIZE; b++)
disks[parity_disk][s][b] ^= disks[d][s][b];
}
}
print_disks("=== RAID 5 Array (healthy) ===");
// Simulate disk 1 failure
int failed_disk = 1;
unsigned char backup[NUM_STRIPES][BLOCK_SIZE];
memcpy(backup, disks[failed_disk], sizeof(backup)); // save for verification
memset(disks[failed_disk], 0, sizeof(disks[failed_disk]));
printf("\n*** DISK %d FAILED ***\n", failed_disk);
// Reconstruct disk 1 by XORing remaining disks
for (int s = 0; s < NUM_STRIPES; s++) {
for (int d = 0; d < NUM_DISKS; d++) {
if (d != failed_disk)
for (int b = 0; b < BLOCK_SIZE; b++)
disks[failed_disk][s][b] ^= disks[d][s][b];
}
}
print_disks("=== After reconstruction ===");
// Verify
int correct = (memcmp(backup, disks[failed_disk], sizeof(backup)) == 0);
printf("\nReconstruction %s\n", correct ? "CORRECT ✓" : "FAILED ✗");
// Capacity analysis
int total_blocks = NUM_DISKS * NUM_STRIPES;
int parity_blocks = NUM_STRIPES; // one parity per stripe
printf("\nCapacity: %d data + %d parity = %d total blocks\n",
total_blocks - parity_blocks, parity_blocks, total_blocks);
printf("Usable: %.0f%%\n", 100.0 * (total_blocks - parity_blocks) / total_blocks);
return 0;
}gcc -O0 -o raid_sim raid_sim.c
./raid_simQ7.1: Verify the reconstruction manually for stripe 0. Take the hex values from disk 0 and disk 2, XOR them byte by byte. Does the result match the reconstructed disk 1? Show at least the first 4 bytes of your work.
Q7.2: RAID 5 with 3 disks gives 66% usable capacity. What would it be with 5 disks? With 10? What's the formula for N disks? Why is RAID 5 more space-efficient than RAID 1 as you add disks?
Q7.3: What happens in RAID 5 if two disks fail simultaneously? Can the data be reconstructed? What RAID level handles double disk failures, and what's the tradeoff? (Look up RAID 6.)
Screenshot 7: Full RAID simulator output showing healthy array, failure, and successful reconstruction.
Swap space is disk area that acts as overflow for physical RAM. When the kernel needs a free frame and RAM is full, it writes (pages out) a victim page to swap and reclaims the frame. Linux supports two forms: swap partitions (dedicated disk partitions) and swap files (regular files used as swap).
# What swap is currently active?
swapon --show
free -m
# Create a swap file and examine its internal structure
dd if=/dev/zero of=myswap bs=1M count=64
mkswap myswap
# Look at what mkswap wrote — the swap header
xxd myswap | head -20 # first page is the header
xxd -s 4086 -l 10 myswap # magic string "SWAPSPACE2" near end of first page# Activate it (with lower priority than existing swap)
sudo chmod 600 myswap
sudo swapon --priority 5 myswap
swapon --show # now you should see two swap areas
# Write a program that forces swapping and observe which swap area gets used
cat << 'EOF' > swap_test.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define MB (1024UL * 1024)
int main(void) {
// Allocate and touch 128 MB to create memory pressure
char *region = malloc(128 * MB);
if (!region) { perror("malloc"); return 1; }
memset(region, 'S', 128 * MB);
printf("Allocated 128 MB. PID: %d\n", getpid());
char cmd[128];
snprintf(cmd, sizeof(cmd),
"grep -E 'VmRSS|VmSwap' /proc/%d/status", getpid());
system(cmd);
printf("\nSwap usage:\n");
system("swapon --show");
free(region);
return 0;
}
EOF
gcc -O0 -o swap_test swap_test.c
./swap_test# When multiple swap areas exist, priority determines which one is used first
# Higher priority = used first
swapon --show # check the priorities
# Clean up
sudo swapoff myswap
rm -f myswap swap_test swap_test.cQ8.1: Look at the
xxdoutput of the swap header. Where is the "SWAPSPACE2" magic string located (exact byte offset)? Why does the swap area need a header at all — what information does the kernel need to store there?Q8.2: You have two swap areas with different priorities. If priority is the same, Linux stripes across them (round-robin). If priorities differ, it fills the higher-priority one first. Why would you want to put swap on a faster disk with higher priority?
Q8.3: Swap files and swap partitions serve the same purpose. What are the performance tradeoffs? A swap partition has a fixed, contiguous layout on disk. A swap file goes through the filesystem layer, so its blocks may not be contiguous. When would you prefer each?
Screenshot 8: mkswap header via xxd, swapon --show with both swap areas, and the swap_test output.
ext4 is a journaled filesystem. Before it modifies the actual data structures on disk (inodes, bitmaps, directory entries), it first writes a description of what it's about to do into a journal (a circular log area on the disk). If the system crashes mid-write, the journal tells fsck exactly what was in progress so it can either complete or roll back the operation. Without a journal, fsck has to scan the entire filesystem — which can take minutes on large disks.
Write sequence with journaling:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1. Write to │────▶│ 2. Commit │────▶│ 3. Write to │
│ journal │ │ (barrier) │ │ actual │
│ (intent) │ │ │ │ location │
└──────────────┘ └──────────────┘ └──────────────┘
│
┌──────▼──────┐
│ 4. Mark │
│ journal │
│ entry │
│ complete │
└─────────────┘
If crash happens at step 1-2: incomplete journal entry → discarded
If crash happens at step 3: journal is committed → replay on recovery
ext4 supports three journaling modes:
- journal — both data and metadata go through the journal (safest, slowest)
- ordered — only metadata is journaled, but data is flushed before metadata commit (default)
- writeback — only metadata is journaled, data can be written in any order (fastest, riskiest)
# Check the current journal mode on our filesystem
sudo dumpe2fs orange.img 2>/dev/null | grep -i "journal"
# Let's see the journal in action
# First, unmount and run fsck on a clean filesystem
sudo umount /mnt/orange
sudo fsck.ext4 -f orange.img # should report clean
# Now let's deliberately corrupt the filesystem and recover it
# Mount it again, write some files, then simulate a crash
sudo mount -o loop orange.img /mnt/orange
echo "important data" | sudo tee /mnt/orange/critical.txt
sudo sync # flush to disk
# Corrupt: overwrite a few bytes in the middle of the image
# This simulates a partial write / crash scenario
sudo umount /mnt/orange
dd if=/dev/urandom of=orange.img bs=1 count=512 seek=1048576 conv=notrunc
# Try to mount the corrupted filesystem
sudo mount -o loop orange.img /mnt/orange 2>&1
ls /mnt/orange 2>&1
sudo umount /mnt/orange 2>/dev/null
# Run fsck to repair
sudo fsck.ext4 -y orange.img
# Mount again and check what survived
sudo mount -o loop orange.img /mnt/orange
ls /mnt/orange
cat /mnt/orange/critical.txt 2>/dev/null# debugfs lets you inspect the raw filesystem structures
sudo debugfs -R "stats" orange.img 2>/dev/null | head -30
sudo debugfs -R "ls -l /" orange.img 2>/dev/null
sudo debugfs -R "stat /critical.txt" orange.img 2>/dev/nullQ9.1: What did
fsckreport? How many errors did it find and fix? Were any files lost tolost+found? Didcritical.txtsurvive the corruption?Q9.2: In
orderedmode (ext4 default), data is written before the metadata journal entry is committed. Why does this ordering matter? What could happen inwritebackmode if the system crashes after the metadata update is journaled but before the data blocks are written? (The file's inode says it has data, but the blocks contain garbage.)Q9.3: The journal itself occupies disk space (typically 128 MB on large filesystems).
debugfs -R "stats"shows the journal size. Is this wasted space? Consider: without a journal, a crash on a 2 TB filesystem could require scanning every inode and block — how long would that take at 100 MB/s read speed?
Screenshot 9: fsck repair output, debugfs stats showing journal info, and verification of file survival.
# Unmount the filesystem
sudo umount /mnt/orange 2>/dev/null
sudo rmdir /mnt/orange 2>/dev/null
# Remove tiny.img if still around
sudo umount /mnt/tiny 2>/dev/null
sudo rmdir /mnt/tiny 2>/dev/null
# Clean up all binaries, source files, and images
rm -f orange.img tiny.img alpine-minirootfs-*.tar.gz
rm -f fd_explore readdir_raw fat_sim bitmap_sim disk_sched raid_sim swap_test thrash
rm -f fd_explore.c readdir_raw.c fat_sim.c bitmap_sim.c disk_sched.c raid_sim.c swap_test.c
rm -f small.bin medium.bin large.bin myswap| # | What to capture |
|---|---|
| 0 | df, ls, and dumpe2fs summary of the Alpine filesystem |
| 1 | ls -li inode numbers + df -i before/after inode exhaustion |
| 2 | ./fd_explore output + strace of cat |
| 3 | Directory sizes + first 10 entries from readdir_raw |
| 4 | filefrag -v for three files + FAT simulator output |
| 5 | Bitmap visualization + dumpe2fs free block counts |
| 6 | ./disk_sched full output with all four algorithms |
| 7 | RAID simulator: healthy array, failure, reconstruction |
| 8 | xxd swap header + swapon --show + swap_test output |
| 9 | fsck repair output + debugfs stats + file survival check |
| Section | Questions |
|---|---|
| 0 | Q0.1 |
| 1 | Q1.1, Q1.2, Q1.3 |
| 2 | Q2.1, Q2.2, Q2.3 |
| 3 | Q3.1, Q3.2, Q3.3 |
| 4 | Q4.1, Q4.2, Q4.3 |
| 5 | Q5.1, Q5.2, Q5.3 |
| 6 | Q6.1, Q6.2, Q6.3 |
| 7 | Q7.1, Q7.2, Q7.3 |
| 8 | Q8.1, Q8.2, Q8.3 |
| 9 | Q9.1, Q9.2, Q9.3 |