Revisit ext4 image sizing + explore mke2fs alternatives

## Context

Surfaced while debugging the proof-of-pipeline CI workflow ([#233](https://github.com/CelestoAI/SmolVM/pull/233), failure on first run, fixed in [#235](https://github.com/CelestoAI/SmolVM/pull/235) by bumping openclaw rootfs from 2 GiB → 4 GiB).

Real-data measurement from the locally-built 4 GiB openclaw image:

\`\`\`
Block size:               4 KiB
Block count:              1,048,576   (4 GiB)
Free blocks:                663,835
Reserved (5% for root):      52,428
FS overhead (journal+...):   36,936
Inodes used:                 65,597   (out of 262,144)
\`\`\`

Real content: ~1.16 GiB. FS overhead: ~144 MiB. Reserved: ~205 MiB. **Total target: ~1.5 GiB.** A 2 GiB filesystem has the raw capacity but mke2fs's `-d` allocator fails reliably above ~70% utilization, so 2 GiB doesn't actually work in practice.

This worked when openclaw images were smaller; bit-rotted as Node 22 base + npm globals + apt deps grew. The fix in #235 unblocks CI but is a workaround — the underlying tool (`mke2fs -d`) is fragile and the sizing logic is static.

## What to revisit

### 1. Stop hardcoding sizes — dynamic sizing

Current pattern across all 5 builders: `rootfs_size_mb: int = <constant>`. These constants drift out of sync with content over time and we only notice when something fails.

Better: measure the tar export's `du -sb` size first, then create the ext4 at `max(min_size, content × 1.4)`. The 40% headroom keeps us well below the allocator's ~70% failure threshold without manual tuning.

### 2. Tweak `mkfs.ext4` flags

Quick wins available without changing the tool:

- **`-m 0`** — drop the 5% reserved-for-root. Pure waste for a single-user sandbox image. Recovers ~50 MiB on a 1 GiB FS.
- **`-T largefile`** — fewer inodes, more blocks for data. Saves ~32 MiB on a 2 GiB FS. Safe if we don't expect 100k+ files (openclaw uses ~65k, so OK with default; might pinch other images).
- **`-O ^has_journal`** — drop the journal. Saves another ~32 MiB. **Unsafe in general** but defensible for a read-mostly published image.
- **`-E packed_meta_blocks=1`** — pack metadata for better locality. Might help allocator behavior, no real downside.

These could close the gap between "70% utilization fails" and "90% utilization succeeds" — buys headroom without changing the architecture.

### 3. Build-then-shrink

Build a generously-sized FS (no allocator stress), populate, then `resize2fs` down to the minimum. Result: FS exactly the size needed, no manual tuning, no allocator failure mode. Costs an extra step in the builder.

### 4. Switch filesystems for published images

Real architectural alternative. Published images are **read-mostly** (writes happen at first boot for `/etc/ssh/ssh_host_*` and `/root/.ssh/authorized_keys`, then mostly nothing). Read-only FS options:

- **squashfs** — compressed read-only FS. Built-in kernel support since forever. Compresses well (~30-50% wire-size reduction on top of zstd-on-ext4). Boot path needs an overlay (overlayfs on tmpfs) for writability.
- **erofs** — newer, faster, kernel support since 5.x. Same overlay story.
- **dm-verity wrapping** — orthogonal but worth considering: verifiable read-only image where the kernel rejects tampering at runtime. Bonus security property for "we shipped this image and the user can prove it wasn't modified after the fact."

Cost: changes the rootfs boot story. The custom `/init` would need to mount the squashfs read-only, set up an overlay for writes, then continue. Non-trivial.

### 5. Drop `mke2fs -d` entirely

Build empty `mkfs.ext4`, loop-mount, `cp -a` content in. Different allocator path (kernel's ext4 driver, not mke2fs's userspace one) — may avoid the population-time allocator failures even at high utilization. We already have the loopfs helper for this on Linux.

## Recommendation order (cheapest → most architectural)

1. **Now (next PR after this stabilizes):** dynamic sizing (#1 above) + `mkfs.ext4 -m 0` (#2). One small builder change, eliminates the size-drift class of bug entirely.
2. **Later:** experiment with squashfs+overlayfs for published images specifically (#4). Real wire-size win for distribution.
3. **Probably never:** the journal/inode flag tuning is fiddly per-image; not worth it if dynamic sizing handles the 90% case.

## References

- mke2fs(8): https://man7.org/linux/man-pages/man8/mke2fs.8.html
- ext4 reserved blocks behavior: \`tune2fs -m 0\`
- squashfs as initrd: https://www.kernel.org/doc/html/latest/filesystems/squashfs.html
- Originating PR: [#235](https://github.com/CelestoAI/SmolVM/pull/235)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit ext4 image sizing + explore mke2fs alternatives #236

Context

What to revisit

1. Stop hardcoding sizes — dynamic sizing

2. Tweak `mkfs.ext4` flags

3. Build-then-shrink

4. Switch filesystems for published images

5. Drop `mke2fs -d` entirely

Recommendation order (cheapest → most architectural)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Revisit ext4 image sizing + explore mke2fs alternatives #236

Description

Context

What to revisit

1. Stop hardcoding sizes — dynamic sizing

2. Tweak mkfs.ext4 flags

3. Build-then-shrink

4. Switch filesystems for published images

5. Drop mke2fs -d entirely

Recommendation order (cheapest → most architectural)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Tweak `mkfs.ext4` flags

5. Drop `mke2fs -d` entirely