Skip to content

Revisit ext4 image sizing + explore mke2fs alternatives #236

@aniketmaurya

Description

@aniketmaurya

Context

Surfaced while debugging the proof-of-pipeline CI workflow (#233, failure on first run, fixed in #235 by bumping openclaw rootfs from 2 GiB → 4 GiB).

Real-data measurement from the locally-built 4 GiB openclaw image:

```
Block size: 4 KiB
Block count: 1,048,576 (4 GiB)
Free blocks: 663,835
Reserved (5% for root): 52,428
FS overhead (journal+...): 36,936
Inodes used: 65,597 (out of 262,144)
```

Real content: ~1.16 GiB. FS overhead: ~144 MiB. Reserved: ~205 MiB. Total target: ~1.5 GiB. A 2 GiB filesystem has the raw capacity but mke2fs's -d allocator fails reliably above ~70% utilization, so 2 GiB doesn't actually work in practice.

This worked when openclaw images were smaller; bit-rotted as Node 22 base + npm globals + apt deps grew. The fix in #235 unblocks CI but is a workaround — the underlying tool (mke2fs -d) is fragile and the sizing logic is static.

What to revisit

1. Stop hardcoding sizes — dynamic sizing

Current pattern across all 5 builders: rootfs_size_mb: int = <constant>. These constants drift out of sync with content over time and we only notice when something fails.

Better: measure the tar export's du -sb size first, then create the ext4 at max(min_size, content × 1.4). The 40% headroom keeps us well below the allocator's ~70% failure threshold without manual tuning.

2. Tweak mkfs.ext4 flags

Quick wins available without changing the tool:

  • -m 0 — drop the 5% reserved-for-root. Pure waste for a single-user sandbox image. Recovers ~50 MiB on a 1 GiB FS.
  • -T largefile — fewer inodes, more blocks for data. Saves ~32 MiB on a 2 GiB FS. Safe if we don't expect 100k+ files (openclaw uses ~65k, so OK with default; might pinch other images).
  • -O ^has_journal — drop the journal. Saves another ~32 MiB. Unsafe in general but defensible for a read-mostly published image.
  • -E packed_meta_blocks=1 — pack metadata for better locality. Might help allocator behavior, no real downside.

These could close the gap between "70% utilization fails" and "90% utilization succeeds" — buys headroom without changing the architecture.

3. Build-then-shrink

Build a generously-sized FS (no allocator stress), populate, then resize2fs down to the minimum. Result: FS exactly the size needed, no manual tuning, no allocator failure mode. Costs an extra step in the builder.

4. Switch filesystems for published images

Real architectural alternative. Published images are read-mostly (writes happen at first boot for /etc/ssh/ssh_host_* and /root/.ssh/authorized_keys, then mostly nothing). Read-only FS options:

  • squashfs — compressed read-only FS. Built-in kernel support since forever. Compresses well (~30-50% wire-size reduction on top of zstd-on-ext4). Boot path needs an overlay (overlayfs on tmpfs) for writability.
  • erofs — newer, faster, kernel support since 5.x. Same overlay story.
  • dm-verity wrapping — orthogonal but worth considering: verifiable read-only image where the kernel rejects tampering at runtime. Bonus security property for "we shipped this image and the user can prove it wasn't modified after the fact."

Cost: changes the rootfs boot story. The custom /init would need to mount the squashfs read-only, set up an overlay for writes, then continue. Non-trivial.

5. Drop mke2fs -d entirely

Build empty mkfs.ext4, loop-mount, cp -a content in. Different allocator path (kernel's ext4 driver, not mke2fs's userspace one) — may avoid the population-time allocator failures even at high utilization. We already have the loopfs helper for this on Linux.

Recommendation order (cheapest → most architectural)

  1. Now (next PR after this stabilizes): dynamic sizing (Update README.md #1 above) + mkfs.ext4 -m 0 (Run the lobster agent 🦞  #2). One small builder change, eliminates the size-drift class of bug entirely.
  2. Later: experiment with squashfs+overlayfs for published images specifically ( Enable macOS support with QEMU backend and fix zero-config VM() flow  #4). Real wire-size win for distribution.
  3. Probably never: the journal/inode flag tuning is fiddly per-image; not worth it if dynamic sizing handles the 90% case.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions