Skip to content

bug: openshell sandbox create --from OOM-killed on large images #602

@minhdqdev

Description

@minhdqdev

Agent Diagnostic

  • Investigated using the create-spike skill backed by principal-engineer-reviewer. Skills loaded: debug-openshell-cluster, openshell-cli, create-spike.
  • Traced sandbox create --from through run.rs → build.rs → push.rs
  • Confirmed three discrete full-image heap allocations in push_local_images() at push.rs:54–55
  • Verified bollard::upload_to_container accepts body_try_stream() — no API constraint blocks a streaming fix
  • Confirmed the tar crate size-in-header constraint requires a seekable (disk) intermediate; fully in-memory zero-copy is not feasible
  • Found tempfile is already a dev-dep; needs promotion to prod dep
  • No test coverage exists for push.rs

Description

Running openshell sandbox create --from <dockerfile-dir> on a large sandbox image (~3.7 GB) causes the openshell process to be killed by the Linux OOM killer. The image export pipeline in crates/openshell-bootstrap/src/push.rs buffers the entire Docker image tar three times in memory before importing it into the gateway, producing ~11 GB peak allocation.

The three allocations in push_local_images():

  1. collect_export (push.rs:97–107) — streams docker.export_images() into a Vec<u8> (~3.7 GB)
  2. wrap_in_tar (push.rs:114–131) — copies that Vec<u8> into a second tar-wrapped Vec<u8> (~3.7 GB); both are live simultaneously, peak ~7.4 GB
  3. upload_archive (push.rs:135–151) — calls Bytes::copy_from_slice(archive) creating a third copy (~3.7 GB); all three overlap in scope, peak ~11 GB

Expected: the export → tar wrap → upload pipeline streams data in O(chunk) memory.

Reproduction Steps

  1. Build a large sandbox image (≥ 3 GB uncompressed):
    openshell sandbox create --from sandboxes/gemini
  2. Observe output:
    [progress] Exported 3745 MiB
    Killed
    
  3. Exit code is 137 (SIGKILL from OOM killer).

Environment

  • Image size: ~3.7 GB (base sandbox + @google/gemini-cli@0.34.0) (community image: feat: add Gemini CLI support OpenShell-Community#51)
  • OS: Linux
  • Docker: Docker Engine 28.2.2
  • OpenShell: 0.0.0 (output of openshell --version)
  • Host RAM: 23 GB total, ~13 GB available at time of failure
  • Swap: 8 GB total, ~12 MB free at time of failure

Logs

Out of memory: Killed process (openshell) total-vm:19913320kB, anon-rss:13626992kB

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions