Skip to content

bug: openshell sandbox create --from OOM-killed on large images #601

@minhdqdev

Description

@minhdqdev

Agent Diagnostic

  • Investigated using the create-spike skill backed by principal-engineer-reviewer. Skills loaded: debug-openshell-cluster, openshell-cli, create-spike.
  • Traced sandbox create --from through run.rs → build.rs → push.rs
  • Confirmed three discrete full-image heap allocations in push_local_images() at push.rs:54–55
  • Verified bollard::upload_to_container accepts body_try_stream() — no API constraint blocks a streaming fix
  • Confirmed the tar crate size-in-header constraint requires a seekable (disk) intermediate; fully in-memory zero-copy is not feasible
  • Found tempfile is already a dev-dep; needs promotion to prod dep
  • No test coverage exists for push.rs

Description

Running openshell sandbox create --from <dockerfile-dir> on a large sandbox image (~3.7 GB) causes the openshell process to be killed by the Linux OOM killer. The image export pipeline in crates/openshell-bootstrap/src/push.rs buffers the entire Docker image tar three times in memory before importing it into the gateway, producing ~11 GB peak allocation.

The three allocations in push_local_images():

  1. collect_export (push.rs:97–107) — streams docker.export_images() into a Vec<u8> (~3.7 GB)
  2. wrap_in_tar (push.rs:114–131) — copies that Vec<u8> into a second tar-wrapped Vec<u8> (~3.7 GB); both are live simultaneously, peak ~7.4 GB
  3. upload_archive (push.rs:135–151) — calls Bytes::copy_from_slice(archive) creating a third copy (~3.7 GB); all three overlap in scope, peak ~11 GB

Expected: the export → tar wrap → upload pipeline streams data in O(chunk) memory.

Reproduction Steps

  1. Build a large sandbox image (≥ 3 GB uncompressed):
    openshell sandbox create --from sandboxes/gemini
  2. Observe output:
    [progress] Exported 3745 MiB
    Killed
    
  3. Exit code is 137 (SIGKILL from OOM killer).

Environment

  • Image size: ~3.7 GB (base sandbox + @google/gemini-cli@0.34.0)
  • OS: Linux
  • Docker: Docker Engine 28.2.2
  • OpenShell: 0.0.0 (output of openshell --version)
  • Host RAM: 23 GB total, ~13 GB available at time of failure
  • Swap: 8 GB total, ~12 MB free at time of failure

Logs

Out of memory: Killed process (openshell) total-vm:19913320kB, anon-rss:13626992kB

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions