Skip to content

[infra] h200-nb_0 / h200-nb_1: enroot /mnt/image-storage full — can't unpack TRT-LLM v1.3.0rc14 image #1496

@functionstackx

Description

@functionstackx

Summary

Two H200 self-hosted runners — h200-nb_0 and h200-nb_1 — are unable to unpack newer TensorRT-LLM container images because their /mnt/image-storage/enroot/data/ partition is full. Every sweep job that lands on either runner fails identically:

enroot-mount: failed to create directory:
  /mnt/image-storage/enroot/data/pyxis_nvcr.io_nvidia_tensorrt-llm_release_1.3.0rc14-gharunner/var/run:
  No space left on device

…and pyxis: failed to create container filesystem during the squashfs extraction step. The benchmark script never runs.

It's been temporarily worked around by removing the h200 SLURM partition tag from these two nodes — they currently can't pick up jobs at all — but they should be put back into service once the disk is freed.

How we got here

The recently-tagged nvcr.io/nvidia/tensorrt-llm/release:v1.3.0rc14 image is significantly larger than v1.1.0rc2.post2 (it bundles Python 3.12 + new CUDA libs). Old enroot caches haven't been pruned, so /mnt/image-storage filled up trying to extract the new image.

Confirmed on these PRs (where every failure landed on h200-nb_* and every success landed on h200-dgxc-slurm_* or h200-cw_01):

  • #1491gptoss-fp4-h200-trt v1.3.0rc11 → v1.3.0rc14 (11/12 failures = disk; 1 = stale port-8888 leak on h200-cw_01)
  • #1487dsr1-fp8-h200-trt (+mtp) v1.1.0rc2.post2 → v1.3.0rc14 (12/12 failures = disk)

Failed CI runs:

Suggested SRE fix

Any of:

  1. Prune the enroot image cache on both nodes:
    ssh root@h200-nb_0  # and h200-nb_1
    enroot list  # see what's still there
    enroot remove --force '*'  # nuclear option, drops everything
    # OR: rm -rf /mnt/image-storage/enroot/data/<stale-pyxis-dirs>
    df -h /mnt/image-storage
  2. Expand /mnt/image-storage on these nodes — they're chronically near-full since the v1.1 → v1.3 image sizes diverged.
  3. Add a periodic cron / systemd-timer that prunes pyxis directories older than ~7d (eg. find /mnt/image-storage/enroot/data -maxdepth 1 -type d -mtime +7 -exec rm -rf {} +).

Once any of those is done, re-add the h200 SLURM partition tag and the affected PRs can be re-swept.

Bonus issue on h200-cw_01

A separate one-off failure on h200-cw_01 during the same #1491 sweep was an Address already in use on port 8888 — left over from a previous trtllm-serve that didn't shut down cleanly. Worth a kill pass on that node too.

cc @sre / @platform

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions