Summary
Two H200 self-hosted runners — h200-nb_0 and h200-nb_1 — are unable to unpack newer TensorRT-LLM container images because their /mnt/image-storage/enroot/data/ partition is full. Every sweep job that lands on either runner fails identically:
enroot-mount: failed to create directory:
/mnt/image-storage/enroot/data/pyxis_nvcr.io_nvidia_tensorrt-llm_release_1.3.0rc14-gharunner/var/run:
No space left on device
…and pyxis: failed to create container filesystem during the squashfs extraction step. The benchmark script never runs.
It's been temporarily worked around by removing the h200 SLURM partition tag from these two nodes — they currently can't pick up jobs at all — but they should be put back into service once the disk is freed.
How we got here
The recently-tagged nvcr.io/nvidia/tensorrt-llm/release:v1.3.0rc14 image is significantly larger than v1.1.0rc2.post2 (it bundles Python 3.12 + new CUDA libs). Old enroot caches haven't been pruned, so /mnt/image-storage filled up trying to extract the new image.
Confirmed on these PRs (where every failure landed on h200-nb_* and every success landed on h200-dgxc-slurm_* or h200-cw_01):
- #1491 —
gptoss-fp4-h200-trt v1.3.0rc11 → v1.3.0rc14 (11/12 failures = disk; 1 = stale port-8888 leak on h200-cw_01)
- #1487 —
dsr1-fp8-h200-trt (+mtp) v1.1.0rc2.post2 → v1.3.0rc14 (12/12 failures = disk)
Failed CI runs:
Suggested SRE fix
Any of:
- Prune the enroot image cache on both nodes:
ssh root@h200-nb_0 # and h200-nb_1
enroot list # see what's still there
enroot remove --force '*' # nuclear option, drops everything
# OR: rm -rf /mnt/image-storage/enroot/data/<stale-pyxis-dirs>
df -h /mnt/image-storage
- Expand
/mnt/image-storage on these nodes — they're chronically near-full since the v1.1 → v1.3 image sizes diverged.
- Add a periodic cron / systemd-timer that prunes pyxis directories older than ~7d (eg.
find /mnt/image-storage/enroot/data -maxdepth 1 -type d -mtime +7 -exec rm -rf {} +).
Once any of those is done, re-add the h200 SLURM partition tag and the affected PRs can be re-swept.
Bonus issue on h200-cw_01
A separate one-off failure on h200-cw_01 during the same #1491 sweep was an Address already in use on port 8888 — left over from a previous trtllm-serve that didn't shut down cleanly. Worth a kill pass on that node too.
cc @sre / @platform
Summary
Two H200 self-hosted runners —
h200-nb_0andh200-nb_1— are unable to unpack newer TensorRT-LLM container images because their/mnt/image-storage/enroot/data/partition is full. Every sweep job that lands on either runner fails identically:…and
pyxis: failed to create container filesystemduring the squashfs extraction step. The benchmark script never runs.It's been temporarily worked around by removing the
h200SLURM partition tag from these two nodes — they currently can't pick up jobs at all — but they should be put back into service once the disk is freed.How we got here
The recently-tagged
nvcr.io/nvidia/tensorrt-llm/release:v1.3.0rc14image is significantly larger thanv1.1.0rc2.post2(it bundles Python 3.12 + new CUDA libs). Old enroot caches haven't been pruned, so/mnt/image-storagefilled up trying to extract the new image.Confirmed on these PRs (where every failure landed on
h200-nb_*and every success landed onh200-dgxc-slurm_*orh200-cw_01):gptoss-fp4-h200-trtv1.3.0rc11 → v1.3.0rc14 (11/12 failures = disk; 1 = stale port-8888 leak onh200-cw_01)dsr1-fp8-h200-trt(+mtp) v1.1.0rc2.post2 → v1.3.0rc14 (12/12 failures = disk)Failed CI runs:
Suggested SRE fix
Any of:
/mnt/image-storageon these nodes — they're chronically near-full since the v1.1 → v1.3 image sizes diverged.find /mnt/image-storage/enroot/data -maxdepth 1 -type d -mtime +7 -exec rm -rf {} +).Once any of those is done, re-add the
h200SLURM partition tag and the affected PRs can be re-swept.Bonus issue on
h200-cw_01A separate one-off failure on
h200-cw_01during the same #1491 sweep was anAddress already in useon port 8888 — left over from a previoustrtllm-servethat didn't shut down cleanly. Worth akillpass on that node too.cc @sre / @platform