[infra] h200-nb_0 / h200-nb_1: enroot /mnt/image-storage full — can't unpack TRT-LLM v1.3.0rc14 image

## Summary

Two H200 self-hosted runners — **`h200-nb_0`** and **`h200-nb_1`** — are unable to unpack newer TensorRT-LLM container images because their `/mnt/image-storage/enroot/data/` partition is full. Every sweep job that lands on either runner fails identically:

```
enroot-mount: failed to create directory:
  /mnt/image-storage/enroot/data/pyxis_nvcr.io_nvidia_tensorrt-llm_release_1.3.0rc14-gharunner/var/run:
  No space left on device
```

…and `pyxis: failed to create container filesystem` during the squashfs extraction step. The benchmark script never runs.

It's been temporarily worked around by removing the `h200` SLURM partition tag from these two nodes — they currently can't pick up jobs at all — but they should be put back into service once the disk is freed.

## How we got here

The recently-tagged `nvcr.io/nvidia/tensorrt-llm/release:v1.3.0rc14` image is significantly larger than `v1.1.0rc2.post2` (it bundles Python 3.12 + new CUDA libs). Old enroot caches haven't been pruned, so `/mnt/image-storage` filled up trying to extract the new image.

Confirmed on these PRs (where every failure landed on `h200-nb_*` and every success landed on `h200-dgxc-slurm_*` or `h200-cw_01`):

- [#1491](https://github.com/SemiAnalysisAI/InferenceX/pull/1491) — `gptoss-fp4-h200-trt` v1.3.0rc11 → v1.3.0rc14 (11/12 failures = disk; 1 = stale port-8888 leak on `h200-cw_01`)
- [#1487](https://github.com/SemiAnalysisAI/InferenceX/pull/1487) — `dsr1-fp8-h200-trt` (+mtp) v1.1.0rc2.post2 → v1.3.0rc14 (12/12 failures = disk)

Failed CI runs:

- https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26016892349
- https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26016868638

## Suggested SRE fix

Any of:

1. **Prune the enroot image cache** on both nodes:
   ```bash
   ssh root@h200-nb_0  # and h200-nb_1
   enroot list  # see what's still there
   enroot remove --force '*'  # nuclear option, drops everything
   # OR: rm -rf /mnt/image-storage/enroot/data/<stale-pyxis-dirs>
   df -h /mnt/image-storage
   ```
2. **Expand `/mnt/image-storage`** on these nodes — they're chronically near-full since the v1.1 → v1.3 image sizes diverged.
3. Add a periodic cron / systemd-timer that prunes pyxis directories older than ~7d (eg. `find /mnt/image-storage/enroot/data -maxdepth 1 -type d -mtime +7 -exec rm -rf {} +`).

Once any of those is done, re-add the `h200` SLURM partition tag and the affected PRs can be re-swept.

## Bonus issue on `h200-cw_01`

A separate one-off failure on `h200-cw_01` during the same #1491 sweep was an `Address already in use` on port 8888 — left over from a previous `trtllm-serve` that didn't shut down cleanly. Worth a `kill` pass on that node too.

cc @sre / @platform


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[infra] h200-nb_0 / h200-nb_1: enroot /mnt/image-storage full — can't unpack TRT-LLM v1.3.0rc14 image #1496

Summary

How we got here

Suggested SRE fix

Bonus issue on `h200-cw_01`

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[infra] h200-nb_0 / h200-nb_1: enroot /mnt/image-storage full — can't unpack TRT-LLM v1.3.0rc14 image #1496

Description

Summary

How we got here

Suggested SRE fix

Bonus issue on h200-cw_01

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bonus issue on `h200-cw_01`