Summary
The OSDC/ARC runner's container-hooks store their RPC exec bookkeeping in /tmp/rpc-logs/*.stdout inside the job container. /tmp is world-writable and is routinely wiped by user CI scripts (rm -rf /tmp/* is an extremely common image/cleanup idiom). When a workflow's script deletes /tmp/*, it silently destroys the runner's own RPC state, and every subsequent exec into the container fails even though the job's main script succeeded.
This bit torchtitan when its H100 integration test migrated to OSDC: bash /install_deepep.sh runs sudo rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* during the "Run script" step. The main step exits 0 (tests pass), but the post-script steps (Surface failing tests, artifact upload, Post Checkout) all fail with:
Error: failed to run script step: RPC /exec failed after 3 attempts (500):
{"error": "[Errno 2] No such file or directory: '/tmp/rpc-logs/<uuid>.stdout'"}
[runner-container-hooks] FATAL: ...
##[error]Executing the custom container implementation failed.
Please contact your self hosted runner administrator.
Symptom vs. root cause
- The canned message says "This is a script/workflow error, not an infrastructure issue" — misleading. The script merely cleaned
/tmp; the breakage is that infra keeps critical RPC state in a user-writable path that scripts are expected to be able to clean.
- "Initialize containers" and "Run script" both succeed, so it looks like a flaky post-step infra death; in reality the damage is done mid-script and only surfaces on the next
exec.
Affected
- Runners: OSDC/ARC,
linux_job_v3.yml, label mt-l-bx86iamx-176-1800-h100-8 (and presumably any OSDC pod using the custom container implementation).
- Old
linux.aws.h100.8 (linux_job_v2) runners were not affected — they don't keep container-hooks RPC logs under /tmp, which is why this only appeared after the OSDC migration.
Evidence
Workaround already applied (consumer side)
torchtitan dropped /tmp/* from the rm in install_deepep.sh (pytorch/torchtitan#3479). But this is a sharp edge any OSDC consumer can hit — rm -rf /tmp/* is ubiquitous.
Proposed infra hardening (OSDC)
The rpc-logs live in the job container's /tmp, not on the host. The hooks can't relocate them to a "safe" fixed path, because we don't control or know the user-supplied container image — /tmp is effectively the only path guaranteed to exist and be writable in an arbitrary container. So the fix has to keep using /tmp but stop assuming the dir persists between execs:
Store rpc-logs outside /tmp (e.g. under the runner's home / a non-/tmp path) — not viable: that path isn't guaranteed to exist or be writable inside an arbitrary user container; /tmp is the lowest-common-denominator writable location.
- (primary) Recreate/own the rpc-logs dir per-exec —
mkdir -p the dir immediately before each exec instead of once at container init, so a rm -rf /tmp/* between steps self-heals and the next exec succeeds.
- Detect + diagnose: when the rpc-logs path is missing, surface a clear, accurate message (e.g. "container
/tmp was cleared by the job; recreating") instead of the misleading "script/workflow error, not an infrastructure issue."
Summary
The OSDC/ARC runner's container-hooks store their RPC exec bookkeeping in
/tmp/rpc-logs/*.stdoutinside the job container./tmpis world-writable and is routinely wiped by user CI scripts (rm -rf /tmp/*is an extremely common image/cleanup idiom). When a workflow's script deletes/tmp/*, it silently destroys the runner's own RPC state, and every subsequentexecinto the container fails even though the job's main script succeeded.This bit torchtitan when its H100 integration test migrated to OSDC:
bash /install_deepep.shrunssudo rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*during the "Run script" step. The main step exits 0 (tests pass), but the post-script steps (Surface failing tests, artifact upload, Post Checkout) all fail with:Symptom vs. root cause
/tmp; the breakage is that infra keeps critical RPC state in a user-writable path that scripts are expected to be able to clean.exec.Affected
linux_job_v3.yml, labelmt-l-bx86iamx-176-1800-h100-8(and presumably any OSDC pod using the custom container implementation).linux.aws.h100.8(linux_job_v2) runners were not affected — they don't keep container-hooks RPC logs under/tmp, which is why this only appeared after the OSDC migration.Evidence
rm: https://github.com/pytorch/torchtitan/actions/runs/26801406725/job/79008692892Workaround already applied (consumer side)
torchtitan dropped
/tmp/*from thermininstall_deepep.sh(pytorch/torchtitan#3479). But this is a sharp edge any OSDC consumer can hit —rm -rf /tmp/*is ubiquitous.Proposed infra hardening (OSDC)
The rpc-logs live in the job container's
/tmp, not on the host. The hooks can't relocate them to a "safe" fixed path, because we don't control or know the user-supplied container image —/tmpis effectively the only path guaranteed to exist and be writable in an arbitrary container. So the fix has to keep using/tmpbut stop assuming the dir persists between execs:Store— not viable: that path isn't guaranteed to exist or be writable inside an arbitrary user container;rpc-logsoutside/tmp(e.g. under the runner's home / a non-/tmppath)/tmpis the lowest-common-denominator writable location.mkdir -pthe dir immediately before eachexecinstead of once at container init, so arm -rf /tmp/*between steps self-heals and the next exec succeeds./tmpwas cleared by the job; recreating") instead of the misleading "script/workflow error, not an infrastructure issue."