[CI][OSDC] Runner container-hooks store RPC logs in /tmp, so user scripts doing 'rm -rf /tmp/*' break post-script steps

## Summary

The OSDC/ARC runner's container-hooks store their RPC exec bookkeeping in **`/tmp/rpc-logs/*.stdout`** inside the job container. `/tmp` is world-writable and is routinely wiped by user CI scripts (`rm -rf /tmp/*` is an extremely common image/cleanup idiom). When a workflow's script deletes `/tmp/*`, it silently destroys the runner's own RPC state, and **every subsequent `exec` into the container fails** even though the job's main script succeeded.

This bit torchtitan when its H100 integration test migrated to OSDC: `bash /install_deepep.sh` runs `sudo rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*` during the "Run script" step. The main step exits 0 (tests pass), but the post-script steps (Surface failing tests, artifact upload, Post Checkout) all fail with:

```
Error: failed to run script step: RPC /exec failed after 3 attempts (500):
{"error": "[Errno 2] No such file or directory: '/tmp/rpc-logs/<uuid>.stdout'"}
[runner-container-hooks] FATAL: ...
##[error]Executing the custom container implementation failed.
         Please contact your self hosted runner administrator.
```

## Symptom vs. root cause

- The canned message says *"This is a script/workflow error, not an infrastructure issue"* — misleading. The script merely cleaned `/tmp`; the breakage is that infra keeps critical RPC state in a user-writable path that scripts are expected to be able to clean.
- "Initialize containers" and "Run script" both **succeed**, so it looks like a flaky post-step infra death; in reality the damage is done mid-script and only surfaces on the next `exec`.

## Affected

- Runners: OSDC/ARC, `linux_job_v3.yml`, label `mt-l-bx86iamx-176-1800-h100-8` (and presumably any OSDC pod using the custom container implementation).
- Old `linux.aws.h100.8` (`linux_job_v2`) runners were **not** affected — they don't keep container-hooks RPC logs under `/tmp`, which is why this only appeared after the OSDC migration.

## Evidence

- Failing main run: https://github.com/pytorch/torchtitan/actions/runs/26777741392/job/78933742236
- Reproduced + isolated to the `rm`: https://github.com/pytorch/torchtitan/actions/runs/26801406725/job/79008692892
- Root-cause line: https://github.com/pytorch/torchtitan/blob/main/.ci/docker/common/install_deepep.sh#L20

## Workaround already applied (consumer side)

torchtitan dropped `/tmp/*` from the `rm` in `install_deepep.sh` (pytorch/torchtitan#3479). But this is a sharp edge any OSDC consumer can hit — `rm -rf /tmp/*` is ubiquitous.

## Proposed infra hardening (OSDC)

The rpc-logs live in the **job container's** `/tmp`, not on the host. The hooks can't relocate them to a "safe" fixed path, because we don't control or know the user-supplied container image — `/tmp` is effectively the only path guaranteed to exist and be writable in an arbitrary container. So the fix has to keep using `/tmp` but stop assuming the dir persists between execs:

- ~~Store `rpc-logs` outside `/tmp` (e.g. under the runner's home / a non-`/tmp` path)~~ — **not viable**: that path isn't guaranteed to exist or be writable inside an arbitrary user container; `/tmp` is the lowest-common-denominator writable location.
- **(primary) Recreate/own the rpc-logs dir per-exec** — `mkdir -p` the dir immediately before each `exec` instead of once at container init, so a `rm -rf /tmp/*` between steps self-heals and the next exec succeeds.
- **Detect + diagnose**: when the rpc-logs path is missing, surface a clear, accurate message (e.g. "container `/tmp` was cleared by the job; recreating") instead of the misleading "script/workflow error, not an infrastructure issue."



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI][OSDC] Runner container-hooks store RPC logs in /tmp, so user scripts doing 'rm -rf /tmp/*' break post-script steps #690

Summary

Symptom vs. root cause

Affected

Evidence

Workaround already applied (consumer side)

Proposed infra hardening (OSDC)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[CI][OSDC] Runner container-hooks store RPC logs in /tmp, so user scripts doing 'rm -rf /tmp/*' break post-script steps #690

Description

Summary

Symptom vs. root cause

Affected

Evidence

Workaround already applied (consumer side)

Proposed infra hardening (OSDC)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions