Skip to content

[CI][OSDC] Runner container-hooks store RPC logs in /tmp, so user scripts doing 'rm -rf /tmp/*' break post-script steps #690

Description

@huydhn

Summary

The OSDC/ARC runner's container-hooks store their RPC exec bookkeeping in /tmp/rpc-logs/*.stdout inside the job container. /tmp is world-writable and is routinely wiped by user CI scripts (rm -rf /tmp/* is an extremely common image/cleanup idiom). When a workflow's script deletes /tmp/*, it silently destroys the runner's own RPC state, and every subsequent exec into the container fails even though the job's main script succeeded.

This bit torchtitan when its H100 integration test migrated to OSDC: bash /install_deepep.sh runs sudo rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* during the "Run script" step. The main step exits 0 (tests pass), but the post-script steps (Surface failing tests, artifact upload, Post Checkout) all fail with:

Error: failed to run script step: RPC /exec failed after 3 attempts (500):
{"error": "[Errno 2] No such file or directory: '/tmp/rpc-logs/<uuid>.stdout'"}
[runner-container-hooks] FATAL: ...
##[error]Executing the custom container implementation failed.
         Please contact your self hosted runner administrator.

Symptom vs. root cause

  • The canned message says "This is a script/workflow error, not an infrastructure issue" — misleading. The script merely cleaned /tmp; the breakage is that infra keeps critical RPC state in a user-writable path that scripts are expected to be able to clean.
  • "Initialize containers" and "Run script" both succeed, so it looks like a flaky post-step infra death; in reality the damage is done mid-script and only surfaces on the next exec.

Affected

  • Runners: OSDC/ARC, linux_job_v3.yml, label mt-l-bx86iamx-176-1800-h100-8 (and presumably any OSDC pod using the custom container implementation).
  • Old linux.aws.h100.8 (linux_job_v2) runners were not affected — they don't keep container-hooks RPC logs under /tmp, which is why this only appeared after the OSDC migration.

Evidence

Workaround already applied (consumer side)

torchtitan dropped /tmp/* from the rm in install_deepep.sh (pytorch/torchtitan#3479). But this is a sharp edge any OSDC consumer can hit — rm -rf /tmp/* is ubiquitous.

Proposed infra hardening (OSDC)

The rpc-logs live in the job container's /tmp, not on the host. The hooks can't relocate them to a "safe" fixed path, because we don't control or know the user-supplied container image — /tmp is effectively the only path guaranteed to exist and be writable in an arbitrary container. So the fix has to keep using /tmp but stop assuming the dir persists between execs:

  • Store rpc-logs outside /tmp (e.g. under the runner's home / a non-/tmp path)not viable: that path isn't guaranteed to exist or be writable inside an arbitrary user container; /tmp is the lowest-common-denominator writable location.
  • (primary) Recreate/own the rpc-logs dir per-execmkdir -p the dir immediately before each exec instead of once at container init, so a rm -rf /tmp/* between steps self-heals and the next exec succeeds.
  • Detect + diagnose: when the rpc-logs path is missing, surface a clear, accurate message (e.g. "container /tmp was cleared by the job; recreating") instead of the misleading "script/workflow error, not an infrastructure issue."

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions