Proposal: AI-assisted root-cause analysis for OSDC step failures
Problem
When a job step fails on OSDC, the in-pod hook wrapper currently emits a generic message:
[OSDC] Step script exited with code 1. This is a script/workflow error, not an infrastructure issue. Check the step logs above for the actual failure.
(Source: osdc/modules/arc-runners/templates/runner.yaml.tpl, the wrapper.js ConfigMap.)
It correctly tells users "this is your workflow, not OSDC infra," but it doesn't help them find the actual cause — they still have to scroll the log. Example: https://github.com/pytorch/pytorch/actions/runs/26852479075/job/79190579763#step:11:57114
Idea
When a step fails, have the wrapper send the failing step's log tail to an Anthropic Claude model on Bedrock (same models/pattern as claude-autorevert-advisor.yml), get a short root-cause analysis, and print it inline in the same job log, right under the existing [OSDC] error — so the likely cause + suggested fix show up exactly where the user is already looking.
Scope / intent
- Trigger on all non-zero step failures, fleet-wide.
- Analysis is a strictly optional, fail-open side effect: a bounded timeout, and any failure (no creds, network, model error, parse error, timeout) silently skips it and returns the job's real exit code unchanged. It must never block, delay unduly, or fail a job.
- Reuse the autorevert Bedrock approach (Opus 4.6).
A possible design + the feasibility constraints discovered so far are in a comment below. This is a discussion starter — details still TBD.
Proposal: AI-assisted root-cause analysis for OSDC step failures
Problem
When a job step fails on OSDC, the in-pod hook wrapper currently emits a generic message:
(Source:
osdc/modules/arc-runners/templates/runner.yaml.tpl, thewrapper.jsConfigMap.)It correctly tells users "this is your workflow, not OSDC infra," but it doesn't help them find the actual cause — they still have to scroll the log. Example: https://github.com/pytorch/pytorch/actions/runs/26852479075/job/79190579763#step:11:57114
Idea
When a step fails, have the wrapper send the failing step's log tail to an Anthropic Claude model on Bedrock (same models/pattern as
claude-autorevert-advisor.yml), get a short root-cause analysis, and print it inline in the same job log, right under the existing[OSDC]error — so the likely cause + suggested fix show up exactly where the user is already looking.Scope / intent
A possible design + the feasibility constraints discovered so far are in a comment below. This is a discussion starter — details still TBD.