AI-assisted root-cause analysis for OSDC step failures (inline in job logs)

## Proposal: AI-assisted root-cause analysis for OSDC step failures

### Problem

When a job step fails on OSDC, the in-pod hook wrapper currently emits a generic message:

> `[OSDC] Step script exited with code 1. This is a script/workflow error, not an infrastructure issue. Check the step logs above for the actual failure.`

(Source: `osdc/modules/arc-runners/templates/runner.yaml.tpl`, the `wrapper.js` ConfigMap.)

It correctly tells users "this is your workflow, not OSDC infra," but it doesn't help them find the actual cause — they still have to scroll the log. Example: https://github.com/pytorch/pytorch/actions/runs/26852479075/job/79190579763#step:11:57114

### Idea

When a step fails, have the wrapper send the failing step's log tail to an Anthropic Claude model on **Bedrock** (same models/pattern as [`claude-autorevert-advisor.yml`](https://github.com/pytorch/pytorch/blob/main/.github/workflows/claude-autorevert-advisor.yml)), get a short root-cause analysis, and print it **inline in the same job log**, right under the existing `[OSDC]` error — so the likely cause + suggested fix show up exactly where the user is already looking.

### Scope / intent

- Trigger on **all non-zero step failures**, fleet-wide.
- Analysis is a **strictly optional, fail-open side effect**: a bounded timeout, and any failure (no creds, network, model error, parse error, timeout) silently skips it and returns the job's real exit code unchanged. It must never block, delay unduly, or fail a job.
- Reuse the autorevert Bedrock approach (Opus 4.6).

A possible design + the feasibility constraints discovered so far are in a comment below. This is a discussion starter — details still TBD.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI-assisted root-cause analysis for OSDC step failures (inline in job logs) #695

Proposal: AI-assisted root-cause analysis for OSDC step failures

Problem

Idea

Scope / intent

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

AI-assisted root-cause analysis for OSDC step failures (inline in job logs) #695

Description

Proposal: AI-assisted root-cause analysis for OSDC step failures

Problem

Idea

Scope / intent

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions