Skip to content

AI-assisted root-cause analysis for OSDC step failures (inline in job logs) #695

Description

@huydhn

Proposal: AI-assisted root-cause analysis for OSDC step failures

Problem

When a job step fails on OSDC, the in-pod hook wrapper currently emits a generic message:

[OSDC] Step script exited with code 1. This is a script/workflow error, not an infrastructure issue. Check the step logs above for the actual failure.

(Source: osdc/modules/arc-runners/templates/runner.yaml.tpl, the wrapper.js ConfigMap.)

It correctly tells users "this is your workflow, not OSDC infra," but it doesn't help them find the actual cause — they still have to scroll the log. Example: https://github.com/pytorch/pytorch/actions/runs/26852479075/job/79190579763#step:11:57114

Idea

When a step fails, have the wrapper send the failing step's log tail to an Anthropic Claude model on Bedrock (same models/pattern as claude-autorevert-advisor.yml), get a short root-cause analysis, and print it inline in the same job log, right under the existing [OSDC] error — so the likely cause + suggested fix show up exactly where the user is already looking.

Scope / intent

  • Trigger on all non-zero step failures, fleet-wide.
  • Analysis is a strictly optional, fail-open side effect: a bounded timeout, and any failure (no creds, network, model error, parse error, timeout) silently skips it and returns the job's real exit code unchanged. It must never block, delay unduly, or fail a job.
  • Reuse the autorevert Bedrock approach (Opus 4.6).

A possible design + the feasibility constraints discovered so far are in a comment below. This is a discussion starter — details still TBD.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions