Skip to content

[Code Health] CI resource failures only surface as task-submit exit=1 #1205

Description

@vegetabledoww

Category

Robustness (potential edge-case failure)

Component

Tests

Description

st-onboard-a5 can fail with only a top-level task-submit / process exit message, while the visible pytest summaries near the end are passing. This makes it hard to identify which Resource-phase child pytest job actually failed.

Example run/job:

Observed tail:

--- L2 host_build_graph: PASS ...
--- L2 tensormap_and_ringbuffer: PASS ...
...
[npu-lock] 已释放设备 4 5 的锁
=== 任务失败 (exit=1) ===
Process completed with exit code 1.

This does not look like a CI-machine failure: the self-hosted runner started successfully, checkout/setup/build succeeded, device locks were acquired and released cleanly, and the job was not cancelled or timed out. The failure appears to come from inside the task-submit-wrapped pytest command.

The relevant workflow command is:

PYTEST="python -m pytest examples tests/st --platform a5 --device ${DEVICE_RANGE} -v --clone-protocol ssh --require-pto-isa"
task-submit --timeout 1800 --max-time 1800 --device "$DEVICE_LIST" --run "$PYTEST --pto-session-timeout 1200 --pto-isa-commit ..."

The root pytest dispatcher runs a Resource phase for L3 / standalone resource-marked tests and collects child pytest results via parallel_scheduler.run_jobs(...). If a child returns non-zero, the parent marks the session failed, but the actionable failure detail is only printed inside the per-child GitHub group. When the Actions log is collapsed/truncated, the top-level failure has no clear child label or useful tail.

In the referenced run, the visible Resource phase began with standalone worker tests and 2-device allreduce variants. The log shows allreduce onephase and ring passing before the middle of the Resource-phase output is truncated in the fetched job log, so the exact failing child is not obvious from the final Actions view.

This issue is about CI observability and failure surfacing. It does not assert that the underlying test failure is a CI-machine problem. The referenced PR also changed prepared-callable behavior so that TASK_READY now requires prior _CTRL_PREPARE; that may be related to the actual child failure, but the immediate issue is that the failed child is not visible from the top-level CI result.

Location

  • .github/workflows/ci.ymlst-onboard-a5, Run pytest scene tests (a5) wraps pytest in task-submit
  • conftest.py_dispatch_test_phases(...) Resource phase uses parallel_scheduler.run_jobs(...)
  • conftest.py_emit_group(...) prints child output inside collapsible GitHub groups
  • conftest.py — final session.testsfailed = 1 if (resource_failed or l2_failed) else 0 marks the parent failed without an uncollapsed Resource failure summary
  • simpler_setup/parallel_scheduler.pyJobResult already carries label, returncode, device_ids, output, and duration_s; enough information exists to summarize failed children

Proposed Fix

Improve Resource phase failure reporting in conftest.py:

  1. Keep collecting JobResults from parallel_scheduler.run_jobs(...) as today.
  2. After the Resource phase completes, if any result has returncode != 0, print an uncollapsed summary outside any GitHub group.
  3. Emit a GitHub Actions annotation for each failed child, for example:
::error title=Resource phase failed::<label> rc=<returncode> devices=<devices>
  1. Include the last 50-100 lines of the failed child output outside the collapsed group, or at minimum print the failing child labels and tell the reader which group to expand.

A minimal shape could be:

*** Resource phase failed: 1 child job(s) ***
- standalone test_xxx (rt=..., dev=...): rc=1 devices=[...]
  tail:
  ...

This would make hardware Resource failures actionable from the job tail without requiring manual log archaeology.

Priority

Medium (minor risk, should fix in next few releases)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions