Skip to content

test(ci): surface Resource phase child failures#1211

Open
vegetabledoww wants to merge 1 commit into
hw-native-sys:mainfrom
vegetabledoww:fix/issue-1205-resource-summary
Open

test(ci): surface Resource phase child failures#1211
vegetabledoww wants to merge 1 commit into
hw-native-sys:mainfrom
vegetabledoww:fix/issue-1205-resource-summary

Conversation

@vegetabledoww

Copy link
Copy Markdown
Contributor

Summary

Fixes #1205.

When the root pytest dispatcher runs Resource-phase child jobs, a failing child currently leaves the top-level CI tail with only task-submit / Process completed with exit code 1 unless the reviewer expands the right GitHub group and the relevant output was not truncated.

This PR makes Resource failures visible outside collapsed groups by printing:

  • a Resource failure summary with child label, rc, devices, and duration
  • a GitHub Actions ::error annotation for each failed child
  • the failed child output tail
  • a compact recap after the L2 phase, so the final job tail still identifies the Resource failure

Notes

This addresses the CI observability problem from #1205. It does not attempt to fix the underlying a5 runtime_fatal_codes teardown behavior; #1206 mitigated that separately by taking the a5 onboard case offline.

Verification

  • Added unit coverage for Resource failure summary formatting, annotation escaping, tail truncation, and compact recap mode.
  • Static check: parsed the changed Python files successfully in the local checkout. Full pytest was not run locally because this Windows environment cannot import the repo root conftest.py due to Linux-only fcntl dependencies.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0af25458-0616-458f-8254-15f7a1facf06

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds three private helpers to conftest.py — a GitHub Actions payload escaper, a tail-lines extractor, and _emit_resource_failure_summary — and wires them into _dispatch_test_phases to emit uncollapsed failure summaries with optional ::error annotations and output tails when any Resource-phase child job fails. Two unit tests exercise both the full and compact recap modes.

Resource Phase Failure Reporting

Layer / File(s) Summary
Helper functions and failure summary emitter
conftest.py
Adds _escape_gha_payload, _tail_lines, and _emit_resource_failure_summary; introduces resource_results accumulator in _dispatch_test_phases; calls the summary function immediately after Resource jobs complete (with tail/annotations) and again after the L2 phase (compact recap, no tail/annotations).
Unit tests for _emit_resource_failure_summary
tests/ut/py/test_resource_failure_summary.py
Dynamically loads root conftest.py and tests full mode (tail_lines=2, ::error annotation, heading, tail content) and compact recap mode (no ::error, no tail, single-line job summary with rc/devices/duration).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 A job failed in the dark, no clue left behind,
Now error lines surface for all eyes to find.
With tail and annotation, the culprit is near,
No more log archaeology — the summary is clear!
Hoppy little helpers make CI less grim. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description matches the changeset and explains the Resource-failure visibility work and verification.
Linked Issues check ✅ Passed The changes implement the issue's required Resource-phase failure summary, annotations, and output tail recap.
Out of Scope Changes check ✅ Passed The diff appears focused on the stated Resource-failure reporting work and its tests, with no obvious unrelated additions.
Title check ✅ Passed The title is concise and accurately describes the main change: surfacing Resource phase child failures in CI.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@vegetabledoww vegetabledoww changed the title test: surface Resource phase child failures test(ci): surface Resource phase child failures Jun 30, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces helper functions _github_actions_escape, _tail_text, and _emit_resource_failure_summary in conftest.py to print detailed summaries and GitHub Actions error annotations for failed Resource child jobs. It also integrates these summaries into the test dispatching workflow and adds comprehensive unit tests in tests/ut/py/test_resource_failure_summary.py. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@vegetabledoww vegetabledoww force-pushed the fix/issue-1205-resource-summary branch from a955e76 to 3d16fc7 Compare July 1, 2026 01:44
@vegetabledoww

vegetabledoww commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

修改前后报错区别

修改前,Resource phase 中某个 child pytest 失败后,详细失败信息只会出现在对应的折叠
group 里。如果后续 L2 phase 继续执行并且全部 PASS,CI 最终尾部通常只剩:

--- L2 host_build_graph: PASS ...
--- L2 tensormap_and_ringbuffer: PASS ...

Process completed with exit code 1.

从最终 tail 看不出具体是哪个 Resource child 失败,只能知道 task-submit 返回了 1。排
查时需要手动展开或搜索中间日志,才能找到类似:

standalone test_device_error_class_reaches_host_log[scope_deadlock] ... [FAILrc=-11]

修改后,Resource child 失败会在折叠 group 外额外打印未折叠摘要,并输出 GitHub
Actions annotation:

*** Resource phase failed: 1 child job(s) ***
::error title=Resource phase failed::standalone test_xxx rc=-11 devices=[4]

  • standalone test_xxx: rc=-11 devices=[4] duration=13.1s
    tail (last 80 line(s)):
    ----- begin Resource child tail: standalone test_xxx -----
    ...
    ----- end Resource child tail: standalone test_xxx -----

如果后续 L2 phase 继续执行,最终退出前还会再打印一次 compact recap,避免 Resource 失
败信息被 L2 PASS 日志冲出 job tail:

*** Resource phase failed recap: 1 child job(s) ***

  • standalone test_xxx: rc=-11 devices=[4] duration=13.1s
    full output is in the Resource child group above

已验证

python -m pytest tests/ut/py/test_resource_failure_summary.py -q

结果:

2 passed

覆盖内容:

  • 只汇总 failed child,不把 pass child 打进 summary
  • label / rc / devices / duration 正确打印
  • GitHub Actions annotation 正确输出
  • % 和换行会被 workflow command escape
  • child output tail 只截取最后 N 行
  • compact recap 不重复打印长 tail

同时跑了:

/data/xxx/simpler/.venv/bin/python -m ruff check conftest.py tests/ut/py/
test_resource_failure_summary.py

结果:

All checks passed!

总结

  • 修改前:CI tail 只有 task-submit exit=1,无法直接判断哪个 Resource child 失败。
  • 修改后:CI tail 能直接看到失败 child 的名称、返回码、设备、耗时,并能在 summary 中
    看到失败输出尾部。

@ChaoZheng109

Copy link
Copy Markdown
Collaborator

我觉得这里可以调整一下输出策略。

Resource child 的完整 pytest 输出已经放在 ::group::...::endgroup:: 里,折叠本身就是为了避免 CI 主日志过长。现在 child 失败后立刻把最后 80 行 tail 再输出到外层,相当于把折叠内容又展开了一遍,日志会重复,也削弱了 group 的意义。

这里更有价值的改进可能是让失败的 folded group 本身更容易定位。现在外层会告诉我们哪个 child 失败了,但 group title 主要是 label,例如 standalone test_l3_l2_message_queue / l3 TestFoo。如果同一个文件里有多个用例,或者不同文件里有同名测试,reviewer 仍然需要对照名称去找具体哪个折叠块里有错误。

建议:

  • group header 和最终 summary 都带上完整或可唯一定位的 pytest nodeid,例如 examples/.../test_x.py::test_case
  • child 完成时不要默认把 tail 复制到外层;保留短提示或者只依赖 group header 的 [FAIL rc=...] 状态即可。
  • 在整个 dispatcher 最后统一输出一次 Resource failure recap,列出 nodeid / label / rc / devices / duration
  • 如果确实需要 root-cause hint,可以只在 timeout/crash/no pytest summary 这类场景输出很短 tail,而不是默认展开 80 行。

这样 CI 最后能看到失败的是哪个具体用例,同时 reviewer 可以直接按 nodeid 搜索/定位到对应 folded group;完整 traceback 仍然保留在 group 里,不会重复展开。

Fixes hw-native-sys#1205 by making Resource-phase child pytest failures visible outside collapsed GitHub Actions groups.

The dispatcher now prints failed child labels, return codes, devices, durations, output tails, and a final recap after L2 so the top-level CI tail remains actionable. This does not change the underlying hardware/runtime failure behavior.
@vegetabledoww vegetabledoww force-pushed the fix/issue-1205-resource-summary branch from 3d16fc7 to 0d63347 Compare July 1, 2026 07:51
@vegetabledoww

Copy link
Copy Markdown
Contributor Author

@ChaoZheng109 已按这个方向调整并重新推了 commit 0d633478

主要改动:

  • Resource child 的 Job / JobResult 现在透传 pytest nodeid
  • folded group header 已带上 nodeid,例如:
    ::group::standalone test_xxx (...) nodeid=tests/...::test_xxx [FAIL rc=1 10.0s, devices=[6]]
    
  • 去掉了默认在 group 外复制最后 80 行 tail 的行为,避免和 folded group 内的完整 pytest 输出重复。
  • 未折叠 Resource failure summary 现在输出结构化定位信息:
    - nodeid=...
      label=...
      rc=... devices=[...] duration=...s
      hint:
        ...
    
  • root-cause 信息改成短 hint,只提取 RuntimeErrorPTO2 runtime failedPTO2 scheduler timeoutvalidate_runtime_implAICore error 等关键行。
  • 最终 recap 保持 compact,只列 nodeid / label / rc / devices / duration,并提示完整输出在上方 Resource child group 中。
  • 单测也同步改成验证 nodeid、annotation、短 hint,以及 compact recap 不重复 hidden output。

验证过:

/data/jinzongquan/simpler/.venv/bin/python -m pytest tests/ut/py/test_resource_failure_summary.py -q
2 passed

/data/jinzongquan/simpler/.venv/bin/python -m ruff check conftest.py simpler_setup/parallel_scheduler.py tests/ut/py/test_resource_failure_summary.py
All checks passed!

另外用 a2a3 onboard 做了真实 Resource failure probe,AICore while(true) 能触发:

RuntimeError: run_prepared failed with code 507018
PTO2 runtime failed: orch_error_code=0 sched_error_code=100 runtime_status=-100
PTO2 scheduler timeout sub_class=S1:running-stalled

当前 summary 会在 group 外显示失败 child 的 nodeid、label、rc、devices、duration 和上述短 hint;完整 traceback 仍保留在 folded group 中。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Code Health] CI resource failures only surface as task-submit exit=1

2 participants