Skip to content

fix(build): rebuild onboard a2a3 host_runtime on pto-isa commit change#1194

Open
doraemonmj wants to merge 1 commit into
hw-native-sys:mainfrom
doraemonmj:fix-1139-pto-isa-cache
Open

fix(build): rebuild onboard a2a3 host_runtime on pto-isa commit change#1194
doraemonmj wants to merge 1 commit into
hw-native-sys:mainfrom
doraemonmj:fix-1139-pto-isa-cache

Conversation

@doraemonmj

Copy link
Copy Markdown
Contributor

Problem (issue #1139)

pto-isa is a header-only dependency compiled into the a2a3 onboard
host_runtime.so (e.g. kSdmaMaxChan in sdma_workspace_manager.hpp). A
pto-isa bump that leaves the runtime repo HEAD untouched was invisible to
both build caches, so a plain reinstall served a stale binary and the SDMA
workspace query failed → ImportByKey -> 507899 cascade in allocate_domain.

Two layers each mis-judged "nothing changed":

  1. cmake persistent cache — the .git_commit stamp keyed only on the
    runtime repo HEAD, so a pto-isa-only change never cleared the per-target
    cache (git checkout also doesn't bump header mtimes, so cmake's own
    incremental check can't see it).
  2. ccache — with compiler_check=mtime and unchanged header mtimes after a
    checkout, the stale .o was served even after rm -rf build/.

pto_isa_build.json is rewritten every install regardless of whether the
binary recompiled, so the existing version guard couldn't catch this case and
the failure was silent.

Fix

  • cmake stamp now folds in the resolved pto-isa commit:
    runtime_sha:pto-isa=<isa_sha> (a2a3 onboard only). A bump mismatches the
    stored stamp → the per-target build cache is cleared → clean reconfigure.
  • ccache — the host_runtime compile command now carries
    SIMPLER_PTO_ISA_BUILD_COMMIT="<sha>" (unused in code; cache-bust only) so a
    pto-isa bump changes the command line → ccache miss → real recompile.

Both are required: clearing the cmake cache alone doesn't defeat ccache
underneath, and the define alone doesn't help if cmake still thinks it's
incrementally up to date. A plain reinstall now recompiles against the new
pto-isa with no manual ccache -C / rm -rf build/.

Scoped to a2a3 onboard — the only variant that embeds pto-isa headers today
(a5's SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is currently off), matching
_requires_pto_isa_compat_validation().

Local test results

Build (real onboard a2a3 toolchain, pip install --no-build-isolation -e .):

  • build/lib/pto_isa_build.jsonpto_isa_commit: 32064ca0…
  • cmake stamp build/cache/a2a3/onboard/*/host/.git_commit
    be5f667f…:pto-isa=32064ca0… (new composite format)
  • ccache-bust define present in all 26 host TUs of
    compile_commands.json: SIMPLER_PTO_ISA_BUILD_COMMIT=\"32064ca0…\"

Onboard a2a3 (via task-submit, device 4):

tests/st/a2a3/tensormap_and_ringbuffer/spmd_basic/test_spmd_basic.py::TestSpmdBasic::test_run PASSED
1 passed in 14.39s   (exit=0)

Confirms the rebuilt host_runtime.so (carrying the new define across all TUs)
loads and executes on silicon — no link/load regression from the cache-bust
define.

pre-commit: markdownlint-cli2 / ruff check / ruff format / pyright all pass.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1b3a4c3b-b8a4-43b7-8327-4b30b433cf25

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description check ✅ Passed The description directly matches the cache invalidation fix and testing described in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly matches the main change: rebuilding the onboard a2a3 host_runtime when the pto-isa commit changes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the build cache invalidation logic to fold the resolved pto-isa commit into the build stamp and the CMake compile definitions for a2a3 onboard builds. This ensures that updates to pto-isa headers automatically invalidate stale cache objects and force a recompile. The review feedback suggests improving the CMake environment variable check by explicitly comparing against an empty string, and refactoring the Python resolution logic to prioritize PTO_ISA_ROOT over the commit environment variable, raise an error on failure, and avoid speculative .strip() validation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/a2a3/platform/onboard/host/CMakeLists.txt Outdated
Comment thread simpler_setup/runtime_builder.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@simpler_setup/runtime_builder.py`:
- Around line 188-192: The PTO-ISA fallback path in get_pto_isa_commit() returns
a resolved commit from PTO_ISA_ROOT, but the CMake ccache-busting logic in the
runtime builder only keys off SIMPLER_RUN_PTO_ISA_COMMIT. Update the environment
propagation in runtime_builder so the fallback commit also feeds the same CMake
env variable used by the cache stamp, and ensure the PTO-ISA resolution path and
the ccache-busting block both reference the same commit source.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 529cf819-5721-4bec-8a92-6c308700531d

📥 Commits

Reviewing files that changed from the base of the PR and between a8d2525 and d4d82a9.

📒 Files selected for processing (3)
  • docs/developer-guide.md
  • simpler_setup/runtime_builder.py
  • src/a2a3/platform/onboard/host/CMakeLists.txt

Comment thread simpler_setup/runtime_builder.py Outdated
@doraemonmj doraemonmj force-pushed the fix-1139-pto-isa-cache branch from d4d82a9 to e98795d Compare June 29, 2026 09:31
@ChaoZheng109

Copy link
Copy Markdown
Collaborator

Review

整体结论:修复正确、范围清晰、改动最小,核心机制和时序契约我都追过,成立。唯一实质缺口是缺回归测试,见下方 Should-fix。

机制确认

两层缓存串联,各有"什么都没变"的盲区,pto-isa-only 变更必须同时击穿两层,本 PR 做到了:

  • cmake 缓存:stamp 从 runtime HEAD 扩成复合 runtime_sha:pto-isa=<isa>(仅 a2a3 onboard),不同则 rmtree 整个 target 缓存 → 全新 reconfigure。
  • ccache:新增编译宏 SIMPLER_PTO_ISA_BUILD_COMMIT="<sha>" 扰动 ccache key,避免 compiler_check=mtime + 头文件 mtime 未变时复用旧 .o

锁步不变量已核对:stamp 与 cmake 宏必须同一个 commit。正常 pip install 路径由 ensure_pto_isa_root 提前设好 SIMPLER_RUN_PTO_ISA_COMMIT;懒加载路径下 _init_a2a3ensure("PTO_ISA_ROOT") 不设 commit 变量,_resolve_build_pto_isa_commitPTO_ISA_ROOT HEAD 推导并在主线程、_compile_target 线程池启动前写回 env,所以 cmake 读到的 $ENV{...} 始终与 stamp 一致。

边界情况也 OK:旧格式 .git_commit → 一次性干净重建;非 a2a3-onboard → 纯 sha,与旧格式相同不误失效,且该宏只在 a2a3 onboard 的 CMakeLists 里,无串扰;runtime HEAD 为空 → stamp 也为空,保留既有"取不到 → 干净重建"路径。

Should fix —— 补回归测试

修复核心逻辑(_build_cache_stamp / _resolve_build_pto_isa_commit)目前没有回归测试。这两个都是纯函数、可 monkeypatch,而 tests/ut/py/test_runtime_builder.py 里已有现成可扩展的范式(TestInvalidateCacheIfStaleTestRuntimeBuilderPtoIsaValidation)。按 .claude/rules/discipline.md §3,这是可在单测层复现/验证的缺陷,"构建破坏免测"例外不适用。建议补断言:

  1. a2a3/onboard 且设了 SIMPLER_RUN_PTO_ISA_COMMIT → 复合 stamp runtime_sha:pto-isa=<isa>;
  2. 其他 arch/variant → 纯 runtime sha;
  3. runtime HEAD 为 "" → stamp 也为 ""(保留 unavailable → 干净重建路径)。

PR 描述里的 "Local test results" 是一次性手动验证,不构成进仓的回归防线。

Consider

  • stamp 变化日志有误导:_invalidate_cache_if_stale 打印 stamp[:20],runtime sha 有 40 字符,pto-isa-only 变更时前 20 字符相同,日志会显示成 build stamp changed (<同> → <同>),恰好把变化藏起来。建议加宽截取或单独打印 pto-isa= 段——这正是下一个排查 Onboard a2a3 host_runtime.so not rebuilt after a pto-isa update (stale ccache/cmake cache) → SDMA query fails, allocate_domain ImportByKey 507899 #1139 的人第一眼要看的信息。
  • 宏门控用 arch/variant 而非 SIMPLER_ENABLE_PTO_SDMA_WORKSPACE:若将来 a2a3 onboard 关掉 SDMA workspace,pto-isa 变更仍会强制重编译一个已不内嵌 pto-isa 的目标。属安全方向(过度失效,绝不会用过期对象),与既有 _requires_pto_isa_compat_validation scoping 一致,可接受;建议加一行注释说明这是有意为之。

Verdict

建议合入,合入前补上单测(Should-fix)。两个 Consider 属打磨。

@doraemonmj doraemonmj changed the title fix(build): rebuild onboard a2a3 host_runtime on pto-isa commit change (#1139) fix(build): rebuild onboard a2a3 host_runtime on pto-isa commit change Jun 30, 2026
pto-isa is a header-only dependency baked into the a2a3 onboard
host_runtime.so at compile time (e.g. kSdmaMaxChan). A pto-isa bump that
leaves the runtime repo HEAD untouched was invisible to both build caches,
so a reinstall served a stale binary -> SDMA query failure -> 507899.

- cmake cache stamp now folds in the resolved pto-isa commit, so a bump
  mismatches the stored stamp and clears the per-target build cache.
- host_runtime compile command carries SIMPLER_PTO_ISA_BUILD_COMMIT so the
  ccache key changes on a bump (git checkout does not bump header mtimes,
  which otherwise yields a stale ccache hit under compiler_check=mtime).

Together a plain reinstall recompiles against the new pto-isa with no
manual 'ccache -C' / 'rm -rf build/'. Scoped to a2a3 onboard, the only
variant that embeds pto-isa headers today.
@doraemonmj doraemonmj force-pushed the fix-1139-pto-isa-cache branch from e98795d to f8bbf29 Compare June 30, 2026 02:41
@doraemonmj

Copy link
Copy Markdown
Contributor Author

close #1139 @ChaoZheng109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants