Summary
After updating pto-isa (without changing the simpler/runtime repo HEAD), reinstalling the
runtime (pip install .../pypto/runtime) silently produces a broken onboard a2a3
libhost_runtime.so that is compiled against the old pto-isa headers. The build's
cache-invalidation logic does not account for pto-isa changes, and the global ccache then serves
the stale object on the reinstall, so neither a normal reinstall nor rm -rf build/ fixes it.
This is a build-system correctness bug: a pto-isa update is not reflected in a rebuilt
host_runtime.so unless the user knows to clear ccache.
Symptom (real failure observed)
allocate_domain (dynamic comm domain over HCCL) fails during IPC setup:
[SDMA] Created 40 STARS streams OK
[SDMA] aclrtSynchronizeStream (aicpu) failed
[comm rank 0] alloc_domain: ImportByKey(peer_dr=1 pid=...) -> 507899
[comm rank 1] alloc_domain: ImportByKey(peer_dr=0 pid=...) -> 507899
destroy_comm_stream: aclrtSynchronizeStream during stream teardown failed: 507018
RuntimeError: alloc_domain(allocation_id=0) failed on 2/2 chips ... comm_alloc_domain_windows failed with code -1
The ImportByKey -> 507899 is a secondary failure. The primary cause is the SDMA workspace
query failing (aclrtSynchronizeStream (aicpu) failed).
Root cause
allocate_domain → ensure_sdma_workspace() → pto::comm::sdma::SdmaWorkspaceManager
(pto-isa: include/pto/npu/comm/async/sdma/sdma_workspace_manager.hpp, header-only).
pto-isa commit e19897e7 ("modify async comm isa for 48 channel") changed kSdmaMaxChan 40 → 48.
Init() calls CreateStarsStreams(detail::kSdmaMaxChan), so the stream count is baked into
host_runtime.so at compile time (it is logged as Created N STARS streams).
A host_runtime.so built against the pre-update (40-channel) object creates 40 STARS streams,
and its AICPU workspace-query path then fails at aclrtSynchronizeStream, cascading into the
ImportByKey -> 507899 above.
We confirmed this by diffing a passing vs failing install: only
a2a3/onboard/*/libhost_runtime.so differed; nm -D showed the SDMA symbols differ; the
40 vs 48 STARS streams log line pinned it to the compile-time kSdmaMaxChan.
Why current safeguards miss it
-
cmake cache invalidation keys on the runtime repo HEAD only.
simpler_setup/runtime_builder.py:
get_binaries(): current_commit = _get_git_head(PROJECT_ROOT) (the runtime repo)
_compile_target(): _invalidate_cache_if_stale(cache_dir/target, current_commit)
The helper's own comment notes "git does not update file mtimes on checkout, so cmake's
incremental build can't detect stale objects." That reasoning is correct — but it is only
applied to the runtime repo's HEAD. A pto-isa-only change (runtime HEAD unchanged) does not
invalidate the cache, and pto-isa's headers come in via -I$PTO_ISA_ROOT/include whose mtimes
are likewise not bumped by a git checkout. So cmake/incremental thinks the object is up to date.
-
ccache also serves the stale object. The toolchain compiles via the ccache wrapper
(/usr/lib64/ccache/g++) into a global CCACHE_DIR. Even after rm -rf .../runtime/build, a
reinstall gets ccache hits and links the stale comm_hccl.o. (Observed with ccache 3.7.12,
compiler_check = mtime, default sloppiness.)
Net effect: a pto-isa update is invisible to the runtime rebuild.
Reproduction
- Build/install the runtime once with pto-isa at the old (40-channel) commit.
- Update pto-isa to a commit that changes
kSdmaMaxChan (e.g. include e19897e7).
pip install --force-reinstall --no-deps --no-cache-dir .../pypto/runtime (runtime HEAD unchanged).
- Run any kernel that uses
orch.allocate_domain (e.g. an EP-2 MoE) → fails with the symptom above.
[SDMA] Created 40 STARS streams confirms the stale build.
Workaround (verified)
Force a real recompile against the updated pto-isa:
ccache -C # or: export CCACHE_DISABLE=1 for the build
rm -rf .../pypto/runtime/build
PTO_ISA_ROOT=/path/to/pto-isa \
pip install --force-reinstall --no-deps --no-cache-dir .../pypto/runtime
After this the binary creates 48 STARS streams and allocate_domain succeeds.
Suggested fix
Make the onboard-a2a3 host build's cache invalidation aware of the pto-isa commit, not just the
runtime repo HEAD. Options:
- Fold the resolved pto-isa HEAD (already recorded in
pto_isa_build.json →
write_pto_isa_build_metadata) into the .git_commit stamp that
_invalidate_cache_if_stale compares, so a pto-isa change clears the per-target cmake cache.
- Additionally guard ccache: since git checkouts don't bump header mtimes and
compiler_check
defaults to mtime, consider setting CCACHE_COMPILERCHECK=content (or
CCACHE_SLOPPINESS-free content hashing) for the runtime build, or mixing the pto-isa commit into
the ccache key via CCACHE_EXTRAFILES / a -D define so a pto-isa bump forces a miss.
Environment
- simpler/runtime HEAD:
fcc33bcb
- pto-isa HEAD:
b9122ec5 (contains e19897e7 "48 channel")
- ccache 3.7.12,
compiler_check = mtime, default sloppiness
- platform: a2a3 onboard, CANN 9.0.0
Summary
After updating pto-isa (without changing the
simpler/runtime repo HEAD), reinstalling theruntime (
pip install .../pypto/runtime) silently produces a broken onboard a2a3libhost_runtime.sothat is compiled against the old pto-isa headers. The build'scache-invalidation logic does not account for pto-isa changes, and the global ccache then serves
the stale object on the reinstall, so neither a normal reinstall nor
rm -rf build/fixes it.This is a build-system correctness bug: a pto-isa update is not reflected in a rebuilt
host_runtime.sounless the user knows to clear ccache.Symptom (real failure observed)
allocate_domain(dynamic comm domain over HCCL) fails during IPC setup:The
ImportByKey -> 507899is a secondary failure. The primary cause is the SDMA workspacequery failing (
aclrtSynchronizeStream (aicpu) failed).Root cause
allocate_domain→ensure_sdma_workspace()→pto::comm::sdma::SdmaWorkspaceManager(
pto-isa: include/pto/npu/comm/async/sdma/sdma_workspace_manager.hpp, header-only).pto-isa commit
e19897e7("modify async comm isa for 48 channel") changedkSdmaMaxChan40 → 48.Init()callsCreateStarsStreams(detail::kSdmaMaxChan), so the stream count is baked intohost_runtime.soat compile time (it is logged asCreated N STARS streams).A
host_runtime.sobuilt against the pre-update (40-channel) object creates 40 STARS streams,and its AICPU workspace-query path then fails at
aclrtSynchronizeStream, cascading into theImportByKey -> 507899above.We confirmed this by diffing a passing vs failing install: only
a2a3/onboard/*/libhost_runtime.sodiffered;nm -Dshowed the SDMA symbols differ; the40 vs 48 STARS streamslog line pinned it to the compile-timekSdmaMaxChan.Why current safeguards miss it
cmake cache invalidation keys on the runtime repo HEAD only.
simpler_setup/runtime_builder.py:get_binaries():current_commit = _get_git_head(PROJECT_ROOT)(the runtime repo)_compile_target():_invalidate_cache_if_stale(cache_dir/target, current_commit)The helper's own comment notes "git does not update file mtimes on checkout, so cmake's
incremental build can't detect stale objects." That reasoning is correct — but it is only
applied to the runtime repo's HEAD. A pto-isa-only change (runtime HEAD unchanged) does not
invalidate the cache, and pto-isa's headers come in via
-I$PTO_ISA_ROOT/includewhose mtimesare likewise not bumped by a git checkout. So cmake/incremental thinks the object is up to date.
ccache also serves the stale object. The toolchain compiles via the ccache wrapper
(
/usr/lib64/ccache/g++) into a globalCCACHE_DIR. Even afterrm -rf .../runtime/build, areinstall gets ccache hits and links the stale
comm_hccl.o. (Observed with ccache 3.7.12,compiler_check = mtime, defaultsloppiness.)Net effect: a pto-isa update is invisible to the runtime rebuild.
Reproduction
kSdmaMaxChan(e.g. includee19897e7).pip install --force-reinstall --no-deps --no-cache-dir .../pypto/runtime(runtime HEAD unchanged).orch.allocate_domain(e.g. an EP-2 MoE) → fails with the symptom above.[SDMA] Created 40 STARS streamsconfirms the stale build.Workaround (verified)
Force a real recompile against the updated pto-isa:
ccache -C # or: export CCACHE_DISABLE=1 for the build rm -rf .../pypto/runtime/build PTO_ISA_ROOT=/path/to/pto-isa \ pip install --force-reinstall --no-deps --no-cache-dir .../pypto/runtimeAfter this the binary creates
48 STARS streamsandallocate_domainsucceeds.Suggested fix
Make the onboard-a2a3 host build's cache invalidation aware of the pto-isa commit, not just the
runtime repo HEAD. Options:
pto_isa_build.json→write_pto_isa_build_metadata) into the.git_commitstamp that_invalidate_cache_if_stalecompares, so a pto-isa change clears the per-target cmake cache.compiler_checkdefaults to
mtime, consider settingCCACHE_COMPILERCHECK=content(orCCACHE_SLOPPINESS-free content hashing) for the runtime build, or mixing the pto-isa commit intothe ccache key via
CCACHE_EXTRAFILES/ a-Ddefine so a pto-isa bump forces a miss.Environment
fcc33bcbb9122ec5(containse19897e7"48 channel")compiler_check = mtime, defaultsloppiness