Skip to content

Onboard a2a3 host_runtime.so not rebuilt after a pto-isa update (stale ccache/cmake cache) → SDMA query fails, allocate_domain ImportByKey 507899 #1139

Description

@zhangqi-chen

Summary

After updating pto-isa (without changing the simpler/runtime repo HEAD), reinstalling the
runtime (pip install .../pypto/runtime) silently produces a broken onboard a2a3
libhost_runtime.so
that is compiled against the old pto-isa headers. The build's
cache-invalidation logic does not account for pto-isa changes, and the global ccache then serves
the stale object on the reinstall, so neither a normal reinstall nor rm -rf build/ fixes it.

This is a build-system correctness bug: a pto-isa update is not reflected in a rebuilt
host_runtime.so unless the user knows to clear ccache.

Symptom (real failure observed)

allocate_domain (dynamic comm domain over HCCL) fails during IPC setup:

[SDMA] Created 40 STARS streams OK
[SDMA] aclrtSynchronizeStream (aicpu) failed
[comm rank 0] alloc_domain: ImportByKey(peer_dr=1 pid=...) -> 507899
[comm rank 1] alloc_domain: ImportByKey(peer_dr=0 pid=...) -> 507899
destroy_comm_stream: aclrtSynchronizeStream during stream teardown failed: 507018
RuntimeError: alloc_domain(allocation_id=0) failed on 2/2 chips ... comm_alloc_domain_windows failed with code -1

The ImportByKey -> 507899 is a secondary failure. The primary cause is the SDMA workspace
query failing (aclrtSynchronizeStream (aicpu) failed).

Root cause

allocate_domainensure_sdma_workspace()pto::comm::sdma::SdmaWorkspaceManager
(pto-isa: include/pto/npu/comm/async/sdma/sdma_workspace_manager.hpp, header-only).

pto-isa commit e19897e7 ("modify async comm isa for 48 channel") changed kSdmaMaxChan 40 → 48.
Init() calls CreateStarsStreams(detail::kSdmaMaxChan), so the stream count is baked into
host_runtime.so at compile time (it is logged as Created N STARS streams).

A host_runtime.so built against the pre-update (40-channel) object creates 40 STARS streams,
and its AICPU workspace-query path then fails at aclrtSynchronizeStream, cascading into the
ImportByKey -> 507899 above.

We confirmed this by diffing a passing vs failing install: only
a2a3/onboard/*/libhost_runtime.so differed; nm -D showed the SDMA symbols differ; the
40 vs 48 STARS streams log line pinned it to the compile-time kSdmaMaxChan.

Why current safeguards miss it

  1. cmake cache invalidation keys on the runtime repo HEAD only.
    simpler_setup/runtime_builder.py:

    • get_binaries(): current_commit = _get_git_head(PROJECT_ROOT) (the runtime repo)
    • _compile_target(): _invalidate_cache_if_stale(cache_dir/target, current_commit)
      The helper's own comment notes "git does not update file mtimes on checkout, so cmake's
      incremental build can't detect stale objects."
      That reasoning is correct — but it is only
      applied to the runtime repo's HEAD. A pto-isa-only change (runtime HEAD unchanged) does not
      invalidate the cache, and pto-isa's headers come in via -I$PTO_ISA_ROOT/include whose mtimes
      are likewise not bumped by a git checkout. So cmake/incremental thinks the object is up to date.
  2. ccache also serves the stale object. The toolchain compiles via the ccache wrapper
    (/usr/lib64/ccache/g++) into a global CCACHE_DIR. Even after rm -rf .../runtime/build, a
    reinstall gets ccache hits and links the stale comm_hccl.o. (Observed with ccache 3.7.12,
    compiler_check = mtime, default sloppiness.)

Net effect: a pto-isa update is invisible to the runtime rebuild.

Reproduction

  1. Build/install the runtime once with pto-isa at the old (40-channel) commit.
  2. Update pto-isa to a commit that changes kSdmaMaxChan (e.g. include e19897e7).
  3. pip install --force-reinstall --no-deps --no-cache-dir .../pypto/runtime (runtime HEAD unchanged).
  4. Run any kernel that uses orch.allocate_domain (e.g. an EP-2 MoE) → fails with the symptom above.
    [SDMA] Created 40 STARS streams confirms the stale build.

Workaround (verified)

Force a real recompile against the updated pto-isa:

ccache -C                       # or: export CCACHE_DISABLE=1 for the build
rm -rf .../pypto/runtime/build
PTO_ISA_ROOT=/path/to/pto-isa \
  pip install --force-reinstall --no-deps --no-cache-dir .../pypto/runtime

After this the binary creates 48 STARS streams and allocate_domain succeeds.

Suggested fix

Make the onboard-a2a3 host build's cache invalidation aware of the pto-isa commit, not just the
runtime repo HEAD. Options:

  • Fold the resolved pto-isa HEAD (already recorded in pto_isa_build.json
    write_pto_isa_build_metadata) into the .git_commit stamp that
    _invalidate_cache_if_stale compares, so a pto-isa change clears the per-target cmake cache.
  • Additionally guard ccache: since git checkouts don't bump header mtimes and compiler_check
    defaults to mtime, consider setting CCACHE_COMPILERCHECK=content (or
    CCACHE_SLOPPINESS-free content hashing) for the runtime build, or mixing the pto-isa commit into
    the ccache key via CCACHE_EXTRAFILES / a -D define so a pto-isa bump forces a miss.

Environment

  • simpler/runtime HEAD: fcc33bcb
  • pto-isa HEAD: b9122ec5 (contains e19897e7 "48 channel")
  • ccache 3.7.12, compiler_check = mtime, default sloppiness
  • platform: a2a3 onboard, CANN 9.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions