Skip to content

Add: SDMA workspace overlay + async completion demo on a5 onboard#1179

Open
jvjhfhg wants to merge 1 commit into
hw-native-sys:mainfrom
jvjhfhg:feat/comm-a5-sdma
Open

Add: SDMA workspace overlay + async completion demo on a5 onboard#1179
jvjhfhg wants to merge 1 commit into
hw-native-sys:mainfrom
jvjhfhg:feat/comm-a5-sdma

Conversation

@jvjhfhg

@jvjhfhg jvjhfhg commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Layers the host-side SDMA workspace allocation on top of the comm backend from the previous commit. Until CANN exposes the missing SDMA primitives on a5, this overlay is the only piece of comm work that fails on real a5 silicon -- aclnnShmemSdmaStarsQuery raises an AICPU exception (InnerCode=0x715002a) that aborts the entire ACL thread context. Dropping this commit therefore unblocks the non-SDMA comm demos (async_notify_demo etc.) without touching the deferred-completion runtime, which is already SDMA-aware on the kernel side (dormant until a kernel registers an SDMA condition).

  • Wire SdmaWorkspaceManager into comm_alloc_windows under SIMPLER_ENABLE_PTO_SDMA_WORKSPACE: pre-allocates the per-rank workspace via aclnnShmemSdmaStarsQuery and overlays the result into CommContext.workSpace/.workSpaceSize. On CANN 8.5 the dlsym fails by design and we demote to "no workspace" rather than failing comm_init.
  • a5 onboard CMakeLists forces SIMPLER_ENABLE_PTO_SDMA_WORKSPACE ON, requires PTO_ISA_ROOT (with FATAL_ERROR message pointing to the workspace coupling), adds pto-isa headers to the include path, and links libnnopbase.
  • runtime_compiler._init_a5 enforces the same PTO_ISA_ROOT env contract as _init_a2a3.
  • Migrate sdma_async_completion_demo to examples/a5/ (kernels + orch byte-identical with the a2a3 version; test.py platform- renamed).

@jvjhfhg jvjhfhg changed the title [WIP] Add: SDMA workspace overlay + async completion demo on a5 onboard Add: SDMA workspace overlay + async completion demo on a5 onboard Jun 27, 2026
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c52ac82d-b10f-40a5-8c07-c412d2acb3a9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an A5 SDMA async-completion demo: new device kernels and orchestration code, host-side workspace support, updated A5 build/runtime checks for PTO_ISA_ROOT, and a smoke test that builds, runs, and validates the two-device flow.

Changes

SDMA async completion demo

Layer / File(s) Summary
A5 build and runtime contract
simpler_setup/runtime_compiler.py, src/a5/platform/onboard/host/CMakeLists.txt
A5 host setup requires PTO_ISA_ROOT, adds its include path, and forces SIMPLER_ENABLE_PTO_SDMA_WORKSPACE into host_runtime compile and link settings.
HCCL workspace ownership
src/a5/platform/onboard/host/comm_hccl.cpp
The HCCL host handle conditionally includes and owns SdmaWorkspaceManager under SIMPLER_ENABLE_PTO_SDMA_WORKSPACE, and the nearby window-allocation comment is updated.
Producer and consumer kernels
examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/aiv/kernel_sdma_tget_async.cpp, examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/aiv/kernel_consumer.cpp
Adds the peer-window SDMA kernel_entry and the tile-processing consumer kernel entrypoint used by the demo.
Orchestration and smoke test
examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/orchestration/sdma_async_completion_orch.cpp, examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py
Adds orchestration entrypoints with four-argument validation and producer/consumer task submission, plus the Python smoke test that builds the chip callable, runs on two devices, and checks out and result.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hw-native-sys/simpler#823 — Adds the a5 comm port and deferred-completion/SDMA backend that this demo and workspace plumbing build on.
  • hw-native-sys/simpler#1166 — Updates SDMA/PTO-ISA header and include-path handling that matches the new workspace-gated host and kernel code here.

Poem

A bunny hopped through tiles so neat,
With SDMA drums beneath my feet.
One peer-window carrot, bright and plain,
Made peer_input + 1 shine again.
Thump-thump — the test passed cleanly! 🐇

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the SDMA workspace overlay and a5 async completion demo migration.
Description check ✅ Passed The description is on-topic and matches the comm backend overlay, a5 build updates, and demo migration.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an SDMA deferred completion demo for the onboard a5 platform, adding consumer and async SDMA TGET kernels, orchestration logic, and a Python smoke test. It also enables the PTO-ISA async SDMA workspace pre-allocation by default, making PTO_ISA_ROOT a hard requirement for the a5 onboard host runtime. The review feedback suggests tightening argument validation in the orchestration code to prevent potential out-of-bounds access and robustly checking for empty PTO_ISA_ROOT environment variables in CMake.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +31 to +34
if (orch_args.tensor_count() + orch_args.scalar_count() != 4) {
LOG_ERROR("sdma_async_completion_demo: expected 4 args");
return;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current check only verifies that the sum of tensor_count() and scalar_count() is 4. If the orchestrator is invoked with an unexpected combination of arguments (e.g., 2 tensors and 2 scalars), accessing orch_args.tensor(2) or other indices will result in an out-of-bounds access and potentially crash. It is safer to explicitly validate that tensor_count() is exactly 3 and scalar_count() is exactly 1.

Suggested change
if (orch_args.tensor_count() + orch_args.scalar_count() != 4) {
LOG_ERROR("sdma_async_completion_demo: expected 4 args");
return;
}
if (orch_args.tensor_count() != 3 || orch_args.scalar_count() != 1) {
LOG_ERROR("sdma_async_completion_demo: expected 3 tensors and 1 scalar");
return;
}
References
  1. Ensure that index-based accessors perform bounds checks to prevent undefined behavior or out-of-bounds memory access.

Comment on lines +44 to +48
if(NOT DEFINED ENV{PTO_ISA_ROOT})
message(FATAL_ERROR
"a5 onboard host_runtime requires PTO_ISA_ROOT "
"(SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is forced ON; needs pto-isa headers + CANN 9.0+)")
endif()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Checking NOT DEFINED ENV{PTO_ISA_ROOT} does not protect against cases where the environment variable is defined but empty. If PTO_ISA_ROOT is empty, appending it to CMAKE_CUSTOM_INCLUDE_DIRS will result in /include, which is incorrect. Checking against an empty string using STREQUAL "" and quoting the variable expansion handles both undefined and empty cases robustly.

if("$ENV{PTO_ISA_ROOT}" STREQUAL "")
    message(FATAL_ERROR
        "a5 onboard host_runtime requires PTO_ISA_ROOT "
        "(SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is forced ON; needs pto-isa headers + CANN 9.0+)")
endif()
References
  1. In CMake, when checking if a string variable is empty or non-empty, explicitly check against an empty string using STREQUAL "" and quote the variable expansion.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/orchestration/sdma_async_completion_orch.cpp`:
- Around line 31-39: The current guard in sdma_async_completion_orch.cpp only
checks the total argument count, so a bad tensor/scalar mix can still reach
tensor(2) and scalar(0). Update the validation around the orchestration argument
parsing to verify the exact split expected by Tensor accessors and the comm_ctx
scalar, not just orch_args.tensor_count() + orch_args.scalar_count(). Keep the
existing error handling in the same flow so invalid inputs are rejected before
from_tensor_arg() and reinterpret_cast<CommContext *> are used.

In
`@examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py`:
- Around line 27-37: The test is still importing and using the deprecated
task-interface alias ContinuousTensor instead of the renamed Tensor type, so
update the import list in the sdma_async_completion demo test to use Tensor and
replace any ContinuousTensor references in the test setup with Tensor. Keep the
rest of the task-interface imports unchanged and ensure the test only depends on
the current public symbol from simpler.task_interface.
- Around line 66-86: The child callables built in the loop around
CoreCallable.build are advertising the wrong ABI because both entries reuse the
parent’s 4-arg signature. Update each child metadata entry to match the actual
kernel interface for kernel_sdma_tget_async.cpp and kernel_consumer.cpp, so the
producer and consumer callables each expose their real argument
directions/counts instead of the parent signature.

In `@src/a5/platform/onboard/host/CMakeLists.txt`:
- Around line 44-49: The current PTO_ISA_ROOT check in the host CMake logic only
verifies that the environment variable is defined, so an empty or nonexistent
path still reaches the include path append and fails later. Update the
validation near the CMakeLists.txt guard around the `PTO_ISA_ROOT` handling to
also reject empty values and paths that do not exist before `list(APPEND
CMAKE_CUSTOM_INCLUDE_DIRS ...)`, and keep the fatal error in the same
`host_runtime` setup path so configuration fails immediately with a clear
message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 120b4b64-2bb8-4a14-88f8-98090fe3ab51

📥 Commits

Reviewing files that changed from the base of the PR and between 47a411c and 781c4e2.

📒 Files selected for processing (7)
  • examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/aiv/kernel_consumer.cpp
  • examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/aiv/kernel_sdma_tget_async.cpp
  • examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/orchestration/sdma_async_completion_orch.cpp
  • examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py
  • simpler_setup/runtime_compiler.py
  • src/a5/platform/onboard/host/CMakeLists.txt
  • src/a5/platform/onboard/host/comm_hccl.cpp

Comment on lines +31 to +39
if (orch_args.tensor_count() + orch_args.scalar_count() != 4) {
LOG_ERROR("sdma_async_completion_demo: expected 4 args");
return;
}

Tensor input = from_tensor_arg(orch_args.tensor(0));
Tensor out = from_tensor_arg(orch_args.tensor(1));
Tensor result = from_tensor_arg(orch_args.tensor(2));
auto *comm_ctx = reinterpret_cast<CommContext *>(static_cast<uintptr_t>(orch_args.scalar(0)));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Validate the tensor/scalar split, not just total arg count.

A call with 4 args but the wrong mix can pass this guard and still fail when accessing tensor(2) or scalar(0).

Proposed fix
-    if (orch_args.tensor_count() + orch_args.scalar_count() != 4) {
-        LOG_ERROR("sdma_async_completion_demo: expected 4 args");
+    if (orch_args.tensor_count() != 3 || orch_args.scalar_count() != 1) {
+        LOG_ERROR("sdma_async_completion_demo: expected 3 tensor args and 1 scalar arg");
         return;
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (orch_args.tensor_count() + orch_args.scalar_count() != 4) {
LOG_ERROR("sdma_async_completion_demo: expected 4 args");
return;
}
Tensor input = from_tensor_arg(orch_args.tensor(0));
Tensor out = from_tensor_arg(orch_args.tensor(1));
Tensor result = from_tensor_arg(orch_args.tensor(2));
auto *comm_ctx = reinterpret_cast<CommContext *>(static_cast<uintptr_t>(orch_args.scalar(0)));
if (orch_args.tensor_count() != 3 || orch_args.scalar_count() != 1) {
LOG_ERROR("sdma_async_completion_demo: expected 3 tensor args and 1 scalar arg");
return;
}
Tensor input = from_tensor_arg(orch_args.tensor(0));
Tensor out = from_tensor_arg(orch_args.tensor(1));
Tensor result = from_tensor_arg(orch_args.tensor(2));
auto *comm_ctx = reinterpret_cast<CommContext *>(static_cast<uintptr_t>(orch_args.scalar(0)));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/orchestration/sdma_async_completion_orch.cpp`
around lines 31 - 39, The current guard in sdma_async_completion_orch.cpp only
checks the total argument count, so a bad tensor/scalar mix can still reach
tensor(2) and scalar(0). Update the validation around the orchestration argument
parsing to verify the exact split expected by Tensor accessors and the comm_ctx
scalar, not just orch_args.tensor_count() + orch_args.scalar_count(). Keep the
existing error handling in the same flow so invalid inputs are rejected before
from_tensor_arg() and reinterpret_cast<CommContext *> are used.

Comment on lines +27 to +37
from simpler.task_interface import (
ArgDirection,
CallConfig,
ChipCallable,
CommBufferSpec,
ContinuousTensor,
CoreCallable,
DataType,
TaskArgs,
TensorArgType,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Use the renamed Tensor task-interface type.

This new test reintroduces ContinuousTensor; switch to the current hard-renamed symbol to avoid depending on an old alias. Based on learnings, renamed public Python types/classes are hard renames and old names such as ContinuousTensor should be removed across the repo.

Proposed fix
 from simpler.task_interface import (
     ArgDirection,
     CallConfig,
     ChipCallable,
     CommBufferSpec,
-    ContinuousTensor,
     CoreCallable,
     DataType,
     TaskArgs,
+    Tensor,
     TensorArgType,
 )
@@
-                        ContinuousTensor.make(
+                        Tensor.make(
                             data=domain.buffer_ptrs["input_window"],
                             shapes=(N,),
                             dtype=DataType.FLOAT32,

Also applies to: 162-168

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py`
around lines 27 - 37, The test is still importing and using the deprecated
task-interface alias ContinuousTensor instead of the renamed Tensor type, so
update the import list in the sdma_async_completion demo test to use Tensor and
replace any ContinuousTensor references in the test setup with Tensor. Keep the
rest of the task-interface imports unchanged and ensure the test only depends on
the current public symbol from simpler.task_interface.

Source: Learnings

Comment on lines +66 to +86
children = []
for func_id, rel in [
(0, "kernels/aiv/kernel_sdma_tget_async.cpp"),
(1, "kernels/aiv/kernel_consumer.cpp"),
]:
kernel = kc.compile_incore(
source_path=os.path.join(HERE, rel),
core_type="aiv",
pto_isa_root=pto_isa_root,
extra_include_dirs=extra_includes,
)
if not platform.endswith("sim"):
kernel = extract_text_section(kernel)
children.append(
(
func_id,
CoreCallable.build(
signature=[ArgDirection.IN, ArgDirection.OUT, ArgDirection.OUT, ArgDirection.IN],
binary=kernel,
),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Give each child callable its actual ABI signature.

The producer kernel is submitted with 3 args and the consumer with 2 args, but both child metadata entries advertise the parent’s 4-arg signature. If runtime validation uses this metadata, child dispatch can reject valid submissions or misdescribe the callable ABI.

Proposed fix
-    for func_id, rel in [
-        (0, "kernels/aiv/kernel_sdma_tget_async.cpp"),
-        (1, "kernels/aiv/kernel_consumer.cpp"),
+    for func_id, rel, signature in [
+        (0, "kernels/aiv/kernel_sdma_tget_async.cpp", [ArgDirection.IN, ArgDirection.OUT, ArgDirection.IN]),
+        (1, "kernels/aiv/kernel_consumer.cpp", [ArgDirection.IN, ArgDirection.OUT]),
     ]:
@@
                 func_id,
                 CoreCallable.build(
-                    signature=[ArgDirection.IN, ArgDirection.OUT, ArgDirection.OUT, ArgDirection.IN],
+                    signature=signature,
                     binary=kernel,
                 ),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
children = []
for func_id, rel in [
(0, "kernels/aiv/kernel_sdma_tget_async.cpp"),
(1, "kernels/aiv/kernel_consumer.cpp"),
]:
kernel = kc.compile_incore(
source_path=os.path.join(HERE, rel),
core_type="aiv",
pto_isa_root=pto_isa_root,
extra_include_dirs=extra_includes,
)
if not platform.endswith("sim"):
kernel = extract_text_section(kernel)
children.append(
(
func_id,
CoreCallable.build(
signature=[ArgDirection.IN, ArgDirection.OUT, ArgDirection.OUT, ArgDirection.IN],
binary=kernel,
),
)
children = []
for func_id, rel, signature in [
(0, "kernels/aiv/kernel_sdma_tget_async.cpp", [ArgDirection.IN, ArgDirection.OUT, ArgDirection.IN]),
(1, "kernels/aiv/kernel_consumer.cpp", [ArgDirection.IN, ArgDirection.OUT]),
]:
kernel = kc.compile_incore(
source_path=os.path.join(HERE, rel),
core_type="aiv",
pto_isa_root=pto_isa_root,
extra_include_dirs=extra_includes,
)
if not platform.endswith("sim"):
kernel = extract_text_section(kernel)
children.append(
(
func_id,
CoreCallable.build(
signature=signature,
binary=kernel,
),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py`
around lines 66 - 86, The child callables built in the loop around
CoreCallable.build are advertising the wrong ABI because both entries reuse the
parent’s 4-arg signature. Update each child metadata entry to match the actual
kernel interface for kernel_sdma_tget_async.cpp and kernel_consumer.cpp, so the
producer and consumer callables each expose their real argument
directions/counts instead of the parent signature.

Comment on lines +44 to +49
if(NOT DEFINED ENV{PTO_ISA_ROOT})
message(FATAL_ERROR
"a5 onboard host_runtime requires PTO_ISA_ROOT "
"(SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is forced ON; needs pto-isa headers + CANN 9.0+)")
endif()
list(APPEND CMAKE_CUSTOM_INCLUDE_DIRS "$ENV{PTO_ISA_ROOT}/include")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Fail fast when PTO_ISA_ROOT is empty or invalid.

DEFINED ENV{PTO_ISA_ROOT} still passes for an empty or nonexistent path, then line 49 appends a broken include directory and defers the failure to compilation.

Proposed fix
-if(NOT DEFINED ENV{PTO_ISA_ROOT})
+if(NOT DEFINED ENV{PTO_ISA_ROOT}
+   OR "$ENV{PTO_ISA_ROOT}" STREQUAL ""
+   OR NOT EXISTS "$ENV{PTO_ISA_ROOT}/include")
     message(FATAL_ERROR
-        "a5 onboard host_runtime requires PTO_ISA_ROOT "
+        "a5 onboard host_runtime requires PTO_ISA_ROOT to point to a valid pto-isa checkout "
         "(SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is forced ON; needs pto-isa headers + CANN 9.0+)")
 endif()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if(NOT DEFINED ENV{PTO_ISA_ROOT})
message(FATAL_ERROR
"a5 onboard host_runtime requires PTO_ISA_ROOT "
"(SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is forced ON; needs pto-isa headers + CANN 9.0+)")
endif()
list(APPEND CMAKE_CUSTOM_INCLUDE_DIRS "$ENV{PTO_ISA_ROOT}/include")
if(NOT DEFINED ENV{PTO_ISA_ROOT}
OR "$ENV{PTO_ISA_ROOT}" STREQUAL ""
OR NOT EXISTS "$ENV{PTO_ISA_ROOT}/include")
message(FATAL_ERROR
"a5 onboard host_runtime requires PTO_ISA_ROOT to point to a valid pto-isa checkout "
"(SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is forced ON; needs pto-isa headers + CANN 9.0+)")
endif()
list(APPEND CMAKE_CUSTOM_INCLUDE_DIRS "$ENV{PTO_ISA_ROOT}/include")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/onboard/host/CMakeLists.txt` around lines 44 - 49, The
current PTO_ISA_ROOT check in the host CMake logic only verifies that the
environment variable is defined, so an empty or nonexistent path still reaches
the include path append and fails later. Update the validation near the
CMakeLists.txt guard around the `PTO_ISA_ROOT` handling to also reject empty
values and paths that do not exist before `list(APPEND CMAKE_CUSTOM_INCLUDE_DIRS
...)`, and keep the fatal error in the same `host_runtime` setup path so
configuration fails immediately with a clear message.

@jvjhfhg jvjhfhg force-pushed the feat/comm-a5-sdma branch 2 times, most recently from 28fceb4 to c0829e5 Compare June 29, 2026 03:34
Layers the host-side SDMA workspace allocation on top of the comm
backend from the previous commit. Until CANN exposes the missing
SDMA primitives on a5, this overlay is the only piece of comm work
that fails on real a5 silicon -- aclnnShmemSdmaStarsQuery raises
an AICPU exception (InnerCode=0x715002a) that aborts the entire
ACL thread context. Dropping this commit therefore unblocks the
non-SDMA comm demos (async_notify_demo etc.) without touching the
deferred-completion runtime, which is already SDMA-aware on the
kernel side (dormant until a kernel registers an SDMA condition).

- Wire SdmaWorkspaceManager into comm_alloc_windows under
  SIMPLER_ENABLE_PTO_SDMA_WORKSPACE: pre-allocates the per-rank
  workspace via aclnnShmemSdmaStarsQuery and overlays the result
  into CommContext.workSpace/.workSpaceSize. On CANN 8.5 the
  dlsym fails by design and we demote to "no workspace" rather
  than failing comm_init.
- a5 onboard CMakeLists forces SIMPLER_ENABLE_PTO_SDMA_WORKSPACE
  ON, requires PTO_ISA_ROOT (with FATAL_ERROR message pointing
  to the workspace coupling), adds pto-isa headers to the include
  path, and links libnnopbase.
- runtime_compiler._init_a5 enforces the same PTO_ISA_ROOT env
  contract as _init_a2a3.
- Migrate sdma_async_completion_demo to examples/a5/ (kernels +
  orch byte-identical with the a2a3 version; test.py platform-
  renamed).
@jvjhfhg jvjhfhg force-pushed the feat/comm-a5-sdma branch from c0829e5 to 901b2e3 Compare June 30, 2026 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant