Add: SDMA workspace overlay + async completion demo on a5 onboard#1179
Add: SDMA workspace overlay + async completion demo on a5 onboard#1179jvjhfhg wants to merge 1 commit into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds an A5 SDMA async-completion demo: new device kernels and orchestration code, host-side workspace support, updated A5 build/runtime checks for ChangesSDMA async completion demo
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces an SDMA deferred completion demo for the onboard a5 platform, adding consumer and async SDMA TGET kernels, orchestration logic, and a Python smoke test. It also enables the PTO-ISA async SDMA workspace pre-allocation by default, making PTO_ISA_ROOT a hard requirement for the a5 onboard host runtime. The review feedback suggests tightening argument validation in the orchestration code to prevent potential out-of-bounds access and robustly checking for empty PTO_ISA_ROOT environment variables in CMake.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if (orch_args.tensor_count() + orch_args.scalar_count() != 4) { | ||
| LOG_ERROR("sdma_async_completion_demo: expected 4 args"); | ||
| return; | ||
| } |
There was a problem hiding this comment.
The current check only verifies that the sum of tensor_count() and scalar_count() is 4. If the orchestrator is invoked with an unexpected combination of arguments (e.g., 2 tensors and 2 scalars), accessing orch_args.tensor(2) or other indices will result in an out-of-bounds access and potentially crash. It is safer to explicitly validate that tensor_count() is exactly 3 and scalar_count() is exactly 1.
| if (orch_args.tensor_count() + orch_args.scalar_count() != 4) { | |
| LOG_ERROR("sdma_async_completion_demo: expected 4 args"); | |
| return; | |
| } | |
| if (orch_args.tensor_count() != 3 || orch_args.scalar_count() != 1) { | |
| LOG_ERROR("sdma_async_completion_demo: expected 3 tensors and 1 scalar"); | |
| return; | |
| } |
References
- Ensure that index-based accessors perform bounds checks to prevent undefined behavior or out-of-bounds memory access.
| if(NOT DEFINED ENV{PTO_ISA_ROOT}) | ||
| message(FATAL_ERROR | ||
| "a5 onboard host_runtime requires PTO_ISA_ROOT " | ||
| "(SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is forced ON; needs pto-isa headers + CANN 9.0+)") | ||
| endif() |
There was a problem hiding this comment.
Checking NOT DEFINED ENV{PTO_ISA_ROOT} does not protect against cases where the environment variable is defined but empty. If PTO_ISA_ROOT is empty, appending it to CMAKE_CUSTOM_INCLUDE_DIRS will result in /include, which is incorrect. Checking against an empty string using STREQUAL "" and quoting the variable expansion handles both undefined and empty cases robustly.
if("$ENV{PTO_ISA_ROOT}" STREQUAL "")
message(FATAL_ERROR
"a5 onboard host_runtime requires PTO_ISA_ROOT "
"(SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is forced ON; needs pto-isa headers + CANN 9.0+)")
endif()
References
- In CMake, when checking if a string variable is empty or non-empty, explicitly check against an empty string using STREQUAL "" and quote the variable expansion.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/orchestration/sdma_async_completion_orch.cpp`:
- Around line 31-39: The current guard in sdma_async_completion_orch.cpp only
checks the total argument count, so a bad tensor/scalar mix can still reach
tensor(2) and scalar(0). Update the validation around the orchestration argument
parsing to verify the exact split expected by Tensor accessors and the comm_ctx
scalar, not just orch_args.tensor_count() + orch_args.scalar_count(). Keep the
existing error handling in the same flow so invalid inputs are rejected before
from_tensor_arg() and reinterpret_cast<CommContext *> are used.
In
`@examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py`:
- Around line 27-37: The test is still importing and using the deprecated
task-interface alias ContinuousTensor instead of the renamed Tensor type, so
update the import list in the sdma_async_completion demo test to use Tensor and
replace any ContinuousTensor references in the test setup with Tensor. Keep the
rest of the task-interface imports unchanged and ensure the test only depends on
the current public symbol from simpler.task_interface.
- Around line 66-86: The child callables built in the loop around
CoreCallable.build are advertising the wrong ABI because both entries reuse the
parent’s 4-arg signature. Update each child metadata entry to match the actual
kernel interface for kernel_sdma_tget_async.cpp and kernel_consumer.cpp, so the
producer and consumer callables each expose their real argument
directions/counts instead of the parent signature.
In `@src/a5/platform/onboard/host/CMakeLists.txt`:
- Around line 44-49: The current PTO_ISA_ROOT check in the host CMake logic only
verifies that the environment variable is defined, so an empty or nonexistent
path still reaches the include path append and fails later. Update the
validation near the CMakeLists.txt guard around the `PTO_ISA_ROOT` handling to
also reject empty values and paths that do not exist before `list(APPEND
CMAKE_CUSTOM_INCLUDE_DIRS ...)`, and keep the fatal error in the same
`host_runtime` setup path so configuration fails immediately with a clear
message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 120b4b64-2bb8-4a14-88f8-98090fe3ab51
📒 Files selected for processing (7)
examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/aiv/kernel_consumer.cppexamples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/aiv/kernel_sdma_tget_async.cppexamples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/orchestration/sdma_async_completion_orch.cppexamples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.pysimpler_setup/runtime_compiler.pysrc/a5/platform/onboard/host/CMakeLists.txtsrc/a5/platform/onboard/host/comm_hccl.cpp
| if (orch_args.tensor_count() + orch_args.scalar_count() != 4) { | ||
| LOG_ERROR("sdma_async_completion_demo: expected 4 args"); | ||
| return; | ||
| } | ||
|
|
||
| Tensor input = from_tensor_arg(orch_args.tensor(0)); | ||
| Tensor out = from_tensor_arg(orch_args.tensor(1)); | ||
| Tensor result = from_tensor_arg(orch_args.tensor(2)); | ||
| auto *comm_ctx = reinterpret_cast<CommContext *>(static_cast<uintptr_t>(orch_args.scalar(0))); |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Validate the tensor/scalar split, not just total arg count.
A call with 4 args but the wrong mix can pass this guard and still fail when accessing tensor(2) or scalar(0).
Proposed fix
- if (orch_args.tensor_count() + orch_args.scalar_count() != 4) {
- LOG_ERROR("sdma_async_completion_demo: expected 4 args");
+ if (orch_args.tensor_count() != 3 || orch_args.scalar_count() != 1) {
+ LOG_ERROR("sdma_async_completion_demo: expected 3 tensor args and 1 scalar arg");
return;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (orch_args.tensor_count() + orch_args.scalar_count() != 4) { | |
| LOG_ERROR("sdma_async_completion_demo: expected 4 args"); | |
| return; | |
| } | |
| Tensor input = from_tensor_arg(orch_args.tensor(0)); | |
| Tensor out = from_tensor_arg(orch_args.tensor(1)); | |
| Tensor result = from_tensor_arg(orch_args.tensor(2)); | |
| auto *comm_ctx = reinterpret_cast<CommContext *>(static_cast<uintptr_t>(orch_args.scalar(0))); | |
| if (orch_args.tensor_count() != 3 || orch_args.scalar_count() != 1) { | |
| LOG_ERROR("sdma_async_completion_demo: expected 3 tensor args and 1 scalar arg"); | |
| return; | |
| } | |
| Tensor input = from_tensor_arg(orch_args.tensor(0)); | |
| Tensor out = from_tensor_arg(orch_args.tensor(1)); | |
| Tensor result = from_tensor_arg(orch_args.tensor(2)); | |
| auto *comm_ctx = reinterpret_cast<CommContext *>(static_cast<uintptr_t>(orch_args.scalar(0))); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/kernels/orchestration/sdma_async_completion_orch.cpp`
around lines 31 - 39, The current guard in sdma_async_completion_orch.cpp only
checks the total argument count, so a bad tensor/scalar mix can still reach
tensor(2) and scalar(0). Update the validation around the orchestration argument
parsing to verify the exact split expected by Tensor accessors and the comm_ctx
scalar, not just orch_args.tensor_count() + orch_args.scalar_count(). Keep the
existing error handling in the same flow so invalid inputs are rejected before
from_tensor_arg() and reinterpret_cast<CommContext *> are used.
| from simpler.task_interface import ( | ||
| ArgDirection, | ||
| CallConfig, | ||
| ChipCallable, | ||
| CommBufferSpec, | ||
| ContinuousTensor, | ||
| CoreCallable, | ||
| DataType, | ||
| TaskArgs, | ||
| TensorArgType, | ||
| ) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Use the renamed Tensor task-interface type.
This new test reintroduces ContinuousTensor; switch to the current hard-renamed symbol to avoid depending on an old alias. Based on learnings, renamed public Python types/classes are hard renames and old names such as ContinuousTensor should be removed across the repo.
Proposed fix
from simpler.task_interface import (
ArgDirection,
CallConfig,
ChipCallable,
CommBufferSpec,
- ContinuousTensor,
CoreCallable,
DataType,
TaskArgs,
+ Tensor,
TensorArgType,
)
@@
- ContinuousTensor.make(
+ Tensor.make(
data=domain.buffer_ptrs["input_window"],
shapes=(N,),
dtype=DataType.FLOAT32,Also applies to: 162-168
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py`
around lines 27 - 37, The test is still importing and using the deprecated
task-interface alias ContinuousTensor instead of the renamed Tensor type, so
update the import list in the sdma_async_completion demo test to use Tensor and
replace any ContinuousTensor references in the test setup with Tensor. Keep the
rest of the task-interface imports unchanged and ensure the test only depends on
the current public symbol from simpler.task_interface.
Source: Learnings
| children = [] | ||
| for func_id, rel in [ | ||
| (0, "kernels/aiv/kernel_sdma_tget_async.cpp"), | ||
| (1, "kernels/aiv/kernel_consumer.cpp"), | ||
| ]: | ||
| kernel = kc.compile_incore( | ||
| source_path=os.path.join(HERE, rel), | ||
| core_type="aiv", | ||
| pto_isa_root=pto_isa_root, | ||
| extra_include_dirs=extra_includes, | ||
| ) | ||
| if not platform.endswith("sim"): | ||
| kernel = extract_text_section(kernel) | ||
| children.append( | ||
| ( | ||
| func_id, | ||
| CoreCallable.build( | ||
| signature=[ArgDirection.IN, ArgDirection.OUT, ArgDirection.OUT, ArgDirection.IN], | ||
| binary=kernel, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Give each child callable its actual ABI signature.
The producer kernel is submitted with 3 args and the consumer with 2 args, but both child metadata entries advertise the parent’s 4-arg signature. If runtime validation uses this metadata, child dispatch can reject valid submissions or misdescribe the callable ABI.
Proposed fix
- for func_id, rel in [
- (0, "kernels/aiv/kernel_sdma_tget_async.cpp"),
- (1, "kernels/aiv/kernel_consumer.cpp"),
+ for func_id, rel, signature in [
+ (0, "kernels/aiv/kernel_sdma_tget_async.cpp", [ArgDirection.IN, ArgDirection.OUT, ArgDirection.IN]),
+ (1, "kernels/aiv/kernel_consumer.cpp", [ArgDirection.IN, ArgDirection.OUT]),
]:
@@
func_id,
CoreCallable.build(
- signature=[ArgDirection.IN, ArgDirection.OUT, ArgDirection.OUT, ArgDirection.IN],
+ signature=signature,
binary=kernel,
),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| children = [] | |
| for func_id, rel in [ | |
| (0, "kernels/aiv/kernel_sdma_tget_async.cpp"), | |
| (1, "kernels/aiv/kernel_consumer.cpp"), | |
| ]: | |
| kernel = kc.compile_incore( | |
| source_path=os.path.join(HERE, rel), | |
| core_type="aiv", | |
| pto_isa_root=pto_isa_root, | |
| extra_include_dirs=extra_includes, | |
| ) | |
| if not platform.endswith("sim"): | |
| kernel = extract_text_section(kernel) | |
| children.append( | |
| ( | |
| func_id, | |
| CoreCallable.build( | |
| signature=[ArgDirection.IN, ArgDirection.OUT, ArgDirection.OUT, ArgDirection.IN], | |
| binary=kernel, | |
| ), | |
| ) | |
| children = [] | |
| for func_id, rel, signature in [ | |
| (0, "kernels/aiv/kernel_sdma_tget_async.cpp", [ArgDirection.IN, ArgDirection.OUT, ArgDirection.IN]), | |
| (1, "kernels/aiv/kernel_consumer.cpp", [ArgDirection.IN, ArgDirection.OUT]), | |
| ]: | |
| kernel = kc.compile_incore( | |
| source_path=os.path.join(HERE, rel), | |
| core_type="aiv", | |
| pto_isa_root=pto_isa_root, | |
| extra_include_dirs=extra_includes, | |
| ) | |
| if not platform.endswith("sim"): | |
| kernel = extract_text_section(kernel) | |
| children.append( | |
| ( | |
| func_id, | |
| CoreCallable.build( | |
| signature=signature, | |
| binary=kernel, | |
| ), |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@examples/a5/tensormap_and_ringbuffer/sdma_async_completion_demo/test_sdma_async_completion_demo.py`
around lines 66 - 86, The child callables built in the loop around
CoreCallable.build are advertising the wrong ABI because both entries reuse the
parent’s 4-arg signature. Update each child metadata entry to match the actual
kernel interface for kernel_sdma_tget_async.cpp and kernel_consumer.cpp, so the
producer and consumer callables each expose their real argument
directions/counts instead of the parent signature.
| if(NOT DEFINED ENV{PTO_ISA_ROOT}) | ||
| message(FATAL_ERROR | ||
| "a5 onboard host_runtime requires PTO_ISA_ROOT " | ||
| "(SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is forced ON; needs pto-isa headers + CANN 9.0+)") | ||
| endif() | ||
| list(APPEND CMAKE_CUSTOM_INCLUDE_DIRS "$ENV{PTO_ISA_ROOT}/include") |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Fail fast when PTO_ISA_ROOT is empty or invalid.
DEFINED ENV{PTO_ISA_ROOT} still passes for an empty or nonexistent path, then line 49 appends a broken include directory and defers the failure to compilation.
Proposed fix
-if(NOT DEFINED ENV{PTO_ISA_ROOT})
+if(NOT DEFINED ENV{PTO_ISA_ROOT}
+ OR "$ENV{PTO_ISA_ROOT}" STREQUAL ""
+ OR NOT EXISTS "$ENV{PTO_ISA_ROOT}/include")
message(FATAL_ERROR
- "a5 onboard host_runtime requires PTO_ISA_ROOT "
+ "a5 onboard host_runtime requires PTO_ISA_ROOT to point to a valid pto-isa checkout "
"(SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is forced ON; needs pto-isa headers + CANN 9.0+)")
endif()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if(NOT DEFINED ENV{PTO_ISA_ROOT}) | |
| message(FATAL_ERROR | |
| "a5 onboard host_runtime requires PTO_ISA_ROOT " | |
| "(SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is forced ON; needs pto-isa headers + CANN 9.0+)") | |
| endif() | |
| list(APPEND CMAKE_CUSTOM_INCLUDE_DIRS "$ENV{PTO_ISA_ROOT}/include") | |
| if(NOT DEFINED ENV{PTO_ISA_ROOT} | |
| OR "$ENV{PTO_ISA_ROOT}" STREQUAL "" | |
| OR NOT EXISTS "$ENV{PTO_ISA_ROOT}/include") | |
| message(FATAL_ERROR | |
| "a5 onboard host_runtime requires PTO_ISA_ROOT to point to a valid pto-isa checkout " | |
| "(SIMPLER_ENABLE_PTO_SDMA_WORKSPACE is forced ON; needs pto-isa headers + CANN 9.0+)") | |
| endif() | |
| list(APPEND CMAKE_CUSTOM_INCLUDE_DIRS "$ENV{PTO_ISA_ROOT}/include") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a5/platform/onboard/host/CMakeLists.txt` around lines 44 - 49, The
current PTO_ISA_ROOT check in the host CMake logic only verifies that the
environment variable is defined, so an empty or nonexistent path still reaches
the include path append and fails later. Update the validation near the
CMakeLists.txt guard around the `PTO_ISA_ROOT` handling to also reject empty
values and paths that do not exist before `list(APPEND CMAKE_CUSTOM_INCLUDE_DIRS
...)`, and keep the fatal error in the same `host_runtime` setup path so
configuration fails immediately with a clear message.
28fceb4 to
c0829e5
Compare
Layers the host-side SDMA workspace allocation on top of the comm backend from the previous commit. Until CANN exposes the missing SDMA primitives on a5, this overlay is the only piece of comm work that fails on real a5 silicon -- aclnnShmemSdmaStarsQuery raises an AICPU exception (InnerCode=0x715002a) that aborts the entire ACL thread context. Dropping this commit therefore unblocks the non-SDMA comm demos (async_notify_demo etc.) without touching the deferred-completion runtime, which is already SDMA-aware on the kernel side (dormant until a kernel registers an SDMA condition). - Wire SdmaWorkspaceManager into comm_alloc_windows under SIMPLER_ENABLE_PTO_SDMA_WORKSPACE: pre-allocates the per-rank workspace via aclnnShmemSdmaStarsQuery and overlays the result into CommContext.workSpace/.workSpaceSize. On CANN 8.5 the dlsym fails by design and we demote to "no workspace" rather than failing comm_init. - a5 onboard CMakeLists forces SIMPLER_ENABLE_PTO_SDMA_WORKSPACE ON, requires PTO_ISA_ROOT (with FATAL_ERROR message pointing to the workspace coupling), adds pto-isa headers to the include path, and links libnnopbase. - runtime_compiler._init_a5 enforces the same PTO_ISA_ROOT env contract as _init_a2a3. - Migrate sdma_async_completion_demo to examples/a5/ (kernels + orch byte-identical with the a2a3 version; test.py platform- renamed).
c0829e5 to
901b2e3
Compare
Layers the host-side SDMA workspace allocation on top of the comm backend from the previous commit. Until CANN exposes the missing SDMA primitives on a5, this overlay is the only piece of comm work that fails on real a5 silicon -- aclnnShmemSdmaStarsQuery raises an AICPU exception (InnerCode=0x715002a) that aborts the entire ACL thread context. Dropping this commit therefore unblocks the non-SDMA comm demos (async_notify_demo etc.) without touching the deferred-completion runtime, which is already SDMA-aware on the kernel side (dormant until a kernel registers an SDMA condition).