Add: host_build_graph runtime (host-orchestration variant of tensormap)#1185
Add: host_build_graph runtime (host-orchestration variant of tensormap)#1185ChaoWao wants to merge 1 commit into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughReplaces the ChangesPTO2 Host-Orchestration Runtime
Sequence DiagramsequenceDiagram
participant Host as Host (runtime_maker)
participant HostOrch as Orchestration SO (host)
participant AICPU as AICPU executor
participant Sched as SchedulerContext
participant AICore as AICore executor
Host->>Host: stage tensors to device, svm_register
Host->>Host: reserve/init/wire PTO2Runtime arena
Host->>HostOrch: run_host_orchestration (host SM mirror)
HostOrch->>Host: submit_task calls → wire_task (inline fanout)
Host->>Host: relocate host→device pointers in arena image
Host->>AICPU: H2D upload prebuilt arena
AICPU->>Sched: boot thread: reattach arena, wire SM, on_orchestration_done
Sched->>Sched: resolve_and_dispatch loop (ready queue, dispatch, completion)
Sched->>AICore: publish PTO2DispatchPayload (scheduler→AICore register write)
AICore->>AICore: execute_task(payload), write FIN register
AICore->>Sched: AICoreCompletionMailbox (deferred/normal done)
Sched->>AICPU: completed_tasks == total_tasks → shutdown
AICPU->>Host: runtime_destroy, framework_bind_runtime(nullptr)
Host->>Host: validate_runtime_impl: D2H copy-back, svm_unregister
Estimated code review effort🎯 5 (Critical) | ⏱️ ~180 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request transitions the host_build_graph runtime to a host-orchestrated model (PTO2) where the task graph is built on the host and executed scheduler-only on the device. Feedback from the review highlights several critical issues: redundant MMIO reads in the speculative early-dispatch loop, unsafe memcpy operations due to layout incompatibility in PTO2TensorMapEntry, and potential infinite hangs on the host because the fallback get_sys_cnt_aicpu() returns zero. Additionally, the review points out undefined behavior from unaligned pointer access, redundant loops over PTO2_MAX_RING_DEPTH since ring is a single member, and opportunities to optimize hash table lookups using bitwise AND instead of modulo.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
44aae09 to
698f596
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🟠 Major comments (27)
src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_context.h-144-147 (1)
144-147: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winMake the orchestration handoff atomic
volatiledoes not synchronize the scheduler/orchestrator handoff, so the flag and thetotal_tasks_publication still race across threads. Switchorchestrator_done_tostd::atomic<bool>and update the reads/writes inscheduler_cold_path.cppandscheduler_dispatch.cppto use acquire/release operations.🔒 Proposed header-side fix
- volatile bool orchestrator_done_{false}; + std::atomic<bool> orchestrator_done_{false};🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_context.h` around lines 144 - 147, The scheduler/orchestrator handoff in SchedulerContext is still using a non-synchronizing flag, so the publication of total_tasks_ can race with scheduler reads. Change orchestrator_done_ in SchedulerContext to std::atomic<bool>, then update every write/read of that flag in scheduler_cold_path.cpp and scheduler_dispatch.cpp to use release on the orchestrator side and acquire on the scheduler side. Keep completed_ consistent with the same atomic handoff pattern so the initialization state is visible across threads through the existing scheduler/orchestrator paths.src/a2a3/runtime/host_build_graph/runtime/pto_ring_buffer.h-285-287 (1)
285-287: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winGuard
local_task_id_before incrementing.local_task_id_++is signedint32_toverflow atINT32_MAX, which is undefined behavior before the ring-mask logic can recover. Returning-1here is safe becauseprepare_task()already treats negativetask_idas allocation failure and stops before publishing the descriptor.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/pto_ring_buffer.h` around lines 285 - 287, Guard local_task_id_ before incrementing in commit_task() to avoid signed int32_t overflow at INT32_MAX; check for the terminal value first and return -1 on failure so prepare_task() can stop cleanly before publishing. Update the task_id assignment logic in PTO ring buffer commit flow to use a safe pre-increment check rather than local_task_id_++ overflow, while preserving the existing negative-id failure path.src/a2a3/runtime/host_build_graph/runtime/runtime.h-337-343 (1)
337-343: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy liftSplit the host-only tensor ledger out of
RuntimeRuntimeis copied to device as a rawsizeof(Runtime)blob, sostd::vector<TensorPair> tensor_pairs_becomes part of the device image and ties the ABI to libstdc++ layout. Move the ledger to a host-only wrapper or copy only a trivially-copyable device prefix.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/runtime.h` around lines 337 - 343, The Runtime layout currently includes the host-only tensor_pairs_ ledger, but Runtime is copied as a raw sizeof(Runtime) blob, so the std::vector control block leaks into the device image and locks the ABI to libstdc++ details. Refactor the Runtime ownership model so the device-copied portion is a trivially copyable prefix and move tensor_pairs_ into a host-only wrapper or separate host-side structure; update any runtime_maker.cpp and validate_runtime_impl usage to read the ledger from that host-only path instead of from Runtime.src/a2a3/runtime/host_build_graph/runtime/pto_types.h-234-248 (1)
234-248: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winReset
launch_specwith the rest of the Arg state.
reset()is the reuse path, but it leaveslaunch_specuntouched. After one submission sets a non-default launch shape, the next recycledArginherits it silently.Suggested fix
void clear() { Base::clear(); `#if` PTO2_PROFILING dump_arg_selection_.clear(); `#endif` explicit_deps_ = nullptr; explicit_dep_count_ = 0; allow_early_resolve_ = false; + launch_spec = PTO2LaunchSpec{}; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/pto_types.h` around lines 234 - 248, The Arg reuse path in reset() leaves launch_spec stale because it is not cleared alongside the other state. Update the clear()/reset() flow in the Arg type so launch_spec is reset together with explicit_deps_, explicit_dep_count_, allow_early_resolve_, has_error, and error_msg, ensuring recycled Arg instances do not inherit a previous submission’s launch shape.src/a2a3/runtime/host_build_graph/orchestration/pto_orchestration_api.h-96-107 (1)
96-107: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy liftKeep
PTO2Runtimeon a single shared definition. This header redefines the runtime object with only the first two fields, whilepto_runtime2.hdefines the full layout. That relies on an implicit cross-component ABI contract rather than a C++ type guarantee; centralize the shared prefix or add an accessor forpending_scope_modeinstead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/orchestration/pto_orchestration_api.h` around lines 96 - 107, The runtime type is being redefined here with only a partial layout, which duplicates the full `PTO2Runtime` definition from the main runtime header and depends on an implicit ABI contract. Update the orchestration API to use a single shared definition or a shared prefix type, and expose `pending_scope_mode` through an accessor or inline helper instead of directly embedding a second `PTO2Runtime` struct definition. Use the `PTO2Runtime` and `pending_scope_mode` symbols to locate the overlap and remove the duplicated layout.src/a2a3/runtime/host_build_graph/runtime/pto_runtime2_types.h-25-26 (1)
25-26: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winUse a unique include guard for this runtime header.
This file’s own comment notes mixed inclusion with the
tensormap_and_ringbuffercopy; keeping the copied guard makes the first included runtime header win and can hide the other one.Proposed fix
-#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_RUNTIME2_TYPES_H_ -#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_RUNTIME2_TYPES_H_ +#ifndef SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_PTO_RUNTIME2_TYPES_H_ +#define SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_PTO_RUNTIME2_TYPES_H_ @@ -#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_RUNTIME2_TYPES_H_ +#endif // SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_PTO_RUNTIME2_TYPES_H_Also applies to: 544-544
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/pto_runtime2_types.h` around lines 25 - 26, The include guard in pto_runtime2_types.h is still using the copied tensormap_and_ringbuffer symbol, which can cause one runtime header to suppress the other. Update the guard macro in this header to a unique name tied to pto_runtime2_types, and make sure the matching endif comment stays consistent so this header can be included independently.src/a2a3/runtime/host_build_graph/runtime/aicore_completion_mailbox.h-12-13 (1)
12-13: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winRename this include guard to avoid cross-runtime suppression.
The guard names
tensormap_and_ringbuffereven though the file is underhost_build_graph. If both runtimes’ mailbox headers are visible in one TU, this can make one copy suppress the other.Proposed fix
-#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_AICORE_COMPLETION_MAILBOX_H_ -#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_AICORE_COMPLETION_MAILBOX_H_ +#ifndef SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_AICORE_COMPLETION_MAILBOX_H_ +#define SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_AICORE_COMPLETION_MAILBOX_H_ @@ -#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_AICORE_COMPLETION_MAILBOX_H_ +#endif // SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_AICORE_COMPLETION_MAILBOX_H_Also applies to: 189-189
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/aicore_completion_mailbox.h` around lines 12 - 13, The include guard in aicore_completion_mailbox.h is using the wrong runtime-specific prefix, which can cause cross-runtime header suppression when both mailbox headers are included together. Update the guard in the aicore_completion_mailbox header to a unique host_build_graph-specific name and make sure the matching closing directive uses the same symbol so it no longer collides with the tensormap_and_ringbuffer runtime.src/a2a3/runtime/host_build_graph/runtime/pto_runtime2_types.h-70-96 (1)
70-96: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy liftDon’t cap scope tasks with the default window size when the actual window is runtime-configurable.
Lines 70-72 say the task window is passed at runtime, but Line 96 fixes
PTO2_SCOPE_TASKS_CAPto the default16384 * depth. A valid larger runtime window can fatal early inscope_tasks_pushbefore the ring is actually full. Size this buffer/cap from the resolved runtimetask_window_size.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/pto_runtime2_types.h` around lines 70 - 96, The scope task cap is incorrectly tied to the compile-time default `PTO2_TASK_WINDOW_SIZE`, which can cause `scope_tasks_push` to overflow early when the runtime-provided window is larger. Update the `PTO2_SCOPE_TASKS_CAP` sizing logic in `pto_runtime2_types.h` so it derives from the resolved runtime `task_window_size` used by `runtime_create_from_sm()` and the ring depth, and make sure any code using the cap or `scope_tasks_push` reads the runtime value instead of the default macro.src/a2a3/runtime/host_build_graph/common/pto_runtime_status.h-18-53 (1)
18-53: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winRename the copied include guard to the host-build-graph path.
The guard still uses the
tensormap_and_ringbuffernamespace. In mixed-runtime TUs, that can cause this status helper to be skipped or replaced by the sibling runtime’s header.Proposed fix
-#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_COMMON_PTO_RUNTIME_STATUS_H_ -#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_COMMON_PTO_RUNTIME_STATUS_H_ +#ifndef SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_COMMON_PTO_RUNTIME_STATUS_H_ +#define SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_COMMON_PTO_RUNTIME_STATUS_H_ @@ -#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_COMMON_PTO_RUNTIME_STATUS_H_ +#endif // SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_COMMON_PTO_RUNTIME_STATUS_H_🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/common/pto_runtime_status.h` around lines 18 - 53, The include guard in pto_runtime_status.h still matches the older tensormap_and_ringbuffer path, which can collide with the sibling runtime header and cause the wrong definition to be skipped in mixed builds. Update the guard macro to a unique host_build_graph-specific name and keep it consistent with the header’s current location; verify the change around the top-level include guard symbols in pto_runtime_status.h so this helper is not accidentally shared with the other runtime variant.src/a2a3/runtime/host_build_graph/runtime/pto_constants.h-12-19 (1)
12-19: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winUse a
host_build_graph-specific include guard. This file still uses thetensormap_and_ringbufferguard prefix, so builds that expose both runtime trees can skip onepto_constants.hcopy entirely. Rename the guard and trailing#endifcomment to matchsrc/a2a3/runtime/host_build_graph/runtime/pto_constants.h.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/pto_constants.h` around lines 12 - 19, The include guard in pto_constants.h still uses the tensormap_and_ringbuffer prefix, which can cause collisions when both runtime trees are included. Update the guard macro name and matching trailing `#endif` comment in pto_constants.h to a host_build_graph-specific symbol that uniquely reflects src/a2a3/runtime/host_build_graph/runtime/pto_constants.h.src/a2a3/runtime/host_build_graph/runtime/pto_submit_types.h-152-153 (1)
152-153: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winPreserve the positive
block_numinvariant.
set_block_num(0)or a negative value violates the documented[0, block_num)dispatch contract and can produce an invalid launch shape. Please reject invalid values at this boundary or before populatingPTO2LaunchSpec.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/pto_submit_types.h` around lines 152 - 153, The block_num setter currently accepts zero and negative values, which breaks the documented dispatch contract and can lead to invalid launch specs. Update `set_block_num` in `PTO2LaunchSpec` (or add validation before it is populated) so only positive values are accepted, and reject invalid inputs at this boundary with an appropriate failure path to preserve the `block_num` invariant.src/a2a3/runtime/host_build_graph/runtime/aicore_completion_mailbox_types.h-12-13 (1)
12-13: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winUse a host_build_graph-specific header guard.
This new host_build_graph header uses a
TENSORMAP_AND_RINGBUFFERguard, creating a collision risk with unchanged runtime headers.Proposed fix
-#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_AICORE_COMPLETION_MAILBOX_TYPES_H_ -#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_AICORE_COMPLETION_MAILBOX_TYPES_H_ +#ifndef SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_AICORE_COMPLETION_MAILBOX_TYPES_H_ +#define SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_AICORE_COMPLETION_MAILBOX_TYPES_H_ ... -#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_AICORE_COMPLETION_MAILBOX_TYPES_H_ +#endif // SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_AICORE_COMPLETION_MAILBOX_TYPES_H_Also applies to: 67-67
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/aicore_completion_mailbox_types.h` around lines 12 - 13, The header guard in aicore_completion_mailbox_types.h is using the wrong namespace and can collide with unrelated runtime headers. Update the include guard in the host_build_graph-specific header to a unique host_build_graph symbol, and make sure the matching closing guard uses the same new name; locate and fix the guard constants at the top and bottom of the file.src/a2a3/runtime/host_build_graph/runtime/pto_completion_token.h-12-13 (1)
12-13: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winUse a host_build_graph-specific header guard.
The guard macro still names
TENSORMAP_AND_RINGBUFFER; if a TU includes same-named headers from both runtimes, one header can suppress the other.Proposed fix
-#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_COMPLETION_TOKEN_H_ -#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_COMPLETION_TOKEN_H_ +#ifndef SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_PTO_COMPLETION_TOKEN_H_ +#define SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_PTO_COMPLETION_TOKEN_H_ ... -#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_COMPLETION_TOKEN_H_ +#endif // SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_PTO_COMPLETION_TOKEN_H_Also applies to: 45-45
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/pto_completion_token.h` around lines 12 - 13, The header guard in pto_completion_token.h still uses the shared TENSORMAP_AND_RINGBUFFER macro, which can clash with other runtimes. Update the guard to a host_build_graph-specific symbol in the include guard at the top of the file and the matching closing directive so the pto_completion_token header is uniquely protected. Keep the change consistent with the existing pto_completion_token.h identifiers.src/a2a3/runtime/host_build_graph/runtime/backend/sdma/sdma_completion_kernel.h-87-142 (1)
87-142: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winPropagate and flush async-registration failures.
register_pto_async_event()records errors for invalid engine / event-check failure, butsend_request_entry()still returnstrue. Also, theBuildAsyncSessionfailure path writescompletion_error_codeand returns beforedefer_flush(), so the scheduler may miss the error.Proposed fix direction
-inline __aicore__ void register_sdma_event_record(AsyncCtx &ctx, volatile __gm__ void *record_addr) { +inline __aicore__ bool register_sdma_event_record(AsyncCtx &ctx, volatile __gm__ void *record_addr) { CompletionToken token{ reinterpret_cast<uint64_t>(record_addr), 0, COMPLETION_ENGINE_SDMA, COMPLETION_TYPE_SDMA_EVENT_RECORD, 0 }; - (void)register_completion_condition(ctx, token); + return register_completion_condition(ctx, token); } template <typename PtoAsyncEvent, typename PtoAsyncSession> -inline __aicore__ void +inline __aicore__ bool register_pto_async_event(AsyncCtx &ctx, const PtoAsyncEvent &event, const PtoAsyncSession &session) { if (ctx.task_token.is_invalid() || ctx.completion_count == nullptr || ctx.completion_entries == nullptr) { (void)event.Wait(session); - return; + return true; } if (event.handle == 0) { - return; + return true; } const uint32_t engine = static_cast<uint32_t>(event.engine); if (engine != static_cast<uint32_t>(::pto::comm::DmaEngine::SDMA)) { defer_error(ctx, PTO2_ERROR_ASYNC_COMPLETION_INVALID); - return; + return false; } ... )) { defer_error(ctx, PTO2_ERROR_ASYNC_COMPLETION_INVALID); - return; + return false; } + bool ok = true; for (uint32_t queue_id = 0; queue_id < queue_num; ++queue_id) { - register_sdma_event_record(ctx, ::pto::comm::sdma::detail::GetEventRecord(recv_workspace, queue_id)); + ok = register_sdma_event_record(ctx, ::pto::comm::sdma::detail::GetEventRecord(recv_workspace, queue_id)) && ok; } + return ok; } ... if (!pto::comm::BuildAsyncSession(desc.scratch, desc.workspace, session, desc.sync_id)) { pto2::detail::defer_error(ctx, PTO2_ERROR_ASYNC_COMPLETION_INVALID); + pto2::detail::defer_flush(ctx); return false; } ... - pto2::detail::register_pto_async_event(ctx, event, session); + bool ok = pto2::detail::register_pto_async_event(ctx, event, session); pto2::detail::defer_flush(ctx); - return true; + return ok; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/backend/sdma/sdma_completion_kernel.h` around lines 87 - 142, `send_request_entry()` currently ignores failures from `register_pto_async_event()` and can return true even after `defer_error()` is set, while the `BuildAsyncSession` failure path exits before any flush. Update `send_request_entry()` and `register_pto_async_event()` so async-registration failures are propagated back to the caller and `defer_flush(ctx)` is always reached whenever an error is recorded, using the existing `AsyncCtx`, `register_pto_async_event`, `BuildAsyncSession`, and `defer_error/defer_flush` flow to ensure the scheduler observes the failure.src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp-706-748 (1)
706-748: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winAdd failure cleanup for staged tensor allocations and SVM maps.
After earlier tensors are pushed into
tensor_pairs_, any laterreturn -1in binding leaks their device allocations and SVM registrations. Use a local cleanup guard or shared cleanup helper until ownership is fully transferred tovalidate_runtime_impl.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp` around lines 706 - 748, The tensor staging path in runtime_maker.cpp leaks previously allocated device buffers and SVM registrations when a later tensor fails after entries have already been added to runtime->tensor_pairs_. Update the allocation/staging flow around device_malloc, device_memset/copy_to_device, push_back on tensor_pairs_, and svm_register_via_runner so any failure performs cleanup of all tensors staged so far before returning -1. Use a local cleanup guard or a shared helper that frees the accumulated dev_ptrs and unregisters their SVM mappings until ownership is fully transferred to validate_runtime_impl.src/a2a3/runtime/host_build_graph/host/dep_gen_replay.cpp-607-615 (1)
607-615: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winDo not emit deps.json from partial overflow chains.
Continuing with a partial dep list silently drops explicit dependencies while still writing a “source of truth” deps.json. Treat malformed chains as fatal and return before serialization.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/host/dep_gen_replay.cpp` around lines 607 - 615, The dep_gen replay path in dep_gen_replay.cpp currently logs a warning and continues when chain_complete is false, which still allows a partial dep list to be serialized into deps.json. Update the replay logic around the chain_complete check so malformed overflow chains are treated as fatal: stop processing and return before deps_data and dc are used for serialization, rather than falling through with full_deps_buf. Keep the change localized to the replay/serialization flow in the host dep generation code so deps.json is only emitted from complete dependency chains.src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp-131-145 (1)
131-145: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winReject single-thread execution when no scheduler thread will run.
The
0 → 1fixup makessched_thread_num_ == 0; withorch_to_sched_ == false, the boot thread skips dispatch and submitted tasks never execute. Require at least one scheduler-capable thread or force a supported transition mode.Also applies to: 229-247
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp` around lines 131 - 145, The single-thread fallback in aicpu_executor.cpp allows a configuration where sched_thread_num_ becomes 0 and, with orch_to_sched_ disabled, nothing ever drives task dispatch. Update the initialization logic in the executor’s setup path (the block that sets aicpu_thread_num_, sched_thread_num_, and orch_to_sched_) to reject this unsupported combination by validating that at least one scheduler-capable thread will run, or by forcing a supported orch_to_sched_ transition mode before calling sched_ctx_.init. Apply the same guard in the other referenced initialization block as well so both code paths enforce the same constraint.src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp-277-280 (1)
277-280: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winDo not treat a failed PTO2 header read as runtime success.
Returning
0here makesvalidate_runtime_implproceed with copy-back even when runtime status is unavailable. Return an error/status sentinel so cleanup still runs but result copy-back is skipped.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp` around lines 277 - 280, The PTO2 header read failure in runtime_maker::validate_runtime_impl is currently returned as success, which lets copy-back continue even when runtime status is unavailable. Update the error path after runtime->host_api.copy_from_device(...) so it returns a non-success sentinel instead of 0, while preserving cleanup execution. Make sure the change is applied in the validate_runtime_impl flow that handles host_header/PTO2SharedMemoryHeader reads, so failed header reads skip result copy-back.src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp-318-323 (1)
318-323: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winTighten temp-SO permissions and unlink on every failure path.
0755exposes callable SO contents to other users, and thedlopen/dlsymerror paths return without unlinking the tempfile.Proposed fix
- if (fchmod(fd, 0755) != 0) { + if (fchmod(fd, 0700) != 0) { close(fd); unlink(tmpl); return false; @@ void *handle = dlopen(so_path.c_str(), RTLD_NOW | RTLD_LOCAL); if (handle == nullptr) { LOG_ERROR("host-orch: dlopen failed: %s", dlerror()); + unlink(so_path.c_str()); return -1; } @@ LOG_ERROR("host-orch: dlsym('%s') failed: %s", orch_func_name, dlerror()); dlclose(handle); + unlink(so_path.c_str()); return -1; } @@ LOG_ERROR("host-orch: orch .so does not export framework_bind_runtime: %s", dlerror()); dlclose(handle); + unlink(so_path.c_str()); return -1; }Also applies to: 606-627
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp` around lines 318 - 323, The temp shared object handling in the runtime maker is too permissive and leaves files behind on some failures. Update the temp-file permission setup around the write/load flow in runtime_maker’s SO creation logic to use stricter permissions than 0755, and make sure every early return from the write_all_bytes, dlopen, and dlsym failure paths closes the fd and unlinks tmpl before returning false. Keep the cleanup consistent in the same code path so no tempfile survives any error.Source: Linters/SAST tools
src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp-979-987 (1)
979-987: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winValidate packed-output copy size and target tensor identity.
graph_out_sizecomes from the device header and is copied intopair.host_ptrwithout checkinggraph_out_size <= pair.size. Also, “first output tensor” is approximated as the first tensor that needs copy-back, which can be an INOUT tensor. Store argument direction inTensorPairor otherwise identify the real first OUT before applyinggraph_output_ptr.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp` around lines 979 - 987, The packed-output path in runtime_maker.cpp is using graph_out_size directly and treating the first copy-back tensor as the packed-output target, which can write past pair.host_ptr or attach graph_output_ptr to an INOUT tensor. Update the copy logic in the host graph runtime to validate that graph_out_size never exceeds pair.size before calling runtime->host_api.copy_from_device, and use TensorPair metadata to distinguish OUT from INOUT so the packed buffer is only applied to the real first output tensor. Keep the fix localized around the first_output_tensor handling and the copy_from_device call in the output-copy loop.src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp-748-748 (1)
748-748: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winCheck the SVM registration result before host orchestration dereferences tensor addresses.
The host orchestrator later dereferences
buffer.addrdirectly. If registration fails, or returns a non-identity host VA, this path can crash or read unmapped memory.Proposed fix
- svm_register_via_runner(dev_ptr, size); + void *mapped_ptr = svm_register_via_runner(dev_ptr, size); + if (size != 0 && mapped_ptr == nullptr) { + LOG_ERROR("Failed to SVM-register tensor %d", i); + runtime->host_api.device_free(dev_ptr); + return -1; + } + if (mapped_ptr != nullptr && mapped_ptr != dev_ptr) { + LOG_ERROR("SVM mapping for tensor %d is non-identity: dev=%p host=%p", i, dev_ptr, mapped_ptr); + svm_unregister_via_runner(dev_ptr); + runtime->host_api.device_free(dev_ptr); + return -1; + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp` at line 748, The SVM path in runtime_maker’s host orchestration should not assume registration always succeeds or preserves the original address. Capture and validate the result of svm_register_via_runner before any later use of buffer.addr, and fail fast or skip orchestration if registration does not succeed. Also verify the returned host virtual address matches the expected identity mapping before dereferencing tensor addresses, so the later buffer.addr access in the host orchestration path cannot touch unmapped memory.src/a2a3/runtime/host_build_graph/runtime/orchestrator_core/pto_orchestrator.cpp-454-456 (1)
454-456: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winStop preparation if
scope_tasks_push()latches fatal.When
scope_tasks_push()overflows, it records a fatal error butprepare_task()still returnstrue, allowing the task to be registered and wired without a scope-release entry.Suggested fix
// fanin_count is set by scheduler during wiring scope_tasks_push(orch, out->slot_state); - return true; + return !orch->fatal; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/orchestrator_core/pto_orchestrator.cpp` around lines 454 - 456, scope_tasks_push() can latch a fatal overflow error, but prepare_task() still returns true and continues wiring the task. Update prepare_task() in pto_orchestrator.cpp to check the result/state after calling scope_tasks_push(orch, out->slot_state) and return false immediately if a fatal was latched, so the task is not registered without a matching scope-release entry.src/a2a3/runtime/host_build_graph/orchestration/common.cpp-56-63 (1)
56-63: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winDo not pass an unescaped shared-object path through the shell.
dl_info.dli_fnameis interpolated into apopen()command. A library path containing shell metacharacters can alter the diagnostic command; avoid the shell or quote the path.Suggested fix
+static std::string shell_quote(const char *s) { + std::string out = "'"; + for (; *s; ++s) { + out += (*s == '\'') ? "'\\''" : std::string(1, *s); + } + out += "'"; + return out; +} + static std::string addr_to_line(const char *executable, void *addr, std::string *inline_chain = nullptr) { - char cmd[512]; - snprintf(cmd, sizeof(cmd), "addr2line -e %s -f -C -p -i %p 2>/dev/null", executable, addr); + std::string cmd = "addr2line -e " + shell_quote(executable) + " -f -C -p -i "; + char addr_buf[32]; + snprintf(addr_buf, sizeof(addr_buf), "%p", addr); + cmd += addr_buf; + cmd += " 2>/dev/null"; @@ - FILE *pipe = popen(cmd, "r"); + FILE *pipe = popen(cmd.c_str(), "r");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/orchestration/common.cpp` around lines 56 - 63, The addr_to_line helper currently interpolates the executable/shared-object path directly into the popen() shell command, which can be unsafe when dl_info.dli_fname contains shell metacharacters. Update addr_to_line to avoid invoking the shell with an unescaped path—either switch to a non-shell process execution approach or properly quote/escape the executable argument before building the addr2line command, and keep the existing inline_chain/raw_output behavior intact.src/a2a3/runtime/host_build_graph/runtime/orchestrator_core/pto_orchestrator.cpp-751-760 (1)
751-760: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winReject unsupported explicit dependency rings before resolving by local id.
dep_ring_idis captured but the lookup always usesorch->sm_header->ring; withPTO2_MAX_RING_DEPTH == 1, a task id from another ring can alias a live ring-0 local id and create the wrong edge.Suggested fix
} uint8_t dep_ring_id = dep_task_id.ring(); + if (dep_ring_id >= PTO2_MAX_RING_DEPTH) { + orch->report_fatal( + PTO2_ERROR_INVALID_ARGS, __FUNCTION__, "Explicit dependency references unsupported ring %u", + static_cast<unsigned>(dep_ring_id) + ); + return result; + } PTO2SharedMemoryRingHeader &dep_ring = orch->sm_header->ring;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/orchestrator_core/pto_orchestrator.cpp` around lines 751 - 760, The dependency resolution in the fan-in path is using orch->sm_header->ring even though dep_task_id carries a dep_ring_id, which can incorrectly alias tasks from unsupported rings. In the PTO2 orchestrator logic, reject any dependency whose ring does not match the supported ring-0 setup before resolving by local task id, and only proceed to get_slot_by_task_id and append_fanin_or_fail for valid same-ring dependencies.src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_cold_path.cpp-697-735 (1)
697-735: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winValidate scheduler thread and core topology counts before indexing.
active_sched_threads_is used as a divisor and as an index bound, andaiv_worker_ids_[2 * ci + 1]assumes two AIVs per AIC. Add explicit checks foractive_sched_threads_ > 0 && <= MAX_AICPU_THREADSandaiv_count_ >= 2 * aic_count_before these loops.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_cold_path.cpp` around lines 697 - 735, The round-robin core assignment in scheduler_cold_path.cpp assumes valid topology counts but does not validate them before division or indexing. In the scheduling setup around active_sched_threads_, clusters_per_thread, and the aic_worker_ids_/aiv_worker_ids_ lookups, add explicit guards that active_sched_threads_ is greater than 0 and no larger than MAX_AICPU_THREADS, and that aiv_count_ is at least 2 * aic_count_ before entering the loops. Keep the existing logic in place after these checks so the indexing of aiv_worker_ids_[2 * ci] and aiv_worker_ids_[2 * ci + 1] is always safe.src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_cold_path.cpp-959-969 (1)
959-969: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winInitialize AIV sub-block IDs for
active_sched_threads_, notsched_thread_num_.When
sched_thread_num_ <= 0,assign_cores_to_threads()uses all AICPU threads, but this loop runs zero iterations and leaves AIV1sub_block_idat the memset default0instead of1.Proposed fix
- for (int32_t t = 0; t < sched_thread_num_; t++) { + for (int32_t t = 0; t < active_sched_threads_; t++) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_cold_path.cpp` around lines 959 - 969, The AIV sub-block initialization loop is using sched_thread_num_ instead of the actual active thread count, so it can skip setup when sched_thread_num_ <= 0 and leave AIV1 sub_block_id at the default value. Update the loop in scheduler_cold_path.cpp to iterate over active_sched_threads_ (the same thread set used by assign_cores_to_threads()), and keep the existing payload_per_core_ / CoreTracker initialization for aiv0_id and aiv1_id so both AIV sub-block IDs are always assigned correctly.src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_completion.cpp-558-569 (1)
558-569: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winGuard the 32-thread drain barrier mask.
1u << active_sched_threads_is undefined whenactive_sched_threads_ == 32, so a max-width scheduler can compute a badall_ackedmask and hang in drain mode.Proposed fix
- uint32_t all_acked = (1u << active_sched_threads_) - 1; + uint32_t all_acked = (active_sched_threads_ >= 32) ? UINT32_MAX : ((1u << active_sched_threads_) - 1);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_completion.cpp` around lines 558 - 569, The drain barrier mask in scheduler_completion.cpp is built with a 32-bit shift that becomes undefined when active_sched_threads_ can reach 32. Update the mask logic in the drain/ack path around the all_acked computation and the drain_ack_mask checks so it safely handles the max-width scheduler case, using a width-safe mask construction that avoids shifting by 32 while preserving the existing ack barrier behavior in the completion loop.
🟡 Minor comments (11)
src/a2a3/runtime/host_build_graph/runtime/runtime.h-29-30 (1)
29-30: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winRename the header guard to
SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_RUNTIME_H_. It collides withsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h, so including both headers in one TU can skip one definition.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/runtime.h` around lines 29 - 30, The header guard in runtime.h is colliding with the one used by the tensormap_and_ringbuffer runtime header, so update the include guard symbol in this file to SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_RUNTIME_H_. Make sure the `#ifndef/`#define pair in the runtime.h header matches the new unique guard name so both headers can be included in the same translation unit without one being skipped.src/a2a3/runtime/host_build_graph/host/dep_gen_replay.h-73-74 (1)
73-74: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winFix the header guard to match
host_build_graph.The current guard uses
TENSORMAP_AND_RINGBUFFER, which risks colliding with a sibling runtime header and skipping this declaration.Proposed fix
-#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_HOST_DEP_GEN_REPLAY_H_ -#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_HOST_DEP_GEN_REPLAY_H_ +#ifndef SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_HOST_DEP_GEN_REPLAY_H_ +#define SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_HOST_DEP_GEN_REPLAY_H_ @@ -#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_HOST_DEP_GEN_REPLAY_H_ +#endif // SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_HOST_DEP_GEN_REPLAY_H_Also applies to: 106-106
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/host/dep_gen_replay.h` around lines 73 - 74, The header guard in dep_gen_replay.h is using the wrong identifier and can collide with a sibling runtime header, so update the guard to match the host_build_graph path instead of TENSORMAP_AND_RINGBUFFER. Make sure both the `#ifndef` and `#define` symbols in the header guard are changed consistently in this file so the HostDepGenReplay declaration is not skipped by the preprocessor.src/a2a3/runtime/host_build_graph/runtime/orchestrator_core/pto_ring_buffer.cpp-108-112 (1)
108-112: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winAvoid advertising a new runtime tuning knob in the deadlock diagnostics.
These messages direct users to
PTO2_RING_DEP_POOL; if this PR is introducing that env/config surface, it needs measured justification in a separate change. Otherwise, keep the remediation to the compile-time constant to avoid documenting an unsupported knob. Based on learnings, new configuration knobs or environment-variable overrides should not be introduced speculatively without a measured workload demonstrating the need.Also applies to: 174-177
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/orchestrator_core/pto_ring_buffer.cpp` around lines 108 - 112, The deadlock diagnostics in pto_ring_buffer.cpp are advertising a new runtime tuning knob that should not be introduced here. Update the LOG_ERROR messages in the spill-pool capacity check so they only point to the compile-time constant (PTO2_DEP_LIST_POOL_SIZE in pto_runtime2_types.h) and remove the PTO2_RING_DEP_POOL runtime env reference from the diagnostics; also apply the same change to the related LOG_ERROR block at the other noted location.Source: Learnings
src/a2a3/runtime/host_build_graph/runtime/scheduler/pto_scheduler.h-792-795 (1)
792-795: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winKeep the wiring-deadlock remediation to supported configuration surfaces.
This diagnostic tells users to raise
PTO2_RING_DEP_POOL; if that runtime knob is new in this PR, it needs measured justification separately. Otherwise, point only at the compile-time pool sizing constant. Based on learnings, new configuration knobs or environment-variable overrides should not be introduced speculatively without a measured workload demonstrating the need.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/pto_scheduler.h` around lines 792 - 795, The deadlock diagnostic in the scheduler’s remediation path should not point users at an unsupported runtime knob; update the LOG_ERROR message in pto_scheduler so it only references the compile-time pool sizing constant, and remove the suggestion to raise PTO2_RING_DEP_POOL unless that configuration surface is explicitly supported and justified elsewhere. Keep the guidance aligned with the existing fanout/dependency-pool check and the surrounding open-scope remediation text.Source: Learnings
src/a2a3/runtime/host_build_graph/runtime/scheduler/pto_scheduler.cpp-51-65 (1)
51-65: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winInclude
<utility>forstd::exchange.std::exchangeis declared in<utility>; adding it avoids relying on transitive includes that can break builds.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/pto_scheduler.cpp` around lines 51 - 65, The scheduler_get_profiling function uses std::exchange but the file does not explicitly include the header that declares it, so add the proper include for <utility> near the other includes in pto_scheduler.cpp. Keep the existing logic in scheduler_get_profiling unchanged, and make sure the new include directly covers all std::exchange calls used to reset the profiling counters.src/a2a3/runtime/host_build_graph/docs/device_log_profiling.md-89-90 (1)
89-90: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winDon’t hardcode a 3-thread scheduler in the host_build_graph guide.
The note above correctly says this runtime is scheduler-only on device, but Block 2 and the grep example still assume only Threads 0/1/2 emit scheduler logs. On a 4-thread host_build_graph run that hides Thread 3’s scheduler output.
Also applies to: 173-174
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/docs/device_log_profiling.md` around lines 89 - 90, Update the host_build_graph device_log_profiling guide to avoid hardcoding a 3-thread scheduler assumption in the text and grep example. In the affected section, revise the wording around the scheduler summaries so it refers to all scheduler threads generically instead of only Thread 0/1/2, and adjust the example filter to include any scheduler thread output so Thread 3 is not omitted in 4-thread runs.src/a2a3/runtime/host_build_graph/docs/SCALAR_DATA_ACCESS.md-17-20 (1)
17-20: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winMatch the documented
set_tensor_datasignature to the real API.The wrapper exposed in
pto_orchestration_api.htakesconst Tensor&, notTensor&. Documenting the mutable form is misleading here becauseTaskOutputTensors::get_ref()commonly yields aconst Tensor&, and readers will assume that path is unsupported when it is actually the intended usage.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/docs/SCALAR_DATA_ACCESS.md` around lines 17 - 20, The documented set_tensor_data signature is using the mutable Tensor& form, but the actual API in pto_orchestration_api.h accepts const Tensor&. Update the SCALAR_DATA_ACCESS docs to match the real wrapper signature and reflect that TaskOutputTensors::get_ref() returning const Tensor& is the intended usage; use the set_tensor_data symbol as the reference point when editing the documentation.src/a2a3/runtime/host_build_graph/docs/RUNTIME_LOGIC.md-784-788 (1)
784-788: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winUse the host_build_graph runtime in the example config.
The example still sets
RUNTIME_CONFIG["runtime"]totensormap_and_ringbuffer, so copying it verbatim selects the wrong runtime.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/docs/RUNTIME_LOGIC.md` around lines 784 - 788, The example runtime config is using the wrong runtime value, so update the RUNTIME_CONFIG snippet in the host_build_graph docs to use the host_build_graph runtime instead of tensormap_and_ringbuffer. Keep the change localized to the example configuration block and ensure the runtime key clearly matches the host_build_graph workflow shown in RUNTIME_LOGIC.md.src/a2a3/runtime/host_build_graph/docs/RUNTIME_LOGIC.md-94-98 (1)
94-98: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winKeep the variant-specific behavior scoped consistently.
These sections still describe device-orchestrator mechanics as if they apply to
host_build_graph(Thread 3as orchestrator, on-device orchestration SO load, the single-ring note undertensormap_and_ringbuffer). That contradicts the scheduler-only host-orchestration model established earlier in the file and will send readers to the wrong execution path.Also applies to: 547-555, 691-697
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/docs/RUNTIME_LOGIC.md` around lines 94 - 98, The host_build_graph docs still mix in device-orchestrator and on-device execution details that do not apply to the scheduler-only host model. Update the affected sections in RUNTIME_LOGIC.md, especially the references around the Thread 3 orchestrator, the on-device orchestration SO load, and the tensormap_and_ringbuffer single-ring note, so they consistently describe host_build_graph behavior only and align with the earlier host-orchestration model.src/a2a3/runtime/host_build_graph/docs/RUNTIME_LOGIC.md-393-409 (1)
393-409: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winFix the task-state numeric mismatch.
Section 6 defines
CONSUMEDas state2, but the scheduler section later refers toCONSUMED (4). These numeric values are exactly what people use when reading dumps/logs, so the inconsistency will mislead debugging.Also applies to: 587-598
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/docs/RUNTIME_LOGIC.md` around lines 393 - 409, Update the task-state documentation in the scheduler/runtime logic docs so the numeric value for CONSUMED is consistent everywhere. In the Task State Machine section, align the later reference to CONSUMED with the state definition used by task_state[] and the numbered states in the same document, and verify any related mentions in the scheduler description (including the other referenced section) use the same symbol-to-number mapping.src/a2a3/runtime/host_build_graph/docs/profiling_levels.md-82-145 (1)
82-145: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winSplit the profiling output examples by runtime variant.
The file-level note says host_build_graph never emits on-device orchestrator timing lines, but the Level 1 output/count sections still present
orch_*logs andN_orch-based formulas as the normative output. In a host_build_graph doc, that will send readers looking for device-log lines that cannot exist.Also applies to: 435-441
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/docs/profiling_levels.md` around lines 82 - 145, The Level 1 profiling section mixes host_build_graph output with on-device orch timing lines, which makes the documented counts and examples misleading. Rework the `profiling_levels.md` Level 1 content so the `orch_*` logs and `N_orch` formulas are clearly separated into the runtime variant where they actually apply, and keep the host_build_graph-facing output focused on the scheduler-side lines. Update the `orch_to_sched_` note and the `LOG_INFO_V9 count` formulas to match that split, and ensure the example blocks reflect only the variant they describe.
🧹 Nitpick comments (3)
src/a2a3/runtime/host_build_graph/runtime/pto_ring_buffer.h (1)
463-470: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAvoid advertising new runtime env knobs in fatal diagnostics.
These logs prescribe
PTO2_RING_HEAP,PTO2_RING_TASK_WINDOW, andPTO2_RING_DEP_POOL. If these are new in this PR, remove the env guidance or split it into a measured follow-up; otherwise ensure the knobs are already supported and documented. Based on learnings, “do not introduce new configuration knobs or environment-variable overrides speculatively” and require measured workload justification for new tunables. <retrieved_learnings>Also applies to: 529-532, 712-715
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/pto_ring_buffer.h` around lines 463 - 470, The fatal diagnostics in the ring buffer code are advertising new environment-variable tunables that should not be introduced speculatively. Update the LOG_ERROR messages in pto_ring_buffer.h at the ring heap, task window, and dep pool failure paths to remove the PTO2_RING_HEAP, PTO2_RING_TASK_WINDOW, and PTO2_RING_DEP_POOL guidance unless those knobs are already implemented and documented; if not, keep the message focused on the failure and any already-supported recovery options. Use the existing error handling blocks around the ring buffer checks to locate and adjust the messages consistently.Source: Learnings
src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp (1)
228-230: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAvoid adding new env-driven runtime knobs without measured justification.
These
PTO2_RING_*andPTO2_ORCH_TO_SCHEDoverrides add new configurable surface. Prefer the existing per-call/runtime config path or constants unless this PR includes a measured workload proving the need. Based on learnings, “do not introduce new configuration knobs or environment-variable overrides … speculatively.”Also applies to: 758-764
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp` around lines 228 - 230, These new PTO2_RING_* environment-driven overrides add speculative runtime knobs and should be removed unless backed by measured workload evidence. Update runtime_maker.cpp to avoid calling apply_env_ring_values for PTO2_RING_TASK_WINDOW, PTO2_RING_HEAP, and PTO2_RING_DEP_POOL (and the similar PTO2_ORCH_TO_SCHED usage elsewhere), and instead rely on the existing per-call/runtime configuration path or fixed constants. Keep the surrounding runtime initialization logic intact, using the existing symbols like apply_env_ring_values and eff_task_window_sizes only where the configuration is already justified.Source: Learnings
src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_cold_path.cpp (1)
13-14: 🎯 Functional Correctness | 🔵 Trivial | 💤 Low valueInclude
<cstring>directly inscheduler_cold_path.cppso thememsetcalls here don’t depend onruntime.h’s transitive#include <string.h>.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_cold_path.cpp` around lines 13 - 14, The memset usage in scheduler_cold_path.cpp should not rely on runtime.h’s transitive include of string.h. Add a direct <cstring> include in scheduler_cold_path.cpp alongside the existing includes so the memset calls are self-contained and the dependency is explicit.
2cfb5ee to
df205c9
Compare
Introduces the host_build_graph (hbg) runtime: the host-orchestration variant of tensormap_and_ringbuffer. The orchestrator runs on the host to completion — building the whole task graph, relocating it to device addresses, and H2D'ing the image — then the device boots scheduler-only. Only the orchestration timing/location differs from tensormap; the data structures and dispatch are shared. Pipeline (host): stage tensors (device alloc + H2D) and SVM-map them so the host can read control tensors directly; dlopen the orchestration .so and run it; wire fanout inline during submit (build dep_pool/fanout_head + seed the ready queue on the host); relocate every cross-task pointer to its final device address (range-based, two-delta: SM region + arena region); H2D the populated SM + arena. Device: attach the already-device-addressed image and dispatch from the seeded ready queue; no on-device orchestrator, no pointer fixup, no execution-time reclaim. Completion is tracked by completed_tasks_, independent of last_task_alive. Simplifications enabled by host-orchestration: - No device-orch path (the device boots scheduler-only). - No execution-time reclaim: the whole graph is host-resident and runs once, so advance_ring_pointers / reset_for_reuse / the COMPLETED->CONSUMED flip are removed. Consumer waits key on fanout_refcount. - Single ring: with no reclaim and a whole-graph-resident ring, the per-scope-depth multi-ring split is gone (PTO2_MAX_RING_DEPTH == 1); rings[]/ring_id are physically removed. All scope depths map to ring 0. - Host reads device control tensors (e.g. paged_attention's context_lens / block_table) via SVM (a2a3 halHostRegister identity map; sim is already a host pointer), so get_tensor_data dereferences buffer.addr directly. The SVM map/unmap is exposed through the per-thread DeviceRunner bridge, not the Runtime.host_api struct, so the HostApi ABI shared with tensormap_and_ringbuffer / a5 is untouched. tensormap_and_ringbuffer and a5 runtimes are not modified. Verified on a2a3 onboard: 12 passed / 1 skipped (host_build_graph suite, incl. paged_attention reading control tensors via SVM).
Summary
Adds the
host_build_graph(hbg) runtime for a2a3 — a host-orchestration variant of tensormap_and_ringbuffer. The orchestrator runs on the host CPU instead of on-device, reading control tensors host-side (via SVM identity-map) to shape the task graph, then H2Ds the built image; the device boots scheduler-only.orchestrator_core/(get_tensor_data/set_tensor_data),scheduler/,shared/,backend/sdma/, 4-fieldTensorPair(skips D2H for read-only inputs).svm_register_via_runner→halHostRegister(DEV_SVM_MAP_HOST)) so the host orchestrator can dereference staged device tensors; identity-map asserted at staging. Routed through the per-thread DeviceRunner, NOT theRuntime.host_apistruct, so tensormap/a5 HostApi ABI is untouched.Rebased onto current main
Rebased onto the latest
mainand adapted to the shared-layer refactors that landed while the branch sat unmerged:HostApimoved out ofRuntimeinto sharedcommon/host_api.h, passed explicitly asconst HostApi *apitoregister_callable_impl/bind_callable_to_runtime_impl/validate_runtime_impl(mirrors tensormap_and_ringbuffer).prepare_callable_implrenamed toregister_callable_impl.runtime_device_copy_size()(hbg returnssizeof(Runtime); copies the whole object).get_worker_count/get_workers/get_aicpu_*) added over hbg's flat fields so the shared platform layer compiles against either runtime variant.l2_swimlane_aicpu_record_sched_phasecall sites to the shared-only snapshot signature.a5 scope
a5 hbg is a separate, older copy and is not migrated to the new host-orch architecture here — it lacks unified memory, so the SVM identity-map path the a2a3 host orchestrator relies on does not apply. a5 hbg migration is deferred until a5 unified memory lands. The shared-platform
svm_registerdefault (returnsnullptr) is correct for a5 today.Testing
-Werror(a2a3sim, a5sim, a2a3, a5).