Skip to content

Add: host_build_graph runtime (host-orchestration variant of tensormap)#1185

Open
ChaoWao wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoWao:worktree-foamy-wondering-thimble
Open

Add: host_build_graph runtime (host-orchestration variant of tensormap)#1185
ChaoWao wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoWao:worktree-foamy-wondering-thimble

Conversation

@ChaoWao

@ChaoWao ChaoWao commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds the host_build_graph (hbg) runtime for a2a3 — a host-orchestration variant of tensormap_and_ringbuffer. The orchestrator runs on the host CPU instead of on-device, reading control tensors host-side (via SVM identity-map) to shape the task graph, then H2Ds the built image; the device boots scheduler-only.

  • New host-orch runtime: orchestrator_core/ (get_tensor_data/set_tensor_data), scheduler/, shared/, backend/sdma/, 4-field TensorPair (skips D2H for read-only inputs).
  • SVM-map seam (svm_register_via_runnerhalHostRegister(DEV_SVM_MAP_HOST)) so the host orchestrator can dereference staged device tensors; identity-map asserted at staging. Routed through the per-thread DeviceRunner, NOT the Runtime.host_api struct, so tensormap/a5 HostApi ABI is untouched.
  • 6 a2a3 hbg examples: bgemm, dump_tensor, matmul, paged_attention, prepared_callable, vector_example.

Rebased onto current main

Rebased onto the latest main and adapted to the shared-layer refactors that landed while the branch sat unmerged:

a5 scope

a5 hbg is a separate, older copy and is not migrated to the new host-orch architecture here — it lacks unified memory, so the SVM identity-map path the a2a3 host orchestrator relies on does not apply. a5 hbg migration is deferred until a5 unified memory lands. The shared-platform svm_register default (returns nullptr) is correct for a5 today.

Testing

  • All 4 platforms × 2 runtimes build clean with -Werror (a2a3sim, a5sim, a2a3, a5).
  • a2a3sim hbg scene tests: 10/10 pass (all 6 examples).
  • a5sim hbg scene tests: 7/7 pass.
  • a5sim + a2a3sim tensormap regression checks pass (shared-platform changes).
  • Onboard (real silicon) validation pending — sim's SVM/cache_ops are no-ops, so the SVM path and scheduler changes are only truly validated on hardware.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 243b003e-9c5d-480b-8aa8-a7fb43fb58fc

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces the host_build_graph task-graph runtime with a full PTO2 host-orchestration system. The host now runs the orchestrator to completion, builds a populated arena image (ring buffers, TensorMap, dispatcher payloads), relocates pointers to device addresses, and H2Ds it; the device boots a scheduler-only path with async AICore completion support.

Changes

PTO2 Host-Orchestration Runtime

Layer / File(s) Summary
Core ABI types and constants
runtime/pto_constants.h, runtime/pto_runtime2_types.h, runtime/pto_submit_types.h, runtime/pto_completion_token.h, common/pto_runtime_status.h, common/intrinsic.h, runtime/pto2_dispatch_payload.h
Introduces all foundational PTO2 types: alignment constants, task state/allocation/payload/slot structs with cache-line assertions, ActiveMask/MixedKernels/PTO2LaunchSpec submit contracts, the 512-byte PTO2DispatchPayload per-core ABI, SPMD GlobalContext/LocalContext/AsyncCtx kernel-facing structs, error-code macros, and CompletionToken/CompletionPollResult.
Shared memory layout and ring buffer data structures
runtime/pto_shared_memory.h, runtime/pto_ring_buffer.h, runtime/runtime.h
Defines PTO2SharedMemoryHeader/PTO2SharedMemoryRingHeader/PTO2RingFlowControl/PTO2SharedMemoryHandle plus host-safe address helpers. Introduces PTO2TaskAllocator (unified task+heap ring with deadlock detection), PTO2FaninPool, PTO2DepListPool, PTO2RingSet, and fanin iteration utilities. Replaces the old task-graph Runtime API with orchestration-centric accessors, prebuilt-arena fields, and tensor_pairs_.
TensorMap dependency tracking, task arg types, and orchestration API
runtime/pto_tensormap.h, runtime/pto_dep_compute.h, runtime/pto_types.h, orchestration/pto_orchestration_api.h, orchestration/pto_arg_with_deps.h
Adds PTO2TensorMap (hash-table/ring-pool with multi-stage check_overlap and lazy invalidation) and compute_task_fanin/register_task_outputs helpers. Introduces the Arg/L0TaskArgs/TensorRef/TaskOutputTensors argument container hierarchy and the orchestration-facing pto_orchestration_api.h (ops table, alloc_tensors/rt_submit_* wrappers, PTO2ScopeGuard RAII, get/set_tensor_data templates, logging macros) plus L0TaskArgsWithDeps convenience accumulator.
Async completion infrastructure
runtime/aicore_completion_mailbox_types.h, runtime/aicore_completion_mailbox.h, runtime/pto_async_kernel_api.h, runtime/pto_async_wait.h, runtime/backend/sdma/sdma_completion_kernel.h, runtime/backend/sdma/sdma_completion_scheduler.h
Defines the full AICore↔AICPU deferred completion path: DeferredCompletionEntry/DeferredCompletionSlab shared types, MPSC lock-free AICoreCompletionMailbox, AICore-side register_completion_condition/defer_flush, scheduler-side AsyncWaitList drain/poll/retire, and SDMA backend helpers (SdmaRequestDescriptor, send_request_entry, poll/retire_sdma_event_record).
PTO2Runtime, orchestrator, and scheduler interface headers
runtime/pto_runtime2.h, runtime/pto_orchestrator.h, runtime/common.h, runtime/scheduler/scheduler_context.h
Declares PTO2Runtime struct and PTO2RuntimeOps ops table, arena lifecycle API (reserve/init/wire/finalize/destroy), PTO2OrchestratorState with its layout reservation and task-submission API, and SchedulerContext class with public lifecycle entry points and all private dispatch/completion/cold-path method signatures. Also adds the framework_bind_runtime TLS bridge.
Orchestrator core implementation
runtime/orchestrator_core/pto_orchestrator.cpp, .../pto_ring_buffer.cpp, .../pto_runtime2.cpp, orchestration/common.cpp
Implements the host orchestrator: fanin deduplication (PTO2FaninBuilder), task preparation, TensorMap capacity backpressure (deadlock backstop), the main submit_task_common pipeline (dep-gen, compute_task_fanin, TensorMap insert, descriptor write, wire_task), submit_task/submit_dummy_task/alloc_tensors/mark_done, pool reclaim with cycle-based deadlock detection, wait_for_tensor_ready/get_tensor_data/set_tensor_data, and orchestration SO assertion/stack-trace infrastructure.
Scheduler: ready queue, dispatch, completion, and cold path
runtime/scheduler/pto_scheduler.h, pto_scheduler.cpp, scheduler_dispatch.cpp, scheduler_completion.cpp, scheduler_cold_path.cpp
Full scheduler: lock-free MPMC ready queue, SPSC wiring queue, PTO2SchedulerState hot-path (wiring fanin, fanout locking, speculative early-dispatch via CAS/doorbell/staged-core-mask, on_task_complete/on_task_release), resolve_and_dispatch main loop (completion polling, drain-mode protocol, dispatch phases, deferred release, hang timeout), slot-transition logic, deferred completion routing through the MPSC mailbox, and cold-path diagnostics (handshake, core assignment, emergency shutdown, stall logging, L2 swimlane summary).
Host runtime maker, AICPU/AICore executors, and dep-gen replay
host/runtime_maker.cpp, aicpu/aicpu_executor.cpp, aicore/aicore_executor.cpp, host/dep_gen_replay.h, host/dep_gen_replay.cpp, host/host_orch_compat_stubs.cpp, host/runtime_compile_info.cpp
Rewrites runtime_maker.cpp for RT2: per-ring config resolution, tensor staging with SVM registration, host prebuilt arena construction, host orchestrator execution against a host SM mirror, pointer relocation, H2D upload, and validate_runtime_impl with fatal-status-aware copy-back. Refactors AicpuExecutor to delegate to SchedulerContext. Updates AICore executor to dispatch via PTO2DispatchPayload with not_ready gate and task_token-based L2 swimlane identity. Adds two-pass dep_gen_replay_emit_deps_json and weak host compat stubs.
Platform SVM helpers, build config, and documentation
platform/onboard/host/device_runner.h, .../device_runner.cpp, host_build_graph/build_config.py, docs/RUNTIME_LOGIC.md, docs/SCALAR_DATA_ACCESS.md, docs/SUBMIT_BY_CLUSTER.md, docs/device_log_profiling.md, docs/profiling_levels.md
Adds DeviceRunner::svm_register/svm_unregister via lazy-loaded HAL. Updates build config include/source directories. Rewrites RUNTIME_LOGIC.md as a full PTO2 system design document and adds four new documentation pages covering scalar data access, cluster submission, device log profiling, and profiling levels.

Sequence Diagram

sequenceDiagram
  participant Host as Host (runtime_maker)
  participant HostOrch as Orchestration SO (host)
  participant AICPU as AICPU executor
  participant Sched as SchedulerContext
  participant AICore as AICore executor

  Host->>Host: stage tensors to device, svm_register
  Host->>Host: reserve/init/wire PTO2Runtime arena
  Host->>HostOrch: run_host_orchestration (host SM mirror)
  HostOrch->>Host: submit_task calls → wire_task (inline fanout)
  Host->>Host: relocate host→device pointers in arena image
  Host->>AICPU: H2D upload prebuilt arena
  AICPU->>Sched: boot thread: reattach arena, wire SM, on_orchestration_done
  Sched->>Sched: resolve_and_dispatch loop (ready queue, dispatch, completion)
  Sched->>AICore: publish PTO2DispatchPayload (scheduler→AICore register write)
  AICore->>AICore: execute_task(payload), write FIN register
  AICore->>Sched: AICoreCompletionMailbox (deferred/normal done)
  Sched->>AICPU: completed_tasks == total_tasks → shutdown
  AICPU->>Host: runtime_destroy, framework_bind_runtime(nullptr)
  Host->>Host: validate_runtime_impl: D2H copy-back, svm_unregister
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~180 minutes

Possibly related PRs

  • hw-native-sys/simpler#878: Both PRs modify aicore_executor.cpp in the L2 swimlane profiling path; this PR changes how swimlane identity/join keys are derived from exec_payload->local_context.async_ctx.task_token.raw and task_id.
  • hw-native-sys/simpler#1051: Both PRs implement direction-aware D2H copy-back by tagging TensorPair with needs_copy_back and skipping copy for ArgDirection::IN tensors in validate_runtime_impl.
  • hw-native-sys/simpler#1074: Both PRs modify SchedulerContext::handle_drain_mode in scheduler_completion.cpp to add early-exit when is_completed() is set, preventing deadlock in the sync_start drain barrier.

Poem

🐇 The rabbit hops from host to device, quite the feat,
Orchestration done on CPU — the graph complete!
Ring buffers spin, the TensorMap tracks each slice,
Payloads land on AICore, dispatched in a trice.
No more task-graph runtime — PTO2 takes the stage,
512 bytes per core, a beautiful new age! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding the host_build_graph runtime as a host-orchestration variant.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description clearly matches the changeset: it introduces host_build_graph, host orchestration, SVM mapping, and runtime refactors.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions the host_build_graph runtime to a host-orchestrated model (PTO2) where the task graph is built on the host and executed scheduler-only on the device. Feedback from the review highlights several critical issues: redundant MMIO reads in the speculative early-dispatch loop, unsafe memcpy operations due to layout incompatibility in PTO2TensorMapEntry, and potential infinite hangs on the host because the fallback get_sys_cnt_aicpu() returns zero. Additionally, the review points out undefined behavior from unaligned pointer access, redundant loops over PTO2_MAX_RING_DEPTH since ring is a single member, and opportunities to optimize hash table lookups using bitwise AND instead of modulo.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
Comment thread src/a2a3/runtime/host_build_graph/runtime/pto_tensormap.h
Comment thread src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp
Comment thread src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp
Comment thread src/a2a3/runtime/host_build_graph/runtime/pto_tensormap.h
Comment thread src/a2a3/runtime/host_build_graph/runtime/pto_tensormap.h
@ChaoWao ChaoWao force-pushed the worktree-foamy-wondering-thimble branch from 44aae09 to 698f596 Compare June 29, 2026 03:41

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (27)
src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_context.h-144-147 (1)

144-147: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Make the orchestration handoff atomic

volatile does not synchronize the scheduler/orchestrator handoff, so the flag and the total_tasks_ publication still race across threads. Switch orchestrator_done_ to std::atomic<bool> and update the reads/writes in scheduler_cold_path.cpp and scheduler_dispatch.cpp to use acquire/release operations.

🔒 Proposed header-side fix
-    volatile bool orchestrator_done_{false};
+    std::atomic<bool> orchestrator_done_{false};
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_context.h`
around lines 144 - 147, The scheduler/orchestrator handoff in SchedulerContext
is still using a non-synchronizing flag, so the publication of total_tasks_ can
race with scheduler reads. Change orchestrator_done_ in SchedulerContext to
std::atomic<bool>, then update every write/read of that flag in
scheduler_cold_path.cpp and scheduler_dispatch.cpp to use release on the
orchestrator side and acquire on the scheduler side. Keep completed_ consistent
with the same atomic handoff pattern so the initialization state is visible
across threads through the existing scheduler/orchestrator paths.
src/a2a3/runtime/host_build_graph/runtime/pto_ring_buffer.h-285-287 (1)

285-287: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard local_task_id_ before incrementing. local_task_id_++ is signed int32_t overflow at INT32_MAX, which is undefined behavior before the ring-mask logic can recover. Returning -1 here is safe because prepare_task() already treats negative task_id as allocation failure and stops before publishing the descriptor.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/pto_ring_buffer.h` around lines 285
- 287, Guard local_task_id_ before incrementing in commit_task() to avoid signed
int32_t overflow at INT32_MAX; check for the terminal value first and return -1
on failure so prepare_task() can stop cleanly before publishing. Update the
task_id assignment logic in PTO ring buffer commit flow to use a safe
pre-increment check rather than local_task_id_++ overflow, while preserving the
existing negative-id failure path.
src/a2a3/runtime/host_build_graph/runtime/runtime.h-337-343 (1)

337-343: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Split the host-only tensor ledger out of Runtime Runtime is copied to device as a raw sizeof(Runtime) blob, so std::vector<TensorPair> tensor_pairs_ becomes part of the device image and ties the ABI to libstdc++ layout. Move the ledger to a host-only wrapper or copy only a trivially-copyable device prefix.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/runtime.h` around lines 337 - 343,
The Runtime layout currently includes the host-only tensor_pairs_ ledger, but
Runtime is copied as a raw sizeof(Runtime) blob, so the std::vector control
block leaks into the device image and locks the ABI to libstdc++ details.
Refactor the Runtime ownership model so the device-copied portion is a trivially
copyable prefix and move tensor_pairs_ into a host-only wrapper or separate
host-side structure; update any runtime_maker.cpp and validate_runtime_impl
usage to read the ledger from that host-only path instead of from Runtime.
src/a2a3/runtime/host_build_graph/runtime/pto_types.h-234-248 (1)

234-248: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Reset launch_spec with the rest of the Arg state.

reset() is the reuse path, but it leaves launch_spec untouched. After one submission sets a non-default launch shape, the next recycled Arg inherits it silently.

Suggested fix
     void clear() {
         Base::clear();
 `#if` PTO2_PROFILING
         dump_arg_selection_.clear();
 `#endif`
         explicit_deps_ = nullptr;
         explicit_dep_count_ = 0;
         allow_early_resolve_ = false;
+        launch_spec = PTO2LaunchSpec{};
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/pto_types.h` around lines 234 -
248, The Arg reuse path in reset() leaves launch_spec stale because it is not
cleared alongside the other state. Update the clear()/reset() flow in the Arg
type so launch_spec is reset together with explicit_deps_, explicit_dep_count_,
allow_early_resolve_, has_error, and error_msg, ensuring recycled Arg instances
do not inherit a previous submission’s launch shape.
src/a2a3/runtime/host_build_graph/orchestration/pto_orchestration_api.h-96-107 (1)

96-107: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Keep PTO2Runtime on a single shared definition. This header redefines the runtime object with only the first two fields, while pto_runtime2.h defines the full layout. That relies on an implicit cross-component ABI contract rather than a C++ type guarantee; centralize the shared prefix or add an accessor for pending_scope_mode instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/orchestration/pto_orchestration_api.h`
around lines 96 - 107, The runtime type is being redefined here with only a
partial layout, which duplicates the full `PTO2Runtime` definition from the main
runtime header and depends on an implicit ABI contract. Update the orchestration
API to use a single shared definition or a shared prefix type, and expose
`pending_scope_mode` through an accessor or inline helper instead of directly
embedding a second `PTO2Runtime` struct definition. Use the `PTO2Runtime` and
`pending_scope_mode` symbols to locate the overlap and remove the duplicated
layout.
src/a2a3/runtime/host_build_graph/runtime/pto_runtime2_types.h-25-26 (1)

25-26: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Use a unique include guard for this runtime header.

This file’s own comment notes mixed inclusion with the tensormap_and_ringbuffer copy; keeping the copied guard makes the first included runtime header win and can hide the other one.

Proposed fix
-#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_RUNTIME2_TYPES_H_
-#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_RUNTIME2_TYPES_H_
+#ifndef SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_PTO_RUNTIME2_TYPES_H_
+#define SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_PTO_RUNTIME2_TYPES_H_
@@
-#endif  // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_RUNTIME2_TYPES_H_
+#endif  // SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_PTO_RUNTIME2_TYPES_H_

Also applies to: 544-544

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/pto_runtime2_types.h` around lines
25 - 26, The include guard in pto_runtime2_types.h is still using the copied
tensormap_and_ringbuffer symbol, which can cause one runtime header to suppress
the other. Update the guard macro in this header to a unique name tied to
pto_runtime2_types, and make sure the matching endif comment stays consistent so
this header can be included independently.
src/a2a3/runtime/host_build_graph/runtime/aicore_completion_mailbox.h-12-13 (1)

12-13: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Rename this include guard to avoid cross-runtime suppression.

The guard names tensormap_and_ringbuffer even though the file is under host_build_graph. If both runtimes’ mailbox headers are visible in one TU, this can make one copy suppress the other.

Proposed fix
-#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_AICORE_COMPLETION_MAILBOX_H_
-#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_AICORE_COMPLETION_MAILBOX_H_
+#ifndef SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_AICORE_COMPLETION_MAILBOX_H_
+#define SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_AICORE_COMPLETION_MAILBOX_H_
@@
-#endif  // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_AICORE_COMPLETION_MAILBOX_H_
+#endif  // SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_AICORE_COMPLETION_MAILBOX_H_

Also applies to: 189-189

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/aicore_completion_mailbox.h` around
lines 12 - 13, The include guard in aicore_completion_mailbox.h is using the
wrong runtime-specific prefix, which can cause cross-runtime header suppression
when both mailbox headers are included together. Update the guard in the
aicore_completion_mailbox header to a unique host_build_graph-specific name and
make sure the matching closing directive uses the same symbol so it no longer
collides with the tensormap_and_ringbuffer runtime.
src/a2a3/runtime/host_build_graph/runtime/pto_runtime2_types.h-70-96 (1)

70-96: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Don’t cap scope tasks with the default window size when the actual window is runtime-configurable.

Lines 70-72 say the task window is passed at runtime, but Line 96 fixes PTO2_SCOPE_TASKS_CAP to the default 16384 * depth. A valid larger runtime window can fatal early in scope_tasks_push before the ring is actually full. Size this buffer/cap from the resolved runtime task_window_size.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/pto_runtime2_types.h` around lines
70 - 96, The scope task cap is incorrectly tied to the compile-time default
`PTO2_TASK_WINDOW_SIZE`, which can cause `scope_tasks_push` to overflow early
when the runtime-provided window is larger. Update the `PTO2_SCOPE_TASKS_CAP`
sizing logic in `pto_runtime2_types.h` so it derives from the resolved runtime
`task_window_size` used by `runtime_create_from_sm()` and the ring depth, and
make sure any code using the cap or `scope_tasks_push` reads the runtime value
instead of the default macro.
src/a2a3/runtime/host_build_graph/common/pto_runtime_status.h-18-53 (1)

18-53: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Rename the copied include guard to the host-build-graph path.

The guard still uses the tensormap_and_ringbuffer namespace. In mixed-runtime TUs, that can cause this status helper to be skipped or replaced by the sibling runtime’s header.

Proposed fix
-#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_COMMON_PTO_RUNTIME_STATUS_H_
-#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_COMMON_PTO_RUNTIME_STATUS_H_
+#ifndef SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_COMMON_PTO_RUNTIME_STATUS_H_
+#define SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_COMMON_PTO_RUNTIME_STATUS_H_
@@
-#endif  // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_COMMON_PTO_RUNTIME_STATUS_H_
+#endif  // SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_COMMON_PTO_RUNTIME_STATUS_H_
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/common/pto_runtime_status.h` around lines
18 - 53, The include guard in pto_runtime_status.h still matches the older
tensormap_and_ringbuffer path, which can collide with the sibling runtime header
and cause the wrong definition to be skipped in mixed builds. Update the guard
macro to a unique host_build_graph-specific name and keep it consistent with the
header’s current location; verify the change around the top-level include guard
symbols in pto_runtime_status.h so this helper is not accidentally shared with
the other runtime variant.
src/a2a3/runtime/host_build_graph/runtime/pto_constants.h-12-19 (1)

12-19: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Use a host_build_graph-specific include guard. This file still uses the tensormap_and_ringbuffer guard prefix, so builds that expose both runtime trees can skip one pto_constants.h copy entirely. Rename the guard and trailing #endif comment to match src/a2a3/runtime/host_build_graph/runtime/pto_constants.h.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/pto_constants.h` around lines 12 -
19, The include guard in pto_constants.h still uses the tensormap_and_ringbuffer
prefix, which can cause collisions when both runtime trees are included. Update
the guard macro name and matching trailing `#endif` comment in pto_constants.h to
a host_build_graph-specific symbol that uniquely reflects
src/a2a3/runtime/host_build_graph/runtime/pto_constants.h.
src/a2a3/runtime/host_build_graph/runtime/pto_submit_types.h-152-153 (1)

152-153: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Preserve the positive block_num invariant.

set_block_num(0) or a negative value violates the documented [0, block_num) dispatch contract and can produce an invalid launch shape. Please reject invalid values at this boundary or before populating PTO2LaunchSpec.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/pto_submit_types.h` around lines
152 - 153, The block_num setter currently accepts zero and negative values,
which breaks the documented dispatch contract and can lead to invalid launch
specs. Update `set_block_num` in `PTO2LaunchSpec` (or add validation before it
is populated) so only positive values are accepted, and reject invalid inputs at
this boundary with an appropriate failure path to preserve the `block_num`
invariant.
src/a2a3/runtime/host_build_graph/runtime/aicore_completion_mailbox_types.h-12-13 (1)

12-13: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Use a host_build_graph-specific header guard.

This new host_build_graph header uses a TENSORMAP_AND_RINGBUFFER guard, creating a collision risk with unchanged runtime headers.

Proposed fix
-#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_AICORE_COMPLETION_MAILBOX_TYPES_H_
-#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_AICORE_COMPLETION_MAILBOX_TYPES_H_
+#ifndef SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_AICORE_COMPLETION_MAILBOX_TYPES_H_
+#define SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_AICORE_COMPLETION_MAILBOX_TYPES_H_
...
-#endif  // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_AICORE_COMPLETION_MAILBOX_TYPES_H_
+#endif  // SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_AICORE_COMPLETION_MAILBOX_TYPES_H_

Also applies to: 67-67

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/aicore_completion_mailbox_types.h`
around lines 12 - 13, The header guard in aicore_completion_mailbox_types.h is
using the wrong namespace and can collide with unrelated runtime headers. Update
the include guard in the host_build_graph-specific header to a unique
host_build_graph symbol, and make sure the matching closing guard uses the same
new name; locate and fix the guard constants at the top and bottom of the file.
src/a2a3/runtime/host_build_graph/runtime/pto_completion_token.h-12-13 (1)

12-13: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Use a host_build_graph-specific header guard.

The guard macro still names TENSORMAP_AND_RINGBUFFER; if a TU includes same-named headers from both runtimes, one header can suppress the other.

Proposed fix
-#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_COMPLETION_TOKEN_H_
-#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_COMPLETION_TOKEN_H_
+#ifndef SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_PTO_COMPLETION_TOKEN_H_
+#define SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_PTO_COMPLETION_TOKEN_H_
...
-#endif  // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_COMPLETION_TOKEN_H_
+#endif  // SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_PTO_COMPLETION_TOKEN_H_

Also applies to: 45-45

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/pto_completion_token.h` around
lines 12 - 13, The header guard in pto_completion_token.h still uses the shared
TENSORMAP_AND_RINGBUFFER macro, which can clash with other runtimes. Update the
guard to a host_build_graph-specific symbol in the include guard at the top of
the file and the matching closing directive so the pto_completion_token header
is uniquely protected. Keep the change consistent with the existing
pto_completion_token.h identifiers.
src/a2a3/runtime/host_build_graph/runtime/backend/sdma/sdma_completion_kernel.h-87-142 (1)

87-142: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Propagate and flush async-registration failures.

register_pto_async_event() records errors for invalid engine / event-check failure, but send_request_entry() still returns true. Also, the BuildAsyncSession failure path writes completion_error_code and returns before defer_flush(), so the scheduler may miss the error.

Proposed fix direction
-inline __aicore__ void register_sdma_event_record(AsyncCtx &ctx, volatile __gm__ void *record_addr) {
+inline __aicore__ bool register_sdma_event_record(AsyncCtx &ctx, volatile __gm__ void *record_addr) {
     CompletionToken token{
         reinterpret_cast<uint64_t>(record_addr), 0, COMPLETION_ENGINE_SDMA, COMPLETION_TYPE_SDMA_EVENT_RECORD, 0
     };
-    (void)register_completion_condition(ctx, token);
+    return register_completion_condition(ctx, token);
 }
 
 template <typename PtoAsyncEvent, typename PtoAsyncSession>
-inline __aicore__ void
+inline __aicore__ bool
 register_pto_async_event(AsyncCtx &ctx, const PtoAsyncEvent &event, const PtoAsyncSession &session) {
     if (ctx.task_token.is_invalid() || ctx.completion_count == nullptr || ctx.completion_entries == nullptr) {
         (void)event.Wait(session);
-        return;
+        return true;
     }
     if (event.handle == 0) {
-        return;
+        return true;
     }
 
     const uint32_t engine = static_cast<uint32_t>(event.engine);
     if (engine != static_cast<uint32_t>(::pto::comm::DmaEngine::SDMA)) {
         defer_error(ctx, PTO2_ERROR_ASYNC_COMPLETION_INVALID);
-        return;
+        return false;
     }
...
         )) {
         defer_error(ctx, PTO2_ERROR_ASYNC_COMPLETION_INVALID);
-        return;
+        return false;
     }
+    bool ok = true;
     for (uint32_t queue_id = 0; queue_id < queue_num; ++queue_id) {
-        register_sdma_event_record(ctx, ::pto::comm::sdma::detail::GetEventRecord(recv_workspace, queue_id));
+        ok = register_sdma_event_record(ctx, ::pto::comm::sdma::detail::GetEventRecord(recv_workspace, queue_id)) && ok;
     }
+    return ok;
 }
...
     if (!pto::comm::BuildAsyncSession(desc.scratch, desc.workspace, session, desc.sync_id)) {
         pto2::detail::defer_error(ctx, PTO2_ERROR_ASYNC_COMPLETION_INVALID);
+        pto2::detail::defer_flush(ctx);
         return false;
     }
...
-    pto2::detail::register_pto_async_event(ctx, event, session);
+    bool ok = pto2::detail::register_pto_async_event(ctx, event, session);
     pto2::detail::defer_flush(ctx);
-    return true;
+    return ok;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/a2a3/runtime/host_build_graph/runtime/backend/sdma/sdma_completion_kernel.h`
around lines 87 - 142, `send_request_entry()` currently ignores failures from
`register_pto_async_event()` and can return true even after `defer_error()` is
set, while the `BuildAsyncSession` failure path exits before any flush. Update
`send_request_entry()` and `register_pto_async_event()` so async-registration
failures are propagated back to the caller and `defer_flush(ctx)` is always
reached whenever an error is recorded, using the existing `AsyncCtx`,
`register_pto_async_event`, `BuildAsyncSession`, and `defer_error/defer_flush`
flow to ensure the scheduler observes the failure.
src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp-706-748 (1)

706-748: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Add failure cleanup for staged tensor allocations and SVM maps.

After earlier tensors are pushed into tensor_pairs_, any later return -1 in binding leaks their device allocations and SVM registrations. Use a local cleanup guard or shared cleanup helper until ownership is fully transferred to validate_runtime_impl.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp` around lines 706 -
748, The tensor staging path in runtime_maker.cpp leaks previously allocated
device buffers and SVM registrations when a later tensor fails after entries
have already been added to runtime->tensor_pairs_. Update the allocation/staging
flow around device_malloc, device_memset/copy_to_device, push_back on
tensor_pairs_, and svm_register_via_runner so any failure performs cleanup of
all tensors staged so far before returning -1. Use a local cleanup guard or a
shared helper that frees the accumulated dev_ptrs and unregisters their SVM
mappings until ownership is fully transferred to validate_runtime_impl.
src/a2a3/runtime/host_build_graph/host/dep_gen_replay.cpp-607-615 (1)

607-615: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Do not emit deps.json from partial overflow chains.

Continuing with a partial dep list silently drops explicit dependencies while still writing a “source of truth” deps.json. Treat malformed chains as fatal and return before serialization.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/host/dep_gen_replay.cpp` around lines 607 -
615, The dep_gen replay path in dep_gen_replay.cpp currently logs a warning and
continues when chain_complete is false, which still allows a partial dep list to
be serialized into deps.json. Update the replay logic around the chain_complete
check so malformed overflow chains are treated as fatal: stop processing and
return before deps_data and dc are used for serialization, rather than falling
through with full_deps_buf. Keep the change localized to the
replay/serialization flow in the host dep generation code so deps.json is only
emitted from complete dependency chains.
src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp-131-145 (1)

131-145: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Reject single-thread execution when no scheduler thread will run.

The 0 → 1 fixup makes sched_thread_num_ == 0; with orch_to_sched_ == false, the boot thread skips dispatch and submitted tasks never execute. Require at least one scheduler-capable thread or force a supported transition mode.

Also applies to: 229-247

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp` around lines 131
- 145, The single-thread fallback in aicpu_executor.cpp allows a configuration
where sched_thread_num_ becomes 0 and, with orch_to_sched_ disabled, nothing
ever drives task dispatch. Update the initialization logic in the executor’s
setup path (the block that sets aicpu_thread_num_, sched_thread_num_, and
orch_to_sched_) to reject this unsupported combination by validating that at
least one scheduler-capable thread will run, or by forcing a supported
orch_to_sched_ transition mode before calling sched_ctx_.init. Apply the same
guard in the other referenced initialization block as well so both code paths
enforce the same constraint.
src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp-277-280 (1)

277-280: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Do not treat a failed PTO2 header read as runtime success.

Returning 0 here makes validate_runtime_impl proceed with copy-back even when runtime status is unavailable. Return an error/status sentinel so cleanup still runs but result copy-back is skipped.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp` around lines 277 -
280, The PTO2 header read failure in runtime_maker::validate_runtime_impl is
currently returned as success, which lets copy-back continue even when runtime
status is unavailable. Update the error path after
runtime->host_api.copy_from_device(...) so it returns a non-success sentinel
instead of 0, while preserving cleanup execution. Make sure the change is
applied in the validate_runtime_impl flow that handles
host_header/PTO2SharedMemoryHeader reads, so failed header reads skip result
copy-back.
src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp-318-323 (1)

318-323: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Tighten temp-SO permissions and unlink on every failure path.

0755 exposes callable SO contents to other users, and the dlopen/dlsym error paths return without unlinking the tempfile.

Proposed fix
-    if (fchmod(fd, 0755) != 0) {
+    if (fchmod(fd, 0700) != 0) {
         close(fd);
         unlink(tmpl);
         return false;
@@
         void *handle = dlopen(so_path.c_str(), RTLD_NOW | RTLD_LOCAL);
         if (handle == nullptr) {
             LOG_ERROR("host-orch: dlopen failed: %s", dlerror());
+            unlink(so_path.c_str());
             return -1;
         }
@@
             LOG_ERROR("host-orch: dlsym('%s') failed: %s", orch_func_name, dlerror());
             dlclose(handle);
+            unlink(so_path.c_str());
             return -1;
         }
@@
             LOG_ERROR("host-orch: orch .so does not export framework_bind_runtime: %s", dlerror());
             dlclose(handle);
+            unlink(so_path.c_str());
             return -1;
         }

Also applies to: 606-627

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp` around lines 318 -
323, The temp shared object handling in the runtime maker is too permissive and
leaves files behind on some failures. Update the temp-file permission setup
around the write/load flow in runtime_maker’s SO creation logic to use stricter
permissions than 0755, and make sure every early return from the
write_all_bytes, dlopen, and dlsym failure paths closes the fd and unlinks tmpl
before returning false. Keep the cleanup consistent in the same code path so no
tempfile survives any error.

Source: Linters/SAST tools

src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp-979-987 (1)

979-987: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Validate packed-output copy size and target tensor identity.

graph_out_size comes from the device header and is copied into pair.host_ptr without checking graph_out_size <= pair.size. Also, “first output tensor” is approximated as the first tensor that needs copy-back, which can be an INOUT tensor. Store argument direction in TensorPair or otherwise identify the real first OUT before applying graph_output_ptr.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp` around lines 979 -
987, The packed-output path in runtime_maker.cpp is using graph_out_size
directly and treating the first copy-back tensor as the packed-output target,
which can write past pair.host_ptr or attach graph_output_ptr to an INOUT
tensor. Update the copy logic in the host graph runtime to validate that
graph_out_size never exceeds pair.size before calling
runtime->host_api.copy_from_device, and use TensorPair metadata to distinguish
OUT from INOUT so the packed buffer is only applied to the real first output
tensor. Keep the fix localized around the first_output_tensor handling and the
copy_from_device call in the output-copy loop.
src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp-748-748 (1)

748-748: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Check the SVM registration result before host orchestration dereferences tensor addresses.

The host orchestrator later dereferences buffer.addr directly. If registration fails, or returns a non-identity host VA, this path can crash or read unmapped memory.

Proposed fix
-        svm_register_via_runner(dev_ptr, size);
+        void *mapped_ptr = svm_register_via_runner(dev_ptr, size);
+        if (size != 0 && mapped_ptr == nullptr) {
+            LOG_ERROR("Failed to SVM-register tensor %d", i);
+            runtime->host_api.device_free(dev_ptr);
+            return -1;
+        }
+        if (mapped_ptr != nullptr && mapped_ptr != dev_ptr) {
+            LOG_ERROR("SVM mapping for tensor %d is non-identity: dev=%p host=%p", i, dev_ptr, mapped_ptr);
+            svm_unregister_via_runner(dev_ptr);
+            runtime->host_api.device_free(dev_ptr);
+            return -1;
+        }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp` at line 748, The
SVM path in runtime_maker’s host orchestration should not assume registration
always succeeds or preserves the original address. Capture and validate the
result of svm_register_via_runner before any later use of buffer.addr, and fail
fast or skip orchestration if registration does not succeed. Also verify the
returned host virtual address matches the expected identity mapping before
dereferencing tensor addresses, so the later buffer.addr access in the host
orchestration path cannot touch unmapped memory.
src/a2a3/runtime/host_build_graph/runtime/orchestrator_core/pto_orchestrator.cpp-454-456 (1)

454-456: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Stop preparation if scope_tasks_push() latches fatal.

When scope_tasks_push() overflows, it records a fatal error but prepare_task() still returns true, allowing the task to be registered and wired without a scope-release entry.

Suggested fix
     // fanin_count is set by scheduler during wiring
     scope_tasks_push(orch, out->slot_state);
 
-    return true;
+    return !orch->fatal;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/a2a3/runtime/host_build_graph/runtime/orchestrator_core/pto_orchestrator.cpp`
around lines 454 - 456, scope_tasks_push() can latch a fatal overflow error, but
prepare_task() still returns true and continues wiring the task. Update
prepare_task() in pto_orchestrator.cpp to check the result/state after calling
scope_tasks_push(orch, out->slot_state) and return false immediately if a fatal
was latched, so the task is not registered without a matching scope-release
entry.
src/a2a3/runtime/host_build_graph/orchestration/common.cpp-56-63 (1)

56-63: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Do not pass an unescaped shared-object path through the shell.

dl_info.dli_fname is interpolated into a popen() command. A library path containing shell metacharacters can alter the diagnostic command; avoid the shell or quote the path.

Suggested fix
+static std::string shell_quote(const char *s) {
+    std::string out = "'";
+    for (; *s; ++s) {
+        out += (*s == '\'') ? "'\\''" : std::string(1, *s);
+    }
+    out += "'";
+    return out;
+}
+
 static std::string addr_to_line(const char *executable, void *addr, std::string *inline_chain = nullptr) {
-    char cmd[512];
-    snprintf(cmd, sizeof(cmd), "addr2line -e %s -f -C -p -i %p 2>/dev/null", executable, addr);
+    std::string cmd = "addr2line -e " + shell_quote(executable) + " -f -C -p -i ";
+    char addr_buf[32];
+    snprintf(addr_buf, sizeof(addr_buf), "%p", addr);
+    cmd += addr_buf;
+    cmd += " 2>/dev/null";
@@
-    FILE *pipe = popen(cmd, "r");
+    FILE *pipe = popen(cmd.c_str(), "r");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/orchestration/common.cpp` around lines 56 -
63, The addr_to_line helper currently interpolates the executable/shared-object
path directly into the popen() shell command, which can be unsafe when
dl_info.dli_fname contains shell metacharacters. Update addr_to_line to avoid
invoking the shell with an unescaped path—either switch to a non-shell process
execution approach or properly quote/escape the executable argument before
building the addr2line command, and keep the existing inline_chain/raw_output
behavior intact.
src/a2a3/runtime/host_build_graph/runtime/orchestrator_core/pto_orchestrator.cpp-751-760 (1)

751-760: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Reject unsupported explicit dependency rings before resolving by local id.

dep_ring_id is captured but the lookup always uses orch->sm_header->ring; with PTO2_MAX_RING_DEPTH == 1, a task id from another ring can alias a live ring-0 local id and create the wrong edge.

Suggested fix
         }
         uint8_t dep_ring_id = dep_task_id.ring();
+        if (dep_ring_id >= PTO2_MAX_RING_DEPTH) {
+            orch->report_fatal(
+                PTO2_ERROR_INVALID_ARGS, __FUNCTION__, "Explicit dependency references unsupported ring %u",
+                static_cast<unsigned>(dep_ring_id)
+            );
+            return result;
+        }
         PTO2SharedMemoryRingHeader &dep_ring = orch->sm_header->ring;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/a2a3/runtime/host_build_graph/runtime/orchestrator_core/pto_orchestrator.cpp`
around lines 751 - 760, The dependency resolution in the fan-in path is using
orch->sm_header->ring even though dep_task_id carries a dep_ring_id, which can
incorrectly alias tasks from unsupported rings. In the PTO2 orchestrator logic,
reject any dependency whose ring does not match the supported ring-0 setup
before resolving by local task id, and only proceed to get_slot_by_task_id and
append_fanin_or_fail for valid same-ring dependencies.
src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_cold_path.cpp-697-735 (1)

697-735: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate scheduler thread and core topology counts before indexing.

active_sched_threads_ is used as a divisor and as an index bound, and aiv_worker_ids_[2 * ci + 1] assumes two AIVs per AIC. Add explicit checks for active_sched_threads_ > 0 && <= MAX_AICPU_THREADS and aiv_count_ >= 2 * aic_count_ before these loops.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_cold_path.cpp`
around lines 697 - 735, The round-robin core assignment in
scheduler_cold_path.cpp assumes valid topology counts but does not validate them
before division or indexing. In the scheduling setup around
active_sched_threads_, clusters_per_thread, and the
aic_worker_ids_/aiv_worker_ids_ lookups, add explicit guards that
active_sched_threads_ is greater than 0 and no larger than MAX_AICPU_THREADS,
and that aiv_count_ is at least 2 * aic_count_ before entering the loops. Keep
the existing logic in place after these checks so the indexing of
aiv_worker_ids_[2 * ci] and aiv_worker_ids_[2 * ci + 1] is always safe.
src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_cold_path.cpp-959-969 (1)

959-969: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Initialize AIV sub-block IDs for active_sched_threads_, not sched_thread_num_.

When sched_thread_num_ <= 0, assign_cores_to_threads() uses all AICPU threads, but this loop runs zero iterations and leaves AIV1 sub_block_id at the memset default 0 instead of 1.

Proposed fix
-    for (int32_t t = 0; t < sched_thread_num_; t++) {
+    for (int32_t t = 0; t < active_sched_threads_; t++) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_cold_path.cpp`
around lines 959 - 969, The AIV sub-block initialization loop is using
sched_thread_num_ instead of the actual active thread count, so it can skip
setup when sched_thread_num_ <= 0 and leave AIV1 sub_block_id at the default
value. Update the loop in scheduler_cold_path.cpp to iterate over
active_sched_threads_ (the same thread set used by assign_cores_to_threads()),
and keep the existing payload_per_core_ / CoreTracker initialization for aiv0_id
and aiv1_id so both AIV sub-block IDs are always assigned correctly.
src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_completion.cpp-558-569 (1)

558-569: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard the 32-thread drain barrier mask.

1u << active_sched_threads_ is undefined when active_sched_threads_ == 32, so a max-width scheduler can compute a bad all_acked mask and hang in drain mode.

Proposed fix
-    uint32_t all_acked = (1u << active_sched_threads_) - 1;
+    uint32_t all_acked = (active_sched_threads_ >= 32) ? UINT32_MAX : ((1u << active_sched_threads_) - 1);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_completion.cpp`
around lines 558 - 569, The drain barrier mask in scheduler_completion.cpp is
built with a 32-bit shift that becomes undefined when active_sched_threads_ can
reach 32. Update the mask logic in the drain/ack path around the all_acked
computation and the drain_ack_mask checks so it safely handles the max-width
scheduler case, using a width-safe mask construction that avoids shifting by 32
while preserving the existing ack barrier behavior in the completion loop.
🟡 Minor comments (11)
src/a2a3/runtime/host_build_graph/runtime/runtime.h-29-30 (1)

29-30: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Rename the header guard to SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_RUNTIME_H_. It collides with src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h, so including both headers in one TU can skip one definition.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/runtime.h` around lines 29 - 30,
The header guard in runtime.h is colliding with the one used by the
tensormap_and_ringbuffer runtime header, so update the include guard symbol in
this file to SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_RUNTIME_RUNTIME_H_. Make sure the
`#ifndef/`#define pair in the runtime.h header matches the new unique guard name
so both headers can be included in the same translation unit without one being
skipped.
src/a2a3/runtime/host_build_graph/host/dep_gen_replay.h-73-74 (1)

73-74: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Fix the header guard to match host_build_graph.

The current guard uses TENSORMAP_AND_RINGBUFFER, which risks colliding with a sibling runtime header and skipping this declaration.

Proposed fix
-#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_HOST_DEP_GEN_REPLAY_H_
-#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_HOST_DEP_GEN_REPLAY_H_
+#ifndef SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_HOST_DEP_GEN_REPLAY_H_
+#define SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_HOST_DEP_GEN_REPLAY_H_
@@
-#endif  // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_HOST_DEP_GEN_REPLAY_H_
+#endif  // SRC_A2A3_RUNTIME_HOST_BUILD_GRAPH_HOST_DEP_GEN_REPLAY_H_

Also applies to: 106-106

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/host/dep_gen_replay.h` around lines 73 -
74, The header guard in dep_gen_replay.h is using the wrong identifier and can
collide with a sibling runtime header, so update the guard to match the
host_build_graph path instead of TENSORMAP_AND_RINGBUFFER. Make sure both the
`#ifndef` and `#define` symbols in the header guard are changed consistently in this
file so the HostDepGenReplay declaration is not skipped by the preprocessor.
src/a2a3/runtime/host_build_graph/runtime/orchestrator_core/pto_ring_buffer.cpp-108-112 (1)

108-112: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Avoid advertising a new runtime tuning knob in the deadlock diagnostics.

These messages direct users to PTO2_RING_DEP_POOL; if this PR is introducing that env/config surface, it needs measured justification in a separate change. Otherwise, keep the remediation to the compile-time constant to avoid documenting an unsupported knob. Based on learnings, new configuration knobs or environment-variable overrides should not be introduced speculatively without a measured workload demonstrating the need.

Also applies to: 174-177

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/a2a3/runtime/host_build_graph/runtime/orchestrator_core/pto_ring_buffer.cpp`
around lines 108 - 112, The deadlock diagnostics in pto_ring_buffer.cpp are
advertising a new runtime tuning knob that should not be introduced here. Update
the LOG_ERROR messages in the spill-pool capacity check so they only point to
the compile-time constant (PTO2_DEP_LIST_POOL_SIZE in pto_runtime2_types.h) and
remove the PTO2_RING_DEP_POOL runtime env reference from the diagnostics; also
apply the same change to the related LOG_ERROR block at the other noted
location.

Source: Learnings

src/a2a3/runtime/host_build_graph/runtime/scheduler/pto_scheduler.h-792-795 (1)

792-795: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Keep the wiring-deadlock remediation to supported configuration surfaces.

This diagnostic tells users to raise PTO2_RING_DEP_POOL; if that runtime knob is new in this PR, it needs measured justification separately. Otherwise, point only at the compile-time pool sizing constant. Based on learnings, new configuration knobs or environment-variable overrides should not be introduced speculatively without a measured workload demonstrating the need.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/pto_scheduler.h` around
lines 792 - 795, The deadlock diagnostic in the scheduler’s remediation path
should not point users at an unsupported runtime knob; update the LOG_ERROR
message in pto_scheduler so it only references the compile-time pool sizing
constant, and remove the suggestion to raise PTO2_RING_DEP_POOL unless that
configuration surface is explicitly supported and justified elsewhere. Keep the
guidance aligned with the existing fanout/dependency-pool check and the
surrounding open-scope remediation text.

Source: Learnings

src/a2a3/runtime/host_build_graph/runtime/scheduler/pto_scheduler.cpp-51-65 (1)

51-65: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Include <utility> for std::exchange. std::exchange is declared in <utility>; adding it avoids relying on transitive includes that can break builds.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/pto_scheduler.cpp` around
lines 51 - 65, The scheduler_get_profiling function uses std::exchange but the
file does not explicitly include the header that declares it, so add the proper
include for <utility> near the other includes in pto_scheduler.cpp. Keep the
existing logic in scheduler_get_profiling unchanged, and make sure the new
include directly covers all std::exchange calls used to reset the profiling
counters.
src/a2a3/runtime/host_build_graph/docs/device_log_profiling.md-89-90 (1)

89-90: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Don’t hardcode a 3-thread scheduler in the host_build_graph guide.

The note above correctly says this runtime is scheduler-only on device, but Block 2 and the grep example still assume only Threads 0/1/2 emit scheduler logs. On a 4-thread host_build_graph run that hides Thread 3’s scheduler output.

Also applies to: 173-174

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/docs/device_log_profiling.md` around lines
89 - 90, Update the host_build_graph device_log_profiling guide to avoid
hardcoding a 3-thread scheduler assumption in the text and grep example. In the
affected section, revise the wording around the scheduler summaries so it refers
to all scheduler threads generically instead of only Thread 0/1/2, and adjust
the example filter to include any scheduler thread output so Thread 3 is not
omitted in 4-thread runs.
src/a2a3/runtime/host_build_graph/docs/SCALAR_DATA_ACCESS.md-17-20 (1)

17-20: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Match the documented set_tensor_data signature to the real API.

The wrapper exposed in pto_orchestration_api.h takes const Tensor&, not Tensor&. Documenting the mutable form is misleading here because TaskOutputTensors::get_ref() commonly yields a const Tensor&, and readers will assume that path is unsupported when it is actually the intended usage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/docs/SCALAR_DATA_ACCESS.md` around lines 17
- 20, The documented set_tensor_data signature is using the mutable Tensor&
form, but the actual API in pto_orchestration_api.h accepts const Tensor&.
Update the SCALAR_DATA_ACCESS docs to match the real wrapper signature and
reflect that TaskOutputTensors::get_ref() returning const Tensor& is the
intended usage; use the set_tensor_data symbol as the reference point when
editing the documentation.
src/a2a3/runtime/host_build_graph/docs/RUNTIME_LOGIC.md-784-788 (1)

784-788: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Use the host_build_graph runtime in the example config.

The example still sets RUNTIME_CONFIG["runtime"] to tensormap_and_ringbuffer, so copying it verbatim selects the wrong runtime.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/docs/RUNTIME_LOGIC.md` around lines 784 -
788, The example runtime config is using the wrong runtime value, so update the
RUNTIME_CONFIG snippet in the host_build_graph docs to use the host_build_graph
runtime instead of tensormap_and_ringbuffer. Keep the change localized to the
example configuration block and ensure the runtime key clearly matches the
host_build_graph workflow shown in RUNTIME_LOGIC.md.
src/a2a3/runtime/host_build_graph/docs/RUNTIME_LOGIC.md-94-98 (1)

94-98: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Keep the variant-specific behavior scoped consistently.

These sections still describe device-orchestrator mechanics as if they apply to host_build_graph (Thread 3 as orchestrator, on-device orchestration SO load, the single-ring note under tensormap_and_ringbuffer). That contradicts the scheduler-only host-orchestration model established earlier in the file and will send readers to the wrong execution path.

Also applies to: 547-555, 691-697

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/docs/RUNTIME_LOGIC.md` around lines 94 -
98, The host_build_graph docs still mix in device-orchestrator and on-device
execution details that do not apply to the scheduler-only host model. Update the
affected sections in RUNTIME_LOGIC.md, especially the references around the
Thread 3 orchestrator, the on-device orchestration SO load, and the
tensormap_and_ringbuffer single-ring note, so they consistently describe
host_build_graph behavior only and align with the earlier host-orchestration
model.
src/a2a3/runtime/host_build_graph/docs/RUNTIME_LOGIC.md-393-409 (1)

393-409: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix the task-state numeric mismatch.

Section 6 defines CONSUMED as state 2, but the scheduler section later refers to CONSUMED (4). These numeric values are exactly what people use when reading dumps/logs, so the inconsistency will mislead debugging.

Also applies to: 587-598

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/docs/RUNTIME_LOGIC.md` around lines 393 -
409, Update the task-state documentation in the scheduler/runtime logic docs so
the numeric value for CONSUMED is consistent everywhere. In the Task State
Machine section, align the later reference to CONSUMED with the state definition
used by task_state[] and the numbered states in the same document, and verify
any related mentions in the scheduler description (including the other
referenced section) use the same symbol-to-number mapping.
src/a2a3/runtime/host_build_graph/docs/profiling_levels.md-82-145 (1)

82-145: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Split the profiling output examples by runtime variant.

The file-level note says host_build_graph never emits on-device orchestrator timing lines, but the Level 1 output/count sections still present orch_* logs and N_orch-based formulas as the normative output. In a host_build_graph doc, that will send readers looking for device-log lines that cannot exist.

Also applies to: 435-441

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/docs/profiling_levels.md` around lines 82 -
145, The Level 1 profiling section mixes host_build_graph output with on-device
orch timing lines, which makes the documented counts and examples misleading.
Rework the `profiling_levels.md` Level 1 content so the `orch_*` logs and
`N_orch` formulas are clearly separated into the runtime variant where they
actually apply, and keep the host_build_graph-facing output focused on the
scheduler-side lines. Update the `orch_to_sched_` note and the `LOG_INFO_V9
count` formulas to match that split, and ensure the example blocks reflect only
the variant they describe.
🧹 Nitpick comments (3)
src/a2a3/runtime/host_build_graph/runtime/pto_ring_buffer.h (1)

463-470: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Avoid advertising new runtime env knobs in fatal diagnostics.

These logs prescribe PTO2_RING_HEAP, PTO2_RING_TASK_WINDOW, and PTO2_RING_DEP_POOL. If these are new in this PR, remove the env guidance or split it into a measured follow-up; otherwise ensure the knobs are already supported and documented. Based on learnings, “do not introduce new configuration knobs or environment-variable overrides speculatively” and require measured workload justification for new tunables. <retrieved_learnings>

Also applies to: 529-532, 712-715

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/pto_ring_buffer.h` around lines 463
- 470, The fatal diagnostics in the ring buffer code are advertising new
environment-variable tunables that should not be introduced speculatively.
Update the LOG_ERROR messages in pto_ring_buffer.h at the ring heap, task
window, and dep pool failure paths to remove the PTO2_RING_HEAP,
PTO2_RING_TASK_WINDOW, and PTO2_RING_DEP_POOL guidance unless those knobs are
already implemented and documented; if not, keep the message focused on the
failure and any already-supported recovery options. Use the existing error
handling blocks around the ring buffer checks to locate and adjust the messages
consistently.

Source: Learnings

src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp (1)

228-230: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Avoid adding new env-driven runtime knobs without measured justification.

These PTO2_RING_* and PTO2_ORCH_TO_SCHED overrides add new configurable surface. Prefer the existing per-call/runtime config path or constants unless this PR includes a measured workload proving the need. Based on learnings, “do not introduce new configuration knobs or environment-variable overrides … speculatively.”

Also applies to: 758-764

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp` around lines 228 -
230, These new PTO2_RING_* environment-driven overrides add speculative runtime
knobs and should be removed unless backed by measured workload evidence. Update
runtime_maker.cpp to avoid calling apply_env_ring_values for
PTO2_RING_TASK_WINDOW, PTO2_RING_HEAP, and PTO2_RING_DEP_POOL (and the similar
PTO2_ORCH_TO_SCHED usage elsewhere), and instead rely on the existing
per-call/runtime configuration path or fixed constants. Keep the surrounding
runtime initialization logic intact, using the existing symbols like
apply_env_ring_values and eff_task_window_sizes only where the configuration is
already justified.

Source: Learnings

src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_cold_path.cpp (1)

13-14: 🎯 Functional Correctness | 🔵 Trivial | 💤 Low value

Include <cstring> directly in scheduler_cold_path.cpp so the memset calls here don’t depend on runtime.h’s transitive #include <string.h>.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/host_build_graph/runtime/scheduler/scheduler_cold_path.cpp`
around lines 13 - 14, The memset usage in scheduler_cold_path.cpp should not
rely on runtime.h’s transitive include of string.h. Add a direct <cstring>
include in scheduler_cold_path.cpp alongside the existing includes so the memset
calls are self-contained and the dependency is explicit.

Comment thread src/a2a3/runtime/host_build_graph/runtime/pto_async_wait.h
Comment thread src/a2a3/runtime/host_build_graph/runtime/pto_dep_compute.h
Comment thread src/a2a3/runtime/host_build_graph/runtime/pto_tensormap.h
Comment thread src/a2a3/runtime/host_build_graph/runtime/scheduler/pto_scheduler.h
@ChaoWao ChaoWao force-pushed the worktree-foamy-wondering-thimble branch 6 times, most recently from 2cfb5ee to df205c9 Compare July 3, 2026 00:54
Introduces the host_build_graph (hbg) runtime: the host-orchestration
variant of tensormap_and_ringbuffer. The orchestrator runs on the host to
completion — building the whole task graph, relocating it to device
addresses, and H2D'ing the image — then the device boots scheduler-only.
Only the orchestration timing/location differs from tensormap; the data
structures and dispatch are shared.

Pipeline (host): stage tensors (device alloc + H2D) and SVM-map them so the
host can read control tensors directly; dlopen the orchestration .so and run
it; wire fanout inline during submit (build dep_pool/fanout_head + seed the
ready queue on the host); relocate every cross-task pointer to its final
device address (range-based, two-delta: SM region + arena region); H2D the
populated SM + arena. Device: attach the already-device-addressed image and
dispatch from the seeded ready queue; no on-device orchestrator, no pointer
fixup, no execution-time reclaim. Completion is tracked by completed_tasks_,
independent of last_task_alive.

Simplifications enabled by host-orchestration:
- No device-orch path (the device boots scheduler-only).
- No execution-time reclaim: the whole graph is host-resident and runs once,
  so advance_ring_pointers / reset_for_reuse / the COMPLETED->CONSUMED flip
  are removed. Consumer waits key on fanout_refcount.
- Single ring: with no reclaim and a whole-graph-resident ring, the
  per-scope-depth multi-ring split is gone (PTO2_MAX_RING_DEPTH == 1);
  rings[]/ring_id are physically removed. All scope depths map to ring 0.
- Host reads device control tensors (e.g. paged_attention's context_lens /
  block_table) via SVM (a2a3 halHostRegister identity map; sim is already a
  host pointer), so get_tensor_data dereferences buffer.addr directly.

The SVM map/unmap is exposed through the per-thread DeviceRunner bridge, not
the Runtime.host_api struct, so the HostApi ABI shared with
tensormap_and_ringbuffer / a5 is untouched. tensormap_and_ringbuffer and a5
runtimes are not modified.

Verified on a2a3 onboard: 12 passed / 1 skipped (host_build_graph suite,
incl. paged_attention reading control tensors via SVM).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant