Category
Technical Debt (cleanup, refactor)
Component
Host Runtime (with AICPU Scheduler on the read side)
Description
PTO2_SCHEDULER_TIMEOUT_MS is a per-device, run-invariant value (the AICPU scheduler no-progress watchdog). It is semantically per-device config, but it is currently carried as a field of the per-run runtime arena layout (PTO2RuntimeArenaLayout::scheduler_timeout_ms) and re-transmitted on every run as part of the full arena image H2D.
This is a structural mismatch with two concrete downsides:
-
The layout becomes a dumping ground. PTO2RuntimeArenaLayout describes the per-run arena (ring sizes, tensor_map, scope caps — things that genuinely change per run). A per-device watchdog timeout has nothing to do with ring/tensor layout. Every future per-device knob that "just rides the layout" compounds this.
-
Read path is per-run for a value that never changes per run. The host re-reads the env (resolve_scheduler_timeout_ms()) every run and re-writes it into the freshly-rebuilt arena image; the device re-reads it from rt_->prebuilt_layout on every boot.
Ring sizes (PTO2_RING_*) legitimately belong in the layout (they are per-run). The mismatch is only for run-invariant per-device config like the scheduler timeout.
There is now a purpose-built channel for exactly this: InitArgs. A recent refactor introduced InitArgs (src/a5/platform/include/common/kernel_args.h:130), documented verbatim as "per-device one-shot invariants ... uploaded once at worker init via the simpler_aicpu_init entry, before any register_callable/exec launch ... so they no longer ride on the per-run KernelArgs: latched once into the resident AICPU SO globals and surviving every subsequent per-task launch." It currently carries device_id, log_level, log_info_v. The scheduler timeout is the same category of value and belongs here.
- Host send:
ensure_aicpu_init_launched() (src/common/platform/onboard/host/device_runner_base.cpp:364) fills InitArgs (:374) and launches KernelNames::InitName exactly once per runner, guarded by aicpu_init_launched_ (:380, aicpu_num=1).
- Device latch precedent:
InitArgs.log_info_v is latched into the resident AICPU global g_log_info_v (src/common/platform/onboard/aicpu/device_log.cpp:36), "latched once per device ... not re-pushed per run."
This supersedes an earlier note in this issue that claimed there was no transmit-once channel — that reasoning only considered the per-run launch tier. InitArgs is a genuine transmit-once-per-device path.
Location
Current placement to remove:
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h:117 — scheduler_timeout_ms field in PTO2RuntimeArenaLayout
src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp:248 — per-run resolve_scheduler_timeout_ms() (env read), written into layout at :499
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp:606 — device read of rt_->prebuilt_layout.scheduler_timeout_ms
Target channel to reuse:
InitArgs struct: src/a5/platform/include/common/kernel_args.h:130
- Host one-shot launch:
src/common/platform/onboard/host/device_runner_base.cpp:364 (ensure_aicpu_init_launched)
- Device latch precedent:
src/common/platform/onboard/aicpu/device_log.cpp:36 (g_log_info_v)
a2a3 mirrors under src/a2a3/...; sim variants under src/*/platform/sim/....
Proposed Fix
Recommended: carry scheduler_timeout_ms in InitArgs (the per-device one-shot channel that already exists for device_id / log config):
- Add
uint32_t scheduler_timeout_ms; to InitArgs (kernel_args.h).
- Host: in
ensure_aicpu_init_launched() stamp init_args.scheduler_timeout_ms from the env value resolved once at init (resolve_onboard_timeout_config() already reads the scheduler env at attach for ordering validation and currently discards it — keep it). The per-run getenv in runtime_maker is then deleted.
- Device:
simpler_aicpu_init latches it into a resident AICPU SO global (next to the device_id / g_log_info_v latches).
- Scheduler:
scheduler_dispatch.cpp reads that global instead of rt_->prebuilt_layout.scheduler_timeout_ms.
- Remove
scheduler_timeout_ms from PTO2RuntimeArenaLayout and the per-run resolve_scheduler_timeout_ms().
- Apply symmetrically across the four quadrants (onboard/sim x a5/a2a3).
This is a true transmit-once-per-device path: the value leaves the per-run arena and the per-run KernelArgs entirely, is uploaded once at init, latched into AICPU SO globals, and consumed read-only by every subsequent run — exactly how device_id / log config already work. No new device buffer, no per-run pointer, no per-run getenv. InitArgs being strictly per-device (vs per-callable) means there is not even a re-stamp concern.
No new env gate is introduced — PTO2_SCHEDULER_TIMEOUT_MS already exists; only its landing/transport changes. Existing per-case tests that set different values (tests/st/runtime_fatal_codes, tests/st/aicore_op_timeout) are per-process and set the env before init, so an init-time read does not break them.
Alternatives considered (inferior, kept for the record): an inline scalar in the per-run KernelArgs (fixes categorization but stays per-run); a separate persistent device buffer modeled on device_wall_dev_ptr_ (data once, but the pointer still free-rides KernelArgs per run — only worthwhile for a large/growing config blob); or the per-callable RegisterCallableArgs register tier (transmit-once but per-callable, so less clean than the strictly per-device InitArgs).
Priority
Low (no impact today, good to fix eventually)
Category
Technical Debt (cleanup, refactor)
Component
Host Runtime (with AICPU Scheduler on the read side)
Description
PTO2_SCHEDULER_TIMEOUT_MSis a per-device, run-invariant value (the AICPU scheduler no-progress watchdog). It is semantically per-device config, but it is currently carried as a field of the per-run runtime arena layout (PTO2RuntimeArenaLayout::scheduler_timeout_ms) and re-transmitted on every run as part of the full arena image H2D.This is a structural mismatch with two concrete downsides:
The layout becomes a dumping ground.
PTO2RuntimeArenaLayoutdescribes the per-run arena (ring sizes, tensor_map, scope caps — things that genuinely change per run). A per-device watchdog timeout has nothing to do with ring/tensor layout. Every future per-device knob that "just rides the layout" compounds this.Read path is per-run for a value that never changes per run. The host re-reads the env (
resolve_scheduler_timeout_ms()) every run and re-writes it into the freshly-rebuilt arena image; the device re-reads it fromrt_->prebuilt_layouton every boot.Ring sizes (
PTO2_RING_*) legitimately belong in the layout (they are per-run). The mismatch is only for run-invariant per-device config like the scheduler timeout.There is now a purpose-built channel for exactly this:
InitArgs. A recent refactor introducedInitArgs(src/a5/platform/include/common/kernel_args.h:130), documented verbatim as "per-device one-shot invariants ... uploaded once at worker init via thesimpler_aicpu_initentry, before any register_callable/exec launch ... so they no longer ride on the per-run KernelArgs: latched once into the resident AICPU SO globals and surviving every subsequent per-task launch." It currently carriesdevice_id,log_level,log_info_v. The scheduler timeout is the same category of value and belongs here.ensure_aicpu_init_launched()(src/common/platform/onboard/host/device_runner_base.cpp:364) fillsInitArgs(:374) and launchesKernelNames::InitNameexactly once per runner, guarded byaicpu_init_launched_(:380,aicpu_num=1).InitArgs.log_info_vis latched into the resident AICPU globalg_log_info_v(src/common/platform/onboard/aicpu/device_log.cpp:36), "latched once per device ... not re-pushed per run."This supersedes an earlier note in this issue that claimed there was no transmit-once channel — that reasoning only considered the per-run launch tier.
InitArgsis a genuine transmit-once-per-device path.Location
Current placement to remove:
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h:117—scheduler_timeout_msfield inPTO2RuntimeArenaLayoutsrc/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp:248— per-runresolve_scheduler_timeout_ms()(env read), written into layout at:499src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp:606— device read ofrt_->prebuilt_layout.scheduler_timeout_msTarget channel to reuse:
InitArgsstruct:src/a5/platform/include/common/kernel_args.h:130src/common/platform/onboard/host/device_runner_base.cpp:364(ensure_aicpu_init_launched)src/common/platform/onboard/aicpu/device_log.cpp:36(g_log_info_v)a2a3 mirrors under
src/a2a3/...; sim variants undersrc/*/platform/sim/....Proposed Fix
Recommended: carry
scheduler_timeout_msinInitArgs(the per-device one-shot channel that already exists fordevice_id/ log config):uint32_t scheduler_timeout_ms;toInitArgs(kernel_args.h).ensure_aicpu_init_launched()stampinit_args.scheduler_timeout_msfrom the env value resolved once at init (resolve_onboard_timeout_config()already reads the scheduler env at attach for ordering validation and currently discards it — keep it). The per-rungetenvinruntime_makeris then deleted.simpler_aicpu_initlatches it into a resident AICPU SO global (next to thedevice_id/g_log_info_vlatches).scheduler_dispatch.cppreads that global instead ofrt_->prebuilt_layout.scheduler_timeout_ms.scheduler_timeout_msfromPTO2RuntimeArenaLayoutand the per-runresolve_scheduler_timeout_ms().This is a true transmit-once-per-device path: the value leaves the per-run arena and the per-run
KernelArgsentirely, is uploaded once at init, latched into AICPU SO globals, and consumed read-only by every subsequent run — exactly howdevice_id/ log config already work. No new device buffer, no per-run pointer, no per-rungetenv.InitArgsbeing strictly per-device (vs per-callable) means there is not even a re-stamp concern.No new env gate is introduced —
PTO2_SCHEDULER_TIMEOUT_MSalready exists; only its landing/transport changes. Existing per-case tests that set different values (tests/st/runtime_fatal_codes,tests/st/aicore_op_timeout) are per-process and set the env before init, so an init-time read does not break them.Alternatives considered (inferior, kept for the record): an inline scalar in the per-run
KernelArgs(fixes categorization but stays per-run); a separate persistent device buffer modeled ondevice_wall_dev_ptr_(data once, but the pointer still free-ridesKernelArgsper run — only worthwhile for a large/growing config blob); or the per-callableRegisterCallableArgsregister tier (transmit-once but per-callable, so less clean than the strictly per-deviceInitArgs).Priority
Low (no impact today, good to fix eventually)