refactor(runtime): latch scheduler timeout per-device via InitArgs, drop it from per-run layout#1223
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR relocates the AICPU scheduler no-progress watchdog timeout from the per-run runtime arena sizing layout to a per-device ChangesScheduler Timeout Channel Migration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors how the AICPU scheduler no-progress watchdog timeout override is configured and propagated. Instead of embedding scheduler_timeout_ms in the prebuilt runtime arena layout (PTO2RuntimeArenaLayout), it is now passed per-device via InitArgs during initialization (simpler_aicpu_init) and stored in a resident global variable. This change is applied across both a2a3 and a5 platforms, including their respective simulation runners. There are no review comments to address, and the implementation looks clean and consistent.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
9cbd328 to
b7f1592
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a5/platform/sim/host/device_runner.cpp`:
- Around line 391-400: The scheduler-timeout latch is being executed from run()
on every invocation instead of once during device initialization. Move the
resolve_runtime_timeout_config and set_scheduler_timeout_ms_func_ call into the
existing aicpu_so_loaded_-guarded init path, alongside the other one-time setup
in device_runner::run/ensure_aicpu_init_launched, so the env is parsed and the
SO boundary crossed only once per device.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e99d221f-5eca-4df6-befb-b7c03385b5fb
📒 Files selected for processing (18)
src/a2a3/platform/include/common/kernel_args.hsrc/a2a3/platform/onboard/aicpu/kernel.cppsrc/a2a3/platform/sim/host/device_runner.cppsrc/a2a3/platform/sim/host/device_runner.hsrc/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cppsrc/a5/platform/include/common/kernel_args.hsrc/a5/platform/onboard/aicpu/kernel.cppsrc/a5/platform/sim/host/device_runner.cppsrc/a5/platform/sim/host/device_runner.hsrc/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cppsrc/common/platform/include/aicpu/aicpu_device_config.hsrc/common/platform/include/host/runtime_timeout_config.hsrc/common/platform/onboard/host/device_runner_base.cppsrc/common/platform/shared/aicpu/aicpu_device_config.cpp
b7f1592 to
faae3d3
Compare
…rop it from per-run layout PTO2_SCHEDULER_TIMEOUT_MS is a per-device, run-invariant value (the AICPU scheduler no-progress watchdog), yet it rode the per-run runtime arena layout (PTO2RuntimeArenaLayout::scheduler_timeout_ms) and was re-read from env + re-transmitted on every run. Move it onto the per-device one-shot InitArgs channel — the same path device_id / log config already use — so it is read once at init, latched into the resident AICPU SO global, and consumed read-only by every run. - InitArgs gains scheduler_timeout_ms (a5 + a2a3); host stamps it once in ensure_aicpu_init_launched from the timeout config resolved at attach. - resolve_onboard_timeout_config keeps the scheduler override now (0 = no override -> device keeps compile-time SCHEDULER_TIMEOUT_CYCLES). - New set/get_scheduler_timeout_ms in platform_regs (mirrors set_orch_device_id); simpler_aicpu_init latches it; scheduler_dispatch reads the resident-SO global instead of rt_->prebuilt_layout. - Remove scheduler_timeout_ms from PTO2RuntimeArenaLayout and the per-run resolve_scheduler_timeout_ms() in runtime_maker (both runtimes). - sim: dlsym set_scheduler_timeout_ms and honor PTO2_SCHEDULER_TIMEOUT_MS at run (the override used to flow through the shared runtime_maker). No new env gate; PTO2_SCHEDULER_TIMEOUT_MS behavior is preserved. Verified: all four quadrants (onboard/sim x a5/a2a3) build; runtime_fatal_codes scheduler_timeout passes on a5sim + a2a3sim; a5 trb st suite (20) green on a5sim. Closes hw-native-sys#1220
…of platform_regs
orch_device_id is the ACL device ordinal latched once per device by
simpler_aicpu_init (from InitArgs.device_id) and read by the AICPU executor to
make the staged orchestration SO filename unique per device. It is a per-device
run-invariant knob — the same category as scheduler_timeout — and has nothing to
do with per-core register addressing, so platform_regs was the wrong home.
Now that aicpu_device_config exists as the dedicated home for per-device AICPU
config latched by simpler_aicpu_init, move set/get_orch_device_id + the global
there, alongside set/get_scheduler_timeout_ms. platform_regs is left strictly for
per-core register access.
- aicpu_device_config.{h,cpp}: add set/get_orch_device_id + g_orch_device_id.
- platform_regs.{h,cpp} (a5 + a2a3): remove them.
- aicpu_executor.cpp (a5 + a2a3): include aicpu_device_config.h for the
get_orch_device_id consumer (still needs platform_regs.h for get_platform_regs).
- kernel.cpp / sim device_runner: unchanged — the symbol name is identical, only
its defining TU moved (same AICPU SO; sim dlsym still resolves it).
No behavior change. Verified: all four quadrants build; runtime_fatal_codes
scheduler_timeout (which exercises the register->orch-SO path) passes on
a5sim + a2a3sim.
Stacked on hw-native-sys#1223 (which introduces aicpu_device_config).
…of platform_regs (#1228) orch_device_id is the ACL device ordinal latched once per device by simpler_aicpu_init (from InitArgs.device_id) and read by the AICPU executor to make the staged orchestration SO filename unique per device. It is a per-device run-invariant knob — the same category as scheduler_timeout — and has nothing to do with per-core register addressing, so platform_regs was the wrong home. Now that aicpu_device_config exists as the dedicated home for per-device AICPU config latched by simpler_aicpu_init, move set/get_orch_device_id + the global there, alongside set/get_scheduler_timeout_ms. platform_regs is left strictly for per-core register access. - aicpu_device_config.{h,cpp}: add set/get_orch_device_id + g_orch_device_id. - platform_regs.{h,cpp} (a5 + a2a3): remove them. - aicpu_executor.cpp (a5 + a2a3): include aicpu_device_config.h for the get_orch_device_id consumer (still needs platform_regs.h for get_platform_regs). - kernel.cpp / sim device_runner: unchanged — the symbol name is identical, only its defining TU moved (same AICPU SO; sim dlsym still resolves it). No behavior change. Verified: all four quadrants build; runtime_fatal_codes scheduler_timeout (which exercises the register->orch-SO path) passes on a5sim + a2a3sim. Stacked on #1223 (which introduces aicpu_device_config).
What
PTO2_SCHEDULER_TIMEOUT_MS(the AICPU scheduler no-progress watchdog) is a per-device, run-invariant value, but it rode the per-run runtime arena layout (ArenaSizingKey::scheduler_timeout_ms): the host re-read the env and re-wrote it into the freshly-rebuilt arena image every run, and the device re-read it from the layout on every boot.This moves it onto the per-device one-shot
InitArgschannel — the same pathdevice_id/ log config already use — so it is read once at init, latched into a resident AICPU SO global, and consumed read-only by every run. The per-run runtime layout no longer carries per-device config.Resolves #1220.
How
InitArgsgainsscheduler_timeout_ms(a5 + a2a3). Host stamps it once inensure_aicpu_init_launched()from the timeout config resolved at attach.resolve_onboard_timeout_config()now keeps the scheduler override (previously parsed for ordering validation, then discarded).0= no override → the device keeps its compile-timeSCHEDULER_TIMEOUT_CYCLES.set/get_scheduler_timeout_mslive in a dedicated common AICPU device-config file (src/common/platform/{include,shared}/aicpu/aicpu_device_config.*) — the extensible home for run-invariant per-device knobs latched bysimpler_aicpu_init. Deliberately not inplatform_regs(kept strictly per-core register addressing).simpler_aicpu_initlatches it;scheduler_dispatchreads the resident-SO global instead of the arena layout.scheduler_timeout_msfromArenaSizingKeyand the per-runresolve_scheduler_timeout_ms()inruntime_maker(both runtimes); dropped the now-unusedruntime_timeout_config.hinclude there.set_scheduler_timeout_msand honorPTO2_SCHEDULER_TIMEOUT_MSat run (the override used to flow through the sharedruntime_maker).No new env gate —
PTO2_SCHEDULER_TIMEOUT_MSbehavior is preserved; only its landing/transport changes. The value leaves both the per-run arena and the per-runKernelArgsentirely. Rebased onto currentmain(adapts to the #1219ArenaSizingKey/ArenaOffsetslayout split).Test
runtime_fatal_codes::scheduler_timeoutpasses on a5sim and a2a3sim (confirmsPTO2_SCHEDULER_TIMEOUT_MS=500is honored end-to-end through the new latch path; a broken wiring would fall back to the long default and hang).runtime_fatal_codessim suite green; a5tensormap_and_ringbufferST suite: 20 passed on a5sim (normal path intact).Onboard hardware runs (a5/a2a3) still recommended before merge.
🤖 Generated with Claude Code