Skip to content

refactor(runtime): latch scheduler timeout per-device via InitArgs, drop it from per-run layout#1223

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix/scheduler-timeout-to-initargs
Jul 1, 2026
Merged

refactor(runtime): latch scheduler timeout per-device via InitArgs, drop it from per-run layout#1223
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix/scheduler-timeout-to-initargs

Conversation

@ChaoZheng109

@ChaoZheng109 ChaoZheng109 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

What

PTO2_SCHEDULER_TIMEOUT_MS (the AICPU scheduler no-progress watchdog) is a per-device, run-invariant value, but it rode the per-run runtime arena layout (ArenaSizingKey::scheduler_timeout_ms): the host re-read the env and re-wrote it into the freshly-rebuilt arena image every run, and the device re-read it from the layout on every boot.

This moves it onto the per-device one-shot InitArgs channel — the same path device_id / log config already use — so it is read once at init, latched into a resident AICPU SO global, and consumed read-only by every run. The per-run runtime layout no longer carries per-device config.

Resolves #1220.

How

  • InitArgs gains scheduler_timeout_ms (a5 + a2a3). Host stamps it once in ensure_aicpu_init_launched() from the timeout config resolved at attach.
  • resolve_onboard_timeout_config() now keeps the scheduler override (previously parsed for ordering validation, then discarded). 0 = no override → the device keeps its compile-time SCHEDULER_TIMEOUT_CYCLES.
  • New set/get_scheduler_timeout_ms live in a dedicated common AICPU device-config file (src/common/platform/{include,shared}/aicpu/aicpu_device_config.*) — the extensible home for run-invariant per-device knobs latched by simpler_aicpu_init. Deliberately not in platform_regs (kept strictly per-core register addressing). simpler_aicpu_init latches it; scheduler_dispatch reads the resident-SO global instead of the arena layout.
  • Removed scheduler_timeout_ms from ArenaSizingKey and the per-run resolve_scheduler_timeout_ms() in runtime_maker (both runtimes); dropped the now-unused runtime_timeout_config.h include there.
  • sim: dlsym set_scheduler_timeout_ms and honor PTO2_SCHEDULER_TIMEOUT_MS at run (the override used to flow through the shared runtime_maker).

No new env gate — PTO2_SCHEDULER_TIMEOUT_MS behavior is preserved; only its landing/transport changes. The value leaves both the per-run arena and the per-run KernelArgs entirely. Rebased onto current main (adapts to the #1219 ArenaSizingKey/ArenaOffsets layout split).

Test

  • All four quadrants (onboard/sim × a5/a2a3) build clean.
  • runtime_fatal_codes::scheduler_timeout passes on a5sim and a2a3sim (confirms PTO2_SCHEDULER_TIMEOUT_MS=500 is honored end-to-end through the new latch path; a broken wiring would fall back to the long default and hang).
  • Full runtime_fatal_codes sim suite green; a5 tensormap_and_ringbuffer ST suite: 20 passed on a5sim (normal path intact).

Onboard hardware runs (a5/a2a3) still recommended before merge.

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3ac72b1f-799c-4689-93cd-941b1bed97a5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR relocates the AICPU scheduler no-progress watchdog timeout from the per-run runtime arena sizing layout to a per-device InitArgs channel. A new shared aicpu_device_config module latches the value once at device init; arena sizing structs, runtime_maker.cpp, and scheduler_dispatch.cpp are updated across a2a3 and a5, onboard and sim variants.

Changes

Scheduler Timeout Channel Migration

Layer / File(s) Summary
Per-device config module
src/common/platform/include/aicpu/aicpu_device_config.h, src/common/platform/shared/aicpu/aicpu_device_config.cpp, src/common/platform/include/host/runtime_timeout_config.h
New set_scheduler_timeout_ms/get_scheduler_timeout_ms functions backed by a module-level global, plus a default-initialized scheduler_timeout_ms in RuntimeTimeoutConfig.
InitArgs extension
src/a2a3/platform/include/common/kernel_args.h, src/a5/platform/include/common/kernel_args.h
InitArgs gains a scheduler_timeout_ms field defaulting to 0.
Host one-shot init latch
src/common/platform/onboard/host/device_runner_base.cpp
resolve_onboard_timeout_config computes a validated scheduler override; ensure_aicpu_init_launched stamps it into InitArgs.
Onboard device latch
src/a2a3/platform/onboard/aicpu/kernel.cpp, src/a5/platform/onboard/aicpu/kernel.cpp
simpler_aicpu_init calls set_scheduler_timeout_ms from the latched InitArgs value.
Sim DeviceRunner wiring
src/a2a3/platform/sim/host/device_runner.{h,cpp}, src/a5/platform/sim/host/device_runner.{h,cpp}
Adds an optional dlsym'd set_scheduler_timeout_ms_func_, loaded at binary load time and invoked in run() using resolved runtime timeout config.
Arena sizing cleanup
src/a2a3/runtime/.../runtime_maker.cpp, src/a2a3/runtime/.../pto_runtime2.h, src/a5/runtime/.../runtime_maker.cpp, src/a5/runtime/.../pto_runtime2.h
Removes scheduler_timeout_ms from ArenaSizingConfig/ArenaSizingKey and its resolution/writing; adds pto2_read_runtime_status() helper for PTO2 error-code-based runtime status.
Scheduler dispatch read path
src/a2a3/runtime/.../scheduler_dispatch.cpp, src/a5/runtime/.../scheduler_dispatch.cpp
Scheduler hang timeout now derives from get_scheduler_timeout_ms() instead of the arena layout's sizing field.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • hw-native-sys/simpler#1201: Extends the same per-device InitArgs payload used for simpler_aicpu_init, which this PR builds on directly.
  • hw-native-sys/simpler#1219: Previously moved scheduler_timeout_ms into prebuilt_layout.sizing, the exact field this PR now removes.
  • hw-native-sys/simpler#1127: Touches the same scheduler timeout propagation and set/get_scheduler_timeout_ms wiring in scheduler_dispatch.cpp.

Poem

A timeout hopped from run to run,
Re-read each time beneath the sun.
Now latched just once, per-device true,
Through InitArgs it hops on through.
No more re-getenv, no more strife —
One watchdog value, set for life. 🐇⏱️

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning runtime_maker also adds PTO2 runtime-status parsing and copy-back behavior changes, which are not part of the scheduler-timeout refactor. Move the runtime-status/copy-back changes to a separate PR unless they are required for #1220.
Docstring Coverage ⚠️ Warning Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main refactor: moving scheduler timeout to per-device InitArgs and removing it from the per-run layout.
Description check ✅ Passed The description matches the change set and explains the InitArgs latch, the removed per-run path, and the a5/a2a3 sim/onboard coverage.
Linked Issues check ✅ Passed The PR satisfies #1220 by latching scheduler timeout once via InitArgs, removing the per-run layout/read path, and applying it across a5/a2a3 and sim.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors how the AICPU scheduler no-progress watchdog timeout override is configured and propagated. Instead of embedding scheduler_timeout_ms in the prebuilt runtime arena layout (PTO2RuntimeArenaLayout), it is now passed per-device via InitArgs during initialization (simpler_aicpu_init) and stored in a resident global variable. This change is applied across both a2a3 and a5 platforms, including their respective simulation runners. There are no review comments to address, and the implementation looks clean and consistent.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@ChaoZheng109 ChaoZheng109 force-pushed the fix/scheduler-timeout-to-initargs branch from 9cbd328 to b7f1592 Compare July 1, 2026 01:08

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a5/platform/sim/host/device_runner.cpp`:
- Around line 391-400: The scheduler-timeout latch is being executed from run()
on every invocation instead of once during device initialization. Move the
resolve_runtime_timeout_config and set_scheduler_timeout_ms_func_ call into the
existing aicpu_so_loaded_-guarded init path, alongside the other one-time setup
in device_runner::run/ensure_aicpu_init_launched, so the env is parsed and the
SO boundary crossed only once per device.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e99d221f-5eca-4df6-befb-b7c03385b5fb

📥 Commits

Reviewing files that changed from the base of the PR and between 62adb13 and b7f1592.

📒 Files selected for processing (18)
  • src/a2a3/platform/include/common/kernel_args.h
  • src/a2a3/platform/onboard/aicpu/kernel.cpp
  • src/a2a3/platform/sim/host/device_runner.cpp
  • src/a2a3/platform/sim/host/device_runner.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/a5/platform/include/common/kernel_args.h
  • src/a5/platform/onboard/aicpu/kernel.cpp
  • src/a5/platform/sim/host/device_runner.cpp
  • src/a5/platform/sim/host/device_runner.h
  • src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/common/platform/include/aicpu/aicpu_device_config.h
  • src/common/platform/include/host/runtime_timeout_config.h
  • src/common/platform/onboard/host/device_runner_base.cpp
  • src/common/platform/shared/aicpu/aicpu_device_config.cpp

Comment thread src/a5/platform/sim/host/device_runner.cpp Outdated
@ChaoZheng109 ChaoZheng109 force-pushed the fix/scheduler-timeout-to-initargs branch from b7f1592 to faae3d3 Compare July 1, 2026 01:38
…rop it from per-run layout

PTO2_SCHEDULER_TIMEOUT_MS is a per-device, run-invariant value (the AICPU
scheduler no-progress watchdog), yet it rode the per-run runtime arena
layout (PTO2RuntimeArenaLayout::scheduler_timeout_ms) and was re-read from
env + re-transmitted on every run. Move it onto the per-device one-shot
InitArgs channel — the same path device_id / log config already use — so it
is read once at init, latched into the resident AICPU SO global, and
consumed read-only by every run.

- InitArgs gains scheduler_timeout_ms (a5 + a2a3); host stamps it once in
  ensure_aicpu_init_launched from the timeout config resolved at attach.
- resolve_onboard_timeout_config keeps the scheduler override now
  (0 = no override -> device keeps compile-time SCHEDULER_TIMEOUT_CYCLES).
- New set/get_scheduler_timeout_ms in platform_regs (mirrors
  set_orch_device_id); simpler_aicpu_init latches it; scheduler_dispatch
  reads the resident-SO global instead of rt_->prebuilt_layout.
- Remove scheduler_timeout_ms from PTO2RuntimeArenaLayout and the per-run
  resolve_scheduler_timeout_ms() in runtime_maker (both runtimes).
- sim: dlsym set_scheduler_timeout_ms and honor PTO2_SCHEDULER_TIMEOUT_MS
  at run (the override used to flow through the shared runtime_maker).

No new env gate; PTO2_SCHEDULER_TIMEOUT_MS behavior is preserved. Verified:
all four quadrants (onboard/sim x a5/a2a3) build; runtime_fatal_codes
scheduler_timeout passes on a5sim + a2a3sim; a5 trb st suite (20) green on
a5sim.

Closes hw-native-sys#1220
@ChaoWao ChaoWao merged commit 2c98abc into hw-native-sys:main Jul 1, 2026
16 checks passed
ChaoZheng109 added a commit to ChaoZheng109/simpler that referenced this pull request Jul 1, 2026
…of platform_regs

orch_device_id is the ACL device ordinal latched once per device by
simpler_aicpu_init (from InitArgs.device_id) and read by the AICPU executor to
make the staged orchestration SO filename unique per device. It is a per-device
run-invariant knob — the same category as scheduler_timeout — and has nothing to
do with per-core register addressing, so platform_regs was the wrong home.

Now that aicpu_device_config exists as the dedicated home for per-device AICPU
config latched by simpler_aicpu_init, move set/get_orch_device_id + the global
there, alongside set/get_scheduler_timeout_ms. platform_regs is left strictly for
per-core register access.

- aicpu_device_config.{h,cpp}: add set/get_orch_device_id + g_orch_device_id.
- platform_regs.{h,cpp} (a5 + a2a3): remove them.
- aicpu_executor.cpp (a5 + a2a3): include aicpu_device_config.h for the
  get_orch_device_id consumer (still needs platform_regs.h for get_platform_regs).
- kernel.cpp / sim device_runner: unchanged — the symbol name is identical, only
  its defining TU moved (same AICPU SO; sim dlsym still resolves it).

No behavior change. Verified: all four quadrants build; runtime_fatal_codes
scheduler_timeout (which exercises the register->orch-SO path) passes on
a5sim + a2a3sim.

Stacked on hw-native-sys#1223 (which introduces aicpu_device_config).
ChaoWao pushed a commit that referenced this pull request Jul 1, 2026
…of platform_regs (#1228)

orch_device_id is the ACL device ordinal latched once per device by
simpler_aicpu_init (from InitArgs.device_id) and read by the AICPU executor to
make the staged orchestration SO filename unique per device. It is a per-device
run-invariant knob — the same category as scheduler_timeout — and has nothing to
do with per-core register addressing, so platform_regs was the wrong home.

Now that aicpu_device_config exists as the dedicated home for per-device AICPU
config latched by simpler_aicpu_init, move set/get_orch_device_id + the global
there, alongside set/get_scheduler_timeout_ms. platform_regs is left strictly for
per-core register access.

- aicpu_device_config.{h,cpp}: add set/get_orch_device_id + g_orch_device_id.
- platform_regs.{h,cpp} (a5 + a2a3): remove them.
- aicpu_executor.cpp (a5 + a2a3): include aicpu_device_config.h for the
  get_orch_device_id consumer (still needs platform_regs.h for get_platform_regs).
- kernel.cpp / sim device_runner: unchanged — the symbol name is identical, only
  its defining TU moved (same AICPU SO; sim dlsym still resolves it).

No behavior change. Verified: all four quadrants build; runtime_fatal_codes
scheduler_timeout (which exercises the register->orch-SO path) passes on
a5sim + a2a3sim.

Stacked on #1223 (which introduces aicpu_device_config).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Code Health] Per-device runtime config (scheduler timeout) rides the per-run arena layout instead of a per-device channel

2 participants