Refactor: split trb bind_callable into lifecycle helpers#1215
Conversation
|
Warning Review limit reached
Next review available in: 32 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors bind_callable_to_runtime_impl in both the a2a3 and a5 runtime makers by extracting several modular helper functions to handle arena sizing, device arguments staging, environment flags, static arenas, runtime image building, and launch state binding. Feedback highlights a potential state leakage issue in both implementations where the orch_to_sched flag is set to true if the PTO2_ORCH_TO_SCHED environment variable is truthy, but is never reset to false if the variable is unset or falsy, which could persist incorrect configurations across multiple runs.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
The ~200-line bind_callable_to_runtime_impl folded three distinct
lifecycles into one function. Split it into named steps so the entry
point reads as the lifecycles it orchestrates:
- resolve_arena_sizing (per-config): ring sizing + derived heap/SM
sizes + scheduler timeout — the layout input half, host arithmetic
- stage_device_args (per-run): the only signature-aware step —
H2D copy / pure-OUTPUT zeroing / copy-back recording
- apply_orch_sched_env_flags (per-run): latch the orch->sched env gates
- ensure_static_arenas (per-config): reserve + acquire the static pools
- build_runtime_image (per-config): pure host image build, no device
touch — the hook a later image-cache stage can memoize
- bind_launch_state (per-run): publish args + rtMemcpy + record base
Behavior is byte-identical: TIMING logs, the simpler_run.bind.{args,
prebuilt} STRACE spans, log ordering, and error paths are preserved.
The host DeviceArena stays a caller-owned local passed by reference
(it is non-copyable/non-movable), so the image outlives the call until
upload.
Also re-syncs the drifted a2a3/a5 runtime_maker copies: a5 adopts the
STRACE markers, common/strace.h include, and pto2_-prefixed naming that
were pure drift, leaving the two files byte-identical.
Verified on sim (behavior unchanged): a2a3sim trb ST 30 passed/1
skipped, a5sim trb ST 20 passed.
aa8fd1e to
f4ba6c9
Compare
Summary
The ~200-line
bind_callable_to_runtime_impl(trb host side) folded three distinct lifecycles into one function. This splits it into named steps so the entry point reads as the lifecycles it orchestrates:resolve_arena_sizing(per-config) — ring sizing + derived heap/SM sizes + scheduler timeout (the layout input half, pure host arithmetic)stage_device_args(per-run) — the only signature-aware step: H2D copy / pure-OUTPUT zeroing / copy-back recordingapply_orch_sched_env_flags(per-run) — latch the orch→sched env gatesensure_static_arenas(per-config) — reserve + acquire the static poolsbuild_runtime_image(per-config) — pure host image build, no device touch (the hook a later image-cache stage can memoize)bind_launch_state(per-run) — publish args + rtMemcpy + record device basebind_callable_to_runtime_implcollapses to a ~45-line orchestrator.Behavior is byte-identical: all
TIMING:logs, thesimpler_run.bind.{args,prebuilt}STRACE spans (consumed by pypto-serving), log ordering, and error paths are preserved. The hostDeviceArenastays a caller-owned local passed by reference (it is non-copyable/non-movable), so the image outlives the call until upload.Also re-syncs the drifted a2a3/a5 copies: a5 adopts the STRACE markers,
common/strace.hinclude, andpto2_-prefixed naming that were pure drift — the tworuntime_maker.cppfiles are now byte-identical.This is the host-side function-split groundwork; the device-ABI layout split, static
host_apirelocation, and register-time image caching are separate follow-up stages.Testing
tests/st/{a2a3,a5}/tensormap_and_ringbuffer: a2a3sim 30 passed / 1 skipped, a5sim 20 passed