Add async chip callable register/run overlap#1090
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds ChangesCallable Prepare Overlap Design Document
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a design document for the 'Callable Prepare Overlap Plan', which outlines a pipeline optimization to overlap the preparation of a future callable with the execution of the current one. The reviewer provided valuable feedback on the proposed design, suggesting the addition of an intermediate state (such as CANCELLING or UNREGISTERING) for the PREPARING to EMPTY transition to prevent race conditions, and recommending that the document explicitly note the requirement for the new chip-child control thread to attach to the thread-local CANN device context.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| ```text | ||
| EMPTY -> PREPARING -> PREPARED | ||
| -> FAILED | ||
| PREPARING -> EMPTY (unregister/cancel after worker reaches a safe point) | ||
| PREPARED -> UNREGISTERING -> EMPTY | ||
| FAILED -> UNREGISTERING -> EMPTY | ||
| ``` |
There was a problem hiding this comment.
In the proposed state machine, when unregister_callable is called during the PREPARING state, it requests cooperative cancellation and waits for the active prepare worker to reach a safe point. However, unlike the PREPARED and FAILED states which transition to UNREGISTERING before reaching EMPTY, there is no intermediate state shown for PREPARING during this waiting period.
To prevent race conditions and clearly signal to other threads that the slot is being cancelled/torn down (and cannot be transitioned to PREPARED or reused), consider introducing an intermediate state (e.g., CANCELLING or routing through UNREGISTERING) for the PREPARING -> EMPTY transition as well.
References
- Ensure documentation and diagrams accurately reflect implementation details regarding resource lifecycles, especially when persistence is used to maintain internal state like caches.
| - a chip-child control thread that services the control mailbox while the task | ||
| thread is blocked in `run_prepared_from_blob()`. |
There was a problem hiding this comment.
Since CANN device contexts are thread-local (as implemented in DeviceRunnerBase::attach_current_thread), any new thread spawned on the child side—such as the proposed chip-child control thread—must explicitly attach to the correct device context (e.g., via rtSetDevice or attach_current_thread) before executing any device-level operations (like allocation, H2D copies, or prewarm).
It would be highly beneficial to explicitly document this requirement in the L3 design section to ensure it is handled correctly during implementation.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/callable-prepare-overlap-plan.md`:
- Around line 147-153: Add explicit rules specifying the outcome of
`run_prepared(cid)` for the `EMPTY` and `UNREGISTERING` states. After the
existing rules for `PREPARED`, `PREPARING`, and `FAILED` states, add two new
bullet points that document what `run_prepared()` does when called on a
non-registered callable or one that is being unregistered, ensuring these cases
result in clear error outcomes rather than blocking or falling through.
- Around line 252-255: The markdown file contains sentences with PR reference
`#1089` that are wrapped in a way that causes these references to appear at the
beginning of lines, which markdownlint will interpret as ATX heading syntax.
Rewrap the sentences containing PR `#1089` references so that the hash symbol and
PR number never start a new line and instead flow naturally within the text
paragraphs. Apply this fix to the sections mentioned in the diff around lines
252-255 and also check and fix the same issue in the section around lines
289-293.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 973e454f-1103-4180-a252-1b652c75919d
📒 Files selected for processing (1)
docs/callable-prepare-overlap-plan.md
| - `run_prepared(cid)` may proceed only from `PREPARED`. | ||
| - `run_prepared(cid)` waits if the slot is `PREPARING`. | ||
| - `run_prepared(cid)` fails with the recorded error if the slot is `FAILED`. | ||
| - `unregister_callable(cid)` removes public visibility, requests cooperative | ||
| cancellation for `PREPARING`, and waits for active prepare or run users before | ||
| releasing resources. | ||
| - callable-id reuse must not expose stale AICPU or host-side state. |
There was a problem hiding this comment.
Specify the EMPTY / UNREGISTERING outcome for run_prepared().
The rules cover PREPARING and FAILED, but not the not-registered path. The current C API already treats "no prep state" as a negative error, so this needs to be explicit here; otherwise an implementation can accidentally block or fall through on an empty slot.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/callable-prepare-overlap-plan.md` around lines 147 - 153, Add explicit
rules specifying the outcome of `run_prepared(cid)` for the `EMPTY` and
`UNREGISTERING` states. After the existing rules for `PREPARED`, `PREPARING`,
and `FAILED` states, add two new bullet points that document what
`run_prepared()` does when called on a non-registered callable or one that is
being unregistered, ensuring these cases result in clear error outcomes rather
than blocking or falling through.
| prewarm work that PR #1089 folded into prepare. This plan does not change PR | ||
| #1089's serialization policy: any prewarm portion that PR #1089 keeps | ||
| serialized with an active run remains serialized and must be reported as | ||
| blocked or non-overlapped prepare time. If PR #1089 has not landed, this plan |
There was a problem hiding this comment.
Reflow the PR #1089`` sentences before they become accidental headings.
The wrapped lines starting with #1089 are likely to be parsed by markdownlint as ATX headings. Rewrap those sentences so the hash never begins a line.
Also applies to: 289-293
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 253-253: No space after hash on atx style heading
(MD018, no-missing-space-atx)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/callable-prepare-overlap-plan.md` around lines 252 - 255, The markdown
file contains sentences with PR reference `#1089` that are wrapped in a way that
causes these references to appear at the beginning of lines, which markdownlint
will interpret as ATX heading syntax. Rewrap the sentences containing PR `#1089`
references so that the hash symbol and PR number never start a new line and
instead flow naturally within the text paragraphs. Apply this fix to the
sections mentioned in the diff around lines 252-255 and also check and fix the
same issue in the section around lines 289-293.
Source: Linters/SAST tools
aad9b22 to
9f9d588
Compare
- Add L3 Worker.run_async as an async DAG queue while keeping sync run ordered through the same lane - Add L2 run/register lanes plus chip-only register_async and unregister_async with tombstone/deferred free - Let chip-child async run/register/unregister controls overlap TASK_RUNNING without widening memory/domain control semantics - Add unit coverage and task-submit hardware overlap acceptance
- Document level-specific async Worker APIs and chip-only guards - Describe TASK_RUNNING mailbox overlap for async chip controls - Record tombstone/deferred-free unregister semantics in formal docs
9f9d588 to
275e41f
Compare
Summary
This PR adds async chip-callable register/run/unregister support so chip
callable preparation can overlap with already submitted chip execution.
PR #1089 made chip callable register/prepare do real prewarm work. This PR
keeps the synchronous API behavior, but adds async handles and child-side run /
register lanes so a later callable can be prepared while an earlier callable is
already running.
Pipeline
TASK_RUNNINGis the key handoff point: once the child has copied the runpayload out of the mailbox, selected async controls can temporarily use the
same mailbox and then restore
TASK_RUNNING.Usage
L3
L2
register_async()/unregister_async()are chip-callable only. Non-chiptargets or handles raise
TypeError.Hardware Overlap Results
Measured on real
a2a3hardware throughtask-submit.Task setup:
C1:repeat_vector_add, long run controlled byrepeat_count=1000000C2:simple_vector_addtrials=5device=2C2 registerandC2 runL3 End-to-End
Task:
task_20260626_155236_381078019428Correctness:
max_diff=0.000e+00L2 Direct Worker End-to-End
Task:
task_20260626_155304_382795311062Correctness:
max_diff=0.000e+00Notes
register()/run()/unregister()remain available.earlier async run.
handle are rejected, but native unregister/free waits until in-flight runs
release the private slot.
run(orch_fn)orrun_async(orch_fn).