Add async chip callable register/run overlap by puddingfjz · Pull Request #1090 · hw-native-sys/simpler

puddingfjz · 2026-06-18T16:02:44Z

Summary

This PR adds async chip-callable register/run/unregister support so chip
callable preparation can overlap with already submitted chip execution.

PR #1089 made chip callable register/prepare do real prewarm work. This PR
keeps the synchronous API behavior, but adds async handles and child-side run /
register lanes so a later callable can be prepared while an earlier callable is
already running.

Pipeline

Parent L3 Worker
  run_async(orch_fn for C1)
    -> DAG run queue
        -> submit_next_level(C1)
            -> child mailbox TASK_READY

Chip child
  TASK_READY(C1)
    -> copy args out of mailbox
    -> pin C1 private slot
    -> enqueue C1 on child run lane
    -> publish TASK_RUNNING

Parent L3 Worker
  register_async(C2)
    -> control mailbox can enter during TASK_RUNNING
        -> child register lane prepares/prewarms C2

Chip child
  run lane:      run C1
  register lane: prepare/prewarm C2

TASK_RUNNING is the key handoff point: once the child has copied the run
payload out of the mailbox, selected async controls can temporarily use the
same mailbox and then restore TASK_RUNNING.

Usage

L3

h1 = worker.register(c1)

run_h = worker.run_async(orch_fn_for_c1)

h2 = worker.register_async(c2).wait()

# C2 run is queued normally through the DAG path.
timing = worker.run(orch_fn_for_c2)

run_h.wait()
worker.unregister_async(h2).wait()

L2

h1 = worker.register(c1)

run_h1 = worker.run_async(h1, c1_args, cfg)

h2 = worker.register_async(c2).wait()
run_h2 = worker.run_async(h2, c2_args, cfg)

run_h1.wait()
run_h2.wait()
worker.unregister_async(h2).wait()

register_async() / unregister_async() are chip-callable only. Non-chip
targets or handles raise TypeError.

Hardware Overlap Results

Measured on real a2a3 hardware through task-submit.

Task setup:

C1: repeat_vector_add, long run controlled by repeat_count=1000000
C2: simple_vector_add
trials=5
device=2
Timing includes C2 register and C2 run

L3 End-to-End

serial  = run(C1) + register_async(C2).wait() + run(C2)
overlap = run_async(C1)
          + register_async(C2).wait() while C1 is running
          + run_async(C2) queued after C1
          + wait both

Task: task_20260626_155236_381078019428

Delay before C2 register	Serial median	Overlap median	Reduction	Speedup
0 ms	26046.3 us	19471.1 us	25.2%	1.34x
2 ms	26215.6 us	17477.7 us	33.3%	1.50x

Correctness: max_diff=0.000e+00

L2 Direct Worker End-to-End

serial  = run(C1) + register_async(C2).wait() + run(C2)
overlap = run_async(C1)
          + register_async(C2).wait() while C1 is running
          + run_async(C2) queued on the same L2 run lane
          + wait both

Task: task_20260626_155304_382795311062

Delay before C2 register	Serial median	Overlap median	Reduction	Speedup
0 ms	17392.9 us	10913.2 us	37.3%	1.59x
2 ms	17315.7 us	12424.0 us	28.3%	1.39x

Correctness: max_diff=0.000e+00

Notes

Sync register() / run() / unregister() remain available.
Sync and async runs share one run queue, so a sync run cannot overtake an
earlier async run.
Final chip unregister uses tombstone/deferred free: new runs through the
handle are rejected, but native unregister/free waits until in-flight runs
release the private slot.
L3 does not expose a direct public chip-run API; L3 execution goes through
run(orch_fn) or run_async(orch_fn).

coderabbitai · 2026-06-18T16:03:06Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 42dd4335-ef18-4e04-8420-f47a9669bb2b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds docs/callable-prepare-overlap-plan.md, a new draft design document specifying how callable-level prepare for a subsequent callable can overlap with the run of a prior callable. The document covers API shape, a slot state machine, locking and snapshot rules, a double-buffered staging/publish contract, L3 control-lane requirements, and an implementation plan.

Changes

Callable Prepare Overlap Design Document

Layer / File(s)	Summary
Goals, non-goals, and current constraints `docs/callable-prepare-overlap-plan.md`	States the overlap goal (prepare C2 during run C1), explicitly excludes CUDA Graph/stable-buffer replay, and enumerates concurrency constraints and shared-state audit items.
Public API shape and Python async registration `docs/callable-prepare-overlap-plan.md`	Retains blocking `prepare_callable(C)` as a composition of `prepare_callable_async` plus `wait_prepare`; defines `Worker.register_async()` returning a `CallableHandle` usable by `Worker.run`.
Slot state machine, locking model, and snapshot rules `docs/callable-prepare-overlap-plan.md`	Specifies `EMPTY/PREPARING/PREPARED/FAILED/UNREGISTERING` transitions, per-callable registry lock to avoid serializing prepares, `run_prepared` snapshot discipline, and generation-matching for stale-completion prevention.
Staging/publish contract, failure/unregister semantics, and AICPU boundary `docs/callable-prepare-overlap-plan.md`	Defines double-buffered staging where prepares write unpublished artifacts and publish atomically; specifies run-side immutable snapshot requirements, AICPU first-sighting discipline, cooperative cancellation, safe-point unregister, and independence boundary with AICPU prewarm PR `#1089`.
L3 control-lane requirements and optimization scope `docs/callable-prepare-overlap-plan.md`	Explains why existing L3 mailbox paths block overlap, specifies a separate control mailbox and chip-child control thread, excludes stable buffers from this plan, and lists follow-up resource optimization candidates.
Implementation plan, validation criteria, and references `docs/callable-prepare-overlap-plan.md`	Step-by-step plan covering prerequisite leak fix, slot state/RAII, `run_prepared` refactor, async APIs, Python GIL-release paths, and L3 control lane; functional and performance validation criteria; and a references list.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

hw-native-sys/simpler#891: Defines the callable-level identity registration and CallableHandle model that this overlap plan lists as part of the async prepare/run pipelining scope.

Poem

🐇 A rabbit once staged a grand scheme,
To prepare the next hop mid-stream!
Async and locked,
Snapshots are docked,
Overlapping runs — what a dream! 🌀

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The description clearly matches the documentation-only design plan for async callable prepare overlap and related register/run semantics.
Title check	✅ Passed	The title matches the main change: a draft design for async callable register/prepare and run overlap.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces a design document for the 'Callable Prepare Overlap Plan', which outlines a pipeline optimization to overlap the preparation of a future callable with the execution of the current one. The reviewer provided valuable feedback on the proposed design, suggesting the addition of an intermediate state (such as CANCELLING or UNREGISTERING) for the PREPARING to EMPTY transition to prevent race conditions, and recommending that the document explicitly note the requirement for the new chip-child control thread to attach to the thread-local CANN device context.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-18T16:04:08Z

+```text
+EMPTY -> PREPARING -> PREPARED
+                    -> FAILED
+PREPARING -> EMPTY        (unregister/cancel after worker reaches a safe point)
+PREPARED -> UNREGISTERING -> EMPTY
+FAILED   -> UNREGISTERING -> EMPTY
+```


In the proposed state machine, when unregister_callable is called during the PREPARING state, it requests cooperative cancellation and waits for the active prepare worker to reach a safe point. However, unlike the PREPARED and FAILED states which transition to UNREGISTERING before reaching EMPTY, there is no intermediate state shown for PREPARING during this waiting period.

To prevent race conditions and clearly signal to other threads that the slot is being cancelled/torn down (and cannot be transitioned to PREPARED or reused), consider introducing an intermediate state (e.g., CANCELLING or routing through UNREGISTERING) for the PREPARING -> EMPTY transition as well.

References

Ensure documentation and diagrams accurately reflect implementation details regarding resource lifecycles, especially when persistence is used to maintain internal state like caches.

gemini-code-assist · 2026-06-18T16:04:08Z

+- a chip-child control thread that services the control mailbox while the task
+  thread is blocked in `run_prepared_from_blob()`.


Since CANN device contexts are thread-local (as implemented in DeviceRunnerBase::attach_current_thread), any new thread spawned on the child side—such as the proposed chip-child control thread—must explicitly attach to the correct device context (e.g., via rtSetDevice or attach_current_thread) before executing any device-level operations (like allocation, H2D copies, or prewarm).

It would be highly beneficial to explicitly document this requirement in the L3 design section to ensure it is handled correctly during implementation.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/callable-prepare-overlap-plan.md`:
- Around line 147-153: Add explicit rules specifying the outcome of
`run_prepared(cid)` for the `EMPTY` and `UNREGISTERING` states. After the
existing rules for `PREPARED`, `PREPARING`, and `FAILED` states, add two new
bullet points that document what `run_prepared()` does when called on a
non-registered callable or one that is being unregistered, ensuring these cases
result in clear error outcomes rather than blocking or falling through.
- Around line 252-255: The markdown file contains sentences with PR reference
`#1089` that are wrapped in a way that causes these references to appear at the
beginning of lines, which markdownlint will interpret as ATX heading syntax.
Rewrap the sentences containing PR `#1089` references so that the hash symbol and
PR number never start a new line and instead flow naturally within the text
paragraphs. Apply this fix to the sections mentioned in the diff around lines
252-255 and also check and fix the same issue in the section around lines
289-293.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 973e454f-1103-4180-a252-1b652c75919d

📥 Commits

Reviewing files that changed from the base of the PR and between cdbea27 and aad9b22.

📒 Files selected for processing (1)

docs/callable-prepare-overlap-plan.md

coderabbitai · 2026-06-18T16:07:45Z

+- `run_prepared(cid)` may proceed only from `PREPARED`.
+- `run_prepared(cid)` waits if the slot is `PREPARING`.
+- `run_prepared(cid)` fails with the recorded error if the slot is `FAILED`.
+- `unregister_callable(cid)` removes public visibility, requests cooperative
+  cancellation for `PREPARING`, and waits for active prepare or run users before
+  releasing resources.
+- callable-id reuse must not expose stale AICPU or host-side state.


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Specify the EMPTY / UNREGISTERING outcome for run_prepared().

The rules cover PREPARING and FAILED, but not the not-registered path. The current C API already treats "no prep state" as a negative error, so this needs to be explicit here; otherwise an implementation can accidentally block or fall through on an empty slot.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/callable-prepare-overlap-plan.md` around lines 147 - 153, Add explicit rules specifying the outcome of `run_prepared(cid)` for the `EMPTY` and `UNREGISTERING` states. After the existing rules for `PREPARED`, `PREPARING`, and `FAILED` states, add two new bullet points that document what `run_prepared()` does when called on a non-registered callable or one that is being unregistered, ensuring these cases result in clear error outcomes rather than blocking or falling through.

coderabbitai · 2026-06-18T16:07:45Z

+prewarm work that PR #1089 folded into prepare. This plan does not change PR
+#1089's serialization policy: any prewarm portion that PR #1089 keeps
+serialized with an active run remains serialized and must be reported as
+blocked or non-overlapped prepare time. If PR #1089 has not landed, this plan


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reflow the PR #1089`` sentences before they become accidental headings.

The wrapped lines starting with #1089 are likely to be parsed by markdownlint as ATX headings. Rewrap those sentences so the hash never begins a line.

Also applies to: 289-293

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 253-253: No space after hash on atx style heading

(MD018, no-missing-space-atx)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/callable-prepare-overlap-plan.md` around lines 252 - 255, The markdown file contains sentences with PR reference `#1089` that are wrapped in a way that causes these references to appear at the beginning of lines, which markdownlint will interpret as ATX heading syntax. Rewrap the sentences containing PR `#1089` references so that the hash symbol and PR number never start a new line and instead flow naturally within the text paragraphs. Apply this fix to the sections mentioned in the diff around lines 252-255 and also check and fix the same issue in the section around lines 289-293.

Source: Linters/SAST tools

- Add L3 Worker.run_async as an async DAG queue while keeping sync run ordered through the same lane - Add L2 run/register lanes plus chip-only register_async and unregister_async with tombstone/deferred free - Let chip-child async run/register/unregister controls overlap TASK_RUNNING without widening memory/domain control semantics - Add unit coverage and task-submit hardware overlap acceptance

- Document level-specific async Worker APIs and chip-only guards - Describe TASK_RUNNING mailbox overlap for async chip controls - Record tombstone/deferred-free unregister semantics in formal docs

gemini-code-assist Bot reviewed Jun 18, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

puddingfjz changed the title ~~Add: callable prepare overlap~~ Add async chip callable register/run overlap Jun 26, 2026

puddingfjz force-pushed the feat/document-callable-prepare-overlap-plan branch from aad9b22 to 9f9d588 Compare June 26, 2026 08:22

puddingfjz added 3 commits June 26, 2026 16:30

Update: document chip callable async flow

573ea3a

- Document level-specific async Worker APIs and chip-only guards - Describe TASK_RUNNING mailbox overlap for async chip controls - Record tombstone/deferred-free unregister semantics in formal docs

Fix: align prewarm args with current kernel ABI

275e41f

puddingfjz force-pushed the feat/document-callable-prepare-overlap-plan branch from 9f9d588 to 275e41f Compare June 26, 2026 08:38

puddingfjz added 2 commits June 26, 2026 17:00

Fix: satisfy async PR CI checks

a34bf85

Fix: keep async register payload writable

7210ad3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add async chip callable register/run overlap#1090

Add async chip callable register/run overlap#1090
puddingfjz wants to merge 5 commits into
hw-native-sys:mainfrom
puddingfjz:feat/document-callable-prepare-overlap-plan

puddingfjz commented Jun 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Uh oh!

coderabbitai Bot Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		- a chip-child control thread that services the control mailbox while the task
		thread is blocked in `run_prepared_from_blob()`.

Uh oh!

Conversation

puddingfjz commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pipeline

Usage

L3

L2

Hardware Overlap Results

L3 End-to-End

L2 Direct Worker End-to-End

Notes

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

puddingfjz commented Jun 18, 2026 •

edited

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading