Skip to content

Add async chip callable register/run overlap#1090

Open
puddingfjz wants to merge 5 commits into
hw-native-sys:mainfrom
puddingfjz:feat/document-callable-prepare-overlap-plan
Open

Add async chip callable register/run overlap#1090
puddingfjz wants to merge 5 commits into
hw-native-sys:mainfrom
puddingfjz:feat/document-callable-prepare-overlap-plan

Conversation

@puddingfjz

@puddingfjz puddingfjz commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds async chip-callable register/run/unregister support so chip
callable preparation can overlap with already submitted chip execution.

PR #1089 made chip callable register/prepare do real prewarm work. This PR
keeps the synchronous API behavior, but adds async handles and child-side run /
register lanes so a later callable can be prepared while an earlier callable is
already running.

Pipeline

Parent L3 Worker
  run_async(orch_fn for C1)
    -> DAG run queue
        -> submit_next_level(C1)
            -> child mailbox TASK_READY

Chip child
  TASK_READY(C1)
    -> copy args out of mailbox
    -> pin C1 private slot
    -> enqueue C1 on child run lane
    -> publish TASK_RUNNING

Parent L3 Worker
  register_async(C2)
    -> control mailbox can enter during TASK_RUNNING
        -> child register lane prepares/prewarms C2

Chip child
  run lane:      run C1
  register lane: prepare/prewarm C2

TASK_RUNNING is the key handoff point: once the child has copied the run
payload out of the mailbox, selected async controls can temporarily use the
same mailbox and then restore TASK_RUNNING.

Usage

L3

h1 = worker.register(c1)

run_h = worker.run_async(orch_fn_for_c1)

h2 = worker.register_async(c2).wait()

# C2 run is queued normally through the DAG path.
timing = worker.run(orch_fn_for_c2)

run_h.wait()
worker.unregister_async(h2).wait()

L2

h1 = worker.register(c1)

run_h1 = worker.run_async(h1, c1_args, cfg)

h2 = worker.register_async(c2).wait()
run_h2 = worker.run_async(h2, c2_args, cfg)

run_h1.wait()
run_h2.wait()
worker.unregister_async(h2).wait()

register_async() / unregister_async() are chip-callable only. Non-chip
targets or handles raise TypeError.

Hardware Overlap Results

Measured on real a2a3 hardware through task-submit.

Task setup:

  • C1: repeat_vector_add, long run controlled by repeat_count=1000000
  • C2: simple_vector_add
  • trials=5
  • device=2
  • Timing includes C2 register and C2 run

L3 End-to-End

serial  = run(C1) + register_async(C2).wait() + run(C2)
overlap = run_async(C1)
          + register_async(C2).wait() while C1 is running
          + run_async(C2) queued after C1
          + wait both

Task: task_20260626_155236_381078019428

Delay before C2 register Serial median Overlap median Reduction Speedup
0 ms 26046.3 us 19471.1 us 25.2% 1.34x
2 ms 26215.6 us 17477.7 us 33.3% 1.50x

Correctness: max_diff=0.000e+00

L2 Direct Worker End-to-End

serial  = run(C1) + register_async(C2).wait() + run(C2)
overlap = run_async(C1)
          + register_async(C2).wait() while C1 is running
          + run_async(C2) queued on the same L2 run lane
          + wait both

Task: task_20260626_155304_382795311062

Delay before C2 register Serial median Overlap median Reduction Speedup
0 ms 17392.9 us 10913.2 us 37.3% 1.59x
2 ms 17315.7 us 12424.0 us 28.3% 1.39x

Correctness: max_diff=0.000e+00

Notes

  • Sync register() / run() / unregister() remain available.
  • Sync and async runs share one run queue, so a sync run cannot overtake an
    earlier async run.
  • Final chip unregister uses tombstone/deferred free: new runs through the
    handle are rejected, but native unregister/free waits until in-flight runs
    release the private slot.
  • L3 does not expose a direct public chip-run API; L3 execution goes through
    run(orch_fn) or run_async(orch_fn).

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 42dd4335-ef18-4e04-8420-f47a9669bb2b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds docs/callable-prepare-overlap-plan.md, a new draft design document specifying how callable-level prepare for a subsequent callable can overlap with the run of a prior callable. The document covers API shape, a slot state machine, locking and snapshot rules, a double-buffered staging/publish contract, L3 control-lane requirements, and an implementation plan.

Changes

Callable Prepare Overlap Design Document

Layer / File(s) Summary
Goals, non-goals, and current constraints
docs/callable-prepare-overlap-plan.md
States the overlap goal (prepare C2 during run C1), explicitly excludes CUDA Graph/stable-buffer replay, and enumerates concurrency constraints and shared-state audit items.
Public API shape and Python async registration
docs/callable-prepare-overlap-plan.md
Retains blocking prepare_callable(C) as a composition of prepare_callable_async plus wait_prepare; defines Worker.register_async() returning a CallableHandle usable by Worker.run.
Slot state machine, locking model, and snapshot rules
docs/callable-prepare-overlap-plan.md
Specifies EMPTY/PREPARING/PREPARED/FAILED/UNREGISTERING transitions, per-callable registry lock to avoid serializing prepares, run_prepared snapshot discipline, and generation-matching for stale-completion prevention.
Staging/publish contract, failure/unregister semantics, and AICPU boundary
docs/callable-prepare-overlap-plan.md
Defines double-buffered staging where prepares write unpublished artifacts and publish atomically; specifies run-side immutable snapshot requirements, AICPU first-sighting discipline, cooperative cancellation, safe-point unregister, and independence boundary with AICPU prewarm PR #1089.
L3 control-lane requirements and optimization scope
docs/callable-prepare-overlap-plan.md
Explains why existing L3 mailbox paths block overlap, specifies a separate control mailbox and chip-child control thread, excludes stable buffers from this plan, and lists follow-up resource optimization candidates.
Implementation plan, validation criteria, and references
docs/callable-prepare-overlap-plan.md
Step-by-step plan covering prerequisite leak fix, slot state/RAII, run_prepared refactor, async APIs, Python GIL-release paths, and L3 control lane; functional and performance validation criteria; and a references list.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

  • hw-native-sys/simpler#891: Defines the callable-level identity registration and CallableHandle model that this overlap plan lists as part of the async prepare/run pipelining scope.

Poem

🐇 A rabbit once staged a grand scheme,
To prepare the next hop mid-stream!
Async and locked,
Snapshots are docked,
Overlapping runs — what a dream! 🌀

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description clearly matches the documentation-only design plan for async callable prepare overlap and related register/run semantics.
Title check ✅ Passed The title matches the main change: a draft design for async callable register/prepare and run overlap.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design document for the 'Callable Prepare Overlap Plan', which outlines a pipeline optimization to overlap the preparation of a future callable with the execution of the current one. The reviewer provided valuable feedback on the proposed design, suggesting the addition of an intermediate state (such as CANCELLING or UNREGISTERING) for the PREPARING to EMPTY transition to prevent race conditions, and recommending that the document explicitly note the requirement for the new chip-child control thread to attach to the thread-local CANN device context.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread docs/callable-prepare-overlap-plan.md Outdated
Comment on lines +137 to +143
```text
EMPTY -> PREPARING -> PREPARED
-> FAILED
PREPARING -> EMPTY (unregister/cancel after worker reaches a safe point)
PREPARED -> UNREGISTERING -> EMPTY
FAILED -> UNREGISTERING -> EMPTY
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the proposed state machine, when unregister_callable is called during the PREPARING state, it requests cooperative cancellation and waits for the active prepare worker to reach a safe point. However, unlike the PREPARED and FAILED states which transition to UNREGISTERING before reaching EMPTY, there is no intermediate state shown for PREPARING during this waiting period.

To prevent race conditions and clearly signal to other threads that the slot is being cancelled/torn down (and cannot be transitioned to PREPARED or reused), consider introducing an intermediate state (e.g., CANCELLING or routing through UNREGISTERING) for the PREPARING -> EMPTY transition as well.

References
  1. Ensure documentation and diagrams accurately reflect implementation details regarding resource lifecycles, especially when persistence is used to maintain internal state like caches.

Comment thread docs/callable-prepare-overlap-plan.md Outdated
Comment on lines +281 to +282
- a chip-child control thread that services the control mailbox while the task
thread is blocked in `run_prepared_from_blob()`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since CANN device contexts are thread-local (as implemented in DeviceRunnerBase::attach_current_thread), any new thread spawned on the child side—such as the proposed chip-child control thread—must explicitly attach to the correct device context (e.g., via rtSetDevice or attach_current_thread) before executing any device-level operations (like allocation, H2D copies, or prewarm).

It would be highly beneficial to explicitly document this requirement in the L3 design section to ensure it is handled correctly during implementation.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/callable-prepare-overlap-plan.md`:
- Around line 147-153: Add explicit rules specifying the outcome of
`run_prepared(cid)` for the `EMPTY` and `UNREGISTERING` states. After the
existing rules for `PREPARED`, `PREPARING`, and `FAILED` states, add two new
bullet points that document what `run_prepared()` does when called on a
non-registered callable or one that is being unregistered, ensuring these cases
result in clear error outcomes rather than blocking or falling through.
- Around line 252-255: The markdown file contains sentences with PR reference
`#1089` that are wrapped in a way that causes these references to appear at the
beginning of lines, which markdownlint will interpret as ATX heading syntax.
Rewrap the sentences containing PR `#1089` references so that the hash symbol and
PR number never start a new line and instead flow naturally within the text
paragraphs. Apply this fix to the sections mentioned in the diff around lines
252-255 and also check and fix the same issue in the section around lines
289-293.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 973e454f-1103-4180-a252-1b652c75919d

📥 Commits

Reviewing files that changed from the base of the PR and between cdbea27 and aad9b22.

📒 Files selected for processing (1)
  • docs/callable-prepare-overlap-plan.md

Comment thread docs/callable-prepare-overlap-plan.md Outdated
Comment on lines +147 to +153
- `run_prepared(cid)` may proceed only from `PREPARED`.
- `run_prepared(cid)` waits if the slot is `PREPARING`.
- `run_prepared(cid)` fails with the recorded error if the slot is `FAILED`.
- `unregister_callable(cid)` removes public visibility, requests cooperative
cancellation for `PREPARING`, and waits for active prepare or run users before
releasing resources.
- callable-id reuse must not expose stale AICPU or host-side state.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Specify the EMPTY / UNREGISTERING outcome for run_prepared().

The rules cover PREPARING and FAILED, but not the not-registered path. The current C API already treats "no prep state" as a negative error, so this needs to be explicit here; otherwise an implementation can accidentally block or fall through on an empty slot.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/callable-prepare-overlap-plan.md` around lines 147 - 153, Add explicit
rules specifying the outcome of `run_prepared(cid)` for the `EMPTY` and
`UNREGISTERING` states. After the existing rules for `PREPARED`, `PREPARING`,
and `FAILED` states, add two new bullet points that document what
`run_prepared()` does when called on a non-registered callable or one that is
being unregistered, ensuring these cases result in clear error outcomes rather
than blocking or falling through.

Comment thread docs/callable-prepare-overlap-plan.md Outdated
Comment on lines +252 to +255
prewarm work that PR #1089 folded into prepare. This plan does not change PR
#1089's serialization policy: any prewarm portion that PR #1089 keeps
serialized with an active run remains serialized and must be reported as
blocked or non-overlapped prepare time. If PR #1089 has not landed, this plan

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reflow the PR #1089`` sentences before they become accidental headings.

The wrapped lines starting with #1089 are likely to be parsed by markdownlint as ATX headings. Rewrap those sentences so the hash never begins a line.

Also applies to: 289-293

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 253-253: No space after hash on atx style heading

(MD018, no-missing-space-atx)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/callable-prepare-overlap-plan.md` around lines 252 - 255, The markdown
file contains sentences with PR reference `#1089` that are wrapped in a way that
causes these references to appear at the beginning of lines, which markdownlint
will interpret as ATX heading syntax. Rewrap the sentences containing PR `#1089`
references so that the hash symbol and PR number never start a new line and
instead flow naturally within the text paragraphs. Apply this fix to the
sections mentioned in the diff around lines 252-255 and also check and fix the
same issue in the section around lines 289-293.

Source: Linters/SAST tools

@puddingfjz puddingfjz changed the title Add: callable prepare overlap Add async chip callable register/run overlap Jun 26, 2026
@puddingfjz puddingfjz force-pushed the feat/document-callable-prepare-overlap-plan branch from aad9b22 to 9f9d588 Compare June 26, 2026 08:22
- Add L3 Worker.run_async as an async DAG queue while keeping sync run ordered through the same lane
- Add L2 run/register lanes plus chip-only register_async and unregister_async with tombstone/deferred free
- Let chip-child async run/register/unregister controls overlap TASK_RUNNING without widening memory/domain control semantics
- Add unit coverage and task-submit hardware overlap acceptance
- Document level-specific async Worker APIs and chip-only guards
- Describe TASK_RUNNING mailbox overlap for async chip controls
- Record tombstone/deferred-free unregister semantics in formal docs
@puddingfjz puddingfjz force-pushed the feat/document-callable-prepare-overlap-plan branch from 9f9d588 to 275e41f Compare June 26, 2026 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant