feat(runtime): L3 post-fork host-buffer registration by doraemonmj · Pull Request #1190 · hw-native-sys/simpler

doraemonmj · 2026-06-29T06:36:59Z

Closes #1027.

Problem

L3 chip children are forked lazily on the first Worker.run(). A host tensor created after that — the natural dynamic-shape serving pattern — is not in the children's address space: the orch fn runs in the parent and the per-task args carry a raw parent VA that is unmapped (or stale) in the child. Today serving must preallocate every input/output buffer (at max shape) before the worker is created.

Change

Add Worker.register_host_buffer(tensor) / unregister_host_buffer(handle):

A separate named shared-memory buffer is mapped into every chip child post-fork and kept mapped (broadcast via the existing NEXT_LEVEL control path, with the same partial-failure rollback as _CTRL_PY_REGISTER).
Before the runtime dereferences a task, the child rewrites the mailbox blob's host pointers (Tensor.buffer.addr) to its own mapping — pure-Python blob rewrite, no runtime C++ change.
Each run() mirrors the tensor through the shm: H2D copy-in before the task, D2H copy-out after the run drains.
submit_next_level validates host-tensor visibility and raises an actionable error for an unregistered post-fork tensor. Fork-inherited vs post-fork is decided by the buffer's backing inode plus a fork-time VA snapshot, so a fresh post-fork torch.empty (its own mmap) and a post-fork share_memory_ tensor (new inode, even if it reused a freed VA) are both rejected.

Sub-views of a registered buffer resolve automatically; views overrunning the registered size raise.

Per the issue decision this delivers B + C + D and defers A (transparent auto-mapping, which would have to guess child visibility from a raw data_ptr and risk silent corruption when torch recycles the pointer).

Scope / limits (documented in `comm-domain.md`)

memcpy, not zero-copy (a registered buffer is a separate shm); true zero-copy is a later optimization.
orch.copy_to (the explicit low-level staging path) is out of scope — its src must still be fork-inherited.
A small non-shared post-fork tensor sub-sliced out of a fork-time heap arena can slip the check (anonymous, inside a fork range); always share_memory_ or register host tensors for chip dispatch.

Files

python/simpler/worker.py, python/simpler/orchestrator.py — implementation (Python only)
docs/comm-domain.md — lifetime/visibility contract
tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py — one a2a3 scene test covering the mechanism (B) and the error path (C, both shared and anonymous post-fork memory)

Testing

a2a3sim and a2a3 onboard (via task-submit): the scene test passes.
pre-commit (ruff / pyright / markdownlint / english-only / headers) clean.

gemini-code-assist

Code Review

This pull request implements host-buffer registration to resolve host tensor visibility issues for post-fork chip children (issue #1027), allowing post-fork host tensors to be mapped via shared memory and validated during orchestrator submission. The review feedback suggests securing the host buffer registry operations with self._registry_lock to prevent race conditions in concurrent environments, and simplifying the parsing of /proc/self/maps to make it more robust and readable.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai · 2026-06-29T06:41:33Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e8e27210-db66-4500-a2e8-9e6e1ea19c2b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds register_host_buffer / unregister_host_buffer APIs to Worker so host tensors created after chip children are forked can be used safely in worker.run(). Includes shared-memory staging, child-side VA translation and blob pointer rewriting, orchestrator wiring, unit and integration tests, and updated documentation.

Changes

Post-fork Host Buffer Registration

Layer / File(s)	Summary
Constants, types, and parent-side init state `python/simpler/worker.py`	Adds `bisect`, `TensorArgType` imports, `_CTRL_MAP_HOST`/`_CTRL_UNMAP_HOST` wire constants, blob layout structs, `_HostBufEntry`, `HostBufferHandle`, `_host_nbytes`, and `Worker.__init__` state fields for the registration control plane.
Chip child: control handlers, blob rewriting, and teardown `python/simpler/worker.py`	Adds `host_buf_table`/`host_buf_ranges` to the chip loop, `_handle_ctrl_map_host`/`_handle_ctrl_unmap_host`, blob host-pointer rewriting before `run_prepared_from_blob`, dispatch for new control commands, teardown cleanup, and refactored `_read_ctrl_staged_shm_name` usage.
Parent-side registration API, fork snapshot, and run lifecycle `python/simpler/worker.py`	Adds fork-time `/proc/self/maps` snapshot, `register_host_buffer`/`unregister_host_buffer`/`_release_all_host_buffers`, `_stage_host_buffers_for_chip_submit`, `_find_host_buf_entry`, `_flush_host_buffer_copyback`, per-run state reset, output copyback on success, and `close()` release.
Orchestrator staging call wiring `python/simpler/orchestrator.py`	`submit_next_level` and `submit_next_level_group` call `_stage_host_buffers_for_chip_submit` for `LOCAL_CHIP` targets when a `Worker` is bound.
Tests and documentation `tests/ut/py/test_worker/test_host_buffer_registration.py`, `tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py`, `docs/comm-domain.md`	Unit tests cover error-path rejection and `_read_self_maps` degradation; L3 integration test covers happy-path post-fork registration; documentation adds host-tensor visibility rules and scope/limits section.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 A new tunnel dug beneath the fork's divide,
Shared memory mapped on the child's inside.
Pointers rewritten, blobs set aright,
Post-fork tensors finally shine bright.
Register once, then run all day—
No stale COW data along the way! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 65.12% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description is brief but directly describes the feature and its purpose, so it is on-topic.
Linked Issues check	✅ Passed	The implementation matches `#1027` by adding registration APIs, validation, docs, and tests for post-fork host tensors.
Out of Scope Changes check	✅ Passed	The modified files are all directly tied to the host-buffer registration feature and its documentation or tests.
Title check	✅ Passed	The title is concise and accurately highlights the main change: L3 post-fork host-buffer registration in runtime.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py (1)
120-126: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use plain post-fork tensors for the happy-path coverage.

These buffers are still moved to .share_memory_() after the fork, so this scene only proves the file-backed source case. The new API is supposed to make any post-fork host tensor usable by copying through the registration shm, so switching a, b, and out to ordinary torch.full / torch.zeros would cover the common dynamic-shape path end-to-end.
Proposed test change
-        a = torch.full((SIZE,), 5.0, dtype=torch.float32).share_memory_()
-        b = torch.full((SIZE,), 7.0, dtype=torch.float32).share_memory_()
-        out = torch.zeros(SIZE, dtype=torch.float32).share_memory_()
+        a = torch.full((SIZE,), 5.0, dtype=torch.float32)
+        b = torch.full((SIZE,), 7.0, dtype=torch.float32)
+        out = torch.zeros(SIZE, dtype=torch.float32)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py`
around lines 120 - 126, The happy-path test in the host buffer registration flow
is still using .share_memory_() tensors, so it only covers the file-backed case
instead of the intended post-fork tensor registration path. Update the setup in
test_l3_host_buffer_registration to create a, b, and out as plain
torch.full/torch.zeros tensors, then pass them through
worker.register_host_buffer so the test exercises the registration shm copy
behavior end-to-end.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/comm-domain.md`:
- Around line 153-160: The host-tensor visibility guidance in the shared-memory
section is too strict and confusing: it should state that the cutoff is the
first chip fork, not Worker.init(), and that the original tensor storage only
needs to be child-visible for the fork-inherited path. Update the wording around
worker.run(...), orch.submit_next_level(...), and orch.copy_to so the table
clearly distinguishes fork-inherited from registered post-fork, and make sure
the registered path is described as working via the buffer registration shim
rather than requiring the tensor itself to be shared before Worker.init().

In `@python/simpler/orchestrator.py`:
- Around line 193-194: The host-buffer staging path in
orchestrator._stage_host_buffers_for_chip_submit is mutating
self._worker._pending_host_copyback before all validation is complete, so
partial failures can leave stale copybacks queued. Update the LOCAL_CHIP submit
flow in orchestrator._submit_chip_group (and the related call site around the
c_args loop) to make staging transactional: snapshot or isolate
_pending_host_copyback before calling _stage_host_buffers_for_chip_submit, and
restore/discard it if any exception is raised before submit completes. Ensure
the rollback covers both the per-c_args staging and the outer group loop so a
failed validation never leaves prior D2H copybacks behind.

In `@python/simpler/worker.py`:
- Around line 3398-3407: The fork snapshot logic in the worker’s chip-path only
records inode membership and captures it too late, so
`dw.init()`/prewarm-created mappings can leak into the fork state and later
accept post-fork remaps. Update the fork snapshot around `_read_self_maps()` in
the chip-fork path to preserve the full `(lo, hi, inode)` ranges captured
immediately before the fork, and then change the validation at the `addr`/tensor
size check to require the requested `[addr, addr + tensor_nbytes)` interval to
be covered by a matching fork-captured range with the same inode, not just a
matching inode set.
- Around line 4190-4194: `unregister_host_buffer()` and
`_release_all_host_buffers()` should always free the parent shared memory even
if `_broadcast_host_unmap()` raises. Wrap the broadcast and per-entry cleanup in
the same best-effort try/except/finally pattern used elsewhere so
`entry.shm.close()` and `entry.shm.unlink()` still run for each entry, and
ensure `self._worker`/`_hierarchical_started` handling in
`python/simpler/worker.py` does not abort cleanup early.
- Around line 385-390: `unregister_host_buffer()` currently removes entries
using only `handle.data_ptr`, so an old `HostBufferHandle` can accidentally
unregister a newer registration at the same pointer. Update the `Worker`
host-buffer registry logic to validate the full handle identity before deleting,
using `HostBufferHandle.token` as part of the lookup/match, and add an owner
identifier if needed so handles from one `Worker` cannot be used to unregister
another worker’s buffer.
- Around line 4268-4271: The per-submit staging in the buffer copy-in path can
overwrite a newer same-run output with stale parent data, so update the logic
around the submit staging block in worker.py to avoid re-copying overlapping
registered buffers after a producer has already written them. Track the
output/INOUT ranges staged for the current run inside the submit/dispatch flow
that calls submit_next_level, and skip the ctypes.memmove copy-in for later
INPUT overlaps, or refactor registered-buffer copy-in to a run-level
pre-execution phase before tasks can run.

---

Nitpick comments:
In `@tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py`:
- Around line 120-126: The happy-path test in the host buffer registration flow
is still using .share_memory_() tensors, so it only covers the file-backed case
instead of the intended post-fork tensor registration path. Update the setup in
test_l3_host_buffer_registration to create a, b, and out as plain
torch.full/torch.zeros tensors, then pass them through
worker.register_host_buffer so the test exercises the registration shm copy
behavior end-to-end.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8571f588-69b2-4469-92f4-973ecd6ca07b

📥 Commits

Reviewing files that changed from the base of the PR and between b6091ee and b0c39d2.

📒 Files selected for processing (5)

docs/comm-domain.md
python/simpler/orchestrator.py
python/simpler/worker.py
tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py
tests/ut/py/test_worker/test_host_buffer_registration.py

) Host tensors created after the L3 chip children are forked (lazily on the first run) were invisible to those children, forcing serving to preallocate all buffers before worker creation. Add Worker.register_host_buffer / unregister_host_buffer: a named shm is mapped into every chip child and the per-task mailbox blob's host pointers are rewritten to the child's own mapping before the runtime dereferences them, so a later run can H2D/D2H through it. Pure Python (worker.py / orchestrator.py) — no runtime C++ change. submit_next_level validates host-tensor visibility and raises an actionable error for an unregistered post-fork tensor (fork-inherited vs post-fork is decided by backing inode + fork VA snapshot). Lifetime/visibility contract documented in comm-domain.md; one a2a3 scene test covers the mechanism and the error path.

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread python/simpler/worker.py Outdated

Comment thread python/simpler/worker.py Outdated

doraemonmj force-pushed the l3-host-buffer-registration branch 2 times, most recently from 7ba711f to b0c39d2 Compare June 29, 2026 11:05

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread docs/comm-domain.md Outdated

Comment thread python/simpler/orchestrator.py Outdated

Comment thread python/simpler/worker.py

Comment thread python/simpler/worker.py Outdated

Comment thread python/simpler/worker.py Outdated

Comment thread python/simpler/worker.py Outdated

doraemonmj force-pushed the l3-host-buffer-registration branch from 97e9cb1 to b0c39d2 Compare June 29, 2026 11:15

doraemonmj changed the title ~~feat(runtime): L3 post-fork host-buffer registration (#1027)~~ feat(runtime): L3 post-fork host-buffer registration Jun 30, 2026

doraemonmj force-pushed the l3-host-buffer-registration branch 2 times, most recently from ef8e301 to 8426d0d Compare June 30, 2026 11:05

doraemonmj force-pushed the l3-host-buffer-registration branch from 8426d0d to 2e85364 Compare June 30, 2026 12:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(runtime): L3 post-fork host-buffer registration#1190

feat(runtime): L3 post-fork host-buffer registration#1190
doraemonmj wants to merge 1 commit into
hw-native-sys:mainfrom
doraemonmj:l3-host-buffer-registration

doraemonmj commented Jun 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

doraemonmj commented Jun 29, 2026

Problem

Change

Scope / limits (documented in comm-domain.md)

Files

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Scope / limits (documented in `comm-domain.md`)

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading