Skip to content

feat(runtime): L3 post-fork host-buffer registration#1190

Open
doraemonmj wants to merge 1 commit into
hw-native-sys:mainfrom
doraemonmj:l3-host-buffer-registration
Open

feat(runtime): L3 post-fork host-buffer registration#1190
doraemonmj wants to merge 1 commit into
hw-native-sys:mainfrom
doraemonmj:l3-host-buffer-registration

Conversation

@doraemonmj

Copy link
Copy Markdown
Contributor

Closes #1027.

Problem

L3 chip children are forked lazily on the first Worker.run(). A host tensor created after that — the natural dynamic-shape serving pattern — is not in the children's address space: the orch fn runs in the parent and the per-task args carry a raw parent VA that is unmapped (or stale) in the child. Today serving must preallocate every input/output buffer (at max shape) before the worker is created.

Change

Add Worker.register_host_buffer(tensor) / unregister_host_buffer(handle):

  • A separate named shared-memory buffer is mapped into every chip child post-fork and kept mapped (broadcast via the existing NEXT_LEVEL control path, with the same partial-failure rollback as _CTRL_PY_REGISTER).
  • Before the runtime dereferences a task, the child rewrites the mailbox blob's host pointers (Tensor.buffer.addr) to its own mapping — pure-Python blob rewrite, no runtime C++ change.
  • Each run() mirrors the tensor through the shm: H2D copy-in before the task, D2H copy-out after the run drains.
  • submit_next_level validates host-tensor visibility and raises an actionable error for an unregistered post-fork tensor. Fork-inherited vs post-fork is decided by the buffer's backing inode plus a fork-time VA snapshot, so a fresh post-fork torch.empty (its own mmap) and a post-fork share_memory_ tensor (new inode, even if it reused a freed VA) are both rejected.

Sub-views of a registered buffer resolve automatically; views overrunning the registered size raise.

Per the issue decision this delivers B + C + D and defers A (transparent auto-mapping, which would have to guess child visibility from a raw data_ptr and risk silent corruption when torch recycles the pointer).

Scope / limits (documented in comm-domain.md)

  • memcpy, not zero-copy (a registered buffer is a separate shm); true zero-copy is a later optimization.
  • orch.copy_to (the explicit low-level staging path) is out of scope — its src must still be fork-inherited.
  • A small non-shared post-fork tensor sub-sliced out of a fork-time heap arena can slip the check (anonymous, inside a fork range); always share_memory_ or register host tensors for chip dispatch.

Files

  • python/simpler/worker.py, python/simpler/orchestrator.py — implementation (Python only)
  • docs/comm-domain.md — lifetime/visibility contract
  • tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py — one a2a3 scene test covering the mechanism (B) and the error path (C, both shared and anonymous post-fork memory)

Testing

  • a2a3sim and a2a3 onboard (via task-submit): the scene test passes.
  • pre-commit (ruff / pyright / markdownlint / english-only / headers) clean.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements host-buffer registration to resolve host tensor visibility issues for post-fork chip children (issue #1027), allowing post-fork host tensors to be mapped via shared memory and validated during orchestrator submission. The review feedback suggests securing the host buffer registry operations with self._registry_lock to prevent race conditions in concurrent environments, and simplifying the parsing of /proc/self/maps to make it more robust and readable.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/simpler/worker.py Outdated
Comment thread python/simpler/worker.py Outdated
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e8e27210-db66-4500-a2e8-9e6e1ea19c2b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds register_host_buffer / unregister_host_buffer APIs to Worker so host tensors created after chip children are forked can be used safely in worker.run(). Includes shared-memory staging, child-side VA translation and blob pointer rewriting, orchestrator wiring, unit and integration tests, and updated documentation.

Changes

Post-fork Host Buffer Registration

Layer / File(s) Summary
Constants, types, and parent-side init state
python/simpler/worker.py
Adds bisect, TensorArgType imports, _CTRL_MAP_HOST/_CTRL_UNMAP_HOST wire constants, blob layout structs, _HostBufEntry, HostBufferHandle, _host_nbytes, and Worker.__init__ state fields for the registration control plane.
Chip child: control handlers, blob rewriting, and teardown
python/simpler/worker.py
Adds host_buf_table/host_buf_ranges to the chip loop, _handle_ctrl_map_host/_handle_ctrl_unmap_host, blob host-pointer rewriting before run_prepared_from_blob, dispatch for new control commands, teardown cleanup, and refactored _read_ctrl_staged_shm_name usage.
Parent-side registration API, fork snapshot, and run lifecycle
python/simpler/worker.py
Adds fork-time /proc/self/maps snapshot, register_host_buffer/unregister_host_buffer/_release_all_host_buffers, _stage_host_buffers_for_chip_submit, _find_host_buf_entry, _flush_host_buffer_copyback, per-run state reset, output copyback on success, and close() release.
Orchestrator staging call wiring
python/simpler/orchestrator.py
submit_next_level and submit_next_level_group call _stage_host_buffers_for_chip_submit for LOCAL_CHIP targets when a Worker is bound.
Tests and documentation
tests/ut/py/test_worker/test_host_buffer_registration.py, tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py, docs/comm-domain.md
Unit tests cover error-path rejection and _read_self_maps degradation; L3 integration test covers happy-path post-fork registration; documentation adds host-tensor visibility rules and scope/limits section.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 A new tunnel dug beneath the fork's divide,
Shared memory mapped on the child's inside.
Pointers rewritten, blobs set aright,
Post-fork tensors finally shine bright.
Register once, then run all day—
No stale COW data along the way! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 65.12% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description is brief but directly describes the feature and its purpose, so it is on-topic.
Linked Issues check ✅ Passed The implementation matches #1027 by adding registration APIs, validation, docs, and tests for post-fork host tensors.
Out of Scope Changes check ✅ Passed The modified files are all directly tied to the host-buffer registration feature and its documentation or tests.
Title check ✅ Passed The title is concise and accurately highlights the main change: L3 post-fork host-buffer registration in runtime.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@doraemonmj doraemonmj force-pushed the l3-host-buffer-registration branch 2 times, most recently from 7ba711f to b0c39d2 Compare June 29, 2026 11:05

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py (1)

120-126: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use plain post-fork tensors for the happy-path coverage.

These buffers are still moved to .share_memory_() after the fork, so this scene only proves the file-backed source case. The new API is supposed to make any post-fork host tensor usable by copying through the registration shm, so switching a, b, and out to ordinary torch.full / torch.zeros would cover the common dynamic-shape path end-to-end.

Proposed test change
-        a = torch.full((SIZE,), 5.0, dtype=torch.float32).share_memory_()
-        b = torch.full((SIZE,), 7.0, dtype=torch.float32).share_memory_()
-        out = torch.zeros(SIZE, dtype=torch.float32).share_memory_()
+        a = torch.full((SIZE,), 5.0, dtype=torch.float32)
+        b = torch.full((SIZE,), 7.0, dtype=torch.float32)
+        out = torch.zeros(SIZE, dtype=torch.float32)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py`
around lines 120 - 126, The happy-path test in the host buffer registration flow
is still using .share_memory_() tensors, so it only covers the file-backed case
instead of the intended post-fork tensor registration path. Update the setup in
test_l3_host_buffer_registration to create a, b, and out as plain
torch.full/torch.zeros tensors, then pass them through
worker.register_host_buffer so the test exercises the registration shm copy
behavior end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/comm-domain.md`:
- Around line 153-160: The host-tensor visibility guidance in the shared-memory
section is too strict and confusing: it should state that the cutoff is the
first chip fork, not Worker.init(), and that the original tensor storage only
needs to be child-visible for the fork-inherited path. Update the wording around
worker.run(...), orch.submit_next_level(...), and orch.copy_to so the table
clearly distinguishes fork-inherited from registered post-fork, and make sure
the registered path is described as working via the buffer registration shim
rather than requiring the tensor itself to be shared before Worker.init().

In `@python/simpler/orchestrator.py`:
- Around line 193-194: The host-buffer staging path in
orchestrator._stage_host_buffers_for_chip_submit is mutating
self._worker._pending_host_copyback before all validation is complete, so
partial failures can leave stale copybacks queued. Update the LOCAL_CHIP submit
flow in orchestrator._submit_chip_group (and the related call site around the
c_args loop) to make staging transactional: snapshot or isolate
_pending_host_copyback before calling _stage_host_buffers_for_chip_submit, and
restore/discard it if any exception is raised before submit completes. Ensure
the rollback covers both the per-c_args staging and the outer group loop so a
failed validation never leaves prior D2H copybacks behind.

In `@python/simpler/worker.py`:
- Around line 3398-3407: The fork snapshot logic in the worker’s chip-path only
records inode membership and captures it too late, so
`dw.init()`/prewarm-created mappings can leak into the fork state and later
accept post-fork remaps. Update the fork snapshot around `_read_self_maps()` in
the chip-fork path to preserve the full `(lo, hi, inode)` ranges captured
immediately before the fork, and then change the validation at the `addr`/tensor
size check to require the requested `[addr, addr + tensor_nbytes)` interval to
be covered by a matching fork-captured range with the same inode, not just a
matching inode set.
- Around line 4190-4194: `unregister_host_buffer()` and
`_release_all_host_buffers()` should always free the parent shared memory even
if `_broadcast_host_unmap()` raises. Wrap the broadcast and per-entry cleanup in
the same best-effort try/except/finally pattern used elsewhere so
`entry.shm.close()` and `entry.shm.unlink()` still run for each entry, and
ensure `self._worker`/`_hierarchical_started` handling in
`python/simpler/worker.py` does not abort cleanup early.
- Around line 385-390: `unregister_host_buffer()` currently removes entries
using only `handle.data_ptr`, so an old `HostBufferHandle` can accidentally
unregister a newer registration at the same pointer. Update the `Worker`
host-buffer registry logic to validate the full handle identity before deleting,
using `HostBufferHandle.token` as part of the lookup/match, and add an owner
identifier if needed so handles from one `Worker` cannot be used to unregister
another worker’s buffer.
- Around line 4268-4271: The per-submit staging in the buffer copy-in path can
overwrite a newer same-run output with stale parent data, so update the logic
around the submit staging block in worker.py to avoid re-copying overlapping
registered buffers after a producer has already written them. Track the
output/INOUT ranges staged for the current run inside the submit/dispatch flow
that calls submit_next_level, and skip the ctypes.memmove copy-in for later
INPUT overlaps, or refactor registered-buffer copy-in to a run-level
pre-execution phase before tasks can run.

---

Nitpick comments:
In `@tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py`:
- Around line 120-126: The happy-path test in the host buffer registration flow
is still using .share_memory_() tensors, so it only covers the file-backed case
instead of the intended post-fork tensor registration path. Update the setup in
test_l3_host_buffer_registration to create a, b, and out as plain
torch.full/torch.zeros tensors, then pass them through
worker.register_host_buffer so the test exercises the registration shm copy
behavior end-to-end.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8571f588-69b2-4469-92f4-973ecd6ca07b

📥 Commits

Reviewing files that changed from the base of the PR and between b6091ee and b0c39d2.

📒 Files selected for processing (5)
  • docs/comm-domain.md
  • python/simpler/orchestrator.py
  • python/simpler/worker.py
  • tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py
  • tests/ut/py/test_worker/test_host_buffer_registration.py

Comment thread docs/comm-domain.md Outdated
Comment thread python/simpler/orchestrator.py Outdated
Comment thread python/simpler/worker.py
Comment thread python/simpler/worker.py Outdated
Comment thread python/simpler/worker.py Outdated
Comment thread python/simpler/worker.py Outdated
@doraemonmj doraemonmj force-pushed the l3-host-buffer-registration branch from 97e9cb1 to b0c39d2 Compare June 29, 2026 11:15
@doraemonmj doraemonmj changed the title feat(runtime): L3 post-fork host-buffer registration (#1027) feat(runtime): L3 post-fork host-buffer registration Jun 30, 2026
@doraemonmj doraemonmj force-pushed the l3-host-buffer-registration branch 2 times, most recently from ef8e301 to 8426d0d Compare June 30, 2026 11:05
)

Host tensors created after the L3 chip children are forked (lazily on the
first run) were invisible to those children, forcing serving to preallocate
all buffers before worker creation. Add Worker.register_host_buffer /
unregister_host_buffer: a named shm is mapped into every chip child and the
per-task mailbox blob's host pointers are rewritten to the child's own mapping
before the runtime dereferences them, so a later run can H2D/D2H through it.
Pure Python (worker.py / orchestrator.py) — no runtime C++ change.

submit_next_level validates host-tensor visibility and raises an actionable
error for an unregistered post-fork tensor (fork-inherited vs post-fork is
decided by backing inode + fork VA snapshot). Lifetime/visibility contract
documented in comm-domain.md; one a2a3 scene test covers the mechanism and the
error path.
@doraemonmj doraemonmj force-pushed the l3-host-buffer-registration branch from 8426d0d to 2e85364 Compare June 30, 2026 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

L3 worker cannot see host tensors created after worker startup

1 participant