feat(runtime): L3 post-fork host-buffer registration#1190
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements host-buffer registration to resolve host tensor visibility issues for post-fork chip children (issue #1027), allowing post-fork host tensors to be mapped via shared memory and validated during orchestrator submission. The review feedback suggests securing the host buffer registry operations with self._registry_lock to prevent race conditions in concurrent environments, and simplifying the parsing of /proc/self/maps to make it more robust and readable.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds ChangesPost-fork Host Buffer Registration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
7ba711f to
b0c39d2
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py (1)
120-126: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse plain post-fork tensors for the happy-path coverage.
These buffers are still moved to
.share_memory_()after the fork, so this scene only proves the file-backed source case. The new API is supposed to make any post-fork host tensor usable by copying through the registration shm, so switchinga,b, andoutto ordinarytorch.full/torch.zeroswould cover the common dynamic-shape path end-to-end.Proposed test change
- a = torch.full((SIZE,), 5.0, dtype=torch.float32).share_memory_() - b = torch.full((SIZE,), 7.0, dtype=torch.float32).share_memory_() - out = torch.zeros(SIZE, dtype=torch.float32).share_memory_() + a = torch.full((SIZE,), 5.0, dtype=torch.float32) + b = torch.full((SIZE,), 7.0, dtype=torch.float32) + out = torch.zeros(SIZE, dtype=torch.float32)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py` around lines 120 - 126, The happy-path test in the host buffer registration flow is still using .share_memory_() tensors, so it only covers the file-backed case instead of the intended post-fork tensor registration path. Update the setup in test_l3_host_buffer_registration to create a, b, and out as plain torch.full/torch.zeros tensors, then pass them through worker.register_host_buffer so the test exercises the registration shm copy behavior end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/comm-domain.md`:
- Around line 153-160: The host-tensor visibility guidance in the shared-memory
section is too strict and confusing: it should state that the cutoff is the
first chip fork, not Worker.init(), and that the original tensor storage only
needs to be child-visible for the fork-inherited path. Update the wording around
worker.run(...), orch.submit_next_level(...), and orch.copy_to so the table
clearly distinguishes fork-inherited from registered post-fork, and make sure
the registered path is described as working via the buffer registration shim
rather than requiring the tensor itself to be shared before Worker.init().
In `@python/simpler/orchestrator.py`:
- Around line 193-194: The host-buffer staging path in
orchestrator._stage_host_buffers_for_chip_submit is mutating
self._worker._pending_host_copyback before all validation is complete, so
partial failures can leave stale copybacks queued. Update the LOCAL_CHIP submit
flow in orchestrator._submit_chip_group (and the related call site around the
c_args loop) to make staging transactional: snapshot or isolate
_pending_host_copyback before calling _stage_host_buffers_for_chip_submit, and
restore/discard it if any exception is raised before submit completes. Ensure
the rollback covers both the per-c_args staging and the outer group loop so a
failed validation never leaves prior D2H copybacks behind.
In `@python/simpler/worker.py`:
- Around line 3398-3407: The fork snapshot logic in the worker’s chip-path only
records inode membership and captures it too late, so
`dw.init()`/prewarm-created mappings can leak into the fork state and later
accept post-fork remaps. Update the fork snapshot around `_read_self_maps()` in
the chip-fork path to preserve the full `(lo, hi, inode)` ranges captured
immediately before the fork, and then change the validation at the `addr`/tensor
size check to require the requested `[addr, addr + tensor_nbytes)` interval to
be covered by a matching fork-captured range with the same inode, not just a
matching inode set.
- Around line 4190-4194: `unregister_host_buffer()` and
`_release_all_host_buffers()` should always free the parent shared memory even
if `_broadcast_host_unmap()` raises. Wrap the broadcast and per-entry cleanup in
the same best-effort try/except/finally pattern used elsewhere so
`entry.shm.close()` and `entry.shm.unlink()` still run for each entry, and
ensure `self._worker`/`_hierarchical_started` handling in
`python/simpler/worker.py` does not abort cleanup early.
- Around line 385-390: `unregister_host_buffer()` currently removes entries
using only `handle.data_ptr`, so an old `HostBufferHandle` can accidentally
unregister a newer registration at the same pointer. Update the `Worker`
host-buffer registry logic to validate the full handle identity before deleting,
using `HostBufferHandle.token` as part of the lookup/match, and add an owner
identifier if needed so handles from one `Worker` cannot be used to unregister
another worker’s buffer.
- Around line 4268-4271: The per-submit staging in the buffer copy-in path can
overwrite a newer same-run output with stale parent data, so update the logic
around the submit staging block in worker.py to avoid re-copying overlapping
registered buffers after a producer has already written them. Track the
output/INOUT ranges staged for the current run inside the submit/dispatch flow
that calls submit_next_level, and skip the ctypes.memmove copy-in for later
INPUT overlaps, or refactor registered-buffer copy-in to a run-level
pre-execution phase before tasks can run.
---
Nitpick comments:
In `@tests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py`:
- Around line 120-126: The happy-path test in the host buffer registration flow
is still using .share_memory_() tensors, so it only covers the file-backed case
instead of the intended post-fork tensor registration path. Update the setup in
test_l3_host_buffer_registration to create a, b, and out as plain
torch.full/torch.zeros tensors, then pass them through
worker.register_host_buffer so the test exercises the registration shm copy
behavior end-to-end.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8571f588-69b2-4469-92f4-973ecd6ca07b
📒 Files selected for processing (5)
docs/comm-domain.mdpython/simpler/orchestrator.pypython/simpler/worker.pytests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.pytests/ut/py/test_worker/test_host_buffer_registration.py
97e9cb1 to
b0c39d2
Compare
ef8e301 to
8426d0d
Compare
) Host tensors created after the L3 chip children are forked (lazily on the first run) were invisible to those children, forcing serving to preallocate all buffers before worker creation. Add Worker.register_host_buffer / unregister_host_buffer: a named shm is mapped into every chip child and the per-task mailbox blob's host pointers are rewritten to the child's own mapping before the runtime dereferences them, so a later run can H2D/D2H through it. Pure Python (worker.py / orchestrator.py) — no runtime C++ change. submit_next_level validates host-tensor visibility and raises an actionable error for an unregistered post-fork tensor (fork-inherited vs post-fork is decided by backing inode + fork VA snapshot). Lifetime/visibility contract documented in comm-domain.md; one a2a3 scene test covers the mechanism and the error path.
8426d0d to
2e85364
Compare
Closes #1027.
Problem
L3 chip children are forked lazily on the first
Worker.run(). A host tensor created after that — the natural dynamic-shape serving pattern — is not in the children's address space: the orch fn runs in the parent and the per-task args carry a raw parent VA that is unmapped (or stale) in the child. Today serving must preallocate every input/output buffer (at max shape) before the worker is created.Change
Add
Worker.register_host_buffer(tensor)/unregister_host_buffer(handle):_CTRL_PY_REGISTER).Tensor.buffer.addr) to its own mapping — pure-Python blob rewrite, no runtime C++ change.run()mirrors the tensor through the shm: H2D copy-in before the task, D2H copy-out after the run drains.submit_next_levelvalidates host-tensor visibility and raises an actionable error for an unregistered post-fork tensor. Fork-inherited vs post-fork is decided by the buffer's backing inode plus a fork-time VA snapshot, so a fresh post-forktorch.empty(its own mmap) and a post-forkshare_memory_tensor (new inode, even if it reused a freed VA) are both rejected.Sub-views of a registered buffer resolve automatically; views overrunning the registered size raise.
Per the issue decision this delivers B + C + D and defers A (transparent auto-mapping, which would have to guess child visibility from a raw
data_ptrand risk silent corruption when torch recycles the pointer).Scope / limits (documented in
comm-domain.md)orch.copy_to(the explicit low-level staging path) is out of scope — itssrcmust still be fork-inherited.share_memory_or register host tensors for chip dispatch.Files
python/simpler/worker.py,python/simpler/orchestrator.py— implementation (Python only)docs/comm-domain.md— lifetime/visibility contracttests/st/a2a3/tensormap_and_ringbuffer/test_l3_host_buffer_registration.py— one a2a3 scene test covering the mechanism (B) and the error path (C, both shared and anonymous post-fork memory)Testing