hw-native-sys · doraemonmj · Jun 29, 2026
diff --git a/docs/comm-domain.md b/docs/comm-domain.md
@@ -148,7 +148,101 @@ with orch.allocate_domain(...) as handle:
 
 ---
 
-## 6. Examples
+## 6. Host tensor visibility for `worker.run`
+
+A host tensor passed to `worker.run(...)` / `orch.submit_next_level(...)` is
+ultimately dereferenced from the forked chip child, not the parent. For the
+design rationale behind the choices below (copy vs zero-copy, explicit
+registration, the procfs/inode classification, and the alternatives that were
+deferred) see [`host-buffer-registration-design.md`](host-buffer-registration-design.md).
+Either the
+tensor's own storage must already be reachable there (fork-inherited), or
+`worker.register_host_buffer(...)` must provide a child-visible shm mirror — in
+the registered case the tensor storage itself need not be child-visible. Three
+sources are legal:
+
+| Source | How | Why it works |
+| ------ | --- | ------------ |
+| **fork-inherited** | `tensor.share_memory_()` **before the chip children are forked** (i.e. before the first `Worker.run()`) | the child inherits the MAP_SHARED page at fork |
+| **registered post-fork** | `worker.register_host_buffer(tensor)` after the chips exist | maps a shm into every child for the buffer's lifetime |
+| anything else | — | raises an actionable error before dispatch |
+
+The chip children are forked lazily on the **first** `run()`. A host tensor
+created after that — the natural dynamic-shape serving pattern — is invisible to
+the children unless registered:
+
+```python
+worker = Worker(level=3, ...); worker.register(chip); worker.init()
+worker.run(orch0, ...)                          # forks the chips
+
+hidden = torch.empty((tokens, hidden_size)).share_memory_()   # created post-fork
+out    = torch.empty((batch, vocab)).share_memory_()
+h_hidden = worker.register_host_buffer(hidden)  # map into every chip child
+h_out    = worker.register_host_buffer(out)
+try:
+    for batch in batches:
+        fill(hidden); worker.run(orch, ...)     # H2D copy-in, D2H copy-out per run
+        use(out)
+finally:
+    worker.unregister_host_buffer(h_hidden)
+    worker.unregister_host_buffer(h_out)
+```
+
+**Register once, reuse many runs.** Registration maps a shm into each child and
+keeps it mapped; every `run` mirrors the tensor through it (H2D copy-in before
+the task, D2H copy-out after the run drains). Register the buffer — or a
+fixed-size superset you sub-slice — once and reuse it; re-registering per run
+pays a map/unmap broadcast each time. A sub-view (slice) of a registered buffer
+is resolved automatically.
+
+**Error path.** An unregistered post-fork host tensor raises before any dispatch:
+
+> Host tensor 0x… is not visible to the L3 chip child (created after fork, not
+> registered). Call worker.register_host_buffer(tensor) before run(), or allocate
+> it with .share_memory_() before init().
+
+### Scope / limits (v1)
+
+- **memcpy, not zero-copy.** A registered buffer is a *separate* shm; each run
+  copies `tensor → shm` (in) and `shm → tensor` (out). For a large hot-path
+  tensor this is a double copy. True zero-copy (mapping the tensor's own storage)
+  is a later optimization.
+- **`orch.copy_to` is the unmanaged low-level path.** Registration covers the
+  `run` / `submit_next_level` host-tensor args (the post-fork host-tensor
+  scenario). The
+  explicit `orch.copy_to(src=tensor.data_ptr())` staging path (§5) is *not*
+  translated or validated by `register_host_buffer` — its `src` must still be
+  fork-inherited (`.share_memory_()` before the chip children are forked, i.e.
+  before the first `run()`).
+- **Anonymous post-fork heap.** A *large* post-fork `torch.empty` (its own mmap)
+  is correctly rejected. A *small* non-shared tensor the allocator sub-slices out
+  of a fork-time heap arena can slip past the check (anonymous, inside a fork
+  range) and read stale data in the child — always `share_memory_` or register
+  host tensors used for chip dispatch.
+- **Non-procfs platforms (e.g. macOS).** The reachability check reads
+  `/proc/self/maps`; where it is unavailable the fork snapshot is empty, so an
+  unregistered host tensor cannot be classified and is **passed through
+  unvalidated** (rather than rejected) — a fork-inherited tensor is the common
+  legitimate case and must keep working. The first such pass-through emits a
+  one-time `UserWarning`; the caller is then responsible for ensuring the tensor
+  is fork-inherited or registered. Error path C above is enforced only where
+  procfs exists (Linux, including onboard).
+- **No in-run produce-then-consume of the same registered buffer.** Copy-in
+  (`tensor → shm`) runs per `submit_next_level`, while tasks may already be in
+  flight before `drain()`. If one task writes a registered buffer and a later
+  dependent task in the *same* `run` reads it, the consumer's copy-in can overwrite
+  the producer's result with the stale parent tensor. Use a registered buffer as a
+  run input *or* a run output, not as an intermediate handed between tasks within
+  one run; chain results through device buffers instead.
+- **Fork-inherited anonymous memory is copy-on-write, hence stale.** Even a tensor
+  the child legitimately inherited is only useful as a *live* input if it is
+  MAP_SHARED: anonymous (non-`share_memory_`) pages are COW, so writes the parent
+  makes *after* fork do not reach the child. A live input must be file-backed
+  (`.share_memory_()` before `init()`) or registered.
+
+---
+
+## 7. Examples
 
 - `examples/workers/l3/allreduce_distributed/` — single domain, PTO-ISA remote
   reads over the window.

diff --git a/python/simpler/orchestrator.py b/python/simpler/orchestrator.py
@@ -187,6 +187,19 @@ def submit_next_level(
                 raise TypeError("RemoteTensorRef is only supported for RemoteCallable NEXT_LEVEL submits")
             remote_sidecar = None
         _validate_remote_sidecar_access(c_args, remote_sidecar)
+        # Mirror any registered post-fork host buffers into their shm (H2D) and
+        # reject host tensors the chip child cannot see (issue #1027). Only the
+        # LOCAL_CHIP path dereferences raw host pointers in the forked child.
+        if target_namespace == "LOCAL_CHIP" and self._worker is not None:
+            staged_len = len(self._worker._pending_host_copyback)
+            try:
+                self._worker._stage_host_buffers_for_chip_submit(c_args)
+            except BaseException:
+                # Staging appends D2H copybacks as it walks; if it aborts the
+                # submit never happens, so drop the copybacks it queued or a later
+                # clean drain would flush stale shm into those outputs.
+                del self._worker._pending_host_copyback[staged_len:]
+                raise
         final_worker_ids = _remote_data_eligible_worker_ids(remote_sidecar, eligible_worker_ids)
         cpp_worker_id = int(worker)
         captured_refs = self._worker._capture_remote_sidecar_refs(remote_sidecar) if self._worker is not None else []
@@ -249,6 +262,18 @@ def submit_next_level_group(
         if remote_sidecars is not None:
             for c_args, remote_sidecar in zip(c_args_list, remote_sidecars):
                 _validate_remote_sidecar_access(c_args, remote_sidecar)
+        # Stage + validate registered/post-fork host buffers for chip dispatch
+        # (issue #1027), same as the single submit path.
+        if target_namespace == "LOCAL_CHIP" and self._worker is not None:
+            staged_len = len(self._worker._pending_host_copyback)
+            try:
+                for c_args in c_args_list:
+                    self._worker._stage_host_buffers_for_chip_submit(c_args)
+            except BaseException:
+                # Roll back copybacks queued for every c_args staged so far; the
+                # group submit never happens, so none of them must survive.
+                del self._worker._pending_host_copyback[staged_len:]
+                raise
         worker_id_sets = (
             [
                 _remote_data_eligible_worker_ids(remote_sidecar, eligible_worker_ids)