Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 95 additions & 1 deletion docs/comm-domain.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,101 @@ with orch.allocate_domain(...) as handle:

---

## 6. Examples
## 6. Host tensor visibility for `worker.run`

A host tensor passed to `worker.run(...)` / `orch.submit_next_level(...)` is
ultimately dereferenced from the forked chip child, not the parent. For the
design rationale behind the choices below (copy vs zero-copy, explicit
registration, the procfs/inode classification, and the alternatives that were
deferred) see [`host-buffer-registration-design.md`](host-buffer-registration-design.md).
Either the
tensor's own storage must already be reachable there (fork-inherited), or
`worker.register_host_buffer(...)` must provide a child-visible shm mirror — in
the registered case the tensor storage itself need not be child-visible. Three
sources are legal:

| Source | How | Why it works |
| ------ | --- | ------------ |
| **fork-inherited** | `tensor.share_memory_()` **before the chip children are forked** (i.e. before the first `Worker.run()`) | the child inherits the MAP_SHARED page at fork |
| **registered post-fork** | `worker.register_host_buffer(tensor)` after the chips exist | maps a shm into every child for the buffer's lifetime |
| anything else | — | raises an actionable error before dispatch |

The chip children are forked lazily on the **first** `run()`. A host tensor
created after that — the natural dynamic-shape serving pattern — is invisible to
the children unless registered:

```python
worker = Worker(level=3, ...); worker.register(chip); worker.init()
worker.run(orch0, ...) # forks the chips

hidden = torch.empty((tokens, hidden_size)).share_memory_() # created post-fork
out = torch.empty((batch, vocab)).share_memory_()
h_hidden = worker.register_host_buffer(hidden) # map into every chip child
h_out = worker.register_host_buffer(out)
try:
for batch in batches:
fill(hidden); worker.run(orch, ...) # H2D copy-in, D2H copy-out per run
use(out)
finally:
worker.unregister_host_buffer(h_hidden)
worker.unregister_host_buffer(h_out)
```

**Register once, reuse many runs.** Registration maps a shm into each child and
keeps it mapped; every `run` mirrors the tensor through it (H2D copy-in before
the task, D2H copy-out after the run drains). Register the buffer — or a
fixed-size superset you sub-slice — once and reuse it; re-registering per run
pays a map/unmap broadcast each time. A sub-view (slice) of a registered buffer
is resolved automatically.

**Error path.** An unregistered post-fork host tensor raises before any dispatch:

> Host tensor 0x… is not visible to the L3 chip child (created after fork, not
> registered). Call worker.register_host_buffer(tensor) before run(), or allocate
> it with .share_memory_() before init().

### Scope / limits (v1)

- **memcpy, not zero-copy.** A registered buffer is a *separate* shm; each run
copies `tensor → shm` (in) and `shm → tensor` (out). For a large hot-path
tensor this is a double copy. True zero-copy (mapping the tensor's own storage)
is a later optimization.
- **`orch.copy_to` is the unmanaged low-level path.** Registration covers the
`run` / `submit_next_level` host-tensor args (the post-fork host-tensor
scenario). The
explicit `orch.copy_to(src=tensor.data_ptr())` staging path (§5) is *not*
translated or validated by `register_host_buffer` — its `src` must still be
fork-inherited (`.share_memory_()` before the chip children are forked, i.e.
before the first `run()`).
- **Anonymous post-fork heap.** A *large* post-fork `torch.empty` (its own mmap)
is correctly rejected. A *small* non-shared tensor the allocator sub-slices out
of a fork-time heap arena can slip past the check (anonymous, inside a fork
range) and read stale data in the child — always `share_memory_` or register
host tensors used for chip dispatch.
- **Non-procfs platforms (e.g. macOS).** The reachability check reads
`/proc/self/maps`; where it is unavailable the fork snapshot is empty, so an
unregistered host tensor cannot be classified and is **passed through
unvalidated** (rather than rejected) — a fork-inherited tensor is the common
legitimate case and must keep working. The first such pass-through emits a
one-time `UserWarning`; the caller is then responsible for ensuring the tensor
is fork-inherited or registered. Error path C above is enforced only where
procfs exists (Linux, including onboard).
- **No in-run produce-then-consume of the same registered buffer.** Copy-in
(`tensor → shm`) runs per `submit_next_level`, while tasks may already be in
flight before `drain()`. If one task writes a registered buffer and a later
dependent task in the *same* `run` reads it, the consumer's copy-in can overwrite
the producer's result with the stale parent tensor. Use a registered buffer as a
run input *or* a run output, not as an intermediate handed between tasks within
one run; chain results through device buffers instead.
- **Fork-inherited anonymous memory is copy-on-write, hence stale.** Even a tensor
the child legitimately inherited is only useful as a *live* input if it is
MAP_SHARED: anonymous (non-`share_memory_`) pages are COW, so writes the parent
makes *after* fork do not reach the child. A live input must be file-backed
(`.share_memory_()` before `init()`) or registered.

---

## 7. Examples

- `examples/workers/l3/allreduce_distributed/` — single domain, PTO-ISA remote
reads over the window.
Expand Down
25 changes: 25 additions & 0 deletions python/simpler/orchestrator.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,19 @@ def submit_next_level(
raise TypeError("RemoteTensorRef is only supported for RemoteCallable NEXT_LEVEL submits")
remote_sidecar = None
_validate_remote_sidecar_access(c_args, remote_sidecar)
# Mirror any registered post-fork host buffers into their shm (H2D) and
# reject host tensors the chip child cannot see (issue #1027). Only the
# LOCAL_CHIP path dereferences raw host pointers in the forked child.
if target_namespace == "LOCAL_CHIP" and self._worker is not None:
staged_len = len(self._worker._pending_host_copyback)
try:
self._worker._stage_host_buffers_for_chip_submit(c_args)
except BaseException:
# Staging appends D2H copybacks as it walks; if it aborts the
# submit never happens, so drop the copybacks it queued or a later
# clean drain would flush stale shm into those outputs.
del self._worker._pending_host_copyback[staged_len:]
raise
final_worker_ids = _remote_data_eligible_worker_ids(remote_sidecar, eligible_worker_ids)
cpp_worker_id = int(worker)
captured_refs = self._worker._capture_remote_sidecar_refs(remote_sidecar) if self._worker is not None else []
Expand Down Expand Up @@ -249,6 +262,18 @@ def submit_next_level_group(
if remote_sidecars is not None:
for c_args, remote_sidecar in zip(c_args_list, remote_sidecars):
_validate_remote_sidecar_access(c_args, remote_sidecar)
# Stage + validate registered/post-fork host buffers for chip dispatch
# (issue #1027), same as the single submit path.
if target_namespace == "LOCAL_CHIP" and self._worker is not None:
staged_len = len(self._worker._pending_host_copyback)
try:
for c_args in c_args_list:
self._worker._stage_host_buffers_for_chip_submit(c_args)
except BaseException:
# Roll back copybacks queued for every c_args staged so far; the
# group submit never happens, so none of them must survive.
del self._worker._pending_host_copyback[staged_len:]
raise
worker_id_sets = (
[
_remote_data_eligible_worker_ids(remote_sidecar, eligible_worker_ids)
Expand Down
Loading
Loading