Category
Technical Debt (cleanup, refactor)
Component
Host Runtime
Description
Two related device-memory hygiene problems in the onboard host runner, surfaced while investigating #1019 (which turned out to be unrelated — an op-timeout window, fixed by #1035).
1. The AICPU orchestration .so is uploaded to the device twice.
upload_chip_callable_buffer() uploads the whole ChipCallable (header + storage_) into chip_callable_buffers_. storage_ starts with the orchestration .so — ChipCallable::binary_data() is the orch .so (the host itself reads it that way: runtime_maker.cpp does orch_so_binary = callable->binary_data()).
register_callable() then uploads just the orch .so bytes again into a separate pool orch_so_dedup_, and points the AICPU at it via runtime.set_dev_orch_so(addr, size).
So the same orch bytes occupy device GM twice. The orch already lives at a known offset inside the chip buffer: chip_dev + offsetof(ChipCallable, storage_), length binary_size(). orch_so_dedup_ exists only because the AICPU reads dev_orch_so (the standalone copy) and dlopens it, instead of pointing into the chip buffer.
2. chip_callable_buffers_ is never freed on unregister — only at finalize().
unregister_callable() decrements + frees orch_so_dedup_ (it has a refcount), but does not touch chip_callable_buffers_.
ChipCallableBuffer has no refcount field (unlike OrchSoBuffer), and the comment says "never freed until finalize".
- Result: in a long-lived worker (e.g. the L2
st_worker pool reuses one ChipWorker per (runtime, device) for a whole xdist session), every distinct chip-callable buffer accumulates on-device until the runner finalizes, even after its callable is unregistered. The header comment at the chip_callable_buffers_ declaration already claims "refcounted by hash", but that was specified and never implemented.
Neither is a crash today (the #1019 repro showed max-live=1 for a single callable), but both are real growth/duplication that scale with callable churn.
Location
src/common/platform/onboard/host/device_runner_base.cpp
upload_chip_callable_buffer() ~385–431 (uploads whole ChipCallable incl. orch)
register_callable() ~476–535 (separate orch .so upload → orch_so_dedup_)
prepare_orch_so() ~437–470 (set_dev_orch_so)
unregister_callable() 571–594 (frees orch_so_dedup_, not chip_callable_buffers_)
- finalize-only free of
chip_callable_buffers_ ~680–689
src/common/platform/onboard/host/device_runner_base.h
struct ChipCallableBuffer 625–628 (no refcount), map 629
struct OrchSoBuffer 654–657 (has refcount), map 660
src/common/task_interface/callable.h:117–118 — ChipCallable::binary_data()/binary_size() = orch .so
src/{a5,a2a3}/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp ~133–134 — orch_so = callable->binary_data()
src/{a5,a2a3}/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp ~269–318 — AICPU reads dev_orch_so → writes temp file → dlopen
Proposed Fix
Unify into a single refcounted device buffer:
prepare_orch_so (device-orch path): set dev_orch_so = chip_dev + offsetof(ChipCallable, storage_), size = binary_size() — point the AICPU at the orch bytes already inside the chip buffer.
- Delete
register_callable's separate orch upload + the orch_so_dedup_ map/refcount.
- Add
refcount to ChipCallableBuffer and free-on-unregister in unregister_callable() (mirror the existing orch_so_dedup_ pattern), so the unified buffer (orch + kernels) is released when its last callable unregisters.
- Start with a failing repro test: register N distinct callables → unregister → assert the chip pool is freed (today it isn't).
Must verify / preserve (the make-or-break check):
- Byte-identity:
upload_chip_callable_buffer runs patch_chip_callable_scratch_for_device (patches child resolved_addr_). Confirm that patching does not touch the orch region storage_[0 .. binary_size), so the in-buffer orch is byte-identical to what dlopen needs.
- hbg path:
host_build_graph doesn't H2D the orch (host dlopen via state.host_dlopen_handle). This change is for the device-orch (tensormap_and_ringbuffer) path only; the hbg branch must stay.
Priority
Medium (minor risk, should fix in next few releases)
Category
Technical Debt (cleanup, refactor)
Component
Host Runtime
Description
Two related device-memory hygiene problems in the onboard host runner, surfaced while investigating #1019 (which turned out to be unrelated — an op-timeout window, fixed by #1035).
1. The AICPU orchestration
.sois uploaded to the device twice.upload_chip_callable_buffer()uploads the wholeChipCallable(header +storage_) intochip_callable_buffers_.storage_starts with the orchestration.so—ChipCallable::binary_data()is the orch.so(the host itself reads it that way:runtime_maker.cppdoesorch_so_binary = callable->binary_data()).register_callable()then uploads just the orch.sobytes again into a separate poolorch_so_dedup_, and points the AICPU at it viaruntime.set_dev_orch_so(addr, size).So the same orch bytes occupy device GM twice. The orch already lives at a known offset inside the chip buffer:
chip_dev + offsetof(ChipCallable, storage_), lengthbinary_size().orch_so_dedup_exists only because the AICPU readsdev_orch_so(the standalone copy) anddlopens it, instead of pointing into the chip buffer.2.
chip_callable_buffers_is never freed on unregister — only atfinalize().unregister_callable()decrements + freesorch_so_dedup_(it has a refcount), but does not touchchip_callable_buffers_.ChipCallableBufferhas norefcountfield (unlikeOrchSoBuffer), and the comment says "never freed until finalize".st_workerpool reuses oneChipWorkerper (runtime, device) for a whole xdist session), every distinct chip-callable buffer accumulates on-device until the runner finalizes, even after its callable is unregistered. The header comment at thechip_callable_buffers_declaration already claims "refcounted by hash", but that was specified and never implemented.Neither is a crash today (the #1019 repro showed max-live=1 for a single callable), but both are real growth/duplication that scale with callable churn.
Location
src/common/platform/onboard/host/device_runner_base.cppupload_chip_callable_buffer()~385–431(uploads whole ChipCallable incl. orch)register_callable()~476–535(separate orch.soupload →orch_so_dedup_)prepare_orch_so()~437–470(set_dev_orch_so)unregister_callable()571–594(freesorch_so_dedup_, notchip_callable_buffers_)chip_callable_buffers_~680–689src/common/platform/onboard/host/device_runner_base.hstruct ChipCallableBuffer625–628(norefcount), map629struct OrchSoBuffer654–657(hasrefcount), map660src/common/task_interface/callable.h:117–118—ChipCallable::binary_data()/binary_size()= orch.sosrc/{a5,a2a3}/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp~133–134—orch_so = callable->binary_data()src/{a5,a2a3}/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp~269–318— AICPU readsdev_orch_so→ writes temp file →dlopenProposed Fix
Unify into a single refcounted device buffer:
prepare_orch_so(device-orch path): setdev_orch_so = chip_dev + offsetof(ChipCallable, storage_), size= binary_size()— point the AICPU at the orch bytes already inside the chip buffer.register_callable's separate orch upload + theorch_so_dedup_map/refcount.refcounttoChipCallableBufferand free-on-unregister inunregister_callable()(mirror the existingorch_so_dedup_pattern), so the unified buffer (orch + kernels) is released when its last callable unregisters.Must verify / preserve (the make-or-break check):
upload_chip_callable_bufferrunspatch_chip_callable_scratch_for_device(patches childresolved_addr_). Confirm that patching does not touch the orch regionstorage_[0 .. binary_size), so the in-buffer orch is byte-identical to whatdlopenneeds.host_build_graphdoesn't H2D the orch (hostdlopenviastate.host_dlopen_handle). This change is for the device-orch (tensormap_and_ringbuffer) path only; the hbg branch must stay.Priority
Medium (minor risk, should fix in next few releases)