Skip to content

[Code Health] AICPU orch .so is double-uploaded (orch_so_dedup_ duplicates the ChipCallable buffer) and chip_callable_buffers_ leaks until finalize #1082

Description

@ChaoWao

Category

Technical Debt (cleanup, refactor)

Component

Host Runtime

Description

Two related device-memory hygiene problems in the onboard host runner, surfaced while investigating #1019 (which turned out to be unrelated — an op-timeout window, fixed by #1035).

1. The AICPU orchestration .so is uploaded to the device twice.

  • upload_chip_callable_buffer() uploads the whole ChipCallable (header + storage_) into chip_callable_buffers_. storage_ starts with the orchestration .soChipCallable::binary_data() is the orch .so (the host itself reads it that way: runtime_maker.cpp does orch_so_binary = callable->binary_data()).
  • register_callable() then uploads just the orch .so bytes again into a separate pool orch_so_dedup_, and points the AICPU at it via runtime.set_dev_orch_so(addr, size).

So the same orch bytes occupy device GM twice. The orch already lives at a known offset inside the chip buffer: chip_dev + offsetof(ChipCallable, storage_), length binary_size(). orch_so_dedup_ exists only because the AICPU reads dev_orch_so (the standalone copy) and dlopens it, instead of pointing into the chip buffer.

2. chip_callable_buffers_ is never freed on unregister — only at finalize().

  • unregister_callable() decrements + frees orch_so_dedup_ (it has a refcount), but does not touch chip_callable_buffers_.
  • ChipCallableBuffer has no refcount field (unlike OrchSoBuffer), and the comment says "never freed until finalize".
  • Result: in a long-lived worker (e.g. the L2 st_worker pool reuses one ChipWorker per (runtime, device) for a whole xdist session), every distinct chip-callable buffer accumulates on-device until the runner finalizes, even after its callable is unregistered. The header comment at the chip_callable_buffers_ declaration already claims "refcounted by hash", but that was specified and never implemented.

Neither is a crash today (the #1019 repro showed max-live=1 for a single callable), but both are real growth/duplication that scale with callable churn.

Location

  • src/common/platform/onboard/host/device_runner_base.cpp
    • upload_chip_callable_buffer() ~385431 (uploads whole ChipCallable incl. orch)
    • register_callable() ~476535 (separate orch .so upload → orch_so_dedup_)
    • prepare_orch_so() ~437470 (set_dev_orch_so)
    • unregister_callable() 571594 (frees orch_so_dedup_, not chip_callable_buffers_)
    • finalize-only free of chip_callable_buffers_ ~680689
  • src/common/platform/onboard/host/device_runner_base.h
    • struct ChipCallableBuffer 625628 (no refcount), map 629
    • struct OrchSoBuffer 654657 (has refcount), map 660
  • src/common/task_interface/callable.h:117118ChipCallable::binary_data()/binary_size() = orch .so
  • src/{a5,a2a3}/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp ~133134orch_so = callable->binary_data()
  • src/{a5,a2a3}/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp ~269318 — AICPU reads dev_orch_so → writes temp file → dlopen

Proposed Fix

Unify into a single refcounted device buffer:

  1. prepare_orch_so (device-orch path): set dev_orch_so = chip_dev + offsetof(ChipCallable, storage_), size = binary_size() — point the AICPU at the orch bytes already inside the chip buffer.
  2. Delete register_callable's separate orch upload + the orch_so_dedup_ map/refcount.
  3. Add refcount to ChipCallableBuffer and free-on-unregister in unregister_callable() (mirror the existing orch_so_dedup_ pattern), so the unified buffer (orch + kernels) is released when its last callable unregisters.
  4. Start with a failing repro test: register N distinct callables → unregister → assert the chip pool is freed (today it isn't).

Must verify / preserve (the make-or-break check):

  • Byte-identity: upload_chip_callable_buffer runs patch_chip_callable_scratch_for_device (patches child resolved_addr_). Confirm that patching does not touch the orch region storage_[0 .. binary_size), so the in-buffer orch is byte-identical to what dlopen needs.
  • hbg path: host_build_graph doesn't H2D the orch (host dlopen via state.host_dlopen_handle). This change is for the device-orch (tensormap_and_ringbuffer) path only; the hbg branch must stay.

Priority

Medium (minor risk, should fix in next few releases)

Metadata

Metadata

Assignees

Labels

code healthTechnical debt, robustness, code quality

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions