From 267842da7d05df728336d8102ed9374db979312f Mon Sep 17 00:00:00 2001 From: puddingfjz <2811443837@qq.com> Date: Mon, 29 Jun 2026 19:53:41 +0800 Subject: [PATCH 1/9] docs: add TRB temporary buffer plan --- docs/trb-serial-tensor-buffer-pool-plan.md | 590 +++++++++++++++++++++ 1 file changed, 590 insertions(+) create mode 100644 docs/trb-serial-tensor-buffer-pool-plan.md diff --git a/docs/trb-serial-tensor-buffer-pool-plan.md b/docs/trb-serial-tensor-buffer-pool-plan.md new file mode 100644 index 000000000..757db30fc --- /dev/null +++ b/docs/trb-serial-tensor-buffer-pool-plan.md @@ -0,0 +1,590 @@ +# TRB Temporary Variable Buffer Implementation Plan + +**Date**: 2026-06-29 +**Status**: implementation plan + +## Decision + +The target optimization is a runtime-side temporary variable buffer for +ordinary non-child tensors in the `tensormap_and_ringbuffer` path. +This plan uses "temporary variable buffer" for the same concept as +临时变量 buffer. + +The serving constraint is important: + +- Do not change Qwen3 model code or kernel signatures. +- Keep the current hidden input boundary: hidden is still produced on host and + copied H2D for each run. +- Preserve existing child-memory behavior. Device-resident weights, RoPE + tables, LM head, and KV cache are already handled by the caller as + child-memory / `DeviceTensor` style inputs. +- Do not try to infer model-specific maximum tensor sizes inside the runtime. + The serving or runner owner must provide the temporary-buffer memory budget + when enabling this optimization. + +Therefore the near-term optimization is: + +```text +ordinary non-child host tensor + -> acquire device slice from the temporary variable buffer + -> H2D or device memset + -> run + -> D2H if the host still needs the output + -> end the run and make the temporary variable buffer reusable +``` + +This reduces repeated `device_malloc()` / `device_free()` in the hot path. It +does not remove required H2D or D2H copies. + +## Non-Goals + +- Do not convert every non-child tensor into user-visible child-memory. +- Do not add full-bind cross-run overlap. +- Do not allocate a second per-run tensor-buffer set by default. +- Do not add output double buffering for logits. +- Do not add dirty/version tracking for Qwen3 hidden or small metadata tensors. +- Do not skip hidden H2D while model/kernel code still expects hidden input. +- Do not add a new env var or macro gate without explicit approval. +- Do not change `worker.malloc()` / `worker.free()` public semantics. +- Do not change copy-back policy in this plan. + +## Current Code Shape + +Current TRB bind does this for every non-child tensor: + +- `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp` + `bind_callable_to_runtime_impl()` +- allocate a fresh device buffer; +- copy input / INOUT tensors H2D; +- memset pure OUT tensors; +- record `TensorPair`. + +Current validate does this: + +- read PTO2 status/header; +- copy OUTPUT / INOUT tensors D2H when `needs_copy_back` is true; +- call `device_free()` for every recorded device pointer; +- clear `tensor_pairs_`. + +Existing child-memory does not need changes: + +- child-memory tensors are passed through directly; +- runtime does not H2D, D2H, or free them; +- ownership stays with the caller. + +## Target Ownership Model + +Replace the current implicit `TensorPair` ownership assumption with an explicit +lease model. + +Current implicit assumption: + +```text +every TensorPair.dev_ptr was allocated by this run +validate must device_free() every TensorPair.dev_ptr +``` + +Target explicit model: + +```cpp +enum class TensorReleaseKind { + Free, + BufferNoop, + ExternalNoop, +}; + +struct TensorLease { + void *host_ptr; + void *dev_ptr; + size_t size; + bool needs_copy_back; + TensorReleaseKind release_kind; +}; +``` + +Expected use: + +- `BufferNoop`: normal non-child TRB temporary-buffer slice. The per-tensor + release is a no-op; the whole buffer becomes reusable at run end. +- `Free`: existing per-run allocation path when the temporary-buffer + optimization is disabled. +- `ExternalNoop`: only for future explicit external device tensors, not needed + for current child-memory pass-through because those tensors are not recorded. + +It is acceptable to keep the member name `tensor_pairs_` temporarily if that +keeps the diff smaller, but the recorded object must carry release ownership. +The preferred cleanup is to rename it to `tensor_leases_` in both TRB runtime +headers. + +## Temporary Buffer Location + +Add the temporary variable buffer to `DeviceRunnerBase`, not to `Runtime`. + +Reason: + +- `DeviceRunnerBase` already owns `MemoryAllocator`. +- it already serializes device allocation/free through `device_mem_mu_`; +- the buffer lifetime should match the worker/device context; +- `Runtime` is per invocation and should only record leases for one run. + +Affected platform bases: + +- `src/common/platform/onboard/host/device_runner_base.{h,cpp}` +- `src/common/platform/sim/host/device_runner_base.{h,cpp}` + +The sim and onboard implementations should expose the same internal methods: + +```cpp +bool configure_temporary_buffer(size_t max_temporary_buffer_bytes); +bool begin_temporary_buffer_run(); +void *acquire_temporary_buffer_slice(size_t bytes, size_t alignment); +void end_temporary_buffer_run(); +void clear_temporary_buffer(); +``` + +The `alignment` argument is internal to the runtime/platform layer. It must +preserve the same or stricter alignment guarantee that callers previously got +from allocating each tensor with `device_malloc()`. The caller does not provide +a new model-specific alignment value. + +The exact error plumbing should follow local runtime style. The logical +contract is that configuration, begin, and acquire failures are observable by +the caller and are not silently ignored. + +`finalize_common()` / sim finalize must clear all retained temporary-buffer +chunks before final allocator teardown. If finalize sees an active +temporary-buffer run, that is a programming error: log it, release retained +chunks, and make the behavior explicit in the implementation contract. + +## Budget Contract + +The runtime should not compute Qwen3-specific maximum temporary-buffer size. +The serving or runner owner that enables this optimization must provide: + +```text +max_temporary_buffer_bytes +``` + +Definition: + +```text +maximum total aligned device bytes required by all ordinary non-child tensor +temporary-buffer allocations in one run_prepared() invocation for this runner +``` + +Rules: + +- The value is an aggregate byte budget, not a tensor count. +- The value must include alignment padding and a safety margin. +- The value covers hidden, small metadata tensors, and output buffers + that are not child-memory. +- Child-memory tensors are not counted. +- If the budget is zero or missing, keep the existing per-run malloc/free path + unless a compatibility rollout explicitly chooses to fail fast. +- If a run needs more than the configured budget, fail with a clear error that + reports required bytes and configured bytes. +- If a positive budget is configured, budget exhaustion is a configuration + error. Do not silently fall back to per-run `device_malloc()` for that tensor + or run. + +This keeps model-shape knowledge in serving code, where `max_batch_size`, +`max_token_num`, hidden size, vocab padding, and metadata shapes are already +known. + +This does not require every application caller to pass a budget on every run. +It means the component that integrates a model runner must provide a positive +runner-scoped budget if it wants this optimization. Without that budget, the +runtime stays on the current allocation/free behavior. + +## Configuration Ingress + +The budget is configured once per runner before the first +temporary-buffer-backed `run_prepared()` call on that runner. The expected +owner is serving or model-runner setup code, after maximum shapes are known. + +Preferred call path: + +```text +Qwen3 serving / runner setup + -> worker or runner temporary-buffer configuration API + -> DeviceRunnerBase::configure_temporary_buffer(bytes) +``` + +The implementation must add a concrete configuration entrypoint above +`DeviceRunnerBase`. The `HostApi` begin/acquire/end callbacks consume the +already configured runner buffer; they must not receive or infer the budget +per tensor or per `run_prepared()` call. + +Rules: + +- missing or zero budget disables the temporary-buffer optimization; +- positive budget enables temporary-buffer-backed allocation for TRB + non-child tensors; +- positive budget configuration must allocate retained device chunks before + the first temporary-buffer-backed run; +- reconfiguration is allowed only when no temporary-buffer run is active; +- repeated configuration with the same positive budget should be a no-op; +- `clear()` should run only when disabling the optimization or when the + configured budget changes; +- invalid reconfiguration must fail with a clear error; +- diagnostics must expose the configured budget. + +## Buffer Behavior + +Implement the temporary variable buffer as a per-run bump allocator over +retained device chunks. + +Basic lifecycle: + +```text +configure(max_temporary_buffer_bytes) + if unchanged, keep existing retained chunks + if disabling or changing the budget, clear existing retained chunks + record the budget + allocate retained device chunks up to the configured budget + +begin_run() + reset chunk offsets to zero + caller must ensure no other temporary-buffer-backed run is active + +acquire(bytes, alignment) + align the chunk offset to the device allocation alignment + return the next aligned slice from existing chunks + fail if the slice would exceed max_temporary_buffer_bytes + +end_run() + mark the temporary variable buffer inactive + keep chunks retained for the next run + +clear() + free all retained chunks +``` + +Pseudocode: + +```cpp +void *TemporaryVariableBuffer::acquire(size_t bytes, size_t alignment) { + if (!active_) { + return nullptr; // temporary buffer is disabled or not in a run + } + + if (void *ptr = try_allocate_from_existing_chunks(bytes, alignment)) { + return ptr; + } + + return nullptr; // configured budget or chunk layout is insufficient +} +``` + +`acquire()` must not call `device_malloc()` on the run hot path. Device chunk +allocation happens during configuration, not lazily while binding a run. + +This is "serial" reuse: + +```text +run N uses temporary-buffer slices +validate for run N copies back required outputs +validate ends the temporary-buffer run +run N+1 may reuse the same temporary-buffer memory +``` + +The temporary variable buffer is not a cross-run overlap mechanism. It is +acquired only inside the active `run_prepared()` call. A later run may use it +only after the previous `run_prepared()` reaches validate and ends the buffer +run. + +## Concurrency Assumption + +This plan assumes one active `run_prepared()` lifecycle per runner for +temporary-buffer-backed tensors. + +The implementation does not add: + +- locking around the full bind/run/validate lifecycle; +- active-run guards; +- fallback malloc/free for concurrent binds; +- double buffering. + +If two host threads call `run_prepared()` concurrently on the same runner while +temporary buffering is enabled, behavior is unsupported. The caller or serving +scheduler is responsible for serializing same-runner runs. + +Future same-runner concurrency must add a run-lifecycle mutex, active-run +guard, fallback-to-malloc behavior, or true double buffering. That work is +outside this implementation plan. + +## Segmented Chunks + +The configured budget is an aggregate limit. It should not require one huge +contiguous HBM allocation. + +Preferred implementation: + +```cpp +struct Chunk { + void *base; + size_t capacity; + size_t offset; +}; + +std::vector chunks_; +size_t max_temporary_buffer_bytes_; +``` + +Allocation policy: + +- Support multiple chunks so the implementation does not depend on the largest + contiguous allocatable HBM block. +- Allocate retained chunks during positive-budget configuration. Do not add + chunks lazily from `acquire()` during bind. +- Never let total retained chunk capacity exceed + `max_temporary_buffer_bytes`. +- A tensor slice must be contiguous within one chunk. If a single tensor is + larger than every retained chunk, configuration must create a large-enough + chunk within the same aggregate budget or fail before the run. + +This keeps the hot path deterministic after warmup while avoiding the fragility +of one large allocation when HBM is fragmented by weights, KV cache, runtime +control buffers, or driver allocations. + +## Host API Wiring + +Do not change the public `device_malloc_ctx()` / `device_free_ctx()` APIs. +Those are used by explicit caller-owned device memory and must keep real +malloc/free semantics. + +Instead, expose internal temporary-buffer callbacks through the runtime +`HostApi`: + +```cpp +bool (*begin_temporary_buffer_run)(); +void *(*acquire_temporary_buffer_slice)(size_t size, size_t alignment); +void (*end_temporary_buffer_run)(); +``` + +Wire these callbacks in: + +- `src/common/platform/onboard/host/c_api_shared.cpp` +- `src/common/platform/sim/host/c_api_shared.cpp` + +HostApi compatibility matters. If common platform code initializes HostApi +fields for multiple runtime variants, add the fields consistently to those +HostApi definitions or guard the wiring so non-TRB runtimes compile cleanly. +Only TRB should use the temporary-buffer callbacks in this plan. + +## Bind Path Changes + +In TRB `bind_callable_to_runtime_impl()`: + +1. Begin a temporary-buffer run before processing non-child tensors when the + optimization is enabled. +2. Keep the child-memory branch unchanged. +3. For every non-child tensor, acquire a temporary-buffer slice: + + ```text + if temporary variable buffer is enabled: + dev_ptr = acquire_temporary_buffer_slice(size, alignment) + fail clearly if the configured budget is insufficient + do not fall back to device_malloc() + release_kind = BufferNoop + else: + dev_ptr = device_malloc(size) + release_kind = Free + ``` + +4. Record a `TensorLease` immediately after a device pointer is acquired. + This lets the failure path release every acquired buffer. +5. Preserve current copy behavior: + + - `ArgDirection::OUT`: device memset when available; + - otherwise: H2D copy from host. + +6. Set `needs_copy_back` from the current signature logic. +7. On failure before bind succeeds, release through the recorded release kind + and end the temporary-buffer run exactly once. + +Do not add dirty/version skip logic. Hidden and small metadata still need the +same copy semantics as today. + +## Validate Path Changes + +In TRB `validate_runtime_impl()`: + +1. Keep PTO2 status/header readback behavior. +2. Keep D2H copy when `needs_copy_back` is true. +3. Replace unconditional `device_free()` with release dispatch: + + ```text + Free -> device_free(dev_ptr) + BufferNoop -> do nothing for this tensor + ExternalNoop -> do nothing + ``` + +4. Clear the per-run lease vector at the end. +5. End the temporary-buffer run after all copy-back and cleanup decisions. +6. On runtime failure, skip tensor copy-back as today, but still release every + `Free` allocation and end the temporary-buffer run correctly. + +For Qwen3 with host sampling, logits copy-back remains required. The temporary +variable buffer only removes repeated allocation/free around that output +buffer. + +## Bind / Validate Cleanup Contract + +Cleanup ownership must be explicit: + +```text +bind owns cleanup until bind succeeds +validate owns cleanup after bind succeeds +``` + +If `begin_temporary_buffer_run()` succeeds, exactly one matching +`end_temporary_buffer_run()` must run. This applies whether bind, H2D, memset, +run, status readback, D2H, or validation fails. + +Bind should use a local cleanup guard: + +```text +temp_run_active = false + +if temporary buffer is enabled: + begin_temporary_buffer_run() + temp_run_active = true + +for tensor in tensors: + acquire or malloc dev_ptr + record TensorLease immediately + copy or memset + +runtime.temporary_buffer_run_active = temp_run_active +release bind cleanup guard +``` + +Before the cleanup guard is released, bind failure cleanup must: + +- release all recorded `Free` leases with `device_free()`; +- leave `BufferNoop` and `ExternalNoop` tensor leases as per-tensor no-ops; +- end the temporary-buffer run if `temp_run_active` is true; +- clear recorded leases. + +After bind succeeds, validate cleanup must perform the same release dispatch +and end the temporary-buffer run if +`runtime.temporary_buffer_run_active` is true. + +## Copy Behavior + +This plan intentionally keeps data movement behavior unchanged: + +- hidden remains H2D every decode step; +- seq / chunk / block metadata remains H2D when passed as host tensors; +- logits remains D2H for host sampling; +- OUTPUT / INOUT copy-back still follows existing `needs_copy_back` logic; +- PTO2 status/header D2H remains unchanged. + +Any future H2D/D2H avoidance would need a separate correctness contract for +which tensor content is device-resident, host-visible, dirty, or final. That is +not needed for the allocation/free optimization in this plan. + +## Qwen3 Tensor Classification + +With the current model/kernel boundary, treat Qwen3 tensors as follows: + +- weights / RoPE / LM head: existing child-memory / `DeviceTensor`; + do not touch. +- KV cache: existing child-memory / `DeviceTensor`; do not touch. +- hidden: ordinary non-child tensor; temporary-buffer slice; keep H2D. +- seq / chunk metadata: ordinary non-child tensor; temporary-buffer slice; + keep H2D. +- block_table / slot_mapping: ordinary non-child tensor; temporary-buffer + slice; keep H2D. +- logits: ordinary output tensor; temporary-buffer slice; keep D2H. + +The hidden H2D copy cannot be removed without changing the model/kernel +boundary to accept token ids and perform embedding lookup on device. That is +out of scope for this plan. + +## Logging And Metrics + +Add lightweight temporary-buffer counters, preferably exposed only through +debug logs or existing diagnostics: + +- configured temporary-buffer budget; +- retained chunk count; +- retained chunk bytes; +- current run used bytes; +- high-water used bytes; +- buffer-backed allocation count; +- `Free` allocation count; +- budget-exceeded count. + +Do not add a new behavior env var for this. If rollout needs a gate, ask for +explicit approval and document the default before adding it. + +Keep hot-path per-tensor logs out of `LOG_INFO_V0`. Use debug or aggregate +summary logs so performance runs are not perturbed. + +## Tests + +Minimum focused tests: + +- buffer unit test: begin, allocate several aligned slices, end, and next run + reuses the same base memory; +- buffer unit test: configured budget is enforced with a clear error; +- buffer unit test: segmented chunks work when one chunk cannot satisfy the + aggregate budget; +- buffer unit test: finalize frees retained chunks exactly once; +- runtime bind/validate test with a fake `HostApi`: repeated run records fewer + allocator calls while preserving H2D/D2H counts; +- child-memory regression: child-memory tensor is still pass-through and is not + recorded for temporary-buffer release; +- OUT tensor regression: pure OUT still receives device memset before run; +- error-path regression: failed copy or failed run releases every `Free` + allocation and ends the temporary-buffer run exactly once. + +Recommended integration checks: + +- existing prepared-callable ST for a2a3 TRB; +- existing prepared-callable ST for a5 TRB; +- qwen3 steady decode benchmark before/after with quiet logs. + +Benchmark and correctness validation must use the supported single-active-run +usage model. Hardware runs must still use `task-submit` when available. + +## Acceptance Criteria + +Accept the implementation only if all of these hold: + +- correctness tests are unchanged; +- no change to public child-memory semantics; +- no change to public `worker.malloc/free` semantics; +- the optimization requires a caller-provided `max_temporary_buffer_bytes` + and does not compute Qwen3 shape maxima inside the runtime; +- if a run exceeds the configured temporary-buffer budget, the error reports + required bytes and configured bytes; +- after warmup, steady decode `device_malloc` / `device_free` calls for + non-child temporary-buffer allocation drop materially; +- H2D/D2H bytes remain explainable and do not silently disappear; +- retained temporary-buffer HBM is bounded by `max_temporary_buffer_bytes`; +- live allocation count does not grow across repeated steady decode; +- `host_wall` improves by at least 1 ms or 5 percent on the target workload, or + allocator timing shows the expected reduction even if end-to-end impact is + smaller; +- `device_wall` and device-log `Total` do not regress by more than 1 percent; +- p99 latency does not regress by more than 5 percent. + +## Deferred Work + +These ideas are not part of this implementation: + +- dirty/version contracts for ordinary host tensors; +- skipping hidden H2D; +- device-side embedding lookup; +- device-side sampling; +- output copy-back elimination for logits; +- cross-run full-bind overlap; +- full tensor double buffering; +- runtime arena double buffering; +- AICPU init/teardown overlap. + +They can be revisited only if measurements show the temporary variable buffer +no longer addresses the dominant host-side overhead. From 0c61eb48dafeb0c14d47fc818241033dfd5068e8 Mon Sep 17 00:00:00 2001 From: puddingfjz <2811443837@qq.com> Date: Tue, 30 Jun 2026 12:06:40 +0800 Subject: [PATCH 2/9] Add: TRB temporary buffer reuse - Add a retained temporary variable buffer owned by DeviceRunnerBase and wire it through HostApi, C ABI, ChipWorker, and Worker configuration.\n- Convert TRB tensor cleanup to explicit leases so configured runs reuse buffer slices while disabled runs keep malloc/free semantics.\n- Cover buffer lifecycle, TRB bind/validate cleanup, child-memory regressions, and Python configuration entrypoints. --- docs/trb-serial-tensor-buffer-pool-plan.md | 4 +- python/bindings/task_interface.cpp | 10 + python/simpler/task_interface.py | 17 + python/simpler/worker.py | 23 ++ src/a2a3/platform/sim/host/device_runner.cpp | 1 + .../host/runtime_maker.cpp | 140 ++++++-- .../runtime/pto_types.h | 2 +- .../runtime/runtime.h | 15 +- src/a5/platform/sim/host/device_runner.cpp | 1 + .../host/runtime_maker.cpp | 140 ++++++-- .../runtime/pto_types.h | 2 +- .../runtime/runtime.h | 15 +- src/common/platform/include/common/host_api.h | 9 +- .../include/host/temporary_variable_buffer.h | 289 ++++++++++++++++ .../platform/onboard/host/c_api_shared.cpp | 22 ++ .../onboard/host/device_runner_base.cpp | 63 +++- .../onboard/host/device_runner_base.h | 8 + src/common/platform/sim/host/c_api_shared.cpp | 22 ++ .../platform/sim/host/device_runner_base.cpp | 54 +++ .../platform/sim/host/device_runner_base.h | 9 + src/common/worker/chip_worker.cpp | 31 +- src/common/worker/chip_worker.h | 6 + src/common/worker/pto_runtime_c_api.h | 8 + tests/ut/cpp/CMakeLists.txt | 30 ++ .../common/test_temporary_variable_buffer.cpp | 143 ++++++++ .../common/test_trb_runtime_temp_buffer.cpp | 316 ++++++++++++++++++ tests/ut/py/test_chip_worker.py | 18 + tests/ut/py/test_worker/test_host_worker.py | 36 ++ 28 files changed, 1354 insertions(+), 80 deletions(-) create mode 100644 src/common/platform/include/host/temporary_variable_buffer.h create mode 100644 tests/ut/cpp/common/test_temporary_variable_buffer.cpp create mode 100644 tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp diff --git a/docs/trb-serial-tensor-buffer-pool-plan.md b/docs/trb-serial-tensor-buffer-pool-plan.md index 757db30fc..0f5fa913f 100644 --- a/docs/trb-serial-tensor-buffer-pool-plan.md +++ b/docs/trb-serial-tensor-buffer-pool-plan.md @@ -206,7 +206,9 @@ Preferred call path: ```text Qwen3 serving / runner setup - -> worker or runner temporary-buffer configuration API + -> Worker(level=2, max_temporary_buffer_bytes=...) + or Worker.configure_temporary_buffer(bytes) + or ChipWorker.configure_temporary_buffer(bytes) -> DeviceRunnerBase::configure_temporary_buffer(bytes) ``` diff --git a/python/bindings/task_interface.cpp b/python/bindings/task_interface.cpp index 43c2e6e9c..02276b59c 100644 --- a/python/bindings/task_interface.cpp +++ b/python/bindings/task_interface.cpp @@ -917,6 +917,16 @@ NB_MODULE(_task_interface, m) { "host_build_graph variants. Mirrors aicpu_dlopen_count for the " "host-orchestration path; 0 on device-orch variants." ) + .def( + "configure_temporary_buffer", &ChipWorker::configure_temporary_buffer, + nb::arg("max_temporary_buffer_bytes"), + "Configure the runner-scoped TRB temporary variable buffer. " + "Pass 0 to disable and return to per-run malloc/free." + ) + .def_prop_ro( + "temporary_buffer_budget", &ChipWorker::temporary_buffer_budget, + "Configured temporary-buffer budget in bytes, or 0 when disabled." + ) .def("malloc", &ChipWorker::malloc, nb::arg("size")) .def("free", &ChipWorker::free, nb::arg("ptr")) .def("copy_to", &ChipWorker::copy_to, nb::arg("dst"), nb::arg("src"), nb::arg("size")) diff --git a/python/simpler/task_interface.py b/python/simpler/task_interface.py index b29e24911..3dd290ac6 100644 --- a/python/simpler/task_interface.py +++ b/python/simpler/task_interface.py @@ -1192,6 +1192,23 @@ def host_dlopen_count(self): """Number of host-side orch SO dlopens (host_build_graph variants).""" return self._impl.host_dlopen_count + def configure_temporary_buffer(self, max_temporary_buffer_bytes: int) -> None: + """Configure the runner-scoped TRB temporary variable buffer. + + ``0`` disables the optimization and keeps the existing per-run + malloc/free path. A positive value is an aggregate byte budget for + ordinary non-child tensors in one ``run_prepared`` invocation. + """ + budget = int(max_temporary_buffer_bytes) + if budget < 0: + raise ValueError("max_temporary_buffer_bytes must be non-negative") + self._impl.configure_temporary_buffer(budget) + + @property + def temporary_buffer_budget(self) -> int: + """Configured temporary-buffer budget in bytes, or 0 when disabled.""" + return int(self._impl.temporary_buffer_budget) + def malloc(self, size): """Allocate memory. Returns a pointer (uint64).""" return int(self._impl.malloc(int(size))) diff --git a/python/simpler/worker.py b/python/simpler/worker.py index 13cb56ab9..b8ded40e9 100644 --- a/python/simpler/worker.py +++ b/python/simpler/worker.py @@ -2875,6 +2875,11 @@ def _init_level2(self) -> None: self._chip_worker = ChipWorker() self._chip_worker.init(device_id, binaries) + max_temporary_buffer_bytes = int(self._config.get("max_temporary_buffer_bytes", 0)) + if max_temporary_buffer_bytes < 0: + raise ValueError("Worker max_temporary_buffer_bytes must be non-negative") + if max_temporary_buffer_bytes > 0: + self._chip_worker.configure_temporary_buffer(max_temporary_buffer_bytes) # Pre-warm any registered ChipCallable so the first run(handle, …) # does not pay the H2D upload cost. @@ -3786,6 +3791,17 @@ def copy_from(self, dst: int, src: int, size: int, worker_id: int = 0) -> None: assert self._orch is not None self._orch.copy_from(worker_id, dst, src, size) + def configure_temporary_buffer(self, max_temporary_buffer_bytes: int) -> None: + """Configure the level-2 TRB temporary variable buffer for this Worker.""" + budget = int(max_temporary_buffer_bytes) + if budget < 0: + raise ValueError("max_temporary_buffer_bytes must be non-negative") + if self.level != 2: + raise NotImplementedError("Worker.configure_temporary_buffer currently supports level 2 only") + self._config["max_temporary_buffer_bytes"] = budget + if self._chip_worker is not None: + self._chip_worker.configure_temporary_buffer(budget) + # ------------------------------------------------------------------ # run — uniform entry point # ------------------------------------------------------------------ @@ -3891,6 +3907,13 @@ def host_dlopen_count(self) -> int: return 0 return self._chip_worker.host_dlopen_count + @property + def temporary_buffer_budget(self) -> int: + """L2 only: configured TRB temporary-buffer budget in bytes.""" + if self.level != 2 or self._chip_worker is None: + return int(self._config.get("max_temporary_buffer_bytes", 0)) + return self._chip_worker.temporary_buffer_budget + # ------------------------------------------------------------------ # close # ------------------------------------------------------------------ diff --git a/src/a2a3/platform/sim/host/device_runner.cpp b/src/a2a3/platform/sim/host/device_runner.cpp index c8e0929b6..cb85663c2 100644 --- a/src/a2a3/platform/sim/host/device_runner.cpp +++ b/src/a2a3/platform/sim/host/device_runner.cpp @@ -683,6 +683,7 @@ int DeviceRunner::finalize() { gm_heap_arena_.release(); gm_sm_arena_.release(); runtime_arena_pool_.release(); + clear_temporary_buffer(); cached_gm_heap_size_ = 0; cached_gm_sm_size_ = 0; cached_runtime_arena_size_ = 0; diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp index 4b7205db8..34494a7ea 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp @@ -51,6 +51,7 @@ #include "common/strace.h" #include "common/unified_log.h" #include "host/platform_compile_info.h" +#include "host/raii_scope_guard.h" #include "utils/device_arena.h" #include "prepare_callable_common.h" @@ -267,6 +268,43 @@ static int32_t pto2_read_runtime_status(Runtime *runtime, const HostApi *api, PT return runtime_status_from_error_codes(orch_error_code, sched_error_code); } +static void release_tensor_leases(Runtime *runtime, const HostApi *api) { + int freed = 0; + int buffer_noop = 0; + int external_noop = 0; + for (TensorLease &lease : runtime->tensor_leases_) { + if (lease.dev_ptr == nullptr) { + continue; + } + switch (lease.release_kind) { + case TensorReleaseKind::Free: + api->device_free(lease.dev_ptr); + ++freed; + break; + case TensorReleaseKind::BufferNoop: + ++buffer_noop; + break; + case TensorReleaseKind::ExternalNoop: + ++external_noop; + break; + } + } + LOG_DEBUG("Released tensor leases: freed=%d buffer_noop=%d external_noop=%d", freed, buffer_noop, external_noop); + runtime->tensor_leases_.clear(); +} + +static void end_temporary_buffer_run_if_active(const HostApi *api, bool &active) { + if (!active) { + return; + } + if (api->end_temporary_buffer_run == nullptr) { + LOG_ERROR("Temporary buffer run is active but end_temporary_buffer_run is not wired"); + } else { + api->end_temporary_buffer_run(); + } + active = false; +} + /** * Stage the per-callable resources (kernel binaries + orchestration SO) into * the supplied runtime so a subsequent bind_callable_to_runtime_impl can use @@ -334,7 +372,7 @@ struct ArenaStaticSizes { }; // Device pointers to the per-Worker static pools that DeviceRunner keeps alive -// across runs (freed in DeviceRunner::finalize(), never in tensor_pairs_). +// across runs (freed in DeviceRunner::finalize(), never in tensor_leases_). struct StaticArenaPtrs { void *gm_heap; void *gm_sm; @@ -416,11 +454,11 @@ static bool derive_arena_static_sizes(const ArenaSizingConfig &sizing, ArenaStat // host tensor pointer with a freshly staged device pointer (H2D copy-in, or an // on-device zero for pure-OUTPUT buffers), and record the host/device pair for // copy-back. Read-only INPUT tensors skip copy-back. On failure the partially -// staged device_args / tensor_pairs_ stay owned by the caller's Runtime, which +// staged device_args / tensor_leases_ stay owned by the caller's Runtime, which // frees them in validate_runtime_impl. static bool stage_device_args( Runtime *runtime, const HostApi *api, const ChipStorageTaskArgs *orch_args, const ArgDirection *signature, - int sig_count, ChipStorageTaskArgs *out + int sig_count, bool use_temporary_buffer, size_t temporary_buffer_budget, ChipStorageTaskArgs *out ) { int tensor_count = orch_args->tensor_count(); int scalar_count = orch_args->scalar_count(); @@ -439,7 +477,21 @@ static bool stage_device_args( void *host_ptr = reinterpret_cast(static_cast(t.buffer.addr)); size_t size = static_cast(t.nbytes()); - void *dev_ptr = api->device_malloc(size); + void *dev_ptr = nullptr; + TensorReleaseKind release_kind = TensorReleaseKind::Free; + if (use_temporary_buffer) { + dev_ptr = api->acquire_temporary_buffer_slice(size, DeviceArena::kDefaultBaseAlign); + release_kind = TensorReleaseKind::BufferNoop; + if (dev_ptr == nullptr) { + LOG_ERROR( + "Temporary buffer acquire failed for tensor %d: tensor bytes=%zu configured bytes=%zu", i, size, + temporary_buffer_budget + ); + return false; + } + } else { + dev_ptr = api->device_malloc(size); + } if (dev_ptr == nullptr) { LOG_ERROR("Failed to allocate device memory for tensor %d", i); return false; @@ -460,7 +512,9 @@ static bool stage_device_args( } if (rc != 0) { LOG_ERROR("Failed to stage tensor %d to device", i); - api->device_free(dev_ptr); + if (release_kind == TensorReleaseKind::Free) { + api->device_free(dev_ptr); + } return false; } // Read-only INPUT tensors are never written by the kernel, so there is @@ -470,7 +524,7 @@ static bool stage_device_args( // tensor entries). Anything not provably IN keeps the safe default of // copying back. bool needs_copy_back = !(signature != nullptr && i < sig_count && signature[i] == ArgDirection::IN); - runtime->tensor_pairs_.push_back({host_ptr, dev_ptr, size, needs_copy_back}); + runtime->tensor_leases_.push_back({host_ptr, dev_ptr, size, needs_copy_back, release_kind}); LOG_INFO_V0(" Tensor %d: %zu bytes at %p", i, size, dev_ptr); t.buffer.addr = reinterpret_cast(dev_ptr); @@ -697,6 +751,8 @@ extern "C" int bind_callable_to_runtime_impl( int tensor_count = orch_args->tensor_count(); int scalar_count = orch_args->scalar_count(); LOG_INFO_V0("RT2 bind: %d tensors + %d scalars, device orchestration mode", tensor_count, scalar_count); + runtime->tensor_leases_.clear(); + runtime->temporary_buffer_run_active_ = false; int64_t t_total_start = _now_ms(); @@ -705,8 +761,35 @@ extern "C" int bind_callable_to_runtime_impl( return -1; } + size_t temporary_buffer_budget = api->temporary_buffer_budget == nullptr ? 0 : api->temporary_buffer_budget(); + bool use_temporary_buffer = temporary_buffer_budget > 0; + if (use_temporary_buffer && (api->begin_temporary_buffer_run == nullptr || + api->acquire_temporary_buffer_slice == nullptr || + api->end_temporary_buffer_run == nullptr)) { + LOG_ERROR("Temporary buffer budget is configured but HostApi temporary-buffer callbacks are not wired"); + return -1; + } + + bool temp_run_active = false; + if (use_temporary_buffer) { + if (!api->begin_temporary_buffer_run()) { + LOG_ERROR("Failed to begin temporary buffer run"); + return -1; + } + temp_run_active = true; + runtime->temporary_buffer_run_active_ = true; + } + + auto bind_cleanup = RAIIScopeGuard([&]() { + release_tensor_leases(runtime, api); + end_temporary_buffer_run_if_active(api, temp_run_active); + runtime->temporary_buffer_run_active_ = false; + }); + ChipStorageTaskArgs device_args; - if (!stage_device_args(runtime, api, orch_args, signature, sig_count, &device_args)) { + if (!stage_device_args( + runtime, api, orch_args, signature, sig_count, use_temporary_buffer, temporary_buffer_budget, &device_args + )) { return -1; } @@ -751,6 +834,8 @@ extern "C" int bind_callable_to_runtime_impl( LOG_INFO_V0("TIMING: prebuilt_runtime_arena = %" PRId64 "ms", t_prebuilt_end - t_prebuilt_start); LOG_INFO_V0("TIMING: total_init_runtime_impl = %" PRId64 "ms", t_total_end - t_total_start); + runtime->temporary_buffer_run_active_ = temp_run_active; + bind_cleanup.dismiss(); return 0; } @@ -759,8 +844,8 @@ extern "C" int bind_callable_to_runtime_impl( * * This function: * 1. Copies recorded tensors from device back to host - * 2. Frees device memory for recorded tensors - * 3. Clears tensor pair state + * 2. Releases recorded tensor leases + * 3. Clears tensor lease state * * @param runtime Pointer to Runtime * @return 0 on success, -1 on failure @@ -780,10 +865,10 @@ extern "C" int validate_runtime_impl(Runtime *runtime, const HostApi *api) { LOG_INFO_V0("=== Copying Results Back to Host ==="); // Copy all recorded tensors from device back to host - TensorPair *tensor_pairs = runtime->tensor_pairs_.data(); - int tensor_pair_count = static_cast(runtime->tensor_pairs_.size()); + TensorLease *tensor_leases = runtime->tensor_leases_.data(); + int tensor_lease_count = static_cast(runtime->tensor_leases_.size()); - LOG_INFO_V0("Tensor pairs to process: %d", tensor_pair_count); + LOG_INFO_V0("Tensor leases to process: %d", tensor_lease_count); // PTO2 (device orchestration): graph output may be in packed buffer uint64_t graph_out_ptr = 0; @@ -832,31 +917,31 @@ extern "C" int validate_runtime_impl(Runtime *runtime, const HostApi *api) { LOG_WARN("Skipping tensor copy-back because PTO2 runtime reported fatal status"); } else { bool first_output_tensor = true; - for (int i = 0; i < tensor_pair_count; i++) { - const TensorPair &pair = tensor_pairs[i]; + for (int i = 0; i < tensor_lease_count; i++) { + const TensorLease &lease = tensor_leases[i]; // Skip if device pointer is null - if (pair.dev_ptr == nullptr) { + if (lease.dev_ptr == nullptr) { LOG_WARN("Tensor %d has null device pointer, skipping", i); continue; } // If host pointer is null, this is a device-only allocation (no copy-back) - if (pair.host_ptr == nullptr) { + if (lease.host_ptr == nullptr) { LOG_INFO_V0("Tensor %d: device-only allocation (no copy-back)", i); continue; } // Read-only INPUT tensors were uploaded H2D but the kernel never // wrote them — copying them back (potentially ~GB) is pure waste. - // They are still device_free'd in the cleanup loop below. - if (!pair.needs_copy_back) { + // They are still released through release_kind below. + if (!lease.needs_copy_back) { LOG_INFO_V0("Tensor %d: read-only input, skipping copy-back", i); continue; } - void *src_ptr = pair.dev_ptr; - size_t copy_size = pair.size; + void *src_ptr = lease.dev_ptr; + size_t copy_size = lease.size; // Use graph_output_ptr for the first output tensor if available if (first_output_tensor && graph_out_ptr != 0 && graph_out_size > 0) { @@ -866,24 +951,20 @@ extern "C" int validate_runtime_impl(Runtime *runtime, const HostApi *api) { first_output_tensor = false; } - int copy_rc = api->copy_from_device(pair.host_ptr, src_ptr, copy_size); + int copy_rc = api->copy_from_device(lease.host_ptr, src_ptr, copy_size); if (copy_rc != 0) { LOG_ERROR("Failed to copy tensor %d from device: %d", i, copy_rc); rc = copy_rc; } else { - LOG_INFO_V0("Tensor %d: %zu bytes copied to host", i, pair.size); + LOG_INFO_V0("Tensor %d: %zu bytes copied to host", i, lease.size); } } } // Cleanup device tensors LOG_INFO_V0("=== Cleaning Up ==="); - for (int i = 0; i < tensor_pair_count; i++) { - if (tensor_pairs[i].dev_ptr != nullptr) { - api->device_free(tensor_pairs[i].dev_ptr); - } - } - LOG_INFO_V0("Freed %d device allocations", tensor_pair_count); + release_tensor_leases(runtime, api); + end_temporary_buffer_run_if_active(api, runtime->temporary_buffer_run_active_); // Clear the per-run dispatch-table entries staged by register_callable_impl. // The underlying chip-callable device buffer is pool-managed by @@ -900,9 +981,6 @@ extern "C" int validate_runtime_impl(Runtime *runtime, const HostApi *api) { } runtime->clear_registered_kernels(); - // Clear tensor pairs - runtime->tensor_pairs_.clear(); - LOG_INFO_V0("=== Finalize Complete ==="); if (rc == 0 && runtime_status != 0) { diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h index 036d69b22..97442cff3 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h @@ -19,7 +19,7 @@ * defined in tensor.h. * * This header is independent of orch_build_graph_runtime.h to allow inclusion from runtime.h - * without type conflicts (Handshake, TensorPair, HostApi). + * without type conflicts (Handshake, TensorLease, HostApi). */ #ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_TYPES_H_ diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h index 8a41434de..8dcdc51cb 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h @@ -103,11 +103,16 @@ struct Handshake { volatile uint32_t aicore_regs_ready; // AICore ID reported: 0=pending, 1=done } __attribute__((aligned(64))); +enum class TensorReleaseKind { + Free, + BufferNoop, + ExternalNoop, +}; + /** - * Tensor pair for tracking host-device memory mappings. - * Used for copy-back during finalize. + * Tensor lease for tracking host-device memory mappings and release ownership. */ -struct TensorPair { +struct TensorLease { void *host_ptr; void *dev_ptr; size_t size; @@ -115,6 +120,7 @@ struct TensorPair { // so the end-of-run D2H copy-back is skipped. OUTPUT/INOUT/unknown // keep the safe default of copying back. bool needs_copy_back = true; + TensorReleaseKind release_kind = TensorReleaseKind::Free; }; /** @@ -315,7 +321,8 @@ class Runtime { // Host-side tensor ledger for D2H copy-back at finalize. Populated by // runtime_maker.cpp from orch_args at bind time, then iterated in // validate_runtime_impl. Host-only (after `dev`): never uploaded. - std::vector tensor_pairs_; + std::vector tensor_leases_; + bool temporary_buffer_run_active_ = false; }; // `dev` must be the first member so the narrowed H2D copy starts at offset 0. diff --git a/src/a5/platform/sim/host/device_runner.cpp b/src/a5/platform/sim/host/device_runner.cpp index 81ea8ea14..be935397b 100644 --- a/src/a5/platform/sim/host/device_runner.cpp +++ b/src/a5/platform/sim/host/device_runner.cpp @@ -652,6 +652,7 @@ int DeviceRunner::finalize() { gm_heap_arena_.release(); gm_sm_arena_.release(); runtime_arena_pool_.release(); + clear_temporary_buffer(); cached_gm_heap_size_ = 0; cached_gm_sm_size_ = 0; cached_runtime_arena_size_ = 0; diff --git a/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp b/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp index 4b7205db8..34494a7ea 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp +++ b/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp @@ -51,6 +51,7 @@ #include "common/strace.h" #include "common/unified_log.h" #include "host/platform_compile_info.h" +#include "host/raii_scope_guard.h" #include "utils/device_arena.h" #include "prepare_callable_common.h" @@ -267,6 +268,43 @@ static int32_t pto2_read_runtime_status(Runtime *runtime, const HostApi *api, PT return runtime_status_from_error_codes(orch_error_code, sched_error_code); } +static void release_tensor_leases(Runtime *runtime, const HostApi *api) { + int freed = 0; + int buffer_noop = 0; + int external_noop = 0; + for (TensorLease &lease : runtime->tensor_leases_) { + if (lease.dev_ptr == nullptr) { + continue; + } + switch (lease.release_kind) { + case TensorReleaseKind::Free: + api->device_free(lease.dev_ptr); + ++freed; + break; + case TensorReleaseKind::BufferNoop: + ++buffer_noop; + break; + case TensorReleaseKind::ExternalNoop: + ++external_noop; + break; + } + } + LOG_DEBUG("Released tensor leases: freed=%d buffer_noop=%d external_noop=%d", freed, buffer_noop, external_noop); + runtime->tensor_leases_.clear(); +} + +static void end_temporary_buffer_run_if_active(const HostApi *api, bool &active) { + if (!active) { + return; + } + if (api->end_temporary_buffer_run == nullptr) { + LOG_ERROR("Temporary buffer run is active but end_temporary_buffer_run is not wired"); + } else { + api->end_temporary_buffer_run(); + } + active = false; +} + /** * Stage the per-callable resources (kernel binaries + orchestration SO) into * the supplied runtime so a subsequent bind_callable_to_runtime_impl can use @@ -334,7 +372,7 @@ struct ArenaStaticSizes { }; // Device pointers to the per-Worker static pools that DeviceRunner keeps alive -// across runs (freed in DeviceRunner::finalize(), never in tensor_pairs_). +// across runs (freed in DeviceRunner::finalize(), never in tensor_leases_). struct StaticArenaPtrs { void *gm_heap; void *gm_sm; @@ -416,11 +454,11 @@ static bool derive_arena_static_sizes(const ArenaSizingConfig &sizing, ArenaStat // host tensor pointer with a freshly staged device pointer (H2D copy-in, or an // on-device zero for pure-OUTPUT buffers), and record the host/device pair for // copy-back. Read-only INPUT tensors skip copy-back. On failure the partially -// staged device_args / tensor_pairs_ stay owned by the caller's Runtime, which +// staged device_args / tensor_leases_ stay owned by the caller's Runtime, which // frees them in validate_runtime_impl. static bool stage_device_args( Runtime *runtime, const HostApi *api, const ChipStorageTaskArgs *orch_args, const ArgDirection *signature, - int sig_count, ChipStorageTaskArgs *out + int sig_count, bool use_temporary_buffer, size_t temporary_buffer_budget, ChipStorageTaskArgs *out ) { int tensor_count = orch_args->tensor_count(); int scalar_count = orch_args->scalar_count(); @@ -439,7 +477,21 @@ static bool stage_device_args( void *host_ptr = reinterpret_cast(static_cast(t.buffer.addr)); size_t size = static_cast(t.nbytes()); - void *dev_ptr = api->device_malloc(size); + void *dev_ptr = nullptr; + TensorReleaseKind release_kind = TensorReleaseKind::Free; + if (use_temporary_buffer) { + dev_ptr = api->acquire_temporary_buffer_slice(size, DeviceArena::kDefaultBaseAlign); + release_kind = TensorReleaseKind::BufferNoop; + if (dev_ptr == nullptr) { + LOG_ERROR( + "Temporary buffer acquire failed for tensor %d: tensor bytes=%zu configured bytes=%zu", i, size, + temporary_buffer_budget + ); + return false; + } + } else { + dev_ptr = api->device_malloc(size); + } if (dev_ptr == nullptr) { LOG_ERROR("Failed to allocate device memory for tensor %d", i); return false; @@ -460,7 +512,9 @@ static bool stage_device_args( } if (rc != 0) { LOG_ERROR("Failed to stage tensor %d to device", i); - api->device_free(dev_ptr); + if (release_kind == TensorReleaseKind::Free) { + api->device_free(dev_ptr); + } return false; } // Read-only INPUT tensors are never written by the kernel, so there is @@ -470,7 +524,7 @@ static bool stage_device_args( // tensor entries). Anything not provably IN keeps the safe default of // copying back. bool needs_copy_back = !(signature != nullptr && i < sig_count && signature[i] == ArgDirection::IN); - runtime->tensor_pairs_.push_back({host_ptr, dev_ptr, size, needs_copy_back}); + runtime->tensor_leases_.push_back({host_ptr, dev_ptr, size, needs_copy_back, release_kind}); LOG_INFO_V0(" Tensor %d: %zu bytes at %p", i, size, dev_ptr); t.buffer.addr = reinterpret_cast(dev_ptr); @@ -697,6 +751,8 @@ extern "C" int bind_callable_to_runtime_impl( int tensor_count = orch_args->tensor_count(); int scalar_count = orch_args->scalar_count(); LOG_INFO_V0("RT2 bind: %d tensors + %d scalars, device orchestration mode", tensor_count, scalar_count); + runtime->tensor_leases_.clear(); + runtime->temporary_buffer_run_active_ = false; int64_t t_total_start = _now_ms(); @@ -705,8 +761,35 @@ extern "C" int bind_callable_to_runtime_impl( return -1; } + size_t temporary_buffer_budget = api->temporary_buffer_budget == nullptr ? 0 : api->temporary_buffer_budget(); + bool use_temporary_buffer = temporary_buffer_budget > 0; + if (use_temporary_buffer && (api->begin_temporary_buffer_run == nullptr || + api->acquire_temporary_buffer_slice == nullptr || + api->end_temporary_buffer_run == nullptr)) { + LOG_ERROR("Temporary buffer budget is configured but HostApi temporary-buffer callbacks are not wired"); + return -1; + } + + bool temp_run_active = false; + if (use_temporary_buffer) { + if (!api->begin_temporary_buffer_run()) { + LOG_ERROR("Failed to begin temporary buffer run"); + return -1; + } + temp_run_active = true; + runtime->temporary_buffer_run_active_ = true; + } + + auto bind_cleanup = RAIIScopeGuard([&]() { + release_tensor_leases(runtime, api); + end_temporary_buffer_run_if_active(api, temp_run_active); + runtime->temporary_buffer_run_active_ = false; + }); + ChipStorageTaskArgs device_args; - if (!stage_device_args(runtime, api, orch_args, signature, sig_count, &device_args)) { + if (!stage_device_args( + runtime, api, orch_args, signature, sig_count, use_temporary_buffer, temporary_buffer_budget, &device_args + )) { return -1; } @@ -751,6 +834,8 @@ extern "C" int bind_callable_to_runtime_impl( LOG_INFO_V0("TIMING: prebuilt_runtime_arena = %" PRId64 "ms", t_prebuilt_end - t_prebuilt_start); LOG_INFO_V0("TIMING: total_init_runtime_impl = %" PRId64 "ms", t_total_end - t_total_start); + runtime->temporary_buffer_run_active_ = temp_run_active; + bind_cleanup.dismiss(); return 0; } @@ -759,8 +844,8 @@ extern "C" int bind_callable_to_runtime_impl( * * This function: * 1. Copies recorded tensors from device back to host - * 2. Frees device memory for recorded tensors - * 3. Clears tensor pair state + * 2. Releases recorded tensor leases + * 3. Clears tensor lease state * * @param runtime Pointer to Runtime * @return 0 on success, -1 on failure @@ -780,10 +865,10 @@ extern "C" int validate_runtime_impl(Runtime *runtime, const HostApi *api) { LOG_INFO_V0("=== Copying Results Back to Host ==="); // Copy all recorded tensors from device back to host - TensorPair *tensor_pairs = runtime->tensor_pairs_.data(); - int tensor_pair_count = static_cast(runtime->tensor_pairs_.size()); + TensorLease *tensor_leases = runtime->tensor_leases_.data(); + int tensor_lease_count = static_cast(runtime->tensor_leases_.size()); - LOG_INFO_V0("Tensor pairs to process: %d", tensor_pair_count); + LOG_INFO_V0("Tensor leases to process: %d", tensor_lease_count); // PTO2 (device orchestration): graph output may be in packed buffer uint64_t graph_out_ptr = 0; @@ -832,31 +917,31 @@ extern "C" int validate_runtime_impl(Runtime *runtime, const HostApi *api) { LOG_WARN("Skipping tensor copy-back because PTO2 runtime reported fatal status"); } else { bool first_output_tensor = true; - for (int i = 0; i < tensor_pair_count; i++) { - const TensorPair &pair = tensor_pairs[i]; + for (int i = 0; i < tensor_lease_count; i++) { + const TensorLease &lease = tensor_leases[i]; // Skip if device pointer is null - if (pair.dev_ptr == nullptr) { + if (lease.dev_ptr == nullptr) { LOG_WARN("Tensor %d has null device pointer, skipping", i); continue; } // If host pointer is null, this is a device-only allocation (no copy-back) - if (pair.host_ptr == nullptr) { + if (lease.host_ptr == nullptr) { LOG_INFO_V0("Tensor %d: device-only allocation (no copy-back)", i); continue; } // Read-only INPUT tensors were uploaded H2D but the kernel never // wrote them — copying them back (potentially ~GB) is pure waste. - // They are still device_free'd in the cleanup loop below. - if (!pair.needs_copy_back) { + // They are still released through release_kind below. + if (!lease.needs_copy_back) { LOG_INFO_V0("Tensor %d: read-only input, skipping copy-back", i); continue; } - void *src_ptr = pair.dev_ptr; - size_t copy_size = pair.size; + void *src_ptr = lease.dev_ptr; + size_t copy_size = lease.size; // Use graph_output_ptr for the first output tensor if available if (first_output_tensor && graph_out_ptr != 0 && graph_out_size > 0) { @@ -866,24 +951,20 @@ extern "C" int validate_runtime_impl(Runtime *runtime, const HostApi *api) { first_output_tensor = false; } - int copy_rc = api->copy_from_device(pair.host_ptr, src_ptr, copy_size); + int copy_rc = api->copy_from_device(lease.host_ptr, src_ptr, copy_size); if (copy_rc != 0) { LOG_ERROR("Failed to copy tensor %d from device: %d", i, copy_rc); rc = copy_rc; } else { - LOG_INFO_V0("Tensor %d: %zu bytes copied to host", i, pair.size); + LOG_INFO_V0("Tensor %d: %zu bytes copied to host", i, lease.size); } } } // Cleanup device tensors LOG_INFO_V0("=== Cleaning Up ==="); - for (int i = 0; i < tensor_pair_count; i++) { - if (tensor_pairs[i].dev_ptr != nullptr) { - api->device_free(tensor_pairs[i].dev_ptr); - } - } - LOG_INFO_V0("Freed %d device allocations", tensor_pair_count); + release_tensor_leases(runtime, api); + end_temporary_buffer_run_if_active(api, runtime->temporary_buffer_run_active_); // Clear the per-run dispatch-table entries staged by register_callable_impl. // The underlying chip-callable device buffer is pool-managed by @@ -900,9 +981,6 @@ extern "C" int validate_runtime_impl(Runtime *runtime, const HostApi *api) { } runtime->clear_registered_kernels(); - // Clear tensor pairs - runtime->tensor_pairs_.clear(); - LOG_INFO_V0("=== Finalize Complete ==="); if (rc == 0 && runtime_status != 0) { diff --git a/src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h b/src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h index 821d6ce3a..391d6c96b 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h +++ b/src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h @@ -19,7 +19,7 @@ * defined in tensor.h. * * This header is independent of orch_build_graph_runtime.h to allow inclusion from runtime.h - * without type conflicts (Handshake, TensorPair, HostApi). + * without type conflicts (Handshake, TensorLease, HostApi). */ #ifndef SRC_A5_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_TYPES_H_ diff --git a/src/a5/runtime/tensormap_and_ringbuffer/runtime/runtime.h b/src/a5/runtime/tensormap_and_ringbuffer/runtime/runtime.h index 694c3e0f4..17655ab1d 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/runtime/runtime.h +++ b/src/a5/runtime/tensormap_and_ringbuffer/runtime/runtime.h @@ -111,11 +111,16 @@ struct Handshake { volatile uint32_t aicore_regs_ready; // AICore ID reported: 0=pending, 1=done } __attribute__((aligned(64))); +enum class TensorReleaseKind { + Free, + BufferNoop, + ExternalNoop, +}; + /** - * Tensor pair for tracking host-device memory mappings. - * Used for copy-back during finalize. + * Tensor lease for tracking host-device memory mappings and release ownership. */ -struct TensorPair { +struct TensorLease { void *host_ptr; void *dev_ptr; size_t size; @@ -123,6 +128,7 @@ struct TensorPair { // so the end-of-run D2H copy-back is skipped. OUTPUT/INOUT/unknown // keep the safe default of copying back. bool needs_copy_back = true; + TensorReleaseKind release_kind = TensorReleaseKind::Free; }; /** @@ -329,7 +335,8 @@ class Runtime { // Host-side tensor ledger for D2H copy-back at finalize. Populated by // runtime_maker.cpp from orch_args at bind time, then iterated in // validate_runtime_impl. Host-only (after `dev`): never uploaded. - std::vector tensor_pairs_; + std::vector tensor_leases_; + bool temporary_buffer_run_active_ = false; }; // `dev` must be the first member so the narrowed H2D copy starts at offset 0. diff --git a/src/common/platform/include/common/host_api.h b/src/common/platform/include/common/host_api.h index a6ed89430..65051d4e8 100644 --- a/src/common/platform/include/common/host_api.h +++ b/src/common/platform/include/common/host_api.h @@ -34,6 +34,13 @@ struct HostApi { // null on backends that don't wire it; callers must fall back to // copy_to_device. int (*device_memset)(void *dev_ptr, int value, size_t size); + // Runner-scoped temporary variable buffer. A zero budget disables the + // optimization. Only trb bind consumes these callbacks; public + // device_malloc/device_free keep real allocation semantics. + size_t (*temporary_buffer_budget)(); + bool (*begin_temporary_buffer_run)(); + void *(*acquire_temporary_buffer_slice)(size_t size, size_t alignment); + void (*end_temporary_buffer_run)(); // Commit the three per-Worker pooled regions (PTO2 GM heap, PTO2 shared // memory, trb prebuilt runtime arena) as three independent device // allocations. `runtime_arena_size == 0` skips the third region (hbg @@ -44,7 +51,7 @@ struct HostApi { // memory / prebuilt runtime arena. setup_static_arena must have already // committed the relevant region; the returned pointer is owned by the // DeviceRunner and freed in `DeviceRunner::finalize()` — do NOT pass it - // to device_free or record it in `tensor_pairs_`. + // to device_free or record it as an owned tensor lease. // // acquire_pooled_runtime_arena is trb-only — the runtime-arena region is // only committed when setup_static_arena was invoked with diff --git a/src/common/platform/include/host/temporary_variable_buffer.h b/src/common/platform/include/host/temporary_variable_buffer.h new file mode 100644 index 000000000..0dc1312b0 --- /dev/null +++ b/src/common/platform/include/host/temporary_variable_buffer.h @@ -0,0 +1,289 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +#ifndef SRC_COMMON_PLATFORM_INCLUDE_HOST_TEMPORARY_VARIABLE_BUFFER_H_ +#define SRC_COMMON_PLATFORM_INCLUDE_HOST_TEMPORARY_VARIABLE_BUFFER_H_ + +#include +#include +#include +#include +#include +#include +#include + +class TemporaryVariableBuffer { +public: + using AllocFn = void *(*)(void *ctx, size_t size); + using FreeFn = void (*)(void *ctx, void *ptr); + + static constexpr size_t kDefaultAlignment = 1024; + + struct Stats { + size_t configured_budget_bytes{0}; + size_t retained_chunk_count{0}; + size_t retained_chunk_bytes{0}; + size_t current_run_used_bytes{0}; + size_t high_water_used_bytes{0}; + size_t buffer_backed_allocation_count{0}; + size_t budget_exceeded_count{0}; + bool active{false}; + }; + + TemporaryVariableBuffer(AllocFn alloc, FreeFn free_fn, void *ctx) : + alloc_(alloc), + free_(free_fn), + ctx_(ctx) {} + + ~TemporaryVariableBuffer() { clear(); } + + TemporaryVariableBuffer(const TemporaryVariableBuffer &) = delete; + TemporaryVariableBuffer &operator=(const TemporaryVariableBuffer &) = delete; + + bool configure(size_t max_temporary_buffer_bytes); + bool begin_run(); + void *acquire(size_t bytes, size_t alignment); + void end_run(); + void clear(); + + bool enabled() const { return max_temporary_buffer_bytes_ > 0; } + bool active() const { return active_; } + size_t budget() const { return max_temporary_buffer_bytes_; } + Stats stats() const; + const std::string &last_error() const { return last_error_; } + +private: + struct Chunk { + void *raw_base{nullptr}; + void *base{nullptr}; + size_t capacity{0}; + size_t raw_size{0}; + size_t offset{0}; + }; + + static bool is_power_of_two(size_t value) { return value != 0 && (value & (value - 1)) == 0; } + + static size_t align_up(size_t value, size_t alignment) { return (value + alignment - 1) & ~(alignment - 1); } + + static void *align_ptr(void *ptr, size_t alignment) { + const uintptr_t raw = reinterpret_cast(ptr); + return reinterpret_cast((raw + alignment - 1) & ~(static_cast(alignment) - 1)); + } + + bool allocate_chunks(size_t budget); + bool allocate_chunk(size_t capacity, Chunk *out); + void release_chunks(); + void set_error(std::string msg) { last_error_ = std::move(msg); } + + AllocFn alloc_{nullptr}; + FreeFn free_{nullptr}; + void *ctx_{nullptr}; + + std::vector chunks_; + size_t max_temporary_buffer_bytes_{0}; + size_t retained_chunk_bytes_{0}; + size_t current_run_used_bytes_{0}; + size_t high_water_used_bytes_{0}; + size_t buffer_backed_allocation_count_{0}; + size_t budget_exceeded_count_{0}; + bool active_{false}; + std::string last_error_; +}; + +inline bool TemporaryVariableBuffer::configure(size_t max_temporary_buffer_bytes) { + if (active_) { + set_error("cannot reconfigure temporary buffer while a run is active"); + return false; + } + if (max_temporary_buffer_bytes == max_temporary_buffer_bytes_ && + (max_temporary_buffer_bytes == 0 || !chunks_.empty())) { + last_error_.clear(); + return true; + } + + clear(); + if (max_temporary_buffer_bytes == 0) { + return true; + } + + max_temporary_buffer_bytes_ = max_temporary_buffer_bytes; + if (!allocate_chunks(max_temporary_buffer_bytes)) { + std::string error = last_error_; + clear(); + last_error_ = std::move(error); + return false; + } + last_error_.clear(); + return true; +} + +inline bool TemporaryVariableBuffer::begin_run() { + if (active_) { + set_error("temporary buffer run is already active"); + return false; + } + if (max_temporary_buffer_bytes_ == 0) { + set_error("temporary buffer is disabled"); + return false; + } + if (chunks_.empty()) { + set_error("temporary buffer has no retained chunks"); + return false; + } + for (Chunk &chunk : chunks_) { + chunk.offset = 0; + } + current_run_used_bytes_ = 0; + active_ = true; + last_error_.clear(); + return true; +} + +inline void *TemporaryVariableBuffer::acquire(size_t bytes, size_t alignment) { + if (!active_) { + set_error("temporary buffer acquire requested outside an active run"); + return nullptr; + } + if (alignment == 0) { + alignment = 1; + } + if (!is_power_of_two(alignment)) { + set_error("temporary buffer alignment must be a power of two"); + return nullptr; + } + + size_t min_padding = std::numeric_limits::max(); + for (Chunk &chunk : chunks_) { + const size_t aligned_offset = align_up(chunk.offset, alignment); + if (aligned_offset < chunk.offset) { + continue; + } + min_padding = std::min(min_padding, aligned_offset - chunk.offset); + if (bytes > chunk.capacity || aligned_offset > chunk.capacity - bytes) { + continue; + } + void *ptr = static_cast(chunk.base) + aligned_offset; + current_run_used_bytes_ += (aligned_offset - chunk.offset) + bytes; + chunk.offset = aligned_offset + bytes; + ++buffer_backed_allocation_count_; + last_error_.clear(); + return ptr; + } + + ++budget_exceeded_count_; + if (min_padding == std::numeric_limits::max()) { + min_padding = 0; + } + size_t required_bytes = std::numeric_limits::max(); + if (current_run_used_bytes_ <= std::numeric_limits::max() - min_padding) { + const size_t used_with_padding = current_run_used_bytes_ + min_padding; + if (bytes <= std::numeric_limits::max() - used_with_padding) { + required_bytes = used_with_padding + bytes; + } + } + set_error( + "temporary buffer budget exceeded: required bytes " + std::to_string(required_bytes) + ", configured bytes " + + std::to_string(max_temporary_buffer_bytes_) + ); + return nullptr; +} + +inline void TemporaryVariableBuffer::end_run() { + if (!active_) { + return; + } + if (current_run_used_bytes_ > high_water_used_bytes_) { + high_water_used_bytes_ = current_run_used_bytes_; + } + active_ = false; +} + +inline void TemporaryVariableBuffer::clear() { + release_chunks(); + max_temporary_buffer_bytes_ = 0; + retained_chunk_bytes_ = 0; + current_run_used_bytes_ = 0; + high_water_used_bytes_ = 0; + buffer_backed_allocation_count_ = 0; + budget_exceeded_count_ = 0; + active_ = false; + last_error_.clear(); +} + +inline TemporaryVariableBuffer::Stats TemporaryVariableBuffer::stats() const { + return Stats{ + max_temporary_buffer_bytes_, chunks_.size(), + retained_chunk_bytes_, current_run_used_bytes_, + high_water_used_bytes_, buffer_backed_allocation_count_, + budget_exceeded_count_, active_, + }; +} + +inline bool TemporaryVariableBuffer::allocate_chunks(size_t budget) { + size_t remaining = budget; + size_t candidate = budget; + while (remaining > 0) { + if (candidate > remaining) { + candidate = remaining; + } + Chunk chunk; + if (allocate_chunk(candidate, &chunk)) { + retained_chunk_bytes_ += candidate; + chunks_.push_back(chunk); + remaining -= candidate; + candidate = remaining; + continue; + } + if (candidate <= 1) { + set_error( + "failed to allocate retained temporary-buffer chunks for configured bytes " + std::to_string(budget) + ); + release_chunks(); + retained_chunk_bytes_ = 0; + return false; + } + candidate = candidate / 2; + if (candidate == 0) { + candidate = 1; + } + } + return true; +} + +inline bool TemporaryVariableBuffer::allocate_chunk(size_t capacity, Chunk *out) { + if (alloc_ == nullptr || free_ == nullptr || out == nullptr) { + set_error("temporary buffer allocator callbacks are not configured"); + return false; + } + if (capacity > std::numeric_limits::max() - (kDefaultAlignment - 1)) { + set_error("temporary buffer chunk size overflows size_t"); + return false; + } + const size_t raw_size = capacity + kDefaultAlignment - 1; + void *raw = alloc_(ctx_, raw_size); + if (raw == nullptr) { + return false; + } + *out = Chunk{raw, align_ptr(raw, kDefaultAlignment), capacity, raw_size, 0}; + return true; +} + +inline void TemporaryVariableBuffer::release_chunks() { + if (free_ != nullptr) { + for (Chunk &chunk : chunks_) { + if (chunk.raw_base != nullptr) { + free_(ctx_, chunk.raw_base); + } + } + } + chunks_.clear(); +} + +#endif // SRC_COMMON_PLATFORM_INCLUDE_HOST_TEMPORARY_VARIABLE_BUFFER_H_ diff --git a/src/common/platform/onboard/host/c_api_shared.cpp b/src/common/platform/onboard/host/c_api_shared.cpp index 5924c4184..b562e5848 100644 --- a/src/common/platform/onboard/host/c_api_shared.cpp +++ b/src/common/platform/onboard/host/c_api_shared.cpp @@ -230,6 +230,24 @@ int copy_from_device_ctx(DeviceContextHandle ctx, void *host_ptr, const void *de } } +int configure_temporary_buffer_ctx(DeviceContextHandle ctx, size_t max_temporary_buffer_bytes) { + if (ctx == NULL) return -1; + try { + return static_cast(ctx)->configure_temporary_buffer(max_temporary_buffer_bytes) ? 0 : -1; + } catch (...) { + return -1; + } +} + +size_t get_temporary_buffer_budget_ctx(DeviceContextHandle ctx) { + if (ctx == NULL) return 0; + try { + return static_cast(ctx)->temporary_buffer_budget(); + } catch (...) { + return 0; + } +} + int finalize_device(DeviceContextHandle ctx) { if (ctx == NULL) return -1; try { @@ -505,6 +523,10 @@ int simpler_run( api.copy_to_device = copy_to_device; api.copy_from_device = copy_from_device; api.device_memset = device_memset; + api.temporary_buffer_budget = temporary_buffer_budget; + api.begin_temporary_buffer_run = begin_temporary_buffer_run; + api.acquire_temporary_buffer_slice = acquire_temporary_buffer_slice; + api.end_temporary_buffer_run = end_temporary_buffer_run; api.setup_static_arena = setup_static_arena_wrapper; api.acquire_pooled_gm_heap = acquire_pooled_gm_heap_wrapper; api.acquire_pooled_gm_sm = acquire_pooled_gm_sm_wrapper; diff --git a/src/common/platform/onboard/host/device_runner_base.cpp b/src/common/platform/onboard/host/device_runner_base.cpp index 623402609..e2b1734e8 100644 --- a/src/common/platform/onboard/host/device_runner_base.cpp +++ b/src/common/platform/onboard/host/device_runner_base.cpp @@ -115,6 +115,7 @@ HostRuntimeTimeoutConfig resolve_onboard_timeout_config() { } // namespace DeviceRunnerBase::DeviceRunnerBase() : + temporary_buffer_(&arena_alloc_trampoline, &arena_free_trampoline, &mem_alloc_), gm_heap_arena_(&arena_alloc_trampoline, &arena_free_trampoline, &mem_alloc_), gm_sm_arena_(&arena_alloc_trampoline, &arena_free_trampoline, &mem_alloc_), runtime_arena_pool_(&arena_alloc_trampoline, &arena_free_trampoline, &mem_alloc_) {} @@ -139,6 +140,60 @@ int DeviceRunnerBase::device_memset(void *dev_ptr, int value, std::size_t bytes) return aclrtMemset(dev_ptr, bytes, value, bytes); } +bool DeviceRunnerBase::configure_temporary_buffer(std::size_t max_temporary_buffer_bytes) { + if (!temporary_buffer_.configure(max_temporary_buffer_bytes)) { + LOG_ERROR( + "configure_temporary_buffer(%zu) failed: %s", max_temporary_buffer_bytes, + temporary_buffer_.last_error().c_str() + ); + return false; + } + auto stats = temporary_buffer_.stats(); + LOG_DEBUG( + "Temporary buffer configured: budget=%zu retained_chunks=%zu retained_bytes=%zu", stats.configured_budget_bytes, + stats.retained_chunk_count, stats.retained_chunk_bytes + ); + return true; +} + +std::size_t DeviceRunnerBase::temporary_buffer_budget() const { return temporary_buffer_.budget(); } + +bool DeviceRunnerBase::begin_temporary_buffer_run() { + if (!temporary_buffer_.begin_run()) { + LOG_ERROR("begin_temporary_buffer_run failed: %s", temporary_buffer_.last_error().c_str()); + return false; + } + return true; +} + +void *DeviceRunnerBase::acquire_temporary_buffer_slice(std::size_t bytes, std::size_t alignment) { + void *ptr = temporary_buffer_.acquire(bytes, alignment); + if (ptr == nullptr) { + LOG_ERROR( + "acquire_temporary_buffer_slice failed: required bytes=%zu configured bytes=%zu: %s", bytes, + temporary_buffer_.budget(), temporary_buffer_.last_error().c_str() + ); + } + return ptr; +} + +void DeviceRunnerBase::end_temporary_buffer_run() { + temporary_buffer_.end_run(); + auto stats = temporary_buffer_.stats(); + LOG_DEBUG( + "Temporary buffer run ended: used=%zu high_water=%zu allocations=%zu budget_exceeded=%zu", + stats.current_run_used_bytes, stats.high_water_used_bytes, stats.buffer_backed_allocation_count, + stats.budget_exceeded_count + ); +} + +void DeviceRunnerBase::clear_temporary_buffer() { + if (temporary_buffer_.active()) { + LOG_ERROR("clear_temporary_buffer called while a temporary-buffer run is active"); + } + temporary_buffer_.clear(); +} + int DeviceRunnerBase::l3_l2_orch_comm_init(void *control_block, size_t control_block_size) { if (!l3_l2_orch_comm_supported()) { return PTO_RUNTIME_ERR_UNSUPPORTED; @@ -479,10 +534,8 @@ int DeviceRunnerBase::ensure_binaries_loaded() { } if (dispatcher_so_binary_.empty()) { - LOG_ERROR( - "DeviceRunner: dispatcher SO bytes not provided; pass dispatcher_path through ChipWorker.init " - "(RuntimeBinaries.dispatcher_path)" - ); + LOG_ERROR("DeviceRunner: dispatcher SO bytes not provided; pass dispatcher_path through ChipWorker.init " + "(RuntimeBinaries.dispatcher_path)"); return -1; } @@ -1036,6 +1089,8 @@ int DeviceRunnerBase::finalize_common() { prebuilt_runtime_arena_cache_runtime_arena_base_ = nullptr; prebuilt_runtime_arena_cache_image_.clear(); + clear_temporary_buffer(); + // Free the 8-byte device_wall buffer (allocated lazily in run()) while // mem_alloc_ and the device context are still live. free_tensor() routes // through mem_alloc_.free(), so it must run before mem_alloc_.finalize() diff --git a/src/common/platform/onboard/host/device_runner_base.h b/src/common/platform/onboard/host/device_runner_base.h index 8bf16a077..a1dd77153 100644 --- a/src/common/platform/onboard/host/device_runner_base.h +++ b/src/common/platform/onboard/host/device_runner_base.h @@ -62,6 +62,7 @@ #include "host/runtime_timeout_config.h" #include "host/scope_stats_collector.h" #include "host/tensor_dump_collector.h" +#include "host/temporary_variable_buffer.h" #include "prepare_callable_common.h" /** @@ -90,6 +91,12 @@ class DeviceRunnerBase : public L3L2OrchCommBackend { int copy_to_device(void *dev_ptr, const void *host_ptr, std::size_t bytes); int copy_from_device(void *host_ptr, const void *dev_ptr, std::size_t bytes); int device_memset(void *dev_ptr, int value, std::size_t bytes); + bool configure_temporary_buffer(std::size_t max_temporary_buffer_bytes); + std::size_t temporary_buffer_budget() const; + bool begin_temporary_buffer_run(); + void *acquire_temporary_buffer_slice(std::size_t bytes, std::size_t alignment); + void end_temporary_buffer_run(); + void clear_temporary_buffer(); int l3_l2_orch_comm_init(void *control_block, size_t control_block_size); int l3_l2_orch_comm_shutdown(); @@ -800,6 +807,7 @@ class DeviceRunnerBase : public L3L2OrchCommBackend { host::LoadAicpuOp load_aicpu_op_; MemoryAllocator mem_alloc_; + TemporaryVariableBuffer temporary_buffer_; DeviceArena gm_heap_arena_; DeviceArena gm_sm_arena_; DeviceArena runtime_arena_pool_; diff --git a/src/common/platform/sim/host/c_api_shared.cpp b/src/common/platform/sim/host/c_api_shared.cpp index c43b06f8f..1048b9df8 100644 --- a/src/common/platform/sim/host/c_api_shared.cpp +++ b/src/common/platform/sim/host/c_api_shared.cpp @@ -223,6 +223,24 @@ int copy_from_device_ctx(DeviceContextHandle ctx, void *host_ptr, const void *de } } +int configure_temporary_buffer_ctx(DeviceContextHandle ctx, size_t max_temporary_buffer_bytes) { + if (ctx == NULL) return -1; + try { + return static_cast(ctx)->configure_temporary_buffer(max_temporary_buffer_bytes) ? 0 : -1; + } catch (...) { + return -1; + } +} + +size_t get_temporary_buffer_budget_ctx(DeviceContextHandle ctx) { + if (ctx == NULL) return 0; + try { + return static_cast(ctx)->temporary_buffer_budget(); + } catch (...) { + return 0; + } +} + int finalize_device(DeviceContextHandle ctx) { if (ctx == NULL) return -1; try { @@ -456,6 +474,10 @@ int simpler_run( api.copy_to_device = copy_to_device; api.copy_from_device = copy_from_device; api.device_memset = device_memset; + api.temporary_buffer_budget = temporary_buffer_budget; + api.begin_temporary_buffer_run = begin_temporary_buffer_run; + api.acquire_temporary_buffer_slice = acquire_temporary_buffer_slice; + api.end_temporary_buffer_run = end_temporary_buffer_run; api.setup_static_arena = setup_static_arena_wrapper; api.acquire_pooled_gm_heap = acquire_pooled_gm_heap_wrapper; api.acquire_pooled_gm_sm = acquire_pooled_gm_sm_wrapper; diff --git a/src/common/platform/sim/host/device_runner_base.cpp b/src/common/platform/sim/host/device_runner_base.cpp index 547fe58b8..d55ae17af 100644 --- a/src/common/platform/sim/host/device_runner_base.cpp +++ b/src/common/platform/sim/host/device_runner_base.cpp @@ -262,6 +262,60 @@ int SimDeviceRunnerBase::device_memset(void *dev_ptr, int value, size_t bytes) { return 0; } +bool SimDeviceRunnerBase::configure_temporary_buffer(size_t max_temporary_buffer_bytes) { + if (!temporary_buffer_.configure(max_temporary_buffer_bytes)) { + LOG_ERROR( + "configure_temporary_buffer(%zu) failed: %s", max_temporary_buffer_bytes, + temporary_buffer_.last_error().c_str() + ); + return false; + } + auto stats = temporary_buffer_.stats(); + LOG_DEBUG( + "Temporary buffer configured: budget=%zu retained_chunks=%zu retained_bytes=%zu", stats.configured_budget_bytes, + stats.retained_chunk_count, stats.retained_chunk_bytes + ); + return true; +} + +size_t SimDeviceRunnerBase::temporary_buffer_budget() const { return temporary_buffer_.budget(); } + +bool SimDeviceRunnerBase::begin_temporary_buffer_run() { + if (!temporary_buffer_.begin_run()) { + LOG_ERROR("begin_temporary_buffer_run failed: %s", temporary_buffer_.last_error().c_str()); + return false; + } + return true; +} + +void *SimDeviceRunnerBase::acquire_temporary_buffer_slice(size_t bytes, size_t alignment) { + void *ptr = temporary_buffer_.acquire(bytes, alignment); + if (ptr == nullptr) { + LOG_ERROR( + "acquire_temporary_buffer_slice failed: required bytes=%zu configured bytes=%zu: %s", bytes, + temporary_buffer_.budget(), temporary_buffer_.last_error().c_str() + ); + } + return ptr; +} + +void SimDeviceRunnerBase::end_temporary_buffer_run() { + temporary_buffer_.end_run(); + auto stats = temporary_buffer_.stats(); + LOG_DEBUG( + "Temporary buffer run ended: used=%zu high_water=%zu allocations=%zu budget_exceeded=%zu", + stats.current_run_used_bytes, stats.high_water_used_bytes, stats.buffer_backed_allocation_count, + stats.budget_exceeded_count + ); +} + +void SimDeviceRunnerBase::clear_temporary_buffer() { + if (temporary_buffer_.active()) { + LOG_ERROR("clear_temporary_buffer called while a temporary-buffer run is active"); + } + temporary_buffer_.clear(); +} + int SimDeviceRunnerBase::l3_l2_orch_comm_init(void *control_block, size_t control_block_size) { return l3_l2_orch_comm_service_.start(this, control_block, control_block_size); } diff --git a/src/common/platform/sim/host/device_runner_base.h b/src/common/platform/sim/host/device_runner_base.h index a147bc015..a0a7ed550 100644 --- a/src/common/platform/sim/host/device_runner_base.h +++ b/src/common/platform/sim/host/device_runner_base.h @@ -51,11 +51,13 @@ #include "host/tensor_dump_collector.h" #include "host/pmu_collector.h" #include "host/scope_stats_collector.h" +#include "host/temporary_variable_buffer.h" #include "runtime.h" class SimDeviceRunnerBase : public L3L2OrchCommBackend { public: SimDeviceRunnerBase() : + temporary_buffer_(&arena_alloc_trampoline, &arena_free_trampoline, &mem_alloc_), gm_heap_arena_(&arena_alloc_trampoline, &arena_free_trampoline, &mem_alloc_), gm_sm_arena_(&arena_alloc_trampoline, &arena_free_trampoline, &mem_alloc_), runtime_arena_pool_(&arena_alloc_trampoline, &arena_free_trampoline, &mem_alloc_) {} @@ -94,6 +96,12 @@ class SimDeviceRunnerBase : public L3L2OrchCommBackend { int copy_to_device(void *dev_ptr, const void *host_ptr, size_t bytes); int copy_from_device(void *host_ptr, const void *dev_ptr, size_t bytes); int device_memset(void *dev_ptr, int value, size_t bytes); + bool configure_temporary_buffer(size_t max_temporary_buffer_bytes); + size_t temporary_buffer_budget() const; + bool begin_temporary_buffer_run(); + void *acquire_temporary_buffer_slice(size_t bytes, size_t alignment); + void end_temporary_buffer_run(); + void clear_temporary_buffer(); int l3_l2_orch_comm_init(void *control_block, size_t control_block_size); int l3_l2_orch_comm_shutdown(); @@ -185,6 +193,7 @@ class SimDeviceRunnerBase : public L3L2OrchCommBackend { std::vector aicore_kernel_binary_; MemoryAllocator mem_alloc_; + TemporaryVariableBuffer temporary_buffer_; // Three independent per-Worker arenas, each backing a single pooled // region (PTO2 GM heap / PTO2 shared memory / trb prebuilt runtime diff --git a/src/common/worker/chip_worker.cpp b/src/common/worker/chip_worker.cpp index 5a51d5a48..9cc3db3eb 100644 --- a/src/common/worker/chip_worker.cpp +++ b/src/common/worker/chip_worker.cpp @@ -101,6 +101,10 @@ void ChipWorker::init( device_free_ctx_fn_ = load_symbol(handle, "device_free_ctx"); copy_to_device_ctx_fn_ = load_symbol(handle, "copy_to_device_ctx"); copy_from_device_ctx_fn_ = load_symbol(handle, "copy_from_device_ctx"); + configure_temporary_buffer_ctx_fn_ = + load_symbol(handle, "configure_temporary_buffer_ctx"); + get_temporary_buffer_budget_ctx_fn_ = + load_symbol(handle, "get_temporary_buffer_budget_ctx"); get_runtime_size_fn_ = load_symbol(handle, "get_runtime_size"); simpler_init_fn_ = load_symbol(handle, "simpler_init"); register_callable_fn_ = load_symbol(handle, "simpler_register_callable"); @@ -184,6 +188,8 @@ void ChipWorker::init( device_free_ctx_fn_ = nullptr; copy_to_device_ctx_fn_ = nullptr; copy_from_device_ctx_fn_ = nullptr; + configure_temporary_buffer_ctx_fn_ = nullptr; + get_temporary_buffer_budget_ctx_fn_ = nullptr; get_runtime_size_fn_ = nullptr; simpler_init_fn_ = nullptr; register_callable_fn_ = nullptr; @@ -224,6 +230,8 @@ void ChipWorker::init( device_free_ctx_fn_ = nullptr; copy_to_device_ctx_fn_ = nullptr; copy_from_device_ctx_fn_ = nullptr; + configure_temporary_buffer_ctx_fn_ = nullptr; + get_temporary_buffer_budget_ctx_fn_ = nullptr; get_runtime_size_fn_ = nullptr; simpler_init_fn_ = nullptr; register_callable_fn_ = nullptr; @@ -279,6 +287,8 @@ void ChipWorker::finalize() { device_free_ctx_fn_ = nullptr; copy_to_device_ctx_fn_ = nullptr; copy_from_device_ctx_fn_ = nullptr; + configure_temporary_buffer_ctx_fn_ = nullptr; + get_temporary_buffer_budget_ctx_fn_ = nullptr; get_runtime_size_fn_ = nullptr; register_callable_fn_ = nullptr; run_fn_ = nullptr; @@ -368,6 +378,23 @@ size_t ChipWorker::host_dlopen_count() const { return get_host_dlopen_count_fn_(device_ctx_); } +void ChipWorker::configure_temporary_buffer(size_t max_temporary_buffer_bytes) { + if (!initialized_) { + throw std::runtime_error("ChipWorker not initialized; call init() first"); + } + int rc = configure_temporary_buffer_ctx_fn_(device_ctx_, max_temporary_buffer_bytes); + if (rc != 0) { + throw std::runtime_error("configure_temporary_buffer failed with code " + std::to_string(rc)); + } +} + +size_t ChipWorker::temporary_buffer_budget() const { + if (!initialized_) { + return 0; + } + return get_temporary_buffer_budget_ctx_fn_(device_ctx_); +} + void *ChipWorker::create_comm_stream_checked(const char *op_name) { int rc = ensure_acl_ready_fn_(device_ctx_, device_id_); if (rc != 0) { @@ -679,8 +706,8 @@ void ChipWorker::comm_destroy(uint64_t comm_handle) { } int rc = destroy_comm_session(*session); - while (!comm_sessions_.empty() && comm_sessions_.back().handle == nullptr && - comm_sessions_.back().stream == nullptr) { + while (!comm_sessions_.empty() && comm_sessions_.back().handle == nullptr && comm_sessions_.back().stream == nullptr + ) { comm_sessions_.pop_back(); } diff --git a/src/common/worker/chip_worker.h b/src/common/worker/chip_worker.h index 086e5b6b6..cc3ca0a29 100644 --- a/src/common/worker/chip_worker.h +++ b/src/common/worker/chip_worker.h @@ -82,6 +82,8 @@ class ChipWorker { void free(uint64_t ptr); void copy_to(uint64_t dst, uint64_t src, size_t size); void copy_from(uint64_t dst, uint64_t src, size_t size); + void configure_temporary_buffer(size_t max_temporary_buffer_bytes); + size_t temporary_buffer_budget() const; void l3_l2_orch_comm_init(uint64_t control_block_addr, size_t control_block_size); void l3_l2_orch_comm_shutdown(); @@ -138,6 +140,8 @@ class ChipWorker { using DeviceFreeCtxFn = void (*)(void *, void *); using CopyToDeviceCtxFn = int (*)(void *, void *, const void *, size_t); using CopyFromDeviceCtxFn = int (*)(void *, void *, const void *, size_t); + using ConfigureTemporaryBufferCtxFn = int (*)(void *, size_t); + using GetTemporaryBufferBudgetCtxFn = size_t (*)(void *); using GetRuntimeSizeFn = size_t (*)(); // From host_runtime.so. Single platform-side init that does (a) thread // attach + device-id record, (b) executor binary takeover, (c) onboard @@ -192,6 +196,8 @@ class ChipWorker { DeviceFreeCtxFn device_free_ctx_fn_ = nullptr; CopyToDeviceCtxFn copy_to_device_ctx_fn_ = nullptr; CopyFromDeviceCtxFn copy_from_device_ctx_fn_ = nullptr; + ConfigureTemporaryBufferCtxFn configure_temporary_buffer_ctx_fn_ = nullptr; + GetTemporaryBufferBudgetCtxFn get_temporary_buffer_budget_ctx_fn_ = nullptr; GetRuntimeSizeFn get_runtime_size_fn_ = nullptr; SimplerInitFn simpler_init_fn_ = nullptr; SimplerRegisterCallableFn register_callable_fn_ = nullptr; diff --git a/src/common/worker/pto_runtime_c_api.h b/src/common/worker/pto_runtime_c_api.h index 380827084..74e927216 100644 --- a/src/common/worker/pto_runtime_c_api.h +++ b/src/common/worker/pto_runtime_c_api.h @@ -24,6 +24,8 @@ * - sizing: get_runtime_size * - device-mem: device_malloc_ctx, device_free_ctx, * copy_to_device_ctx, copy_from_device_ctx + * - temp-buffer: configure_temporary_buffer_ctx, + * get_temporary_buffer_budget_ctx * - prepared run: simpler_register_callable, simpler_run, unregister_callable, * get_aicpu_dlopen_count, get_host_dlopen_count * - L3-L2 orch: l3_l2_orch_comm_init_ctx, @@ -91,6 +93,12 @@ int copy_to_device_ctx(DeviceContextHandle ctx, void *dev_ptr, const void *host_ /** Copy device memory to a host pointer within the given device context. */ int copy_from_device_ctx(DeviceContextHandle ctx, void *host_ptr, const void *dev_ptr, size_t size); +/** Configure the runner-scoped temporary variable buffer. Zero disables it. */ +int configure_temporary_buffer_ctx(DeviceContextHandle ctx, size_t max_temporary_buffer_bytes); + +/** Return the configured temporary-buffer budget, or 0 when disabled. */ +size_t get_temporary_buffer_budget_ctx(DeviceContextHandle ctx); + /** * One-shot platform-side init. Called once by ChipWorker::init() right * after dlopen, before any other entry. Three responsibilities, in order: diff --git a/tests/ut/cpp/CMakeLists.txt b/tests/ut/cpp/CMakeLists.txt index 794673731..72dd96191 100644 --- a/tests/ut/cpp/CMakeLists.txt +++ b/tests/ut/cpp/CMakeLists.txt @@ -378,7 +378,37 @@ target_include_directories(test_runtime_orch_so PRIVATE ${CMAKE_SOURCE_DIR}/../../../src/a2a3/platform/include ) add_common_utils_test(test_device_arena common/test_device_arena.cpp) +add_common_utils_test(test_temporary_variable_buffer common/test_temporary_variable_buffer.cpp) add_common_utils_test(test_l3_l2_orch_comm common/test_l3_l2_orch_comm.cpp) + +add_executable(test_trb_runtime_temp_buffer + common/test_trb_runtime_temp_buffer.cpp + ${CMAKE_SOURCE_DIR}/../../../src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp + ${CMAKE_SOURCE_DIR}/../../../src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/runtime.cpp + ${CMAKE_SOURCE_DIR}/../../../src/common/platform/shared/host/platform_compile_info.cpp +) +target_compile_definitions(test_trb_runtime_temp_buffer PRIVATE SIMPLER_PLATFORM_NAME="a2a3sim") +target_include_directories(test_trb_runtime_temp_buffer PRIVATE + ${GTEST_INCLUDE_DIRS} + ${CMAKE_SOURCE_DIR}/../../../src/a2a3/runtime/tensormap_and_ringbuffer/host + ${CMAKE_SOURCE_DIR}/../../../src/a2a3/runtime/tensormap_and_ringbuffer/orchestration + ${CMAKE_SOURCE_DIR}/../../../src/a2a3/runtime/tensormap_and_ringbuffer/runtime + ${CMAKE_SOURCE_DIR}/../../../src/a2a3/runtime/tensormap_and_ringbuffer/common + ${CMAKE_SOURCE_DIR}/../../../src/a2a3/platform/include + ${CMAKE_SOURCE_DIR}/../../../src/common/platform/include + ${CMAKE_SOURCE_DIR}/../../../src/common/task_interface + ${CMAKE_SOURCE_DIR}/../../../src/common/log/include + ${CMAKE_SOURCE_DIR}/../../../src/common +) +target_link_libraries(test_trb_runtime_temp_buffer PRIVATE + a2a3_rt_objs + ${GTEST_MAIN_LIB} + ${GTEST_LIB} + pthread +) +add_test(NAME test_trb_runtime_temp_buffer COMMAND test_trb_runtime_temp_buffer) +set_tests_properties(test_trb_runtime_temp_buffer PROPERTIES LABELS "no_hardware") + add_executable(test_l3_l2_orch_endpoint common/test_l3_l2_orch_endpoint.cpp stubs/test_stubs.cpp diff --git a/tests/ut/cpp/common/test_temporary_variable_buffer.cpp b/tests/ut/cpp/common/test_temporary_variable_buffer.cpp new file mode 100644 index 000000000..8490d4d2c --- /dev/null +++ b/tests/ut/cpp/common/test_temporary_variable_buffer.cpp @@ -0,0 +1,143 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +// Unit tests for host/temporary_variable_buffer.h. + +#include +#include +#include +#include +#include + +#include + +#include "host/temporary_variable_buffer.h" + +namespace { + +struct MockBackend { + int alloc_count = 0; + int free_count = 0; + size_t max_alloc_size = 0; + std::unordered_set live; + + void *alloc(size_t size) { + if (max_alloc_size != 0 && size > max_alloc_size) { + return nullptr; + } + void *ptr = nullptr; + if (posix_memalign(&ptr, TemporaryVariableBuffer::kDefaultAlignment, size) != 0) { + return nullptr; + } + ++alloc_count; + live.insert(ptr); + return ptr; + } + + void free(void *ptr) { + ++free_count; + EXPECT_EQ(live.count(ptr), 1u) << "free called on a pointer that is not live"; + live.erase(ptr); + std::free(ptr); + } +}; + +void *mock_alloc(void *ctx, size_t size) { return static_cast(ctx)->alloc(size); } +void mock_free(void *ctx, void *ptr) { static_cast(ctx)->free(ptr); } + +bool is_aligned(const void *ptr, size_t alignment) { return (reinterpret_cast(ptr) & (alignment - 1)) == 0; } + +} // namespace + +TEST(TemporaryVariableBufferTest, BeginAcquireEndReusesRetainedMemory) { + MockBackend backend; + TemporaryVariableBuffer buffer(mock_alloc, mock_free, &backend); + + ASSERT_TRUE(buffer.configure(4096)) << buffer.last_error(); + EXPECT_EQ(backend.alloc_count, 1); + ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); + void *first = buffer.acquire(512, TemporaryVariableBuffer::kDefaultAlignment); + ASSERT_NE(first, nullptr) << buffer.last_error(); + EXPECT_TRUE(is_aligned(first, TemporaryVariableBuffer::kDefaultAlignment)); + void *second = buffer.acquire(256, 256); + ASSERT_NE(second, nullptr) << buffer.last_error(); + EXPECT_TRUE(is_aligned(second, 256)); + buffer.end_run(); + + ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); + void *again = buffer.acquire(512, TemporaryVariableBuffer::kDefaultAlignment); + EXPECT_EQ(again, first); + buffer.end_run(); + + EXPECT_EQ(backend.alloc_count, 1); + EXPECT_EQ(backend.free_count, 0); + EXPECT_EQ(buffer.stats().high_water_used_bytes, 768u); +} + +TEST(TemporaryVariableBufferTest, ConfiguredBudgetIsEnforcedWithClearError) { + MockBackend backend; + TemporaryVariableBuffer buffer(mock_alloc, mock_free, &backend); + + ASSERT_TRUE(buffer.configure(1024)) << buffer.last_error(); + ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); + ASSERT_NE(buffer.acquire(768, 256), nullptr) << buffer.last_error(); + EXPECT_EQ(buffer.acquire(512, 256), nullptr); + EXPECT_NE(buffer.last_error().find("required bytes 1280"), std::string::npos); + EXPECT_NE(buffer.last_error().find("configured bytes 1024"), std::string::npos); + EXPECT_EQ(buffer.stats().budget_exceeded_count, 1u); + buffer.end_run(); +} + +TEST(TemporaryVariableBufferTest, SegmentedChunksSatisfyAggregateBudget) { + MockBackend backend; + backend.max_alloc_size = 2047; + TemporaryVariableBuffer buffer(mock_alloc, mock_free, &backend); + + ASSERT_TRUE(buffer.configure(2048)) << buffer.last_error(); + EXPECT_EQ(buffer.stats().retained_chunk_count, 2u); + EXPECT_EQ(buffer.stats().retained_chunk_bytes, 2048u); + + ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); + void *first = buffer.acquire(900, TemporaryVariableBuffer::kDefaultAlignment); + void *second = buffer.acquire(900, TemporaryVariableBuffer::kDefaultAlignment); + ASSERT_NE(first, nullptr) << buffer.last_error(); + ASSERT_NE(second, nullptr) << buffer.last_error(); + EXPECT_NE(first, second); + EXPECT_TRUE(is_aligned(first, TemporaryVariableBuffer::kDefaultAlignment)); + EXPECT_TRUE(is_aligned(second, TemporaryVariableBuffer::kDefaultAlignment)); + buffer.end_run(); +} + +TEST(TemporaryVariableBufferTest, ClearFreesRetainedChunksExactlyOnce) { + MockBackend backend; + backend.max_alloc_size = 2047; + TemporaryVariableBuffer buffer(mock_alloc, mock_free, &backend); + + ASSERT_TRUE(buffer.configure(2048)) << buffer.last_error(); + EXPECT_EQ(backend.alloc_count, 2); + buffer.clear(); + EXPECT_EQ(backend.free_count, 2); + EXPECT_TRUE(backend.live.empty()); + EXPECT_EQ(buffer.budget(), 0u); + + buffer.clear(); + EXPECT_EQ(backend.free_count, 2); +} + +TEST(TemporaryVariableBufferTest, ActiveReconfigurationFailsClearly) { + MockBackend backend; + TemporaryVariableBuffer buffer(mock_alloc, mock_free, &backend); + + ASSERT_TRUE(buffer.configure(1024)) << buffer.last_error(); + ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); + EXPECT_FALSE(buffer.configure(2048)); + EXPECT_NE(buffer.last_error().find("cannot reconfigure"), std::string::npos); + buffer.end_run(); +} diff --git a/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp b/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp new file mode 100644 index 000000000..7fa431692 --- /dev/null +++ b/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp @@ -0,0 +1,316 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +// Host-side fake HostApi tests for a2a3 TRB bind/validate tensor leases. + +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "arg_direction.h" +#include "pto_runtime2_types.h" +#include "runtime.h" +#include "task_args.h" +#include "utils/device_arena.h" + +extern "C" int bind_callable_to_runtime_impl( + Runtime *runtime, const ChipStorageTaskArgs *orch_args, void *host_orch_func_ptr, const ArgDirection *signature, + int sig_count, const uint64_t *ring_task_window, const uint64_t *ring_heap, const uint64_t *ring_dep_pool +); +extern "C" int validate_runtime_impl(Runtime *runtime); + +namespace { + +size_t align_up(size_t value, size_t alignment) { return (value + alignment - 1) & ~(alignment - 1); } + +struct FakeHostApi { + int device_malloc_count = 0; + int device_free_count = 0; + int copy_to_count = 0; + int copy_from_count = 0; + int device_memset_count = 0; + int setup_static_arena_count = 0; + int temp_begin_count = 0; + int temp_end_count = 0; + int temp_acquire_attempts = 0; + int temp_acquire_successes = 0; + int fail_copy_to_on_call = 0; + size_t temp_budget = 0; + size_t temp_offset = 0; + bool temp_active = false; + void *temp_pool = nullptr; + std::unordered_set live_mallocs; + std::vector gm_heap; + std::vector gm_sm; + std::vector runtime_arena; + + ~FakeHostApi() { release_all(); } + + void release_all() { + for (void *ptr : live_mallocs) { + std::free(ptr); + } + live_mallocs.clear(); + if (temp_pool != nullptr) { + std::free(temp_pool); + temp_pool = nullptr; + } + } + + void reset(size_t budget = 0) { + release_all(); + *this = FakeHostApi(); + temp_budget = budget; + if (budget > 0) { + ASSERT_EQ(posix_memalign(&temp_pool, DeviceArena::kDefaultBaseAlign, budget), 0); + std::memset(temp_pool, 0, budget); + } + } +}; + +FakeHostApi *g_fake = nullptr; + +void *fake_device_malloc(size_t size) { + void *ptr = std::malloc(std::max(size, 1)); + if (ptr == nullptr) { + return nullptr; + } + ++g_fake->device_malloc_count; + g_fake->live_mallocs.insert(ptr); + return ptr; +} + +void fake_device_free(void *ptr) { + if (ptr == nullptr) { + return; + } + ++g_fake->device_free_count; + EXPECT_EQ(g_fake->live_mallocs.count(ptr), 1u); + g_fake->live_mallocs.erase(ptr); + std::free(ptr); +} + +int fake_copy_to_device(void *dev_ptr, const void *host_ptr, size_t size) { + ++g_fake->copy_to_count; + if (g_fake->fail_copy_to_on_call != 0 && g_fake->copy_to_count == g_fake->fail_copy_to_on_call) { + return -7; + } + std::memcpy(dev_ptr, host_ptr, size); + return 0; +} + +int fake_copy_from_device(void *host_ptr, const void *dev_ptr, size_t size) { + ++g_fake->copy_from_count; + std::memcpy(host_ptr, dev_ptr, size); + return 0; +} + +int fake_device_memset(void *dev_ptr, int value, size_t size) { + ++g_fake->device_memset_count; + std::memset(dev_ptr, value, size); + return 0; +} + +size_t fake_temporary_buffer_budget() { return g_fake->temp_budget; } + +bool fake_begin_temporary_buffer_run() { + if (g_fake->temp_budget == 0 || g_fake->temp_pool == nullptr || g_fake->temp_active) { + return false; + } + ++g_fake->temp_begin_count; + g_fake->temp_offset = 0; + g_fake->temp_active = true; + return true; +} + +void *fake_acquire_temporary_buffer_slice(size_t size, size_t alignment) { + ++g_fake->temp_acquire_attempts; + const size_t offset = align_up(g_fake->temp_offset, alignment == 0 ? 1 : alignment); + if (!g_fake->temp_active || offset > g_fake->temp_budget || size > g_fake->temp_budget - offset) { + return nullptr; + } + void *ptr = static_cast(g_fake->temp_pool) + offset; + g_fake->temp_offset = offset + size; + ++g_fake->temp_acquire_successes; + return ptr; +} + +void fake_end_temporary_buffer_run() { + EXPECT_TRUE(g_fake->temp_active); + ++g_fake->temp_end_count; + g_fake->temp_active = false; +} + +int fake_setup_static_arena(size_t gm_heap_size, size_t gm_sm_size, size_t runtime_arena_size) { + ++g_fake->setup_static_arena_count; + g_fake->gm_heap.assign(gm_heap_size, 0); + g_fake->gm_sm.assign(gm_sm_size, 0); + g_fake->runtime_arena.assign(runtime_arena_size, 0); + return 0; +} + +void *fake_acquire_pooled_gm_heap() { return g_fake->gm_heap.empty() ? nullptr : g_fake->gm_heap.data(); } +void *fake_acquire_pooled_gm_sm() { return g_fake->gm_sm.empty() ? nullptr : g_fake->gm_sm.data(); } +void *fake_acquire_pooled_runtime_arena() { + return g_fake->runtime_arena.empty() ? nullptr : g_fake->runtime_arena.data(); +} +uint64_t fake_upload_chip_callable_buffer(const void * /* callable */) { return 0; } + +HostApi make_host_api() { + return HostApi{ + fake_device_malloc, + fake_device_free, + fake_copy_to_device, + fake_copy_from_device, + fake_device_memset, + fake_temporary_buffer_budget, + fake_begin_temporary_buffer_run, + fake_acquire_temporary_buffer_slice, + fake_end_temporary_buffer_run, + fake_setup_static_arena, + fake_acquire_pooled_gm_heap, + fake_acquire_pooled_gm_sm, + fake_acquire_pooled_runtime_arena, + fake_upload_chip_callable_buffer, + }; +} + +Tensor make_tensor(std::vector &storage, bool child_memory = false) { + Tensor tensor; + uint32_t shape[1] = {static_cast(storage.size())}; + tensor.init_external(storage.data(), storage.size(), shape, 1, DataType::UINT8, 0, false, child_memory ? 1 : 0); + return tensor; +} + +ChipStorageTaskArgs make_args(std::vector &input, std::vector &output) { + ChipStorageTaskArgs args; + args.add_tensor(make_tensor(input)); + args.add_tensor(make_tensor(output)); + return args; +} + +int bind_runtime(Runtime &runtime, const ChipStorageTaskArgs &args, const ArgDirection *signature, int sig_count) { + uint64_t ring_task_window[PTO2_MAX_RING_DEPTH] = {4, 4, 4, 4}; + uint64_t ring_heap[PTO2_MAX_RING_DEPTH] = {1024, 1024, 1024, 1024}; + uint64_t ring_dep_pool[PTO2_MAX_RING_DEPTH] = {4, 4, 4, 4}; + return bind_callable_to_runtime_impl( + &runtime, &args, nullptr, signature, sig_count, ring_task_window, ring_heap, ring_dep_pool + ); +} + +class TrbRuntimeTempBufferTest : public ::testing::Test { +protected: + void SetUp() override { g_fake = &fake_; } + void TearDown() override { + fake_.release_all(); + g_fake = nullptr; + } + + Runtime make_runtime() { + Runtime runtime; + runtime.host_api = make_host_api(); + return runtime; + } + + FakeHostApi fake_; +}; + +} // namespace + +TEST_F(TrbRuntimeTempBufferTest, PositiveBudgetUsesTemporarySlicesWithoutChangingCopies) { + std::vector input(64, 7); + std::vector output(64, 0); + ChipStorageTaskArgs args = make_args(input, output); + ArgDirection signature[2] = {ArgDirection::IN, ArgDirection::OUT}; + + fake_.reset(0); + Runtime malloc_runtime = make_runtime(); + ASSERT_EQ(bind_runtime(malloc_runtime, args, signature, 2), 0); + EXPECT_EQ(fake_.device_malloc_count, 2); + EXPECT_EQ(fake_.copy_to_count, 2); + EXPECT_EQ(fake_.device_memset_count, 1); + ASSERT_EQ(validate_runtime_impl(&malloc_runtime), 0); + EXPECT_EQ(fake_.device_free_count, 2); + EXPECT_EQ(fake_.copy_from_count, 2); + EXPECT_EQ(fake_.temp_begin_count, 0); + + fake_.reset(4096); + Runtime buffer_runtime = make_runtime(); + ASSERT_EQ(bind_runtime(buffer_runtime, args, signature, 2), 0); + EXPECT_EQ(fake_.device_malloc_count, 0); + EXPECT_EQ(fake_.temp_begin_count, 1); + EXPECT_EQ(fake_.temp_acquire_successes, 2); + EXPECT_EQ(fake_.copy_to_count, 2); + EXPECT_EQ(fake_.device_memset_count, 1); + ASSERT_EQ(validate_runtime_impl(&buffer_runtime), 0); + EXPECT_EQ(fake_.device_free_count, 0); + EXPECT_EQ(fake_.temp_end_count, 1); + EXPECT_EQ(fake_.copy_from_count, 2); +} + +TEST_F(TrbRuntimeTempBufferTest, ChildMemoryIsPassThroughAndPureOutStillMemsets) { + fake_.reset(4096); + Runtime runtime = make_runtime(); + std::vector child(64, 3); + std::vector output(64, 0); + ChipStorageTaskArgs args; + args.add_tensor(make_tensor(child, true)); + args.add_tensor(make_tensor(output)); + ArgDirection signature[2] = {ArgDirection::IN, ArgDirection::OUT}; + + ASSERT_EQ(bind_runtime(runtime, args, signature, 2), 0); + EXPECT_EQ(fake_.temp_acquire_successes, 1); + EXPECT_EQ(fake_.device_malloc_count, 0); + EXPECT_EQ(fake_.copy_to_count, 1); + EXPECT_EQ(fake_.device_memset_count, 1); + ASSERT_EQ(validate_runtime_impl(&runtime), 0); + EXPECT_EQ(fake_.device_free_count, 0); + EXPECT_EQ(fake_.temp_end_count, 1); +} + +TEST_F(TrbRuntimeTempBufferTest, BudgetExhaustionFailsWithoutMallocFallbackAndEndsRun) { + fake_.reset(1024); + Runtime runtime = make_runtime(); + std::vector input(768, 1); + std::vector output(768, 0); + ChipStorageTaskArgs args = make_args(input, output); + ArgDirection signature[2] = {ArgDirection::IN, ArgDirection::OUT}; + + EXPECT_EQ(bind_runtime(runtime, args, signature, 2), -1); + EXPECT_EQ(fake_.device_malloc_count, 0); + EXPECT_EQ(fake_.temp_begin_count, 1); + EXPECT_EQ(fake_.temp_acquire_attempts, 2); + EXPECT_EQ(fake_.temp_acquire_successes, 1); + EXPECT_EQ(fake_.temp_end_count, 1); + EXPECT_FALSE(runtime.temporary_buffer_run_active_); + EXPECT_TRUE(runtime.tensor_leases_.empty()); +} + +TEST_F(TrbRuntimeTempBufferTest, FailedCopyReleasesRecordedFreeLease) { + fake_.reset(0); + fake_.fail_copy_to_on_call = 1; + Runtime runtime = make_runtime(); + std::vector input(64, 9); + ChipStorageTaskArgs args; + args.add_tensor(make_tensor(input)); + ArgDirection signature[1] = {ArgDirection::IN}; + + EXPECT_EQ(bind_runtime(runtime, args, signature, 1), -1); + EXPECT_EQ(fake_.device_malloc_count, 1); + EXPECT_EQ(fake_.device_free_count, 1); + EXPECT_TRUE(runtime.tensor_leases_.empty()); + EXPECT_FALSE(runtime.temporary_buffer_run_active_); +} diff --git a/tests/ut/py/test_chip_worker.py b/tests/ut/py/test_chip_worker.py index fe6efe4e5..2a2cb6aeb 100644 --- a/tests/ut/py/test_chip_worker.py +++ b/tests/ut/py/test_chip_worker.py @@ -239,6 +239,11 @@ def test_l3_l2_orch_comm_shutdown_before_init_raises(self): with pytest.raises(RuntimeError, match="not initialized"): worker.l3_l2_orch_comm_shutdown() + def test_configure_temporary_buffer_before_init_raises(self): + worker = _ChipWorker() + with pytest.raises(RuntimeError, match="not initialized"): + worker.configure_temporary_buffer(4096) + # ============================================================================ # Python-level ChipWorker wrapper tests @@ -271,6 +276,8 @@ def __init__(self): self.unregistered = [] self.aicpu_dlopen_count = 0 self.host_dlopen_count = 0 + self.temporary_buffer_budget = 0 + self.configured_temporary_buffers = [] def register_callable(self, slot, callable_obj): self.prepared.append((slot, callable_obj)) @@ -281,6 +288,10 @@ def run(self, slot, args, config): def unregister_callable(self, slot): self.unregistered.append(slot) + def configure_temporary_buffer(self, budget): + self.configured_temporary_buffers.append(budget) + self.temporary_buffer_budget = budget + worker = ChipWorker() fake = FakeImpl() worker._impl = fake @@ -304,6 +315,13 @@ def unregister_callable(self, slot): worker.unregister_callable(second) assert fake.unregistered == [0] + worker.configure_temporary_buffer(4096) + assert fake.configured_temporary_buffers == [4096] + assert worker.temporary_buffer_budget == 4096 + + with pytest.raises(ValueError, match="max_temporary_buffer_bytes"): + worker.configure_temporary_buffer(-1) + def test_public_wrapper_rejects_raw_slot_run(self): from _task_interface import ChipStorageTaskArgs # noqa: PLC0415 from simpler.task_interface import ChipWorker # noqa: PLC0415 # pyright: ignore[reportAttributeAccessIssue] diff --git a/tests/ut/py/test_worker/test_host_worker.py b/tests/ut/py/test_worker/test_host_worker.py index d169ffcfd..00abd7b0f 100644 --- a/tests/ut/py/test_worker/test_host_worker.py +++ b/tests/ut/py/test_worker/test_host_worker.py @@ -106,6 +106,16 @@ def _slot_for(worker: Worker, handle: CallableHandle) -> int: return worker._identity_registry[handle.digest].slot_id +class _FakeChipWorker: + def __init__(self) -> None: + self.configured_temporary_buffers: list[int] = [] + self.temporary_buffer_budget = 0 + + def configure_temporary_buffer(self, budget: int) -> None: + self.configured_temporary_buffers.append(budget) + self.temporary_buffer_budget = budget + + class _FakeControlResult: def __init__(self, worker_type: str, worker_id: int = 0, ok: bool = True, error_message: str = ""): self.worker_type = worker_type @@ -122,6 +132,32 @@ def _chip_payload_shm(callable_obj: ChipCallable) -> SharedMemory: return shm +def test_l2_worker_configure_temporary_buffer_records_and_forwards(): + worker = Worker(level=2, platform="a2a3sim", runtime="tensormap_and_ringbuffer") + + assert worker.temporary_buffer_budget == 0 + worker.configure_temporary_buffer(8192) + assert worker._config["max_temporary_buffer_bytes"] == 8192 + assert worker.temporary_buffer_budget == 8192 + + fake_chip = _FakeChipWorker() + worker._chip_worker = fake_chip + worker.configure_temporary_buffer(16384) + assert fake_chip.configured_temporary_buffers == [16384] + assert worker.temporary_buffer_budget == 16384 + + with pytest.raises(ValueError, match="max_temporary_buffer_bytes"): + worker.configure_temporary_buffer(-1) + + +def test_temporary_buffer_configuration_is_l2_only(): + worker = Worker(level=3, num_sub_workers=0) + + with pytest.raises(NotImplementedError, match="level 2"): + worker.configure_temporary_buffer(1024) + assert "max_temporary_buffer_bytes" not in worker._config + + def _chip_digest(callable_obj: ChipCallable, *, platform: str = "", runtime: str = "") -> bytes: descriptor = build_chip_callable_descriptor(target=callable_obj, platform=platform, runtime=runtime) return hashid_to_digest(compute_callable_hashid(descriptor)) From 8c4d2bd6d73bb4bbe3daa59b824914d5dfedbcc8 Mon Sep 17 00:00:00 2001 From: puddingfjz <2811443837@qq.com> Date: Tue, 30 Jun 2026 14:50:11 +0800 Subject: [PATCH 3/9] Fix: propagate TRB temporary buffer to L3 chip children --- python/simpler/worker.py | 21 ++++++-- tests/ut/py/test_worker/test_host_worker.py | 53 +++++++++++++++++++-- 2 files changed, 67 insertions(+), 7 deletions(-) diff --git a/python/simpler/worker.py b/python/simpler/worker.py index b8ded40e9..d23a122dd 100644 --- a/python/simpler/worker.py +++ b/python/simpler/worker.py @@ -1152,6 +1152,7 @@ def _chip_process_loop( log_info_v: int = 5, platform: str = "", runtime: str = "", + max_temporary_buffer_bytes: int = 0, ) -> None: """Runs in forked child process. Loads host_runtime.so in own address space. @@ -1167,6 +1168,11 @@ def _chip_process_loop( try: cw = ChipWorker() cw.init(device_id, bins, log_level=log_level, log_info_v=log_info_v) + temporary_buffer_budget = int(max_temporary_buffer_bytes) + if temporary_buffer_budget < 0: + raise ValueError("max_temporary_buffer_bytes must be non-negative") + if temporary_buffer_budget > 0: + cw.configure_temporary_buffer(temporary_buffer_budget) except Exception as e: _tb.print_exc() # Write the message so any parent reader that *does* inspect this @@ -2892,6 +2898,9 @@ def _init_hierarchical(self) -> None: device_ids = self._config.get("device_ids", []) n_sub = self._config.get("num_sub_workers", 0) heap_ring_size = self._config.get("heap_ring_size", None) + max_temporary_buffer_bytes = int(self._config.get("max_temporary_buffer_bytes", 0)) + if max_temporary_buffer_bytes < 0: + raise ValueError("Worker max_temporary_buffer_bytes must be non-negative") if self.level >= 4 and device_ids: raise RuntimeError("Worker level >= 4 must use add_worker(); device_ids are only supported on L3 Workers") @@ -2975,6 +2984,9 @@ def _start_hierarchical(self) -> None: # noqa: PLR0912 -- three parallel fork l """Fork child processes and start C++ scheduler. Called on first run().""" device_ids = self._config.get("device_ids", []) n_sub = self._config.get("num_sub_workers", 0) + max_temporary_buffer_bytes = int(self._config.get("max_temporary_buffer_bytes", 0)) + if max_temporary_buffer_bytes < 0: + raise ValueError("Worker max_temporary_buffer_bytes must be non-negative") try: # Fork children from an immutable snapshot. The state transition @@ -3034,6 +3046,7 @@ def _start_hierarchical(self) -> None: # noqa: PLR0912 -- three parallel fork l chip_log_info_v, str(self._config["platform"]), str(self._config["runtime"]), + max_temporary_buffer_bytes, ) os._exit(0) else: @@ -3792,12 +3805,14 @@ def copy_from(self, dst: int, src: int, size: int, worker_id: int = 0) -> None: self._orch.copy_from(worker_id, dst, src, size) def configure_temporary_buffer(self, max_temporary_buffer_bytes: int) -> None: - """Configure the level-2 TRB temporary variable buffer for this Worker.""" + """Configure the TRB temporary variable buffer for this Worker.""" budget = int(max_temporary_buffer_bytes) if budget < 0: raise ValueError("max_temporary_buffer_bytes must be non-negative") - if self.level != 2: - raise NotImplementedError("Worker.configure_temporary_buffer currently supports level 2 only") + if self.level not in (2, 3): + raise NotImplementedError("Worker.configure_temporary_buffer currently supports level 2 and level 3 only") + if self.level == 3 and self._hierarchical_start_state == "started": + raise RuntimeError("Worker.configure_temporary_buffer for level 3 must be called before hierarchy startup") self._config["max_temporary_buffer_bytes"] = budget if self._chip_worker is not None: self._chip_worker.configure_temporary_buffer(budget) diff --git a/tests/ut/py/test_worker/test_host_worker.py b/tests/ut/py/test_worker/test_host_worker.py index 00abd7b0f..469b5fb92 100644 --- a/tests/ut/py/test_worker/test_host_worker.py +++ b/tests/ut/py/test_worker/test_host_worker.py @@ -18,6 +18,7 @@ from multiprocessing.shared_memory import SharedMemory import pytest +import simpler.worker as worker_mod from _task_interface import MAX_REGISTERED_CALLABLE_IDS # pyright: ignore[reportMissingImports] from simpler.callable_identity import ( CallableHandle, @@ -150,12 +151,56 @@ def test_l2_worker_configure_temporary_buffer_records_and_forwards(): worker.configure_temporary_buffer(-1) -def test_temporary_buffer_configuration_is_l2_only(): +def test_temporary_buffer_configuration_records_for_l3_children(): worker = Worker(level=3, num_sub_workers=0) - with pytest.raises(NotImplementedError, match="level 2"): - worker.configure_temporary_buffer(1024) - assert "max_temporary_buffer_bytes" not in worker._config + worker.configure_temporary_buffer(1024) + assert worker._config["max_temporary_buffer_bytes"] == 1024 + assert worker.temporary_buffer_budget == 1024 + + +def test_chip_process_loop_configures_temporary_buffer(monkeypatch): + events: list[tuple] = [] + + class FakeChipWorker: + def init(self, device_id, bins, *, log_level, log_info_v): + events.append(("init", device_id, bins, log_level, log_info_v)) + + def configure_temporary_buffer(self, budget: int) -> None: + events.append(("configure_temporary_buffer", budget)) + + def finalize(self) -> None: + events.append(("finalize",)) + + def fake_run_chip_main_loop(cw, *_args, chip_platform, chip_runtime): + events.append(("main_loop", cw, chip_platform, chip_runtime)) + + monkeypatch.setattr(worker_mod, "ChipWorker", FakeChipWorker) + monkeypatch.setattr(worker_mod, "_run_chip_main_loop", fake_run_chip_main_loop) + + shm = SharedMemory(create=True, size=MAILBOX_SIZE) + try: + assert shm.buf is not None + worker_mod._chip_process_loop( + shm.buf, + "bins", + 7, + {}, + {}, + {}, + platform="a2a3", + runtime="tensormap_and_ringbuffer", + max_temporary_buffer_bytes=4096, + ) + finally: + shm.close() + shm.unlink() + + assert events[0] == ("init", 7, "bins", 1, 5) + assert events[1] == ("configure_temporary_buffer", 4096) + assert events[2][0] == "main_loop" + assert events[2][2:] == ("a2a3", "tensormap_and_ringbuffer") + assert events[3] == ("finalize",) def _chip_digest(callable_obj: ChipCallable, *, platform: str = "", runtime: str = "") -> bytes: From 433070f12a93315daa91c823a59f4e46db561e42 Mon Sep 17 00:00:00 2001 From: puddingfjz <2811443837@qq.com> Date: Wed, 1 Jul 2026 10:43:54 +0800 Subject: [PATCH 4/9] fix: align TRB temporary buffer allocations --- .../host/runtime_maker.cpp | 3 ++- .../host/runtime_maker.cpp | 3 ++- .../include/host/temporary_variable_buffer.h | 16 +++------------- .../common/test_temporary_variable_buffer.cpp | 12 ++++++++---- .../cpp/common/test_trb_runtime_temp_buffer.cpp | 4 ++-- 5 files changed, 17 insertions(+), 21 deletions(-) diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp index 34494a7ea..103055664 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp @@ -52,6 +52,7 @@ #include "common/unified_log.h" #include "host/platform_compile_info.h" #include "host/raii_scope_guard.h" +#include "host/temporary_variable_buffer.h" #include "utils/device_arena.h" #include "prepare_callable_common.h" @@ -480,7 +481,7 @@ static bool stage_device_args( void *dev_ptr = nullptr; TensorReleaseKind release_kind = TensorReleaseKind::Free; if (use_temporary_buffer) { - dev_ptr = api->acquire_temporary_buffer_slice(size, DeviceArena::kDefaultBaseAlign); + dev_ptr = api->acquire_temporary_buffer_slice(size, TemporaryVariableBuffer::kDefaultAlignment); release_kind = TensorReleaseKind::BufferNoop; if (dev_ptr == nullptr) { LOG_ERROR( diff --git a/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp b/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp index 34494a7ea..103055664 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp +++ b/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp @@ -52,6 +52,7 @@ #include "common/unified_log.h" #include "host/platform_compile_info.h" #include "host/raii_scope_guard.h" +#include "host/temporary_variable_buffer.h" #include "utils/device_arena.h" #include "prepare_callable_common.h" @@ -480,7 +481,7 @@ static bool stage_device_args( void *dev_ptr = nullptr; TensorReleaseKind release_kind = TensorReleaseKind::Free; if (use_temporary_buffer) { - dev_ptr = api->acquire_temporary_buffer_slice(size, DeviceArena::kDefaultBaseAlign); + dev_ptr = api->acquire_temporary_buffer_slice(size, TemporaryVariableBuffer::kDefaultAlignment); release_kind = TensorReleaseKind::BufferNoop; if (dev_ptr == nullptr) { LOG_ERROR( diff --git a/src/common/platform/include/host/temporary_variable_buffer.h b/src/common/platform/include/host/temporary_variable_buffer.h index 0dc1312b0..19feea589 100644 --- a/src/common/platform/include/host/temporary_variable_buffer.h +++ b/src/common/platform/include/host/temporary_variable_buffer.h @@ -13,7 +13,6 @@ #include #include -#include #include #include #include @@ -24,7 +23,7 @@ class TemporaryVariableBuffer { using AllocFn = void *(*)(void *ctx, size_t size); using FreeFn = void (*)(void *ctx, void *ptr); - static constexpr size_t kDefaultAlignment = 1024; + static constexpr size_t kDefaultAlignment = 32; struct Stats { size_t configured_budget_bytes{0}; @@ -72,11 +71,6 @@ class TemporaryVariableBuffer { static size_t align_up(size_t value, size_t alignment) { return (value + alignment - 1) & ~(alignment - 1); } - static void *align_ptr(void *ptr, size_t alignment) { - const uintptr_t raw = reinterpret_cast(ptr); - return reinterpret_cast((raw + alignment - 1) & ~(static_cast(alignment) - 1)); - } - bool allocate_chunks(size_t budget); bool allocate_chunk(size_t capacity, Chunk *out); void release_chunks(); @@ -262,16 +256,12 @@ inline bool TemporaryVariableBuffer::allocate_chunk(size_t capacity, Chunk *out) set_error("temporary buffer allocator callbacks are not configured"); return false; } - if (capacity > std::numeric_limits::max() - (kDefaultAlignment - 1)) { - set_error("temporary buffer chunk size overflows size_t"); - return false; - } - const size_t raw_size = capacity + kDefaultAlignment - 1; + const size_t raw_size = capacity; void *raw = alloc_(ctx_, raw_size); if (raw == nullptr) { return false; } - *out = Chunk{raw, align_ptr(raw, kDefaultAlignment), capacity, raw_size, 0}; + *out = Chunk{raw, raw, capacity, raw_size, 0}; return true; } diff --git a/tests/ut/cpp/common/test_temporary_variable_buffer.cpp b/tests/ut/cpp/common/test_temporary_variable_buffer.cpp index 8490d4d2c..6f5c27b3d 100644 --- a/tests/ut/cpp/common/test_temporary_variable_buffer.cpp +++ b/tests/ut/cpp/common/test_temporary_variable_buffer.cpp @@ -26,6 +26,7 @@ struct MockBackend { int alloc_count = 0; int free_count = 0; size_t max_alloc_size = 0; + size_t total_alloc_bytes = 0; std::unordered_set live; void *alloc(size_t size) { @@ -37,6 +38,7 @@ struct MockBackend { return nullptr; } ++alloc_count; + total_alloc_bytes += size; live.insert(ptr); return ptr; } @@ -62,13 +64,14 @@ TEST(TemporaryVariableBufferTest, BeginAcquireEndReusesRetainedMemory) { ASSERT_TRUE(buffer.configure(4096)) << buffer.last_error(); EXPECT_EQ(backend.alloc_count, 1); + EXPECT_EQ(backend.total_alloc_bytes, 4096u); ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); void *first = buffer.acquire(512, TemporaryVariableBuffer::kDefaultAlignment); ASSERT_NE(first, nullptr) << buffer.last_error(); EXPECT_TRUE(is_aligned(first, TemporaryVariableBuffer::kDefaultAlignment)); - void *second = buffer.acquire(256, 256); + void *second = buffer.acquire(256, TemporaryVariableBuffer::kDefaultAlignment); ASSERT_NE(second, nullptr) << buffer.last_error(); - EXPECT_TRUE(is_aligned(second, 256)); + EXPECT_TRUE(is_aligned(second, TemporaryVariableBuffer::kDefaultAlignment)); buffer.end_run(); ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); @@ -87,8 +90,8 @@ TEST(TemporaryVariableBufferTest, ConfiguredBudgetIsEnforcedWithClearError) { ASSERT_TRUE(buffer.configure(1024)) << buffer.last_error(); ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); - ASSERT_NE(buffer.acquire(768, 256), nullptr) << buffer.last_error(); - EXPECT_EQ(buffer.acquire(512, 256), nullptr); + ASSERT_NE(buffer.acquire(768, TemporaryVariableBuffer::kDefaultAlignment), nullptr) << buffer.last_error(); + EXPECT_EQ(buffer.acquire(512, TemporaryVariableBuffer::kDefaultAlignment), nullptr); EXPECT_NE(buffer.last_error().find("required bytes 1280"), std::string::npos); EXPECT_NE(buffer.last_error().find("configured bytes 1024"), std::string::npos); EXPECT_EQ(buffer.stats().budget_exceeded_count, 1u); @@ -103,6 +106,7 @@ TEST(TemporaryVariableBufferTest, SegmentedChunksSatisfyAggregateBudget) { ASSERT_TRUE(buffer.configure(2048)) << buffer.last_error(); EXPECT_EQ(buffer.stats().retained_chunk_count, 2u); EXPECT_EQ(buffer.stats().retained_chunk_bytes, 2048u); + EXPECT_EQ(backend.total_alloc_bytes, 2048u); ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); void *first = buffer.acquire(900, TemporaryVariableBuffer::kDefaultAlignment); diff --git a/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp b/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp index 7fa431692..f50dd7016 100644 --- a/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp +++ b/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp @@ -24,7 +24,7 @@ #include "pto_runtime2_types.h" #include "runtime.h" #include "task_args.h" -#include "utils/device_arena.h" +#include "host/temporary_variable_buffer.h" extern "C" int bind_callable_to_runtime_impl( Runtime *runtime, const ChipStorageTaskArgs *orch_args, void *host_orch_func_ptr, const ArgDirection *signature, @@ -75,7 +75,7 @@ struct FakeHostApi { *this = FakeHostApi(); temp_budget = budget; if (budget > 0) { - ASSERT_EQ(posix_memalign(&temp_pool, DeviceArena::kDefaultBaseAlign, budget), 0); + ASSERT_EQ(posix_memalign(&temp_pool, TemporaryVariableBuffer::kDefaultAlignment, budget), 0); std::memset(temp_pool, 0, budget); } } From f80230bea8f80dc4d7714a32b599f1352658a1dc Mon Sep 17 00:00:00 2001 From: puddingfjz <2811443837@qq.com> Date: Wed, 1 Jul 2026 11:22:14 +0800 Subject: [PATCH 5/9] Fix: cover temp-buffer copy failure cleanup --- .../common/test_trb_runtime_temp_buffer.cpp | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp b/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp index f50dd7016..094741246 100644 --- a/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp +++ b/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp @@ -314,3 +314,25 @@ TEST_F(TrbRuntimeTempBufferTest, FailedCopyReleasesRecordedFreeLease) { EXPECT_TRUE(runtime.tensor_leases_.empty()); EXPECT_FALSE(runtime.temporary_buffer_run_active_); } + +TEST_F(TrbRuntimeTempBufferTest, FailedCopyAfterTemporaryRunBeginsEndsRunOnce) { + fake_.reset(4096); + fake_.fail_copy_to_on_call = 1; + Runtime runtime = make_runtime(); + std::vector input(64, 9); + ChipStorageTaskArgs args; + args.add_tensor(make_tensor(input)); + ArgDirection signature[1] = {ArgDirection::IN}; + + EXPECT_EQ(bind_runtime(runtime, args, signature, 1), -1); + EXPECT_EQ(fake_.device_malloc_count, 0); + EXPECT_EQ(fake_.device_free_count, 0); + EXPECT_EQ(fake_.temp_begin_count, 1); + EXPECT_EQ(fake_.temp_acquire_attempts, 1); + EXPECT_EQ(fake_.temp_acquire_successes, 1); + EXPECT_EQ(fake_.copy_to_count, 1); + EXPECT_EQ(fake_.temp_end_count, 1); + EXPECT_FALSE(fake_.temp_active); + EXPECT_TRUE(runtime.tensor_leases_.empty()); + EXPECT_FALSE(runtime.temporary_buffer_run_active_); +} From 3bf8392ab8bf55bef7dd90ba9dcb521152a26f62 Mon Sep 17 00:00:00 2001 From: puddingfjz <2811443837@qq.com> Date: Wed, 1 Jul 2026 12:16:20 +0800 Subject: [PATCH 6/9] Fix: adapt TRB temp buffer to shared HostApi --- .../platform/onboard/host/c_api_shared.cpp | 30 +++++++++++ src/common/platform/sim/host/c_api_shared.cpp | 30 +++++++++++ .../common/test_trb_runtime_temp_buffer.cpp | 52 ++++++++++++------- 3 files changed, 93 insertions(+), 19 deletions(-) diff --git a/src/common/platform/onboard/host/c_api_shared.cpp b/src/common/platform/onboard/host/c_api_shared.cpp index b562e5848..65b2669b5 100644 --- a/src/common/platform/onboard/host/c_api_shared.cpp +++ b/src/common/platform/onboard/host/c_api_shared.cpp @@ -120,6 +120,36 @@ static int device_memset(void *dev_ptr, int value, size_t size) { } } +static size_t temporary_buffer_budget() { + try { + return current_runner()->temporary_buffer_budget(); + } catch (...) { + return 0; + } +} + +static bool begin_temporary_buffer_run() { + try { + return current_runner()->begin_temporary_buffer_run(); + } catch (...) { + return false; + } +} + +static void *acquire_temporary_buffer_slice(size_t size, size_t alignment) { + try { + return current_runner()->acquire_temporary_buffer_slice(size, alignment); + } catch (...) { + return nullptr; + } +} + +static void end_temporary_buffer_run() { + try { + current_runner()->end_temporary_buffer_run(); + } catch (...) {} +} + static uint64_t upload_chip_callable_buffer_wrapper(const void *callable) { try { return current_runner()->upload_chip_callable_buffer(static_cast(callable)); diff --git a/src/common/platform/sim/host/c_api_shared.cpp b/src/common/platform/sim/host/c_api_shared.cpp index 1048b9df8..94af2b32d 100644 --- a/src/common/platform/sim/host/c_api_shared.cpp +++ b/src/common/platform/sim/host/c_api_shared.cpp @@ -117,6 +117,36 @@ static int device_memset(void *dev_ptr, int value, size_t size) { } } +static size_t temporary_buffer_budget() { + try { + return current_runner()->temporary_buffer_budget(); + } catch (...) { + return 0; + } +} + +static bool begin_temporary_buffer_run() { + try { + return current_runner()->begin_temporary_buffer_run(); + } catch (...) { + return false; + } +} + +static void *acquire_temporary_buffer_slice(size_t size, size_t alignment) { + try { + return current_runner()->acquire_temporary_buffer_slice(size, alignment); + } catch (...) { + return nullptr; + } +} + +static void end_temporary_buffer_run() { + try { + current_runner()->end_temporary_buffer_run(); + } catch (...) {} +} + static uint64_t upload_chip_callable_buffer_wrapper(const void *callable) { try { return current_runner()->upload_chip_callable_buffer(static_cast(callable)); diff --git a/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp b/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp index 094741246..e16bad90f 100644 --- a/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp +++ b/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp @@ -27,10 +27,11 @@ #include "host/temporary_variable_buffer.h" extern "C" int bind_callable_to_runtime_impl( - Runtime *runtime, const ChipStorageTaskArgs *orch_args, void *host_orch_func_ptr, const ArgDirection *signature, - int sig_count, const uint64_t *ring_task_window, const uint64_t *ring_heap, const uint64_t *ring_dep_pool + Runtime *runtime, const HostApi *api, const ChipStorageTaskArgs *orch_args, void *host_orch_func_ptr, + const ArgDirection *signature, int sig_count, const uint64_t *ring_task_window, const uint64_t *ring_heap, + const uint64_t *ring_dep_pool ); -extern "C" int validate_runtime_impl(Runtime *runtime); +extern "C" int validate_runtime_impl(Runtime *runtime, const HostApi *api); namespace { @@ -167,6 +168,18 @@ void *fake_acquire_pooled_gm_sm() { return g_fake->gm_sm.empty() ? nullptr : g_f void *fake_acquire_pooled_runtime_arena() { return g_fake->runtime_arena.empty() ? nullptr : g_fake->runtime_arena.data(); } +bool fake_lookup_prebuilt_runtime_arena_cache( + uint64_t /* hash */, const void * /* key_data */, size_t /* key_size */, void ** /* gm_heap_base */, + void ** /* sm_base */, void ** /* runtime_arena_base */, size_t * /* runtime_off */, + const void ** /* image_data */, size_t * /* image_size */ +) { + return false; +} +void fake_mark_prebuilt_runtime_arena_cached( + uint64_t /* hash */, const void * /* key_data */, size_t /* key_size */, void * /* gm_heap_base */, + void * /* sm_base */, void * /* runtime_arena_base */, size_t /* runtime_off */, const void * /* image_data */, + size_t /* image_size */ +) {} uint64_t fake_upload_chip_callable_buffer(const void * /* callable */) { return 0; } HostApi make_host_api() { @@ -184,6 +197,8 @@ HostApi make_host_api() { fake_acquire_pooled_gm_heap, fake_acquire_pooled_gm_sm, fake_acquire_pooled_runtime_arena, + fake_lookup_prebuilt_runtime_arena_cache, + fake_mark_prebuilt_runtime_arena_cached, fake_upload_chip_callable_buffer, }; } @@ -202,12 +217,14 @@ ChipStorageTaskArgs make_args(std::vector &input, std::vector return args; } -int bind_runtime(Runtime &runtime, const ChipStorageTaskArgs &args, const ArgDirection *signature, int sig_count) { +int bind_runtime( + Runtime &runtime, const HostApi &api, const ChipStorageTaskArgs &args, const ArgDirection *signature, int sig_count +) { uint64_t ring_task_window[PTO2_MAX_RING_DEPTH] = {4, 4, 4, 4}; uint64_t ring_heap[PTO2_MAX_RING_DEPTH] = {1024, 1024, 1024, 1024}; uint64_t ring_dep_pool[PTO2_MAX_RING_DEPTH] = {4, 4, 4, 4}; return bind_callable_to_runtime_impl( - &runtime, &args, nullptr, signature, sig_count, ring_task_window, ring_heap, ring_dep_pool + &runtime, &api, &args, nullptr, signature, sig_count, ring_task_window, ring_heap, ring_dep_pool ); } @@ -219,13 +236,10 @@ class TrbRuntimeTempBufferTest : public ::testing::Test { g_fake = nullptr; } - Runtime make_runtime() { - Runtime runtime; - runtime.host_api = make_host_api(); - return runtime; - } + Runtime make_runtime() { return Runtime{}; } FakeHostApi fake_; + HostApi api_ = make_host_api(); }; } // namespace @@ -238,24 +252,24 @@ TEST_F(TrbRuntimeTempBufferTest, PositiveBudgetUsesTemporarySlicesWithoutChangin fake_.reset(0); Runtime malloc_runtime = make_runtime(); - ASSERT_EQ(bind_runtime(malloc_runtime, args, signature, 2), 0); + ASSERT_EQ(bind_runtime(malloc_runtime, api_, args, signature, 2), 0); EXPECT_EQ(fake_.device_malloc_count, 2); EXPECT_EQ(fake_.copy_to_count, 2); EXPECT_EQ(fake_.device_memset_count, 1); - ASSERT_EQ(validate_runtime_impl(&malloc_runtime), 0); + ASSERT_EQ(validate_runtime_impl(&malloc_runtime, &api_), 0); EXPECT_EQ(fake_.device_free_count, 2); EXPECT_EQ(fake_.copy_from_count, 2); EXPECT_EQ(fake_.temp_begin_count, 0); fake_.reset(4096); Runtime buffer_runtime = make_runtime(); - ASSERT_EQ(bind_runtime(buffer_runtime, args, signature, 2), 0); + ASSERT_EQ(bind_runtime(buffer_runtime, api_, args, signature, 2), 0); EXPECT_EQ(fake_.device_malloc_count, 0); EXPECT_EQ(fake_.temp_begin_count, 1); EXPECT_EQ(fake_.temp_acquire_successes, 2); EXPECT_EQ(fake_.copy_to_count, 2); EXPECT_EQ(fake_.device_memset_count, 1); - ASSERT_EQ(validate_runtime_impl(&buffer_runtime), 0); + ASSERT_EQ(validate_runtime_impl(&buffer_runtime, &api_), 0); EXPECT_EQ(fake_.device_free_count, 0); EXPECT_EQ(fake_.temp_end_count, 1); EXPECT_EQ(fake_.copy_from_count, 2); @@ -271,12 +285,12 @@ TEST_F(TrbRuntimeTempBufferTest, ChildMemoryIsPassThroughAndPureOutStillMemsets) args.add_tensor(make_tensor(output)); ArgDirection signature[2] = {ArgDirection::IN, ArgDirection::OUT}; - ASSERT_EQ(bind_runtime(runtime, args, signature, 2), 0); + ASSERT_EQ(bind_runtime(runtime, api_, args, signature, 2), 0); EXPECT_EQ(fake_.temp_acquire_successes, 1); EXPECT_EQ(fake_.device_malloc_count, 0); EXPECT_EQ(fake_.copy_to_count, 1); EXPECT_EQ(fake_.device_memset_count, 1); - ASSERT_EQ(validate_runtime_impl(&runtime), 0); + ASSERT_EQ(validate_runtime_impl(&runtime, &api_), 0); EXPECT_EQ(fake_.device_free_count, 0); EXPECT_EQ(fake_.temp_end_count, 1); } @@ -289,7 +303,7 @@ TEST_F(TrbRuntimeTempBufferTest, BudgetExhaustionFailsWithoutMallocFallbackAndEn ChipStorageTaskArgs args = make_args(input, output); ArgDirection signature[2] = {ArgDirection::IN, ArgDirection::OUT}; - EXPECT_EQ(bind_runtime(runtime, args, signature, 2), -1); + EXPECT_EQ(bind_runtime(runtime, api_, args, signature, 2), -1); EXPECT_EQ(fake_.device_malloc_count, 0); EXPECT_EQ(fake_.temp_begin_count, 1); EXPECT_EQ(fake_.temp_acquire_attempts, 2); @@ -308,7 +322,7 @@ TEST_F(TrbRuntimeTempBufferTest, FailedCopyReleasesRecordedFreeLease) { args.add_tensor(make_tensor(input)); ArgDirection signature[1] = {ArgDirection::IN}; - EXPECT_EQ(bind_runtime(runtime, args, signature, 1), -1); + EXPECT_EQ(bind_runtime(runtime, api_, args, signature, 1), -1); EXPECT_EQ(fake_.device_malloc_count, 1); EXPECT_EQ(fake_.device_free_count, 1); EXPECT_TRUE(runtime.tensor_leases_.empty()); @@ -324,7 +338,7 @@ TEST_F(TrbRuntimeTempBufferTest, FailedCopyAfterTemporaryRunBeginsEndsRunOnce) { args.add_tensor(make_tensor(input)); ArgDirection signature[1] = {ArgDirection::IN}; - EXPECT_EQ(bind_runtime(runtime, args, signature, 1), -1); + EXPECT_EQ(bind_runtime(runtime, api_, args, signature, 1), -1); EXPECT_EQ(fake_.device_malloc_count, 0); EXPECT_EQ(fake_.device_free_count, 0); EXPECT_EQ(fake_.temp_begin_count, 1); From cd8c36ee23171985b1c9b39ff8ab8d6ac35cd61f Mon Sep 17 00:00:00 2001 From: puddingfjz <2811443837@qq.com> Date: Wed, 1 Jul 2026 12:44:11 +0800 Subject: [PATCH 7/9] Fix: satisfy pre-commit after HostApi rebase - Replace the non-English temporary-buffer note in the plan doc - Group chip child-process startup settings to keep the helper under ruff's argument limit - Apply clang-format changes reported by CI --- docs/trb-serial-tensor-buffer-pool-plan.md | 2 +- python/simpler/worker.py | 38 ++++++++++++------- .../host/runtime_maker.cpp | 6 +-- .../host/runtime_maker.cpp | 6 +-- .../onboard/host/device_runner_base.cpp | 6 ++- src/common/worker/chip_worker.cpp | 4 +- .../common/test_trb_runtime_temp_buffer.cpp | 4 +- tests/ut/py/test_worker/test_host_worker.py | 8 ++-- 8 files changed, 44 insertions(+), 30 deletions(-) diff --git a/docs/trb-serial-tensor-buffer-pool-plan.md b/docs/trb-serial-tensor-buffer-pool-plan.md index 0f5fa913f..7eee7bf1c 100644 --- a/docs/trb-serial-tensor-buffer-pool-plan.md +++ b/docs/trb-serial-tensor-buffer-pool-plan.md @@ -8,7 +8,7 @@ The target optimization is a runtime-side temporary variable buffer for ordinary non-child tensors in the `tensormap_and_ringbuffer` path. This plan uses "temporary variable buffer" for the same concept as -临时变量 buffer. +temporary tensor storage. The serving constraint is important: diff --git a/python/simpler/worker.py b/python/simpler/worker.py index d23a122dd..428267db7 100644 --- a/python/simpler/worker.py +++ b/python/simpler/worker.py @@ -267,6 +267,15 @@ class _CallableRegistration: eligible_worker_ids: tuple[int, ...] = () +@dataclass(frozen=True) +class _ChipProcessConfig: + log_level: int = 1 + log_info_v: int = 5 + platform: str = "" + runtime: str = "" + max_temporary_buffer_bytes: int = 0 + + @dataclass(frozen=True) class RemoteCallable: """Import-path descriptor for a parent-facing remote L3 callable.""" @@ -1148,11 +1157,7 @@ def _chip_process_loop( registry: dict[int, Any], identity_table: dict[bytes, int], identity_refs: dict[bytes, int], - log_level: int = 1, - log_info_v: int = 5, - platform: str = "", - runtime: str = "", - max_temporary_buffer_bytes: int = 0, + config: _ChipProcessConfig | None = None, ) -> None: """Runs in forked child process. Loads host_runtime.so in own address space. @@ -1165,10 +1170,13 @@ def _chip_process_loop( """ import traceback as _tb # noqa: PLC0415 + if config is None: + config = _ChipProcessConfig() + try: cw = ChipWorker() - cw.init(device_id, bins, log_level=log_level, log_info_v=log_info_v) - temporary_buffer_budget = int(max_temporary_buffer_bytes) + cw.init(device_id, bins, log_level=config.log_level, log_info_v=config.log_info_v) + temporary_buffer_budget = int(config.max_temporary_buffer_bytes) if temporary_buffer_budget < 0: raise ValueError("max_temporary_buffer_bytes must be non-negative") if temporary_buffer_budget > 0: @@ -1202,8 +1210,8 @@ def _chip_process_loop( registry, identity_table, identity_refs, - chip_platform=platform, - chip_runtime=runtime, + chip_platform=config.platform, + chip_runtime=config.runtime, ) finally: cw.finalize() @@ -3042,11 +3050,13 @@ def _start_hierarchical(self) -> None: # noqa: PLR0912 -- three parallel fork l callable_kind="CHIP_CALLABLE", target_namespace="LOCAL_CHIP", ), - chip_log_level, - chip_log_info_v, - str(self._config["platform"]), - str(self._config["runtime"]), - max_temporary_buffer_bytes, + _ChipProcessConfig( + log_level=chip_log_level, + log_info_v=chip_log_info_v, + platform=str(self._config["platform"]), + runtime=str(self._config["runtime"]), + max_temporary_buffer_bytes=max_temporary_buffer_bytes, + ), ) os._exit(0) else: diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp index 103055664..386b8286a 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp @@ -764,9 +764,9 @@ extern "C" int bind_callable_to_runtime_impl( size_t temporary_buffer_budget = api->temporary_buffer_budget == nullptr ? 0 : api->temporary_buffer_budget(); bool use_temporary_buffer = temporary_buffer_budget > 0; - if (use_temporary_buffer && (api->begin_temporary_buffer_run == nullptr || - api->acquire_temporary_buffer_slice == nullptr || - api->end_temporary_buffer_run == nullptr)) { + if (use_temporary_buffer && + (api->begin_temporary_buffer_run == nullptr || api->acquire_temporary_buffer_slice == nullptr || + api->end_temporary_buffer_run == nullptr)) { LOG_ERROR("Temporary buffer budget is configured but HostApi temporary-buffer callbacks are not wired"); return -1; } diff --git a/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp b/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp index 103055664..386b8286a 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp +++ b/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp @@ -764,9 +764,9 @@ extern "C" int bind_callable_to_runtime_impl( size_t temporary_buffer_budget = api->temporary_buffer_budget == nullptr ? 0 : api->temporary_buffer_budget(); bool use_temporary_buffer = temporary_buffer_budget > 0; - if (use_temporary_buffer && (api->begin_temporary_buffer_run == nullptr || - api->acquire_temporary_buffer_slice == nullptr || - api->end_temporary_buffer_run == nullptr)) { + if (use_temporary_buffer && + (api->begin_temporary_buffer_run == nullptr || api->acquire_temporary_buffer_slice == nullptr || + api->end_temporary_buffer_run == nullptr)) { LOG_ERROR("Temporary buffer budget is configured but HostApi temporary-buffer callbacks are not wired"); return -1; } diff --git a/src/common/platform/onboard/host/device_runner_base.cpp b/src/common/platform/onboard/host/device_runner_base.cpp index e2b1734e8..971774b9a 100644 --- a/src/common/platform/onboard/host/device_runner_base.cpp +++ b/src/common/platform/onboard/host/device_runner_base.cpp @@ -534,8 +534,10 @@ int DeviceRunnerBase::ensure_binaries_loaded() { } if (dispatcher_so_binary_.empty()) { - LOG_ERROR("DeviceRunner: dispatcher SO bytes not provided; pass dispatcher_path through ChipWorker.init " - "(RuntimeBinaries.dispatcher_path)"); + LOG_ERROR( + "DeviceRunner: dispatcher SO bytes not provided; pass dispatcher_path through ChipWorker.init " + "(RuntimeBinaries.dispatcher_path)" + ); return -1; } diff --git a/src/common/worker/chip_worker.cpp b/src/common/worker/chip_worker.cpp index 9cc3db3eb..8d57b817b 100644 --- a/src/common/worker/chip_worker.cpp +++ b/src/common/worker/chip_worker.cpp @@ -706,8 +706,8 @@ void ChipWorker::comm_destroy(uint64_t comm_handle) { } int rc = destroy_comm_session(*session); - while (!comm_sessions_.empty() && comm_sessions_.back().handle == nullptr && comm_sessions_.back().stream == nullptr - ) { + while (!comm_sessions_.empty() && comm_sessions_.back().handle == nullptr && + comm_sessions_.back().stream == nullptr) { comm_sessions_.pop_back(); } diff --git a/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp b/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp index e16bad90f..aec9c2f38 100644 --- a/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp +++ b/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp @@ -170,8 +170,8 @@ void *fake_acquire_pooled_runtime_arena() { } bool fake_lookup_prebuilt_runtime_arena_cache( uint64_t /* hash */, const void * /* key_data */, size_t /* key_size */, void ** /* gm_heap_base */, - void ** /* sm_base */, void ** /* runtime_arena_base */, size_t * /* runtime_off */, - const void ** /* image_data */, size_t * /* image_size */ + void ** /* sm_base */, void ** /* runtime_arena_base */, size_t * /* runtime_off */, const void ** /* image_data */, + size_t * /* image_size */ ) { return false; } diff --git a/tests/ut/py/test_worker/test_host_worker.py b/tests/ut/py/test_worker/test_host_worker.py index 469b5fb92..968232a17 100644 --- a/tests/ut/py/test_worker/test_host_worker.py +++ b/tests/ut/py/test_worker/test_host_worker.py @@ -188,9 +188,11 @@ def fake_run_chip_main_loop(cw, *_args, chip_platform, chip_runtime): {}, {}, {}, - platform="a2a3", - runtime="tensormap_and_ringbuffer", - max_temporary_buffer_bytes=4096, + worker_mod._ChipProcessConfig( + platform="a2a3", + runtime="tensormap_and_ringbuffer", + max_temporary_buffer_bytes=4096, + ), ) finally: shm.close() From 078843fc75821fd1ba9ce779af0bef677ce1776e Mon Sep 17 00:00:00 2001 From: puddingfjz <2811443837@qq.com> Date: Wed, 1 Jul 2026 17:51:57 +0800 Subject: [PATCH 8/9] Implement auto realloc TRB temporary buffer --- ...lloc-temporary-buffer-modification-plan.md | 335 +++++++++++++++++ ...auto-temporary-buffer-modification-plan.md | 340 ++++++++++++++++++ python/bindings/task_interface.cpp | 10 +- python/simpler/task_interface.py | 19 +- python/simpler/worker.py | 59 ++- .../host/runtime_maker.cpp | 43 ++- .../host/runtime_maker.cpp | 43 ++- src/common/platform/include/common/host_api.h | 15 +- .../include/host/temporary_variable_buffer.h | 275 ++++++++------ .../platform/onboard/host/c_api_shared.cpp | 25 +- .../onboard/host/device_runner_base.cpp | 28 +- .../onboard/host/device_runner_base.h | 6 +- src/common/platform/sim/host/c_api_shared.cpp | 25 +- .../platform/sim/host/device_runner_base.cpp | 28 +- .../platform/sim/host/device_runner_base.h | 6 +- src/common/worker/chip_worker.cpp | 28 +- src/common/worker/chip_worker.h | 9 +- src/common/worker/pto_runtime_c_api.h | 10 +- .../common/test_temporary_variable_buffer.cpp | 142 +++++--- .../common/test_trb_runtime_temp_buffer.cpp | 58 ++- tests/ut/py/test_chip_worker.py | 19 +- tests/ut/py/test_worker/test_host_worker.py | 49 +-- 22 files changed, 1167 insertions(+), 405 deletions(-) create mode 100644 docs/trb-auto-realloc-temporary-buffer-modification-plan.md create mode 100644 docs/trb-auto-temporary-buffer-modification-plan.md diff --git a/docs/trb-auto-realloc-temporary-buffer-modification-plan.md b/docs/trb-auto-realloc-temporary-buffer-modification-plan.md new file mode 100644 index 000000000..7208f9c97 --- /dev/null +++ b/docs/trb-auto-realloc-temporary-buffer-modification-plan.md @@ -0,0 +1,335 @@ +# TRB AUTO Realloc Temporary Buffer Modification Plan + +**Date**: 2026-07-01 +**Status**: replacement modification plan + +## Purpose + +This document replaces the previous chunk-growth AUTO plan for PR 1198 while +keeping that older plan file intact for review history. + +The new target is simpler: + +- no multi-chunk automatic growth mechanism; +- one retained temporary-buffer allocation per runner; +- at each run, compute the whole temporary-buffer requirement before staging; +- if the retained buffer is too small, free it first and allocate one new + buffer for the current run; +- use 1024-byte address alignment for the temporary buffer. + +The previous plan remains in +`docs/trb-auto-temporary-buffer-modification-plan.md`. + +## Target Behavior + +Temporary buffering still has two modes: + +- `off`: use the existing per-run `device_malloc()` / `device_free()` path. +- `auto`: use one runner-scoped retained temporary buffer. + +The default mode is `off`. + +AUTO mode does not take a caller-provided byte budget. The retained buffer +starts empty. On each TRB bind, the host builds a run plan for all ordinary +non-child tensors that would use temporary storage. The plan is packed with +1024-byte alignment. If the current retained buffer is large enough, the run +reuses it. If it is not large enough, the implementation frees the old +retained buffer and allocates one new retained buffer for this run. + +There is no incremental chunk growth and no per-acquire allocation. After +`begin_temporary_buffer_run(plan)` succeeds, every +`acquire_temporary_buffer_slice()` must be satisfied from the retained buffer. +A miss after a successful begin is a bug in plan/acquire consistency and must +fail clearly. It must not fall back to ordinary `device_malloc()`. + +## Alignment Contract + +Use a single alignment constant for this feature: + +```cpp +static constexpr size_t kTemporaryBufferAlignment = 1024; +``` + +Apply it to both: + +- the retained buffer base address exposed to tensor slices; +- every tensor slice offset allocated from that retained buffer. + +If the platform allocator does not guarantee 1024-byte alignment directly, the +temporary buffer must over-allocate and store both addresses: + +```cpp +struct Buffer { + void *raw_base; + void *base; // 1024-byte aligned address used by slices + size_t capacity; // usable bytes from base + size_t offset; +}; +``` + +Only `raw_base` is passed to the platform free callback. The usable +`capacity` is the bytes available from aligned `base`. + +The required capacity for a run is computed with the same 1024-byte alignment +rule as real acquire: + +```text +offset = 0 +for item in plan: + offset = align_up(offset, 1024) + offset += item.bytes +required = offset +``` + +The implementation may round `required` up to 1024 bytes before storing it as +capacity, but it must not use a coarse fixed MiB chunk granularity. + +## Run Planning + +Before staging tensors in TRB bind, build a plan using the same filtering and +ordering as real acquire: + +```text +for tensor in orch_args, in real bind order: + if tensor.is_child_memory(): + skip + else: + append {bytes=tensor.nbytes(), alignment=1024} +``` + +The plan includes ordinary non-child input, INOUT, and output tensors. Child +memory stays pass-through and is not included. + +Zero-byte tensors should not force a retained-buffer allocation. The plan and +real acquire path must handle them consistently. The preferred behavior is to +skip zero-byte tensors in the temporary-buffer plan and avoid consuming buffer +capacity for them. + +## Host API Shape + +Use plan-based AUTO callbacks, not a byte-budget API: + +```cpp +struct TemporaryBufferPlanItem { + size_t bytes; + size_t alignment; +}; + +bool (*temporary_buffer_enabled)(); +bool (*begin_temporary_buffer_run)( + const TemporaryBufferPlanItem *items, size_t item_count); +void *(*acquire_temporary_buffer_slice)(size_t bytes, size_t alignment); +void (*end_temporary_buffer_run)(); +``` + +`begin_temporary_buffer_run()` computes the packed required size and ensures +the retained buffer is large enough for the whole run. + +## Buffer State + +The implementation should store a single retained buffer, not a vector of +chunks: + +```cpp +Buffer buffer_; +size_t retained_bytes_; +size_t current_run_used_bytes_; +size_t high_water_used_bytes_; +bool enabled_; +bool active_; +``` + +Maintain these invariants: + +- `retained_bytes_ == buffer_.capacity`; +- `retained_bytes_ == 0` when `buffer_.raw_base == nullptr`; +- `buffer_.base` is 1024-byte aligned when non-null; +- `buffer_.offset` is reset to zero only after begin succeeds; +- `current_run_used_bytes_` is reset to zero only after begin succeeds; +- real acquire increments `current_run_used_bytes_` by padding plus bytes; +- `end_temporary_buffer_run()` updates `high_water_used_bytes_`; +- clear/finalize releases `raw_base` and resets all retained-buffer state. + +Useful diagnostics are: + +- `retained_bytes`; +- `high_water_used_bytes`; +- `realloc_count`; +- `realloc_failed_count`; +- `buffer_backed_allocation_count`. + +Do not expose a public budget getter. + +## Begin-Run Resize Logic + +`begin_temporary_buffer_run(plan)` owns the resize decision: + +```text +if AUTO is disabled: + return false + +if active_ is true: + fail clearly; do not reset offset + return false + +required = packed_size(plan, alignment=1024) + +if retained_bytes_ >= required: + buffer_.offset = 0 + current_run_used_bytes_ = 0 + active_ = true + return true + +free existing retained buffer +retained_bytes_ = 0 + +if required == 0: + buffer_.offset = 0 + current_run_used_bytes_ = 0 + active_ = true + return true + +allocate one new retained buffer with usable capacity >= required +if allocation fails: + active_ = false + return false + +buffer_.offset = 0 +retained_bytes_ = new usable capacity +current_run_used_bytes_ = 0 +active_ = true +return true +``` + +This is intentionally not transactional with respect to the old retained +buffer. If a larger run requires resize and the new allocation fails, the old +retained buffer has already been released. That follows the required +free-then-allocate behavior and avoids keeping two large temporary buffers +alive at once. + +## Real Acquire Logic + +After begin succeeds, real acquire is a single-buffer bump allocator: + +```text +if not active: + fail + +alignment = max(requested_alignment, 1024) +aligned = align_up(buffer_.offset, alignment) + +if bytes does not fit in buffer_.capacity - aligned: + fail clearly + +ptr = buffer_.base + aligned +buffer_.offset = aligned + bytes +current_run_used_bytes_ += aligned - old_offset + bytes +return ptr +``` + +The caller must pass 1024 for temporary tensor slices. The implementation +should still validate that any requested alignment is a power of two and use +at least 1024. + +## Cleanup And Lifetime + +Release the retained buffer when: + +- AUTO is disabled; +- an explicit clear path is called; +- runner/device context finalizes; +- `begin_temporary_buffer_run(plan)` needs a larger buffer. + +Do not shrink merely because a later run is smaller. Smaller later runs reuse +the larger retained buffer until one of the release events above occurs. + +If finalize sees an active temporary-buffer run, log a programming error and +still release the retained buffer before allocator teardown. + +## Implementation Steps + +1. Update `TemporaryVariableBuffer`. + - Replace chunk-vector state with a single retained buffer. + - Remove suffix growth and repeated simulation. + - Add 1024-byte alignment for base and slices. + - Add packed-size computation for the whole run plan. + - Implement free-then-allocate resize in begin-run. + +2. Update onboard and sim `DeviceRunnerBase`. + - Keep AUTO enable/disable APIs. + - Remove chunk-specific diagnostics. + - Report retained bytes, high-water, realloc count, and realloc failures. + +3. Update common `HostApi`. + - Keep `TemporaryBufferPlanItem`. + - Keep `temporary_buffer_enabled()`. + - Keep plan-based `begin_temporary_buffer_run(items, item_count)`. + - Do not restore `temporary_buffer_budget()`. + +4. Update TRB bind path for a2a3 and a5. + - Build the plan from ordinary non-child tensors before staging. + - Use 1024-byte alignment in the plan and real acquire. + - Begin AUTO run before staging. + - Fail clearly if begin or acquire fails. + - Keep child-memory, H2D, memset, and copy-back semantics unchanged. + +5. Update Python/C++ public API. + - Keep mode-based configuration, for example + `configure_temporary_buffer_auto(bool enabled)`. + - Keep `temporary_buffer_mode = "off" | "auto"`. + - Do not reintroduce caller-provided byte budgets. + +6. Update tests. + - Cover initial empty AUTO begin and allocation. + - Cover same-shape reuse with no realloc. + - Cover larger later run freeing old buffer and allocating one new buffer. + - Cover smaller later run not shrinking. + - Cover allocation failure after old buffer is freed. + - Cover 1024-byte base and slice alignment. + - Keep TRB child-memory, OUT memset, and error-cleanup regressions. + +## Test Plan + +Run focused unit tests first: + +```text +tests/ut/cpp/common/test_temporary_variable_buffer.cpp +tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp +tests/ut/py/test_chip_worker.py +tests/ut/py/test_worker/test_host_worker.py +``` + +Then run TRB prepared-callable coverage for both architectures where +available: + +```text +a2a3 TRB prepared-callable ST +a5 TRB prepared-callable ST +``` + +Hardware tests must use `task-submit`. + +For performance validation, use Qwen3 Path A with the same matrix already +requested for PR 1198: + +- skill-default model/input/output setting; +- batch size 1 and 16; +- short input and 256-token input; +- output length 20, 256, and 512; +- compare AUTO enabled vs disabled on the same NPU where possible. + +## Acceptance Criteria + +- No public caller-provided temporary-buffer byte budget remains. +- AUTO starts empty and does not allocate until the first planned run. +- The retained temporary buffer is a single allocation, not retained chunks. +- All temporary-buffer slice addresses are 1024-byte aligned. +- Same-shape repeated runs reuse the retained buffer without reallocating. +- A larger later run frees the old retained buffer before allocating a new + one. +- A smaller later run does not shrink the retained buffer. +- Allocation failure during resize leaves no retained old buffer behind. +- Acquire failure after successful begin fails clearly and never falls back to + ordinary malloc. +- Child-memory pass-through, OUT memset, and copy-back semantics are + unchanged. diff --git a/docs/trb-auto-temporary-buffer-modification-plan.md b/docs/trb-auto-temporary-buffer-modification-plan.md new file mode 100644 index 000000000..58b5ed625 --- /dev/null +++ b/docs/trb-auto-temporary-buffer-modification-plan.md @@ -0,0 +1,340 @@ +# TRB AUTO Temporary Buffer Modification Plan + +**Date**: 2026-07-01 +**Status**: implementation modification plan + +## Purpose + +This document describes how to modify the existing PR 1198 temporary-buffer +implementation from an explicit byte-budget design to an AUTO self-sizing +design. + +The existing design plan remains in +`docs/trb-serial-tensor-buffer-pool-plan.md`. This document is the concrete +change plan for updating code, tests, and public API surface. + +## Target Behavior + +Temporary buffering has two modes: + +- `off`: keep the current per-run `device_malloc()` / `device_free()` path. +- `auto`: enable retained temporary chunks that grow from observed run plans. + +The default mode is `off`, preserving existing behavior unless the caller +explicitly enables AUTO mode. AUTO mode must not require callers to provide +`max_temporary_buffer_bytes`. The buffer starts empty, grows when a run cannot +fit in retained chunks, and does not automatically shrink. + +After steady decode shapes converge, AUTO should perform no temporary-tensor +device allocation or free on repeated same-shape runs. + +## Design Changes + +### Remove Explicit Budget Semantics + +Remove public and internal behavior that treats a numeric byte budget as the +configuration contract: + +- public `max_temporary_buffer_bytes` worker config; +- public `configure_temporary_buffer(bytes)` budget API; +- public `temporary_buffer_budget` getter, unless replaced by diagnostic-only + retained/high-water reporting; +- fail-fast `"required X / configured Y"` budget-exceeded path. + +Do not replace the budget with a hidden large number. AUTO must be represented +as AUTO, not as a disguised explicit budget. + +### Add AUTO Mode Configuration + +Expose mode configuration instead of byte sizing. Acceptable shapes are: + +```text +temporary_buffer_mode = "off" | "auto" +``` + +or an equivalent bool/enum API: + +```cpp +configure_temporary_buffer_auto(bool enabled); +``` + +Rules: + +- enabling AUTO does not allocate retained HBM immediately; +- disabling AUTO clears retained chunks when no run is active; +- reconfiguration while a temporary-buffer run is active fails clearly; +- default configuration is `off`; +- `worker.malloc()` and `worker.free()` semantics stay unchanged. + +### Add Run Planning + +Before staging tensors in TRB bind, build a plan using the exact same tensor +filtering and ordering as real acquire: + +```text +for tensor in orch_args, in real bind order: + if tensor.is_child_memory(): + skip + else: + append {bytes=tensor.nbytes(), alignment=default_alignment} +``` + +The plan includes ordinary non-child input, INOUT, and output tensors. Child +memory stays pass-through and is not included. + +### Change Host API Shape + +Replace budget-based HostApi usage with plan-based AUTO callbacks: + +```cpp +struct TemporaryBufferPlanItem { + size_t bytes; + size_t alignment; +}; + +bool (*temporary_buffer_enabled)(); +bool (*begin_temporary_buffer_run)( + const TemporaryBufferPlanItem *items, size_t item_count); +void *(*acquire_temporary_buffer_slice)(size_t bytes, size_t alignment); +void (*end_temporary_buffer_run)(); +``` + +`begin_temporary_buffer_run()` performs simulation and any required growth. +After it succeeds, real `acquire_temporary_buffer_slice()` should only perform +first-fit bump allocation over retained chunks. + +### Implement Simulation-Based Growth + +The buffer owns retained chunks: + +```cpp +struct Chunk { + void *raw_base; + void *base; + size_t capacity; + size_t offset; +}; + +std::vector chunks_; +size_t retained_bytes_; +size_t current_run_used_bytes_; +size_t high_water_used_bytes_; +``` + +Maintain this invariant: + +```text +retained_bytes_ == sum(chunk.capacity for chunk in chunks_) +``` + +Counter lifecycle: + +- `retained_bytes_` is updated only when a retained chunk allocation succeeds + or when chunks are cleared; +- `current_run_used_bytes_` is reset to zero when begin succeeds; +- real acquire adds consumed bytes, including alignment padding, to + `current_run_used_bytes_`; +- `end_temporary_buffer_run()` updates `high_water_used_bytes_` from + `current_run_used_bytes_`; +- clear/finalize resets both run counters. + +Growth happens in `begin_temporary_buffer_run(plan)`: + +```text +if AUTO is disabled: + return false + +if active_ is true: + fail clearly; do not reset offsets + return false + +checkpoint chunk count and retained_bytes_ +simulate plan against retained chunks + +if simulation succeeds: + reset real chunk offsets + current_run_used_bytes_ = 0 + mark active + return true + +if simulation fails at item i: + remaining = packed_size_in_empty_chunk(plan[i:]) + retained = retained_bytes_ + if retained == 0: + new_chunk_size = remaining + else: + new_chunk_size = max(retained, remaining) + allocate one new retained chunk + retained_bytes_ += new_chunk_size + repeat simulation from the beginning + +if allocating a new chunk fails: + free chunks allocated after the checkpoint + restore chunk count and retained_bytes_ + active_ remains false + return false +``` + +`packed_size_in_empty_chunk()` uses the same alignment rule as real acquire: + +```text +offset = 0 +for item in suffix: + offset = align_up(offset, item.alignment) + offset += item.bytes +return offset +``` + +This rule avoids per-tensor grow in the real bind path, handles repeated +large tensors, and lets retained capacity approximately double when it already +exists. + +Growth is transactional at begin-run granularity. Newly allocated chunks are +committed only if `begin_temporary_buffer_run(plan)` succeeds. If growth fails, +only chunks allocated during that begin attempt are released; older retained +chunks remain available for later runs. + +Do not add a fixed MiB-size chunk granularity. Correctness comes from tensor +offset alignment. The implementation may align `new_chunk_size` to the default +slice alignment for simpler arithmetic. + +This plan assumes every tensor `nbytes()` value and aggregate temporary-buffer +plan size fits in `size_t`; handling values outside the `size_t` range is out +of scope for this modification. + +### Preserve Real Acquire Semantics + +After planning succeeds, real acquire uses the same first-fit bump rule: + +```text +for chunk in chunks: + aligned = align_up(chunk.offset, alignment) + if bytes fits in chunk.capacity - aligned: + return chunk.base + aligned +return nullptr +``` + +A null return after successful planning is a plan/acquire mismatch. It must +fail clearly and run normal cleanup. It must not silently fall back to +ordinary `device_malloc()`. + +### Cleanup And Lifetime + +AUTO chunks are retained across runs and are not automatically shrunk. + +Release retained chunks only when: + +- AUTO is disabled; +- an explicit clear path is called; +- runner/device context finalizes. + +If finalize sees an active temporary-buffer run, log a programming error and +still release retained chunks before allocator teardown. + +## Implementation Steps + +1. Update C++ `TemporaryVariableBuffer`. + - Replace budget configuration with AUTO enable/disable. + - Add plan-item simulation. + - Add suffix-size growth. + - Store and maintain `retained_bytes_`. + - Remove budget-exceeded error state. + +2. Update onboard and sim `DeviceRunnerBase`. + - Rename budget methods to AUTO-mode methods. + - Keep clear/finalize behavior. + - Expose diagnostics for retained bytes, high-water, grow count, and grow + failure count. + +3. Update common `HostApi`. + - Add `TemporaryBufferPlanItem`. + - Replace `temporary_buffer_budget()` usage with + `temporary_buffer_enabled()`. + - Change `begin_temporary_buffer_run()` to accept plan items. + - Wire both onboard and sim c-api shared implementations. + +4. Update TRB bind path for a2a3 and a5. + - Build the plan from ordinary non-child tensors before staging. + - Begin AUTO run with that plan. + - Acquire slices in the same order used by the plan. + - Keep child-memory pass-through unchanged. + - Keep H2D, memset, and copy-back semantics unchanged. + +5. Update Python/C++ public API. + - Remove byte-budget config paths in this PR. + - Add mode-based config for Worker and ChipWorker. + - Ensure level-3 child process setup forwards AUTO mode, not bytes. + +6. Update tests. + - Convert budget tests into AUTO growth tests. + - Add repeated max-size tensor simulation coverage. + - Add no-shrink coverage. + - Add grow-failure no-fallback coverage. + - Keep child-memory, OUT memset, and error-cleanup regressions. + +7. Update docs and PR metadata. + - Keep the main plan consistent with AUTO semantics. + - Remove references that present explicit byte budget as the target API. + - Update PR title/body from docs-only to implementation feature work. + +## Test Plan + +Run focused unit tests first: + +```text +tests/ut/cpp/common/test_temporary_variable_buffer.cpp +tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp +tests/ut/py/test_chip_worker.py +tests/ut/py/test_worker/test_host_worker.py +``` + +Then run TRB prepared-callable coverage for both architectures where +available: + +```text +a2a3 TRB prepared-callable ST +a5 TRB prepared-callable ST +``` + +Hardware tests must use `task-submit`. + +For performance validation, use Qwen3 Path A: + +- steady decode same-shape run, to confirm warmup then zero temporary-tensor + allocation/free; +- short input and 256-token context prefill cases, to quantify AUTO grow + jitter and p99 impact; +- output lengths that include short and long decode, so growth effects are not + confused with decode kernel timing. + +## Acceptance Criteria + +- No public caller-provided temporary-buffer byte budget remains. +- AUTO starts empty and does not allocate until the first planned run. +- Same-shape repeated runs reuse retained chunks without additional growth. +- Larger later runs grow and then stabilize. +- Smaller later runs do not shrink retained chunks. +- A failed begin rolls back chunks allocated during that begin attempt. +- Grow failure fails clearly and does not fall back to ordinary malloc. +- `retained_bytes_` remains equal to the sum of retained chunk capacities. +- `current_run_used_bytes_` and `high_water_used_bytes_` follow the documented + lifecycle. +- H2D/D2H bytes remain unchanged and explainable. +- Child-memory semantics remain unchanged. +- `worker.malloc()` / `worker.free()` semantics remain unchanged. +- Steady decode allocation/free count drops materially after warmup. +- Prefill growth jitter is measured and reported separately. + +## Risks + +AUTO gives up the explicit-budget fail-fast diagnostic. A true memory-pressure +failure becomes an allocator grow failure, so the error must include requested +tensor bytes, remaining suffix bytes, retained bytes, chunk count, and the +underlying allocator status. + +AUTO can allocate during prefill when sequence length grows. That may add +latency jitter in the same phase where timeout-related failures have been +observed. Prefill must be measured separately from steady decode. + +AUTO does not provide a bounded-memory serving contract. If serving needs a +hard memory cap, that is a separate design from this PR. diff --git a/python/bindings/task_interface.cpp b/python/bindings/task_interface.cpp index 02276b59c..15a240058 100644 --- a/python/bindings/task_interface.cpp +++ b/python/bindings/task_interface.cpp @@ -918,14 +918,8 @@ NB_MODULE(_task_interface, m) { "host-orchestration path; 0 on device-orch variants." ) .def( - "configure_temporary_buffer", &ChipWorker::configure_temporary_buffer, - nb::arg("max_temporary_buffer_bytes"), - "Configure the runner-scoped TRB temporary variable buffer. " - "Pass 0 to disable and return to per-run malloc/free." - ) - .def_prop_ro( - "temporary_buffer_budget", &ChipWorker::temporary_buffer_budget, - "Configured temporary-buffer budget in bytes, or 0 when disabled." + "configure_temporary_buffer_auto", &ChipWorker::configure_temporary_buffer_auto, nb::arg("enabled") = true, + "Enable or disable the runner-scoped TRB AUTO temporary variable buffer." ) .def("malloc", &ChipWorker::malloc, nb::arg("size")) .def("free", &ChipWorker::free, nb::arg("ptr")) diff --git a/python/simpler/task_interface.py b/python/simpler/task_interface.py index 3dd290ac6..1673f08ca 100644 --- a/python/simpler/task_interface.py +++ b/python/simpler/task_interface.py @@ -1192,22 +1192,9 @@ def host_dlopen_count(self): """Number of host-side orch SO dlopens (host_build_graph variants).""" return self._impl.host_dlopen_count - def configure_temporary_buffer(self, max_temporary_buffer_bytes: int) -> None: - """Configure the runner-scoped TRB temporary variable buffer. - - ``0`` disables the optimization and keeps the existing per-run - malloc/free path. A positive value is an aggregate byte budget for - ordinary non-child tensors in one ``run_prepared`` invocation. - """ - budget = int(max_temporary_buffer_bytes) - if budget < 0: - raise ValueError("max_temporary_buffer_bytes must be non-negative") - self._impl.configure_temporary_buffer(budget) - - @property - def temporary_buffer_budget(self) -> int: - """Configured temporary-buffer budget in bytes, or 0 when disabled.""" - return int(self._impl.temporary_buffer_budget) + def configure_temporary_buffer_auto(self, enabled: bool = True) -> None: + """Enable or disable the runner-scoped TRB AUTO temporary variable buffer.""" + self._impl.configure_temporary_buffer_auto(bool(enabled)) def malloc(self, size): """Allocate memory. Returns a pointer (uint64).""" diff --git a/python/simpler/worker.py b/python/simpler/worker.py index 428267db7..bee0930d8 100644 --- a/python/simpler/worker.py +++ b/python/simpler/worker.py @@ -273,7 +273,14 @@ class _ChipProcessConfig: log_info_v: int = 5 platform: str = "" runtime: str = "" - max_temporary_buffer_bytes: int = 0 + temporary_buffer_mode: str = "off" + + +def _normalize_temporary_buffer_mode(mode: Any) -> str: + normalized = str(mode).lower() + if normalized not in ("off", "auto"): + raise ValueError("temporary_buffer_mode must be 'off' or 'auto'") + return normalized @dataclass(frozen=True) @@ -1176,11 +1183,8 @@ def _chip_process_loop( try: cw = ChipWorker() cw.init(device_id, bins, log_level=config.log_level, log_info_v=config.log_info_v) - temporary_buffer_budget = int(config.max_temporary_buffer_bytes) - if temporary_buffer_budget < 0: - raise ValueError("max_temporary_buffer_bytes must be non-negative") - if temporary_buffer_budget > 0: - cw.configure_temporary_buffer(temporary_buffer_budget) + if _normalize_temporary_buffer_mode(config.temporary_buffer_mode) == "auto": + cw.configure_temporary_buffer_auto(True) except Exception as e: _tb.print_exc() # Write the message so any parent reader that *does* inspect this @@ -1374,6 +1378,9 @@ def __init__( **config, ) -> None: self.level = level + if "max_temporary_buffer_bytes" in config: + raise ValueError("max_temporary_buffer_bytes has been removed; use temporary_buffer_mode='auto'") + config["temporary_buffer_mode"] = _normalize_temporary_buffer_mode(config.get("temporary_buffer_mode", "off")) self._config = config self._callable_registry: dict[int, Any] = {} self._identity_registry: dict[bytes, _CallableIdentityState] = {} @@ -2889,11 +2896,8 @@ def _init_level2(self) -> None: self._chip_worker = ChipWorker() self._chip_worker.init(device_id, binaries) - max_temporary_buffer_bytes = int(self._config.get("max_temporary_buffer_bytes", 0)) - if max_temporary_buffer_bytes < 0: - raise ValueError("Worker max_temporary_buffer_bytes must be non-negative") - if max_temporary_buffer_bytes > 0: - self._chip_worker.configure_temporary_buffer(max_temporary_buffer_bytes) + if _normalize_temporary_buffer_mode(self._config.get("temporary_buffer_mode", "off")) == "auto": + self._chip_worker.configure_temporary_buffer_auto(True) # Pre-warm any registered ChipCallable so the first run(handle, …) # does not pay the H2D upload cost. @@ -2906,9 +2910,7 @@ def _init_hierarchical(self) -> None: device_ids = self._config.get("device_ids", []) n_sub = self._config.get("num_sub_workers", 0) heap_ring_size = self._config.get("heap_ring_size", None) - max_temporary_buffer_bytes = int(self._config.get("max_temporary_buffer_bytes", 0)) - if max_temporary_buffer_bytes < 0: - raise ValueError("Worker max_temporary_buffer_bytes must be non-negative") + _normalize_temporary_buffer_mode(self._config.get("temporary_buffer_mode", "off")) if self.level >= 4 and device_ids: raise RuntimeError("Worker level >= 4 must use add_worker(); device_ids are only supported on L3 Workers") @@ -2992,9 +2994,7 @@ def _start_hierarchical(self) -> None: # noqa: PLR0912 -- three parallel fork l """Fork child processes and start C++ scheduler. Called on first run().""" device_ids = self._config.get("device_ids", []) n_sub = self._config.get("num_sub_workers", 0) - max_temporary_buffer_bytes = int(self._config.get("max_temporary_buffer_bytes", 0)) - if max_temporary_buffer_bytes < 0: - raise ValueError("Worker max_temporary_buffer_bytes must be non-negative") + temporary_buffer_mode = _normalize_temporary_buffer_mode(self._config.get("temporary_buffer_mode", "off")) try: # Fork children from an immutable snapshot. The state transition @@ -3055,7 +3055,7 @@ def _start_hierarchical(self) -> None: # noqa: PLR0912 -- three parallel fork l log_info_v=chip_log_info_v, platform=str(self._config["platform"]), runtime=str(self._config["runtime"]), - max_temporary_buffer_bytes=max_temporary_buffer_bytes, + temporary_buffer_mode=temporary_buffer_mode, ), ) os._exit(0) @@ -3814,18 +3814,15 @@ def copy_from(self, dst: int, src: int, size: int, worker_id: int = 0) -> None: assert self._orch is not None self._orch.copy_from(worker_id, dst, src, size) - def configure_temporary_buffer(self, max_temporary_buffer_bytes: int) -> None: - """Configure the TRB temporary variable buffer for this Worker.""" - budget = int(max_temporary_buffer_bytes) - if budget < 0: - raise ValueError("max_temporary_buffer_bytes must be non-negative") + def configure_temporary_buffer_auto(self, enabled: bool = True) -> None: + """Enable or disable the TRB AUTO temporary variable buffer.""" if self.level not in (2, 3): - raise NotImplementedError("Worker.configure_temporary_buffer currently supports level 2 and level 3 only") + raise NotImplementedError("Worker.configure_temporary_buffer_auto supports level 2 and level 3 only") if self.level == 3 and self._hierarchical_start_state == "started": - raise RuntimeError("Worker.configure_temporary_buffer for level 3 must be called before hierarchy startup") - self._config["max_temporary_buffer_bytes"] = budget + raise RuntimeError("Worker.configure_temporary_buffer_auto for level 3 must be called before hierarchy startup") + self._config["temporary_buffer_mode"] = "auto" if enabled else "off" if self._chip_worker is not None: - self._chip_worker.configure_temporary_buffer(budget) + self._chip_worker.configure_temporary_buffer_auto(enabled) # ------------------------------------------------------------------ # run — uniform entry point @@ -3933,11 +3930,9 @@ def host_dlopen_count(self) -> int: return self._chip_worker.host_dlopen_count @property - def temporary_buffer_budget(self) -> int: - """L2 only: configured TRB temporary-buffer budget in bytes.""" - if self.level != 2 or self._chip_worker is None: - return int(self._config.get("max_temporary_buffer_bytes", 0)) - return self._chip_worker.temporary_buffer_budget + def temporary_buffer_mode(self) -> str: + """Configured TRB temporary-buffer mode: ``off`` or ``auto``.""" + return _normalize_temporary_buffer_mode(self._config.get("temporary_buffer_mode", "off")) # ------------------------------------------------------------------ # close diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp index 386b8286a..12302bc38 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp @@ -451,6 +451,20 @@ static bool derive_arena_static_sizes(const ArenaSizingConfig &sizing, ArenaStat return true; } +static void +build_temporary_buffer_plan(const ChipStorageTaskArgs *orch_args, std::vector *out) { + out->clear(); + int tensor_count = orch_args->tensor_count(); + out->reserve(tensor_count); + for (int i = 0; i < tensor_count; i++) { + Tensor t = orch_args->tensor(i); + if (t.is_child_memory() || t.nbytes() == 0) { + continue; + } + out->push_back({static_cast(t.nbytes()), TemporaryVariableBuffer::kTemporaryBufferAlignment}); + } +} + // per-run: the only signature-aware step. Copy the orch args, replacing each // host tensor pointer with a freshly staged device pointer (H2D copy-in, or an // on-device zero for pure-OUTPUT buffers), and record the host/device pair for @@ -459,7 +473,7 @@ static bool derive_arena_static_sizes(const ArenaSizingConfig &sizing, ArenaStat // frees them in validate_runtime_impl. static bool stage_device_args( Runtime *runtime, const HostApi *api, const ChipStorageTaskArgs *orch_args, const ArgDirection *signature, - int sig_count, bool use_temporary_buffer, size_t temporary_buffer_budget, ChipStorageTaskArgs *out + int sig_count, bool use_temporary_buffer, ChipStorageTaskArgs *out ) { int tensor_count = orch_args->tensor_count(); int scalar_count = orch_args->scalar_count(); @@ -477,17 +491,19 @@ static bool stage_device_args( void *host_ptr = reinterpret_cast(static_cast(t.buffer.addr)); size_t size = static_cast(t.nbytes()); + if (size == 0) { + t.buffer.addr = 0; + out->add_tensor(t); + continue; + } void *dev_ptr = nullptr; TensorReleaseKind release_kind = TensorReleaseKind::Free; if (use_temporary_buffer) { - dev_ptr = api->acquire_temporary_buffer_slice(size, TemporaryVariableBuffer::kDefaultAlignment); + dev_ptr = api->acquire_temporary_buffer_slice(size, TemporaryVariableBuffer::kTemporaryBufferAlignment); release_kind = TensorReleaseKind::BufferNoop; if (dev_ptr == nullptr) { - LOG_ERROR( - "Temporary buffer acquire failed for tensor %d: tensor bytes=%zu configured bytes=%zu", i, size, - temporary_buffer_budget - ); + LOG_ERROR("AUTO temporary buffer acquire failed for tensor %d: tensor bytes=%zu", i, size); return false; } } else { @@ -762,18 +778,21 @@ extern "C" int bind_callable_to_runtime_impl( return -1; } - size_t temporary_buffer_budget = api->temporary_buffer_budget == nullptr ? 0 : api->temporary_buffer_budget(); - bool use_temporary_buffer = temporary_buffer_budget > 0; + bool use_temporary_buffer = api->temporary_buffer_enabled != nullptr && api->temporary_buffer_enabled(); if (use_temporary_buffer && (api->begin_temporary_buffer_run == nullptr || api->acquire_temporary_buffer_slice == nullptr || api->end_temporary_buffer_run == nullptr)) { - LOG_ERROR("Temporary buffer budget is configured but HostApi temporary-buffer callbacks are not wired"); + LOG_ERROR("AUTO temporary buffer is enabled but HostApi temporary-buffer callbacks are not wired"); return -1; } + std::vector temporary_buffer_plan; bool temp_run_active = false; if (use_temporary_buffer) { - if (!api->begin_temporary_buffer_run()) { + build_temporary_buffer_plan(orch_args, &temporary_buffer_plan); + const TemporaryBufferPlanItem *plan_data = + temporary_buffer_plan.empty() ? nullptr : temporary_buffer_plan.data(); + if (!api->begin_temporary_buffer_run(plan_data, temporary_buffer_plan.size())) { LOG_ERROR("Failed to begin temporary buffer run"); return -1; } @@ -788,9 +807,7 @@ extern "C" int bind_callable_to_runtime_impl( }); ChipStorageTaskArgs device_args; - if (!stage_device_args( - runtime, api, orch_args, signature, sig_count, use_temporary_buffer, temporary_buffer_budget, &device_args - )) { + if (!stage_device_args(runtime, api, orch_args, signature, sig_count, use_temporary_buffer, &device_args)) { return -1; } diff --git a/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp b/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp index 386b8286a..12302bc38 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp +++ b/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp @@ -451,6 +451,20 @@ static bool derive_arena_static_sizes(const ArenaSizingConfig &sizing, ArenaStat return true; } +static void +build_temporary_buffer_plan(const ChipStorageTaskArgs *orch_args, std::vector *out) { + out->clear(); + int tensor_count = orch_args->tensor_count(); + out->reserve(tensor_count); + for (int i = 0; i < tensor_count; i++) { + Tensor t = orch_args->tensor(i); + if (t.is_child_memory() || t.nbytes() == 0) { + continue; + } + out->push_back({static_cast(t.nbytes()), TemporaryVariableBuffer::kTemporaryBufferAlignment}); + } +} + // per-run: the only signature-aware step. Copy the orch args, replacing each // host tensor pointer with a freshly staged device pointer (H2D copy-in, or an // on-device zero for pure-OUTPUT buffers), and record the host/device pair for @@ -459,7 +473,7 @@ static bool derive_arena_static_sizes(const ArenaSizingConfig &sizing, ArenaStat // frees them in validate_runtime_impl. static bool stage_device_args( Runtime *runtime, const HostApi *api, const ChipStorageTaskArgs *orch_args, const ArgDirection *signature, - int sig_count, bool use_temporary_buffer, size_t temporary_buffer_budget, ChipStorageTaskArgs *out + int sig_count, bool use_temporary_buffer, ChipStorageTaskArgs *out ) { int tensor_count = orch_args->tensor_count(); int scalar_count = orch_args->scalar_count(); @@ -477,17 +491,19 @@ static bool stage_device_args( void *host_ptr = reinterpret_cast(static_cast(t.buffer.addr)); size_t size = static_cast(t.nbytes()); + if (size == 0) { + t.buffer.addr = 0; + out->add_tensor(t); + continue; + } void *dev_ptr = nullptr; TensorReleaseKind release_kind = TensorReleaseKind::Free; if (use_temporary_buffer) { - dev_ptr = api->acquire_temporary_buffer_slice(size, TemporaryVariableBuffer::kDefaultAlignment); + dev_ptr = api->acquire_temporary_buffer_slice(size, TemporaryVariableBuffer::kTemporaryBufferAlignment); release_kind = TensorReleaseKind::BufferNoop; if (dev_ptr == nullptr) { - LOG_ERROR( - "Temporary buffer acquire failed for tensor %d: tensor bytes=%zu configured bytes=%zu", i, size, - temporary_buffer_budget - ); + LOG_ERROR("AUTO temporary buffer acquire failed for tensor %d: tensor bytes=%zu", i, size); return false; } } else { @@ -762,18 +778,21 @@ extern "C" int bind_callable_to_runtime_impl( return -1; } - size_t temporary_buffer_budget = api->temporary_buffer_budget == nullptr ? 0 : api->temporary_buffer_budget(); - bool use_temporary_buffer = temporary_buffer_budget > 0; + bool use_temporary_buffer = api->temporary_buffer_enabled != nullptr && api->temporary_buffer_enabled(); if (use_temporary_buffer && (api->begin_temporary_buffer_run == nullptr || api->acquire_temporary_buffer_slice == nullptr || api->end_temporary_buffer_run == nullptr)) { - LOG_ERROR("Temporary buffer budget is configured but HostApi temporary-buffer callbacks are not wired"); + LOG_ERROR("AUTO temporary buffer is enabled but HostApi temporary-buffer callbacks are not wired"); return -1; } + std::vector temporary_buffer_plan; bool temp_run_active = false; if (use_temporary_buffer) { - if (!api->begin_temporary_buffer_run()) { + build_temporary_buffer_plan(orch_args, &temporary_buffer_plan); + const TemporaryBufferPlanItem *plan_data = + temporary_buffer_plan.empty() ? nullptr : temporary_buffer_plan.data(); + if (!api->begin_temporary_buffer_run(plan_data, temporary_buffer_plan.size())) { LOG_ERROR("Failed to begin temporary buffer run"); return -1; } @@ -788,9 +807,7 @@ extern "C" int bind_callable_to_runtime_impl( }); ChipStorageTaskArgs device_args; - if (!stage_device_args( - runtime, api, orch_args, signature, sig_count, use_temporary_buffer, temporary_buffer_budget, &device_args - )) { + if (!stage_device_args(runtime, api, orch_args, signature, sig_count, use_temporary_buffer, &device_args)) { return -1; } diff --git a/src/common/platform/include/common/host_api.h b/src/common/platform/include/common/host_api.h index 65051d4e8..460472aed 100644 --- a/src/common/platform/include/common/host_api.h +++ b/src/common/platform/include/common/host_api.h @@ -14,6 +14,11 @@ #include #include +struct TemporaryBufferPlanItem { + size_t bytes; + size_t alignment; +}; + /** * Host API function pointers for device memory operations. * Allows a runtime to use pluggable device-memory backends. @@ -34,11 +39,11 @@ struct HostApi { // null on backends that don't wire it; callers must fall back to // copy_to_device. int (*device_memset)(void *dev_ptr, int value, size_t size); - // Runner-scoped temporary variable buffer. A zero budget disables the - // optimization. Only trb bind consumes these callbacks; public - // device_malloc/device_free keep real allocation semantics. - size_t (*temporary_buffer_budget)(); - bool (*begin_temporary_buffer_run)(); + // Runner-scoped AUTO temporary variable buffer. Only trb bind consumes + // these callbacks; public device_malloc/device_free keep real allocation + // semantics. + bool (*temporary_buffer_enabled)(); + bool (*begin_temporary_buffer_run)(const TemporaryBufferPlanItem *items, size_t item_count); void *(*acquire_temporary_buffer_slice)(size_t size, size_t alignment); void (*end_temporary_buffer_run)(); // Commit the three per-Worker pooled regions (PTO2 GM heap, PTO2 shared diff --git a/src/common/platform/include/host/temporary_variable_buffer.h b/src/common/platform/include/host/temporary_variable_buffer.h index 19feea589..34eeedd04 100644 --- a/src/common/platform/include/host/temporary_variable_buffer.h +++ b/src/common/platform/include/host/temporary_variable_buffer.h @@ -13,26 +13,29 @@ #include #include +#include #include #include #include -#include + +#include "common/host_api.h" class TemporaryVariableBuffer { public: using AllocFn = void *(*)(void *ctx, size_t size); using FreeFn = void (*)(void *ctx, void *ptr); - static constexpr size_t kDefaultAlignment = 32; + static constexpr size_t kTemporaryBufferAlignment = 1024; + static constexpr size_t kDefaultAlignment = kTemporaryBufferAlignment; struct Stats { - size_t configured_budget_bytes{0}; - size_t retained_chunk_count{0}; - size_t retained_chunk_bytes{0}; + bool enabled{false}; + size_t retained_bytes{0}; size_t current_run_used_bytes{0}; size_t high_water_used_bytes{0}; size_t buffer_backed_allocation_count{0}; - size_t budget_exceeded_count{0}; + size_t realloc_count{0}; + size_t realloc_failed_count{0}; bool active{false}; }; @@ -46,20 +49,19 @@ class TemporaryVariableBuffer { TemporaryVariableBuffer(const TemporaryVariableBuffer &) = delete; TemporaryVariableBuffer &operator=(const TemporaryVariableBuffer &) = delete; - bool configure(size_t max_temporary_buffer_bytes); - bool begin_run(); + bool configure_auto(bool enabled); + bool begin_run(const TemporaryBufferPlanItem *items, size_t item_count); void *acquire(size_t bytes, size_t alignment); void end_run(); void clear(); - bool enabled() const { return max_temporary_buffer_bytes_ > 0; } + bool enabled() const { return enabled_; } bool active() const { return active_; } - size_t budget() const { return max_temporary_buffer_bytes_; } Stats stats() const; const std::string &last_error() const { return last_error_; } private: - struct Chunk { + struct Buffer { void *raw_base{nullptr}; void *base{nullptr}; size_t capacity{0}; @@ -69,72 +71,104 @@ class TemporaryVariableBuffer { static bool is_power_of_two(size_t value) { return value != 0 && (value & (value - 1)) == 0; } - static size_t align_up(size_t value, size_t alignment) { return (value + alignment - 1) & ~(alignment - 1); } + static bool align_up_checked(size_t value, size_t alignment, size_t *out); + static bool align_ptr_checked(void *ptr, size_t alignment, void **out); - bool allocate_chunks(size_t budget); - bool allocate_chunk(size_t capacity, Chunk *out); - void release_chunks(); + bool validate_plan_item(const TemporaryBufferPlanItem &item); + bool packed_plan_size(const TemporaryBufferPlanItem *items, size_t item_count, size_t *out); + bool allocate_buffer(size_t required_bytes); + void release_buffer(); + void reset_run_state(); void set_error(std::string msg) { last_error_ = std::move(msg); } AllocFn alloc_{nullptr}; FreeFn free_{nullptr}; void *ctx_{nullptr}; - std::vector chunks_; - size_t max_temporary_buffer_bytes_{0}; - size_t retained_chunk_bytes_{0}; + Buffer buffer_; size_t current_run_used_bytes_{0}; size_t high_water_used_bytes_{0}; size_t buffer_backed_allocation_count_{0}; - size_t budget_exceeded_count_{0}; + size_t realloc_count_{0}; + size_t realloc_failed_count_{0}; + bool enabled_{false}; bool active_{false}; std::string last_error_; }; -inline bool TemporaryVariableBuffer::configure(size_t max_temporary_buffer_bytes) { +inline bool TemporaryVariableBuffer::align_up_checked(size_t value, size_t alignment, size_t *out) { + if (out == nullptr || !is_power_of_two(alignment)) { + return false; + } + const size_t padding = alignment - 1; + if (value > std::numeric_limits::max() - padding) { + return false; + } + *out = (value + padding) & ~padding; + return true; +} + +inline bool TemporaryVariableBuffer::align_ptr_checked(void *ptr, size_t alignment, void **out) { + if (out == nullptr || ptr == nullptr || !is_power_of_two(alignment)) { + return false; + } + uintptr_t raw = reinterpret_cast(ptr); + const uintptr_t padding = static_cast(alignment - 1); + if (raw > std::numeric_limits::max() - padding) { + return false; + } + uintptr_t aligned = (raw + padding) & ~padding; + *out = reinterpret_cast(aligned); + return true; +} + +inline bool TemporaryVariableBuffer::configure_auto(bool enabled) { if (active_) { set_error("cannot reconfigure temporary buffer while a run is active"); return false; } - if (max_temporary_buffer_bytes == max_temporary_buffer_bytes_ && - (max_temporary_buffer_bytes == 0 || !chunks_.empty())) { + if (enabled == enabled_) { last_error_.clear(); return true; } - clear(); - if (max_temporary_buffer_bytes == 0) { - return true; - } - - max_temporary_buffer_bytes_ = max_temporary_buffer_bytes; - if (!allocate_chunks(max_temporary_buffer_bytes)) { - std::string error = last_error_; + enabled_ = enabled; + if (!enabled_) { clear(); - last_error_ = std::move(error); - return false; } last_error_.clear(); return true; } -inline bool TemporaryVariableBuffer::begin_run() { +inline bool TemporaryVariableBuffer::begin_run(const TemporaryBufferPlanItem *items, size_t item_count) { if (active_) { set_error("temporary buffer run is already active"); return false; } - if (max_temporary_buffer_bytes_ == 0) { + if (!enabled_) { set_error("temporary buffer is disabled"); return false; } - if (chunks_.empty()) { - set_error("temporary buffer has no retained chunks"); + if (items == nullptr && item_count != 0) { + set_error("temporary buffer plan items pointer is null"); + return false; + } + + size_t required = 0; + if (!packed_plan_size(items, item_count, &required)) { return false; } - for (Chunk &chunk : chunks_) { - chunk.offset = 0; + + if (buffer_.capacity < required) { + release_buffer(); + if (required != 0 && !allocate_buffer(required)) { + ++realloc_failed_count_; + set_error("temporary buffer AUTO realloc failed: required bytes " + std::to_string(required)); + return false; + } } - current_run_used_bytes_ = 0; + + reset_run_state(); active_ = true; last_error_.clear(); return true; @@ -146,47 +180,38 @@ inline void *TemporaryVariableBuffer::acquire(size_t bytes, size_t alignment) { return nullptr; } if (alignment == 0) { - alignment = 1; + alignment = kTemporaryBufferAlignment; } if (!is_power_of_two(alignment)) { set_error("temporary buffer alignment must be a power of two"); return nullptr; } + alignment = std::max(alignment, kTemporaryBufferAlignment); - size_t min_padding = std::numeric_limits::max(); - for (Chunk &chunk : chunks_) { - const size_t aligned_offset = align_up(chunk.offset, alignment); - if (aligned_offset < chunk.offset) { - continue; - } - min_padding = std::min(min_padding, aligned_offset - chunk.offset); - if (bytes > chunk.capacity || aligned_offset > chunk.capacity - bytes) { - continue; - } - void *ptr = static_cast(chunk.base) + aligned_offset; - current_run_used_bytes_ += (aligned_offset - chunk.offset) + bytes; - chunk.offset = aligned_offset + bytes; - ++buffer_backed_allocation_count_; - last_error_.clear(); - return ptr; + if (buffer_.base == nullptr) { + set_error("temporary buffer acquire requested with no retained buffer"); + return nullptr; } - ++budget_exceeded_count_; - if (min_padding == std::numeric_limits::max()) { - min_padding = 0; + size_t aligned_offset = 0; + if (!align_up_checked(buffer_.offset, alignment, &aligned_offset)) { + set_error("temporary buffer acquire alignment overflow"); + return nullptr; } - size_t required_bytes = std::numeric_limits::max(); - if (current_run_used_bytes_ <= std::numeric_limits::max() - min_padding) { - const size_t used_with_padding = current_run_used_bytes_ + min_padding; - if (bytes <= std::numeric_limits::max() - used_with_padding) { - required_bytes = used_with_padding + bytes; - } + if (bytes > buffer_.capacity || aligned_offset > buffer_.capacity - bytes) { + set_error( + "temporary buffer acquire missed after successful plan: tensor bytes " + std::to_string(bytes) + + ", retained bytes " + std::to_string(buffer_.capacity) + ); + return nullptr; } - set_error( - "temporary buffer budget exceeded: required bytes " + std::to_string(required_bytes) + ", configured bytes " + - std::to_string(max_temporary_buffer_bytes_) - ); - return nullptr; + + void *ptr = static_cast(buffer_.base) + aligned_offset; + current_run_used_bytes_ += (aligned_offset - buffer_.offset) + bytes; + buffer_.offset = aligned_offset + bytes; + ++buffer_backed_allocation_count_; + last_error_.clear(); + return ptr; } inline void TemporaryVariableBuffer::end_run() { @@ -200,80 +225,102 @@ inline void TemporaryVariableBuffer::end_run() { } inline void TemporaryVariableBuffer::clear() { - release_chunks(); - max_temporary_buffer_bytes_ = 0; - retained_chunk_bytes_ = 0; + release_buffer(); + enabled_ = false; current_run_used_bytes_ = 0; high_water_used_bytes_ = 0; buffer_backed_allocation_count_ = 0; - budget_exceeded_count_ = 0; + realloc_count_ = 0; + realloc_failed_count_ = 0; active_ = false; last_error_.clear(); } inline TemporaryVariableBuffer::Stats TemporaryVariableBuffer::stats() const { return Stats{ - max_temporary_buffer_bytes_, chunks_.size(), - retained_chunk_bytes_, current_run_used_bytes_, - high_water_used_bytes_, buffer_backed_allocation_count_, - budget_exceeded_count_, active_, + enabled_, + buffer_.capacity, + current_run_used_bytes_, + high_water_used_bytes_, + buffer_backed_allocation_count_, + realloc_count_, + realloc_failed_count_, + active_, }; } -inline bool TemporaryVariableBuffer::allocate_chunks(size_t budget) { - size_t remaining = budget; - size_t candidate = budget; - while (remaining > 0) { - if (candidate > remaining) { - candidate = remaining; - } - Chunk chunk; - if (allocate_chunk(candidate, &chunk)) { - retained_chunk_bytes_ += candidate; - chunks_.push_back(chunk); - remaining -= candidate; - candidate = remaining; - continue; - } - if (candidate <= 1) { - set_error( - "failed to allocate retained temporary-buffer chunks for configured bytes " + std::to_string(budget) - ); - release_chunks(); - retained_chunk_bytes_ = 0; +inline bool TemporaryVariableBuffer::validate_plan_item(const TemporaryBufferPlanItem &item) { + if (item.alignment == 0 || !is_power_of_two(item.alignment)) { + set_error("temporary buffer plan alignment must be a power of two"); + return false; + } + return true; +} + +inline bool +TemporaryVariableBuffer::packed_plan_size(const TemporaryBufferPlanItem *items, size_t item_count, size_t *out) { + if (out == nullptr) { + set_error("temporary buffer packed size received invalid arguments"); + return false; + } + size_t offset = 0; + for (size_t i = 0; i < item_count; ++i) { + if (!validate_plan_item(items[i])) { return false; } - candidate = candidate / 2; - if (candidate == 0) { - candidate = 1; + const size_t alignment = std::max(items[i].alignment, kTemporaryBufferAlignment); + size_t aligned = 0; + if (!align_up_checked(offset, alignment, &aligned) || + items[i].bytes > std::numeric_limits::max() - aligned) { + set_error("temporary buffer plan size overflow"); + return false; } + offset = aligned + items[i].bytes; } + *out = offset; return true; } -inline bool TemporaryVariableBuffer::allocate_chunk(size_t capacity, Chunk *out) { - if (alloc_ == nullptr || free_ == nullptr || out == nullptr) { +inline bool TemporaryVariableBuffer::allocate_buffer(size_t required_bytes) { + if (alloc_ == nullptr || free_ == nullptr) { set_error("temporary buffer allocator callbacks are not configured"); return false; } - const size_t raw_size = capacity; + size_t capacity = 0; + if (!align_up_checked(required_bytes, kTemporaryBufferAlignment, &capacity)) { + set_error("temporary buffer capacity overflow"); + return false; + } + if (capacity > std::numeric_limits::max() - (kTemporaryBufferAlignment - 1)) { + set_error("temporary buffer raw allocation size overflow"); + return false; + } + const size_t raw_size = capacity + (kTemporaryBufferAlignment - 1); void *raw = alloc_(ctx_, raw_size); if (raw == nullptr) { return false; } - *out = Chunk{raw, raw, capacity, raw_size, 0}; + void *base = nullptr; + if (!align_ptr_checked(raw, kTemporaryBufferAlignment, &base)) { + free_(ctx_, raw); + set_error("temporary buffer base alignment overflow"); + return false; + } + buffer_ = Buffer{raw, base, capacity, raw_size, 0}; + ++realloc_count_; return true; } -inline void TemporaryVariableBuffer::release_chunks() { - if (free_ != nullptr) { - for (Chunk &chunk : chunks_) { - if (chunk.raw_base != nullptr) { - free_(ctx_, chunk.raw_base); - } - } +inline void TemporaryVariableBuffer::release_buffer() { + if (buffer_.raw_base != nullptr && free_ != nullptr) { + free_(ctx_, buffer_.raw_base); } - chunks_.clear(); + buffer_ = Buffer{}; +} + +inline void TemporaryVariableBuffer::reset_run_state() { + buffer_.offset = 0; + current_run_used_bytes_ = 0; } #endif // SRC_COMMON_PLATFORM_INCLUDE_HOST_TEMPORARY_VARIABLE_BUFFER_H_ diff --git a/src/common/platform/onboard/host/c_api_shared.cpp b/src/common/platform/onboard/host/c_api_shared.cpp index 65b2669b5..d90ddb4b2 100644 --- a/src/common/platform/onboard/host/c_api_shared.cpp +++ b/src/common/platform/onboard/host/c_api_shared.cpp @@ -120,17 +120,17 @@ static int device_memset(void *dev_ptr, int value, size_t size) { } } -static size_t temporary_buffer_budget() { +static bool temporary_buffer_enabled() { try { - return current_runner()->temporary_buffer_budget(); + return current_runner()->temporary_buffer_enabled(); } catch (...) { - return 0; + return false; } } -static bool begin_temporary_buffer_run() { +static bool begin_temporary_buffer_run(const TemporaryBufferPlanItem *items, size_t item_count) { try { - return current_runner()->begin_temporary_buffer_run(); + return current_runner()->begin_temporary_buffer_run(items, item_count); } catch (...) { return false; } @@ -260,24 +260,15 @@ int copy_from_device_ctx(DeviceContextHandle ctx, void *host_ptr, const void *de } } -int configure_temporary_buffer_ctx(DeviceContextHandle ctx, size_t max_temporary_buffer_bytes) { +int configure_temporary_buffer_auto_ctx(DeviceContextHandle ctx, int enabled) { if (ctx == NULL) return -1; try { - return static_cast(ctx)->configure_temporary_buffer(max_temporary_buffer_bytes) ? 0 : -1; + return static_cast(ctx)->configure_temporary_buffer_auto(enabled != 0) ? 0 : -1; } catch (...) { return -1; } } -size_t get_temporary_buffer_budget_ctx(DeviceContextHandle ctx) { - if (ctx == NULL) return 0; - try { - return static_cast(ctx)->temporary_buffer_budget(); - } catch (...) { - return 0; - } -} - int finalize_device(DeviceContextHandle ctx) { if (ctx == NULL) return -1; try { @@ -553,7 +544,7 @@ int simpler_run( api.copy_to_device = copy_to_device; api.copy_from_device = copy_from_device; api.device_memset = device_memset; - api.temporary_buffer_budget = temporary_buffer_budget; + api.temporary_buffer_enabled = temporary_buffer_enabled; api.begin_temporary_buffer_run = begin_temporary_buffer_run; api.acquire_temporary_buffer_slice = acquire_temporary_buffer_slice; api.end_temporary_buffer_run = end_temporary_buffer_run; diff --git a/src/common/platform/onboard/host/device_runner_base.cpp b/src/common/platform/onboard/host/device_runner_base.cpp index 971774b9a..ea680fb45 100644 --- a/src/common/platform/onboard/host/device_runner_base.cpp +++ b/src/common/platform/onboard/host/device_runner_base.cpp @@ -140,26 +140,20 @@ int DeviceRunnerBase::device_memset(void *dev_ptr, int value, std::size_t bytes) return aclrtMemset(dev_ptr, bytes, value, bytes); } -bool DeviceRunnerBase::configure_temporary_buffer(std::size_t max_temporary_buffer_bytes) { - if (!temporary_buffer_.configure(max_temporary_buffer_bytes)) { - LOG_ERROR( - "configure_temporary_buffer(%zu) failed: %s", max_temporary_buffer_bytes, - temporary_buffer_.last_error().c_str() - ); +bool DeviceRunnerBase::configure_temporary_buffer_auto(bool enabled) { + if (!temporary_buffer_.configure_auto(enabled)) { + LOG_ERROR("configure_temporary_buffer_auto(%d) failed: %s", enabled, temporary_buffer_.last_error().c_str()); return false; } auto stats = temporary_buffer_.stats(); - LOG_DEBUG( - "Temporary buffer configured: budget=%zu retained_chunks=%zu retained_bytes=%zu", stats.configured_budget_bytes, - stats.retained_chunk_count, stats.retained_chunk_bytes - ); + LOG_DEBUG("Temporary buffer AUTO configured: enabled=%d retained_bytes=%zu", stats.enabled, stats.retained_bytes); return true; } -std::size_t DeviceRunnerBase::temporary_buffer_budget() const { return temporary_buffer_.budget(); } +bool DeviceRunnerBase::temporary_buffer_enabled() const { return temporary_buffer_.enabled(); } -bool DeviceRunnerBase::begin_temporary_buffer_run() { - if (!temporary_buffer_.begin_run()) { +bool DeviceRunnerBase::begin_temporary_buffer_run(const TemporaryBufferPlanItem *items, std::size_t item_count) { + if (!temporary_buffer_.begin_run(items, item_count)) { LOG_ERROR("begin_temporary_buffer_run failed: %s", temporary_buffer_.last_error().c_str()); return false; } @@ -170,8 +164,8 @@ void *DeviceRunnerBase::acquire_temporary_buffer_slice(std::size_t bytes, std::s void *ptr = temporary_buffer_.acquire(bytes, alignment); if (ptr == nullptr) { LOG_ERROR( - "acquire_temporary_buffer_slice failed: required bytes=%zu configured bytes=%zu: %s", bytes, - temporary_buffer_.budget(), temporary_buffer_.last_error().c_str() + "acquire_temporary_buffer_slice failed: bytes=%zu retained_bytes=%zu: %s", bytes, + temporary_buffer_.stats().retained_bytes, temporary_buffer_.last_error().c_str() ); } return ptr; @@ -181,9 +175,9 @@ void DeviceRunnerBase::end_temporary_buffer_run() { temporary_buffer_.end_run(); auto stats = temporary_buffer_.stats(); LOG_DEBUG( - "Temporary buffer run ended: used=%zu high_water=%zu allocations=%zu budget_exceeded=%zu", + "Temporary buffer run ended: used=%zu high_water=%zu allocations=%zu reallocs=%zu realloc_failed=%zu", stats.current_run_used_bytes, stats.high_water_used_bytes, stats.buffer_backed_allocation_count, - stats.budget_exceeded_count + stats.realloc_count, stats.realloc_failed_count ); } diff --git a/src/common/platform/onboard/host/device_runner_base.h b/src/common/platform/onboard/host/device_runner_base.h index a1dd77153..693ed7aa7 100644 --- a/src/common/platform/onboard/host/device_runner_base.h +++ b/src/common/platform/onboard/host/device_runner_base.h @@ -91,9 +91,9 @@ class DeviceRunnerBase : public L3L2OrchCommBackend { int copy_to_device(void *dev_ptr, const void *host_ptr, std::size_t bytes); int copy_from_device(void *host_ptr, const void *dev_ptr, std::size_t bytes); int device_memset(void *dev_ptr, int value, std::size_t bytes); - bool configure_temporary_buffer(std::size_t max_temporary_buffer_bytes); - std::size_t temporary_buffer_budget() const; - bool begin_temporary_buffer_run(); + bool configure_temporary_buffer_auto(bool enabled); + bool temporary_buffer_enabled() const; + bool begin_temporary_buffer_run(const TemporaryBufferPlanItem *items, std::size_t item_count); void *acquire_temporary_buffer_slice(std::size_t bytes, std::size_t alignment); void end_temporary_buffer_run(); void clear_temporary_buffer(); diff --git a/src/common/platform/sim/host/c_api_shared.cpp b/src/common/platform/sim/host/c_api_shared.cpp index 94af2b32d..6e7604d73 100644 --- a/src/common/platform/sim/host/c_api_shared.cpp +++ b/src/common/platform/sim/host/c_api_shared.cpp @@ -117,17 +117,17 @@ static int device_memset(void *dev_ptr, int value, size_t size) { } } -static size_t temporary_buffer_budget() { +static bool temporary_buffer_enabled() { try { - return current_runner()->temporary_buffer_budget(); + return current_runner()->temporary_buffer_enabled(); } catch (...) { - return 0; + return false; } } -static bool begin_temporary_buffer_run() { +static bool begin_temporary_buffer_run(const TemporaryBufferPlanItem *items, size_t item_count) { try { - return current_runner()->begin_temporary_buffer_run(); + return current_runner()->begin_temporary_buffer_run(items, item_count); } catch (...) { return false; } @@ -253,24 +253,15 @@ int copy_from_device_ctx(DeviceContextHandle ctx, void *host_ptr, const void *de } } -int configure_temporary_buffer_ctx(DeviceContextHandle ctx, size_t max_temporary_buffer_bytes) { +int configure_temporary_buffer_auto_ctx(DeviceContextHandle ctx, int enabled) { if (ctx == NULL) return -1; try { - return static_cast(ctx)->configure_temporary_buffer(max_temporary_buffer_bytes) ? 0 : -1; + return static_cast(ctx)->configure_temporary_buffer_auto(enabled != 0) ? 0 : -1; } catch (...) { return -1; } } -size_t get_temporary_buffer_budget_ctx(DeviceContextHandle ctx) { - if (ctx == NULL) return 0; - try { - return static_cast(ctx)->temporary_buffer_budget(); - } catch (...) { - return 0; - } -} - int finalize_device(DeviceContextHandle ctx) { if (ctx == NULL) return -1; try { @@ -504,7 +495,7 @@ int simpler_run( api.copy_to_device = copy_to_device; api.copy_from_device = copy_from_device; api.device_memset = device_memset; - api.temporary_buffer_budget = temporary_buffer_budget; + api.temporary_buffer_enabled = temporary_buffer_enabled; api.begin_temporary_buffer_run = begin_temporary_buffer_run; api.acquire_temporary_buffer_slice = acquire_temporary_buffer_slice; api.end_temporary_buffer_run = end_temporary_buffer_run; diff --git a/src/common/platform/sim/host/device_runner_base.cpp b/src/common/platform/sim/host/device_runner_base.cpp index d55ae17af..32b35777a 100644 --- a/src/common/platform/sim/host/device_runner_base.cpp +++ b/src/common/platform/sim/host/device_runner_base.cpp @@ -262,26 +262,20 @@ int SimDeviceRunnerBase::device_memset(void *dev_ptr, int value, size_t bytes) { return 0; } -bool SimDeviceRunnerBase::configure_temporary_buffer(size_t max_temporary_buffer_bytes) { - if (!temporary_buffer_.configure(max_temporary_buffer_bytes)) { - LOG_ERROR( - "configure_temporary_buffer(%zu) failed: %s", max_temporary_buffer_bytes, - temporary_buffer_.last_error().c_str() - ); +bool SimDeviceRunnerBase::configure_temporary_buffer_auto(bool enabled) { + if (!temporary_buffer_.configure_auto(enabled)) { + LOG_ERROR("configure_temporary_buffer_auto(%d) failed: %s", enabled, temporary_buffer_.last_error().c_str()); return false; } auto stats = temporary_buffer_.stats(); - LOG_DEBUG( - "Temporary buffer configured: budget=%zu retained_chunks=%zu retained_bytes=%zu", stats.configured_budget_bytes, - stats.retained_chunk_count, stats.retained_chunk_bytes - ); + LOG_DEBUG("Temporary buffer AUTO configured: enabled=%d retained_bytes=%zu", stats.enabled, stats.retained_bytes); return true; } -size_t SimDeviceRunnerBase::temporary_buffer_budget() const { return temporary_buffer_.budget(); } +bool SimDeviceRunnerBase::temporary_buffer_enabled() const { return temporary_buffer_.enabled(); } -bool SimDeviceRunnerBase::begin_temporary_buffer_run() { - if (!temporary_buffer_.begin_run()) { +bool SimDeviceRunnerBase::begin_temporary_buffer_run(const TemporaryBufferPlanItem *items, size_t item_count) { + if (!temporary_buffer_.begin_run(items, item_count)) { LOG_ERROR("begin_temporary_buffer_run failed: %s", temporary_buffer_.last_error().c_str()); return false; } @@ -292,8 +286,8 @@ void *SimDeviceRunnerBase::acquire_temporary_buffer_slice(size_t bytes, size_t a void *ptr = temporary_buffer_.acquire(bytes, alignment); if (ptr == nullptr) { LOG_ERROR( - "acquire_temporary_buffer_slice failed: required bytes=%zu configured bytes=%zu: %s", bytes, - temporary_buffer_.budget(), temporary_buffer_.last_error().c_str() + "acquire_temporary_buffer_slice failed: bytes=%zu retained_bytes=%zu: %s", bytes, + temporary_buffer_.stats().retained_bytes, temporary_buffer_.last_error().c_str() ); } return ptr; @@ -303,9 +297,9 @@ void SimDeviceRunnerBase::end_temporary_buffer_run() { temporary_buffer_.end_run(); auto stats = temporary_buffer_.stats(); LOG_DEBUG( - "Temporary buffer run ended: used=%zu high_water=%zu allocations=%zu budget_exceeded=%zu", + "Temporary buffer run ended: used=%zu high_water=%zu allocations=%zu reallocs=%zu realloc_failed=%zu", stats.current_run_used_bytes, stats.high_water_used_bytes, stats.buffer_backed_allocation_count, - stats.budget_exceeded_count + stats.realloc_count, stats.realloc_failed_count ); } diff --git a/src/common/platform/sim/host/device_runner_base.h b/src/common/platform/sim/host/device_runner_base.h index a0a7ed550..9acad2f8f 100644 --- a/src/common/platform/sim/host/device_runner_base.h +++ b/src/common/platform/sim/host/device_runner_base.h @@ -96,9 +96,9 @@ class SimDeviceRunnerBase : public L3L2OrchCommBackend { int copy_to_device(void *dev_ptr, const void *host_ptr, size_t bytes); int copy_from_device(void *host_ptr, const void *dev_ptr, size_t bytes); int device_memset(void *dev_ptr, int value, size_t bytes); - bool configure_temporary_buffer(size_t max_temporary_buffer_bytes); - size_t temporary_buffer_budget() const; - bool begin_temporary_buffer_run(); + bool configure_temporary_buffer_auto(bool enabled); + bool temporary_buffer_enabled() const; + bool begin_temporary_buffer_run(const TemporaryBufferPlanItem *items, size_t item_count); void *acquire_temporary_buffer_slice(size_t bytes, size_t alignment); void end_temporary_buffer_run(); void clear_temporary_buffer(); diff --git a/src/common/worker/chip_worker.cpp b/src/common/worker/chip_worker.cpp index 8d57b817b..9b518a1d8 100644 --- a/src/common/worker/chip_worker.cpp +++ b/src/common/worker/chip_worker.cpp @@ -101,10 +101,8 @@ void ChipWorker::init( device_free_ctx_fn_ = load_symbol(handle, "device_free_ctx"); copy_to_device_ctx_fn_ = load_symbol(handle, "copy_to_device_ctx"); copy_from_device_ctx_fn_ = load_symbol(handle, "copy_from_device_ctx"); - configure_temporary_buffer_ctx_fn_ = - load_symbol(handle, "configure_temporary_buffer_ctx"); - get_temporary_buffer_budget_ctx_fn_ = - load_symbol(handle, "get_temporary_buffer_budget_ctx"); + configure_temporary_buffer_auto_ctx_fn_ = + load_symbol(handle, "configure_temporary_buffer_auto_ctx"); get_runtime_size_fn_ = load_symbol(handle, "get_runtime_size"); simpler_init_fn_ = load_symbol(handle, "simpler_init"); register_callable_fn_ = load_symbol(handle, "simpler_register_callable"); @@ -188,8 +186,7 @@ void ChipWorker::init( device_free_ctx_fn_ = nullptr; copy_to_device_ctx_fn_ = nullptr; copy_from_device_ctx_fn_ = nullptr; - configure_temporary_buffer_ctx_fn_ = nullptr; - get_temporary_buffer_budget_ctx_fn_ = nullptr; + configure_temporary_buffer_auto_ctx_fn_ = nullptr; get_runtime_size_fn_ = nullptr; simpler_init_fn_ = nullptr; register_callable_fn_ = nullptr; @@ -230,8 +227,7 @@ void ChipWorker::init( device_free_ctx_fn_ = nullptr; copy_to_device_ctx_fn_ = nullptr; copy_from_device_ctx_fn_ = nullptr; - configure_temporary_buffer_ctx_fn_ = nullptr; - get_temporary_buffer_budget_ctx_fn_ = nullptr; + configure_temporary_buffer_auto_ctx_fn_ = nullptr; get_runtime_size_fn_ = nullptr; simpler_init_fn_ = nullptr; register_callable_fn_ = nullptr; @@ -287,8 +283,7 @@ void ChipWorker::finalize() { device_free_ctx_fn_ = nullptr; copy_to_device_ctx_fn_ = nullptr; copy_from_device_ctx_fn_ = nullptr; - configure_temporary_buffer_ctx_fn_ = nullptr; - get_temporary_buffer_budget_ctx_fn_ = nullptr; + configure_temporary_buffer_auto_ctx_fn_ = nullptr; get_runtime_size_fn_ = nullptr; register_callable_fn_ = nullptr; run_fn_ = nullptr; @@ -378,23 +373,16 @@ size_t ChipWorker::host_dlopen_count() const { return get_host_dlopen_count_fn_(device_ctx_); } -void ChipWorker::configure_temporary_buffer(size_t max_temporary_buffer_bytes) { +void ChipWorker::configure_temporary_buffer_auto(bool enabled) { if (!initialized_) { throw std::runtime_error("ChipWorker not initialized; call init() first"); } - int rc = configure_temporary_buffer_ctx_fn_(device_ctx_, max_temporary_buffer_bytes); + int rc = configure_temporary_buffer_auto_ctx_fn_(device_ctx_, enabled ? 1 : 0); if (rc != 0) { - throw std::runtime_error("configure_temporary_buffer failed with code " + std::to_string(rc)); + throw std::runtime_error("configure_temporary_buffer_auto failed with code " + std::to_string(rc)); } } -size_t ChipWorker::temporary_buffer_budget() const { - if (!initialized_) { - return 0; - } - return get_temporary_buffer_budget_ctx_fn_(device_ctx_); -} - void *ChipWorker::create_comm_stream_checked(const char *op_name) { int rc = ensure_acl_ready_fn_(device_ctx_, device_id_); if (rc != 0) { diff --git a/src/common/worker/chip_worker.h b/src/common/worker/chip_worker.h index cc3ca0a29..d2be14389 100644 --- a/src/common/worker/chip_worker.h +++ b/src/common/worker/chip_worker.h @@ -82,8 +82,7 @@ class ChipWorker { void free(uint64_t ptr); void copy_to(uint64_t dst, uint64_t src, size_t size); void copy_from(uint64_t dst, uint64_t src, size_t size); - void configure_temporary_buffer(size_t max_temporary_buffer_bytes); - size_t temporary_buffer_budget() const; + void configure_temporary_buffer_auto(bool enabled); void l3_l2_orch_comm_init(uint64_t control_block_addr, size_t control_block_size); void l3_l2_orch_comm_shutdown(); @@ -140,8 +139,7 @@ class ChipWorker { using DeviceFreeCtxFn = void (*)(void *, void *); using CopyToDeviceCtxFn = int (*)(void *, void *, const void *, size_t); using CopyFromDeviceCtxFn = int (*)(void *, void *, const void *, size_t); - using ConfigureTemporaryBufferCtxFn = int (*)(void *, size_t); - using GetTemporaryBufferBudgetCtxFn = size_t (*)(void *); + using ConfigureTemporaryBufferAutoCtxFn = int (*)(void *, int); using GetRuntimeSizeFn = size_t (*)(); // From host_runtime.so. Single platform-side init that does (a) thread // attach + device-id record, (b) executor binary takeover, (c) onboard @@ -196,8 +194,7 @@ class ChipWorker { DeviceFreeCtxFn device_free_ctx_fn_ = nullptr; CopyToDeviceCtxFn copy_to_device_ctx_fn_ = nullptr; CopyFromDeviceCtxFn copy_from_device_ctx_fn_ = nullptr; - ConfigureTemporaryBufferCtxFn configure_temporary_buffer_ctx_fn_ = nullptr; - GetTemporaryBufferBudgetCtxFn get_temporary_buffer_budget_ctx_fn_ = nullptr; + ConfigureTemporaryBufferAutoCtxFn configure_temporary_buffer_auto_ctx_fn_ = nullptr; GetRuntimeSizeFn get_runtime_size_fn_ = nullptr; SimplerInitFn simpler_init_fn_ = nullptr; SimplerRegisterCallableFn register_callable_fn_ = nullptr; diff --git a/src/common/worker/pto_runtime_c_api.h b/src/common/worker/pto_runtime_c_api.h index 74e927216..133e5e8f7 100644 --- a/src/common/worker/pto_runtime_c_api.h +++ b/src/common/worker/pto_runtime_c_api.h @@ -24,8 +24,7 @@ * - sizing: get_runtime_size * - device-mem: device_malloc_ctx, device_free_ctx, * copy_to_device_ctx, copy_from_device_ctx - * - temp-buffer: configure_temporary_buffer_ctx, - * get_temporary_buffer_budget_ctx + * - temp-buffer: configure_temporary_buffer_auto_ctx * - prepared run: simpler_register_callable, simpler_run, unregister_callable, * get_aicpu_dlopen_count, get_host_dlopen_count * - L3-L2 orch: l3_l2_orch_comm_init_ctx, @@ -93,11 +92,8 @@ int copy_to_device_ctx(DeviceContextHandle ctx, void *dev_ptr, const void *host_ /** Copy device memory to a host pointer within the given device context. */ int copy_from_device_ctx(DeviceContextHandle ctx, void *host_ptr, const void *dev_ptr, size_t size); -/** Configure the runner-scoped temporary variable buffer. Zero disables it. */ -int configure_temporary_buffer_ctx(DeviceContextHandle ctx, size_t max_temporary_buffer_bytes); - -/** Return the configured temporary-buffer budget, or 0 when disabled. */ -size_t get_temporary_buffer_budget_ctx(DeviceContextHandle ctx); +/** Enable or disable the runner-scoped AUTO temporary variable buffer. */ +int configure_temporary_buffer_auto_ctx(DeviceContextHandle ctx, int enabled); /** * One-shot platform-side init. Called once by ChipWorker::init() right diff --git a/tests/ut/cpp/common/test_temporary_variable_buffer.cpp b/tests/ut/cpp/common/test_temporary_variable_buffer.cpp index 6f5c27b3d..026b4ae43 100644 --- a/tests/ut/cpp/common/test_temporary_variable_buffer.cpp +++ b/tests/ut/cpp/common/test_temporary_variable_buffer.cpp @@ -34,7 +34,7 @@ struct MockBackend { return nullptr; } void *ptr = nullptr; - if (posix_memalign(&ptr, TemporaryVariableBuffer::kDefaultAlignment, size) != 0) { + if (posix_memalign(&ptr, TemporaryVariableBuffer::kTemporaryBufferAlignment, size) != 0) { return nullptr; } ++alloc_count; @@ -58,90 +58,144 @@ bool is_aligned(const void *ptr, size_t alignment) { return (reinterpret_cast(second) - static_cast(first), 1024); + EXPECT_EQ(buffer.stats().current_run_used_bytes, 1280u); buffer.end_run(); + EXPECT_EQ(buffer.stats().high_water_used_bytes, 1280u); +} - ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); - void *again = buffer.acquire(512, TemporaryVariableBuffer::kDefaultAlignment); +TEST(TemporaryVariableBufferTest, RepeatedSamePlanReusesRetainedBuffer) { + MockBackend backend; + TemporaryVariableBuffer buffer(mock_alloc, mock_free, &backend); + TemporaryBufferPlanItem plan[] = {{512, TemporaryVariableBuffer::kTemporaryBufferAlignment}}; + + ASSERT_TRUE(buffer.configure_auto(true)) << buffer.last_error(); + ASSERT_TRUE(buffer.begin_run(plan, 1)) << buffer.last_error(); + void *first = buffer.acquire(512, TemporaryVariableBuffer::kTemporaryBufferAlignment); + ASSERT_NE(first, nullptr) << buffer.last_error(); + buffer.end_run(); + + ASSERT_TRUE(buffer.begin_run(plan, 1)) << buffer.last_error(); + void *again = buffer.acquire(512, TemporaryVariableBuffer::kTemporaryBufferAlignment); EXPECT_EQ(again, first); buffer.end_run(); EXPECT_EQ(backend.alloc_count, 1); EXPECT_EQ(backend.free_count, 0); - EXPECT_EQ(buffer.stats().high_water_used_bytes, 768u); + EXPECT_EQ(buffer.stats().realloc_count, 1u); } -TEST(TemporaryVariableBufferTest, ConfiguredBudgetIsEnforcedWithClearError) { +TEST(TemporaryVariableBufferTest, LargerPlanFreesOldBufferBeforeAllocatingNewOne) { MockBackend backend; TemporaryVariableBuffer buffer(mock_alloc, mock_free, &backend); + TemporaryBufferPlanItem small_plan[] = {{512, TemporaryVariableBuffer::kTemporaryBufferAlignment}}; + TemporaryBufferPlanItem large_plan[] = { + {2048, TemporaryVariableBuffer::kTemporaryBufferAlignment}, + {2048, TemporaryVariableBuffer::kTemporaryBufferAlignment}, + }; + + ASSERT_TRUE(buffer.configure_auto(true)) << buffer.last_error(); + ASSERT_TRUE(buffer.begin_run(small_plan, 1)) << buffer.last_error(); + ASSERT_NE(buffer.acquire(512, TemporaryVariableBuffer::kTemporaryBufferAlignment), nullptr) << buffer.last_error(); + buffer.end_run(); + EXPECT_EQ(buffer.stats().retained_bytes, 1024u); - ASSERT_TRUE(buffer.configure(1024)) << buffer.last_error(); - ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); - ASSERT_NE(buffer.acquire(768, TemporaryVariableBuffer::kDefaultAlignment), nullptr) << buffer.last_error(); - EXPECT_EQ(buffer.acquire(512, TemporaryVariableBuffer::kDefaultAlignment), nullptr); - EXPECT_NE(buffer.last_error().find("required bytes 1280"), std::string::npos); - EXPECT_NE(buffer.last_error().find("configured bytes 1024"), std::string::npos); - EXPECT_EQ(buffer.stats().budget_exceeded_count, 1u); + ASSERT_TRUE(buffer.begin_run(large_plan, 2)) << buffer.last_error(); + EXPECT_EQ(backend.alloc_count, 2); + EXPECT_EQ(backend.free_count, 1); + EXPECT_EQ(buffer.stats().retained_bytes, 4096u); + EXPECT_EQ(buffer.stats().realloc_count, 2u); + ASSERT_NE(buffer.acquire(2048, TemporaryVariableBuffer::kTemporaryBufferAlignment), nullptr) << buffer.last_error(); + ASSERT_NE(buffer.acquire(2048, TemporaryVariableBuffer::kTemporaryBufferAlignment), nullptr) << buffer.last_error(); buffer.end_run(); } -TEST(TemporaryVariableBufferTest, SegmentedChunksSatisfyAggregateBudget) { +TEST(TemporaryVariableBufferTest, SmallerPlanDoesNotShrinkRetainedBuffer) { MockBackend backend; - backend.max_alloc_size = 2047; TemporaryVariableBuffer buffer(mock_alloc, mock_free, &backend); + TemporaryBufferPlanItem large_plan[] = {{4096, TemporaryVariableBuffer::kTemporaryBufferAlignment}}; + TemporaryBufferPlanItem small_plan[] = {{512, TemporaryVariableBuffer::kTemporaryBufferAlignment}}; - ASSERT_TRUE(buffer.configure(2048)) << buffer.last_error(); - EXPECT_EQ(buffer.stats().retained_chunk_count, 2u); - EXPECT_EQ(buffer.stats().retained_chunk_bytes, 2048u); - EXPECT_EQ(backend.total_alloc_bytes, 2048u); + ASSERT_TRUE(buffer.configure_auto(true)) << buffer.last_error(); + ASSERT_TRUE(buffer.begin_run(large_plan, 1)) << buffer.last_error(); + buffer.end_run(); + const size_t retained_after_large = buffer.stats().retained_bytes; - ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); - void *first = buffer.acquire(900, TemporaryVariableBuffer::kDefaultAlignment); - void *second = buffer.acquire(900, TemporaryVariableBuffer::kDefaultAlignment); - ASSERT_NE(first, nullptr) << buffer.last_error(); - ASSERT_NE(second, nullptr) << buffer.last_error(); - EXPECT_NE(first, second); - EXPECT_TRUE(is_aligned(first, TemporaryVariableBuffer::kDefaultAlignment)); - EXPECT_TRUE(is_aligned(second, TemporaryVariableBuffer::kDefaultAlignment)); + ASSERT_TRUE(buffer.begin_run(small_plan, 1)) << buffer.last_error(); buffer.end_run(); + EXPECT_EQ(buffer.stats().retained_bytes, retained_after_large); + EXPECT_EQ(backend.alloc_count, 1); + EXPECT_EQ(backend.free_count, 0); } -TEST(TemporaryVariableBufferTest, ClearFreesRetainedChunksExactlyOnce) { +TEST(TemporaryVariableBufferTest, ReallocFailureLeavesOldBufferReleased) { MockBackend backend; - backend.max_alloc_size = 2047; TemporaryVariableBuffer buffer(mock_alloc, mock_free, &backend); + TemporaryBufferPlanItem small_plan[] = {{512, TemporaryVariableBuffer::kTemporaryBufferAlignment}}; + TemporaryBufferPlanItem too_large_plan[] = {{4096, TemporaryVariableBuffer::kTemporaryBufferAlignment}}; - ASSERT_TRUE(buffer.configure(2048)) << buffer.last_error(); - EXPECT_EQ(backend.alloc_count, 2); + ASSERT_TRUE(buffer.configure_auto(true)) << buffer.last_error(); + ASSERT_TRUE(buffer.begin_run(small_plan, 1)) << buffer.last_error(); + buffer.end_run(); + ASSERT_EQ(buffer.stats().retained_bytes, 1024u); + + backend.max_alloc_size = 1024; + EXPECT_FALSE(buffer.begin_run(too_large_plan, 1)); + EXPECT_NE(buffer.last_error().find("AUTO realloc failed"), std::string::npos); + EXPECT_FALSE(buffer.stats().active); + EXPECT_EQ(buffer.stats().retained_bytes, 0u); + EXPECT_EQ(backend.free_count, 1); + EXPECT_TRUE(backend.live.empty()); + EXPECT_EQ(buffer.stats().realloc_failed_count, 1u); +} + +TEST(TemporaryVariableBufferTest, ClearFreesRetainedBufferExactlyOnce) { + MockBackend backend; + TemporaryVariableBuffer buffer(mock_alloc, mock_free, &backend); + TemporaryBufferPlanItem plan[] = {{1024, TemporaryVariableBuffer::kTemporaryBufferAlignment}}; + + ASSERT_TRUE(buffer.configure_auto(true)) << buffer.last_error(); + ASSERT_TRUE(buffer.begin_run(plan, 1)) << buffer.last_error(); + buffer.end_run(); + EXPECT_EQ(backend.alloc_count, 1); buffer.clear(); - EXPECT_EQ(backend.free_count, 2); + EXPECT_EQ(backend.free_count, 1); EXPECT_TRUE(backend.live.empty()); - EXPECT_EQ(buffer.budget(), 0u); + EXPECT_FALSE(buffer.enabled()); + EXPECT_EQ(buffer.stats().retained_bytes, 0u); buffer.clear(); - EXPECT_EQ(backend.free_count, 2); + EXPECT_EQ(backend.free_count, 1); } TEST(TemporaryVariableBufferTest, ActiveReconfigurationFailsClearly) { MockBackend backend; TemporaryVariableBuffer buffer(mock_alloc, mock_free, &backend); + TemporaryBufferPlanItem plan[] = {{1024, TemporaryVariableBuffer::kTemporaryBufferAlignment}}; - ASSERT_TRUE(buffer.configure(1024)) << buffer.last_error(); - ASSERT_TRUE(buffer.begin_run()) << buffer.last_error(); - EXPECT_FALSE(buffer.configure(2048)); + ASSERT_TRUE(buffer.configure_auto(true)) << buffer.last_error(); + ASSERT_TRUE(buffer.begin_run(plan, 1)) << buffer.last_error(); + EXPECT_FALSE(buffer.configure_auto(false)); EXPECT_NE(buffer.last_error().find("cannot reconfigure"), std::string::npos); buffer.end_run(); } diff --git a/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp b/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp index aec9c2f38..be6b0a769 100644 --- a/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp +++ b/tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp @@ -49,8 +49,11 @@ struct FakeHostApi { int temp_acquire_attempts = 0; int temp_acquire_successes = 0; int fail_copy_to_on_call = 0; - size_t temp_budget = 0; + size_t temp_capacity = 0; size_t temp_offset = 0; + size_t temp_plan_count = 0; + size_t temp_plan_required_bytes = 0; + bool temp_enabled = false; bool temp_active = false; void *temp_pool = nullptr; std::unordered_set live_mallocs; @@ -71,13 +74,14 @@ struct FakeHostApi { } } - void reset(size_t budget = 0) { + void reset(size_t capacity = 0) { release_all(); *this = FakeHostApi(); - temp_budget = budget; - if (budget > 0) { - ASSERT_EQ(posix_memalign(&temp_pool, TemporaryVariableBuffer::kDefaultAlignment, budget), 0); - std::memset(temp_pool, 0, budget); + temp_enabled = capacity > 0; + temp_capacity = capacity; + if (capacity > 0) { + ASSERT_EQ(posix_memalign(&temp_pool, TemporaryVariableBuffer::kTemporaryBufferAlignment, capacity), 0); + std::memset(temp_pool, 0, capacity); } } }; @@ -125,13 +129,25 @@ int fake_device_memset(void *dev_ptr, int value, size_t size) { return 0; } -size_t fake_temporary_buffer_budget() { return g_fake->temp_budget; } +bool fake_temporary_buffer_enabled() { return g_fake->temp_enabled; } -bool fake_begin_temporary_buffer_run() { - if (g_fake->temp_budget == 0 || g_fake->temp_pool == nullptr || g_fake->temp_active) { +bool fake_begin_temporary_buffer_run(const TemporaryBufferPlanItem *items, size_t item_count) { + ++g_fake->temp_begin_count; + if (!g_fake->temp_enabled || g_fake->temp_capacity == 0 || g_fake->temp_pool == nullptr || g_fake->temp_active || + (items == nullptr && item_count != 0)) { return false; } - ++g_fake->temp_begin_count; + size_t offset = 0; + for (size_t i = 0; i < item_count; ++i) { + const size_t alignment = std::max(items[i].alignment, TemporaryVariableBuffer::kTemporaryBufferAlignment); + offset = align_up(offset, alignment); + if (items[i].bytes > g_fake->temp_capacity || offset > g_fake->temp_capacity - items[i].bytes) { + return false; + } + offset += items[i].bytes; + } + g_fake->temp_plan_count = item_count; + g_fake->temp_plan_required_bytes = offset; g_fake->temp_offset = 0; g_fake->temp_active = true; return true; @@ -139,8 +155,10 @@ bool fake_begin_temporary_buffer_run() { void *fake_acquire_temporary_buffer_slice(size_t size, size_t alignment) { ++g_fake->temp_acquire_attempts; - const size_t offset = align_up(g_fake->temp_offset, alignment == 0 ? 1 : alignment); - if (!g_fake->temp_active || offset > g_fake->temp_budget || size > g_fake->temp_budget - offset) { + const size_t effective_alignment = + std::max(alignment == 0 ? size_t{1} : alignment, TemporaryVariableBuffer::kTemporaryBufferAlignment); + const size_t offset = align_up(g_fake->temp_offset, effective_alignment); + if (!g_fake->temp_active || offset > g_fake->temp_capacity || size > g_fake->temp_capacity - offset) { return nullptr; } void *ptr = static_cast(g_fake->temp_pool) + offset; @@ -189,7 +207,7 @@ HostApi make_host_api() { fake_copy_to_device, fake_copy_from_device, fake_device_memset, - fake_temporary_buffer_budget, + fake_temporary_buffer_enabled, fake_begin_temporary_buffer_run, fake_acquire_temporary_buffer_slice, fake_end_temporary_buffer_run, @@ -244,7 +262,7 @@ class TrbRuntimeTempBufferTest : public ::testing::Test { } // namespace -TEST_F(TrbRuntimeTempBufferTest, PositiveBudgetUsesTemporarySlicesWithoutChangingCopies) { +TEST_F(TrbRuntimeTempBufferTest, AutoEnabledUsesTemporarySlicesWithoutChangingCopies) { std::vector input(64, 7); std::vector output(64, 0); ChipStorageTaskArgs args = make_args(input, output); @@ -266,6 +284,8 @@ TEST_F(TrbRuntimeTempBufferTest, PositiveBudgetUsesTemporarySlicesWithoutChangin ASSERT_EQ(bind_runtime(buffer_runtime, api_, args, signature, 2), 0); EXPECT_EQ(fake_.device_malloc_count, 0); EXPECT_EQ(fake_.temp_begin_count, 1); + EXPECT_EQ(fake_.temp_plan_count, 2u); + EXPECT_EQ(fake_.temp_plan_required_bytes, 1088u); EXPECT_EQ(fake_.temp_acquire_successes, 2); EXPECT_EQ(fake_.copy_to_count, 2); EXPECT_EQ(fake_.device_memset_count, 1); @@ -286,6 +306,8 @@ TEST_F(TrbRuntimeTempBufferTest, ChildMemoryIsPassThroughAndPureOutStillMemsets) ArgDirection signature[2] = {ArgDirection::IN, ArgDirection::OUT}; ASSERT_EQ(bind_runtime(runtime, api_, args, signature, 2), 0); + EXPECT_EQ(fake_.temp_plan_count, 1u); + EXPECT_EQ(fake_.temp_plan_required_bytes, 64u); EXPECT_EQ(fake_.temp_acquire_successes, 1); EXPECT_EQ(fake_.device_malloc_count, 0); EXPECT_EQ(fake_.copy_to_count, 1); @@ -295,7 +317,7 @@ TEST_F(TrbRuntimeTempBufferTest, ChildMemoryIsPassThroughAndPureOutStillMemsets) EXPECT_EQ(fake_.temp_end_count, 1); } -TEST_F(TrbRuntimeTempBufferTest, BudgetExhaustionFailsWithoutMallocFallbackAndEndsRun) { +TEST_F(TrbRuntimeTempBufferTest, TemporaryPlanFailureFailsWithoutMallocFallback) { fake_.reset(1024); Runtime runtime = make_runtime(); std::vector input(768, 1); @@ -306,9 +328,9 @@ TEST_F(TrbRuntimeTempBufferTest, BudgetExhaustionFailsWithoutMallocFallbackAndEn EXPECT_EQ(bind_runtime(runtime, api_, args, signature, 2), -1); EXPECT_EQ(fake_.device_malloc_count, 0); EXPECT_EQ(fake_.temp_begin_count, 1); - EXPECT_EQ(fake_.temp_acquire_attempts, 2); - EXPECT_EQ(fake_.temp_acquire_successes, 1); - EXPECT_EQ(fake_.temp_end_count, 1); + EXPECT_EQ(fake_.temp_acquire_attempts, 0); + EXPECT_EQ(fake_.temp_acquire_successes, 0); + EXPECT_EQ(fake_.temp_end_count, 0); EXPECT_FALSE(runtime.temporary_buffer_run_active_); EXPECT_TRUE(runtime.tensor_leases_.empty()); } diff --git a/tests/ut/py/test_chip_worker.py b/tests/ut/py/test_chip_worker.py index 2a2cb6aeb..83f33bf0b 100644 --- a/tests/ut/py/test_chip_worker.py +++ b/tests/ut/py/test_chip_worker.py @@ -242,7 +242,7 @@ def test_l3_l2_orch_comm_shutdown_before_init_raises(self): def test_configure_temporary_buffer_before_init_raises(self): worker = _ChipWorker() with pytest.raises(RuntimeError, match="not initialized"): - worker.configure_temporary_buffer(4096) + worker.configure_temporary_buffer_auto(True) # ============================================================================ @@ -276,8 +276,7 @@ def __init__(self): self.unregistered = [] self.aicpu_dlopen_count = 0 self.host_dlopen_count = 0 - self.temporary_buffer_budget = 0 - self.configured_temporary_buffers = [] + self.configured_temporary_buffer_auto = [] def register_callable(self, slot, callable_obj): self.prepared.append((slot, callable_obj)) @@ -288,9 +287,8 @@ def run(self, slot, args, config): def unregister_callable(self, slot): self.unregistered.append(slot) - def configure_temporary_buffer(self, budget): - self.configured_temporary_buffers.append(budget) - self.temporary_buffer_budget = budget + def configure_temporary_buffer_auto(self, enabled): + self.configured_temporary_buffer_auto.append(enabled) worker = ChipWorker() fake = FakeImpl() @@ -315,12 +313,9 @@ def configure_temporary_buffer(self, budget): worker.unregister_callable(second) assert fake.unregistered == [0] - worker.configure_temporary_buffer(4096) - assert fake.configured_temporary_buffers == [4096] - assert worker.temporary_buffer_budget == 4096 - - with pytest.raises(ValueError, match="max_temporary_buffer_bytes"): - worker.configure_temporary_buffer(-1) + worker.configure_temporary_buffer_auto(True) + worker.configure_temporary_buffer_auto(False) + assert fake.configured_temporary_buffer_auto == [True, False] def test_public_wrapper_rejects_raw_slot_run(self): from _task_interface import ChipStorageTaskArgs # noqa: PLC0415 diff --git a/tests/ut/py/test_worker/test_host_worker.py b/tests/ut/py/test_worker/test_host_worker.py index 968232a17..adc963f72 100644 --- a/tests/ut/py/test_worker/test_host_worker.py +++ b/tests/ut/py/test_worker/test_host_worker.py @@ -109,12 +109,10 @@ def _slot_for(worker: Worker, handle: CallableHandle) -> int: class _FakeChipWorker: def __init__(self) -> None: - self.configured_temporary_buffers: list[int] = [] - self.temporary_buffer_budget = 0 + self.configured_temporary_buffer_auto: list[bool] = [] - def configure_temporary_buffer(self, budget: int) -> None: - self.configured_temporary_buffers.append(budget) - self.temporary_buffer_budget = budget + def configure_temporary_buffer_auto(self, enabled: bool) -> None: + self.configured_temporary_buffer_auto.append(enabled) class _FakeControlResult: @@ -133,30 +131,35 @@ def _chip_payload_shm(callable_obj: ChipCallable) -> SharedMemory: return shm -def test_l2_worker_configure_temporary_buffer_records_and_forwards(): +def test_l2_worker_configure_temporary_buffer_auto_records_and_forwards(): worker = Worker(level=2, platform="a2a3sim", runtime="tensormap_and_ringbuffer") - assert worker.temporary_buffer_budget == 0 - worker.configure_temporary_buffer(8192) - assert worker._config["max_temporary_buffer_bytes"] == 8192 - assert worker.temporary_buffer_budget == 8192 + assert worker.temporary_buffer_mode == "off" + worker.configure_temporary_buffer_auto(True) + assert worker._config["temporary_buffer_mode"] == "auto" + assert worker.temporary_buffer_mode == "auto" fake_chip = _FakeChipWorker() worker._chip_worker = fake_chip - worker.configure_temporary_buffer(16384) - assert fake_chip.configured_temporary_buffers == [16384] - assert worker.temporary_buffer_budget == 16384 - - with pytest.raises(ValueError, match="max_temporary_buffer_bytes"): - worker.configure_temporary_buffer(-1) + worker.configure_temporary_buffer_auto(False) + assert fake_chip.configured_temporary_buffer_auto == [False] + assert worker.temporary_buffer_mode == "off" def test_temporary_buffer_configuration_records_for_l3_children(): worker = Worker(level=3, num_sub_workers=0) - worker.configure_temporary_buffer(1024) - assert worker._config["max_temporary_buffer_bytes"] == 1024 - assert worker.temporary_buffer_budget == 1024 + worker.configure_temporary_buffer_auto(True) + assert worker._config["temporary_buffer_mode"] == "auto" + assert worker.temporary_buffer_mode == "auto" + + +def test_temporary_buffer_rejects_removed_budget_config(): + with pytest.raises(ValueError, match="max_temporary_buffer_bytes"): + Worker(level=2, max_temporary_buffer_bytes=1024) + + with pytest.raises(ValueError, match="temporary_buffer_mode"): + Worker(level=2, temporary_buffer_mode="bad") def test_chip_process_loop_configures_temporary_buffer(monkeypatch): @@ -166,8 +169,8 @@ class FakeChipWorker: def init(self, device_id, bins, *, log_level, log_info_v): events.append(("init", device_id, bins, log_level, log_info_v)) - def configure_temporary_buffer(self, budget: int) -> None: - events.append(("configure_temporary_buffer", budget)) + def configure_temporary_buffer_auto(self, enabled: bool) -> None: + events.append(("configure_temporary_buffer_auto", enabled)) def finalize(self) -> None: events.append(("finalize",)) @@ -191,7 +194,7 @@ def fake_run_chip_main_loop(cw, *_args, chip_platform, chip_runtime): worker_mod._ChipProcessConfig( platform="a2a3", runtime="tensormap_and_ringbuffer", - max_temporary_buffer_bytes=4096, + temporary_buffer_mode="auto", ), ) finally: @@ -199,7 +202,7 @@ def fake_run_chip_main_loop(cw, *_args, chip_platform, chip_runtime): shm.unlink() assert events[0] == ("init", 7, "bins", 1, 5) - assert events[1] == ("configure_temporary_buffer", 4096) + assert events[1] == ("configure_temporary_buffer_auto", True) assert events[2][0] == "main_loop" assert events[2][2:] == ("a2a3", "tensormap_and_ringbuffer") assert events[3] == ("finalize",) From bfac4371cd993bbaad4f4bc51c7b1f7a0a541f5e Mon Sep 17 00:00:00 2001 From: puddingfjz <2811443837@qq.com> Date: Wed, 1 Jul 2026 18:38:25 +0800 Subject: [PATCH 9/9] Fix TRB cleanup after main merge --- .../host/runtime_maker.cpp | 15 --------------- .../host/runtime_maker.cpp | 15 --------------- 2 files changed, 30 deletions(-) diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp index ea2d502e6..27b8c1972 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp @@ -981,21 +981,6 @@ extern "C" int validate_runtime_impl(Runtime *runtime, const HostApi *api) { release_tensor_leases(runtime, api); end_temporary_buffer_run_if_active(api, runtime->temporary_buffer_run_active_); - // Clear the per-run dispatch-table entries staged by register_callable_impl. - // The underlying chip-callable device buffer is pool-managed by - // DeviceRunner (keyed by content hash) and bulk-freed in - // DeviceRunner::finalize(); re-running the same callable repeatedly - // should not re-upload. - int kernel_count = runtime->get_registered_kernel_count(); - for (int i = 0; i < kernel_count; i++) { - int func_id = runtime->get_registered_kernel_func_id(i); - runtime->set_function_bin_addr(func_id, 0); - } - if (kernel_count > 0) { - LOG_INFO_V0("Cleared %d kernel dispatch-table entries", kernel_count); - } - runtime->clear_registered_kernels(); - LOG_INFO_V0("=== Finalize Complete ==="); if (rc == 0 && runtime_status != 0) { diff --git a/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp b/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp index ea2d502e6..27b8c1972 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp +++ b/src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp @@ -981,21 +981,6 @@ extern "C" int validate_runtime_impl(Runtime *runtime, const HostApi *api) { release_tensor_leases(runtime, api); end_temporary_buffer_run_if_active(api, runtime->temporary_buffer_run_active_); - // Clear the per-run dispatch-table entries staged by register_callable_impl. - // The underlying chip-callable device buffer is pool-managed by - // DeviceRunner (keyed by content hash) and bulk-freed in - // DeviceRunner::finalize(); re-running the same callable repeatedly - // should not re-upload. - int kernel_count = runtime->get_registered_kernel_count(); - for (int i = 0; i < kernel_count; i++) { - int func_id = runtime->get_registered_kernel_func_id(i); - runtime->set_function_bin_addr(func_id, 0); - } - if (kernel_count > 0) { - LOG_INFO_V0("Cleared %d kernel dispatch-table entries", kernel_count); - } - runtime->clear_registered_kernels(); - LOG_INFO_V0("=== Finalize Complete ==="); if (rc == 0 && runtime_status != 0) {