hw-native-sys · puddingfjz · Jun 29, 2026 · Jun 30, 2026 · Jun 30, 2026 · Jul 1, 2026
diff --git a/docs/trb-auto-realloc-temporary-buffer-modification-plan.md b/docs/trb-auto-realloc-temporary-buffer-modification-plan.md
@@ -0,0 +1,335 @@
+# TRB AUTO Realloc Temporary Buffer Modification Plan
+
+**Date**: 2026-07-01
+**Status**: replacement modification plan
+
+## Purpose
+
+This document replaces the previous chunk-growth AUTO plan for PR 1198 while
+keeping that older plan file intact for review history.
+
+The new target is simpler:
+
+- no multi-chunk automatic growth mechanism;
+- one retained temporary-buffer allocation per runner;
+- at each run, compute the whole temporary-buffer requirement before staging;
+- if the retained buffer is too small, free it first and allocate one new
+  buffer for the current run;
+- use 1024-byte address alignment for the temporary buffer.
+
+The previous plan remains in
+`docs/trb-auto-temporary-buffer-modification-plan.md`.
+
+## Target Behavior
+
+Temporary buffering still has two modes:
+
+- `off`: use the existing per-run `device_malloc()` / `device_free()` path.
+- `auto`: use one runner-scoped retained temporary buffer.
+
+The default mode is `off`.
+
+AUTO mode does not take a caller-provided byte budget. The retained buffer
+starts empty. On each TRB bind, the host builds a run plan for all ordinary
+non-child tensors that would use temporary storage. The plan is packed with
+1024-byte alignment. If the current retained buffer is large enough, the run
+reuses it. If it is not large enough, the implementation frees the old
+retained buffer and allocates one new retained buffer for this run.
+
+There is no incremental chunk growth and no per-acquire allocation. After
+`begin_temporary_buffer_run(plan)` succeeds, every
+`acquire_temporary_buffer_slice()` must be satisfied from the retained buffer.
+A miss after a successful begin is a bug in plan/acquire consistency and must
+fail clearly. It must not fall back to ordinary `device_malloc()`.
+
+## Alignment Contract
+
+Use a single alignment constant for this feature:
+
+```cpp
+static constexpr size_t kTemporaryBufferAlignment = 1024;
+```
+
+Apply it to both:
+
+- the retained buffer base address exposed to tensor slices;
+- every tensor slice offset allocated from that retained buffer.
+
+If the platform allocator does not guarantee 1024-byte alignment directly, the
+temporary buffer must over-allocate and store both addresses:
+
+```cpp
+struct Buffer {
+    void *raw_base;
+    void *base;      // 1024-byte aligned address used by slices
+    size_t capacity; // usable bytes from base
+    size_t offset;
+};
+```
+
+Only `raw_base` is passed to the platform free callback. The usable
+`capacity` is the bytes available from aligned `base`.
+
+The required capacity for a run is computed with the same 1024-byte alignment
+rule as real acquire:
+
+```text
+offset = 0
+for item in plan:
+    offset = align_up(offset, 1024)
+    offset += item.bytes
+required = offset
+```
+
+The implementation may round `required` up to 1024 bytes before storing it as
+capacity, but it must not use a coarse fixed MiB chunk granularity.
+
+## Run Planning
+
+Before staging tensors in TRB bind, build a plan using the same filtering and
+ordering as real acquire:
+
+```text
+for tensor in orch_args, in real bind order:
+    if tensor.is_child_memory():
+        skip
+    else:
+        append {bytes=tensor.nbytes(), alignment=1024}
+```
+
+The plan includes ordinary non-child input, INOUT, and output tensors. Child
+memory stays pass-through and is not included.
+
+Zero-byte tensors should not force a retained-buffer allocation. The plan and
+real acquire path must handle them consistently. The preferred behavior is to
+skip zero-byte tensors in the temporary-buffer plan and avoid consuming buffer
+capacity for them.
+
+## Host API Shape
+
+Use plan-based AUTO callbacks, not a byte-budget API:
+
+```cpp
+struct TemporaryBufferPlanItem {
+    size_t bytes;
+    size_t alignment;
+};
+
+bool (*temporary_buffer_enabled)();
+bool (*begin_temporary_buffer_run)(
+    const TemporaryBufferPlanItem *items, size_t item_count);
+void *(*acquire_temporary_buffer_slice)(size_t bytes, size_t alignment);
+void (*end_temporary_buffer_run)();
+```
+
+`begin_temporary_buffer_run()` computes the packed required size and ensures
+the retained buffer is large enough for the whole run.
+
+## Buffer State
+
+The implementation should store a single retained buffer, not a vector of
+chunks:
+
+```cpp
+Buffer buffer_;
+size_t retained_bytes_;
+size_t current_run_used_bytes_;
+size_t high_water_used_bytes_;
+bool enabled_;
+bool active_;
+```
+
+Maintain these invariants:
+
+- `retained_bytes_ == buffer_.capacity`;
+- `retained_bytes_ == 0` when `buffer_.raw_base == nullptr`;
+- `buffer_.base` is 1024-byte aligned when non-null;
+- `buffer_.offset` is reset to zero only after begin succeeds;
+- `current_run_used_bytes_` is reset to zero only after begin succeeds;
+- real acquire increments `current_run_used_bytes_` by padding plus bytes;
+- `end_temporary_buffer_run()` updates `high_water_used_bytes_`;
+- clear/finalize releases `raw_base` and resets all retained-buffer state.
+
+Useful diagnostics are:
+
+- `retained_bytes`;
+- `high_water_used_bytes`;
+- `realloc_count`;
+- `realloc_failed_count`;
+- `buffer_backed_allocation_count`.
+
+Do not expose a public budget getter.
+
+## Begin-Run Resize Logic
+
+`begin_temporary_buffer_run(plan)` owns the resize decision:
+
+```text
+if AUTO is disabled:
+    return false
+
+if active_ is true:
+    fail clearly; do not reset offset
+    return false
+
+required = packed_size(plan, alignment=1024)
+
+if retained_bytes_ >= required:
+    buffer_.offset = 0
+    current_run_used_bytes_ = 0
+    active_ = true
+    return true
+
+free existing retained buffer
+retained_bytes_ = 0
+
+if required == 0:
+    buffer_.offset = 0
+    current_run_used_bytes_ = 0
+    active_ = true
+    return true
+
+allocate one new retained buffer with usable capacity >= required
+if allocation fails:
+    active_ = false
+    return false
+
+buffer_.offset = 0
+retained_bytes_ = new usable capacity
+current_run_used_bytes_ = 0
+active_ = true
+return true
+```
+
+This is intentionally not transactional with respect to the old retained
+buffer. If a larger run requires resize and the new allocation fails, the old
+retained buffer has already been released. That follows the required
+free-then-allocate behavior and avoids keeping two large temporary buffers
+alive at once.
+
+## Real Acquire Logic
+
+After begin succeeds, real acquire is a single-buffer bump allocator:
+
+```text
+if not active:
+    fail
+
+alignment = max(requested_alignment, 1024)
+aligned = align_up(buffer_.offset, alignment)
+
+if bytes does not fit in buffer_.capacity - aligned:
+    fail clearly
+
+ptr = buffer_.base + aligned
+buffer_.offset = aligned + bytes
+current_run_used_bytes_ += aligned - old_offset + bytes
+return ptr
+```
+
+The caller must pass 1024 for temporary tensor slices. The implementation
+should still validate that any requested alignment is a power of two and use
+at least 1024.
+
+## Cleanup And Lifetime
+
+Release the retained buffer when:
+
+- AUTO is disabled;
+- an explicit clear path is called;
+- runner/device context finalizes;
+- `begin_temporary_buffer_run(plan)` needs a larger buffer.
+
+Do not shrink merely because a later run is smaller. Smaller later runs reuse
+the larger retained buffer until one of the release events above occurs.
+
+If finalize sees an active temporary-buffer run, log a programming error and
+still release the retained buffer before allocator teardown.
+
+## Implementation Steps
+
+1. Update `TemporaryVariableBuffer`.
+   - Replace chunk-vector state with a single retained buffer.
+   - Remove suffix growth and repeated simulation.
+   - Add 1024-byte alignment for base and slices.
+   - Add packed-size computation for the whole run plan.
+   - Implement free-then-allocate resize in begin-run.
+
+2. Update onboard and sim `DeviceRunnerBase`.
+   - Keep AUTO enable/disable APIs.
+   - Remove chunk-specific diagnostics.
+   - Report retained bytes, high-water, realloc count, and realloc failures.
+
+3. Update common `HostApi`.
+   - Keep `TemporaryBufferPlanItem`.
+   - Keep `temporary_buffer_enabled()`.
+   - Keep plan-based `begin_temporary_buffer_run(items, item_count)`.
+   - Do not restore `temporary_buffer_budget()`.
+
+4. Update TRB bind path for a2a3 and a5.
+   - Build the plan from ordinary non-child tensors before staging.
+   - Use 1024-byte alignment in the plan and real acquire.
+   - Begin AUTO run before staging.
+   - Fail clearly if begin or acquire fails.
+   - Keep child-memory, H2D, memset, and copy-back semantics unchanged.
+
+5. Update Python/C++ public API.
+   - Keep mode-based configuration, for example
+     `configure_temporary_buffer_auto(bool enabled)`.
+   - Keep `temporary_buffer_mode = "off" | "auto"`.
+   - Do not reintroduce caller-provided byte budgets.
+
+6. Update tests.
+   - Cover initial empty AUTO begin and allocation.
+   - Cover same-shape reuse with no realloc.
+   - Cover larger later run freeing old buffer and allocating one new buffer.
+   - Cover smaller later run not shrinking.
+   - Cover allocation failure after old buffer is freed.
+   - Cover 1024-byte base and slice alignment.
+   - Keep TRB child-memory, OUT memset, and error-cleanup regressions.
+
+## Test Plan
+
+Run focused unit tests first:
+
+```text
+tests/ut/cpp/common/test_temporary_variable_buffer.cpp
+tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp
+tests/ut/py/test_chip_worker.py
+tests/ut/py/test_worker/test_host_worker.py
+```
+
+Then run TRB prepared-callable coverage for both architectures where
+available:
+
+```text
+a2a3 TRB prepared-callable ST
+a5 TRB prepared-callable ST
+```
+
+Hardware tests must use `task-submit`.
+
+For performance validation, use Qwen3 Path A with the same matrix already
+requested for PR 1198:
+
+- skill-default model/input/output setting;
+- batch size 1 and 16;
+- short input and 256-token input;
+- output length 20, 256, and 512;
+- compare AUTO enabled vs disabled on the same NPU where possible.
+
+## Acceptance Criteria
+
+- No public caller-provided temporary-buffer byte budget remains.
+- AUTO starts empty and does not allocate until the first planned run.
+- The retained temporary buffer is a single allocation, not retained chunks.
+- All temporary-buffer slice addresses are 1024-byte aligned.
+- Same-shape repeated runs reuse the retained buffer without reallocating.
+- A larger later run frees the old retained buffer before allocating a new
+  one.
+- A smaller later run does not shrink the retained buffer.
+- Allocation failure during resize leaves no retained old buffer behind.
+- Acquire failure after successful begin fails clearly and never falls back to
+  ordinary malloc.
+- Child-memory pass-through, OUT memset, and copy-back semantics are
+  unchanged.