Skip to content
335 changes: 335 additions & 0 deletions docs/trb-auto-realloc-temporary-buffer-modification-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,335 @@
# TRB AUTO Realloc Temporary Buffer Modification Plan

**Date**: 2026-07-01
**Status**: replacement modification plan

## Purpose

This document replaces the previous chunk-growth AUTO plan for PR 1198 while
keeping that older plan file intact for review history.

The new target is simpler:

- no multi-chunk automatic growth mechanism;
- one retained temporary-buffer allocation per runner;
- at each run, compute the whole temporary-buffer requirement before staging;
- if the retained buffer is too small, free it first and allocate one new
buffer for the current run;
- use 1024-byte address alignment for the temporary buffer.

The previous plan remains in
`docs/trb-auto-temporary-buffer-modification-plan.md`.

## Target Behavior

Temporary buffering still has two modes:

- `off`: use the existing per-run `device_malloc()` / `device_free()` path.
- `auto`: use one runner-scoped retained temporary buffer.

The default mode is `off`.

AUTO mode does not take a caller-provided byte budget. The retained buffer
starts empty. On each TRB bind, the host builds a run plan for all ordinary
non-child tensors that would use temporary storage. The plan is packed with
1024-byte alignment. If the current retained buffer is large enough, the run
reuses it. If it is not large enough, the implementation frees the old
retained buffer and allocates one new retained buffer for this run.

There is no incremental chunk growth and no per-acquire allocation. After
`begin_temporary_buffer_run(plan)` succeeds, every
`acquire_temporary_buffer_slice()` must be satisfied from the retained buffer.
A miss after a successful begin is a bug in plan/acquire consistency and must
fail clearly. It must not fall back to ordinary `device_malloc()`.

## Alignment Contract

Use a single alignment constant for this feature:

```cpp
static constexpr size_t kTemporaryBufferAlignment = 1024;
```

Apply it to both:

- the retained buffer base address exposed to tensor slices;
- every tensor slice offset allocated from that retained buffer.

If the platform allocator does not guarantee 1024-byte alignment directly, the
temporary buffer must over-allocate and store both addresses:

```cpp
struct Buffer {
void *raw_base;
void *base; // 1024-byte aligned address used by slices
size_t capacity; // usable bytes from base
size_t offset;
};
```

Only `raw_base` is passed to the platform free callback. The usable
`capacity` is the bytes available from aligned `base`.

The required capacity for a run is computed with the same 1024-byte alignment
rule as real acquire:

```text
offset = 0
for item in plan:
offset = align_up(offset, 1024)
offset += item.bytes
required = offset
```

The implementation may round `required` up to 1024 bytes before storing it as
capacity, but it must not use a coarse fixed MiB chunk granularity.

## Run Planning

Before staging tensors in TRB bind, build a plan using the same filtering and
ordering as real acquire:

```text
for tensor in orch_args, in real bind order:
if tensor.is_child_memory():
skip
else:
append {bytes=tensor.nbytes(), alignment=1024}
```

The plan includes ordinary non-child input, INOUT, and output tensors. Child
memory stays pass-through and is not included.

Zero-byte tensors should not force a retained-buffer allocation. The plan and
real acquire path must handle them consistently. The preferred behavior is to
skip zero-byte tensors in the temporary-buffer plan and avoid consuming buffer
capacity for them.

## Host API Shape

Use plan-based AUTO callbacks, not a byte-budget API:

```cpp
struct TemporaryBufferPlanItem {
size_t bytes;
size_t alignment;
};

bool (*temporary_buffer_enabled)();
bool (*begin_temporary_buffer_run)(
const TemporaryBufferPlanItem *items, size_t item_count);
void *(*acquire_temporary_buffer_slice)(size_t bytes, size_t alignment);
void (*end_temporary_buffer_run)();
```

`begin_temporary_buffer_run()` computes the packed required size and ensures
the retained buffer is large enough for the whole run.

## Buffer State

The implementation should store a single retained buffer, not a vector of
chunks:

```cpp
Buffer buffer_;
size_t retained_bytes_;
size_t current_run_used_bytes_;
size_t high_water_used_bytes_;
bool enabled_;
bool active_;
```

Maintain these invariants:

- `retained_bytes_ == buffer_.capacity`;
- `retained_bytes_ == 0` when `buffer_.raw_base == nullptr`;
- `buffer_.base` is 1024-byte aligned when non-null;
- `buffer_.offset` is reset to zero only after begin succeeds;
- `current_run_used_bytes_` is reset to zero only after begin succeeds;
- real acquire increments `current_run_used_bytes_` by padding plus bytes;
- `end_temporary_buffer_run()` updates `high_water_used_bytes_`;
- clear/finalize releases `raw_base` and resets all retained-buffer state.

Useful diagnostics are:

- `retained_bytes`;
- `high_water_used_bytes`;
- `realloc_count`;
- `realloc_failed_count`;
- `buffer_backed_allocation_count`.

Do not expose a public budget getter.

## Begin-Run Resize Logic

`begin_temporary_buffer_run(plan)` owns the resize decision:

```text
if AUTO is disabled:
return false

if active_ is true:
fail clearly; do not reset offset
return false

required = packed_size(plan, alignment=1024)

if retained_bytes_ >= required:
buffer_.offset = 0
current_run_used_bytes_ = 0
active_ = true
return true

free existing retained buffer
retained_bytes_ = 0

if required == 0:
buffer_.offset = 0
current_run_used_bytes_ = 0
active_ = true
return true

allocate one new retained buffer with usable capacity >= required
if allocation fails:
active_ = false
return false

buffer_.offset = 0
retained_bytes_ = new usable capacity
current_run_used_bytes_ = 0
active_ = true
return true
```

This is intentionally not transactional with respect to the old retained
buffer. If a larger run requires resize and the new allocation fails, the old
retained buffer has already been released. That follows the required
free-then-allocate behavior and avoids keeping two large temporary buffers
alive at once.

## Real Acquire Logic

After begin succeeds, real acquire is a single-buffer bump allocator:

```text
if not active:
fail

alignment = max(requested_alignment, 1024)
aligned = align_up(buffer_.offset, alignment)

if bytes does not fit in buffer_.capacity - aligned:
fail clearly

ptr = buffer_.base + aligned
buffer_.offset = aligned + bytes
current_run_used_bytes_ += aligned - old_offset + bytes
return ptr
```

The caller must pass 1024 for temporary tensor slices. The implementation
should still validate that any requested alignment is a power of two and use
at least 1024.

## Cleanup And Lifetime

Release the retained buffer when:

- AUTO is disabled;
- an explicit clear path is called;
- runner/device context finalizes;
- `begin_temporary_buffer_run(plan)` needs a larger buffer.

Do not shrink merely because a later run is smaller. Smaller later runs reuse
the larger retained buffer until one of the release events above occurs.

If finalize sees an active temporary-buffer run, log a programming error and
still release the retained buffer before allocator teardown.

## Implementation Steps

1. Update `TemporaryVariableBuffer`.
- Replace chunk-vector state with a single retained buffer.
- Remove suffix growth and repeated simulation.
- Add 1024-byte alignment for base and slices.
- Add packed-size computation for the whole run plan.
- Implement free-then-allocate resize in begin-run.

2. Update onboard and sim `DeviceRunnerBase`.
- Keep AUTO enable/disable APIs.
- Remove chunk-specific diagnostics.
- Report retained bytes, high-water, realloc count, and realloc failures.

3. Update common `HostApi`.
- Keep `TemporaryBufferPlanItem`.
- Keep `temporary_buffer_enabled()`.
- Keep plan-based `begin_temporary_buffer_run(items, item_count)`.
- Do not restore `temporary_buffer_budget()`.

4. Update TRB bind path for a2a3 and a5.
- Build the plan from ordinary non-child tensors before staging.
- Use 1024-byte alignment in the plan and real acquire.
- Begin AUTO run before staging.
- Fail clearly if begin or acquire fails.
- Keep child-memory, H2D, memset, and copy-back semantics unchanged.

5. Update Python/C++ public API.
- Keep mode-based configuration, for example
`configure_temporary_buffer_auto(bool enabled)`.
- Keep `temporary_buffer_mode = "off" | "auto"`.
- Do not reintroduce caller-provided byte budgets.

6. Update tests.
- Cover initial empty AUTO begin and allocation.
- Cover same-shape reuse with no realloc.
- Cover larger later run freeing old buffer and allocating one new buffer.
- Cover smaller later run not shrinking.
- Cover allocation failure after old buffer is freed.
- Cover 1024-byte base and slice alignment.
- Keep TRB child-memory, OUT memset, and error-cleanup regressions.

## Test Plan

Run focused unit tests first:

```text
tests/ut/cpp/common/test_temporary_variable_buffer.cpp
tests/ut/cpp/common/test_trb_runtime_temp_buffer.cpp
tests/ut/py/test_chip_worker.py
tests/ut/py/test_worker/test_host_worker.py
```

Then run TRB prepared-callable coverage for both architectures where
available:

```text
a2a3 TRB prepared-callable ST
a5 TRB prepared-callable ST
```

Hardware tests must use `task-submit`.

For performance validation, use Qwen3 Path A with the same matrix already
requested for PR 1198:

- skill-default model/input/output setting;
- batch size 1 and 16;
- short input and 256-token input;
- output length 20, 256, and 512;
- compare AUTO enabled vs disabled on the same NPU where possible.

## Acceptance Criteria

- No public caller-provided temporary-buffer byte budget remains.
- AUTO starts empty and does not allocate until the first planned run.
- The retained temporary buffer is a single allocation, not retained chunks.
- All temporary-buffer slice addresses are 1024-byte aligned.
- Same-shape repeated runs reuse the retained buffer without reallocating.
- A larger later run frees the old retained buffer before allocating a new
one.
- A smaller later run does not shrink the retained buffer.
- Allocation failure during resize leaves no retained old buffer behind.
- Acquire failure after successful begin fails clearly and never falls back to
ordinary malloc.
- Child-memory pass-through, OUT memset, and copy-back semantics are
unchanged.
Loading
Loading