Skip to content

docs: add TRB temporary buffer plan#1198

Open
puddingfjz wants to merge 7 commits into
hw-native-sys:mainfrom
puddingfjz:docs/trb-temp-buffer-plan-20260629
Open

docs: add TRB temporary buffer plan#1198
puddingfjz wants to merge 7 commits into
hw-native-sys:mainfrom
puddingfjz:docs/trb-temp-buffer-plan-20260629

Conversation

@puddingfjz

Copy link
Copy Markdown
Contributor

Summary

  • Add an implementation plan for TRB temporary variable buffer reuse.
  • Document runner-scoped budget configuration, preallocated retained chunks, serial run assumption, and bind/validate cleanup ownership.
  • Clarify non-goals: no model/kernel changes, no hidden H2D removal, no cross-run double buffering, and no fallback on positive-budget exhaustion.
  • Follow up on the optimization direction discussed in Add: run latency optimization assessment #1186.

Testing

  • Not run; documentation-only change.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 51ab601f-70d9-4a11-b8b1-d67e3b3953c8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds docs/trb-serial-tensor-buffer-pool-plan.md, a 590-line design document specifying a lease-based bump allocator for reusing pre-retained device memory across decode steps in the tensormap_and_ringbuffer path, covering ownership model, DeviceRunnerBase placement, HostApi wiring, bind/validate path changes, Qwen3 tensor classification, metrics, tests, acceptance criteria, and deferred work.

Changes

TRB Serial Tensor Buffer Pool Plan

Layer / File(s) Summary
Scope, non-goals, and lease ownership model
docs/trb-serial-tensor-buffer-pool-plan.md
Establishes the optimization's decision scope, explicit non-goals, the current bind/validate allocation/free flow for non-child tensors, and replaces implicit TensorPair ownership with an explicit lease model using release kinds.
DeviceRunnerBase placement, budget contract, and bump allocator lifecycle
docs/trb-serial-tensor-buffer-pool-plan.md
Specifies relocating buffer management to DeviceRunnerBase, defines the aggregate max_temporary_buffer_bytes budget contract with alignment/padding and error rules, configuration ingress rules, segmented-chunk retained memory approach, HostApi wiring, and the per-run configure/begin/acquire/end/clear lifecycle.
Concurrency assumptions and bind/validate path changes
docs/trb-serial-tensor-buffer-pool-plan.md
States the single-active-run concurrency assumption and enumerates non-additions (no locking, no fallback malloc, no double buffering), then details bind path changes (begin run, acquire slices, record leases), validate path changes (release-kind dispatch replacing device_free, clear leases, end run), cleanup contract, and preserved copy semantics including Qwen3 tensor classification.
Metrics, tests, acceptance criteria, and deferred work
docs/trb-serial-tensor-buffer-pool-plan.md
Defines lightweight counter/metrics requirements, minimum unit and integration tests, acceptance criteria for correctness/compatibility/performance, and explicitly deferred items.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Poem

🐇 A plan arrives, fresh from the burrow below,
Bump allocators reuse memory's flow,
Leases replace frees on the hot decode path,
One runner, one run — no concurrency wrath,
The doc hops in, and the rabbit says: go! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the documented TRB temporary buffer plan addition.
Description check ✅ Passed The description accurately summarizes the documentation-only TRB temporary buffer plan and its non-goals.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an implementation plan for a TRB Temporary Variable Buffer to optimize memory allocation overhead. The review feedback highlights three key areas for improvement: simplifying the cleanup contract by leveraging the existing validate_runtime_impl gateway, resolving a design contradiction between the aggregate budget contract and segmented chunk allocation, and adding a lightweight active-run guard to prevent silent data corruption from concurrent runs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +434 to +472
Cleanup ownership must be explicit:

```text
bind owns cleanup until bind succeeds
validate owns cleanup after bind succeeds
```

If `begin_temporary_buffer_run()` succeeds, exactly one matching
`end_temporary_buffer_run()` must run. This applies whether bind, H2D, memset,
run, status readback, D2H, or validation fails.

Bind should use a local cleanup guard:

```text
temp_run_active = false

if temporary buffer is enabled:
begin_temporary_buffer_run()
temp_run_active = true

for tensor in tensors:
acquire or malloc dev_ptr
record TensorLease immediately
copy or memset

runtime.temporary_buffer_run_active = temp_run_active
release bind cleanup guard
```

Before the cleanup guard is released, bind failure cleanup must:

- release all recorded `Free` leases with `device_free()`;
- leave `BufferNoop` and `ExternalNoop` tensor leases as per-tensor no-ops;
- end the temporary-buffer run if `temp_run_active` is true;
- clear recorded leases.

After bind succeeds, validate cleanup must perform the same release dispatch
and end the temporary-buffer run if
`runtime.temporary_buffer_run_active` is true.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The proposed cleanup contract introduces a local cleanup guard in bind to handle failures before bind succeeds, and a separate cleanup path in validate. However, in the existing c_api_shared.cpp implementation, validate_runtime_impl is always called by the caller (run_prepared) if bind_callable_to_runtime_impl fails:

rc = bind_callable_to_runtime_impl(...);
if (rc != 0) {
    r->set_gm_sm_ptr(nullptr);
    validate_runtime_impl(r);
    return rc;
}

Because validate_runtime_impl is already the designated single cleanup gateway for any failure after Runtime construction, having bind perform its own manual cleanup of the temporary-buffer run and leases duplicates the release dispatch logic and increases the risk of double-cleanup or mismatched state.

Suggestion:
Simplify the contract by letting validate_runtime_impl handle all cleanup on failure. When begin_temporary_buffer_run() succeeds, immediately set runtime.temporary_buffer_run_active = true and record leases in runtime.tensor_leases_ as they are acquired. If any subsequent step in bind fails, simply return an error and let the caller's existing validate_runtime_impl call handle the rollback of Free leases and end the temporary-buffer run.

References
  1. Ensure documentation and diagrams accurately reflect implementation details regarding resource lifecycles, especially when persistence is used to maintain internal state like caches.
  2. Ensure that code snippets provided in documentation are kept in sync with the actual implementation in the codebase, including robust error handling (e.g., try-except blocks) and edge-case guards.

Comment on lines +334 to +345
Allocation policy:

- Support multiple chunks so the implementation does not depend on the largest
contiguous allocatable HBM block.
- Allocate retained chunks during positive-budget configuration. Do not add
chunks lazily from `acquire()` during bind.
- Never let total retained chunk capacity exceed
`max_temporary_buffer_bytes`.
- A tensor slice must be contiguous within one chunk. If a single tensor is
larger than every retained chunk, configuration must create a large-enough
chunk within the same aggregate budget or fail before the run.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a design contradiction between the "Budget Contract" and "Segmented Chunks" sections:

  1. The "Budget Contract" states that the runtime does not know model-specific tensor sizes and only receives an aggregate max_temporary_buffer_bytes limit.
  2. The "Segmented Chunks" section states that chunks are allocated statically during configuration (not lazily during acquire), and that if a single tensor is larger than every retained chunk, the configuration must create a large-enough chunk or fail before the run.

If the configuration API only receives the aggregate max_temporary_buffer_bytes and has no knowledge of individual tensor sizes, it cannot know how to segment the budget into chunks that are guaranteed to fit the largest contiguous tensor. For example, if the budget is split into multiple chunks, a single large tensor might fail to find a contiguous slot even if the aggregate budget is sufficient.

Suggestion:
Clarify how the chunk allocation policy decides the individual chunk sizes during configuration. If the largest tensor size is unknown, the runner should default to allocating a single contiguous chunk of max_temporary_buffer_bytes to guarantee that any tensor within the budget can be allocated contiguously, or the configuration API should accept a list of chunk sizes instead of a single aggregate limit.

Comment on lines +308 to +315
If two host threads call `run_prepared()` concurrently on the same runner while
temporary buffering is enabled, behavior is unsupported. The caller or serving
scheduler is responsible for serializing same-runner runs.

Future same-runner concurrency must add a run-lifecycle mutex, active-run
guard, fallback-to-malloc behavior, or true double buffering. That work is
outside this implementation plan.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While the plan assumes a single active run per runner and treats concurrent same-runner runs as unsupported, failing to guard against concurrency can lead to silent data corruption (e.g., if a second thread calls begin_temporary_buffer_run() and resets the chunk offsets to zero while the first thread's run is still active).

Suggestion:
Instead of leaving concurrent runs completely unguarded, implement a lightweight active-run guard (e.g., using a std::atomic<bool> or std::atomic_flag on the runner). If begin_temporary_buffer_run() is called while a run is already active, it should fail-fast with a clear error or assertion. This provides a robust safety net against programming errors with zero performance overhead on the hot path.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
docs/trb-serial-tensor-buffer-pool-plan.md (3)

11-11: 📐 Maintainability & Code Quality | 🔵 Trivial

Consider using English consistently throughout the document.

Line 11 uses Chinese characters "临时变量" while the rest of the document is entirely in English. For consistency and to avoid encoding issues in some toolchains, consider using only English: "This plan uses 'temporary variable buffer' for the same concept as 'linshi biancun buffer'."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/trb-serial-tensor-buffer-pool-plan.md` at line 11, The document mixes
Chinese and English, so update the referenced text in the plan to use English
consistently. Replace the Chinese phrase in the buffer description with an
English equivalent in the same section of
docs/trb-serial-tensor-buffer-pool-plan.md, keeping the wording aligned with the
existing terminology used elsewhere in the document.

154-157: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Clarify finalize cleanup ordering when a run is active.

The text says to "log it, release retained chunks" if finalize sees an active run. For defensive correctness, explicitly state whether end_temporary_buffer_run() should be called before clear() to reset the active_ flag, or if clear() alone is sufficient because the runner is being destroyed. This prevents implementers from leaving the flag in an inconsistent state if any post-finalize diagnostics check it.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/trb-serial-tensor-buffer-pool-plan.md` around lines 154 - 157, Clarify
the finalize cleanup order in finalize_common()/sim finalize when an active
temporary-buffer run is present: specify whether end_temporary_buffer_run() must
be called before clear() to reset active_, or whether clear() alone is
sufficient because the runner is being torn down. Update the contract wording so
implementers know how to handle the active run state consistently and avoid
leaving diagnostics-visible flags stale.

526-552: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add unit tests for configuration edge cases.

Consider adding focused unit tests for:

  • configure_temporary_buffer with same budget twice is a no-op (no reallocation);
  • configure_temporary_buffer while a run is active fails with clear error;
  • begin_temporary_buffer_run failure path (e.g., OOM during chunk allocation) is handled correctly.

These complement the existing test list and guard against reconfiguration bugs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/trb-serial-tensor-buffer-pool-plan.md` around lines 526 - 552, Add
focused unit tests for the temporary-buffer configuration edge cases: verify
configure_temporary_buffer is a no-op when called twice with the same budget,
verify configure_temporary_buffer rejects changes while a run is active with a
clear error, and verify begin_temporary_buffer_run handles allocation
failure/OOM correctly. Place these alongside the existing buffer unit tests and
use the same temporary-buffer pool APIs and run lifecycle symbols so the new
coverage matches the current single-active-run model.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/trb-serial-tensor-buffer-pool-plan.md`:
- Around line 448-460: The bind pseudocode currently assumes
begin_temporary_buffer_run() always succeeds and sets temp_run_active=true
unconditionally, but the HostApi callback returns a bool. Update the bind flow
to check the return value from begin_temporary_buffer_run() and fail the bind
path immediately with a clear error if it returns false, before any TensorLease
recording or device memory work; keep temp_run_active=true only after a
successful begin_temporary_buffer_run() call.

---

Nitpick comments:
In `@docs/trb-serial-tensor-buffer-pool-plan.md`:
- Line 11: The document mixes Chinese and English, so update the referenced text
in the plan to use English consistently. Replace the Chinese phrase in the
buffer description with an English equivalent in the same section of
docs/trb-serial-tensor-buffer-pool-plan.md, keeping the wording aligned with the
existing terminology used elsewhere in the document.
- Around line 154-157: Clarify the finalize cleanup order in
finalize_common()/sim finalize when an active temporary-buffer run is present:
specify whether end_temporary_buffer_run() must be called before clear() to
reset active_, or whether clear() alone is sufficient because the runner is
being torn down. Update the contract wording so implementers know how to handle
the active run state consistently and avoid leaving diagnostics-visible flags
stale.
- Around line 526-552: Add focused unit tests for the temporary-buffer
configuration edge cases: verify configure_temporary_buffer is a no-op when
called twice with the same budget, verify configure_temporary_buffer rejects
changes while a run is active with a clear error, and verify
begin_temporary_buffer_run handles allocation failure/OOM correctly. Place these
alongside the existing buffer unit tests and use the same temporary-buffer pool
APIs and run lifecycle symbols so the new coverage matches the current
single-active-run model.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e992fafc-1d81-49ff-a3d7-3cea3aeb9bfd

📥 Commits

Reviewing files that changed from the base of the PR and between 86bd60a and 185bc6e.

📒 Files selected for processing (1)
  • docs/trb-serial-tensor-buffer-pool-plan.md

Comment on lines +448 to +460
temp_run_active = false

if temporary buffer is enabled:
begin_temporary_buffer_run()
temp_run_active = true

for tensor in tensors:
acquire or malloc dev_ptr
record TensorLease immediately
copy or memset

runtime.temporary_buffer_run_active = temp_run_active
release bind cleanup guard

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Add error handling for begin_temporary_buffer_run() failure in bind pseudocode.

The pseudocode sets temp_run_active = true unconditionally after begin_temporary_buffer_run(), but the actual callback returns bool. If begin fails (e.g., configuration error, out of memory), the bind path should fail early rather than proceeding with temp_run_active = true. Add:

if temporary buffer is enabled:
    if !begin_temporary_buffer_run():
        fail with clear error
    temp_run_active = true

This aligns with the actual bool return type specified in the HostApi wiring section (line 360).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/trb-serial-tensor-buffer-pool-plan.md` around lines 448 - 460, The bind
pseudocode currently assumes begin_temporary_buffer_run() always succeeds and
sets temp_run_active=true unconditionally, but the HostApi callback returns a
bool. Update the bind flow to check the return value from
begin_temporary_buffer_run() and fail the bind path immediately with a clear
error if it returns false, before any TensorLease recording or device memory
work; keep temp_run_active=true only after a successful
begin_temporary_buffer_run() call.

@puddingfjz puddingfjz force-pushed the docs/trb-temp-buffer-plan-20260629 branch from 185bc6e to befb5c5 Compare July 1, 2026 03:46
- Add a retained temporary variable buffer owned by DeviceRunnerBase and wire it through HostApi, C ABI, ChipWorker, and Worker configuration.\n- Convert TRB tensor cleanup to explicit leases so configured runs reuse buffer slices while disabled runs keep malloc/free semantics.\n- Cover buffer lifecycle, TRB bind/validate cleanup, child-memory regressions, and Python configuration entrypoints.
@puddingfjz puddingfjz force-pushed the docs/trb-temp-buffer-plan-20260629 branch from befb5c5 to 3bf8392 Compare July 1, 2026 04:17
- Replace the non-English temporary-buffer note in the plan doc
- Group chip child-process startup settings to keep the helper under ruff's argument limit
- Apply clang-format changes reported by CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant