docs: add TRB temporary buffer plan by puddingfjz · Pull Request #1198 · hw-native-sys/simpler

puddingfjz · 2026-06-29T12:00:33Z

Summary

Add an implementation plan for TRB temporary variable buffer reuse.
Document runner-scoped budget configuration, preallocated retained chunks, serial run assumption, and bind/validate cleanup ownership.
Clarify non-goals: no model/kernel changes, no hidden H2D removal, no cross-run double buffering, and no fallback on positive-budget exhaustion.
Follow up on the optimization direction discussed in Add: run latency optimization assessment #1186.

Testing

Not run; documentation-only change.

coderabbitai · 2026-06-29T12:00:49Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 51ab601f-70d9-4a11-b8b1-d67e3b3953c8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds docs/trb-serial-tensor-buffer-pool-plan.md, a 590-line design document specifying a lease-based bump allocator for reusing pre-retained device memory across decode steps in the tensormap_and_ringbuffer path, covering ownership model, DeviceRunnerBase placement, HostApi wiring, bind/validate path changes, Qwen3 tensor classification, metrics, tests, acceptance criteria, and deferred work.

Changes

TRB Serial Tensor Buffer Pool Plan

Layer / File(s)	Summary
Scope, non-goals, and lease ownership model `docs/trb-serial-tensor-buffer-pool-plan.md`	Establishes the optimization's decision scope, explicit non-goals, the current bind/validate allocation/free flow for non-child tensors, and replaces implicit `TensorPair` ownership with an explicit lease model using release kinds.
DeviceRunnerBase placement, budget contract, and bump allocator lifecycle `docs/trb-serial-tensor-buffer-pool-plan.md`	Specifies relocating buffer management to `DeviceRunnerBase`, defines the aggregate `max_temporary_buffer_bytes` budget contract with alignment/padding and error rules, configuration ingress rules, segmented-chunk retained memory approach, HostApi wiring, and the per-run `configure/begin/acquire/end/clear` lifecycle.
Concurrency assumptions and bind/validate path changes `docs/trb-serial-tensor-buffer-pool-plan.md`	States the single-active-run concurrency assumption and enumerates non-additions (no locking, no fallback malloc, no double buffering), then details bind path changes (begin run, acquire slices, record leases), validate path changes (release-kind dispatch replacing `device_free`, clear leases, end run), cleanup contract, and preserved copy semantics including Qwen3 tensor classification.
Metrics, tests, acceptance criteria, and deferred work `docs/trb-serial-tensor-buffer-pool-plan.md`	Defines lightweight counter/metrics requirements, minimum unit and integration tests, acceptance criteria for correctness/compatibility/performance, and explicitly deferred items.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Poem

🐇 A plan arrives, fresh from the burrow below,
Bump allocators reuse memory's flow,
Leases replace frees on the hot decode path,
One runner, one run — no concurrency wrath,
The doc hops in, and the rabbit says: go! 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly matches the documented TRB temporary buffer plan addition.
Description check	✅ Passed	The description accurately summarizes the documentation-only TRB temporary buffer plan and its non-goals.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces an implementation plan for a TRB Temporary Variable Buffer to optimize memory allocation overhead. The review feedback highlights three key areas for improvement: simplifying the cleanup contract by leveraging the existing validate_runtime_impl gateway, resolving a design contradiction between the aggregate budget contract and segmented chunk allocation, and adding a lightweight active-run guard to prevent silent data corruption from concurrent runs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-29T12:02:05Z

+Cleanup ownership must be explicit:
+
+```text
+bind owns cleanup until bind succeeds
+validate owns cleanup after bind succeeds
+```
+
+If `begin_temporary_buffer_run()` succeeds, exactly one matching
+`end_temporary_buffer_run()` must run. This applies whether bind, H2D, memset,
+run, status readback, D2H, or validation fails.
+
+Bind should use a local cleanup guard:
+
+```text
+temp_run_active = false
+
+if temporary buffer is enabled:
+    begin_temporary_buffer_run()
+    temp_run_active = true
+
+for tensor in tensors:
+    acquire or malloc dev_ptr
+    record TensorLease immediately
+    copy or memset
+
+runtime.temporary_buffer_run_active = temp_run_active
+release bind cleanup guard
+```
+
+Before the cleanup guard is released, bind failure cleanup must:
+
+- release all recorded `Free` leases with `device_free()`;
+- leave `BufferNoop` and `ExternalNoop` tensor leases as per-tensor no-ops;
+- end the temporary-buffer run if `temp_run_active` is true;
+- clear recorded leases.
+
+After bind succeeds, validate cleanup must perform the same release dispatch
+and end the temporary-buffer run if
+`runtime.temporary_buffer_run_active` is true.


The proposed cleanup contract introduces a local cleanup guard in bind to handle failures before bind succeeds, and a separate cleanup path in validate. However, in the existing c_api_shared.cpp implementation, validate_runtime_impl is always called by the caller (run_prepared) if bind_callable_to_runtime_impl fails:

rc = bind_callable_to_runtime_impl(...); if (rc != 0) { r->set_gm_sm_ptr(nullptr); validate_runtime_impl(r); return rc; }

Because validate_runtime_impl is already the designated single cleanup gateway for any failure after Runtime construction, having bind perform its own manual cleanup of the temporary-buffer run and leases duplicates the release dispatch logic and increases the risk of double-cleanup or mismatched state.

Suggestion:
Simplify the contract by letting validate_runtime_impl handle all cleanup on failure. When begin_temporary_buffer_run() succeeds, immediately set runtime.temporary_buffer_run_active = true and record leases in runtime.tensor_leases_ as they are acquired. If any subsequent step in bind fails, simply return an error and let the caller's existing validate_runtime_impl call handle the rollback of Free leases and end the temporary-buffer run.

References

Ensure documentation and diagrams accurately reflect implementation details regarding resource lifecycles, especially when persistence is used to maintain internal state like caches.

Ensure that code snippets provided in documentation are kept in sync with the actual implementation in the codebase, including robust error handling (e.g., try-except blocks) and edge-case guards.

gemini-code-assist · 2026-06-29T12:02:05Z

+Allocation policy:
+
+- Support multiple chunks so the implementation does not depend on the largest
+  contiguous allocatable HBM block.
+- Allocate retained chunks during positive-budget configuration. Do not add
+  chunks lazily from `acquire()` during bind.
+- Never let total retained chunk capacity exceed
+  `max_temporary_buffer_bytes`.
+- A tensor slice must be contiguous within one chunk. If a single tensor is
+  larger than every retained chunk, configuration must create a large-enough
+  chunk within the same aggregate budget or fail before the run.
+


There is a design contradiction between the "Budget Contract" and "Segmented Chunks" sections:

The "Budget Contract" states that the runtime does not know model-specific tensor sizes and only receives an aggregate max_temporary_buffer_bytes limit.

The "Segmented Chunks" section states that chunks are allocated statically during configuration (not lazily during acquire), and that if a single tensor is larger than every retained chunk, the configuration must create a large-enough chunk or fail before the run.

If the configuration API only receives the aggregate max_temporary_buffer_bytes and has no knowledge of individual tensor sizes, it cannot know how to segment the budget into chunks that are guaranteed to fit the largest contiguous tensor. For example, if the budget is split into multiple chunks, a single large tensor might fail to find a contiguous slot even if the aggregate budget is sufficient.

Suggestion:
Clarify how the chunk allocation policy decides the individual chunk sizes during configuration. If the largest tensor size is unknown, the runner should default to allocating a single contiguous chunk of max_temporary_buffer_bytes to guarantee that any tensor within the budget can be allocated contiguously, or the configuration API should accept a list of chunk sizes instead of a single aggregate limit.

gemini-code-assist · 2026-06-29T12:02:05Z

+If two host threads call `run_prepared()` concurrently on the same runner while
+temporary buffering is enabled, behavior is unsupported. The caller or serving
+scheduler is responsible for serializing same-runner runs.
+
+Future same-runner concurrency must add a run-lifecycle mutex, active-run
+guard, fallback-to-malloc behavior, or true double buffering. That work is
+outside this implementation plan.
+


While the plan assumes a single active run per runner and treats concurrent same-runner runs as unsupported, failing to guard against concurrency can lead to silent data corruption (e.g., if a second thread calls begin_temporary_buffer_run() and resets the chunk offsets to zero while the first thread's run is still active).

Suggestion:
Instead of leaving concurrent runs completely unguarded, implement a lightweight active-run guard (e.g., using a std::atomic<bool> or std::atomic_flag on the runner). If begin_temporary_buffer_run() is called while a run is already active, it should fail-fast with a clear error or assertion. This provides a robust safety net against programming errors with zero performance overhead on the hot path.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

docs/trb-serial-tensor-buffer-pool-plan.md (3)
11-11: 📐 Maintainability & Code Quality | 🔵 Trivial

Consider using English consistently throughout the document.

Line 11 uses Chinese characters "临时变量" while the rest of the document is entirely in English. For consistency and to avoid encoding issues in some toolchains, consider using only English: "This plan uses 'temporary variable buffer' for the same concept as 'linshi biancun buffer'."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/trb-serial-tensor-buffer-pool-plan.md` at line 11, The document mixes
Chinese and English, so update the referenced text in the plan to use English
consistently. Replace the Chinese phrase in the buffer description with an
English equivalent in the same section of
docs/trb-serial-tensor-buffer-pool-plan.md, keeping the wording aligned with the
existing terminology used elsewhere in the document.
154-157: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Clarify finalize cleanup ordering when a run is active.

The text says to "log it, release retained chunks" if finalize sees an active run. For defensive correctness, explicitly state whether end_temporary_buffer_run() should be called before clear() to reset the active_ flag, or if clear() alone is sufficient because the runner is being destroyed. This prevents implementers from leaving the flag in an inconsistent state if any post-finalize diagnostics check it.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/trb-serial-tensor-buffer-pool-plan.md` around lines 154 - 157, Clarify
the finalize cleanup order in finalize_common()/sim finalize when an active
temporary-buffer run is present: specify whether end_temporary_buffer_run() must
be called before clear() to reset active_, or whether clear() alone is
sufficient because the runner is being torn down. Update the contract wording so
implementers know how to handle the active run state consistently and avoid
leaving diagnostics-visible flags stale.
526-552: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add unit tests for configuration edge cases.

Consider adding focused unit tests for:

configure_temporary_buffer with same budget twice is a no-op (no reallocation);

configure_temporary_buffer while a run is active fails with clear error;

begin_temporary_buffer_run failure path (e.g., OOM during chunk allocation) is handled correctly.

These complement the existing test list and guard against reconfiguration bugs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/trb-serial-tensor-buffer-pool-plan.md` around lines 526 - 552, Add
focused unit tests for the temporary-buffer configuration edge cases: verify
configure_temporary_buffer is a no-op when called twice with the same budget,
verify configure_temporary_buffer rejects changes while a run is active with a
clear error, and verify begin_temporary_buffer_run handles allocation
failure/OOM correctly. Place these alongside the existing buffer unit tests and
use the same temporary-buffer pool APIs and run lifecycle symbols so the new
coverage matches the current single-active-run model.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/trb-serial-tensor-buffer-pool-plan.md`:
- Around line 448-460: The bind pseudocode currently assumes
begin_temporary_buffer_run() always succeeds and sets temp_run_active=true
unconditionally, but the HostApi callback returns a bool. Update the bind flow
to check the return value from begin_temporary_buffer_run() and fail the bind
path immediately with a clear error if it returns false, before any TensorLease
recording or device memory work; keep temp_run_active=true only after a
successful begin_temporary_buffer_run() call.

---

Nitpick comments:
In `@docs/trb-serial-tensor-buffer-pool-plan.md`:
- Line 11: The document mixes Chinese and English, so update the referenced text
in the plan to use English consistently. Replace the Chinese phrase in the
buffer description with an English equivalent in the same section of
docs/trb-serial-tensor-buffer-pool-plan.md, keeping the wording aligned with the
existing terminology used elsewhere in the document.
- Around line 154-157: Clarify the finalize cleanup order in
finalize_common()/sim finalize when an active temporary-buffer run is present:
specify whether end_temporary_buffer_run() must be called before clear() to
reset active_, or whether clear() alone is sufficient because the runner is
being torn down. Update the contract wording so implementers know how to handle
the active run state consistently and avoid leaving diagnostics-visible flags
stale.
- Around line 526-552: Add focused unit tests for the temporary-buffer
configuration edge cases: verify configure_temporary_buffer is a no-op when
called twice with the same budget, verify configure_temporary_buffer rejects
changes while a run is active with a clear error, and verify
begin_temporary_buffer_run handles allocation failure/OOM correctly. Place these
alongside the existing buffer unit tests and use the same temporary-buffer pool
APIs and run lifecycle symbols so the new coverage matches the current
single-active-run model.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e992fafc-1d81-49ff-a3d7-3cea3aeb9bfd

📥 Commits

Reviewing files that changed from the base of the PR and between 86bd60a and 185bc6e.

📒 Files selected for processing (1)

docs/trb-serial-tensor-buffer-pool-plan.md

coderabbitai · 2026-06-29T12:04:43Z

+temp_run_active = false
+
+if temporary buffer is enabled:
+    begin_temporary_buffer_run()
+    temp_run_active = true
+
+for tensor in tensors:
+    acquire or malloc dev_ptr
+    record TensorLease immediately
+    copy or memset
+
+runtime.temporary_buffer_run_active = temp_run_active
+release bind cleanup guard


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Add error handling for begin_temporary_buffer_run() failure in bind pseudocode.

The pseudocode sets temp_run_active = true unconditionally after begin_temporary_buffer_run(), but the actual callback returns bool. If begin fails (e.g., configuration error, out of memory), the bind path should fail early rather than proceeding with temp_run_active = true. Add:

if temporary buffer is enabled: if !begin_temporary_buffer_run(): fail with clear error temp_run_active = true

This aligns with the actual bool return type specified in the HostApi wiring section (line 360).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/trb-serial-tensor-buffer-pool-plan.md` around lines 448 - 460, The bind pseudocode currently assumes begin_temporary_buffer_run() always succeeds and sets temp_run_active=true unconditionally, but the HostApi callback returns a bool. Update the bind flow to check the return value from begin_temporary_buffer_run() and fail the bind path immediately with a clear error if it returns false, before any TensorLease recording or device memory work; keep temp_run_active=true only after a successful begin_temporary_buffer_run() call.

- Add a retained temporary variable buffer owned by DeviceRunnerBase and wire it through HostApi, C ABI, ChipWorker, and Worker configuration.\n- Convert TRB tensor cleanup to explicit leases so configured runs reuse buffer slices while disabled runs keep malloc/free semantics.\n- Cover buffer lifecycle, TRB bind/validate cleanup, child-memory regressions, and Python configuration entrypoints.

- Replace the non-English temporary-buffer note in the plan doc - Group chip child-process startup settings to keep the helper under ruff's argument limit - Apply clang-format changes reported by CI

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

puddingfjz force-pushed the docs/trb-temp-buffer-plan-20260629 branch from 185bc6e to befb5c5 Compare July 1, 2026 03:46

puddingfjz added 6 commits July 1, 2026 11:55

docs: add TRB temporary buffer plan

267842d

Fix: propagate TRB temporary buffer to L3 chip children

8c4d2bd

fix: align TRB temporary buffer allocations

433070f

Fix: cover temp-buffer copy failure cleanup

f80230b

Fix: adapt TRB temp buffer to shared HostApi

3bf8392

puddingfjz force-pushed the docs/trb-temp-buffer-plan-20260629 branch from befb5c5 to 3bf8392 Compare July 1, 2026 04:17

Fix: satisfy pre-commit after HostApi rebase

cd8c36e

- Replace the non-English temporary-buffer note in the plan doc - Group chip child-process startup settings to keep the helper under ruff's argument limit - Apply clang-format changes reported by CI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: add TRB temporary buffer plan#1198

docs: add TRB temporary buffer plan#1198
puddingfjz wants to merge 7 commits into
hw-native-sys:mainfrom
puddingfjz:docs/trb-temp-buffer-plan-20260629

puddingfjz commented Jun 29, 2026

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

puddingfjz commented Jun 29, 2026

Summary

Testing

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading