Add: run latency optimization assessment by puddingfjz · Pull Request #1186 · hw-native-sys/simpler

puddingfjz · 2026-06-29T04:27:17Z

Summary

Add a run-level latency optimization assessment for the L2
tensormap_and_ringbuffer path.
Document timing terms for host_wall, device_wall, and device-log Total.
Classify optimization candidates around tensor binding, validate/copy-back,
runtime arena caching, topology caching, logging, and device-wall overhead.
Clarify that full-bind and output-double-buffer pipelines are not planned by
default because they require a second device tensor-buffer set.
Document which staging paths need extra HBM and which remain host-only.

Testing

Docs only; no runtime tests run.
Ran markdown line-width check with awk.
Ran git diff --staged --check.

coderabbitai · 2026-06-29T04:27:33Z

📝 Walkthrough

Walkthrough

A new 619-line documentation file docs/run-latency-optimization-assessment.md is added, assessing run-level latency optimization options for the tensormap_and_ringbuffer path, including staging levels, a memory/double-buffering contract, eight prioritized optimization candidates with acceptance criteria, decision rules, a measurement plan, and references.

Changes

Run Latency Optimization Assessment

Layer / File(s)	Summary
Scope, timing terms, and staging framework `docs/run-latency-optimization-assessment.md`	Defines document metadata and scope, introduces timing terms (`host_wall`, `device_wall`, `Total`, `Orch`, `Sched`), and details three staging approaches (host-only, full bind, device-control) with HBM and isolation implications.
Double-buffering and memory duplication contract `docs/run-latency-optimization-assessment.md`	Specifies the double-buffering shape, peak HBM impact under staging, categorizes which device data would be duplicated vs. not, and establishes the memory decision list as a contract.
Optimization candidates `#1`–#8 `docs/run-latency-optimization-assessment.md`	Enumerates all eight optimization candidates (split timers, tensor binding/pooling, validate/copy-back/free improvements, arena/args/kernel-args caching, topology/launch metadata caching, device-wall overlap, cross-run pipelining, logging overhead) each with gate conditions and acceptance criteria.
Decision rules, measurement plan, and references `docs/run-latency-optimization-assessment.md`	Provides recommended ordering after measurement, rejection/postponement criteria, before/after metric thresholds for accepting complexity, and a references section.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit reads docs by the moonlight glow,
Timing and buffers all neatly in rows,
Split timers first, then the arenas cache,
Double-buffer deferred — let's not be too rash!
🐇 Measure twice, optimize once, that's the way to go!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and accurately summarizes the docs-only latency optimization assessment.
Description check	✅ Passed	The description matches the documented assessment and validation steps in the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces a comprehensive run-latency optimization assessment document for the L2 tensormap_and_ringbuffer path, detailing staging levels, memory implications, optimization candidates, and measurement plans. The review feedback focuses on improving the precision and grammar of the documentation, specifically clarifying that Runtime and KernelArgs are separate allocations requiring a 'set of slots' rather than a single 'slot', and correcting the phrasing of writing to an output slot.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-29T04:28:23Z

+Small device-control staging may be cheap, for example an extra `Runtime` and
+`KernelArgs` slot. Starting N+1's AICPU scheduler init during N's `Total` is


Since Runtime and KernelArgs are separate allocations/structures, referring to them as a single "slot" is technically inaccurate. It is clearer to refer to them as a "set of slots" or "slots".

Suggested change

Small device-control staging may be cheap, for example an extra `Runtime` and

`KernelArgs` slot. Starting N+1's AICPU scheduler init during N's `Total` is

Small device-control staging may be cheap, for example an extra set of `Runtime` and

`KernelArgs` slots. Starting N+1's AICPU scheduler init during N's `Total` is

References

Ensure documentation and diagrams accurately reflect implementation details regarding resource lifecycles, especially when persistence is used to maintain internal state like caches.

gemini-code-assist · 2026-06-29T04:28:23Z

+- Resident `Runtime` and `KernelArgs` slots are small control buffers; keeping
+  one slot is preferred.


Similar to the previous point, Runtime and KernelArgs are separate allocations, so "keeping one slot" should be corrected to "keeping one set of slots" or "keeping one slot set" for precision.

Suggested change

- Resident `Runtime` and `KernelArgs` slots are small control buffers; keeping

one slot is preferred.

- Resident `Runtime` and `KernelArgs` slots are small control buffers; keeping

one set of slots is preferred.

References

Ensure documentation and diagrams accurately reflect implementation details regarding resource lifecycles, especially when persistence is used to maintain internal state like caches.

gemini-code-assist · 2026-06-29T04:28:23Z

+- Output double buffering is also not planned by default when it keeps run N
+  outputs live while run N+1 writes a second output slot.


Grammatically, a run "writes to" a slot rather than "writes" a slot. Adding "to" improves readability and precision.

Suggested change

- Output double buffering is also not planned by default when it keeps run N

outputs live while run N+1 writes a second output slot.

- Output double buffering is also not planned by default when it keeps run N

outputs live while run N+1 writes to a second output slot.

coderabbitai

🧹 Nitpick comments (2)

docs/run-latency-optimization-assessment.md (2)
562-619: 📐 Maintainability & Code Quality | 🔵 Trivial

Consider explicitly ordering logging optimization in the recommended sequence.

Candidate 8 (Logging) is absent from the numbered default order. Since noisy logs can perturb the split-timing measurements that step 1 relies on, consider adding "Set quiet log level and verify measurement fidelity" as step 0 or 1a. Without this, teams following the doc may collect noisy baseline measurements.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/run-latency-optimization-assessment.md` around lines 562 - 619, The
default optimization sequence in the Decision Rules is missing the logging step,
which should be called out before measurements are trusted. Update the numbered
recommendation order in this section to explicitly include Candidate 8 by adding
a quiet-logging/measurement-fidelity step near the start (for example, before or
alongside add split timers) so readers know to reduce log noise before
collecting baselines. Keep the guidance aligned with the existing
measurement-first flow and reference the Decision Rules and Measurement Plan
sections for consistency.
61-65: 📐 Maintainability & Code Quality | 🔵 Trivial

Clarify "Total" vs. "Effective" terminology.

The document defines Total as the device-log union of Orch and Sched, but the upstream docs/dfx/l2-timing.md heading lists Effective alongside Orch/Sched and states device_wall is "strictly larger than Orch/Sched/Effective". If Total and Effective are the same concept, add a note mapping them; if they differ, define the distinction. Without this, readers may search the upstream docs for Total and not find it.

Also applies to: 73-74
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/run-latency-optimization-assessment.md` around lines 61 - 65, The
terminology in the latency section is inconsistent because “Total” is used here
where the upstream timing docs use “Effective”; update the description to either
explicitly map “Total” to “Effective” or define how they differ. Adjust the
affected wording in the latency summary and the later note that references the
same concept so readers can match the terms across this document and
docs/dfx/l2-timing.md, especially around the Total/Orch/Sched definitions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@docs/run-latency-optimization-assessment.md`:
- Around line 562-619: The default optimization sequence in the Decision Rules
is missing the logging step, which should be called out before measurements are
trusted. Update the numbered recommendation order in this section to explicitly
include Candidate 8 by adding a quiet-logging/measurement-fidelity step near the
start (for example, before or alongside add split timers) so readers know to
reduce log noise before collecting baselines. Keep the guidance aligned with the
existing measurement-first flow and reference the Decision Rules and Measurement
Plan sections for consistency.
- Around line 61-65: The terminology in the latency section is inconsistent
because “Total” is used here where the upstream timing docs use “Effective”;
update the description to either explicitly map “Total” to “Effective” or define
how they differ. Adjust the affected wording in the latency summary and the
later note that references the same concept so readers can match the terms
across this document and docs/dfx/l2-timing.md, especially around the
Total/Orch/Sched definitions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 05fd838a-b118-4f6e-9119-5c1f3185c2b6

📥 Commits

Reviewing files that changed from the base of the PR and between 53e291d and 0031064.

📒 Files selected for processing (1)

docs/run-latency-optimization-assessment.md

puddingfjz · 2026-06-29T05:11:51Z

跨 Run Overlap 阶段与 Device Buffer 需求表

步骤	在 run N+1 中属于	跨 run overlap 时是否需要 double device buffer
解析 args / 生成 bind plan / 检查 shape / direction	host pre-launch	不需要。纯 host-only。
topology / block-dim / launch metadata	host pre-launch	不需要。
full bind: tensor `device_malloc` + H2D/memset	host pre-launch	需要第二套 tensor buffer。默认不做。
`Runtime` / `KernelArgs` copy 到 device	host pre-launch	只有 run N 还可能读旧 slot 时，才需要第二个小 slot。
runtime arena image upload	host pre-launch	如果 run N 还在用旧 arena，需要第二个 device runtime-arena slot。默认避免。
AICPU attach arena / wire pointers / reset SM/mailbox	AICPU init	如果和 run N 的 `Total` overlap，需要独立 arena/SM/control state。默认不做。
PTO2 shared memory / scheduler state init	AICPU init	需要第二套 shared/control state。默认避免。
orchestration graph build	`Total / Orch`	和 run N 的 `Total` 并发接近 true concurrent runs，需要完整隔离。默认不做。
scheduler dispatch / AICore execute / polling	`Total / Sched`	同上。默认不做。
AICore 读写 tensor buffer	`Total`	如果 run N+1 写入时 run N 仍需要旧数据，需要第二套 tensor/output buffer。
GM heap 使用	init attach，之后在 `Total` 中使用	并发独立 cursor 需要分区或第二份 heap。默认不做。
scheduler shutdown / runtime_destroy	AICPU teardown	通常不需要第二套 buffer；如果提前启动 run N+1 init，则需要状态隔离。
validate D2H / copy-back / free	host post-sync	不在 `device_wall`；如果 run N+1 提前复用 buffer，必须确保 run N validate 已完成。

Add: run latency optimization assessment

0031064

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

puddingfjz mentioned this pull request Jun 29, 2026

docs: add TRB temporary buffer plan #1198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add: run latency optimization assessment#1186

Add: run latency optimization assessment#1186
puddingfjz wants to merge 1 commit into
hw-native-sys:mainfrom
puddingfjz:feat/run-latency-optimization-assessment

puddingfjz commented Jun 29, 2026

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

puddingfjz commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		Small device-control staging may be cheap, for example an extra `Runtime` and
		`KernelArgs` slot. Starting N+1's AICPU scheduler init during N's `Total` is

		- Resident `Runtime` and `KernelArgs` slots are small control buffers; keeping
		one slot is preferred.

		- Output double buffering is also not planned by default when it keeps run N
		outputs live while run N+1 writes a second output slot.

Uh oh!

Conversation

puddingfjz commented Jun 29, 2026

Summary

Testing

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

puddingfjz commented Jun 29, 2026

跨 Run Overlap 阶段与 Device Buffer 需求表

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading