Skip to content

Add: run latency optimization assessment#1186

Open
puddingfjz wants to merge 1 commit into
hw-native-sys:mainfrom
puddingfjz:feat/run-latency-optimization-assessment
Open

Add: run latency optimization assessment#1186
puddingfjz wants to merge 1 commit into
hw-native-sys:mainfrom
puddingfjz:feat/run-latency-optimization-assessment

Conversation

@puddingfjz

Copy link
Copy Markdown
Contributor

Summary

  • Add a run-level latency optimization assessment for the L2
    tensormap_and_ringbuffer path.
  • Document timing terms for host_wall, device_wall, and device-log Total.
  • Classify optimization candidates around tensor binding, validate/copy-back,
    runtime arena caching, topology caching, logging, and device-wall overhead.
  • Clarify that full-bind and output-double-buffer pipelines are not planned by
    default because they require a second device tensor-buffer set.
  • Document which staging paths need extra HBM and which remain host-only.

Testing

  • Docs only; no runtime tests run.
  • Ran markdown line-width check with awk.
  • Ran git diff --staged --check.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

A new 619-line documentation file docs/run-latency-optimization-assessment.md is added, assessing run-level latency optimization options for the tensormap_and_ringbuffer path, including staging levels, a memory/double-buffering contract, eight prioritized optimization candidates with acceptance criteria, decision rules, a measurement plan, and references.

Changes

Run Latency Optimization Assessment

Layer / File(s) Summary
Scope, timing terms, and staging framework
docs/run-latency-optimization-assessment.md
Defines document metadata and scope, introduces timing terms (host_wall, device_wall, Total, Orch, Sched), and details three staging approaches (host-only, full bind, device-control) with HBM and isolation implications.
Double-buffering and memory duplication contract
docs/run-latency-optimization-assessment.md
Specifies the double-buffering shape, peak HBM impact under staging, categorizes which device data would be duplicated vs. not, and establishes the memory decision list as a contract.
Optimization candidates #1#8
docs/run-latency-optimization-assessment.md
Enumerates all eight optimization candidates (split timers, tensor binding/pooling, validate/copy-back/free improvements, arena/args/kernel-args caching, topology/launch metadata caching, device-wall overlap, cross-run pipelining, logging overhead) each with gate conditions and acceptance criteria.
Decision rules, measurement plan, and references
docs/run-latency-optimization-assessment.md
Provides recommended ordering after measurement, rejection/postponement criteria, before/after metric thresholds for accepting complexity, and a references section.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit reads docs by the moonlight glow,
Timing and buffers all neatly in rows,
Split timers first, then the arenas cache,
Double-buffer deferred — let's not be too rash!
🐇 Measure twice, optimize once, that's the way to go!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the docs-only latency optimization assessment.
Description check ✅ Passed The description matches the documented assessment and validation steps in the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive run-latency optimization assessment document for the L2 tensormap_and_ringbuffer path, detailing staging levels, memory implications, optimization candidates, and measurement plans. The review feedback focuses on improving the precision and grammar of the documentation, specifically clarifying that Runtime and KernelArgs are separate allocations requiring a 'set of slots' rather than a single 'slot', and correcting the phrasing of writing to an output slot.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +133 to +134
Small device-control staging may be cheap, for example an extra `Runtime` and
`KernelArgs` slot. Starting N+1's AICPU scheduler init during N's `Total` is

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since Runtime and KernelArgs are separate allocations/structures, referring to them as a single "slot" is technically inaccurate. It is clearer to refer to them as a "set of slots" or "slots".

Suggested change
Small device-control staging may be cheap, for example an extra `Runtime` and
`KernelArgs` slot. Starting N+1's AICPU scheduler init during N's `Total` is
Small device-control staging may be cheap, for example an extra set of `Runtime` and
`KernelArgs` slots. Starting N+1's AICPU scheduler init during N's `Total` is
References
  1. Ensure documentation and diagrams accurately reflect implementation details regarding resource lifecycles, especially when persistence is used to maintain internal state like caches.

Comment on lines +385 to +386
- Resident `Runtime` and `KernelArgs` slots are small control buffers; keeping
one slot is preferred.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the previous point, Runtime and KernelArgs are separate allocations, so "keeping one slot" should be corrected to "keeping one set of slots" or "keeping one slot set" for precision.

Suggested change
- Resident `Runtime` and `KernelArgs` slots are small control buffers; keeping
one slot is preferred.
- Resident `Runtime` and `KernelArgs` slots are small control buffers; keeping
one set of slots is preferred.
References
  1. Ensure documentation and diagrams accurately reflect implementation details regarding resource lifecycles, especially when persistence is used to maintain internal state like caches.

Comment on lines +498 to +499
- Output double buffering is also not planned by default when it keeps run N
outputs live while run N+1 writes a second output slot.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Grammatically, a run "writes to" a slot rather than "writes" a slot. Adding "to" improves readability and precision.

Suggested change
- Output double buffering is also not planned by default when it keeps run N
outputs live while run N+1 writes a second output slot.
- Output double buffering is also not planned by default when it keeps run N
outputs live while run N+1 writes to a second output slot.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
docs/run-latency-optimization-assessment.md (2)

562-619: 📐 Maintainability & Code Quality | 🔵 Trivial

Consider explicitly ordering logging optimization in the recommended sequence.

Candidate 8 (Logging) is absent from the numbered default order. Since noisy logs can perturb the split-timing measurements that step 1 relies on, consider adding "Set quiet log level and verify measurement fidelity" as step 0 or 1a. Without this, teams following the doc may collect noisy baseline measurements.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/run-latency-optimization-assessment.md` around lines 562 - 619, The
default optimization sequence in the Decision Rules is missing the logging step,
which should be called out before measurements are trusted. Update the numbered
recommendation order in this section to explicitly include Candidate 8 by adding
a quiet-logging/measurement-fidelity step near the start (for example, before or
alongside add split timers) so readers know to reduce log noise before
collecting baselines. Keep the guidance aligned with the existing
measurement-first flow and reference the Decision Rules and Measurement Plan
sections for consistency.

61-65: 📐 Maintainability & Code Quality | 🔵 Trivial

Clarify "Total" vs. "Effective" terminology.

The document defines Total as the device-log union of Orch and Sched, but the upstream docs/dfx/l2-timing.md heading lists Effective alongside Orch/Sched and states device_wall is "strictly larger than Orch/Sched/Effective". If Total and Effective are the same concept, add a note mapping them; if they differ, define the distinction. Without this, readers may search the upstream docs for Total and not find it.

Also applies to: 73-74

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/run-latency-optimization-assessment.md` around lines 61 - 65, The
terminology in the latency section is inconsistent because “Total” is used here
where the upstream timing docs use “Effective”; update the description to either
explicitly map “Total” to “Effective” or define how they differ. Adjust the
affected wording in the latency summary and the later note that references the
same concept so readers can match the terms across this document and
docs/dfx/l2-timing.md, especially around the Total/Orch/Sched definitions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@docs/run-latency-optimization-assessment.md`:
- Around line 562-619: The default optimization sequence in the Decision Rules
is missing the logging step, which should be called out before measurements are
trusted. Update the numbered recommendation order in this section to explicitly
include Candidate 8 by adding a quiet-logging/measurement-fidelity step near the
start (for example, before or alongside add split timers) so readers know to
reduce log noise before collecting baselines. Keep the guidance aligned with the
existing measurement-first flow and reference the Decision Rules and Measurement
Plan sections for consistency.
- Around line 61-65: The terminology in the latency section is inconsistent
because “Total” is used here where the upstream timing docs use “Effective”;
update the description to either explicitly map “Total” to “Effective” or define
how they differ. Adjust the affected wording in the latency summary and the
later note that references the same concept so readers can match the terms
across this document and docs/dfx/l2-timing.md, especially around the
Total/Orch/Sched definitions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 05fd838a-b118-4f6e-9119-5c1f3185c2b6

📥 Commits

Reviewing files that changed from the base of the PR and between 53e291d and 0031064.

📒 Files selected for processing (1)
  • docs/run-latency-optimization-assessment.md

@puddingfjz

Copy link
Copy Markdown
Contributor Author

跨 Run Overlap 阶段与 Device Buffer 需求表

步骤 在 run N+1 中属于 跨 run overlap 时是否需要 double device buffer
解析 args / 生成 bind plan / 检查 shape / direction host pre-launch 不需要。纯 host-only。
topology / block-dim / launch metadata host pre-launch 不需要。
full bind: tensor device_malloc + H2D/memset host pre-launch 需要第二套 tensor buffer。默认不做。
Runtime / KernelArgs copy 到 device host pre-launch 只有 run N 还可能读旧 slot 时,才需要第二个小 slot。
runtime arena image upload host pre-launch 如果 run N 还在用旧 arena,需要第二个 device runtime-arena slot。默认避免。
AICPU attach arena / wire pointers / reset SM/mailbox AICPU init 如果和 run N 的 Total overlap,需要独立 arena/SM/control state。默认不做。
PTO2 shared memory / scheduler state init AICPU init 需要第二套 shared/control state。默认避免。
orchestration graph build Total / Orch 和 run N 的 Total 并发接近 true concurrent runs,需要完整隔离。默认不做。
scheduler dispatch / AICore execute / polling Total / Sched 同上。默认不做。
AICore 读写 tensor buffer Total 如果 run N+1 写入时 run N 仍需要旧数据,需要第二套 tensor/output buffer。
GM heap 使用 init attach,之后在 Total 中使用 并发独立 cursor 需要分区或第二份 heap。默认不做。
scheduler shutdown / runtime_destroy AICPU teardown 通常不需要第二套 buffer;如果提前启动 run N+1 init,则需要状态隔离。
validate D2H / copy-back / free host post-sync 不在 device_wall;如果 run N+1 提前复用 buffer,必须确保 run N validate 已完成。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant