diff --git a/docs/run-latency-optimization-assessment.md b/docs/run-latency-optimization-assessment.md new file mode 100644 index 000000000..9c1b8face --- /dev/null +++ b/docs/run-latency-optimization-assessment.md @@ -0,0 +1,619 @@ +# Run Latency Optimization Assessment + +**Date**: 2026-06-29 +**Status**: assessment + +## Summary + +This note evaluates run-level latency optimizations for the L2 +`tensormap_and_ringbuffer` path. The focus is the per-token gap between host +wall time, device wall time, and device-log `Total`. + +The main conclusion is: + +- Host-only cross-run staging is low risk but probably low value. +- Overlapping the current full bind path is not a planned optimization because + it needs a second set of per-run device tensor buffers. +- Simple double buffering does not make one run's AICPU init/teardown overlap + with that same run's `Total`; that ordering is a real dependency. +- Cross-run AICPU init overlap is plausible only with isolated per-run + scheduler/control state, or with a longer-lived executor that avoids most + per-run init/teardown. +- The first work item should be measurement: split bind, runtime arena, + device-copy, validate, and AICPU init/teardown timers before implementing a + pipeline. + +## Scope + +This is separate from callable-level prepare overlap. Callable prepare overlap +prepares a future callable while the current callable runs. Run-level work is +the per-invocation work still paid by every `run_prepared()` call: + +- bind this run's tensor and scalar arguments; +- allocate or acquire per-run device buffers; +- copy or memset tensor payloads; +- stage runtime args and kernel args; +- launch AICPU/AICore work; +- copy outputs back and free per-run buffers. + +The current host call order is: + +```text +run_prepared() + bind callable snapshot + bind_callable_to_runtime_impl() + DeviceRunner::run() + validate_runtime_impl() +``` + +For TRB, `bind_callable_to_runtime_impl()` already performs device work for +non-child-memory tensors: it calls `device_malloc()`, then H2D copy or device +memset, before `DeviceRunner::run()` launches. + +## Timing Terms + +The important timing terms are: + +- `host_wall`: host-side `steady_clock` around `run_prepared()`. Includes + bind, launch/sync, and validate. +- `device_wall`: full on-NPU AICPU exec wall. Earliest AICPU entry start to + latest AICPU entry end. +- `Total`: device-log window from `min(orch_start, sched_start)` to + `max(orch_end, sched_end)`. +- `Orch`: device-log orchestrator graph-build window. +- `Sched`: device-log scheduler/execute window across scheduler threads. + +Useful relationship: + +```text +host_wall = host_pre_launch + launch/sync + host_post_sync +device_wall = AICPU init + Total + AICPU teardown +``` + +`Total` is the useful AICPU orchestration plus scheduler/execution window. It +does not include host bind/validate, nor AICPU init/teardown. + +## Staging Levels + +### 1. Host-Only Staging + +Host-only staging means preparing run N+1 while all outputs of the preparation +remain in host memory. It does not call device allocation, device copy, device +memset, runtime-arena upload, or AICPU/AICore launch APIs. + +Examples: + +- parse N+1 `ChipStorageTaskArgs`; +- count tensors/scalars and inspect shape/stride/dtype; +- classify tensor direction and child-memory status; +- decide copy-back policy; +- choose logical double-buffer slots; +- build a host-side bind plan; +- prepare host-side `Runtime` template fields; +- compute block dim, ring sizes, and launch descriptors. + +Expected benefit is likely small. Once `device_malloc`, H2D copy, memset, +runtime upload, and validate/free are excluded, the remaining host CPU work is +usually tens of microseconds to a few hundred microseconds, and only likely +reaches low single-digit milliseconds for very large argument sets or +unexpectedly expensive host code. + +Use host-only overlap only if measurement proves: + +- host-only plan/build time is consistently above 1 ms, or above 5 percent of + steady TPOT; +- the implementation does not force extra device memory; +- the resulting pipeline does not complicate error handling or ownership. + +### 2. Current Full Bind Staging + +The current full TRB bind is not host-only. For every non-child-memory tensor +it: + +- allocates a device buffer; +- copies input/INOUT tensors to device; +- memsets pure output tensors on device; +- records tensor pairs for later D2H and free. + +Therefore, a cross-run pipeline that simply starts today's full bind for run +N+1 while run N is executing needs run N+1 device buffers to coexist with run +N's live buffers. This can hide real `host_pre_launch` time, but it consumes +extra HBM and may contend for allocator locks, H2D bandwidth, and HBM. + +Do not make this the default optimization path. Treat it as out of scope unless +there is an explicit product decision to spend the extra HBM for a second +per-run tensor-buffer set. + +### 3. Device-Control Staging + +Device-control staging means preparing device-side `Runtime`, `KernelArgs`, +PTO2 shared memory, runtime arena image, or even AICPU scheduler state for run +N+1 before run N completes. + +Small device-control staging may be cheap, for example an extra `Runtime` and +`KernelArgs` slot. Starting N+1's AICPU scheduler init during N's `Total` is +much stronger. It requires: + +- separate `Runtime` / `KernelArgs` slots; +- scheduler state that does not overwrite run N; +- AICore handshake/register state that does not disturb run N's active cores; +- a launch/control protocol that is not serialized behind the current + `DeviceRunner::run()` stream sync; +- spare AICPU capacity so N+1 init does not slow N's scheduler/orchestrator. + +Until those isolation properties exist, prefer reducing per-run init/teardown +or using a persistent executor over trying to overlap AICPU init directly. +If this path stages device-side state for N+1, it may need extra small control +slots, an extra runtime arena image, or an independent PTO2 shared-memory +region. Extra small control slots are conditional; extra arena or shared-memory +regions are larger HBM costs and should be avoided by default. + +## Double Buffering And Memory + +The useful cross-run shape is: + +```text +run N: live buffer A ---- execution ---- validate/free A +run N+1: prepare buffer B ---------------- launch B +``` + +Double buffering does not overwrite run N's live memory. It either uses a +separate run N+1 slot or waits until run N releases its slot. + +If a device-side staging path is explicitly accepted, there are two +implementation choices: + +- Allocate N+1 buffers while run N is executing. This hides allocation but can + add allocator variance and fail if free HBM is low. +- Preallocate A/B slots during warmup. This removes allocator latency from the + hot pipeline, but reserves the extra HBM for the worker's lifetime. + +These choices are memory-lifetime mechanisms, not default recommendations for +duplicating tensor buffers, runtime arenas, or PTO2 shared memory. + +In both cases, if the staged object is device-side, peak HBM increases: + +```text +peak_hbm_with_pipeline + = steady_hbm_without_pipeline + + staged_run_N_plus_1_bytes + + allocator/safety margin +``` + +This does not "steal" an already-owned run N buffer. It can still reduce the +free HBM available to run N if both runs allocate from the same device memory +pool and run N performs late allocations. A production design should reserve a +pipeline pool up front or gate staging by measured free HBM. + +Likely duplicated device memory: + +- per-run input/output tensor buffers, if full bind is overlapped; this is not + planned by default; +- `Runtime` device copy; +- `KernelArgs` device copy; +- PTO2 shared memory, if N and N+1 both need independent staged state; +- runtime arena image, if upload overlaps execution; +- diagnostic/device-wall buffers, if N is read after N+1 starts. + +Data that should not be duplicated: + +- uploaded callable/kernel buffers; +- AICPU-prewarmed orchestration SO handles; +- model weights; +- stable KV/cache buffers, when update semantics are in-place and ordered; +- long-lived GM heap region, unless two logical runs need independent heap + cursors at the same time; that concurrent-heap case is not planned by + default. + +Device-buffer duplication classes: + +- No duplication: host-only plans, topology metadata, log-level changes, + resident weights, ordered KV/cache updates, and child-memory pass-through. +- Small control duplication: extra `Runtime`, `KernelArgs`, diagnostic, or + device-wall slots. This is still device memory, but it is not a second tensor + buffer set. +- Arena duplication: an extra runtime arena image or independent PTO2 + shared-memory region. These can be MB-scale or larger, so avoid duplicating + them by default. +- Tensor-buffer duplication: full-bind staging or output double buffering where + run N data stays live while run N+1 writes another buffer. This is not planned + by default. + +Memory decision list: + +- Host-only plan/template: no extra device memory; allowed. +- Runtime/KernelArgs staging: one extra small slot set; conditional. +- Runtime arena upload: one extra runtime arena image; avoid by default. +- PTO2 independent staged state: second shared-memory region; avoid by + default. +- Output retained while next writes: second output buffer; avoid by default. +- Full bind tensor staging: second per-run tensor buffer set; not planned. +- True concurrent device runs: full per-run isolation; not planned. + +This section describes the memory contract only. It does not make double +buffering a first-line optimization; the decision status is captured in +`Cross-Run Pipeline / Double Buffering` below. + +## Optimization Candidates + +### 1. Add Split Timing First + +Current timing is too coarse. `args_malloc_copy` mixes host loops, allocation, +H2D copy, and memset. It cannot prove host-only overlap value. + +Add timers for: + +- callable snapshot bind; +- host-only bind plan construction; +- tensor device allocation; +- tensor H2D copy or memset; +- runtime arena host build; +- runtime arena device upload; +- `Runtime` device copy; +- `KernelArgs` device copy; +- launch/sync; +- validate status/header read; +- validate tensor D2H copy; +- validate free; +- AICPU init, `Total`, and teardown. + +Acceptance: + +- timer overhead is below 1 percent in quiet mode; +- timings are reported with a stable naming scheme; +- p50/p90/p99 are collected for steady decode. + +### 2. Per-Run Tensor Binding + +Current work: + +- allocate device memory for each non-child tensor; +- copy input and INOUT tensors H2D; +- memset pure output tensors on device; +- record tensor pairs for later D2H and free. + +Optimization directions: + +- Keep stable input/output buffers resident across decode steps when shape and + lifetime are stable. +- Pass device-resident child memory through instead of rematerializing host + tensors. +- Pool per-shape tensor buffers and reuse them across runs. +- Keep pooling serial by default: reuse one logical slot after the previous + owner releases it, instead of allocating a second per-run tensor-buffer set. +- Skip H2D for immutable data that is already resident on device, including + KV cache, weights, and constant inputs. +- Skip D2H for intermediate outputs that immediately feed another device run. +- Use smaller explicit output descriptors when the host only needs a scalar or + compact final result. + +Device-buffer duplication note: + +- Resident immutable data should be shared, not duplicated. +- Pooling does not require two device buffers if producer/consumer lifetimes are + serial. +- A second tensor buffer is needed only when old tensor contents must stay live + while the next run writes another logical version; avoid this by default. + +Potential conditions: + +- tensor allocation/copy/memset is consistently above 1 ms; +- tensor shapes are stable across many decode steps; +- large input tensors are copied every run without changing contents; +- KV cache, weights, or constant inputs are repeatedly uploaded from host; +- output tensors are copied back only to feed another device run. + +Acceptance: + +- `host_wall` drops by at least 5 percent, or at least 1 ms for steady decode; +- `device_wall` and `Total` do not increase by more than 1 percent; +- HBM stays inside a documented per-worker budget; +- repeated runs do not grow live allocation count. + +### 3. Validate / Copy-Back / Free + +Code reference: + +- `src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp`: + `validate_runtime_impl()`. + +Current work: + +- read PTO2 shared-memory header back to host; +- copy OUTPUT and INOUT tensors D2H; +- free all per-run tensor device allocations; +- clear dispatch-table and tensor-pair state. + +Optimization directions: + +- Keep outputs on device when the next stage can consume device pointers. +- Copy back only declared final outputs, not every writable tensor. +- Copy only the final small result that must be visible to host; avoid copying + intermediate large tensors. +- Replace per-run free with a buffer-pool release. +- Reuse or pool output buffers so `device_free` moves from every run to a + longer-lived owner or lifecycle boundary. +- Batch small D2H copies when correctness allows. +- Split error-status readback from full output validation. +- Make status/header D2H a lighter-weight error snapshot path, separate from + successful hot-path output materialization. + +Device-buffer duplication note: + +- Keeping an output on device does not require a second buffer if ownership is + transferred to the next consumer and the slot is released before reuse. +- It does require output double buffering if run N's output remains live while + run N+1 writes the same logical output slot. Avoid that for large tensors + unless explicitly accepted. + +Potential conditions: + +- D2H output copy or per-run free is visible in `host_post_sync`; +- workload has multi-stage device pipelines where host does not inspect + intermediate tensors; +- output size is large compared with the host-consumed result. + +Acceptance: + +- `host_post_sync` drops by at least 1 ms, or D2H bytes per token drop by at + least 50 percent; +- no output is skipped without an ownership and consumer contract; +- error paths still copy enough state for diagnostics. + +### 4. Runtime Arena, Runtime Args, And Kernel Args + +Current work: + +- derive effective ring sizes; +- reserve and commit a host arena; +- initialize the prebuilt PTO2 runtime image on host; +- upload the image into the pooled runtime arena each run; +- allocate/copy/free `Runtime` and `KernelArgs` device slots. + +Optimization directions: + +- Cache runtime arena layout metadata and image by `(task_window, heap, + dep_pool)`. +- Patch only fields that differ for the current run. +- Keep resident `Runtime` and `KernelArgs` slots per runner. +- Use two small `Runtime` / `KernelArgs` slots only if N+1 control staging + overlaps N. +- Patch and copy only changed bytes if a safe ABI is introduced. + +Device-buffer duplication note: + +- Resident `Runtime` and `KernelArgs` slots are small control buffers; keeping + one slot is preferred. +- Two slots are only needed when N+1 device-side staging overlaps N. +- Host-side runtime arena layout/image cache does not need another device + buffer. +- Overlapping N+1 runtime-arena device upload while N is still running needs a + second device runtime-arena slot. The runtime arena can be large, so avoid + this by default; prefer host-side cache and upload only after the current run + no longer needs the device arena. + +Potential conditions: + +- runtime arena build/upload is consistently above 0.5 to 1 ms; +- `Runtime` / `KernelArgs` allocation or copy is visible in `host_pre_launch`; +- the workload repeats the same ring sizes and task-window shape. + +Acceptance: + +- warmed runs show measurable `host_pre_launch` reduction; +- AICPU attach/wire/reset remains correct for every run; +- cached images invalidate cleanly when ABI, ring sizes, or platform config + changes; +- error recovery and finalize free resident slots exactly once. + +### 5. Topology, Block-Dim, And Launch Metadata + +Current work: + +- resolve block dim; +- on a5, probe AICPU topology and compute allowed CPUs; +- fill launch metadata into `Runtime`; +- derive launch counts from requested AICPU thread count and platform limits. + +Optimization directions: + +- Cache topology probe results per `(device_id, process, platform)`. +- Cache a5 topology and allowed CPU lists by requested AICPU thread count. +- Avoid repeated block-dim queries when config pins a known value. +- Keep launch metadata templates for repeated decode shapes. + +Potential conditions: + +- topology probe or block-dim query is above 0.2 to 0.5 ms per run; +- the same worker runs many tokens on one device without changing launch shape; +- launch metadata is identical across steady decode steps. + +Acceptance: + +- launch metadata matches the uncached path bit-for-bit for the same device; +- cache clears on device reset/finalize; +- wrong-arch or wrong-SKU failures remain fail-fast; +- `host_pre_launch` drops without changing `device_wall` or `Total`. + +### 6. Device Wall Init/Teardown + +Status: deferred / conditional. Do not treat this as a first-line overlap +optimization. + +Decision note: + +- Same-run AICPU init, `Total`, and teardown cannot be overlapped by double + buffering; their order is a real dependency. +- Cross-run N+1 init during N's `Total` is theoretically possible, but only + after scheduler/control state, AICore register state, and per-run args slots + are isolated. +- The safer direction is to reduce or persist init/teardown work, not to make + simple double buffering carry it. + +Current work inside `device_wall`: + +- AICPU executor init; +- scheduler init; +- AICore handshake and assignment; +- runtime attach/wire/reset; +- scheduler shutdown; +- AICore register deinit; +- executor deinit and runtime destroy. + +Optimization directions: + +- Measure AICPU init and teardown separately from `Total`. +- Preserve static per-core assignment and metadata across runs. +- Keep AICore worker state alive when queues can be reset safely. +- Convert repeated launch/shutdown into a long-lived executor that receives + work through a device mailbox or ring. + +Gate conditions: + +- measured signal remains after logging/profiling noise is controlled: + `device_wall - Total` is consistently above 3 ms, or init/teardown alone is + above 1 ms for steady decode; +- the workload runs many homogeneous decode steps on the same device; +- split timing proves this is still a bottleneck after lower-risk host and + binding optimizations are applied. + +Acceptance: + +- `device_wall - Total` drops by at least 30 percent, or by at least 1 ms; +- `Total` does not increase by more than 1 percent; +- no increase in AICore op-timeout, AICPU exception, or stream-sync failures; +- emergency shutdown still leaves the device recoverable. + +### 7. Cross-Run Pipeline / Double Buffering + +Status: deferred / conditional. Use this only after split timing proves there +is enough stageable work. + +Decision note: + +- Host-only pipeline is feasible but expected to be low value unless host-only + plan/build time independently measures above 1 ms or 5 percent of TPOT. +- Full-bind pipeline is not planned by default because it requires run N+1's + tensor buffers to coexist with run N's live buffers. +- Output double buffering is also not planned by default when it keeps run N + outputs live while run N+1 writes a second output slot. +- Double buffering should not be used as the current plan for AICPU + init/teardown overlap; that becomes device-control pipelining with much + stronger isolation requirements. + +Optimization directions: + +- If only host-only staging is expensive, build a host bind-plan pipeline. +- Do not add A/B tensor slots for full-bind overlap unless the extra device + buffer set is explicitly accepted. +- Do not add output double buffers for large tensors unless the consumer + lifetime proves they are necessary and HBM is explicitly budgeted. +- Prefer preallocated small control/arg slots only for accepted device-side + staging. +- Gate any device-side N+1 staging by HBM headroom. +- Keep launch ordering explicit: N+1 may not consume staged state until its + staging is complete and N's required resources are released. + +Gate conditions: + +- host-only plan/build is measured above 1 ms or 5 percent of TPOT; +- run N has enough `Total` time to cover N+1 staging; +- no second per-run tensor or output buffer set is required, unless explicitly + accepted; +- active-run inflation is less than 20 percent of hidden staging time. + +Acceptance: + +- TPOT improves by at least 5 percent, or at least 1 ms absolute; +- p99 does not regress by more than 5 percent; +- peak HBM stays inside the documented budget; +- run N `Total` does not materially increase from background copy/allocation; +- error and timeout paths release or quarantine both slots correctly. + +### 8. Logging And Diagnostic Overhead + +Current work: + +- hot paths contain many `LOG_INFO_V0` and `LOG_INFO_V9` records; +- tensor binding and validate paths print per-tensor `LOG_INFO_V0` records; +- external reports indicate some `printf` paths can be multi-ms. + +Optimization directions: + +- Ensure performance runs use a quiet log level. +- Move per-tensor and per-thread logs behind higher verbosity. +- Reduce `LOG_INFO_V0` in hot paths, especially per-tensor bind and validate + prints. +- Use counters or compact summaries instead of repeated formatted strings. +- Verify device-log timing collection does not perturb the target workload. + +Potential conditions: + +- `host_wall`, `device_wall`, or `Total` changes materially when log level + changes; +- device log contains repeated per-tensor/per-task records in steady decode. + +Acceptance: + +- quiet mode keeps required error diagnostics; +- performance variance drops; +- removing logs does not change correctness or device synchronization. + +## Decision Rules + +Recommended default order after measurements: + +1. Add split timers. +2. Optimize tensor residency, child-memory use, buffer pooling, and H2D/D2H + avoidance. +3. Cache runtime arena metadata/images and keep small arg slots resident. +4. Cache topology, block-dim, and launch metadata if they measure visible. +5. If `device_wall - Total` remains a measured bottleneck, consider persistent + executor or reduced teardown. +6. Do not pursue full-bind or output-double-buffer pipelines unless the second + device tensor-buffer set is explicitly accepted. +7. Consider host-only cross-run staging only if it independently measures above + 1 ms or 5 percent of TPOT. + +Reject or postpone an optimization when: + +- the measured component is below 1 ms and below 5 percent of TPOT; +- the change increases run N `Total` by more than 1 percent; +- the change improves average TPOT but worsens p99 by more than 5 percent; +- HBM headroom cannot cover any explicitly accepted device-side staging; +- the design requires a second tensor/output buffer set without explicit + acceptance; +- ownership on error/timeout paths is unclear. + +## Measurement Plan + +For any optimization above, collect before/after: + +- `host_wall`; +- `device_wall`; +- device-log `Total`, `Orch`, and `Sched`; +- split `host_pre_launch`; +- split `host_post_sync`; +- `device_wall - Total`; +- HBM live allocation high-water mark; +- background-staging active-run inflation; +- p50/p90/p99 over steady decode tokens. + +Recommended thresholds: + +- Treat changes within +/-2 percent as noise unless repeated across devices. +- Require at least 5 percent TPOT improvement, or at least 1 ms absolute + improvement for steady decode, before accepting a complexity-increasing + optimization. +- Reject optimizations that improve average latency but worsen p99 by more than + 5 percent without an explicit serving-policy reason. + +## References + +- `docs/dfx/l2-timing.md` +- `docs/callable-prepare-overlap-plan.md` +- `src/common/platform/onboard/host/c_api_shared.cpp` +- `src/common/platform/onboard/host/device_runner_base.cpp` +- `src/a5/platform/onboard/host/device_runner.cpp` +- `src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp` +- `src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`