You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #989 landed the batched publish optimization (one wmb() per claim) from poursoul/simpler:a2a3-sched-opt. That commit originally bundled six dispatch-path optimizations; only one shipped in #989 (plus the cross-task gating + cross-thread stagger fix added during merge). The other five remain unmerged on poursoul/a2a3-sched-opt:
Original branch: poursoul/simpler:a2a3-sched-opt, commit 40944786.
This issue tracks the remaining five so they don't get lost. Each item below has its gain source, scope, and the rebase constraint from upstream changes since the original branch base (fecf7c97).
What: AICPU writes the task's tensor pointers + scalars once into a per-task dispatch_args_template (in GM) at submit time. Each build_payload then only writes task_args (GM pointer) + arg_count into the per-core dispatch payload. AICore burst-copies args[0..arg_count) from the shared template into its per-core args[] before invoking the kernel.
Gain source:
AICPU build_payload inner loops (for tensor_count + for scalar_count) eliminated.
For SPMD with block_num=B and N+M=K total args: AICPU stores drop from B × K × 8B to K × 8B (one-time) + B × 2 words (per-dispatch).
Eliminates per-block RFO misses on the args[] cache lines (each block previously dirtied them).
Cost: AICore pays one extra GM burst-load per dispatch to fetch the shared template into local args[].
Estimated gain: Hundreds of ns to ~µs per SPMD task with many args. Zero gain on block_num=1 tasks.
Most relevant workload: decode_layer-style — many tensor pointers per kernel (KV-cache slots, projections) over many SPMD blocks.
#1093 unified TaskArgs on strided Tensor, dropped ContinuousTensor. The PTO2TaskPayload type poursoul added dispatch_args_template to has been refactored.
Requires reconciling with the new payload type system and verifying the template lifetime fits the new task submit flow.
What: Move the per-dispatch-written control fields (function_bin_addr, task_args, arg_count, local_context, global_context) to the leading cache lines of the struct. Move args[] (256B array) to the tail with alignas(64). Struct total size stays 576B (hardware ABI constraint).
Gain source:
AICPU per-dispatch writes hit only the first 1-2 cache lines (control block); previously spread across 2-3 lines.
AICore's first dcci-then-read on the dispatch path lands on function_bin_addr (offset 0) immediately, can issue kernel jump earlier.
Fewer dirty cache lines per dispatch → less NoC writeback bandwidth for AICore coherence.
Cost: Just layout — no extra ops.
Estimated gain: 1-2 fewer cache line dirties per dispatch (~ns direct savings); shaves ~10-100 ns off AICore's wake-to-start critical path depending on NoC latency.
Upstream rebase risk: HIGH.
#1056 resized args/scalar caps inside pto2_dispatch_payload.h. Layout has already been touched.
#1079 (speculative early-dispatch) reads the payload from a different code path; layout assumptions need to stay consistent.
Couples with: #2 (needs the new task_args / arg_count fields).
What: At handshake init, write args[PAYLOAD_LOCAL_CONTEXT_INDEX] and args[PAYLOAD_GLOBAL_CONTEXT_INDEX] once per (core_id, buf_idx) pair. Remove these two stores from build_payload.
Gain source:
These slots hold pointers to local_context / global_context fields inside the same dispatch_payload buffer — the values are fixed across all dispatches for a given (core, buffer).
Per-dispatch AICPU saves 2 stores (~2 ns direct).
The args[] cache line containing these indexes stays clean across dispatches → AICore's dcci doesn't pull stale state.
Estimated gain: ~2 ns + one cache line kept clean per dispatch. Trivial alone, meaningful as part of the layout cleanup.
Upstream rebase risk: MEDIUM. Depends on #3's layout decisions and the new args[] sizing from #1056.
Couples with: #3 (requires stable args[] layout across dispatches).
Cross-core scheduler rotation means each per-core buffer is cold-cache when its turn comes round again.
Without prefetch: first store hits Read-For-Ownership miss → ~100 ns blocking while line is fetched and ownership acquired.
With prefetch: async RFO issued ahead of the actual writes; by the time build_payload stores fire, line is in L1 with exclusive ownership.
Estimated gain: ~80-100 ns per dispatch when buffer is cold (common in steady-state cross-core rotation). Marginal cost (~3 ns) when buffer happens to be hot.
Upstream rebase risk: LOW. 3-line standalone addition; no struct or layout dependencies.
What: Replace get_sys_cnt_aicpu() (out-of-line function in device_time.cpp) with a static inline __attribute__((always_inline)) wrapper in the same TU as the dispatch hot path:
The actual register read is identical (cntvct_el0, the chip-wide system timer at 50 MHz — same register both AICPU and AICore-side get_sys_cnt() resolve to). The win is purely function-call elimination.
Why not pmccntr_el0 (per-core CPU cycle counter, ~30× higher resolution): per the original commit's note, EL0 access on the a2a3 AICPU traps-and-emulates (→ 507018 op timeout); enable attempts get masked by the platform. Plus per-core PMU counters aren't synchronized across cores — wouldn't be usable for cross-core / cross-tier dispatch ↔ AICore-start measurement anyway.
Gain source:
Each profiling timestamp sample drops from bl get_sys_cnt_aicpu + frame + mrs + ret (~5 ns) to just mrs (~0 ns call-overhead).
AICPU instruction cache footprint on the dispatch hot path slightly smaller.
Estimated gain: ~5 ns × (samples per dispatch) × (dispatches). Only with --enable-l2-swimlane >= 2; zero in release / no-profiling builds.
Upstream rebase risk: LOW. Standalone, 6 lines, drops next to existing get_sys_cnt_aicpu call sites.
Standalone, 3 lines, but needs micro-benchmark on a real workload (decode-style) to confirm cache-cold assumption holds today (after #1079 speculative-early-dispatch changed the prepare path).
Largest gain on the table (per-block AICPU args store elimination scales linearly with block_num), but requires non-trivial rebase work against #1056 (args sizing), #1093 (TaskArgs type unification), #1079 (speculative early-dispatch). Consider redesigning rather than mechanically cherry-picking — the original dispatch_args_template design may conflict with the new speculative path's payload assumptions.
Context
PR #989 landed the batched publish optimization (one
wmb()per claim) frompoursoul/simpler:a2a3-sched-opt. That commit originally bundled six dispatch-path optimizations; only one shipped in #989 (plus the cross-task gating + cross-thread stagger fix added during merge). The other five remain unmerged onpoursoul/a2a3-sched-opt:98849e81)PTO2DispatchPayloadre-layout (control block first, args[] at tail)__builtin_prefetchpayload + slab)fast_sys_cnt(inlinemrs cntvct_el0)Original branch:
poursoul/simpler:a2a3-sched-opt, commit40944786.This issue tracks the remaining five so they don't get lost. Each item below has its gain source, scope, and the rebase constraint from upstream changes since the original branch base (
fecf7c97).#2 SPMD arg sharing
What: AICPU writes the task's tensor pointers + scalars once into a per-task
dispatch_args_template(in GM) at submit time. Eachbuild_payloadthen only writestask_args(GM pointer) +arg_countinto the per-core dispatch payload. AICore burst-copiesargs[0..arg_count)from the shared template into its per-coreargs[]before invoking the kernel.Gain source:
build_payloadinner loops (for tensor_count+for scalar_count) eliminated.block_num=BandN+M=Ktotal args: AICPU stores drop fromB × K × 8BtoK × 8B(one-time) +B × 2 words(per-dispatch).Cost: AICore pays one extra GM burst-load per dispatch to fetch the shared template into local args[].
Estimated gain: Hundreds of ns to ~µs per SPMD task with many args. Zero gain on
block_num=1tasks.Most relevant workload: decode_layer-style — many tensor pointers per kernel (KV-cache slots, projections) over many SPMD blocks.
Upstream rebase risk: HIGH.
#1056raisedCORE_MAX_TENSOR_ARGS16 → 32 and lowered scalars 32 → 16. Args buffer layout sized differently.#1093unified TaskArgs on strided Tensor, droppedContinuousTensor. ThePTO2TaskPayloadtype poursoul addeddispatch_args_templateto has been refactored.Couples with: #3, #4 (shared layout dependency)
#3
PTO2DispatchPayloadre-layoutWhat: Move the per-dispatch-written control fields (
function_bin_addr,task_args,arg_count,local_context,global_context) to the leading cache lines of the struct. Moveargs[](256B array) to the tail withalignas(64). Struct total size stays 576B (hardware ABI constraint).Gain source:
function_bin_addr(offset 0) immediately, can issue kernel jump earlier.Cost: Just layout — no extra ops.
Estimated gain: 1-2 fewer cache line dirties per dispatch (~ns direct savings); shaves ~10-100 ns off AICore's wake-to-start critical path depending on NoC latency.
Upstream rebase risk: HIGH.
#1056resized args/scalar caps insidepto2_dispatch_payload.h. Layout has already been touched.#1079(speculative early-dispatch) reads the payload from a different code path; layout assumptions need to stay consistent.Couples with: #2 (needs the new
task_args/arg_countfields).#4 One-time context-pointer init
What: At handshake init, write
args[PAYLOAD_LOCAL_CONTEXT_INDEX]andargs[PAYLOAD_GLOBAL_CONTEXT_INDEX]once per(core_id, buf_idx)pair. Remove these two stores frombuild_payload.Gain source:
local_context/global_contextfields inside the same dispatch_payload buffer — the values are fixed across all dispatches for a given (core, buffer).Estimated gain: ~2 ns + one cache line kept clean per dispatch. Trivial alone, meaningful as part of the layout cleanup.
Upstream rebase risk: MEDIUM. Depends on #3's layout decisions and the new args[] sizing from
#1056.Couples with: #3 (requires stable args[] layout across dispatches).
#5 AICPU prefetch
What: At the top of
prepare_subtask_to_core(before any store into the payload or slab), issue three software prefetches:(1, 3)= prefetch-for-write, highest temporal locality.Gain source:
build_payloadstores fire, line is in L1 with exclusive ownership.Estimated gain: ~80-100 ns per dispatch when buffer is cold (common in steady-state cross-core rotation). Marginal cost (~3 ns) when buffer happens to be hot.
Upstream rebase risk: LOW. 3-line standalone addition; no struct or layout dependencies.
Independent: can be its own PR.
#6
fast_sys_cntWhat: Replace
get_sys_cnt_aicpu()(out-of-line function indevice_time.cpp) with astatic inline __attribute__((always_inline))wrapper in the same TU as the dispatch hot path:The actual register read is identical (
cntvct_el0, the chip-wide system timer at 50 MHz — same register both AICPU and AICore-sideget_sys_cnt()resolve to). The win is purely function-call elimination.Why not
pmccntr_el0(per-core CPU cycle counter, ~30× higher resolution): per the original commit's note, EL0 access on the a2a3 AICPU traps-and-emulates (→ 507018 op timeout); enable attempts get masked by the platform. Plus per-core PMU counters aren't synchronized across cores — wouldn't be usable for cross-core / cross-tier dispatch ↔ AICore-start measurement anyway.Gain source:
bl get_sys_cnt_aicpu+ frame +mrs+ret(~5 ns) to justmrs(~0 ns call-overhead).Estimated gain: ~5 ns × (samples per dispatch) × (dispatches). Only with
--enable-l2-swimlane >= 2; zero in release / no-profiling builds.Upstream rebase risk: LOW. Standalone, 6 lines, drops next to existing
get_sys_cnt_aicpucall sites.Independent: can be its own PR.
Suggested ordering
fast_sys_cnt#1079speculative-early-dispatch changed the prepare path).block_num), but requires non-trivial rebase work against#1056(args sizing),#1093(TaskArgs type unification),#1079(speculative early-dispatch). Consider redesigning rather than mechanically cherry-picking — the originaldispatch_args_templatedesign may conflict with the new speculative path's payload assumptions.Measurement plan
Each item should be benched on:
spmd_serial_chain_mix(PR Add: a2a3 SPMD MIX serial-chain busy-wait swimlane example #988) — clean SPMD with controlled kernel duration, validatesblock_num-scaled pathsspmd_sync_start_stress× 10 — regression gate (this is the test that caught the cross-task batching bug in PR Optimize: a2a3 dispatch path — batched publish (one wmb per claim) #989)Level-2 L2 swimlane is the primary observability channel for per-dispatch timing impacts.
tools/benchmark_rounds.shcovers the wall-time view.References
98849e81).spmd_serial_chain_mixexample useful for measuring SPMD-fanout-bound paths.poursoul/simpler:a2a3-sched-opt, commit40944786.docs/investigations/2026-06-aicore-cold-start-warmup.md: related cold-start finding (NoC routing is the dominant cause of first-task head OH, not I-cache; affects any prefetch evaluation under Migrate AICore kernel compilation from C++ to Python and simplify DeviceRunner API #5).docs/investigations/2026-06-cross-task-batched-publish.md: the cross-task hoist that was attempted, gated on sync_start in the merged version of Optimize: a2a3 dispatch path — batched publish (one wmb per claim) #989.#1056(args caps),#1079(speculative early-dispatch),#1093(TaskArgs unification).