[Performance] Remaining dispatch-path optimizations from poursoul/a2a3-sched-opt

## Context

PR #989 landed the **batched publish** optimization (one `wmb()` per claim) from `poursoul/simpler:a2a3-sched-opt`. That commit originally bundled **six** dispatch-path optimizations; only one shipped in #989 (plus the cross-task gating + cross-thread stagger fix added during merge). The other **five remain unmerged** on `poursoul/a2a3-sched-opt`:

| # | Item | Status |
|---|---|---|
| 1 | Batched publish (one wmb per claim) | ✅ shipped in #989 (commit `98849e81`) |
| 2 | SPMD arg sharing (AICore burst-copy from template) | ❌ not shipped |
| 3 | `PTO2DispatchPayload` re-layout (control block first, args[] at tail) | ❌ not shipped |
| 4 | One-time context-pointer init (handshake-time) | ❌ not shipped |
| 5 | AICPU prefetch (`__builtin_prefetch` payload + slab) | ❌ not shipped |
| 6 | `fast_sys_cnt` (inline `mrs cntvct_el0`) | ❌ not shipped |

Original branch: `poursoul/simpler:a2a3-sched-opt`, commit `40944786`.

This issue tracks the remaining five so they don't get lost. Each item below has its gain source, scope, and the rebase constraint from upstream changes since the original branch base (`fecf7c97`).

---

## #2 SPMD arg sharing

**What**: AICPU writes the task's tensor pointers + scalars once into a per-task `dispatch_args_template` (in GM) at submit time. Each `build_payload` then only writes `task_args` (GM pointer) + `arg_count` into the per-core dispatch payload. AICore burst-copies `args[0..arg_count)` from the shared template into its per-core `args[]` before invoking the kernel.

**Gain source**:
- AICPU `build_payload` inner loops (`for tensor_count` + `for scalar_count`) eliminated.
- For SPMD with `block_num=B` and `N+M=K` total args: AICPU stores drop from `B × K × 8B` to `K × 8B` (one-time) + `B × 2 words` (per-dispatch).
- Eliminates per-block RFO misses on the args[] cache lines (each block previously dirtied them).

**Cost**: AICore pays one extra GM burst-load per dispatch to fetch the shared template into local args[].

**Estimated gain**: Hundreds of ns to ~µs per SPMD task with many args. Zero gain on `block_num=1` tasks.

**Most relevant workload**: decode_layer-style — many tensor pointers per kernel (KV-cache slots, projections) over many SPMD blocks.

**Upstream rebase risk**: HIGH.
- `#1056` raised `CORE_MAX_TENSOR_ARGS` 16 → 32 and lowered scalars 32 → 16. Args buffer layout sized differently.
- `#1093` unified TaskArgs on strided Tensor, dropped `ContinuousTensor`. The `PTO2TaskPayload` type poursoul added `dispatch_args_template` to has been refactored.
- Requires reconciling with the new payload type system and verifying the template lifetime fits the new task submit flow.

**Couples with**: #3, #4 (shared layout dependency)

---

## #3 `PTO2DispatchPayload` re-layout

**What**: Move the per-dispatch-written control fields (`function_bin_addr`, `task_args`, `arg_count`, `local_context`, `global_context`) to the **leading** cache lines of the struct. Move `args[]` (256B array) to the **tail** with `alignas(64)`. Struct total size stays 576B (hardware ABI constraint).

**Gain source**:
- AICPU per-dispatch writes hit only the first 1-2 cache lines (control block); previously spread across 2-3 lines.
- AICore's first dcci-then-read on the dispatch path lands on `function_bin_addr` (offset 0) immediately, can issue kernel jump earlier.
- Fewer dirty cache lines per dispatch → less NoC writeback bandwidth for AICore coherence.

**Cost**: Just layout — no extra ops.

**Estimated gain**: 1-2 fewer cache line dirties per dispatch (~ns direct savings); shaves ~10-100 ns off AICore's wake-to-start critical path depending on NoC latency.

**Upstream rebase risk**: HIGH.
- `#1056` resized args/scalar caps inside `pto2_dispatch_payload.h`. Layout has already been touched.
- `#1079` (speculative early-dispatch) reads the payload from a different code path; layout assumptions need to stay consistent.

**Couples with**: #2 (needs the new `task_args` / `arg_count` fields).

---

## #4 One-time context-pointer init

**What**: At handshake init, write `args[PAYLOAD_LOCAL_CONTEXT_INDEX]` and `args[PAYLOAD_GLOBAL_CONTEXT_INDEX]` once per `(core_id, buf_idx)` pair. Remove these two stores from `build_payload`.

**Gain source**:
- These slots hold pointers to `local_context` / `global_context` fields **inside the same dispatch_payload buffer** — the values are fixed across all dispatches for a given (core, buffer).
- Per-dispatch AICPU saves 2 stores (~2 ns direct).
- The args[] cache line containing these indexes stays clean across dispatches → AICore's dcci doesn't pull stale state.

**Estimated gain**: ~2 ns + one cache line kept clean per dispatch. Trivial alone, meaningful as part of the layout cleanup.

**Upstream rebase risk**: MEDIUM. Depends on #3's layout decisions and the new args[] sizing from `#1056`.

**Couples with**: #3 (requires stable args[] layout across dispatches).

---

## #5 AICPU prefetch

**What**: At the top of `prepare_subtask_to_core` (before any store into the payload or slab), issue three software prefetches:

```cpp
__builtin_prefetch(&payload, 1, 3);
__builtin_prefetch(reinterpret_cast<const char*>(&payload) + 64, 1, 3);
__builtin_prefetch(deferred_slab, 1, 3);
```

`(1, 3)` = prefetch-for-write, highest temporal locality.

**Gain source**:
- 72 cores × dual-buffer = 144 payload + 144 slab buffers ≈ 36+KB, exceeds typical AICPU L1.
- Cross-core scheduler rotation means each per-core buffer is cold-cache when its turn comes round again.
- Without prefetch: first store hits **Read-For-Ownership miss → ~100 ns blocking** while line is fetched and ownership acquired.
- With prefetch: async RFO issued ahead of the actual writes; by the time `build_payload` stores fire, line is in L1 with exclusive ownership.

**Estimated gain**: ~80-100 ns per dispatch when buffer is cold (common in steady-state cross-core rotation). Marginal cost (~3 ns) when buffer happens to be hot.

**Upstream rebase risk**: LOW. 3-line standalone addition; no struct or layout dependencies.

**Independent**: can be its own PR.

---

## #6 `fast_sys_cnt`

**What**: Replace `get_sys_cnt_aicpu()` (out-of-line function in `device_time.cpp`) with a `static inline __attribute__((always_inline))` wrapper in the same TU as the dispatch hot path:

```cpp
namespace {
static inline __attribute__((always_inline)) uint64_t fast_sys_cnt() {
    uint64_t t;
    asm volatile("mrs %0, cntvct_el0" : "=r"(t));
    return t;
}
}
```

The actual register read is identical (`cntvct_el0`, the chip-wide system timer at 50 MHz — same register both AICPU and AICore-side `get_sys_cnt()` resolve to). The win is purely function-call elimination.

**Why not `pmccntr_el0` (per-core CPU cycle counter, ~30× higher resolution)**: per the original commit's note, EL0 access on the a2a3 AICPU traps-and-emulates (→ 507018 op timeout); enable attempts get masked by the platform. Plus per-core PMU counters aren't synchronized across cores — wouldn't be usable for cross-core / cross-tier dispatch ↔ AICore-start measurement anyway.

**Gain source**:
- Each profiling timestamp sample drops from `bl get_sys_cnt_aicpu` + frame + `mrs` + `ret` (~5 ns) to just `mrs` (~0 ns call-overhead).
- AICPU instruction cache footprint on the dispatch hot path slightly smaller.

**Estimated gain**: ~5 ns × (samples per dispatch) × (dispatches). Only with `--enable-l2-swimlane >= 2`; zero in release / no-profiling builds.

**Upstream rebase risk**: LOW. Standalone, 6 lines, drops next to existing `get_sys_cnt_aicpu` call sites.

**Independent**: can be its own PR.

---

## Suggested ordering

| Priority | Item(s) | Why |
|---|---|---|
| 1 | **#6 `fast_sys_cnt`** | Smallest scope, zero risk, drops next to existing call sites. Standalone PR, ~6 lines. |
| 2 | **#5 AICPU prefetch** | Standalone, 3 lines, but needs micro-benchmark on a real workload (decode-style) to confirm cache-cold assumption holds today (after `#1079` speculative-early-dispatch changed the prepare path). |
| 3 | **#2 + #3 + #4 bundle** | Largest gain on the table (per-block AICPU args store elimination scales linearly with `block_num`), but requires non-trivial rebase work against `#1056` (args sizing), `#1093` (TaskArgs type unification), `#1079` (speculative early-dispatch). Consider redesigning rather than mechanically cherry-picking — the original `dispatch_args_template` design may conflict with the new speculative path's payload assumptions. |

## Measurement plan

Each item should be benched on:
1. **`spmd_serial_chain_mix`** (PR #988) — clean SPMD with controlled kernel duration, validates `block_num`-scaled paths
2. **A real decode workload** — qwen3 decode_layer (per the PR #989 measurement table) to capture cross-task batch interaction and cache effects
3. **`spmd_sync_start_stress` × 10** — regression gate (this is the test that caught the cross-task batching bug in PR #989)

Level-2 L2 swimlane is the primary observability channel for per-dispatch timing impacts. `tools/benchmark_rounds.sh` covers the wall-time view.

## References

- PR #989: where #1 shipped (commit `98849e81`).
- PR #988: `spmd_serial_chain_mix` example useful for measuring SPMD-fanout-bound paths.
- Branch with all six bundled: `poursoul/simpler:a2a3-sched-opt`, commit `40944786`.
- `docs/investigations/2026-06-aicore-cold-start-warmup.md`: related cold-start finding (NoC routing is the dominant cause of first-task head OH, not I-cache; affects any prefetch evaluation under #5).
- `docs/investigations/2026-06-cross-task-batched-publish.md`: the cross-task hoist that was attempted, gated on sync_start in the merged version of #989.
- Upstream changes interacting with these: `#1056` (args caps), `#1079` (speculative early-dispatch), `#1093` (TaskArgs unification).


Priority	Item(s)	Why
1	#6 `fast_sys_cnt`	Smallest scope, zero risk, drops next to existing call sites. Standalone PR, ~6 lines.
2	#5 AICPU prefetch	Standalone, 3 lines, but needs micro-benchmark on a real workload (decode-style) to confirm cache-cold assumption holds today (after `#1079` speculative-early-dispatch changed the prepare path).
3	#2 + #3 + #4 bundle	Largest gain on the table (per-block AICPU args store elimination scales linearly with `block_num`), but requires non-trivial rebase work against `#1056` (args sizing), `#1093` (TaskArgs type unification), `#1079` (speculative early-dispatch). Consider redesigning rather than mechanically cherry-picking — the original `dispatch_args_template` design may conflict with the new speculative path's payload assumptions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Remaining dispatch-path optimizations from poursoul/a2a3-sched-opt #1103

Context

#2 SPMD arg sharing

#3 `PTO2DispatchPayload` re-layout

#4 One-time context-pointer init

#5 AICPU prefetch

#6 `fast_sys_cnt`

Suggested ordering

Measurement plan

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	Item	Status
1	Batched publish (one wmb per claim)	✅ shipped in #989 (commit `98849e81`)
2	SPMD arg sharing (AICore burst-copy from template)	❌ not shipped
3	`PTO2DispatchPayload` re-layout (control block first, args[] at tail)	❌ not shipped
4	One-time context-pointer init (handshake-time)	❌ not shipped
5	AICPU prefetch (`__builtin_prefetch` payload + slab)	❌ not shipped
6	`fast_sys_cnt` (inline `mrs cntvct_el0`)	❌ not shipped

Uh oh!

[Performance] Remaining dispatch-path optimizations from poursoul/a2a3-sched-opt #1103

Description

Context

#2 SPMD arg sharing

#3 PTO2DispatchPayload re-layout

#4 One-time context-pointer init

#5 AICPU prefetch

#6 fast_sys_cnt

Suggested ordering

Measurement plan

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

#3 `PTO2DispatchPayload` re-layout

#6 `fast_sys_cnt`