Skip to content

[Performance] Remaining dispatch-path optimizations from poursoul/a2a3-sched-opt #1103

Description

@hw-native-sys-bot

Context

PR #989 landed the batched publish optimization (one wmb() per claim) from poursoul/simpler:a2a3-sched-opt. That commit originally bundled six dispatch-path optimizations; only one shipped in #989 (plus the cross-task gating + cross-thread stagger fix added during merge). The other five remain unmerged on poursoul/a2a3-sched-opt:

# Item Status
1 Batched publish (one wmb per claim) ✅ shipped in #989 (commit 98849e81)
2 SPMD arg sharing (AICore burst-copy from template) ❌ not shipped
3 PTO2DispatchPayload re-layout (control block first, args[] at tail) ❌ not shipped
4 One-time context-pointer init (handshake-time) ❌ not shipped
5 AICPU prefetch (__builtin_prefetch payload + slab) ❌ not shipped
6 fast_sys_cnt (inline mrs cntvct_el0) ❌ not shipped

Original branch: poursoul/simpler:a2a3-sched-opt, commit 40944786.

This issue tracks the remaining five so they don't get lost. Each item below has its gain source, scope, and the rebase constraint from upstream changes since the original branch base (fecf7c97).


#2 SPMD arg sharing

What: AICPU writes the task's tensor pointers + scalars once into a per-task dispatch_args_template (in GM) at submit time. Each build_payload then only writes task_args (GM pointer) + arg_count into the per-core dispatch payload. AICore burst-copies args[0..arg_count) from the shared template into its per-core args[] before invoking the kernel.

Gain source:

  • AICPU build_payload inner loops (for tensor_count + for scalar_count) eliminated.
  • For SPMD with block_num=B and N+M=K total args: AICPU stores drop from B × K × 8B to K × 8B (one-time) + B × 2 words (per-dispatch).
  • Eliminates per-block RFO misses on the args[] cache lines (each block previously dirtied them).

Cost: AICore pays one extra GM burst-load per dispatch to fetch the shared template into local args[].

Estimated gain: Hundreds of ns to ~µs per SPMD task with many args. Zero gain on block_num=1 tasks.

Most relevant workload: decode_layer-style — many tensor pointers per kernel (KV-cache slots, projections) over many SPMD blocks.

Upstream rebase risk: HIGH.

  • #1056 raised CORE_MAX_TENSOR_ARGS 16 → 32 and lowered scalars 32 → 16. Args buffer layout sized differently.
  • #1093 unified TaskArgs on strided Tensor, dropped ContinuousTensor. The PTO2TaskPayload type poursoul added dispatch_args_template to has been refactored.
  • Requires reconciling with the new payload type system and verifying the template lifetime fits the new task submit flow.

Couples with: #3, #4 (shared layout dependency)


#3 PTO2DispatchPayload re-layout

What: Move the per-dispatch-written control fields (function_bin_addr, task_args, arg_count, local_context, global_context) to the leading cache lines of the struct. Move args[] (256B array) to the tail with alignas(64). Struct total size stays 576B (hardware ABI constraint).

Gain source:

  • AICPU per-dispatch writes hit only the first 1-2 cache lines (control block); previously spread across 2-3 lines.
  • AICore's first dcci-then-read on the dispatch path lands on function_bin_addr (offset 0) immediately, can issue kernel jump earlier.
  • Fewer dirty cache lines per dispatch → less NoC writeback bandwidth for AICore coherence.

Cost: Just layout — no extra ops.

Estimated gain: 1-2 fewer cache line dirties per dispatch (~ns direct savings); shaves ~10-100 ns off AICore's wake-to-start critical path depending on NoC latency.

Upstream rebase risk: HIGH.

  • #1056 resized args/scalar caps inside pto2_dispatch_payload.h. Layout has already been touched.
  • #1079 (speculative early-dispatch) reads the payload from a different code path; layout assumptions need to stay consistent.

Couples with: #2 (needs the new task_args / arg_count fields).


#4 One-time context-pointer init

What: At handshake init, write args[PAYLOAD_LOCAL_CONTEXT_INDEX] and args[PAYLOAD_GLOBAL_CONTEXT_INDEX] once per (core_id, buf_idx) pair. Remove these two stores from build_payload.

Gain source:

  • These slots hold pointers to local_context / global_context fields inside the same dispatch_payload buffer — the values are fixed across all dispatches for a given (core, buffer).
  • Per-dispatch AICPU saves 2 stores (~2 ns direct).
  • The args[] cache line containing these indexes stays clean across dispatches → AICore's dcci doesn't pull stale state.

Estimated gain: ~2 ns + one cache line kept clean per dispatch. Trivial alone, meaningful as part of the layout cleanup.

Upstream rebase risk: MEDIUM. Depends on #3's layout decisions and the new args[] sizing from #1056.

Couples with: #3 (requires stable args[] layout across dispatches).


#5 AICPU prefetch

What: At the top of prepare_subtask_to_core (before any store into the payload or slab), issue three software prefetches:

__builtin_prefetch(&payload, 1, 3);
__builtin_prefetch(reinterpret_cast<const char*>(&payload) + 64, 1, 3);
__builtin_prefetch(deferred_slab, 1, 3);

(1, 3) = prefetch-for-write, highest temporal locality.

Gain source:

  • 72 cores × dual-buffer = 144 payload + 144 slab buffers ≈ 36+KB, exceeds typical AICPU L1.
  • Cross-core scheduler rotation means each per-core buffer is cold-cache when its turn comes round again.
  • Without prefetch: first store hits Read-For-Ownership miss → ~100 ns blocking while line is fetched and ownership acquired.
  • With prefetch: async RFO issued ahead of the actual writes; by the time build_payload stores fire, line is in L1 with exclusive ownership.

Estimated gain: ~80-100 ns per dispatch when buffer is cold (common in steady-state cross-core rotation). Marginal cost (~3 ns) when buffer happens to be hot.

Upstream rebase risk: LOW. 3-line standalone addition; no struct or layout dependencies.

Independent: can be its own PR.


#6 fast_sys_cnt

What: Replace get_sys_cnt_aicpu() (out-of-line function in device_time.cpp) with a static inline __attribute__((always_inline)) wrapper in the same TU as the dispatch hot path:

namespace {
static inline __attribute__((always_inline)) uint64_t fast_sys_cnt() {
    uint64_t t;
    asm volatile("mrs %0, cntvct_el0" : "=r"(t));
    return t;
}
}

The actual register read is identical (cntvct_el0, the chip-wide system timer at 50 MHz — same register both AICPU and AICore-side get_sys_cnt() resolve to). The win is purely function-call elimination.

Why not pmccntr_el0 (per-core CPU cycle counter, ~30× higher resolution): per the original commit's note, EL0 access on the a2a3 AICPU traps-and-emulates (→ 507018 op timeout); enable attempts get masked by the platform. Plus per-core PMU counters aren't synchronized across cores — wouldn't be usable for cross-core / cross-tier dispatch ↔ AICore-start measurement anyway.

Gain source:

  • Each profiling timestamp sample drops from bl get_sys_cnt_aicpu + frame + mrs + ret (~5 ns) to just mrs (~0 ns call-overhead).
  • AICPU instruction cache footprint on the dispatch hot path slightly smaller.

Estimated gain: ~5 ns × (samples per dispatch) × (dispatches). Only with --enable-l2-swimlane >= 2; zero in release / no-profiling builds.

Upstream rebase risk: LOW. Standalone, 6 lines, drops next to existing get_sys_cnt_aicpu call sites.

Independent: can be its own PR.


Suggested ordering

Priority Item(s) Why
1 #6 fast_sys_cnt Smallest scope, zero risk, drops next to existing call sites. Standalone PR, ~6 lines.
2 #5 AICPU prefetch Standalone, 3 lines, but needs micro-benchmark on a real workload (decode-style) to confirm cache-cold assumption holds today (after #1079 speculative-early-dispatch changed the prepare path).
3 #2 + #3 + #4 bundle Largest gain on the table (per-block AICPU args store elimination scales linearly with block_num), but requires non-trivial rebase work against #1056 (args sizing), #1093 (TaskArgs type unification), #1079 (speculative early-dispatch). Consider redesigning rather than mechanically cherry-picking — the original dispatch_args_template design may conflict with the new speculative path's payload assumptions.

Measurement plan

Each item should be benched on:

  1. spmd_serial_chain_mix (PR Add: a2a3 SPMD MIX serial-chain busy-wait swimlane example #988) — clean SPMD with controlled kernel duration, validates block_num-scaled paths
  2. A real decode workload — qwen3 decode_layer (per the PR Optimize: a2a3 dispatch path — batched publish (one wmb per claim) #989 measurement table) to capture cross-task batch interaction and cache effects
  3. spmd_sync_start_stress × 10 — regression gate (this is the test that caught the cross-task batching bug in PR Optimize: a2a3 dispatch path — batched publish (one wmb per claim) #989)

Level-2 L2 swimlane is the primary observability channel for per-dispatch timing impacts. tools/benchmark_rounds.sh covers the wall-time view.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance regression or optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions