Add: fully_distributed_within_core runtime — SPMD on-core orchestration#1142
Add: fully_distributed_within_core runtime — SPMD on-core orchestration#1142hengliao1972 wants to merge 17 commits into
Conversation
Introduce a new runtime where orchestration, scheduling, and execution all
run in SPMD fashion on the AICore workers themselves, with AICPU reduced to a
thin init/handoff/teardown stub (no scheduler). Each core builds, owns, and
executes its own tasks.
Design pillars (see docs/fully_distributed_within_core.md):
- Task ownership via a claim race over two global cursors (cube_cursor for
AIC-anchored, vector_cursor for AIV-only).
- owner = builder = executor, with core-type matching.
- Full per-core duplicate TensorMap for dependency discovery.
- Per-core private task ring + one global completion-flag ring driving a
run-ahead, pull-based execution loop.
- block.won anchor/follower co-ownership for multi-core (MIX / 2V) tasks.
- Bounded GM output-heap ring reclaimed by a global completion frontier.
Includes: - docs/fully_distributed_within_core.md (authoritative design, in Chinese).
- src/{a2a3,a5}/runtime/fully_distributed_within_core/ runtime skeleton +
dist_engine (a2a3 validated on a2a3sim via benchmark_bgemm + mix_coown).
- examples/a2a3/fully_distributed_within_core/ migrated examples.
- tests/st/a2a3/fully_distributed_within_core/ migrated ST tests.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Important Review skippedToo many files! This PR contains 297 files, which is 147 over the limit of 150. To get a review, narrow the scope: Upgrade to a paid plan to raise the limit. ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (297)
You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the fully_distributed_within_core runtime mode, enabling orchestration, scheduling, and execution to run entirely on the AI cores in an SPMD fashion, bypassing the AICPU. It includes comprehensive documentation, core executor updates, and several new test cases and examples (such as paged_attention and qwen3_14b_decode). The reviewer provided valuable feedback focusing on robustness, security, and performance. Key recommendations include: explicitly resetting persistent global/static variables (like initialized_) in aicpu_executor.cpp to prevent state leakage across launches; gating expensive MMIO register reads (like DATA_MAIN_BASE) in tight spin loops to reduce CPU overhead; clamping task overhead metrics to a minimum of 0 to prevent negative values caused by clock skew; using RAII cleanup structs for temporary .so files to guarantee deletion on all exit paths; validating signed integer count/size parameters to prevent signed-to-unsigned conversion overflows in orchestration kernels; and avoiding angle brackets in Markdown bash code blocks to prevent input redirection parsing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| int32_t AicpuExecutor::init(Runtime *runtime) { | ||
| bool expected = false; | ||
| if (!initialized_.compare_exchange_strong(expected, true, std::memory_order_acq_rel, std::memory_order_acquire)) { | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
On platforms where static or global variables persist across launches on the same loaded binary (such as AICPU), always explicitly reset or initialize these variables at the start of each execution/launch to prevent state leakage and undefined behavior between runs. If a previous launch aborted or crashed, deinit might not be called, leaving initialized_ as true. On the next launch, init will return 0 immediately without re-initializing. Consider detecting a new launch (e.g., by comparing the runtime pointer) to safely reset the executor state.
References
- On platforms where static or global variables persist across launches on the same loaded binary (such as AICPU), always explicitly reset or initialize these variables at the start of each execution/launch to prevent state leakage and undefined behavior between runs.
| while (true) { | ||
| uint32_t reg_val = static_cast<uint32_t>(read_reg(RegId::DATA_MAIN_BASE)); | ||
| if (reg_val == AICORE_EXIT_SIGNAL) { | ||
| write_reg(RegId::COND, AICORE_EXITED_VALUE); | ||
| break; | ||
| } | ||
| SPIN_WAIT_HINT(); | ||
| } |
There was a problem hiding this comment.
Reading system registers (such as DATA_MAIN_BASE) via MMIO reads inside a tight spin loop can be extremely expensive and cause high CPU overhead and latency impact. Consider gating this check to run periodically (e.g., every 1024 spins) to minimize overhead.
uint32_t spin_count = 0;
while (true) {
if (spin_count++ % 1024 == 0) {
uint32_t reg_val = static_cast<uint32_t>(read_reg(RegId::DATA_MAIN_BASE));
if (reg_val == AICORE_EXIT_SIGNAL) {
write_reg(RegId::COND, AICORE_EXITED_VALUE);
break;
}
}
SPIN_WAIT_HINT();
}References
- Avoid reading system counters (which can be expensive MMIO reads) or performing complex structural checks on every iteration of a tight spin loop. Instead, gate these checks to run periodically (e.g., every 1024 spins) to minimize CPU overhead and latency impact.
| LOG_INFO_V9( | ||
| "Thread %d: task+heap_alloc: %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "", | ||
| thread_idx, cycles_to_us(p.alloc_cycle), p.alloc_cycle * 100.0 / total, | ||
| cycles_to_us(p.alloc_cycle - p.alloc_wait_cycle), cycles_to_us(p.alloc_wait_cycle), | ||
| static_cast<uint64_t>(p.alloc_atomic_count) | ||
| ); |
There was a problem hiding this comment.
When calculating task overhead metrics, clamp the values to a minimum of 0 to prevent negative overhead accumulation. This avoids distortion of metrics caused by dual-issue concurrent execution or clock synchronization skew between different processors.
LOG_INFO_V9(
"Thread %d: task+heap_alloc: %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "",
thread_idx, cycles_to_us(p.alloc_cycle), p.alloc_cycle * 100.0 / total,
cycles_to_us(p.alloc_cycle > p.alloc_wait_cycle ? p.alloc_cycle - p.alloc_wait_cycle : 0), cycles_to_us(p.alloc_wait_cycle),
static_cast<uint64_t>(p.alloc_atomic_count)
);| ); | ||
| LOG_INFO_V9( | ||
| "Thread %d: fanin+ready : %.3fus (%.1f%%) work=%.3fus wait=%.3fus", thread_idx, | ||
| cycles_to_us(p.fanin_cycle), p.fanin_cycle * 100.0 / total, | ||
| cycles_to_us(p.fanin_cycle - p.fanin_wait_cycle), cycles_to_us(p.fanin_wait_cycle) |
There was a problem hiding this comment.
When calculating task overhead metrics, clamp the values to a minimum of 0 to prevent negative overhead accumulation. This avoids distortion of metrics caused by dual-issue concurrent execution or clock synchronization skew between different processors.
LOG_INFO_V9(
"Thread %d: fanin+ready : %.3fus (%.1f%%) work=%.3fus wait=%.3fus", thread_idx,
cycles_to_us(p.fanin_cycle), p.fanin_cycle * 100.0 / total,
cycles_to_us(p.fanin_cycle > p.fanin_wait_cycle ? p.fanin_cycle - p.fanin_wait_cycle : 0), cycles_to_us(p.fanin_wait_cycle)
);| ); | ||
| unlink(so_path); | ||
| continue; | ||
| } | ||
| file_created = true; | ||
| LOG_INFO_V0("Thread %d: Created SO file at %s (%zu bytes)", thread_idx, so_path, so_size); | ||
| } | ||
|
|
||
| if (!file_created) { | ||
| LOG_ERROR("Thread %d: Failed to create SO file in any candidate path", thread_idx); | ||
| // Unblock scheduler threads before returning so they don't spin forever. | ||
| runtime_init_ready_.store(true, std::memory_order_release); | ||
| return -1; | ||
| } | ||
|
|
||
| dlerror(); | ||
| void *handle = dlopen(so_path, RTLD_LAZY | RTLD_LOCAL); | ||
| const char *dlopen_err = dlerror(); | ||
| if (handle == nullptr) { | ||
| LOG_ERROR("Thread %d: dlopen failed: %s", thread_idx, dlopen_err ? dlopen_err : "unknown"); | ||
| unlink(so_path); | ||
| // Unblock scheduler threads before returning so they don't spin forever. | ||
| runtime_init_ready_.store(true, std::memory_order_release); | ||
| return -1; | ||
| } | ||
| LOG_INFO_V0("Thread %d: dlopen succeeded, handle=%p", thread_idx, handle); |
There was a problem hiding this comment.
Using an RAII cleanup struct is much safer and cleaner than manually unlinking the temporary .so file on every error path. This guarantees deletion on all exit paths, including any unexpected early returns or future modifications.
if (!file_created) {
LOG_ERROR("Thread %d: Failed to create SO file in any candidate path", thread_idx);
// Unblock scheduler threads before returning so they don't spin forever.
runtime_init_ready_.store(true, std::memory_order_release);
return -1;
}
struct SoCleanup {
const char* path;
SoCleanup(const char* p) : path(p) {}
~SoCleanup() { if (path && path[0] != '\0') { unlink(path); } }
void cancel() { path = nullptr; }
} cleanup(so_path);
dlerror();
void *handle = dlopen(so_path, RTLD_LAZY | RTLD_LOCAL);
const char *dlopen_err = dlerror();
if (handle == nullptr) {
LOG_ERROR("Thread %d: dlopen failed: %s", thread_idx, dlopen_err ? dlopen_err : "unknown");
// Unblock scheduler threads before returning so they don't spin forever.
runtime_init_ready_.store(true, std::memory_order_release);
return -1;
}
LOG_INFO_V0("Thread %d: dlopen succeeded, handle=%p", thread_idx, handle);
// Unlink the on-disk SO immediately: dlopen has already mmap'd
// the image, so the kernel keeps the inode alive until the
// matching dlclose / process exit. This prevents stale
// libdevice_orch_<pid>_<cid>.so files from accumulating in
// /tmp when child processes exit via os._exit(0), which skips
// ~AicpuExecutor (worker.py: _sub/_chip/_child loops).
cleanup.cancel();
unlink(so_path);References
- Use secure temporary file creation methods like
mkstempinstead of predictable paths, and manage the file's lifetime using an RAII cleanup struct to guarantee deletion on all exit paths.
| # Extract orchestrator profiling (Thread 3) | ||
| grep "Thread 3:" <logfile> | ||
|
|
||
| # Extract scheduler profiling (Threads 0/1/2) | ||
| grep -E "Thread [012]:" <logfile> |
There was a problem hiding this comment.
In Markdown bash code blocks, avoid using angle brackets for placeholders (e.g., <logfile>) to prevent them from being parsed as input redirection. Use safe, quoted placeholders like "path/to/logfile" instead.
| # Extract orchestrator profiling (Thread 3) | |
| grep "Thread 3:" <logfile> | |
| # Extract scheduler profiling (Threads 0/1/2) | |
| grep -E "Thread [012]:" <logfile> | |
| # Extract orchestrator profiling (Thread 3) | |
| grep "Thread 3:" "path/to/logfile" | |
| # Extract scheduler profiling (Threads 0/1/2) | |
| grep -E "Thread [012]:" "path/to/logfile" |
References
- In Markdown bash code blocks, avoid using angle brackets for placeholders (e.g.,
<path>) to prevent them from being parsed as input redirection. Use safe, quoted placeholders like "path/to/file" instead.
| uint32_t cl_idx[1] = {static_cast<uint32_t>(b_idx)}; | ||
| uint64_t cur_seq = static_cast<uint64_t>(get_tensor_data<int32_t>(context_lens, 1, cl_idx)); | ||
| uint64_t bn_this_batch = (cur_seq + block_size - 1) / block_size; |
There was a problem hiding this comment.
When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent signed-to-unsigned conversion overflows.
uint32_t cl_idx[1] = {static_cast<uint32_t>(b_idx)};
int32_t seq_len = get_tensor_data<int32_t>(context_lens, 1, cl_idx);
always_assert(seq_len >= 0);
uint64_t cur_seq = static_cast<uint64_t>(seq_len);
uint64_t bn_this_batch = (cur_seq + block_size - 1) / block_size;References
- When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations (such as memcpy or array indexing), always validate that the count is non-negative (e.g., count >= 0) to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.
| PTO2_SCOPE(PTO2ScopeMode::MANUAL) { | ||
| CYCLE_COUNT_LAP(prof_scope); | ||
| uint64_t cur_offset = b_idx * q_head_num + q_idx * q_tile; |
There was a problem hiding this comment.
When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent signed-to-unsigned conversion overflows.
uint32_t cl_idx[1] = {static_cast<uint32_t>(b_idx)};
int32_t seq_len = get_tensor_data<int32_t>(context_lens, 1, cl_idx);
always_assert(seq_len >= 0);
uint64_t cur_seq = static_cast<uint64_t>(seq_len);
uint64_t bn_this_batch = (cur_seq + block_size - 1) / block_size;References
- When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations (such as memcpy or array indexing), always validate that the count is non-negative (e.g., count >= 0) to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.
| void *query_ptr = orch_args.tensor(0).ref().data_as<void>(); | ||
| void *kc_ptr = orch_args.tensor(1).ref().data_as<void>(); | ||
| void *vc_ptr = orch_args.tensor(2).ref().data_as<void>(); |
There was a problem hiding this comment.
When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent signed-to-unsigned conversion overflows.
uint32_t cl_idx[1] = {static_cast<uint32_t>(b_idx)};
int32_t seq_len = get_tensor_data<int32_t>(context_lens, 1, cl_idx);
always_assert(seq_len >= 0);
uint64_t cur_seq = static_cast<uint64_t>(seq_len);
uint64_t bn_this_batch = (cur_seq + block_size - 1) / block_size;References
- When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations (such as memcpy or array indexing), always validate that the count is non-negative (e.g., count >= 0) to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.
| size_t idx_ctx_len = b; | ||
| int32_t ctx_len = static_cast<int32_t *>(orch_args.tensor(7).ref().data_as<void>())[idx_ctx_len]; | ||
| int64_t pos = (static_cast<int64_t>(ctx_len) - 1); | ||
| int64_t ctx_blocks = ((static_cast<int64_t>(ctx_len) + 255) / 256); |
There was a problem hiding this comment.
When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent signed-to-unsigned conversion overflows.
size_t idx_ctx_len = b;
int32_t ctx_len = static_cast<int32_t *>(orch_args.tensor(7).ref().data_as<void>())[idx_ctx_len];
always_assert(ctx_len > 0);
int64_t pos = (static_cast<int64_t>(ctx_len) - 1);
int64_t ctx_blocks = ((static_cast<int64_t>(ctx_len) + 255) / 256);References
- When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations (such as memcpy or array indexing), always validate that the count is non-negative (e.g., count >= 0) to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.
…ibuted_within_core Engine (dist_engine): rework the core loop to be execute-first — drain ready (sub)tasks and block.won deposits before claiming at most one new task per iteration, and shrink the private ring (kPrivateSlots 8->4) so a single fast core can no longer greedily claim a long run of consecutive long tasks and starve its peers. The out-of-order window now comes from the core-count dimension, not a deep per-core ring (docs §6, §6.1). Swimlane: env-gated (PTO_DIST_SWIMLANE) Chrome-trace exporter built into the engine (the centralized L2 collector is incompatible with the AICPU stub). Each executed (sub)task is one duration event laid out by physical block (pid) and lane AIC/AIV0/AIV1 (tid). aicpu_executor dumps it once all workers finish. New simpler_setup/tools/dist_swimlane_render.py renders a static Gantt PNG; opt-in FullCore24 bgemm case (block_dim=24) fills all 24 AIC + 48 AIV lanes for full-core visualization. Function names: leaf CoreCallable carries no name, so scene_test injects the incore func_id->name map from the CALLABLE spec into the captured swimlane JSON post-run, so Perfetto and the renderer show GEMM/ADD instead of f0/f1. Co-authored-by: Cursor <cursoragent@cursor.com>
Add the captured per-core swimlane PNG (and its source Chrome-trace JSON) for benchmark_bgemm FullCore24 (24 AIC + 48 AIV, 240 GEMM + 240 ADD) and reference it from §6.1, illustrating how the execute-first claim race spreads GEMM evenly across all 24 AIC lanes (load balance). Includes the PTO_DIST_SWIMLANE capture + dist_swimlane_render reproduction steps. Co-authored-by: Cursor <cursoragent@cursor.com>
… cost Engine: add PTO_DIST_SKIP_EXEC gate. When set, execute_slot skips the incore kernel call (every (sub)task is treated as 0-cost and completes instantly) but keeps all ownership/completion/frontier bookkeeping, so the distributed loop terminates identically. This isolates the pure cost of on-core SPMD orchestration + claim race + scheduling from kernel work. New example examples/a2a3/fully_distributed_within_core/runtime_overhead_test: reuses the benchmark_bgemm workload (orchestration + GEMM/ADD incores, referenced not duplicated) and sweeps block_dim (1/2/12/24 blocks = 3..72 cores on a2a3sim), reporting the on-device orchestrator wall per config so the overhead-vs-core-count scaling is visible. Skip-exec runs need golden off; the class is also a valid SceneTestCase (manual cases) for the normal kernels-on path. Finding: device orchestration wall grows ~linearly with core count because every core replays the full orchestration (SPMD) and contends on the shared cursors — the overhead the run-ahead execution is meant to amortize once real kernels run. Co-authored-by: Cursor <cursoragent@cursor.com>
…overhead Document the PTO_DIST_SKIP_EXEC overhead experiment and the runtime_overhead_test benchmark: sweeping block_dim 1/2/12/24 (3..72 cores) on a2a3sim shows the on-device orchestration wall grows ~linearly with core count (≈11× at 24 blocks) because every core replays the full orchestration (SPMD) and contends on the shared cursors — the fixed cost that real-kernel parallelism amortizes. Co-authored-by: Cursor <cursoragent@cursor.com>
- sim aicore kernel (a2a3 + a5): release pthread TLS keys on .so unload via a destructor and abort on pthread_key_create failure, preventing key-pool exhaustion (1024 cap) that caused a SIGSEGV at high block_dim after many dlopen/dlclose cycles in device_runner. - runtime_overhead_test: add --bind CPU-affinity control (none|node:<list>| cpu:<list>); table header now reports the bound physical-core count so pinned-vs-unpinned and over/under-subscription effects are self-describing. - docs/fully_distributed_within_core.md: add §6.3 on CPU pinning, the leak fix, and the best exclusive-binding sweep (120 cores) including us/task.
…tead of O(N^2) The per-core duplicate producer map used a flat array with linear-scan lookup/insert and never retired entries. For single-buffer/many-tile workloads (bgemm) `count` grew with the whole run, making submit O(N^2): isolated 1-block sweeps showed device wall scaling super-linearly (tasks x8 -> device x11, us/task 3.23 -> 4.53). Rewrite DistTensorMap to mirror tensormap_and_ringbuffer's proven PTO2TensorMap: hash buckets by buffer base + doubly-linked bucket chains + per-task entry chains + free list + lazy invalidation + per-task cleanup_retired. Retirement uses the deterministic N-H threshold (H span contract, same bound that recycles the GM heap region), so chains stay bounded by the live H-window and every core's map evolves identically. insert always links a fresh entry (no in-place replace) so cleanup can free a task's entries via its chain; lookup returns the max overlapping producer, equivalent to the old replace-in-place semantics. Also defer built[] assembly until after the claim so the ~2/3 of cores that fail type_match / lose the race skip the tensor copies. Result (1-block isolation, skip-exec): O(N^2) -> O(N); 3840 tasks 34.76 -> 4.01 ms (8.7x), us/task 4.53 -> 0.52; neutral-to-better at 480 tasks. Golden validated on bgemm / paged_attention / paged_attention_ringbuffer / mix_coown. Analysis documented in docs §6.4. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Updated with The per-core duplicate producer map was a flat array with linear-scan lookup/insert that never retired entries; for single-buffer/many-tile workloads (bgemm) Rewrote Result (1-block, skip-exec): 3840 tasks 34.76 → 4.01 ms (8.7×), us/task 4.53 → 0.52; neutral-to-better at 480 tasks. Golden validated on bgemm / paged_attention / paged_attention_ringbuffer / mix_coown. Analysis added in docs §6.4. |
…inning + drain fast paths - device_runner: pin each AICore sim thread 1:1 into one NUMA node via PTO_SIM_AICORE_NUMA_NODE so the AICore working set stays node-local (avoids cross-NUMA noise on this 8-node host); PTO_SIM_AICORE_PIN_VERBOSE logs per-thread placement. - runtime_overhead_test: add --aicore-numa to drive the above; default batch 480->1000 (~2000 tasks) for better fixed-cost amortization; docs. - dist_engine: low-risk orchestration micro-opts (A.1) — empty-ring fast path in drain_phase_b, and a monotone per-block any_pub flag so follower drain_block_won/has_pending_won short-circuit the per-slot won scan for single-core workloads (bgemm), behaviour-identical otherwise. - docs: single-NUMA evaluation methodology, thread-pinning results, baseline/scaling overhead analysis.
…rm-aware overhead --blocks The AICore single-NUMA thread pinning added in f896159 uses cpu_set_t / sched_setaffinity / sched_getcpu, which are Linux-only and broke the macOS host build (unknown type 'cpu_set_t', undeclared 'sched_getcpu'). Wrap the per-thread affinity call in `#if defined(__linux__)`; on other hosts (macOS) AICore pinning is a no-op. Linux behavior is unchanged. Make runtime_overhead_test --blocks default platform-aware: macOS hosts have few physical cores so default to '1-2'; Linux dev boxes default to '1-13'. Explicit --blocks still overrides. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…on losers) Hoist the anchor-type + cursor claim ahead of the producer-map operations so the ~2/3 of cores that fail type_match / lose the race skip the fan-in lookup entirely. They still perform the unconditional output insert, so every core's duplicate TensorMap stays identical (§4); the winner computes fan-in before its own insert (preserving INOUT lookup-before-insert). This is the work that load-balancing across cores can actually amortize: more blocks -> each core wins fewer tasks -> skips more fan-in lookups. Measured dev-vs-1blk (tasks=4000) flattens from 1.7x/2.2x (2/4 blocks) to ~0.7-1.1x. The per-core "floor" (output materialization + output insert, inherent to the per-core duplicate map) is unchanged, so 1-block is flat. Golden validated: bgemm / paged_attention / paged_attention_ringbuffer / mix_coown. Docs §6.4 updated. Co-authored-by: Cursor <cursoragent@cursor.com>
… % G (G=4) Split each per-anchor-type claim cursor (cube_cursor / vector_cursor) into kCursorShards (=4) independent, cache-line-padded sub-cursors. A task id N claims on shard (N % kCursorShards); the shard is a pure function of N (identical on every core, no worker partitioning), so claim semantics stay byte-for-byte equivalent to a single cursor — exactly one owner per task, every core eligible, same progress/skew, same determinism. Sharding ONLY spreads the CAS traffic across G cache lines to cut the false-sharing / coherence contention that dominated us/task at higher core counts (§6.5). Each sub-cursor is alignas(64)-padded so adjacent shards never share a line; all entries init to -1. Golden checks pass (bgemm 24/3-block, mix_coown, paged_attention). Docs: rewrite §6.6 to prove the single-cursor equivalence (vs. the inferior worker-partitioning variant), add §6.5 core-scaling overhead analysis, and correct the H definition to be SCOPE-derived (PC exiting a scope ends the visibility of its tensors) rather than a hardcoded constant; the a2a3 prototype's kHDefault is documented as a conservative stand-in. Co-authored-by: Cursor <cursoragent@cursor.com>
…g results Record the landed sharding (G=4) impact on single-NUMA orchestration overhead: - single cursor -> G=4: ~6-12% lower us/task at mid/high blocks (8-13), curve stays clean and monotonic. - G=4 vs G=8: G=8 is worse (8-13% at mid/high blocks) in the single-NUMA, <=39-core regime -> G=4 is the sweet spot; larger G only pays off cross-NUMA. Also update the §6.5 closing note (sharding + winner-only fan-in now implemented).
Pure clang-format pass with no logic change, split out so the following feature/fix commits carry only semantic diffs (easier review).
fully_distributed_within_core can busy-wait each incore's reference duration (example_exec_time_ns, from the CALLABLE incores) in place of running the real kernel, so a fast a2a3sim run reflects measured on-hardware durations. The switch forces skip-golden since outputs are not computed. - CallConfig carries use_example_exec_time + example_exec_time_ns[64]; the wire layout is kept in sync in worker.py (_CFG_FMT) and remote_wire. - run_prepared plumbs the two fields; a weak runtime_apply_example_exec_time hook lets only fully_distributed implement it (strong override in its runtime_maker), every other runtime links the weak no-op. - scene_test maps CALLABLE example_exec_time_ns -> config and rejects the switch for any runtime other than fully_distributed_within_core; conftest adds --use-example-exec-time. - dist_engine execute_slot busy-waits example_exec_time_ns_[func_id] ns when set, else runs the real kernel. - Add paged_attention_unroll example (a2a3sim) with CaseSimSmall for fast feature validation and example_exec_time_ns on the incores.
…-tagged swimlane Fix a large-scale a2a3sim crash in dist_alloc_tensors and enrich the distributed swimlane so per-stage cost (not just kernel time) is visible. Crash root cause: the heap reclaim back-pressure computed the live window as the UNSIGNED subtraction `heap_next - vend[F-H]`. dist_alloc_tensors was replayed by every core unconditionally, so a core lagging the global frontier by more than H tasks has heap_next < vend[F-H]; the subtraction wrapped to ~2^64, spuriously tripped the "heap ring too small" fatal, and the empty TaskOutputTensors then failed a downstream get_ref assert (SIGABRT). Only the busy-wait replay path (--use-example-exec-time) hit it, at batch>=16 — real kernels keep the frontier close enough that no core lags past H. Fix: give dist_alloc_tensors the same single-owner election dist_submit_impl uses. Materialize outputs + producer-map registration stay per-core (deterministic), then a claim on a new alloc_cursor elects one owner; only that owner runs the reclaim back-pressure and publishes completion. The winner is by construction at/ahead of the frontier (its task is not yet done, so F < N), so the window subtraction can no longer underflow — no arithmetic guard needed. Materializing before the (now winner-only) back-pressure also means a genuine fatal returns a populated result instead of the empty one that asserted. Swimlane: TraceEvent gains a phase (kernel / alloc / build / ringbp / drain_won / replay) so the exported Chrome trace shows the work between kernels — claim/ build, alloc, the ring/heap back-pressure spin (ringbp), and the per-core replay of lost claims — not just kernel spans. The two winner back-pressure spins are timed as a separate ringbp phase so dependency/slot WAITING is no longer misattributed to build (build itself is sub-microsecond; the spin can be hundreds of us under a small ring / few blocks). Capture stays zero-cost when PTO_DIST_SWIMLANE is unset; the trace vector is reserved up front when on so push_back never reallocs mid-run. scene_test's func-id->name pass prefixes non-kernel phases so a task's build span no longer collides with its kernel span on the same lane.
fix(fully_distributed_within_core): alloc 单owner选举修复 heap 回收下溢 + phase化泳道
概述
新增 runtime
fully_distributed_within_core:编排(orchestration)、调度(scheduling)与执行(execution)全部以 SPMD 方式运行在 AICore 自身之上,AICPU 完全不参与关键路径(仅保留 init / handoff / teardown 桩)。不存在独立调度器——每个核自行构建、拥有并执行自己的任务。权威设计见
docs/fully_distributed_within_core.md。设计支柱
cube_cursor用于 AIC-anchored,vector_cursor用于 AIV-only)扫过同一确定性 submit id 空间。本次更新
1)负载均衡策略改进——避免极度不均衡
旧的核循环是"先把私有环填满、再开始执行"。由于抢占(claim)+ 填环极快、而任务执行很慢,一个快核会把连续很多个任务一口气抢光直到填满私有环,再串行地慢慢执行;其它核因 cursor 已被推到很远而无任务可抢,造成严重负载不均衡与按序的 Head-of-Line Blocking。乱序窗口本应来自 核数 × 私有环 这个维度,但深私有环把窗口压在了单核内部。
改为 执行优先、认领其次、一次只认领一个(execute-first run-ahead):
kPrivateSlots8 → 4 缩小,把单核 run-ahead 上限收紧;乱序能力交还给核数维度。详见
docs/fully_distributed_within_core.md§6 / §6.1。2)新增每核执行泳道图(swimlane)
由于本 runtime 没有中心化 AICPU 调度器,原 L2 swimlane 采集器不适用。新增引擎内置、环境变量门控的 Chrome-trace 导出:
PTO_DIST_SWIMLANE=<path>即开启(未设为 no-op);每个被执行的(子)任务记为一个 duration 事件,按**物理 block(pid)**与 **lane AIC/AIV0/AIV1(tid)**排布,直观展示 execute-first 竞争如何把工作摊到各核(负载均衡)。aicpu_executor在所有 worker 完成后一次性导出 JSON(可直接拖入 https://ui.perfetto.dev/ )。simpler_setup/tools/dist_swimlane_render.py,把 JSON 渲染为静态 Gantt PNG。benchmark_bgemm::FullCore24(block_dim=24),填满 24 AIC + 48 AIV 共 72 条 lane,便于满核可视化。CoreCallable不携带名字,故由scene_test在捕获后把func_id → name(来自 CALLABLE spec)注入泳道 JSON,使 Perfetto 与渲染图显示GEMM/ADD而非f0/f1。3)编排/调度开销实测(runtime_overhead_test)
为测量全分布式模式"无中心调度器"的代价,引擎新增门控
PTO_DIST_SKIP_EXEC=1:置位后execute_slot跳过 incore kernel 调用(每个子任务当 0 代价瞬时完成),但保留全部 ownership/完成/frontier 簿记,循环照常终止。这样测得的片上墙钟只反映 orchestration + claim race + scheduling。新增基准
examples/a2a3/fully_distributed_within_core/runtime_overhead_test/:复用 bgemm 工作负载,扫block_dim(1 block = 1 AIC + 2 AIV),报告每档片上编排墙钟。a2a3sim(约 960 任务,中位数):结论:纯编排/调度墙钟随核数近线性增长(3→72 核约 11×),因为每个核都完整重放整段编排(SPMD)且在共享 cursor 上竞争;少核时增量很小(2 块仅 +20%),核越多越陡。这部分固定开销要靠真实 kernel 执行被多核并行摊薄来回本(本实验故意跳过执行,只暴露开销)。详见设计文档 §6.2。
包含内容
docs/fully_distributed_within_core.md— 权威设计文档(中文,含 §6.1 负载均衡论证、§6.2 开销实测)。src/{a2a3,a5}/runtime/fully_distributed_within_core/— runtime 骨架与dist_engine(execute-first 循环 + 泳道导出 + skip-exec 门控)。examples/a2a3/fully_distributed_within_core/— 迁移的示例(含FullCore24满核用例、runtime_overhead_test开销基准)。tests/st/a2a3/fully_distributed_within_core/— 迁移的 ST 测试。simpler_setup/tools/dist_swimlane_render.py— 泳道图渲染脚本。验证(a2a3sim)
benchmark_bgemm(Case0 / Bgemm64 / FullCore24)、vector_example、mix_coown(MIX 共同所有权)均 PASS。FullCore24满核运行得到 72 lane / 480 事件的泳道图,图例正确显示 GEMM/ADD。runtime_overhead_test扫 1/2/12/24 blocks 全部 PASS,得到上表的编排开销 scaling。备注 / 后续
Made with Cursor