perf(runtime): overlap AICore handshake wakeups; batch the release barrier#1214
perf(runtime): overlap AICore handshake wakeups; batch the release barrier#1214ChaoWao wants to merge 1 commit into
Conversation
…rrier handshake_all_cores ran two costs serially that can be parallelized: 1. Step 2 blocked on core i (wait aicore_regs_ready, init regs, wait aicore_done) before looking at core i+1, so the 72 AICore cores' wakeup latencies summed. The cores wake and advance independently, so this is now two phase-batched sweeps (poll every outstanding core per pass, service the ready ones): the per-core wakeup waits overlap instead of accumulating. The handshake flags are GM reads, not the nGnRE MMIO reg window, so sweeping is not subject to the serial-LDR constraint that COND polling is. 2. Step 1 raised aicpu_ready inside the per-core loop with a barrier each iteration (71 redundant barriers). The task pointers are now published with one barrier, then aicpu_ready is raised for all cores — one barrier suffices since AICore only relies on "all task stores visible before any aicpu_ready". Measured (qwen3-14B 3.5k decode, a2a3 onboard, PTO2_RING_TASK_WINDOW=524288): preamble 329us -> 150us/step. Output tokens identical. The residual ~150us is the physical floor: AICore launch (rtKernelLaunch lazy binary load) + NoC cold-wakeup, plus one structurally-required GM round-trip to bind logical block_idx <-> runtime-assigned physical_core_id (the register channel cannot bootstrap that binding — it needs the physical core id the binding is establishing, and has no host-preclearable "not ready" sentinel). Applied symmetrically to a2a3 and a5. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the core handshake process in scheduler_cold_path.cpp (for both a2a3 and a5 runtimes) by splitting it into two non-blocking sweeps (Sweep A and Sweep B) to overlap core wakeups and reduce latency. The review feedback highlights a correctness bug where core_exec_states_[i].reg_addr is left uninitialized if validation fails during Sweep A, which prevents emergency_shutdown() from correctly cleaning up initialized registers. To resolve this, reg_addr should be populated immediately in Sweep A, and the redundant assignment in Sweep B can be removed.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| uint64_t reg_addr = regs[physical_core_id]; | ||
| reg_addr_of[i] = reg_addr; | ||
| platform_init_aicore_regs(reg_addr); | ||
| OUT_OF_ORDER_STORE_BARRIER(); | ||
| hank->aicpu_regs_ready = 1; |
There was a problem hiding this comment.
Correctness Bug: Uninitialized core_exec_states_[i].reg_addr on Early Exit
By splitting the handshake into Sweep A and Sweep B, core_exec_states_[i].reg_addr is only populated during Sweep B. However, if any core fails validation (e.g., invalid physical_core_id) during Sweep A, handshake_failed is set to true, and the function exits early before Sweep B is ever executed.
This triggers emergency_shutdown(), which relies on core_exec_states_[i].reg_addr != 0 to deinitialize the registers of already-initialized cores. Since core_exec_states_[i].reg_addr remains 0 for all cores, emergency_shutdown() fails to deinitialize any registers, leaving them in an active/initialized state.
Populating core_exec_states_[i].reg_addr immediately after initialization in Sweep A ensures that emergency_shutdown() can correctly clean up resources on failure.
Additionally, if a dependent kernel launch fails after another kernel (such as the AICore worker) has already been launched and is waiting (spinning in a handshake), ensure that the device is recovered or marked unusable (e.g., by calling recover_device_or_mark_unusable) before returning to prevent the orphaned kernel from spinning until timeout and poisoning the device context.
uint64_t reg_addr = regs[physical_core_id];
reg_addr_of[i] = reg_addr;
core_exec_states_[i].reg_addr = reg_addr;
platform_init_aicore_regs(reg_addr);
OUT_OF_ORDER_STORE_BARRIER();
hank->aicpu_regs_ready = 1;References
- If a dependent kernel launch fails after another kernel has already been launched and is waiting, ensure that the device is recovered or marked unusable (e.g., by calling
recover_device_or_mark_unusable) before returning to prevent the orphaned kernel from spinning and poisoning the device context.
| uint64_t reg_addr = regs[physical_core_id]; | ||
| reg_addr_of[i] = reg_addr; | ||
| platform_init_aicore_regs(reg_addr); | ||
| OUT_OF_ORDER_STORE_BARRIER(); | ||
| hank->aicpu_regs_ready = 1; |
There was a problem hiding this comment.
Correctness Bug: Uninitialized core_exec_states_[i].reg_addr on Early Exit
By splitting the handshake into Sweep A and Sweep B, core_exec_states_[i].reg_addr is only populated during Sweep B. However, if any core fails validation (e.g., invalid physical_core_id) during Sweep A, handshake_failed is set to true, and the function exits early before Sweep B is ever executed.
This triggers emergency_shutdown(), which relies on core_exec_states_[i].reg_addr != 0 to deinitialize the registers of already-initialized cores. Since core_exec_states_[i].reg_addr remains 0 for all cores, emergency_shutdown() fails to deinitialize any registers, leaving them in an active/initialized state.
Populating core_exec_states_[i].reg_addr immediately after initialization in Sweep A ensures that emergency_shutdown() can correctly clean up resources on failure.
Additionally, if a dependent kernel launch fails after another kernel (such as the AICore worker) has already been launched and is waiting (spinning in a handshake), ensure that the device is recovered or marked unusable (e.g., by calling recover_device_or_mark_unusable) before returning to prevent the orphaned kernel from spinning until timeout and poisoning the device context.
uint64_t reg_addr = regs[physical_core_id];
reg_addr_of[i] = reg_addr;
core_exec_states_[i].reg_addr = reg_addr;
platform_init_aicore_regs(reg_addr);
OUT_OF_ORDER_STORE_BARRIER();
hank->aicpu_regs_ready = 1;References
- If a dependent kernel launch fails after another kernel has already been launched and is waiting, ensure that the device is recovered or marked unusable (e.g., by calling
recover_device_or_mark_unusable) before returning to prevent the orphaned kernel from spinning and poisoning the device context.
| CoreType type = hank->core_type; | ||
| uint64_t reg_addr = reg_addr_of[i]; | ||
| core_exec_states_[i].reg_addr = reg_addr; | ||
| core_exec_states_[i].cond_ptr = get_reg_ptr(reg_addr, RegId::COND); |
There was a problem hiding this comment.
Redundant Assignment Cleanup
Since core_exec_states_[i].reg_addr is now populated during Sweep A to ensure correct cleanup in emergency_shutdown(), the redundant assignment in Sweep B can be removed.
CoreType type = hank->core_type;
uint64_t reg_addr = reg_addr_of[i];
core_exec_states_[i].cond_ptr = get_reg_ptr(reg_addr, RegId::COND);| CoreType type = hank->core_type; | ||
| uint64_t reg_addr = reg_addr_of[i]; | ||
| core_exec_states_[i].reg_addr = reg_addr; | ||
| core_exec_states_[i].cond_ptr = get_reg_ptr(reg_addr, RegId::COND); |
There was a problem hiding this comment.
Redundant Assignment Cleanup
Since core_exec_states_[i].reg_addr is now populated during Sweep A to ensure correct cleanup in emergency_shutdown(), the redundant assignment in Sweep B can be removed.
CoreType type = hank->core_type;
uint64_t reg_addr = reg_addr_of[i];
core_exec_states_[i].cond_ptr = get_reg_ptr(reg_addr, RegId::COND);
📝 WalkthroughWalkthrough
ChangesOut-of-order sweep handshake refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 762-781: The failure path in the scheduler cold-path
initialization leaves some initialized register addresses untracked because
`reg_addr` is only stored in Sweep B, so `emergency_shutdown()` can miss regs
set up before an invalid `physical_core_id` triggers `handshake_failed`. Update
the `scheduler_cold_path.cpp` flow around the `reg_addr` initialization and
`emergency_shutdown()` handling so every successfully initialized `reg_addr` is
persisted immediately in `core_exec_states_[i].reg_addr` before any possible
early return, ensuring shutdown can deinitialize all regs consistently.
In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 766-785: The reg address for each initialized core is being
recorded too late, so the failure path in `scheduler_cold_path.cpp` can miss or
reuse stale state after `platform_init_aicore_regs(reg_addr)` has already run.
Move the assignment of the initialized reg address into the same Sweep A path
where `platform_init_aicore_regs` and `hank->aicpu_regs_ready` are set, using
`core_exec_states_[i].reg_addr` so `emergency_shutdown` and the later deinit
logic can see the correct value. Keep the update paired with the existing core
state writes in the loop that handles `physical_core_id`, `reg_addr_of`, and
`regs_phase_done`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 45bbc16f-c2ed-470f-8f92-4b7d16d5d40b
📒 Files selected for processing (2)
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
| uint64_t reg_addr = regs[physical_core_id]; | ||
| reg_addr_of[i] = reg_addr; | ||
| platform_init_aicore_regs(reg_addr); | ||
| OUT_OF_ORDER_STORE_BARRIER(); | ||
| hank->aicpu_regs_ready = 1; | ||
| #if PTO2_PROFILING | ||
| // Record physical_core_id for PMU init later (CoreExecState has no room | ||
| // for this field under PTO2_PROFILING). | ||
| physical_core_ids_[i] = physical_core_id; | ||
| physical_core_ids_[i] = physical_core_id; | ||
| #endif | ||
| #if !PTO2_PROFILING | ||
| core_exec_states_[i].worker_id = i; | ||
| core_exec_states_[i].physical_core_id = physical_core_id; | ||
| core_exec_states_[i].core_type = type; | ||
| core_exec_states_[i].physical_core_id = physical_core_id; | ||
| #endif | ||
|
|
||
| if (type == CoreType::AIC) { | ||
| aic_worker_ids_[aic_count_++] = i; | ||
| LOG_INFO_V0("Core %d: AIC, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr); | ||
| } else { | ||
| aiv_worker_ids_[aiv_count_++] = i; | ||
| LOG_INFO_V0("Core %d: AIV, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr); | ||
| regs_phase_done[i] = true; | ||
| remaining--; | ||
| } | ||
| } | ||
| OUT_OF_ORDER_STORE_BARRIER(); | ||
|
|
||
| if (handshake_failed) { | ||
| emergency_shutdown(runtime); | ||
| return -1; |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify a2a3 emergency_shutdown can clean regs initialized during Sweep A.
rg -n -C4 'SchedulerContext::emergency_shutdown|platform_deinit_aicore_regs|core_exec_states_\[[^]]+\]\.reg_addr|reg_addr_of' \
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppRepository: hw-native-sys/simpler
Length of output: 4635
🏁 Script executed:
#!/bin/bash
sed -n '720,820p' src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
printf '\n----\n'
sed -n '954,980p' src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppRepository: hw-native-sys/simpler
Length of output: 5525
🏁 Script executed:
rg -n 'core_exec_states_\[[^]]+\]\.reg_addr\s*=|reg_addr\s*=' src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppRepository: hw-native-sys/simpler
Length of output: 554
Persist reg_addr before the failure return (src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp:762-798, 958-967) emergency_shutdown() only deinitializes via core_exec_states_[i].reg_addr, which is populated in Sweep B; if Sweep A bails on an invalid physical_core_id, any regs already initialized there are skipped.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`
around lines 762 - 781, The failure path in the scheduler cold-path
initialization leaves some initialized register addresses untracked because
`reg_addr` is only stored in Sweep B, so `emergency_shutdown()` can miss regs
set up before an invalid `physical_core_id` triggers `handshake_failed`. Update
the `scheduler_cold_path.cpp` flow around the `reg_addr` initialization and
`emergency_shutdown()` handling so every successfully initialized `reg_addr` is
persisted immediately in `core_exec_states_[i].reg_addr` before any possible
early return, ensuring shutdown can deinitialize all regs consistently.
| uint64_t reg_addr = regs[physical_core_id]; | ||
| reg_addr_of[i] = reg_addr; | ||
| platform_init_aicore_regs(reg_addr); | ||
| OUT_OF_ORDER_STORE_BARRIER(); | ||
| hank->aicpu_regs_ready = 1; | ||
| #if PTO2_PROFILING | ||
| physical_core_ids_[i] = physical_core_id; | ||
| physical_core_ids_[i] = physical_core_id; | ||
| #endif | ||
|
|
||
| #if !PTO2_PROFILING | ||
| core_exec_states_[i].worker_id = i; | ||
| core_exec_states_[i].physical_core_id = physical_core_id; | ||
| core_exec_states_[i].core_type = type; | ||
| core_exec_states_[i].physical_core_id = physical_core_id; | ||
| #endif | ||
|
|
||
| if (type == CoreType::AIC) { | ||
| aic_worker_ids_[aic_count_++] = i; | ||
| LOG_INFO_V0("Core %d: AIC, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr); | ||
| } else { | ||
| aiv_worker_ids_[aiv_count_++] = i; | ||
| LOG_INFO_V0("Core %d: AIV, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr); | ||
| regs_phase_done[i] = true; | ||
| remaining--; | ||
| } | ||
| } | ||
| OUT_OF_ORDER_STORE_BARRIER(); | ||
|
|
||
| if (handshake_failed) { | ||
| emergency_shutdown(runtime); | ||
| return -1; |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Record initialized reg addresses before the failure path.
platform_init_aicore_regs(reg_addr) can run in Sweep A, but core_exec_states_[i].reg_addr is only assigned in Sweep B. If any later core reports an invalid physical ID, Line 784 calls emergency_shutdown, whose Line 970 deinit check skips these newly initialized regs or may use stale reg addrs from a previous run.
Proposed fix
bool regs_phase_done[RUNTIME_MAX_WORKER] = {false};
uint64_t reg_addr_of[RUNTIME_MAX_WORKER] = {0};
+ for (int32_t i = 0; i < cores_total_num_; i++) {
+ core_exec_states_[i].reg_addr = 0;
+ }
@@
uint64_t reg_addr = regs[physical_core_id];
reg_addr_of[i] = reg_addr;
platform_init_aicore_regs(reg_addr);
+ core_exec_states_[i].reg_addr = reg_addr;
OUT_OF_ORDER_STORE_BARRIER();
hank->aicpu_regs_ready = 1;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| uint64_t reg_addr = regs[physical_core_id]; | |
| reg_addr_of[i] = reg_addr; | |
| platform_init_aicore_regs(reg_addr); | |
| OUT_OF_ORDER_STORE_BARRIER(); | |
| hank->aicpu_regs_ready = 1; | |
| #if PTO2_PROFILING | |
| physical_core_ids_[i] = physical_core_id; | |
| physical_core_ids_[i] = physical_core_id; | |
| #endif | |
| #if !PTO2_PROFILING | |
| core_exec_states_[i].worker_id = i; | |
| core_exec_states_[i].physical_core_id = physical_core_id; | |
| core_exec_states_[i].core_type = type; | |
| core_exec_states_[i].physical_core_id = physical_core_id; | |
| #endif | |
| if (type == CoreType::AIC) { | |
| aic_worker_ids_[aic_count_++] = i; | |
| LOG_INFO_V0("Core %d: AIC, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr); | |
| } else { | |
| aiv_worker_ids_[aiv_count_++] = i; | |
| LOG_INFO_V0("Core %d: AIV, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr); | |
| regs_phase_done[i] = true; | |
| remaining--; | |
| } | |
| } | |
| OUT_OF_ORDER_STORE_BARRIER(); | |
| if (handshake_failed) { | |
| emergency_shutdown(runtime); | |
| return -1; | |
| bool regs_phase_done[RUNTIME_MAX_WORKER] = {false}; | |
| uint64_t reg_addr_of[RUNTIME_MAX_WORKER] = {0}; | |
| for (int32_t i = 0; i < cores_total_num_; i++) { | |
| core_exec_states_[i].reg_addr = 0; | |
| } | |
| uint64_t reg_addr = regs[physical_core_id]; | |
| reg_addr_of[i] = reg_addr; | |
| platform_init_aicore_regs(reg_addr); | |
| core_exec_states_[i].reg_addr = reg_addr; | |
| OUT_OF_ORDER_STORE_BARRIER(); | |
| hank->aicpu_regs_ready = 1; | |
| `#if` PTO2_PROFILING | |
| physical_core_ids_[i] = physical_core_id; | |
| `#endif` | |
| `#if` !PTO2_PROFILING | |
| core_exec_states_[i].physical_core_id = physical_core_id; | |
| `#endif` | |
| regs_phase_done[i] = true; | |
| remaining--; | |
| } | |
| } | |
| OUT_OF_ORDER_STORE_BARRIER(); | |
| if (handshake_failed) { | |
| emergency_shutdown(runtime); | |
| return -1; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`
around lines 766 - 785, The reg address for each initialized core is being
recorded too late, so the failure path in `scheduler_cold_path.cpp` can miss or
reuse stale state after `platform_init_aicore_regs(reg_addr)` has already run.
Move the assignment of the initialized reg address into the same Sweep A path
where `platform_init_aicore_regs` and `hank->aicpu_regs_ready` are set, using
`core_exec_states_[i].reg_addr` so `emergency_shutdown` and the later deinit
logic can see the correct value. Keep the update paired with the existing core
state writes in the loop that handles `physical_core_id`, `reg_addr_of`, and
`regs_phase_done`.
Summary
handshake_all_cores(run once perrun_prepared, on the orchestrator thread,inside the
preambledevice phase) ran two costs serially that can beparallelized. Both AICore cores wake and advance their handshake phases
independently, so blocking on one core before looking at the next made their
latencies sum instead of overlap.
Change 1 — sweep Step 2 instead of per-core blocking
The old loop did, for each core:
while(wait aicore_regs_ready)→ init regs →while(wait aicore_done), fully draining coreibefore touching corei+1.That serializes 72 cores' wakeup waits. Replaced with two phase-batched
sweeps: poll every still-outstanding core per pass and service whichever are
ready. The per-core wakeup waits now overlap (≈ slowest single core, not the
sum). The handshake flags are GM reads, not the
nGnREMMIO registerwindow, so sweeping is not subject to the serial-LDR constraint that
RegId::CONDpolling is.
Change 2 — batch the Step 1 release barrier
Step 1 raised
aicpu_readyinside the per-core loop with anOUT_OF_ORDER_STORE_BARRIER()each iteration (71 redundant barriers). Thetask pointers are now published with one barrier, then
aicpu_readyis raisedfor all cores. One barrier suffices — AICore only relies on "all
taskstoresglobally visible before any
aicpu_readystore".Applied symmetrically to a2a3 and a5.
Measured —
preambledevice phase onlyqwen3-14B, 3.5k-context decode, a2a3 onboard,
PTO2_RING_TASK_WINDOW=524288,averaged over the 19 decode steps:
preambleNet −179 µs/step (−54%). Output token IDs identical to baseline.
Why it stops at ~150 µs (the residual is a physical / protocol floor)
The remaining
preambleis dominated by costs no AICPU-side code change canremove:
rtKernelLaunchWithHandleV2lazilyloads the AICore kernel binary onto the device on first launch, and each core
must be woken across the NoC before it reaches the handshake point. The sweep
already overlaps these waits to the slowest single core; that core's wakeup
is a hardware floor.
block_idx↔runtime-assigned
physical_core_id. Three follow-on ideas to move this offGM were each ruled out by verified hardware constraints:
cannot write
DATA_MAIN_BASE/FAST_PATH_ENABLE— the SPR write isrejected by the CCEC backend and an MMIO STR to the SPR window hangs the
chip (
.claude/rules/ascend.md,docs/hardware/mmio-performance.md). Reginit must stay on the AICPU.
physical_core_id/ signals over COND instead of GM: to poll acore's COND the AICPU needs that core's
reg_addr = regs[physical_core_id]— which is exactly the binding the handshake is establishing. The register
channel cannot bootstrap the binding it depends on.
distinguish "AICPU has not initialized my regs yet" from "done", which
needs a host-preclearable "not ready" sentinel. GM flags have one (host
zeroes
workers[]each run); SPRs (DMB/COND) do not — the host cannot writethem, so a stale value would be misread. The init-ack must stay a GM flag.
So
preamble's floor is: AICore binary load + NoC wakeup + one GM round-trip forthe logical↔physical binding. The accumulation (sweep) and the redundant
barriers (batch) — the parts that were software overhead — are removed here.
Testing
paths:
vector_example,async_notify_demo,sdma_async_completion_demo,paged_attention,multi_round_paged_attention,batch_paged_attention— all pass.identical to a2a3).