perf(runtime): overlap AICore handshake wakeups; batch the release barrier by ChaoWao · Pull Request #1214 · hw-native-sys/simpler

ChaoWao · 2026-06-30T11:16:06Z

Summary

handshake_all_cores (run once per run_prepared, on the orchestrator thread,
inside the preamble device phase) ran two costs serially that can be
parallelized. Both AICore cores wake and advance their handshake phases
independently, so blocking on one core before looking at the next made their
latencies sum instead of overlap.

Change 1 — sweep Step 2 instead of per-core blocking

The old loop did, for each core: while(wait aicore_regs_ready) → init regs →
while(wait aicore_done), fully draining core i before touching core i+1.
That serializes 72 cores' wakeup waits. Replaced with two phase-batched
sweeps: poll every still-outstanding core per pass and service whichever are
ready. The per-core wakeup waits now overlap (≈ slowest single core, not the
sum). The handshake flags are GM reads, not the nGnRE MMIO register
window, so sweeping is not subject to the serial-LDR constraint that RegId::COND
polling is.

Change 2 — batch the Step 1 release barrier

Step 1 raised aicpu_ready inside the per-core loop with an
OUT_OF_ORDER_STORE_BARRIER() each iteration (71 redundant barriers). The
task pointers are now published with one barrier, then aicpu_ready is raised
for all cores. One barrier suffices — AICore only relies on "all task stores
globally visible before any aicpu_ready store".

Applied symmetrically to a2a3 and a5.

Measured — `preamble` device phase only

qwen3-14B, 3.5k-context decode, a2a3 onboard, PTO2_RING_TASK_WINDOW=524288,
averaged over the 19 decode steps:

`preamble`	value
before (per-core blocking + per-iter barrier)	329 µs
after sweep Step 2	213 µs
after sweep + batched Step 1 barrier	150 µs

Net −179 µs/step (−54%). Output token IDs identical to baseline.

Why it stops at ~150 µs (the residual is a physical / protocol floor)

The remaining preamble is dominated by costs no AICPU-side code change can
remove:

AICore launch + NoC cold-wakeup — rtKernelLaunchWithHandleV2 lazily
loads the AICore kernel binary onto the device on first launch, and each core
must be woken across the NoC before it reaches the handshake point. The sweep
already overlaps these waits to the slowest single core; that core's wakeup
is a hardware floor.
One structurally-required GM round-trip to bind logical block_idx ↔
runtime-assigned physical_core_id. Three follow-on ideas to move this off
GM were each ruled out by verified hardware constraints:
- Core initializes its own dispatch regs (to drop the round-trip): AICore
  cannot write DATA_MAIN_BASE / FAST_PATH_ENABLE — the SPR write is
  rejected by the CCEC backend and an MMIO STR to the SPR window hangs the
  chip (.claude/rules/ascend.md, docs/hardware/mmio-performance.md). Reg
  init must stay on the AICPU.
- Report physical_core_id / signals over COND instead of GM: to poll a
  core's COND the AICPU needs that core's reg_addr = regs[physical_core_id]
  — which is exactly the binding the handshake is establishing. The register
  channel cannot bootstrap the binding it depends on.
- Acknowledge reg-init over DMB instead of a GM flag: the core must
  distinguish "AICPU has not initialized my regs yet" from "done", which
  needs a host-preclearable "not ready" sentinel. GM flags have one (host
  zeroes workers[] each run); SPRs (DMB/COND) do not — the host cannot write
  them, so a stale value would be misread. The init-ack must stay a GM flag.

So preamble's floor is: AICore binary load + NoC wakeup + one GM round-trip for
the logical↔physical binding. The accumulation (sweep) and the redundant
barriers (batch) — the parts that were software overhead — are removed here.

Testing

a2a3 onboard, output tokens identical to baseline across the run.
a2a3 onboard correctness across the handshake → dispatch → completion
paths: vector_example, async_notify_demo, sdma_async_completion_demo,
paged_attention, multi_round_paged_attention, batch_paged_attention — all pass.
a5 onboard (no a5 silicon on this box — relies on CI; change is structurally
identical to a2a3).

…rrier handshake_all_cores ran two costs serially that can be parallelized: 1. Step 2 blocked on core i (wait aicore_regs_ready, init regs, wait aicore_done) before looking at core i+1, so the 72 AICore cores' wakeup latencies summed. The cores wake and advance independently, so this is now two phase-batched sweeps (poll every outstanding core per pass, service the ready ones): the per-core wakeup waits overlap instead of accumulating. The handshake flags are GM reads, not the nGnRE MMIO reg window, so sweeping is not subject to the serial-LDR constraint that COND polling is. 2. Step 1 raised aicpu_ready inside the per-core loop with a barrier each iteration (71 redundant barriers). The task pointers are now published with one barrier, then aicpu_ready is raised for all cores — one barrier suffices since AICore only relies on "all task stores visible before any aicpu_ready". Measured (qwen3-14B 3.5k decode, a2a3 onboard, PTO2_RING_TASK_WINDOW=524288): preamble 329us -> 150us/step. Output tokens identical. The residual ~150us is the physical floor: AICore launch (rtKernelLaunch lazy binary load) + NoC cold-wakeup, plus one structurally-required GM round-trip to bind logical block_idx <-> runtime-assigned physical_core_id (the register channel cannot bootstrap that binding — it needs the physical core id the binding is establishing, and has no host-preclearable "not ready" sentinel). Applied symmetrically to a2a3 and a5. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request refactors the core handshake process in scheduler_cold_path.cpp (for both a2a3 and a5 runtimes) by splitting it into two non-blocking sweeps (Sweep A and Sweep B) to overlap core wakeups and reduce latency. The review feedback highlights a correctness bug where core_exec_states_[i].reg_addr is left uninitialized if validation fails during Sweep A, which prevents emergency_shutdown() from correctly cleaning up initialized registers. To resolve this, reg_addr should be populated immediately in Sweep A, and the redundant assignment in Sweep B can be removed.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-30T11:18:09Z

+            uint64_t reg_addr = regs[physical_core_id];
+            reg_addr_of[i] = reg_addr;
+            platform_init_aicore_regs(reg_addr);
+            OUT_OF_ORDER_STORE_BARRIER();
+            hank->aicpu_regs_ready = 1;


Correctness Bug: Uninitialized core_exec_states_[i].reg_addr on Early Exit

By splitting the handshake into Sweep A and Sweep B, core_exec_states_[i].reg_addr is only populated during Sweep B. However, if any core fails validation (e.g., invalid physical_core_id) during Sweep A, handshake_failed is set to true, and the function exits early before Sweep B is ever executed.

This triggers emergency_shutdown(), which relies on core_exec_states_[i].reg_addr != 0 to deinitialize the registers of already-initialized cores. Since core_exec_states_[i].reg_addr remains 0 for all cores, emergency_shutdown() fails to deinitialize any registers, leaving them in an active/initialized state.

Populating core_exec_states_[i].reg_addr immediately after initialization in Sweep A ensures that emergency_shutdown() can correctly clean up resources on failure.

Additionally, if a dependent kernel launch fails after another kernel (such as the AICore worker) has already been launched and is waiting (spinning in a handshake), ensure that the device is recovered or marked unusable (e.g., by calling recover_device_or_mark_unusable) before returning to prevent the orphaned kernel from spinning until timeout and poisoning the device context.

uint64_t reg_addr = regs[physical_core_id]; reg_addr_of[i] = reg_addr; core_exec_states_[i].reg_addr = reg_addr; platform_init_aicore_regs(reg_addr); OUT_OF_ORDER_STORE_BARRIER(); hank->aicpu_regs_ready = 1;

References

If a dependent kernel launch fails after another kernel has already been launched and is waiting, ensure that the device is recovered or marked unusable (e.g., by calling recover_device_or_mark_unusable) before returning to prevent the orphaned kernel from spinning and poisoning the device context.

gemini-code-assist · 2026-06-30T11:18:09Z

+            uint64_t reg_addr = regs[physical_core_id];
+            reg_addr_of[i] = reg_addr;
+            platform_init_aicore_regs(reg_addr);
+            OUT_OF_ORDER_STORE_BARRIER();
+            hank->aicpu_regs_ready = 1;


Correctness Bug: Uninitialized core_exec_states_[i].reg_addr on Early Exit

By splitting the handshake into Sweep A and Sweep B, core_exec_states_[i].reg_addr is only populated during Sweep B. However, if any core fails validation (e.g., invalid physical_core_id) during Sweep A, handshake_failed is set to true, and the function exits early before Sweep B is ever executed.

This triggers emergency_shutdown(), which relies on core_exec_states_[i].reg_addr != 0 to deinitialize the registers of already-initialized cores. Since core_exec_states_[i].reg_addr remains 0 for all cores, emergency_shutdown() fails to deinitialize any registers, leaving them in an active/initialized state.

Populating core_exec_states_[i].reg_addr immediately after initialization in Sweep A ensures that emergency_shutdown() can correctly clean up resources on failure.

Additionally, if a dependent kernel launch fails after another kernel (such as the AICore worker) has already been launched and is waiting (spinning in a handshake), ensure that the device is recovered or marked unusable (e.g., by calling recover_device_or_mark_unusable) before returning to prevent the orphaned kernel from spinning until timeout and poisoning the device context.

uint64_t reg_addr = regs[physical_core_id]; reg_addr_of[i] = reg_addr; core_exec_states_[i].reg_addr = reg_addr; platform_init_aicore_regs(reg_addr); OUT_OF_ORDER_STORE_BARRIER(); hank->aicpu_regs_ready = 1;

References

If a dependent kernel launch fails after another kernel has already been launched and is waiting, ensure that the device is recovered or marked unusable (e.g., by calling recover_device_or_mark_unusable) before returning to prevent the orphaned kernel from spinning and poisoning the device context.

gemini-code-assist · 2026-06-30T11:18:09Z

+            CoreType type = hank->core_type;
+            uint64_t reg_addr = reg_addr_of[i];
+            core_exec_states_[i].reg_addr = reg_addr;
+            core_exec_states_[i].cond_ptr = get_reg_ptr(reg_addr, RegId::COND);


Redundant Assignment Cleanup

Since core_exec_states_[i].reg_addr is now populated during Sweep A to ensure correct cleanup in emergency_shutdown(), the redundant assignment in Sweep B can be removed.

CoreType type = hank->core_type; uint64_t reg_addr = reg_addr_of[i]; core_exec_states_[i].cond_ptr = get_reg_ptr(reg_addr, RegId::COND);

gemini-code-assist · 2026-06-30T11:18:09Z

+            CoreType type = hank->core_type;
+            uint64_t reg_addr = reg_addr_of[i];
+            core_exec_states_[i].reg_addr = reg_addr;
+            core_exec_states_[i].cond_ptr = get_reg_ptr(reg_addr, RegId::COND);


Redundant Assignment Cleanup

Since core_exec_states_[i].reg_addr is now populated during Sweep A to ensure correct cleanup in emergency_shutdown(), the redundant assignment in Sweep B can be removed.

CoreType type = hank->core_type; uint64_t reg_addr = reg_addr_of[i]; core_exec_states_[i].cond_ptr = get_reg_ptr(reg_addr, RegId::COND);

coderabbitai · 2026-06-30T11:18:41Z

📝 Walkthrough

Walkthrough

SchedulerContext::handshake_all_cores in both the a2a3 and a5 variants is refactored from a serial per-core blocking loop into a batched two-phase pattern: Step 1 publishes all payload pointers with a single barrier then asserts aicpu_ready for all cores with a second barrier; Step 2 splits into two out-of-order polling sweeps (Sweep A: regs_ready/validate/init/ack; Sweep B: done/latch state).

Changes

Out-of-order sweep handshake refactor

Layer / File(s)	Summary
Batched payload publication (Step 1) `src/a2a3/.../scheduler_cold_path.cpp`, `src/a5/.../scheduler_cold_path.cpp`	Both files change Step 1 to write all per-core task pointers, issue one store barrier, set `aicpu_ready=1` for all cores, then issue a second barrier—removing the previous per-core barrier between task store and `aicpu_ready`.
Sweep A: regs_ready polling, validation, and ack `src/a2a3/.../scheduler_cold_path.cpp`, `src/a5/.../scheduler_cold_path.cpp`	Serial `regs_ready→init→ack` block replaced with a sweep that polls `aicore_regs_ready` across all cores in repeated passes, validates `physical_core_id`, calls `platform_init_aicore_regs`, sets `aicpu_regs_ready=1` with a barrier per core, tracks completion via `regs_phase_done[]`, and defers `emergency_shutdown`/`return -1` until after the full sweep if `handshake_failed` is set.
Sweep B: done polling and state latch `src/a2a3/.../scheduler_cold_path.cpp`, `src/a5/.../scheduler_cold_path.cpp`	Second sweep polls `aicore_done` across all cores and latches `reg_addr`, `cond_ptr`, `worker_id`, `core_type` into `core_exec_states_[i]`, populates `aic_worker_ids_`/`aiv_worker_ids_` by `core_type`, tracked with `done_phase_done[]`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 Two sweeps now dance where one loop stood,
barriers batch as they always should.
Regs_ready, done—no more waiting in line,
out-of-order polls let all cores shine.
The rabbit hops fast, no serial delay! 🚀

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main performance change: overlapping AICore wakeups and batching the release barrier.
Description check	✅ Passed	The description is directly related to the changes and explains the handshake sweep and barrier batching in detail.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 762-781: The failure path in the scheduler cold-path
initialization leaves some initialized register addresses untracked because
`reg_addr` is only stored in Sweep B, so `emergency_shutdown()` can miss regs
set up before an invalid `physical_core_id` triggers `handshake_failed`. Update
the `scheduler_cold_path.cpp` flow around the `reg_addr` initialization and
`emergency_shutdown()` handling so every successfully initialized `reg_addr` is
persisted immediately in `core_exec_states_[i].reg_addr` before any possible
early return, ensuring shutdown can deinitialize all regs consistently.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 766-785: The reg address for each initialized core is being
recorded too late, so the failure path in `scheduler_cold_path.cpp` can miss or
reuse stale state after `platform_init_aicore_regs(reg_addr)` has already run.
Move the assignment of the initialized reg address into the same Sweep A path
where `platform_init_aicore_regs` and `hank->aicpu_regs_ready` are set, using
`core_exec_states_[i].reg_addr` so `emergency_shutdown` and the later deinit
logic can see the correct value. Keep the update paired with the existing core
state writes in the loop that handles `physical_core_id`, `reg_addr_of`, and
`regs_phase_done`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 45bbc16f-c2ed-470f-8f92-4b7d16d5d40b

📥 Commits

Reviewing files that changed from the base of the PR and between 11d03d9 and 445329d.

📒 Files selected for processing (2)

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp

coderabbitai · 2026-06-30T11:24:17Z

+            uint64_t reg_addr = regs[physical_core_id];
+            reg_addr_of[i] = reg_addr;
+            platform_init_aicore_regs(reg_addr);
+            OUT_OF_ORDER_STORE_BARRIER();
+            hank->aicpu_regs_ready = 1;
 #if PTO2_PROFILING
-        // Record physical_core_id for PMU init later (CoreExecState has no room
-        // for this field under PTO2_PROFILING).
-        physical_core_ids_[i] = physical_core_id;
+            physical_core_ids_[i] = physical_core_id;
 #endif
 #if !PTO2_PROFILING
-        core_exec_states_[i].worker_id = i;
-        core_exec_states_[i].physical_core_id = physical_core_id;
-        core_exec_states_[i].core_type = type;
+            core_exec_states_[i].physical_core_id = physical_core_id;
 #endif
-
-        if (type == CoreType::AIC) {
-            aic_worker_ids_[aic_count_++] = i;
-            LOG_INFO_V0("Core %d: AIC, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr);
-        } else {
-            aiv_worker_ids_[aiv_count_++] = i;
-            LOG_INFO_V0("Core %d: AIV, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr);
+            regs_phase_done[i] = true;
+            remaining--;
        }
    }
+    OUT_OF_ORDER_STORE_BARRIER();

    if (handshake_failed) {
        emergency_shutdown(runtime);
        return -1;


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify a2a3 emergency_shutdown can clean regs initialized during Sweep A. rg -n -C4 'SchedulerContext::emergency_shutdown|platform_deinit_aicore_regs|core_exec_states_\[[^]]+\]\.reg_addr|reg_addr_of' \ src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp

Repository: hw-native-sys/simpler

Length of output: 4635

🏁 Script executed:

#!/bin/bash sed -n '720,820p' src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp printf '\n----\n' sed -n '954,980p' src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp

Repository: hw-native-sys/simpler

Length of output: 5525

🏁 Script executed:

rg -n 'core_exec_states_\[[^]]+\]\.reg_addr\s*=|reg_addr\s*=' src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp

Repository: hw-native-sys/simpler

Length of output: 554

Persist reg_addr before the failure return (src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp:762-798, 958-967) emergency_shutdown() only deinitializes via core_exec_states_[i].reg_addr, which is populated in Sweep B; if Sweep A bails on an invalid physical_core_id, any regs already initialized there are skipped.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp` around lines 762 - 781, The failure path in the scheduler cold-path initialization leaves some initialized register addresses untracked because `reg_addr` is only stored in Sweep B, so `emergency_shutdown()` can miss regs set up before an invalid `physical_core_id` triggers `handshake_failed`. Update the `scheduler_cold_path.cpp` flow around the `reg_addr` initialization and `emergency_shutdown()` handling so every successfully initialized `reg_addr` is persisted immediately in `core_exec_states_[i].reg_addr` before any possible early return, ensuring shutdown can deinitialize all regs consistently.

coderabbitai · 2026-06-30T11:24:18Z

+            uint64_t reg_addr = regs[physical_core_id];
+            reg_addr_of[i] = reg_addr;
+            platform_init_aicore_regs(reg_addr);
+            OUT_OF_ORDER_STORE_BARRIER();
+            hank->aicpu_regs_ready = 1;
 #if PTO2_PROFILING
-        physical_core_ids_[i] = physical_core_id;
+            physical_core_ids_[i] = physical_core_id;
 #endif
-
 #if !PTO2_PROFILING
-        core_exec_states_[i].worker_id = i;
-        core_exec_states_[i].physical_core_id = physical_core_id;
-        core_exec_states_[i].core_type = type;
+            core_exec_states_[i].physical_core_id = physical_core_id;
 #endif
-
-        if (type == CoreType::AIC) {
-            aic_worker_ids_[aic_count_++] = i;
-            LOG_INFO_V0("Core %d: AIC, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr);
-        } else {
-            aiv_worker_ids_[aiv_count_++] = i;
-            LOG_INFO_V0("Core %d: AIV, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr);
+            regs_phase_done[i] = true;
+            remaining--;
        }
    }
+    OUT_OF_ORDER_STORE_BARRIER();

    if (handshake_failed) {
        emergency_shutdown(runtime);
        return -1;


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Record initialized reg addresses before the failure path.

platform_init_aicore_regs(reg_addr) can run in Sweep A, but core_exec_states_[i].reg_addr is only assigned in Sweep B. If any later core reports an invalid physical ID, Line 784 calls emergency_shutdown, whose Line 970 deinit check skips these newly initialized regs or may use stale reg addrs from a previous run.

Proposed fix

bool regs_phase_done[RUNTIME_MAX_WORKER] = {false}; uint64_t reg_addr_of[RUNTIME_MAX_WORKER] = {0}; + for (int32_t i = 0; i < cores_total_num_; i++) { + core_exec_states_[i].reg_addr = 0; + } @@ uint64_t reg_addr = regs[physical_core_id]; reg_addr_of[i] = reg_addr; platform_init_aicore_regs(reg_addr); + core_exec_states_[i].reg_addr = reg_addr; OUT_OF_ORDER_STORE_BARRIER(); hank->aicpu_regs_ready = 1;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

uint64_t reg_addr = regs[physical_core_id];

reg_addr_of[i] = reg_addr;

platform_init_aicore_regs(reg_addr);

OUT_OF_ORDER_STORE_BARRIER();

hank->aicpu_regs_ready = 1;

#if PTO2_PROFILING

physical_core_ids_[i] = physical_core_id;

physical_core_ids_[i] = physical_core_id;

#endif

#if !PTO2_PROFILING

core_exec_states_[i].worker_id = i;

core_exec_states_[i].physical_core_id = physical_core_id;

core_exec_states_[i].core_type = type;

core_exec_states_[i].physical_core_id = physical_core_id;

#endif

if (type == CoreType::AIC) {

aic_worker_ids_[aic_count_++] = i;

LOG_INFO_V0("Core %d: AIC, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr);

} else {

aiv_worker_ids_[aiv_count_++] = i;

LOG_INFO_V0("Core %d: AIV, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr);

regs_phase_done[i] = true;

remaining--;

}

}

OUT_OF_ORDER_STORE_BARRIER();

if (handshake_failed) {

emergency_shutdown(runtime);

return -1;

bool regs_phase_done[RUNTIME_MAX_WORKER] = {false};

uint64_t reg_addr_of[RUNTIME_MAX_WORKER] = {0};

for (int32_t i = 0; i < cores_total_num_; i++) {

core_exec_states_[i].reg_addr = 0;

}

uint64_t reg_addr = regs[physical_core_id];

reg_addr_of[i] = reg_addr;

platform_init_aicore_regs(reg_addr);

core_exec_states_[i].reg_addr = reg_addr;

OUT_OF_ORDER_STORE_BARRIER();

hank->aicpu_regs_ready = 1;

`#if` PTO2_PROFILING

physical_core_ids_[i] = physical_core_id;

`#endif`

`#if` !PTO2_PROFILING

core_exec_states_[i].physical_core_id = physical_core_id;

`#endif`

regs_phase_done[i] = true;

remaining--;

}

}

OUT_OF_ORDER_STORE_BARRIER();

if (handshake_failed) {

emergency_shutdown(runtime);

return -1;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp` around lines 766 - 785, The reg address for each initialized core is being recorded too late, so the failure path in `scheduler_cold_path.cpp` can miss or reuse stale state after `platform_init_aicore_regs(reg_addr)` has already run. Move the assignment of the initialized reg address into the same Sweep A path where `platform_init_aicore_regs` and `hank->aicpu_regs_ready` are set, using `core_exec_states_[i].reg_addr` so `emergency_shutdown` and the later deinit logic can see the correct value. Keep the update paired with the existing core state writes in the loop that handles `physical_core_id`, `reg_addr_of`, and `regs_phase_done`.

gemini-code-assist Bot reviewed Jun 30, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(runtime): overlap AICore handshake wakeups; batch the release barrier#1214

perf(runtime): overlap AICore handshake wakeups; batch the release barrier#1214
ChaoWao wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoWao:handshake-overlap

ChaoWao commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ChaoWao commented Jun 30, 2026

Summary

Change 1 — sweep Step 2 instead of per-core blocking

Change 2 — batch the Step 1 release barrier

Measured — preamble device phase only

Why it stops at ~150 µs (the residual is a physical / protocol floor)

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Correctness Bug: Uninitialized core_exec_states_[i].reg_addr on Early Exit

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Correctness Bug: Uninitialized core_exec_states_[i].reg_addr on Early Exit

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Redundant Assignment Cleanup

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Redundant Assignment Cleanup

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Measured — `preamble` device phase only

Correctness Bug: Uninitialized `core_exec_states_[i].reg_addr` on Early Exit

Correctness Bug: Uninitialized `core_exec_states_[i].reg_addr` on Early Exit

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading