Skip to content

perf(runtime): overlap AICore handshake wakeups; batch the release barrier#1214

Open
ChaoWao wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoWao:handshake-overlap
Open

perf(runtime): overlap AICore handshake wakeups; batch the release barrier#1214
ChaoWao wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoWao:handshake-overlap

Conversation

@ChaoWao

@ChaoWao ChaoWao commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

handshake_all_cores (run once per run_prepared, on the orchestrator thread,
inside the preamble device phase) ran two costs serially that can be
parallelized. Both AICore cores wake and advance their handshake phases
independently, so blocking on one core before looking at the next made their
latencies sum instead of overlap.

Change 1 — sweep Step 2 instead of per-core blocking

The old loop did, for each core: while(wait aicore_regs_ready) → init regs →
while(wait aicore_done), fully draining core i before touching core i+1.
That serializes 72 cores' wakeup waits. Replaced with two phase-batched
sweeps
: poll every still-outstanding core per pass and service whichever are
ready. The per-core wakeup waits now overlap (≈ slowest single core, not the
sum). The handshake flags are GM reads, not the nGnRE MMIO register
window, so sweeping is not subject to the serial-LDR constraint that RegId::COND
polling is.

Change 2 — batch the Step 1 release barrier

Step 1 raised aicpu_ready inside the per-core loop with an
OUT_OF_ORDER_STORE_BARRIER() each iteration (71 redundant barriers). The
task pointers are now published with one barrier, then aicpu_ready is raised
for all cores. One barrier suffices — AICore only relies on "all task stores
globally visible before any aicpu_ready store".

Applied symmetrically to a2a3 and a5.

Measured — preamble device phase only

qwen3-14B, 3.5k-context decode, a2a3 onboard, PTO2_RING_TASK_WINDOW=524288,
averaged over the 19 decode steps:

preamble value
before (per-core blocking + per-iter barrier) 329 µs
after sweep Step 2 213 µs
after sweep + batched Step 1 barrier 150 µs

Net −179 µs/step (−54%). Output token IDs identical to baseline.

Why it stops at ~150 µs (the residual is a physical / protocol floor)

The remaining preamble is dominated by costs no AICPU-side code change can
remove:

  1. AICore launch + NoC cold-wakeuprtKernelLaunchWithHandleV2 lazily
    loads the AICore kernel binary onto the device on first launch, and each core
    must be woken across the NoC before it reaches the handshake point. The sweep
    already overlaps these waits to the slowest single core; that core's wakeup
    is a hardware floor.
  2. One structurally-required GM round-trip to bind logical block_idx
    runtime-assigned physical_core_id. Three follow-on ideas to move this off
    GM were each ruled out by verified hardware constraints:
    • Core initializes its own dispatch regs (to drop the round-trip): AICore
      cannot write DATA_MAIN_BASE / FAST_PATH_ENABLE — the SPR write is
      rejected by the CCEC backend and an MMIO STR to the SPR window hangs the
      chip (.claude/rules/ascend.md, docs/hardware/mmio-performance.md). Reg
      init must stay on the AICPU.
    • Report physical_core_id / signals over COND instead of GM: to poll a
      core's COND the AICPU needs that core's reg_addr = regs[physical_core_id]
      — which is exactly the binding the handshake is establishing. The register
      channel cannot bootstrap the binding it depends on.
    • Acknowledge reg-init over DMB instead of a GM flag: the core must
      distinguish "AICPU has not initialized my regs yet" from "done", which
      needs a host-preclearable "not ready" sentinel. GM flags have one (host
      zeroes workers[] each run); SPRs (DMB/COND) do not — the host cannot write
      them, so a stale value would be misread. The init-ack must stay a GM flag.

So preamble's floor is: AICore binary load + NoC wakeup + one GM round-trip for
the logical↔physical binding. The accumulation (sweep) and the redundant
barriers (batch) — the parts that were software overhead — are removed here.

Testing

  • a2a3 onboard, output tokens identical to baseline across the run.
  • a2a3 onboard correctness across the handshake → dispatch → completion
    paths: vector_example, async_notify_demo, sdma_async_completion_demo,
    paged_attention, multi_round_paged_attention, batch_paged_attention — all pass.
  • a5 onboard (no a5 silicon on this box — relies on CI; change is structurally
    identical to a2a3).

…rrier

handshake_all_cores ran two costs serially that can be parallelized:

1. Step 2 blocked on core i (wait aicore_regs_ready, init regs, wait
   aicore_done) before looking at core i+1, so the 72 AICore cores' wakeup
   latencies summed. The cores wake and advance independently, so this is now
   two phase-batched sweeps (poll every outstanding core per pass, service the
   ready ones): the per-core wakeup waits overlap instead of accumulating. The
   handshake flags are GM reads, not the nGnRE MMIO reg window, so sweeping is
   not subject to the serial-LDR constraint that COND polling is.

2. Step 1 raised aicpu_ready inside the per-core loop with a barrier each
   iteration (71 redundant barriers). The task pointers are now published with
   one barrier, then aicpu_ready is raised for all cores — one barrier suffices
   since AICore only relies on "all task stores visible before any aicpu_ready".

Measured (qwen3-14B 3.5k decode, a2a3 onboard, PTO2_RING_TASK_WINDOW=524288):
preamble 329us -> 150us/step. Output tokens identical.

The residual ~150us is the physical floor: AICore launch (rtKernelLaunch lazy
binary load) + NoC cold-wakeup, plus one structurally-required GM round-trip to
bind logical block_idx <-> runtime-assigned physical_core_id (the register
channel cannot bootstrap that binding — it needs the physical core id the
binding is establishing, and has no host-preclearable "not ready" sentinel).
Applied symmetrically to a2a3 and a5.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the core handshake process in scheduler_cold_path.cpp (for both a2a3 and a5 runtimes) by splitting it into two non-blocking sweeps (Sweep A and Sweep B) to overlap core wakeups and reduce latency. The review feedback highlights a correctness bug where core_exec_states_[i].reg_addr is left uninitialized if validation fails during Sweep A, which prevents emergency_shutdown() from correctly cleaning up initialized registers. To resolve this, reg_addr should be populated immediately in Sweep A, and the redundant assignment in Sweep B can be removed.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +762 to +766
uint64_t reg_addr = regs[physical_core_id];
reg_addr_of[i] = reg_addr;
platform_init_aicore_regs(reg_addr);
OUT_OF_ORDER_STORE_BARRIER();
hank->aicpu_regs_ready = 1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Correctness Bug: Uninitialized core_exec_states_[i].reg_addr on Early Exit

By splitting the handshake into Sweep A and Sweep B, core_exec_states_[i].reg_addr is only populated during Sweep B. However, if any core fails validation (e.g., invalid physical_core_id) during Sweep A, handshake_failed is set to true, and the function exits early before Sweep B is ever executed.

This triggers emergency_shutdown(), which relies on core_exec_states_[i].reg_addr != 0 to deinitialize the registers of already-initialized cores. Since core_exec_states_[i].reg_addr remains 0 for all cores, emergency_shutdown() fails to deinitialize any registers, leaving them in an active/initialized state.

Populating core_exec_states_[i].reg_addr immediately after initialization in Sweep A ensures that emergency_shutdown() can correctly clean up resources on failure.

Additionally, if a dependent kernel launch fails after another kernel (such as the AICore worker) has already been launched and is waiting (spinning in a handshake), ensure that the device is recovered or marked unusable (e.g., by calling recover_device_or_mark_unusable) before returning to prevent the orphaned kernel from spinning until timeout and poisoning the device context.

            uint64_t reg_addr = regs[physical_core_id];
            reg_addr_of[i] = reg_addr;
            core_exec_states_[i].reg_addr = reg_addr;
            platform_init_aicore_regs(reg_addr);
            OUT_OF_ORDER_STORE_BARRIER();
            hank->aicpu_regs_ready = 1;
References
  1. If a dependent kernel launch fails after another kernel has already been launched and is waiting, ensure that the device is recovered or marked unusable (e.g., by calling recover_device_or_mark_unusable) before returning to prevent the orphaned kernel from spinning and poisoning the device context.

Comment on lines +766 to +770
uint64_t reg_addr = regs[physical_core_id];
reg_addr_of[i] = reg_addr;
platform_init_aicore_regs(reg_addr);
OUT_OF_ORDER_STORE_BARRIER();
hank->aicpu_regs_ready = 1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Correctness Bug: Uninitialized core_exec_states_[i].reg_addr on Early Exit

By splitting the handshake into Sweep A and Sweep B, core_exec_states_[i].reg_addr is only populated during Sweep B. However, if any core fails validation (e.g., invalid physical_core_id) during Sweep A, handshake_failed is set to true, and the function exits early before Sweep B is ever executed.

This triggers emergency_shutdown(), which relies on core_exec_states_[i].reg_addr != 0 to deinitialize the registers of already-initialized cores. Since core_exec_states_[i].reg_addr remains 0 for all cores, emergency_shutdown() fails to deinitialize any registers, leaving them in an active/initialized state.

Populating core_exec_states_[i].reg_addr immediately after initialization in Sweep A ensures that emergency_shutdown() can correctly clean up resources on failure.

Additionally, if a dependent kernel launch fails after another kernel (such as the AICore worker) has already been launched and is waiting (spinning in a handshake), ensure that the device is recovered or marked unusable (e.g., by calling recover_device_or_mark_unusable) before returning to prevent the orphaned kernel from spinning until timeout and poisoning the device context.

            uint64_t reg_addr = regs[physical_core_id];
            reg_addr_of[i] = reg_addr;
            core_exec_states_[i].reg_addr = reg_addr;
            platform_init_aicore_regs(reg_addr);
            OUT_OF_ORDER_STORE_BARRIER();
            hank->aicpu_regs_ready = 1;
References
  1. If a dependent kernel launch fails after another kernel has already been launched and is waiting, ensure that the device is recovered or marked unusable (e.g., by calling recover_device_or_mark_unusable) before returning to prevent the orphaned kernel from spinning and poisoning the device context.

Comment on lines +796 to +799
CoreType type = hank->core_type;
uint64_t reg_addr = reg_addr_of[i];
core_exec_states_[i].reg_addr = reg_addr;
core_exec_states_[i].cond_ptr = get_reg_ptr(reg_addr, RegId::COND);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Redundant Assignment Cleanup

Since core_exec_states_[i].reg_addr is now populated during Sweep A to ensure correct cleanup in emergency_shutdown(), the redundant assignment in Sweep B can be removed.

            CoreType type = hank->core_type;
            uint64_t reg_addr = reg_addr_of[i];
            core_exec_states_[i].cond_ptr = get_reg_ptr(reg_addr, RegId::COND);

Comment on lines +800 to +803
CoreType type = hank->core_type;
uint64_t reg_addr = reg_addr_of[i];
core_exec_states_[i].reg_addr = reg_addr;
core_exec_states_[i].cond_ptr = get_reg_ptr(reg_addr, RegId::COND);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Redundant Assignment Cleanup

Since core_exec_states_[i].reg_addr is now populated during Sweep A to ensure correct cleanup in emergency_shutdown(), the redundant assignment in Sweep B can be removed.

            CoreType type = hank->core_type;
            uint64_t reg_addr = reg_addr_of[i];
            core_exec_states_[i].cond_ptr = get_reg_ptr(reg_addr, RegId::COND);

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

SchedulerContext::handshake_all_cores in both the a2a3 and a5 variants is refactored from a serial per-core blocking loop into a batched two-phase pattern: Step 1 publishes all payload pointers with a single barrier then asserts aicpu_ready for all cores with a second barrier; Step 2 splits into two out-of-order polling sweeps (Sweep A: regs_ready/validate/init/ack; Sweep B: done/latch state).

Changes

Out-of-order sweep handshake refactor

Layer / File(s) Summary
Batched payload publication (Step 1)
src/a2a3/.../scheduler_cold_path.cpp, src/a5/.../scheduler_cold_path.cpp
Both files change Step 1 to write all per-core task pointers, issue one store barrier, set aicpu_ready=1 for all cores, then issue a second barrier—removing the previous per-core barrier between task store and aicpu_ready.
Sweep A: regs_ready polling, validation, and ack
src/a2a3/.../scheduler_cold_path.cpp, src/a5/.../scheduler_cold_path.cpp
Serial regs_ready→init→ack block replaced with a sweep that polls aicore_regs_ready across all cores in repeated passes, validates physical_core_id, calls platform_init_aicore_regs, sets aicpu_regs_ready=1 with a barrier per core, tracks completion via regs_phase_done[], and defers emergency_shutdown/return -1 until after the full sweep if handshake_failed is set.
Sweep B: done polling and state latch
src/a2a3/.../scheduler_cold_path.cpp, src/a5/.../scheduler_cold_path.cpp
Second sweep polls aicore_done across all cores and latches reg_addr, cond_ptr, worker_id, core_type into core_exec_states_[i], populates aic_worker_ids_/aiv_worker_ids_ by core_type, tracked with done_phase_done[].

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 Two sweeps now dance where one loop stood,
barriers batch as they always should.
Regs_ready, done—no more waiting in line,
out-of-order polls let all cores shine.
The rabbit hops fast, no serial delay! 🚀

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main performance change: overlapping AICore wakeups and batching the release barrier.
Description check ✅ Passed The description is directly related to the changes and explains the handshake sweep and barrier batching in detail.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 762-781: The failure path in the scheduler cold-path
initialization leaves some initialized register addresses untracked because
`reg_addr` is only stored in Sweep B, so `emergency_shutdown()` can miss regs
set up before an invalid `physical_core_id` triggers `handshake_failed`. Update
the `scheduler_cold_path.cpp` flow around the `reg_addr` initialization and
`emergency_shutdown()` handling so every successfully initialized `reg_addr` is
persisted immediately in `core_exec_states_[i].reg_addr` before any possible
early return, ensuring shutdown can deinitialize all regs consistently.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 766-785: The reg address for each initialized core is being
recorded too late, so the failure path in `scheduler_cold_path.cpp` can miss or
reuse stale state after `platform_init_aicore_regs(reg_addr)` has already run.
Move the assignment of the initialized reg address into the same Sweep A path
where `platform_init_aicore_regs` and `hank->aicpu_regs_ready` are set, using
`core_exec_states_[i].reg_addr` so `emergency_shutdown` and the later deinit
logic can see the correct value. Keep the update paired with the existing core
state writes in the loop that handles `physical_core_id`, `reg_addr_of`, and
`regs_phase_done`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 45bbc16f-c2ed-470f-8f92-4b7d16d5d40b

📥 Commits

Reviewing files that changed from the base of the PR and between 11d03d9 and 445329d.

📒 Files selected for processing (2)
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp

Comment on lines +762 to 781
uint64_t reg_addr = regs[physical_core_id];
reg_addr_of[i] = reg_addr;
platform_init_aicore_regs(reg_addr);
OUT_OF_ORDER_STORE_BARRIER();
hank->aicpu_regs_ready = 1;
#if PTO2_PROFILING
// Record physical_core_id for PMU init later (CoreExecState has no room
// for this field under PTO2_PROFILING).
physical_core_ids_[i] = physical_core_id;
physical_core_ids_[i] = physical_core_id;
#endif
#if !PTO2_PROFILING
core_exec_states_[i].worker_id = i;
core_exec_states_[i].physical_core_id = physical_core_id;
core_exec_states_[i].core_type = type;
core_exec_states_[i].physical_core_id = physical_core_id;
#endif

if (type == CoreType::AIC) {
aic_worker_ids_[aic_count_++] = i;
LOG_INFO_V0("Core %d: AIC, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr);
} else {
aiv_worker_ids_[aiv_count_++] = i;
LOG_INFO_V0("Core %d: AIV, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr);
regs_phase_done[i] = true;
remaining--;
}
}
OUT_OF_ORDER_STORE_BARRIER();

if (handshake_failed) {
emergency_shutdown(runtime);
return -1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify a2a3 emergency_shutdown can clean regs initialized during Sweep A.
rg -n -C4 'SchedulerContext::emergency_shutdown|platform_deinit_aicore_regs|core_exec_states_\[[^]]+\]\.reg_addr|reg_addr_of' \
  src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp

Repository: hw-native-sys/simpler

Length of output: 4635


🏁 Script executed:

#!/bin/bash
sed -n '720,820p' src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
printf '\n----\n'
sed -n '954,980p' src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp

Repository: hw-native-sys/simpler

Length of output: 5525


🏁 Script executed:

rg -n 'core_exec_states_\[[^]]+\]\.reg_addr\s*=|reg_addr\s*=' src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp

Repository: hw-native-sys/simpler

Length of output: 554


Persist reg_addr before the failure return (src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp:762-798, 958-967) emergency_shutdown() only deinitializes via core_exec_states_[i].reg_addr, which is populated in Sweep B; if Sweep A bails on an invalid physical_core_id, any regs already initialized there are skipped.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`
around lines 762 - 781, The failure path in the scheduler cold-path
initialization leaves some initialized register addresses untracked because
`reg_addr` is only stored in Sweep B, so `emergency_shutdown()` can miss regs
set up before an invalid `physical_core_id` triggers `handshake_failed`. Update
the `scheduler_cold_path.cpp` flow around the `reg_addr` initialization and
`emergency_shutdown()` handling so every successfully initialized `reg_addr` is
persisted immediately in `core_exec_states_[i].reg_addr` before any possible
early return, ensuring shutdown can deinitialize all regs consistently.

Comment on lines +766 to 785
uint64_t reg_addr = regs[physical_core_id];
reg_addr_of[i] = reg_addr;
platform_init_aicore_regs(reg_addr);
OUT_OF_ORDER_STORE_BARRIER();
hank->aicpu_regs_ready = 1;
#if PTO2_PROFILING
physical_core_ids_[i] = physical_core_id;
physical_core_ids_[i] = physical_core_id;
#endif

#if !PTO2_PROFILING
core_exec_states_[i].worker_id = i;
core_exec_states_[i].physical_core_id = physical_core_id;
core_exec_states_[i].core_type = type;
core_exec_states_[i].physical_core_id = physical_core_id;
#endif

if (type == CoreType::AIC) {
aic_worker_ids_[aic_count_++] = i;
LOG_INFO_V0("Core %d: AIC, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr);
} else {
aiv_worker_ids_[aiv_count_++] = i;
LOG_INFO_V0("Core %d: AIV, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr);
regs_phase_done[i] = true;
remaining--;
}
}
OUT_OF_ORDER_STORE_BARRIER();

if (handshake_failed) {
emergency_shutdown(runtime);
return -1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Record initialized reg addresses before the failure path.

platform_init_aicore_regs(reg_addr) can run in Sweep A, but core_exec_states_[i].reg_addr is only assigned in Sweep B. If any later core reports an invalid physical ID, Line 784 calls emergency_shutdown, whose Line 970 deinit check skips these newly initialized regs or may use stale reg addrs from a previous run.

Proposed fix
     bool regs_phase_done[RUNTIME_MAX_WORKER] = {false};
     uint64_t reg_addr_of[RUNTIME_MAX_WORKER] = {0};
+    for (int32_t i = 0; i < cores_total_num_; i++) {
+        core_exec_states_[i].reg_addr = 0;
+    }
 
@@
             uint64_t reg_addr = regs[physical_core_id];
             reg_addr_of[i] = reg_addr;
             platform_init_aicore_regs(reg_addr);
+            core_exec_states_[i].reg_addr = reg_addr;
             OUT_OF_ORDER_STORE_BARRIER();
             hank->aicpu_regs_ready = 1;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
uint64_t reg_addr = regs[physical_core_id];
reg_addr_of[i] = reg_addr;
platform_init_aicore_regs(reg_addr);
OUT_OF_ORDER_STORE_BARRIER();
hank->aicpu_regs_ready = 1;
#if PTO2_PROFILING
physical_core_ids_[i] = physical_core_id;
physical_core_ids_[i] = physical_core_id;
#endif
#if !PTO2_PROFILING
core_exec_states_[i].worker_id = i;
core_exec_states_[i].physical_core_id = physical_core_id;
core_exec_states_[i].core_type = type;
core_exec_states_[i].physical_core_id = physical_core_id;
#endif
if (type == CoreType::AIC) {
aic_worker_ids_[aic_count_++] = i;
LOG_INFO_V0("Core %d: AIC, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr);
} else {
aiv_worker_ids_[aiv_count_++] = i;
LOG_INFO_V0("Core %d: AIV, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr);
regs_phase_done[i] = true;
remaining--;
}
}
OUT_OF_ORDER_STORE_BARRIER();
if (handshake_failed) {
emergency_shutdown(runtime);
return -1;
bool regs_phase_done[RUNTIME_MAX_WORKER] = {false};
uint64_t reg_addr_of[RUNTIME_MAX_WORKER] = {0};
for (int32_t i = 0; i < cores_total_num_; i++) {
core_exec_states_[i].reg_addr = 0;
}
uint64_t reg_addr = regs[physical_core_id];
reg_addr_of[i] = reg_addr;
platform_init_aicore_regs(reg_addr);
core_exec_states_[i].reg_addr = reg_addr;
OUT_OF_ORDER_STORE_BARRIER();
hank->aicpu_regs_ready = 1;
`#if` PTO2_PROFILING
physical_core_ids_[i] = physical_core_id;
`#endif`
`#if` !PTO2_PROFILING
core_exec_states_[i].physical_core_id = physical_core_id;
`#endif`
regs_phase_done[i] = true;
remaining--;
}
}
OUT_OF_ORDER_STORE_BARRIER();
if (handshake_failed) {
emergency_shutdown(runtime);
return -1;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`
around lines 766 - 785, The reg address for each initialized core is being
recorded too late, so the failure path in `scheduler_cold_path.cpp` can miss or
reuse stale state after `platform_init_aicore_regs(reg_addr)` has already run.
Move the assignment of the initialized reg address into the same Sweep A path
where `platform_init_aicore_regs` and `hank->aicpu_regs_ready` are set, using
`core_exec_states_[i].reg_addr` so `emergency_shutdown` and the later deinit
logic can see the correct value. Keep the update paired with the existing core
state writes in the loop that handles `physical_core_id`, `reg_addr_of`, and
`regs_phase_done`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant