Integrate TraCR as the runtime/kernel profiler for Simpler#1173
Integrate TraCR as the runtime/kernel profiler for Simpler#1173noabauma wants to merge 29 commits into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request integrates the TraCR profiling and tracing tool into the a2a3 platform, enabling device-side trace collection and host-side downloading. The changes include adding the TraCR submodule, CMake build configurations, API helpers, and instrumenting the orchestration and execution paths with tracing markers. The review feedback highlights several critical areas for improvement: correcting format specifiers for uint64_t variables to avoid undefined behavior, replacing manual memory management with std::vector to prevent leaks, adding defensive null-pointer and bounds checks, quoting CMake string variables, and adding spin hints to tight loops to reduce CPU consumption.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| PROF_INC(prof_make_count, 4); | ||
| CYCLE_COUNT_LAP(prof_make_tensor); | ||
|
|
||
| LOG_INFO_V0("Thread %d: Orch PTO2_SCOPE loop: #batch=%u, q_loop=%u", g_TraCR_thread_idx, batch, q_loop); |
There was a problem hiding this comment.
batch and q_loop are of type uint64_t. Using %u as the format specifier can lead to undefined behavior, truncated output, or compilation errors under -Werror. Use " PRIu64 instead.
LOG_INFO_V0("Thread %d: Orch PTO2_SCOPE loop: #batch=%" PRIu64 ", q_loop=%" PRIu64, g_TraCR_thread_idx, batch, q_loop);| size_t size = sizeof(TraCR::Payload) * TraCR::CAPACITY * runtime.aicpu_thread_num; | ||
| TraCR::Payload* tracrData = reinterpret_cast<TraCR::Payload *>(std::malloc(size)); | ||
| int rc = device_runner->copy_from_device(reinterpret_cast<void*>(tracrData), | ||
| reinterpret_cast<void*>(runtime.tracrData_), size); | ||
| if (rc != 0) { | ||
| LOG_ERROR("device_runner->copy_from_device 'tracrData' failed rc=%d", rc); | ||
| return rc; | ||
| } | ||
|
|
||
| // Download the tracrDataSizes_ from Device to Host | ||
| size = sizeof(size_t) * runtime.aicpu_thread_num; | ||
| size_t* tracrDataSizes = reinterpret_cast<size_t *>(std::malloc(size)); | ||
| rc = device_runner->copy_from_device(reinterpret_cast<void*>(tracrDataSizes), | ||
| reinterpret_cast<void*>(runtime.tracrDataSizes_), size); | ||
| if (rc != 0) { | ||
| LOG_ERROR("device_runner->copy_from_device 'tracrDataSizes' failed rc=%d", rc); | ||
| return rc; | ||
| } | ||
|
|
||
| // Now, store the traces into '~/ascend/tracr/' | ||
| tracr_dir = "~/ascend/tracr_" + std::to_string(sampleID++) + "/proc." + std::to_string(1000 + device_runner->device_id()); | ||
| rc = TracrData2BTS(tracrData, tracrDataSizes, runtime.aicpu_thread_num); | ||
| if (rc != 0) { | ||
| LOG_ERROR("TracrData2BTS() failed"); | ||
| return rc; | ||
| } | ||
|
|
||
| // Free tmp Host TraCR data placeholders | ||
| std::free(reinterpret_cast<void *>(tracrData)); | ||
| std::free(reinterpret_cast<void *>(tracrDataSizes)); |
There was a problem hiding this comment.
Using manual std::malloc and std::free is prone to memory leaks if any of the intermediate operations (like copy_from_device or TracrData2BTS) fail and return early. Using std::vector automatically manages the memory lifetime and prevents leaks on error paths.
size_t size = sizeof(TraCR::Payload) * TraCR::CAPACITY * runtime.aicpu_thread_num;
std::vector<TraCR::Payload> tracrData(TraCR::CAPACITY * runtime.aicpu_thread_num);
int rc = device_runner->copy_from_device(reinterpret_cast<void*>(tracrData.data()),
reinterpret_cast<void*>(runtime.tracrData_), size);
if (rc != 0) {
LOG_ERROR("device_runner->copy_from_device 'tracrData' failed rc=%d", rc);
return rc;
}
// Download the tracrDataSizes_ from Device to Host
size = sizeof(size_t) * runtime.aicpu_thread_num;
std::vector<size_t> tracrDataSizes(runtime.aicpu_thread_num);
rc = device_runner->copy_from_device(reinterpret_cast<void*>(tracrDataSizes.data()),
reinterpret_cast<void*>(runtime.tracrDataSizes_), size);
if (rc != 0) {
LOG_ERROR("device_runner->copy_from_device 'tracrDataSizes' failed rc=%d", rc);
return rc;
}
// Now, store the traces into '~/ascend/tracr/'
tracr_dir = "~/ascend/tracr_" + std::to_string(sampleID++) + "/proc." + std::to_string(1000 + device_runner->device_id());
rc = TracrData2BTS(tracrData.data(), tracrDataSizes.data(), runtime.aicpu_thread_num);
if (rc != 0) {
LOG_ERROR("TracrData2BTS() failed");
return rc;
}
// Free device TraCR memory data placeholder
device_runner->free_tensor(runtime.tracrData_);
device_runner->free_tensor(runtime.tracrDataSizes_);| runtime.tracrDataSizes_ = device_runner->allocate_tensor(runtime.aicpu_thread_num * sizeof(size_t)); | ||
| if (runtime.tracrDataSizes_ == nullptr) { | ||
| LOG_ERROR("runtime.tracrDataSizes_: alloc %zu bytes failed", size); | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
If the allocation of runtime.tracrDataSizes_ fails, the function returns -1 without freeing runtime.tracrData_, leading to a memory leak. Additionally, the log message prints size (which is the size of tracrData_) instead of the actual size of tracrDataSizes_.
runtime.tracrDataSizes_ = device_runner->allocate_tensor(runtime.aicpu_thread_num * sizeof(size_t));
if (runtime.tracrDataSizes_ == nullptr) {
size_t sizes_bytes = runtime.aicpu_thread_num * sizeof(size_t);
LOG_ERROR("runtime.tracrDataSizes_: alloc %zu bytes failed", sizes_bytes);
device_runner->free_tensor(runtime.tracrData_);
runtime.tracrData_ = nullptr;
return -1;
}| if (tracrThread->_traceIdx > 0) { | ||
| TraCR::Payload* tracrData = reinterpret_cast<TraCR::Payload*>(runtime->tracrData_); | ||
| const size_t payload_size = tracrThread->_traceIdx * sizeof(TraCR::Payload); | ||
|
|
||
| std::memcpy( | ||
| &tracrData[g_TraCR_thread_idx * TraCR::CAPACITY], | ||
| tracrThread->_traces.data(), | ||
| payload_size | ||
| ); | ||
| } | ||
|
|
||
| size_t* tracrDataSizes = reinterpret_cast<size_t*>(runtime->tracrDataSizes_); | ||
| tracrDataSizes[g_TraCR_thread_idx] = tracrThread->_traceIdx; |
There was a problem hiding this comment.
Defensive programming requires checking that runtime->tracrData_ and runtime->tracrDataSizes_ are not nullptr before dereferencing them. Additionally, g_TraCR_thread_idx must be validated to be within [0, runtime->aicpu_thread_num) to prevent out-of-bounds memory corruption.
if (g_TraCR_thread_idx >= 0 && g_TraCR_thread_idx < runtime->aicpu_thread_num) {
if (runtime->tracrData_ != nullptr && tracrThread->_traceIdx > 0) {
TraCR::Payload* tracrData = reinterpret_cast<TraCR::Payload*>(runtime->tracrData_);
const size_t payload_size = tracrThread->_traceIdx * sizeof(TraCR::Payload);
std::memcpy(
&tracrData[g_TraCR_thread_idx * TraCR::CAPACITY],
tracrThread->_traces.data(),
payload_size
);
}
if (runtime->tracrDataSizes_ != nullptr) {
size_t* tracrDataSizes = reinterpret_cast<size_t*>(runtime->tracrDataSizes_);
tracrDataSizes[g_TraCR_thread_idx] = tracrThread->_traceIdx;
}
} else {
LOG_ERROR("[TraCR] thread index %d out of bounds (max=%d)", g_TraCR_thread_idx, runtime->aicpu_thread_num);
}References
- When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent bypassing bounds checks.
- Ensure that index-based accessors using signed integers perform bounds checks to prevent undefined behavior, out-of-bounds memory access, or signed-to-unsigned conversion overflows.
| # TraCR threads capacity (default is 1<<20 ~= 1 million traces per thread = ~17MB per thread buffer size) | ||
| set(TRACR_CAPACITY "" CACHE STRING "Optional TraCR buffer capacity (empty = use internal default)") | ||
|
|
||
| if(NOT TRACR_CAPACITY STREQUAL "") |
There was a problem hiding this comment.
According to the general rules, when checking if a string variable is non-empty, you should quote the variable expansion to prevent unexpected behavior if the string value matches a boolean keyword or is numeric.
if(NOT "${TRACR_CAPACITY}" STREQUAL "")
References
- In CMake, when checking if a string variable is non-empty, explicitly check against an empty string using
NOT ... STREQUAL ""and quote the variable expansion.
| TRACR_POLICY_IGNORE_IF_FULL | ||
| ) | ||
|
|
||
| if(NOT TRACR_POLICY STREQUAL "") |
There was a problem hiding this comment.
According to the general rules, when checking if a string variable is non-empty, you should quote the variable expansion to prevent unexpected behavior if the string value matches a boolean keyword or is numeric.
if(NOT "${TRACR_POLICY}" STREQUAL "")
References
- In CMake, when checking if a string variable is non-empty, explicitly check against an empty string using
NOT ... STREQUAL ""and quote the variable expansion.
| for(int i = 0; i < int(RUNTIME_MAX_WORKER/3); ++i) { | ||
| channel_names.push_back("AICube_" + std::to_string(i)); | ||
| } | ||
| for(int i = 0; i < int(2*RUNTIME_MAX_WORKER/3); ++i) { | ||
| channel_names.push_back("AIVector_" + std::to_string(i)); | ||
| } |
There was a problem hiding this comment.
Using the hardcoded RUNTIME_MAX_WORKER macro creates an implicit dependency on external headers and populates channel_names with inactive channels. Use the active runtime.worker_count instead to ensure the metadata accurately reflects the running configuration.
for(int i = 0; i < int(runtime.worker_count/3); ++i) {
channel_names.push_back("AICube_" + std::to_string(i));
}
for(int i = 0; i < int(2*runtime.worker_count/3); ++i) {
channel_names.push_back("AIVector_" + std::to_string(i));
}| #ifdef INDEP_ORCH | ||
| INSTRUMENTATION_MARK_SET(g_TraCR_thread_idx, Barrier, orchestrator_done_); | ||
| LOG_INFO_V9("[TraCR] Thread %d: Waiting before the Orch to finish: %d, orchestrator_done_=%d", g_TraCR_thread_idx, g_TraCR_thread_idx_counter.load(), orchestrator_done_); | ||
| while (!orchestrator_done_){}; |
There was a problem hiding this comment.
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1)
882-893: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
init_failed_exits leak the TraCR lifecycle.After
TRACR_START(), every return path needs to pair the reset/finalize sequence. The earlyinit_failed_return skips both, leaving the per-run counter and thread-local tracer state live for the next execution.Also applies to: 909-911
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp` around lines 882 - 893, The early return in aicpu_executor.cpp after g_aicpu_executor.init() bypasses the TraCR cleanup, so make every init_failed_ exit in this init-wait loop pair TRACR_START() with the matching reset/finalize sequence before returning. Update the error path around g_aicpu_executor.init_done_ / init_failed_ so it performs the same TraCR lifecycle teardown as the normal exit path, including the additional return site referenced in the follow-up comment.src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp (1)
390-402: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winDon't reset the core lane until the core is actually idle.
running_donecan immediately flow intopromote_pending_to_running(core)below, so clearing the marker here drops the remaining in-flight task from the trace. Any completion path that leaves one slot loaded should transition back toRunning_Task_Single, not blank the lane.Also applies to: 406-418
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp` around lines 390 - 402, The completion path in scheduler_completion.cpp is clearing the core lane too early in the `running_done` branch, which drops in-flight work from the trace. Update the logic around `complete_slot_task(...)` and the `promote_pending_to_running(core)` flow so the lane is only reset when the core is truly idle; if one slot remains loaded, transition it back to `Running_Task_Single` instead of blanking the lane. Keep the `INSTRUMENTATION_MARK_RESET(...)` call aligned with the actual idle state in this completion path.
🧹 Nitpick comments (1)
.gitmodules (1)
1-4: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winPin submodule to a specific branch or tag for reproducible builds.
The TraCR submodule lacks
branchortagpinning. Without this, builds are non-deterministic and may break if upstream pushes incompatible changes.[submodule "tools/tracr"] path = tools/tracr url = https://github.com/huawei-csl/TracR.git + branch = main🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.gitmodules around lines 1 - 4, The tools/tracr submodule entry is not pinned, so update the submodule configuration to reference a specific branch or tag for deterministic builds. Modify the existing submodule definition for tools/tracr in .gitmodules to include explicit pinning metadata, and keep the URL/path mapping intact so future checkouts resolve to the intended TraCR version.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a2a3/platform/onboard/host/device_runner.cpp`:
- Around line 270-278: The TraCR allocation/error path in
device_runner::StoreTracrData is not unwind-safe, so failed runs can skip
cleanup and leak buffers. Make the DevAllocTraCR()/runtime pointer setup
transactional or add a single all-exit cleanup path that always runs before
returning, ensuring the StoreTracrData free path and
teardown_shared_collectors_after_run() are executed even on failures. Keep the
final export-error return only after teardown has completed, and preserve the
existing error logging around DevAllocTraCR and later failure points.
In `@src/a2a3/platform/sim/aicpu/CMakeLists.txt`:
- Around line 77-79: The TraCR post-processing include is only registered from
the sim-only aicpu CMake target, so onboard builds never get the shared
post-process step. Move the include of tracr_postprocessing_script.cmake out of
the aicpu-specific CMakeLists and into a shared parent CMake entry that is
evaluated for all platforms, ensuring it is added exactly once for both sim and
onboard builds.
In `@src/a2a3/platform/sim/host/device_runner.cpp`:
- Around line 238-246: The TraCR setup in device_runner.cpp is not unwind-safe:
`DevAllocTraCR()` can partially initialize runtime state, and later early
returns in `StoreTracrData()` / the export path skip the only cleanup and
collector shutdown sequence. Refactor the `StoreTracrData()` flow to funnel all
exits through a single TraCR cleanup path that frees any allocated TraCR buffers
and always runs the collector stop/reconcile/export teardown before returning,
and defer propagating export failure until after that shutdown sequence
completes.
In `@src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`:
- Around line 799-807: `TRACR_START()` is treating `g_TraCR_thread_idx == 0` as
the session owner, but that value only means “first arriver,” not “last to
finish.” Update the lifecycle in `TRACR_START()` and the corresponding teardown
path near the related `INSTRUMENTATION_END()` handling so the global trace
session is started once and ended only when all participating threads have
completed. Ensure `g_TraCR_thread_idx_counter` is not reset by thread 0 while
other `INSTRUMENTATION_THREAD_INIT()` threads may still be active.
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Around line 819-823: The INDEP_ORCH wait in scheduler_dispatch.cpp uses
orchestrator_done_ as an unsynchronized spin flag, which can lead to a stale
read and an infinite loop. Update the barrier in the scheduler dispatch path to
use a synchronized/atomic latch for orchestrator_done_ (or an existing atomic
equivalent), and make the polling loop in the INDEP_ORCH block use acquire
semantics together with SPIN_WAIT_HINT() so the wait is safe on weakly ordered
CPUs.
In `@tools/tracr`:
- Line 1: The TraCR submodule pin is incompatible with the header path expected
by the build, so update the gitlink in tools/tracr to a revision that provides
include/tracr/tracr.hpp or adjust tracr_enable/CMake helper to reference the
actual header location for the current TraCR commit. Use the existing
tracr_enable configuration logic and the tools/tracr submodule reference to
locate the mismatch and keep the pinned revision and header path consistent.
In `@tools/tracr_postprocessing_script.cmake`:
- Around line 11-29: The current `set(BUILD_DIR ...)`, `configure_file(...)`,
and `set_target_properties(tracr_process ...)` logic is hard-coding outputs
relative to `CMAKE_CURRENT_LIST_DIR`, which points at the source tree and can
collide across builds. Update `tracr_postprocessing_script.cmake` to derive the
output location from `CMAKE_CURRENT_BINARY_DIR` (or an injected binary output
variable) and use that for both the `configure_file` destination of `state.cfg`
and the `RUNTIME_OUTPUT_DIRECTORY` of `tracr_process`, so artifacts land in the
active binary tree instead of `tools/../build/output/bin/`.
In `@tools/tracr_simpler_api.hpp`:
- Around line 241-250: The allocation path in the runtime setup leaves
runtime.tracrData_ live if the follow-up allocate_tensor call for
runtime.tracrDataSizes_ fails; update the same function to explicitly release or
null out runtime.tracrData_ before returning -1 so the partially initialized
state is cleaned up consistently. Use the existing runtime.tracrData_ and
runtime.tracrDataSizes_ symbols in the device allocation block to keep the
failure path safe for retries and cleanup.
- Around line 188-227: The tracing download flow in the TraCR helper leaks both
host and device buffers on early returns. Update the block around the
copy_from_device calls, TracrData2BTS, and StoreTracrMetaData so cleanup is
centralized with RAII or a scope guard, and make sure both tracrData and
tracrDataSizes are checked for null before use. Also ensure runtime.tracrData_
and runtime.tracrDataSizes_ are always released even when any rc check fails,
not only on the success path.
- Around line 33-40: The header currently defines non-inline globals sampleID
and tracr_dir, which can cause multiple-definition linker errors and unsafe
static initialization; update the declarations in tracr_simpler_api.hpp so these
variables are either marked inline or moved to a .cpp file with extern
declarations in the header. For sampleID, prefer lazy inline initialization via
getSampleID() so std::stoul is not run during problematic static initialization
and remains tied to the existing helper.
In `@tools/tracr.cmake`:
- Around line 29-56: The TraCR header probe in tracr_enable() runs even when
BUILD_TRACR is off, causing unnecessary configure failures. Move the
TRACR_INCLUDE_DIR setup, tracr.hpp existence check, and
target_include_directories call inside the BUILD_TRACR branch in tracr_enable(),
or return early before probing when BUILD_TRACR is false, so downstream targets
can invoke it safely without requiring the submodule.
---
Outside diff comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`:
- Around line 882-893: The early return in aicpu_executor.cpp after
g_aicpu_executor.init() bypasses the TraCR cleanup, so make every init_failed_
exit in this init-wait loop pair TRACR_START() with the matching reset/finalize
sequence before returning. Update the error path around
g_aicpu_executor.init_done_ / init_failed_ so it performs the same TraCR
lifecycle teardown as the normal exit path, including the additional return site
referenced in the follow-up comment.
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp`:
- Around line 390-402: The completion path in scheduler_completion.cpp is
clearing the core lane too early in the `running_done` branch, which drops
in-flight work from the trace. Update the logic around `complete_slot_task(...)`
and the `promote_pending_to_running(core)` flow so the lane is only reset when
the core is truly idle; if one slot remains loaded, transition it back to
`Running_Task_Single` instead of blanking the lane. Keep the
`INSTRUMENTATION_MARK_RESET(...)` call aligned with the actual idle state in
this completion path.
---
Nitpick comments:
In @.gitmodules:
- Around line 1-4: The tools/tracr submodule entry is not pinned, so update the
submodule configuration to reference a specific branch or tag for deterministic
builds. Modify the existing submodule definition for tools/tracr in .gitmodules
to include explicit pinning metadata, and keep the URL/path mapping intact so
future checkouts resolve to the intended TraCR version.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 88334310-c768-4931-80be-a1387c83c6b4
📒 Files selected for processing (22)
.gitmodulesexamples/a2a3/tensormap_and_ringbuffer/paged_attention/kernels/orchestration/paged_attention_orch.cppsimpler_setup/kernel_compiler.pysrc/a2a3/platform/onboard/aicpu/CMakeLists.txtsrc/a2a3/platform/onboard/host/CMakeLists.txtsrc/a2a3/platform/onboard/host/device_runner.cppsrc/a2a3/platform/sim/aicpu/CMakeLists.txtsrc/a2a3/platform/sim/host/CMakeLists.txtsrc/a2a3/platform/sim/host/device_runner.cppsrc/a2a3/runtime/host_build_graph/runtime/pto_runtime2_types.hsrc/a2a3/runtime/host_build_graph/runtime/runtime.hsrc/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpptools/tracrtools/tracr.cmaketools/tracr_postprocessing_script.cmaketools/tracr_simpler_api.hpptools/tracr_simpler_markers.hpp
| // Initialize TraCR memory on the device | ||
| #ifdef ENABLE_TRACR | ||
| // LOG_INFO_V9("[TraCR] thread[%d] DevAllocTraCR device_id_=%d", sched_getcpu(), device_id_); | ||
| rc = DevAllocTraCR(this, runtime); | ||
| if (rc != 0) { | ||
| LOG_ERROR("DevAllocTraCR failed rc=%d", rc); | ||
| return rc; | ||
| } | ||
| #endif |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Make the TraCR lifecycle unwind-safe.
DevAllocTraCR() can fail after populating one runtime pointer, and every later return before Line 487 skips the only shown free path in StoreTracrData(). The new Line 490 return also bypasses teardown_shared_collectors_after_run() at Line 496. Failed runs will leak TraCR buffers and skip the collector stop/export path. Either guard the TraCR buffers with an all-exit cleanup path here or make the helper transactional, then return the export error only after teardown has run.
Also applies to: 485-493
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a2a3/platform/onboard/host/device_runner.cpp` around lines 270 - 278, The
TraCR allocation/error path in device_runner::StoreTracrData is not unwind-safe,
so failed runs can skip cleanup and leak buffers. Make the
DevAllocTraCR()/runtime pointer setup transactional or add a single all-exit
cleanup path that always runs before returning, ensuring the StoreTracrData free
path and teardown_shared_collectors_after_run() are executed even on failures.
Keep the final export-error return only after teardown has completed, and
preserve the existing error logging around DevAllocTraCR and later failure
points.
| # TODO: move this somewhere such that EVERY platform launches this once. Placing this here is hacky... | ||
| include(${CMAKE_CURRENT_SOURCE_DIR}/../../../../../tools/tracr_postprocessing_script.cmake) | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift
Hoist TraCR post-processing out of the sim-only target.
Lines 77-79 register tracr_postprocessing_script.cmake only from the sim aicpu target, and the TODO already notes this should run once for every platform. As written, onboard builds miss the advertised post-process step entirely. Move this include to a shared parent CMake entry so it is added exactly once for both sim and onboard builds.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a2a3/platform/sim/aicpu/CMakeLists.txt` around lines 77 - 79, The TraCR
post-processing include is only registered from the sim-only aicpu CMake target,
so onboard builds never get the shared post-process step. Move the include of
tracr_postprocessing_script.cmake out of the aicpu-specific CMakeLists and into
a shared parent CMake entry that is evaluated for all platforms, ensuring it is
added exactly once for both sim and onboard builds.
| // Initialize TraCR memory on the device | ||
| #ifdef ENABLE_TRACR | ||
| rc = DevAllocTraCR(this, runtime); | ||
| if (rc != 0) { | ||
| LOG_ERROR("DevAllocTraCR failed rc=%d", rc); | ||
| return rc; | ||
| } | ||
| #endif | ||
|
|
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Make the TraCR lifecycle unwind-safe.
DevAllocTraCR() can fail after populating one runtime pointer, and every later return before Line 529 skips the only shown free path in StoreTracrData(). The new Line 532 return also bypasses the collector stop/reconcile/export sequence at Lines 538-572. Failed runs will leak TraCR buffers and skip collector teardown. Add an all-exit TraCR cleanup path and defer the export failure until after the collector shutdown path has completed.
Also applies to: 527-535
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a2a3/platform/sim/host/device_runner.cpp` around lines 238 - 246, The
TraCR setup in device_runner.cpp is not unwind-safe: `DevAllocTraCR()` can
partially initialize runtime state, and later early returns in
`StoreTracrData()` / the export path skip the only cleanup and collector
shutdown sequence. Refactor the `StoreTracrData()` flow to funnel all exits
through a single TraCR cleanup path that frees any allocated TraCR buffers and
always runs the collector stop/reconcile/export teardown before returning, and
defer propagating export failure until after that shutdown sequence completes.
| inline void TRACR_START() { | ||
| g_TraCR_thread_idx = g_TraCR_thread_idx_counter.fetch_add(1, std::memory_order_relaxed); | ||
|
|
||
| if (g_TraCR_thread_idx == 0) { | ||
| INSTRUMENTATION_START(); | ||
| } else { | ||
| INSTRUMENTATION_THREAD_INIT(); | ||
| } | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win
Don't end the trace session on “TraCR thread 0”.
TRACR_START() assigns g_TraCR_thread_idx in arrival order, so index 0 is just the first caller, not the last finisher. If that thread executes INSTRUMENTATION_END() and resets the global counter while siblings are still running, the remaining threads race session teardown and can reuse thread slots inside the same run.
Also applies to: 835-839
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp` around
lines 799 - 807, `TRACR_START()` is treating `g_TraCR_thread_idx == 0` as the
session owner, but that value only means “first arriver,” not “last to finish.”
Update the lifecycle in `TRACR_START()` and the corresponding teardown path near
the related `INSTRUMENTATION_END()` handling so the global trace session is
started once and ended only when all participating threads have completed.
Ensure `g_TraCR_thread_idx_counter` is not reset by thread 0 while other
`INSTRUMENTATION_THREAD_INIT()` threads may still be active.
| #ifdef INDEP_ORCH | ||
| INSTRUMENTATION_MARK_SET(g_TraCR_thread_idx, Barrier, orchestrator_done_); | ||
| LOG_INFO_V9("[TraCR] Thread %d: Waiting before the Orch to finish: %d, orchestrator_done_=%d", g_TraCR_thread_idx, g_TraCR_thread_idx_counter.load(), orchestrator_done_); | ||
| while (!orchestrator_done_){}; | ||
| #endif |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win
The INDEP_ORCH barrier is a racy infinite-spin.
This wait reads orchestrator_done_ as an unsynchronized boolean, while the same flag is reset with plain assignment elsewhere. On weakly ordered cores the scheduler can spin forever on a stale false. Make this latch atomic (or reuse an existing synchronized one) and poll it with acquire semantics plus SPIN_WAIT_HINT().
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`
around lines 819 - 823, The INDEP_ORCH wait in scheduler_dispatch.cpp uses
orchestrator_done_ as an unsynchronized spin flag, which can lead to a stale
read and an infinite loop. Update the barrier in the scheduler dispatch path to
use a synchronized/atomic latch for orchestrator_done_ (or an existing atomic
equivalent), and make the polling loop in the INDEP_ORCH block use acquire
semantics together with SPIN_WAIT_HINT() so the wait is safe on weakly ordered
CPUs.
| set(BUILD_DIR "${CMAKE_CURRENT_LIST_DIR}/../build/output/bin/") | ||
|
|
||
| message(STATUS "TraCR: REAL_SOURCE_DIR: '${CMAKE_CURRENT_LIST_DIR}'") | ||
|
|
||
| # Paraver format configuration file | ||
| configure_file( | ||
| ${CMAKE_CURRENT_LIST_DIR}/tracr/postprocessing/paraver/state.cfg | ||
| ${BUILD_DIR}/state.cfg | ||
| COPYONLY | ||
| ) | ||
|
|
||
| add_executable(tracr_process ${CMAKE_CURRENT_LIST_DIR}/tracr/postprocessing/tracr_process.cpp) | ||
|
|
||
| tracr_enable(tracr_process) | ||
|
|
||
| # Set the output directory for the compiled executable | ||
| set_target_properties(tracr_process PROPERTIES | ||
| RUNTIME_OUTPUT_DIRECTORY ${BUILD_DIR} | ||
| ) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Emit state.cfg and tracr_process into the binary tree, not the source tree.
CMAKE_CURRENT_LIST_DIR resolves to tools/, so this hard-codes outputs under the repository checkout (../build/output/bin). Out-of-tree or parallel builds will stomp each other's artifacts, and the files won't follow the active build directory. Base this on CMAKE_CURRENT_BINARY_DIR (or a caller-provided binary output dir) instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/tracr_postprocessing_script.cmake` around lines 11 - 29, The current
`set(BUILD_DIR ...)`, `configure_file(...)`, and
`set_target_properties(tracr_process ...)` logic is hard-coding outputs relative
to `CMAKE_CURRENT_LIST_DIR`, which points at the source tree and can collide
across builds. Update `tracr_postprocessing_script.cmake` to derive the output
location from `CMAKE_CURRENT_BINARY_DIR` (or an injected binary output variable)
and use that for both the `configure_file` destination of `state.cfg` and the
`RUNTIME_OUTPUT_DIRECTORY` of `tracr_process`, so artifacts land in the active
binary tree instead of `tools/../build/output/bin/`.
| size_t getSampleID() { | ||
| const auto env = std::getenv("PYPTO_RUN_SAMPLE_ID"); | ||
| return env ? std::stoul(env) : 0; | ||
| } | ||
| size_t sampleID = getSampleID(); | ||
|
|
||
| std::string tracr_dir = "~/ascend/tracr/proc.1"; | ||
|
|
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
Move non-inline global variables out of the header or mark them inline.
Defining variables (sampleID, tracr_dir) in a header without inline creates external linkage. Including this header in multiple translation units will cause linker errors (multiple definition). Additionally, the initialization of sampleID uses std::stoul, which can throw an exception during static initialization, causing the program to crash before standard error handling is active.
Since C++17 is available, prefer declaring these as inline variables with lazy initialization to ensure single-definition and safe initialization:
Recommended fix
inline size_t sampleID = []() {
const char* env = std::getenv("PYPTO_RUN_SAMPLE_ID");
return env ? std::stoul(env) : 0;
}();
inline std::string tracr_dir = "~/ascend/tracr/proc.1";Alternatively, move the definitions to a corresponding .cpp file and declare them extern in the header.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/tracr_simpler_api.hpp` around lines 33 - 40, The header currently
defines non-inline globals sampleID and tracr_dir, which can cause
multiple-definition linker errors and unsafe static initialization; update the
declarations in tracr_simpler_api.hpp so these variables are either marked
inline or moved to a .cpp file with extern declarations in the header. For
sampleID, prefer lazy inline initialization via getSampleID() so std::stoul is
not run during problematic static initialization and remains tied to the
existing helper.
| size_t size = sizeof(TraCR::Payload) * TraCR::CAPACITY * runtime.aicpu_thread_num; | ||
| TraCR::Payload* tracrData = reinterpret_cast<TraCR::Payload *>(std::malloc(size)); | ||
| int rc = device_runner->copy_from_device(reinterpret_cast<void*>(tracrData), | ||
| reinterpret_cast<void*>(runtime.tracrData_), size); | ||
| if (rc != 0) { | ||
| LOG_ERROR("device_runner->copy_from_device 'tracrData' failed rc=%d", rc); | ||
| return rc; | ||
| } | ||
|
|
||
| // Download the tracrDataSizes_ from Device to Host | ||
| size = sizeof(size_t) * runtime.aicpu_thread_num; | ||
| size_t* tracrDataSizes = reinterpret_cast<size_t *>(std::malloc(size)); | ||
| rc = device_runner->copy_from_device(reinterpret_cast<void*>(tracrDataSizes), | ||
| reinterpret_cast<void*>(runtime.tracrDataSizes_), size); | ||
| if (rc != 0) { | ||
| LOG_ERROR("device_runner->copy_from_device 'tracrDataSizes' failed rc=%d", rc); | ||
| return rc; | ||
| } | ||
|
|
||
| // Now, store the traces into '~/ascend/tracr/' | ||
| tracr_dir = "~/ascend/tracr_" + std::to_string(sampleID++) + "/proc." + std::to_string(1000 + device_runner->device_id()); | ||
| rc = TracrData2BTS(tracrData, tracrDataSizes, runtime.aicpu_thread_num); | ||
| if (rc != 0) { | ||
| LOG_ERROR("TracrData2BTS() failed"); | ||
| return rc; | ||
| } | ||
|
|
||
| // Free tmp Host TraCR data placeholders | ||
| std::free(reinterpret_cast<void *>(tracrData)); | ||
| std::free(reinterpret_cast<void *>(tracrDataSizes)); | ||
|
|
||
| // Free device TraCR memory data placeholder | ||
| device_runner->free_tensor(runtime.tracrData_); | ||
| device_runner->free_tensor(runtime.tracrDataSizes_); | ||
|
|
||
| rc = StoreTracrMetaData(runtime); | ||
| if (rc != 0) { | ||
| LOG_ERROR("StoreTracrMetaData failed: %d", rc); | ||
| return rc; | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
These error paths leak both host buffers and device trace tensors.
Every return between the first malloc() and StoreTracrMetaData() skips some combination of tracrData, tracrDataSizes, runtime.tracrData_, and runtime.tracrDataSizes_. A failed copy or file write will leave tracing buffers allocated for the rest of the process. Please switch this block to RAII/scope-guard cleanup and reject null malloc() results before calling copy_from_device().
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/tracr_simpler_api.hpp` around lines 188 - 227, The tracing download
flow in the TraCR helper leaks both host and device buffers on early returns.
Update the block around the copy_from_device calls, TracrData2BTS, and
StoreTracrMetaData so cleanup is centralized with RAII or a scope guard, and
make sure both tracrData and tracrDataSizes are checked for null before use.
Also ensure runtime.tracrData_ and runtime.tracrDataSizes_ are always released
even when any rc check fails, not only on the success path.
| runtime.tracrData_ = device_runner->allocate_tensor(size); | ||
| if (runtime.tracrData_ == nullptr) { | ||
| LOG_ERROR("runtime.tracrData_: alloc %zu bytes failed", size); | ||
| return -1; | ||
| } | ||
| // LOG_INFO_V9("Device alloc start of size=%u, %p", size, runtime.tracrData_); | ||
| runtime.tracrDataSizes_ = device_runner->allocate_tensor(runtime.aicpu_thread_num * sizeof(size_t)); | ||
| if (runtime.tracrDataSizes_ == nullptr) { | ||
| LOG_ERROR("runtime.tracrDataSizes_: alloc %zu bytes failed", size); | ||
| return -1; |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Rollback runtime.tracrData_ if the second device allocation fails.
When allocate_tensor(runtime.aicpu_thread_num * sizeof(size_t)) fails, the function returns with the first device buffer still live in runtime.tracrData_. That leaks device memory and leaves runtime in a half-initialized state for any retry/error cleanup path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/tracr_simpler_api.hpp` around lines 241 - 250, The allocation path in
the runtime setup leaves runtime.tracrData_ live if the follow-up
allocate_tensor call for runtime.tracrDataSizes_ fails; update the same function
to explicitly release or null out runtime.tracrData_ before returning -1 so the
partially initialized state is cleaned up consistently. Use the existing
runtime.tracrData_ and runtime.tracrDataSizes_ symbols in the device allocation
block to keep the failure path safe for retries and cleanup.
| # Create the TraCR include directory path | ||
| set(TRACR_INCLUDE_DIR | ||
| ${TRACR_ROOT_DIR}/tracr/include | ||
| ) | ||
|
|
||
| # Check if it even exists | ||
| if (NOT EXISTS "${TRACR_INCLUDE_DIR}/tracr/tracr.hpp") | ||
| message(FATAL_ERROR | ||
| "tracr.hpp not found at ${TRACR_INCLUDE_DIR}/tracr/tracr.hpp" | ||
| ) | ||
| endif() | ||
|
|
||
| # Append the nlohmann json path as well | ||
| set(TRACR_INCLUDE_DIR | ||
| ${TRACR_INCLUDE_DIR} | ||
| ${TRACR_ROOT_DIR} | ||
| ${TRACR_ROOT_DIR}/tracr/extern | ||
| ) | ||
|
|
||
| # --- include the directories --- | ||
| target_include_directories(${target} PRIVATE | ||
| ${TRACR_INCLUDE_DIR} | ||
| ) | ||
|
|
||
| # --- compiler flags of TraCR --- | ||
| if (BUILD_TRACR) | ||
| # Flag to enable/disable TraCR calls at compile time | ||
| target_compile_definitions(${target} PRIVATE ENABLE_TRACR) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win
Don't probe the TraCR submodule unless BUILD_TRACR is enabled.
tracr_enable() is invoked unconditionally by downstream targets, but the tracr.hpp existence check runs before the if(BUILD_TRACR) guard. That turns every default/off build into a hard configure failure instead of a no-op, which is exactly what CI is doing now. Move the include-dir setup and header validation inside the enabled branch, or return early when BUILD_TRACR is false.
Suggested fix
function(tracr_enable target)
message(STATUS "Enabling TraCR '${BUILD_TRACR}' for target: ${target}")
if (NOT TARGET ${target})
message(FATAL_ERROR "Target '${target}' does not exist.")
endif()
+
+ if (NOT BUILD_TRACR)
+ return()
+ endif()
# Create the TraCR include directory path
set(TRACR_INCLUDE_DIR
${TRACR_ROOT_DIR}/tracr/include
)
@@
- # --- compiler flags of TraCR ---
- if (BUILD_TRACR)
+ # --- compiler flags of TraCR ---
+ if (BUILD_TRACR)
# Flag to enable/disable TraCR calls at compile time
target_compile_definitions(${target} PRIVATE ENABLE_TRACR)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Create the TraCR include directory path | |
| set(TRACR_INCLUDE_DIR | |
| ${TRACR_ROOT_DIR}/tracr/include | |
| ) | |
| # Check if it even exists | |
| if (NOT EXISTS "${TRACR_INCLUDE_DIR}/tracr/tracr.hpp") | |
| message(FATAL_ERROR | |
| "tracr.hpp not found at ${TRACR_INCLUDE_DIR}/tracr/tracr.hpp" | |
| ) | |
| endif() | |
| # Append the nlohmann json path as well | |
| set(TRACR_INCLUDE_DIR | |
| ${TRACR_INCLUDE_DIR} | |
| ${TRACR_ROOT_DIR} | |
| ${TRACR_ROOT_DIR}/tracr/extern | |
| ) | |
| # --- include the directories --- | |
| target_include_directories(${target} PRIVATE | |
| ${TRACR_INCLUDE_DIR} | |
| ) | |
| # --- compiler flags of TraCR --- | |
| if (BUILD_TRACR) | |
| # Flag to enable/disable TraCR calls at compile time | |
| target_compile_definitions(${target} PRIVATE ENABLE_TRACR) | |
| # Create the TraCR include directory path | |
| set(TRACR_INCLUDE_DIR | |
| ${TRACR_ROOT_DIR}/tracr/include | |
| ) | |
| # Check if it even exists | |
| if (NOT EXISTS "${TRACR_INCLUDE_DIR}/tracr/tracr.hpp") | |
| message(FATAL_ERROR | |
| "tracr.hpp not found at ${TRACR_INCLUDE_DIR}/tracr/tracr.hpp" | |
| ) | |
| endif() | |
| # Append the nlohmann json path as well | |
| set(TRACR_INCLUDE_DIR | |
| ${TRACR_INCLUDE_DIR} | |
| ${TRACR_ROOT_DIR} | |
| ${TRACR_ROOT_DIR}/tracr/extern | |
| ) | |
| # --- include the directories --- | |
| target_include_directories(${target} PRIVATE | |
| ${TRACR_INCLUDE_DIR} | |
| ) | |
| # --- compiler flags of TraCR --- | |
| if (BUILD_TRACR) | |
| # Flag to enable/disable TraCR calls at compile time | |
| target_compile_definitions(${target} PRIVATE ENABLE_TRACR) |
🧰 Tools
🪛 GitHub Actions: CI / 12_pre-commit.txt
[error] 36-36: CMake Error: tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (triggered by tracr_enable; Call Stack: CMakeLists.txt:75).
[error] 36-36: CMake Error: tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (triggered by tracr_enable; Call Stack: CMakeLists.txt:80).
🪛 GitHub Actions: CI / 13_ut (ubuntu-latest, 3.10).txt
[error] 36-36: CMake error: tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (triggered by tracr_enable).
🪛 GitHub Actions: CI / 4_profiling-flags-smoke.txt
[error] 36-36: CMake configuration failed: tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp.
🪛 GitHub Actions: CI / 7_packaging-matrix (ubuntu-latest, 3.10).txt
[error] 36-36: CMake Error: tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (triggered by tracr_enable in CMakeLists.txt)
🪛 GitHub Actions: CI / 8_packaging-matrix (macos-latest, 3.10).txt
[error] 36-36: CMake configuration failed: tracr.hpp not found at /Users/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (Call Stack: CMakeLists.txt:75 (tracr_enable)).
[error] 36-36: CMake configuration failed: tracr.hpp not found at /Users/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (Call Stack: CMakeLists.txt:80 (tracr_enable)).
🪛 GitHub Actions: CI / 9_ut (macos-latest, 3.10).txt
[error] 36-36: CMake configuration failed: tracr.hpp not found at /Users/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp.
🪛 GitHub Actions: CI / packaging-matrix (macos-latest, 3.10)
[error] 36-36: CMake Error: tracr.hpp not found at /Users/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (triggered by tracr_enable).
🪛 GitHub Actions: CI / packaging-matrix (ubuntu-latest, 3.10)
[error] 36-36: CMake Error: tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (called from CMakeLists.txt:75 for aicpu_kernel and CMakeLists.txt:80 for host_runtime).
🪛 GitHub Actions: CI / pre-commit
[error] 36-36: CMake error: tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (triggered from tracr_enable).
🪛 GitHub Actions: CI / profiling-flags-smoke
[error] 36-36: CMake Error: tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (triggered by tracr_enable).
🪛 GitHub Actions: CI / st-onboard-a2a3
[error] 36-36: CMake error: tracr.hpp not found at /data/ci-runner/simpler-ci/npu-runner-1/_work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (triggered via tracr_enable in CMakeLists.txt).
🪛 GitHub Actions: CI / st-onboard-a5
[error] 36-36: CMake configuration failed: tracr.hpp not found at /home/pyptouser/simpler_a5ci/actions-runner/_work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (Call Stack: CMakeLists.txt -> tracr_enable)
🪛 GitHub Actions: CI / st-sim-a2a3 (macos-latest, 3.10)
[error] 36-36: CMake failed: tracr.hpp not found at /Users/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp. Call Stack: CMakeLists.txt:80 (tracr_enable) (HOST target).
[error] 36-36: CMake failed: tracr.hpp not found at /Users/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp. Call Stack: CMakeLists.txt:75 (tracr_enable) (AICPU target).
🪛 GitHub Actions: CI / st-sim-a2a3 (ubuntu-latest, 3.10)
[error] 36-36: CMake Error: tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp. (CMakeLists.txt:80 (tracr_enable))
[error] 36-36: CMake Error: tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp. (CMakeLists.txt:75 (tracr_enable))
🪛 GitHub Actions: CI / st-sim-a5 (ubuntu-latest, 3.10)
[error] 36-36: CMake configuration failed: tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (CMake Error at tracr.cmake:36 (message)).
🪛 GitHub Actions: CI / ut (macos-latest, 3.10)
[error] 36-36: CMake Error: tracr.hpp not found at /Users/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (tracr_enable).
🪛 GitHub Actions: CI / ut (ubuntu-latest, 3.10)
[error] 36-36: CMake configuration failed for AICPU (exit code 1): tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp
[error] 36-36: CMake configuration failed for HOST (exit code 1): tracr.hpp not found at /home/runner/work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp
🪛 GitHub Actions: CI / ut-a2a3
[error] 36-36: CMake Error: tracr.hpp not found at /data/ci-runner/simpler-ci/npu-runner-2/_work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp (via tracr_enable from CMakeLists.txt)
🪛 GitHub Actions: CI / ut-a5
[error] 36-36: CMake failed because required header is missing: "tracr.hpp not found at /home/pyptouser/simpler_a5ci/actions-runner/_work/simpler/simpler/tools/tracr/include/tracr/tracr.hpp"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/tracr.cmake` around lines 29 - 56, The TraCR header probe in
tracr_enable() runs even when BUILD_TRACR is off, causing unnecessary configure
failures. Move the TRACR_INCLUDE_DIR setup, tracr.hpp existence check, and
target_include_directories call inside the BUILD_TRACR branch in tracr_enable(),
or return early before probing when BUILD_TRACR is false, so downstream targets
can invoke it safely without requiring the submodule.
Source: Pipeline failures
Brings tracr (main+tracr) up to main f4d14bf. Only aicpu_executor.cpp conflicted: kept main's refactors (RAII AicpuPhaseScope phases hw-native-sys#1183, simpler_aicpu_register_callable rename hw-native-sys#1207/hw-native-sys#1210, nested arena-setup scope) and re-wove the TraCR hooks (Allocating marker, TRACR_START/ FINALIZE, aicpu_execute entry/finalize, tracr_getcpu). All other TraCR footprint files (scheduler .cpp markers, device_runner, runtime.h, pto_runtime2_types.h) auto-merged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…1216) main hw-native-sys#1216 split the trb Runtime into a device-copied DeviceRuntimeLaunchDesc (`dev`) + host-only tail, so aicpu_thread_num / worker_count / the tracr pointers are no longer direct `Runtime` members on trb (they live in `dev`, reached via accessors); host_build_graph keeps them flat. The templated shared TraCR code (tracr_simpler_api.hpp) and the trb TRACR_FINALIZE used the old direct-field API and failed to compile under BUILD_TRACR=ON. - Add get/set_tracr_data + get/set_tracr_data_sizes accessors to both Runtime variants (mirroring get_aicpu_thread_num), identical signatures so the shared header compiles against either. - Switch tracr_simpler_api.hpp (DevAllocTraCR / StoreTracrData / StoreTracrMetaData) and aicpu_executor.cpp TRACR_FINALIZE to the accessors (get_aicpu_thread_num / get_worker_count / get/set_tracr_data*). Verified: BUILD_TRACR=ON build is clean (simpler-cann9 docker). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Whitespace/brace-style only, applied via the repo's pinned pre-commit clang-format hook (v21.1.0) scoped to the TraCR diff vs main. No semantic changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Formatting only, via the repo's pinned ruff hooks (v0.14.8). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
check-headers requires the standard "# Copyright (c) PyPTO Contributors." header (PY_HEADER) for .cmake files; tracr.cmake and tracr_postprocessing_script.cmake carried a different Huawei header with an extra leading rule line. Also picks up trailing-whitespace / end-of-file fixes on tracr.cmake. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
end-of-file-fixer requires a trailing newline; tracr_simpler_api.hpp and tracr_simpler_markers.hpp were missing it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Upstreams TraCR into Simpler as a
first-class profiler, replacing the separate development branch. No more
out-of-tree branch to rebase — TraCR lives in
tools/tracr/and is wired intothe
tensormap_and_ringbufferruntime build behindBUILD_TRACR=ON.What is TraCR
A lightweight, nanoscale C++ instrumentation library. Markers store fixed
16-byte records into a lock-free per-thread ring buffer; on device the
timestamps come from the hardware counter, and the buffers are downloaded
after a run to
~/ascend/tracr_<N>/and post-processed (tracr_process) intoPerfetto / Paraver timelines.
It captures the full on-device picture: the AICPU scheduler threads
(orchestration / scheduling / phase / drain markers) and per-core task
execution on the AICube / AIVector units.
Why TraCR
no locks, fixed-size records, HW-counter timestamps — designed to perturb the
hot path as little as possible. Crucially, it does not write to the AICPU
device_logon the hot path, so it avoids the op-serialization /op-execute-timeout pitfalls of log-based profiling — which is why it should be
meaningfully lower-overhead than the current device-log (
PTO2_PROFILING)approach.
orchestration) and kernel level — the visibility needed to profile PyPTO /
Simpler kernels.
Perfetto trace you open in
ui.perfetto.dev.supported, in-tree tool instead of a side branch.
Where it's already been used
qwen3-14B decode attention (
fa_fused+online_softmax) in pypto-lib —captured the full decode-layer timeline (AICPU + 24 AICube + 48 AIVector) (see pypto-lib #607.
Optimized pypto vs CCE paged attention — the follow-up to the
6x in Paged Attention benchmark vs pypto #986 , comparing the
optimized attention against the CCE highperf kernel
High performance Paged Attention A2A3 ST Test #899 at matched shapes.
DeepSeek V4 CSA decode-attention kernel pypto-lib #590
General PyPTO / Simpler kernel profiling.
Future work
Repo: https://github.com/huawei-csl/TracR