Skip to content

Fix: reduce profiling buffer drops across collectors#1162

Open
zmnobug wants to merge 3 commits into
hw-native-sys:mainfrom
zmnobug:l2-swimlane-full-buffer-optimization
Open

Fix: reduce profiling buffer drops across collectors#1162
zmnobug wants to merge 3 commits into
hw-native-sys:mainfrom
zmnobug:l2-swimlane-full-buffer-optimization

Conversation

@zmnobug

@zmnobug zmnobug commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Split the shared profiling host pipeline into sharded drain/refill
    workers, collector shards, and a lightweight replenish path.
  • Move split-runtime free-queue refill onto the drain path: when a ready
    entry is popped, that drain shard tops up the originating free queue
    before handing the full buffer to its collector shard.
  • Change recycled device buffers from one global per-kind pool to
    shard-local collector shard x buffer kind pools. The former broad
    pool lock is narrowed to pointer mappings only.
  • Keep the per-free-queue writer lock as a safety guard because some
    runtime paths can reassign producer ownership across AICPU threads.
  • Improve device-side publication/backpressure behavior: publish full
    buffers before recovering replacements, order ready-entry writes before
    tail publication, use bounded queue waits, preserve total drop accounting,
    and reuse phase buffers when ready queues are full.
  • Synchronize the arch-specific L2/PMU/DepGen mitigation paths for both
    a2a3 and a5, and apply the shared split host framework across
    TensorDump and ScopeStats.
  • Harden AICPU profiling disabled/base=0 paths by resetting cached runtime
    state, so a disabled launch cannot reuse stale header, pool, or current
    buffer pointers from a prior launch.
  • Harden host non-SVM copy/range paths by checking narrow read/write and
    buffer-copy return values before advancing queues or delivering buffers
    to collectors.
  • Harden non-L2 collector drain paths by validating ready-entry indices
    before resolving BufferState/free_queue, so a malformed or stale device
    entry is dropped instead of letting host refill an out-of-range
    free_queue.
  • Update docs and add sharded buffer-pool unit coverage.

Observed Effect

  • paged_attention stress with temporary pressure
    (PLATFORM_PROF_BUFFER_SIZE=4, PLATFORM_PROF_BUFFERS_PER_CORE=1):
    PERF drops decreased from about 32,969 to about 3,000, roughly a
    91% reduction.
  • qwen3_14b_decode stress under the same temporary pressure: PERF drops
    decreased from 68 to 28, roughly a 59% reduction.
  • Synthetic direct producer, target 30 GB/s for 5 ms:
    • Initial direct baseline: host effective receive median improved from
      1.8368 GB/s to 3.8720 GB/s, about +110.8%.
    • Initial direct baseline: host effective receive mean improved from
      1.4680 GB/s to 3.8949 GB/s, about +165.3%.
    • Initial direct baseline: median drop rate decreased from 93.88% to
      87.11%.
    • Stricter 4-ready baseline: host effective receive median improved from
      2.0096 GB/s to 3.8720 GB/s, about +92.7%.
    • Stricter 4-ready baseline: median PERF drop decreased from
      4,378,000 to 4,087,000; median drop rate decreased from 93.31%
      to 87.11%.

This is a mitigation and host-management architecture improvement, not a
complete drop elimination. Under the extreme temporary buffer=4 burst
stress, residual drops remain.

Testing

  • conda run -n zm_pypto pre-commit run --files docs/profiling-framework.md src/common/platform/include/host/buffer_pool_manager.h src/common/platform/include/host/profiler_base.h tests/ut/cpp/common/test_buffer_pool_manager.cpp
  • conda run -n zm_pypto pre-commit run --files src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp src/a5/platform/shared/aicpu/pmu_collector_aicpu.cpp src/a5/platform/shared/aicpu/dep_gen_collector_aicpu.cpp src/common/platform/shared/aicpu/tensor_dump_aicpu.cpp src/common/platform/include/host/profiler_base.h
  • git diff --check --cached
  • CCACHE_DISABLE=1 PIP_CACHE_DIR=/tmp/pip-cache-simpler pip install --no-build-isolation -e .
  • Synthetic direct producer rerun: 7/7 runs passed on hardware
    device 2; attempted median 30.0288 GB/s, host effective receive
    median 3.8720 GB/s.
  • Local test_buffer_pool_manager executable link is blocked by the
    already-documented local GoogleTest ABI mismatch; the target compiles up
    to link stage.

Related: #1161

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8f8867f2-c41d-4512-ad11-bd55ec3d0d37

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR updates profiling framework docs and shared contracts for split management, adds synchronization to host buffer pools, changes profiler runtime start/stop and worker loops, seeds more host buffers at initialization, and updates AICPU drop accounting and queue backpressure handling.

Changes

Split L2 swimlane profiling

Layer / File(s) Summary
Shared contracts and counters
docs/profiling-framework.md, src/a2a3/platform/include/common/l2_swimlane_profiling.h, src/a2a3/platform/include/common/platform_config.h, src/a2a3/platform/include/host/l2_swimlane_collector.h
Shared profiling docs and public headers now describe split management, the 4-kind L2Swimlane buffer model, the new active-head drop counters, and the module metadata refresh hook.
Buffer pool locking
src/common/platform/include/host/buffer_pool_manager.h
Host buffer pool mappings, recycled-pool access, and free-queue writer access now use pool_mutex_ and striped mutexes.
ProfilerBase helpers
src/common/platform/include/host/profiler_base.h
ProfilerBase traits and queue helpers now support optional split management, refreshed queue indices, and capacity-checked free-queue pushes.
ProfilerBase workers and lifecycle
src/common/platform/include/host/profiler_base.h
ProfilerBase startup, shutdown, and loop bodies now launch separate drain and replenish workers, join them on stop, and log loop and buffer counters.
Host seeding and reconciliation
src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
Host-side initialization now seeds multiple free-queue slots per kind, and reconciliation now aggregates and reports split drop counters.
AICPU backpressure and drops
src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
AICPU queue rotation, buffer recovery, and phase recording now use bounded waits and record separate ready-queue-full and free-queue-empty drops.

Sequence Diagram(s)

sequenceDiagram
  participant ProfilerBase
  participant L2SwimlaneModule
  participant ProfilerAlgorithms
  participant BufferPoolManager
  participant mgmt_drain_loop
  participant mgmt_replenish_loop

  ProfilerBase->>L2SwimlaneModule: refresh_replenish_metadata(header)
  ProfilerBase->>mgmt_drain_loop: launch drain thread(s)
  ProfilerBase->>mgmt_replenish_loop: launch replenish thread
  mgmt_drain_loop->>ProfilerAlgorithms: try_pop_aicpu_entry(refresh_indices=true)
  mgmt_replenish_loop->>BufferPoolManager: drain_done_into_recycled()
  mgmt_replenish_loop->>ProfilerAlgorithms: proactive_replenish()
  ProfilerBase->>mgmt_drain_loop: join() on stop()
  ProfilerBase->>mgmt_replenish_loop: join() on stop()
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

  • hw-native-sys/simpler issue 1161 — The split drain/refill changes and split drop accounting match the profiling buffer drop and host drain/refill behavior tracked there.

Possibly related PRs

  • hw-native-sys/simpler#939: The L2SwimlaneActiveHead padding and counter changes extend the cache-line layout refactor introduced there.
  • hw-native-sys/simpler#944: Both PRs change BufferPoolManager state access around dev_to_host_ and malloc_shadows_.
  • hw-native-sys/simpler#1152: The multi-buffer free-queue seeding in init_phase_pools follows the same phase pool initialization path adjusted there.

Poem

I hopped through queues at moonlit pace,
And found new threads to share the race.
Four buffer kinds now twinkle bright,
With split drains humming through the night.
I nibbled drops and counted true,
Then left a fluffy pawprint: zoop! 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly matches the main change: reducing profiling buffer drops in the collectors pipeline.
Description check ✅ Passed The description accurately matches the documented changes to split host management, add backpressure and counters, seed queues, and update docs.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optional split management mode for the profiling framework, separating the management path into dedicated drain/refill and background replenish threads to prevent queue backpressure and micro-burst drops. It also adds diagnostic drop counters and enhances thread-safety using striped mutexes and a pool mutex. The code review identified several important issues and improvements: tight spin loops in the AICPU collector should gate system counter reads to avoid expensive MMIO overhead, and bounds/null checks must be added to prevent out-of-bounds access and null pointer dereferences. Additionally, atomic loads in self-correcting re-polling loops should be optimized to use relaxed memory ordering instead of acquire semantics.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
Comment thread src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
Comment thread src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
Comment thread src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
Comment thread src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
Comment thread src/common/platform/include/host/profiler_base.h Outdated
Comment thread src/common/platform/include/host/profiler_base.h Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp (1)

781-792: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Reuse the dropped phase buffer instead of orphaning it.

On ready_queue-full, this clears current_buf_ptr/current_buf_out after dropping the records, but the buffer is not enqueued, recycled, or returned to free_queue. That permanently removes a phase buffer from circulation for the run; reset count and keep it active, matching the task-buffer path.

Proposed fix
         state->head.dropped_record_count += full_buf->count;
         state->head.ready_queue_full_drop_record_count += full_buf->count;
         full_buf->count = 0;
-        *current_buf_out = nullptr;
-        state->head.current_buf_ptr = 0;
         wmb();
         return;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp` around lines
781 - 792, The ready_queue-full handling in l2_swimlane_collector_aicpu.cpp is
orphaning the phase buffer by clearing current_buf_ptr and current_buf_out after
counting the drop, which permanently removes it from circulation. Update the
enqueue-failure path in the phase-buffer branch around the rc != 0 handling so
the dropped buffer is reused like the task-buffer path: reset full_buf->count,
keep the buffer active/current, and avoid nulling out the current buffer state
unless it is actually handed off to free_queue or another queue.
🧹 Nitpick comments (1)
src/a2a3/platform/include/common/l2_swimlane_profiling.h (1)

286-294: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Pin this shared-memory layout with static_asserts.

The new counters make the padding math load-bearing again. A compile-time size/alignment check will catch the next field tweak before host and device builds drift apart.

Suggested change
 struct L2SwimlaneActiveHead {
     volatile uint64_t current_buf_ptr;       // 8 — active buffer device address (0 = none)
     volatile uint32_t current_buf_seq;       // 4 — monotonic seq / AICore rotation generation
     volatile uint32_t total_record_count;    // 4 — producer-attempted writes
     volatile uint32_t dropped_record_count;  // 4 — producer-dropped writes
     volatile uint32_t free_queue_empty_drop_record_count;
     volatile uint32_t ready_queue_full_drop_record_count;
     uint32_t pad[9];  // 36 → 64B
 } __attribute__((aligned(64)));
+
+static_assert(sizeof(L2SwimlaneActiveHead) == 64, "L2SwimlaneActiveHead must remain 64 bytes");
+static_assert(alignof(L2SwimlaneActiveHead) == 64, "L2SwimlaneActiveHead must remain 64-byte aligned");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h` around lines 286 -
294, Add compile-time layout checks for L2SwimlaneActiveHead to lock down the
shared-memory ABI. Use static_asserts near the struct definition to verify the
final sizeof and alignment remain the expected 64 bytes/64-byte alignment, so
any future field or padding change in L2SwimlaneActiveHead fails fast before
host/device layouts drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp`:
- Around line 132-145: `wait_for_free_queue_entry` in `L2SwimlaneFreeQueue`
currently lets callers observe `tail` and then immediately read the slot
pointer, which can race on weakly ordered AICPU memory. Add an acquire barrier
at the point where the free queue entry is consumed so the `buffer_ptrs[head %
PLATFORM_PROF_SLOT_COUNT]` read cannot happen before the producer’s slot write
is visible; update the call sites that use `wait_for_free_queue_entry` in the
AICPU collector flow to pair the existing release/write ordering with this
acquire step.

In `@src/common/platform/include/host/profiler_base.h`:
- Around line 37-42: The ProfilerBase module-trait comment is incomplete because
it omits the newly required kMgmtDrainThreadCount and the optional
refresh_replenish_metadata(...) hook. Update the trait contract block in
ProfilerBase to advertise these symbols alongside kSplitMgmtFunctions so new
module authors can see the full extension surface in one place.

---

Outside diff comments:
In `@src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp`:
- Around line 781-792: The ready_queue-full handling in
l2_swimlane_collector_aicpu.cpp is orphaning the phase buffer by clearing
current_buf_ptr and current_buf_out after counting the drop, which permanently
removes it from circulation. Update the enqueue-failure path in the phase-buffer
branch around the rc != 0 handling so the dropped buffer is reused like the
task-buffer path: reset full_buf->count, keep the buffer active/current, and
avoid nulling out the current buffer state unless it is actually handed off to
free_queue or another queue.

---

Nitpick comments:
In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h`:
- Around line 286-294: Add compile-time layout checks for L2SwimlaneActiveHead
to lock down the shared-memory ABI. Use static_asserts near the struct
definition to verify the final sizeof and alignment remain the expected 64
bytes/64-byte alignment, so any future field or padding change in
L2SwimlaneActiveHead fails fast before host/device layouts drift.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab26346f-4a42-41d4-9ae2-5e46df6fa0e3

📥 Commits

Reviewing files that changed from the base of the PR and between abc62d8 and 48bf5e9.

📒 Files selected for processing (8)
  • docs/profiling-framework.md
  • src/a2a3/platform/include/common/l2_swimlane_profiling.h
  • src/a2a3/platform/include/common/platform_config.h
  • src/a2a3/platform/include/host/l2_swimlane_collector.h
  • src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
  • src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
  • src/common/platform/include/host/buffer_pool_manager.h
  • src/common/platform/include/host/profiler_base.h

Comment thread src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
Comment thread src/common/platform/include/host/profiler_base.h Outdated
@zmnobug zmnobug force-pushed the l2-swimlane-full-buffer-optimization branch 7 times, most recently from 534396d to 47f7159 Compare June 29, 2026 11:59
@ChaoZheng109

Copy link
Copy Markdown
Collaborator

建议:本 PR 需要一份对等的 a5 同步修改

本 PR 的丢弃缓解(有界背压 + publish-then-acquire + drop 分桶 + 分片 mgmt/collector)对 a2a3 的 arch-specific collector 生效,但 a5 有一套平行的独立实现,本 PR 未触及任何 src/a5/ 文件。当前合并后会出现 arch 之间的能力割裂。

a5 已经自动拿到(共享代码,无需单独同步,但需在 a5 验证)

  • framework:src/common/platform/include/host/{profiler_base,buffer_pool_manager}.h
  • scope_stats / tensor_dump 的 AICPU + host(均在 src/common/

⚠️ 副作用:因为这两个 collector 的 host traits 也在 src/common,a5 上它们已经切到 split + 背压路径,但 PR 的丢弃实测只在 a2a3 上做。建议在 a5 onboard 上补一次功能 + 丢弃验证(st-onboard-a5 当前只证不挂)。

a5 缺失、需要同步(arch-specific,PR 漏改)

  1. AICPU 背压 / 有序发布 / drop 分桶(a5 当前为 0)
    • src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
    • src/a5/platform/shared/aicpu/pmu_collector_aicpu.cpp
    • src/a5/platform/shared/aicpu/dep_gen_collector_aicpu.cpp
    • 含 a2a3 这次补的 ready-queue enqueue 的 wmb 配对(PMU/DepGen a2a3 旧版缺失),a5 同名路径请一并核对。
  2. ABI:L2SwimlaneActiveHead 加两个 drop-counter 字段(a5 仍是 pad[11],无字段)
    • src/a5/platform/include/common/l2_swimlane_profiling.hpad[11] → pad[9] + alignof static_assert)
  3. host module split traits(a5 三个都缺 → 永远走单线程 mgmt_loop,享受不到分片)
    • src/a5/platform/include/host/{l2_swimlane,pmu,dep_gen}_collector.h
      kSplitMgmtFunctions / kMgmtDrainThreadCount / kCollectorThreadCount
  4. host 多线程 collector 并发适配(计数器改 atomic + collected 向量按 index 加锁,对齐 a2a3 l2_swimlane_collector.cpp 的做法)
    • src/a5/platform/shared/host/{l2_swimlane,pmu,dep_gen}_collector.cpp
  5. platform_config 注释 / 背压常量 floor(背压 cycles 由 PLATFORM_PROF_SYS_CNT_FREQ 派生,a5 频率不同但公式一致,需确认 seeding 注释与 floor)
    • src/a5/platform/include/common/platform_config.h
  6. 待评估:AICore-as-producer 端(a5 有 aicore/l2_swimlane_collector_aicore.hpmu_collector_aicore.h)——a2a3 这次也动了 aicore rotate 路径,a5 对应端是否需要同步请确认。

建议处理方式(二选一)

  • (A) 在本 PR 内补齐 a5,使两 arch 同时获得缓解;或
  • (B) 开一个 follow-up a5 PR 并在此关联,同时在本 PR 描述里写明当前 scope 仅 a2a3,避免读者误以为 a5 的 l2/pmu/dep_gen 也已缓解。

无论哪种,合并前请明确:在 a5 同步落地之前,a5 的 l2_swimlane / pmu / dep_gen 拿不到任何丢弃缓解

@zmnobug zmnobug force-pushed the l2-swimlane-full-buffer-optimization branch 3 times, most recently from 2d87705 to a1c9e95 Compare June 30, 2026 10:44
@zmnobug zmnobug changed the title Fix: reduce L2 swimlane profiling drops Fix: reduce profiling buffer drops across collectors Jun 30, 2026
- Add host ready/done queue sharding and collector thread sharding in the shared profiling framework.

- Enable split mgmt and collector sharding for PMU, DepGen, TensorDump, and ScopeStats.

- Make AICPU writers publish full buffers before recovering replacement buffers, with bounded queue waits and publish barriers.

- Apply the same a5 arch-specific L2, PMU, and DepGen drop-mitigation paths as a2a3.

- Split recycled buffers by collector shard and kind so drain/refill mostly stays local.

- Move split runtime free-queue refill onto the drain path and leave replenish to drain done buffers only.

- Add acquire ordering and ready-entry validation fixes for non-L2 collector paths.

- Narrow the former pool lock to pointer mappings and update the sharding unit test and docs.
The reset_*_cached_state() helpers added to the a5 L2/PMU/DepGen and common
TensorDump collectors guard against reusing stale file-local statics on an
enabled->disabled launch. That path is already unreachable: the AICPU
record/complete entrypoints are gated at their call sites by the per-launch
enable switch (refreshed each register/prepare), e.g. l2_swimlane_aicpu_complete_task
runs only under `if (l2_swimlane_enabled && level >= AICPU_TIMING)` in both
scheduler_completion.cpp and aicpu_executor.cpp. A disabled launch never enters
the collectors, so the cached pointers are never dereferenced and the reset is
dead defensive code unrelated to this PR's buffer-drop-reduction goal.

Restore the setters, dep_gen finalize, and set_dump_args_enabled to their prior
form. The buffer-switch protocol changes (publish-before-recover, bounded waits,
wmb ordering, drop-cause counters) are untouched. a2a3 never carried these resets,
so both arches are now consistent again.
@ChaoZheng109

Copy link
Copy Markdown
Collaborator

Pushed a small follow-up commit (3a866247) — removing the reset_*_cached_state() additions (a5 L2/PMU/DepGen + common tensor_dump), and restoring the setters / dep_gen_aicpu_finalize / set_dump_args_enabled to their prior form. The buffer-switch protocol work (publish-before-recover, bounded waits, the wmb() ordering fix in enqueue_*_ready_buffer, drop-cause counters) is untouched.

Why: the reset guards against reusing stale file-local statics on an enabled→disabled launch, but that path is already unreachable — the AICPU record/complete entrypoints are gated at their call sites by the per-launch enable switch, not just the internal state == nullptr check. e.g. l2_swimlane_aicpu_complete_task only runs under if (l2_swimlane_enabled && level >= AICPU_TIMING) in both runtimes:

  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp:250
  • src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp:767

A disabled launch never enters the collectors, so the cached pointers are never dereferenced and the reset is dead defensive code — unrelated to this PR's drop-reduction goal. Dropping it also restores a2a3↔a5 symmetry (a2a3 never carried these resets). Filed and closed the standalone tracking issue #1232 with this rationale.

If there's an ungated call path into a collector on a disabled launch that I've missed, please point it out — in that case we'd keep the reset and apply it symmetrically to a2a3 instead. Also worth trimming the corresponding "Harden AICPU profiling disabled/base=0 paths…" bullet from the PR description so it matches the code.

Two minor notes from reading the device-side rewrite, unrelated to the above:

  • switch_dump_meta_buffer went from DUMP_SPIN_WAIT_LIMIT = 1,000,000 iterations to the shared ~20 µs bounded wait — a good stall reduction, but it also shrinks the host replenish window substantially; worth a line in the PR notes since it shifts the drop/stall trade-off.
  • Every records-path try_pop_* advances free_queue.head before the == 0 slot check (consumes the slot), whereas aicore_rotate checks before advancing — minor inconsistency, pick one convention.

All five profiling collectors (L2/PMU/DepGen/TensorDump/ScopeStats) set
kSplitMgmtFunctions = true, so the non-split mgmt_loop() path was dead code and
the flag was a config dimension with only one value. Per the repo's
env-macro-gating discipline (a gate with no remaining choice is worse than
none), collapse it:

- Remove kSplitMgmtFunctions from all collector Modules and delete the
  ProfilerModuleOptions SFINAE detector.
- Make the pre-start proactive_replenish and the drain/replenish thread spawn
  unconditional in ProfilerBase::start().
- Delete the now-unreferenced mgmt_loop().
- Update the Module-concept docblock, the SVM/host-shadow comment (which
  described the removed bulk-mirror-per-tick behavior), and
  docs/profiling-framework.md.

Pure dead-code removal: every collector was already on the split path, so
runtime behavior is unchanged. Thread-sizing traits (kMgmtDrainThreadCount /
kCollectorThreadCount) are untouched and still default to 1 via their own
SFINAE, so a Module defining neither gets one drain + one replenish + one
collector thread.
@ChaoZheng109

Copy link
Copy Markdown
Collaborator

Follow-up commit (618b7f71) — collapse kSplitMgmtFunctions into unconditional split mgmt.

All five collectors (L2/PMU/DepGen/TensorDump/ScopeStats) set kSplitMgmtFunctions = true, so the non-split mgmt_loop() path was dead code and the flag was a config dimension with only one live value. Per the repo's env-macro-gating discipline (a gate with no remaining choice is worse than none), this:

  • removes kSplitMgmtFunctions from all collector Modules + deletes the ProfilerModuleOptions SFINAE detector;
  • makes the pre-start proactive_replenish and the drain/replenish thread spawn unconditional in ProfilerBase::start();
  • deletes the now-unreferenced mgmt_loop();
  • updates the Module-concept docblock, the SVM/host-shadow comment (which described the removed bulk-mirror-per-tick behavior), and docs/profiling-framework.md.

It's a pure dead-code removal — every collector was already on the split path, so runtime behavior is unchanged. The thread-sizing traits (kMgmtDrainThreadCount / kCollectorThreadCount) are untouched and still default to 1 via their own SFINAE, so a Module defining neither would get one drain + one replenish + one collector thread.

Verification note: validated locally via pre-commit (clang-format / clang-tidy / cpplint / markdownlint all pass), but I did not run a full build/onboard — profiler_base.h is header-only and the real template instantiations live in each collector's TU. Please let CI's build + sim/onboard jobs backstop this before merge. If you'd rather keep kSplitMgmtFunctions as a real extension point for a future non-split Module, feel free to drop this commit — there's no non-split user today, which is the whole reason for collapsing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants