Fix: reduce profiling buffer drops across collectors#1162
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe PR updates profiling framework docs and shared contracts for split management, adds synchronization to host buffer pools, changes profiler runtime start/stop and worker loops, seeds more host buffers at initialization, and updates AICPU drop accounting and queue backpressure handling. ChangesSplit L2 swimlane profiling
Sequence Diagram(s)sequenceDiagram
participant ProfilerBase
participant L2SwimlaneModule
participant ProfilerAlgorithms
participant BufferPoolManager
participant mgmt_drain_loop
participant mgmt_replenish_loop
ProfilerBase->>L2SwimlaneModule: refresh_replenish_metadata(header)
ProfilerBase->>mgmt_drain_loop: launch drain thread(s)
ProfilerBase->>mgmt_replenish_loop: launch replenish thread
mgmt_drain_loop->>ProfilerAlgorithms: try_pop_aicpu_entry(refresh_indices=true)
mgmt_replenish_loop->>BufferPoolManager: drain_done_into_recycled()
mgmt_replenish_loop->>ProfilerAlgorithms: proactive_replenish()
ProfilerBase->>mgmt_drain_loop: join() on stop()
ProfilerBase->>mgmt_replenish_loop: join() on stop()
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces an optional split management mode for the profiling framework, separating the management path into dedicated drain/refill and background replenish threads to prevent queue backpressure and micro-burst drops. It also adds diagnostic drop counters and enhances thread-safety using striped mutexes and a pool mutex. The code review identified several important issues and improvements: tight spin loops in the AICPU collector should gate system counter reads to avoid expensive MMIO overhead, and bounds/null checks must be added to prevent out-of-bounds access and null pointer dereferences. Additionally, atomic loads in self-correcting re-polling loops should be optimized to use relaxed memory ordering instead of acquire semantics.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp (1)
781-792: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winReuse the dropped phase buffer instead of orphaning it.
On ready_queue-full, this clears
current_buf_ptr/current_buf_outafter dropping the records, but the buffer is not enqueued, recycled, or returned tofree_queue. That permanently removes a phase buffer from circulation for the run; resetcountand keep it active, matching the task-buffer path.Proposed fix
state->head.dropped_record_count += full_buf->count; state->head.ready_queue_full_drop_record_count += full_buf->count; full_buf->count = 0; - *current_buf_out = nullptr; - state->head.current_buf_ptr = 0; wmb(); return;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp` around lines 781 - 792, The ready_queue-full handling in l2_swimlane_collector_aicpu.cpp is orphaning the phase buffer by clearing current_buf_ptr and current_buf_out after counting the drop, which permanently removes it from circulation. Update the enqueue-failure path in the phase-buffer branch around the rc != 0 handling so the dropped buffer is reused like the task-buffer path: reset full_buf->count, keep the buffer active/current, and avoid nulling out the current buffer state unless it is actually handed off to free_queue or another queue.
🧹 Nitpick comments (1)
src/a2a3/platform/include/common/l2_swimlane_profiling.h (1)
286-294: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winPin this shared-memory layout with
static_asserts.The new counters make the padding math load-bearing again. A compile-time size/alignment check will catch the next field tweak before host and device builds drift apart.
Suggested change
struct L2SwimlaneActiveHead { volatile uint64_t current_buf_ptr; // 8 — active buffer device address (0 = none) volatile uint32_t current_buf_seq; // 4 — monotonic seq / AICore rotation generation volatile uint32_t total_record_count; // 4 — producer-attempted writes volatile uint32_t dropped_record_count; // 4 — producer-dropped writes volatile uint32_t free_queue_empty_drop_record_count; volatile uint32_t ready_queue_full_drop_record_count; uint32_t pad[9]; // 36 → 64B } __attribute__((aligned(64))); + +static_assert(sizeof(L2SwimlaneActiveHead) == 64, "L2SwimlaneActiveHead must remain 64 bytes"); +static_assert(alignof(L2SwimlaneActiveHead) == 64, "L2SwimlaneActiveHead must remain 64-byte aligned");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h` around lines 286 - 294, Add compile-time layout checks for L2SwimlaneActiveHead to lock down the shared-memory ABI. Use static_asserts near the struct definition to verify the final sizeof and alignment remain the expected 64 bytes/64-byte alignment, so any future field or padding change in L2SwimlaneActiveHead fails fast before host/device layouts drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp`:
- Around line 132-145: `wait_for_free_queue_entry` in `L2SwimlaneFreeQueue`
currently lets callers observe `tail` and then immediately read the slot
pointer, which can race on weakly ordered AICPU memory. Add an acquire barrier
at the point where the free queue entry is consumed so the `buffer_ptrs[head %
PLATFORM_PROF_SLOT_COUNT]` read cannot happen before the producer’s slot write
is visible; update the call sites that use `wait_for_free_queue_entry` in the
AICPU collector flow to pair the existing release/write ordering with this
acquire step.
In `@src/common/platform/include/host/profiler_base.h`:
- Around line 37-42: The ProfilerBase module-trait comment is incomplete because
it omits the newly required kMgmtDrainThreadCount and the optional
refresh_replenish_metadata(...) hook. Update the trait contract block in
ProfilerBase to advertise these symbols alongside kSplitMgmtFunctions so new
module authors can see the full extension surface in one place.
---
Outside diff comments:
In `@src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp`:
- Around line 781-792: The ready_queue-full handling in
l2_swimlane_collector_aicpu.cpp is orphaning the phase buffer by clearing
current_buf_ptr and current_buf_out after counting the drop, which permanently
removes it from circulation. Update the enqueue-failure path in the phase-buffer
branch around the rc != 0 handling so the dropped buffer is reused like the
task-buffer path: reset full_buf->count, keep the buffer active/current, and
avoid nulling out the current buffer state unless it is actually handed off to
free_queue or another queue.
---
Nitpick comments:
In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h`:
- Around line 286-294: Add compile-time layout checks for L2SwimlaneActiveHead
to lock down the shared-memory ABI. Use static_asserts near the struct
definition to verify the final sizeof and alignment remain the expected 64
bytes/64-byte alignment, so any future field or padding change in
L2SwimlaneActiveHead fails fast before host/device layouts drift.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ab26346f-4a42-41d4-9ae2-5e46df6fa0e3
📒 Files selected for processing (8)
docs/profiling-framework.mdsrc/a2a3/platform/include/common/l2_swimlane_profiling.hsrc/a2a3/platform/include/common/platform_config.hsrc/a2a3/platform/include/host/l2_swimlane_collector.hsrc/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cppsrc/a2a3/platform/shared/host/l2_swimlane_collector.cppsrc/common/platform/include/host/buffer_pool_manager.hsrc/common/platform/include/host/profiler_base.h
534396d to
47f7159
Compare
建议:本 PR 需要一份对等的 a5 同步修改本 PR 的丢弃缓解(有界背压 + publish-then-acquire + drop 分桶 + 分片 mgmt/collector)对 a2a3 的 arch-specific collector 生效,但 a5 有一套平行的独立实现,本 PR 未触及任何 a5 已经自动拿到(共享代码,无需单独同步,但需在 a5 验证)
a5 缺失、需要同步(arch-specific,PR 漏改)
建议处理方式(二选一)
无论哪种,合并前请明确:在 a5 同步落地之前,a5 的 |
2d87705 to
a1c9e95
Compare
- Add host ready/done queue sharding and collector thread sharding in the shared profiling framework. - Enable split mgmt and collector sharding for PMU, DepGen, TensorDump, and ScopeStats. - Make AICPU writers publish full buffers before recovering replacement buffers, with bounded queue waits and publish barriers. - Apply the same a5 arch-specific L2, PMU, and DepGen drop-mitigation paths as a2a3. - Split recycled buffers by collector shard and kind so drain/refill mostly stays local. - Move split runtime free-queue refill onto the drain path and leave replenish to drain done buffers only. - Add acquire ordering and ready-entry validation fixes for non-L2 collector paths. - Narrow the former pool lock to pointer mappings and update the sharding unit test and docs.
a1c9e95 to
9f7587d
Compare
The reset_*_cached_state() helpers added to the a5 L2/PMU/DepGen and common TensorDump collectors guard against reusing stale file-local statics on an enabled->disabled launch. That path is already unreachable: the AICPU record/complete entrypoints are gated at their call sites by the per-launch enable switch (refreshed each register/prepare), e.g. l2_swimlane_aicpu_complete_task runs only under `if (l2_swimlane_enabled && level >= AICPU_TIMING)` in both scheduler_completion.cpp and aicpu_executor.cpp. A disabled launch never enters the collectors, so the cached pointers are never dereferenced and the reset is dead defensive code unrelated to this PR's buffer-drop-reduction goal. Restore the setters, dep_gen finalize, and set_dump_args_enabled to their prior form. The buffer-switch protocol changes (publish-before-recover, bounded waits, wmb ordering, drop-cause counters) are untouched. a2a3 never carried these resets, so both arches are now consistent again.
|
Pushed a small follow-up commit ( Why: the reset guards against reusing stale file-local statics on an enabled→disabled launch, but that path is already unreachable — the AICPU record/complete entrypoints are gated at their call sites by the per-launch enable switch, not just the internal
A disabled launch never enters the collectors, so the cached pointers are never dereferenced and the reset is dead defensive code — unrelated to this PR's drop-reduction goal. Dropping it also restores a2a3↔a5 symmetry (a2a3 never carried these resets). Filed and closed the standalone tracking issue #1232 with this rationale. If there's an ungated call path into a collector on a disabled launch that I've missed, please point it out — in that case we'd keep the reset and apply it symmetrically to a2a3 instead. Also worth trimming the corresponding "Harden AICPU profiling disabled/base=0 paths…" bullet from the PR description so it matches the code. Two minor notes from reading the device-side rewrite, unrelated to the above:
|
All five profiling collectors (L2/PMU/DepGen/TensorDump/ScopeStats) set kSplitMgmtFunctions = true, so the non-split mgmt_loop() path was dead code and the flag was a config dimension with only one value. Per the repo's env-macro-gating discipline (a gate with no remaining choice is worse than none), collapse it: - Remove kSplitMgmtFunctions from all collector Modules and delete the ProfilerModuleOptions SFINAE detector. - Make the pre-start proactive_replenish and the drain/replenish thread spawn unconditional in ProfilerBase::start(). - Delete the now-unreferenced mgmt_loop(). - Update the Module-concept docblock, the SVM/host-shadow comment (which described the removed bulk-mirror-per-tick behavior), and docs/profiling-framework.md. Pure dead-code removal: every collector was already on the split path, so runtime behavior is unchanged. Thread-sizing traits (kMgmtDrainThreadCount / kCollectorThreadCount) are untouched and still default to 1 via their own SFINAE, so a Module defining neither gets one drain + one replenish + one collector thread.
|
Follow-up commit ( All five collectors (L2/PMU/DepGen/TensorDump/ScopeStats) set
It's a pure dead-code removal — every collector was already on the split path, so runtime behavior is unchanged. The thread-sizing traits ( Verification note: validated locally via pre-commit (clang-format / clang-tidy / cpplint / markdownlint all pass), but I did not run a full build/onboard — |
Summary
workers, collector shards, and a lightweight replenish path.
entry is popped, that drain shard tops up the originating free queue
before handing the full buffer to its collector shard.
shard-local
collector shard x buffer kindpools. The former broadpool lock is narrowed to pointer mappings only.
runtime paths can reassign producer ownership across AICPU threads.
buffers before recovering replacements, order ready-entry writes before
tail publication, use bounded queue waits, preserve total drop accounting,
and reuse phase buffers when ready queues are full.
a2a3 and a5, and apply the shared split host framework across
TensorDump and ScopeStats.
state, so a disabled launch cannot reuse stale header, pool, or current
buffer pointers from a prior launch.
buffer-copy return values before advancing queues or delivering buffers
to collectors.
before resolving BufferState/free_queue, so a malformed or stale device
entry is dropped instead of letting host refill an out-of-range
free_queue.
Observed Effect
paged_attentionstress with temporary pressure(
PLATFORM_PROF_BUFFER_SIZE=4,PLATFORM_PROF_BUFFERS_PER_CORE=1):PERF drops decreased from about
32,969to about3,000, roughly a91%reduction.qwen3_14b_decodestress under the same temporary pressure: PERF dropsdecreased from
68to28, roughly a59%reduction.30 GB/sfor5 ms:1.8368 GB/sto3.8720 GB/s, about+110.8%.1.4680 GB/sto3.8949 GB/s, about+165.3%.93.88%to87.11%.2.0096 GB/sto3.8720 GB/s, about+92.7%.4,378,000to4,087,000; median drop rate decreased from93.31%to
87.11%.This is a mitigation and host-management architecture improvement, not a
complete drop elimination. Under the extreme temporary
buffer=4burststress, residual drops remain.
Testing
conda run -n zm_pypto pre-commit run --files docs/profiling-framework.md src/common/platform/include/host/buffer_pool_manager.h src/common/platform/include/host/profiler_base.h tests/ut/cpp/common/test_buffer_pool_manager.cppconda run -n zm_pypto pre-commit run --files src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp src/a5/platform/shared/aicpu/pmu_collector_aicpu.cpp src/a5/platform/shared/aicpu/dep_gen_collector_aicpu.cpp src/common/platform/shared/aicpu/tensor_dump_aicpu.cpp src/common/platform/include/host/profiler_base.hgit diff --check --cachedCCACHE_DISABLE=1 PIP_CACHE_DIR=/tmp/pip-cache-simpler pip install --no-build-isolation -e .7/7runs passed on hardwaredevice
2; attempted median30.0288 GB/s, host effective receivemedian
3.8720 GB/s.test_buffer_pool_managerexecutable link is blocked by thealready-documented local GoogleTest ABI mismatch; the target compiles up
to link stage.
Related: #1161