Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Summary
L2 swimlane profiling can drop records when device-side AICPU profiling data is produced faster than the host can drain, recycle, and return buffers.
The issue is about the device-to-host profiling buffer lifecycle: AICPU writes records into profiling buffers, publishes full buffers to the ready queue, and depends on host-side management threads to drain those buffers and refill the free queue. Under bursty profiling traffic, the host path may not return buffers quickly enough, causing device-side drops.
This should be investigated independently from any specific model or library example. A synthetic AICPU producer is useful here because it can generate normal L2 swimlane profiling records at a controlled rate and let the host consume them through the regular profiling path.
Related: #997
Git Commit ID
5a6ad1f2cab80b9eea3e4350a2c93a80d53b6403
CANN Version
Not captured.
Driver Version
Not captured.
Host Platform
Linux (aarch64)
Reproduction
Use a temporary synthetic L2 swimlane producer to isolate the profiling path from real model execution:
- Short-circuit the normal AICore path so the kernel does not run real compute work.
- In the AICPU entry path, start one AICPU thread that writes normal
L2SwimlaneAicpuTaskRecord buffers for a short fixed duration.
- Publish those buffers through the existing L2 swimlane ready-queue/free-queue mechanism.
- Let the normal host L2 swimlane management and collector path drain the data.
- Check whether the host keeps up or whether device-side profiling drops are reported.
The Python test used to launch the kernel should only act as a harness; the reproduction should not depend on pageattention, qwen, or any other specific workload behavior.
Expected Performance
For a controlled device-side producer below the intended hardware bandwidth envelope, the host profiling path should be able to drain and recycle buffers without scattered profiling record loss, or should expose clear backpressure/overflow behavior that is easy to reason about.
Actual Performance
Under bursty device-side profiling production, L2 swimlane can still report device-side dropped profiling records. This indicates that host-side drain/refill/collector progress can lag behind the AICPU producer and fail to return free buffers in time.
Profiling Data (Optional)
N/A. This issue intentionally tracks the problem statement and reproduction direction only; detailed benchmark numbers should live in the relevant PR or investigation notes.
Additional Context
The main question is whether the limiting factor is host-side profiling control-path throughput, buffer lifecycle design, or a lower-level device-to-host bandwidth limit. The synthetic AICPU producer should help separate this from workload-specific behavior.
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Summary
L2 swimlane profiling can drop records when device-side AICPU profiling data is produced faster than the host can drain, recycle, and return buffers.
The issue is about the device-to-host profiling buffer lifecycle: AICPU writes records into profiling buffers, publishes full buffers to the ready queue, and depends on host-side management threads to drain those buffers and refill the free queue. Under bursty profiling traffic, the host path may not return buffers quickly enough, causing device-side drops.
This should be investigated independently from any specific model or library example. A synthetic AICPU producer is useful here because it can generate normal L2 swimlane profiling records at a controlled rate and let the host consume them through the regular profiling path.
Related: #997
Git Commit ID
5a6ad1f2cab80b9eea3e4350a2c93a80d53b6403
CANN Version
Not captured.
Driver Version
Not captured.
Host Platform
Linux (aarch64)
Reproduction
Use a temporary synthetic L2 swimlane producer to isolate the profiling path from real model execution:
L2SwimlaneAicpuTaskRecordbuffers for a short fixed duration.The Python test used to launch the kernel should only act as a harness; the reproduction should not depend on pageattention, qwen, or any other specific workload behavior.
Expected Performance
For a controlled device-side producer below the intended hardware bandwidth envelope, the host profiling path should be able to drain and recycle buffers without scattered profiling record loss, or should expose clear backpressure/overflow behavior that is easy to reason about.
Actual Performance
Under bursty device-side profiling production, L2 swimlane can still report device-side dropped profiling records. This indicates that host-side drain/refill/collector progress can lag behind the AICPU producer and fail to return free buffers in time.
Profiling Data (Optional)
N/A. This issue intentionally tracks the problem statement and reproduction direction only; detailed benchmark numbers should live in the relevant PR or investigation notes.
Additional Context
The main question is whether the limiting factor is host-side profiling control-path throughput, buffer lifecycle design, or a lower-level device-to-host bandwidth limit. The synthetic AICPU producer should help separate this from workload-specific behavior.