[Performance] L2 swimlane profiling drops under 30GB/s AICPU producer

### Platform

a2a3 (Ascend 910B/C hardware)

### Runtime Variant

tensormap_and_ringbuffer

### Summary

L2 swimlane profiling can drop records when device-side AICPU profiling data is produced faster than the host can drain, recycle, and return buffers.

The issue is about the device-to-host profiling buffer lifecycle: AICPU writes records into profiling buffers, publishes full buffers to the ready queue, and depends on host-side management threads to drain those buffers and refill the free queue. Under bursty profiling traffic, the host path may not return buffers quickly enough, causing device-side drops.

This should be investigated independently from any specific model or library example. A synthetic AICPU producer is useful here because it can generate normal L2 swimlane profiling records at a controlled rate and let the host consume them through the regular profiling path.

Related: #997

### Git Commit ID

5a6ad1f2cab80b9eea3e4350a2c93a80d53b6403

### CANN Version

Not captured.

### Driver Version

Not captured.

### Host Platform

Linux (aarch64)

### Reproduction

Use a temporary synthetic L2 swimlane producer to isolate the profiling path from real model execution:

1. Short-circuit the normal AICore path so the kernel does not run real compute work.
2. In the AICPU entry path, start one AICPU thread that writes normal `L2SwimlaneAicpuTaskRecord` buffers for a short fixed duration.
3. Publish those buffers through the existing L2 swimlane ready-queue/free-queue mechanism.
4. Let the normal host L2 swimlane management and collector path drain the data.
5. Check whether the host keeps up or whether device-side profiling drops are reported.

The Python test used to launch the kernel should only act as a harness; the reproduction should not depend on pageattention, qwen, or any other specific workload behavior.

### Expected Performance

For a controlled device-side producer below the intended hardware bandwidth envelope, the host profiling path should be able to drain and recycle buffers without scattered profiling record loss, or should expose clear backpressure/overflow behavior that is easy to reason about.

### Actual Performance

Under bursty device-side profiling production, L2 swimlane can still report device-side dropped profiling records. This indicates that host-side drain/refill/collector progress can lag behind the AICPU producer and fail to return free buffers in time.

### Profiling Data (Optional)

N/A. This issue intentionally tracks the problem statement and reproduction direction only; detailed benchmark numbers should live in the relevant PR or investigation notes.

### Additional Context

The main question is whether the limiting factor is host-side profiling control-path throughput, buffer lifecycle design, or a lower-level device-to-host bandwidth limit. The synthetic AICPU producer should help separate this from workload-specific behavior.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] L2 swimlane profiling drops under 30GB/s AICPU producer #1161

Platform

Runtime Variant

Summary

Git Commit ID

CANN Version

Driver Version

Host Platform

Reproduction

Expected Performance

Actual Performance

Profiling Data (Optional)

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Performance] L2 swimlane profiling drops under 30GB/s AICPU producer #1161

Description

Platform

Runtime Variant

Summary

Git Commit ID

CANN Version

Driver Version

Host Platform

Reproduction

Expected Performance

Actual Performance

Profiling Data (Optional)

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions