Skip to content

[Performance] L2 swimlane profiling drops under 30GB/s AICPU producer #1161

Description

@zmnobug

Platform

a2a3 (Ascend 910B/C hardware)

Runtime Variant

tensormap_and_ringbuffer

Summary

L2 swimlane profiling can drop records when device-side AICPU profiling data is produced faster than the host can drain, recycle, and return buffers.

The issue is about the device-to-host profiling buffer lifecycle: AICPU writes records into profiling buffers, publishes full buffers to the ready queue, and depends on host-side management threads to drain those buffers and refill the free queue. Under bursty profiling traffic, the host path may not return buffers quickly enough, causing device-side drops.

This should be investigated independently from any specific model or library example. A synthetic AICPU producer is useful here because it can generate normal L2 swimlane profiling records at a controlled rate and let the host consume them through the regular profiling path.

Related: #997

Git Commit ID

5a6ad1f2cab80b9eea3e4350a2c93a80d53b6403

CANN Version

Not captured.

Driver Version

Not captured.

Host Platform

Linux (aarch64)

Reproduction

Use a temporary synthetic L2 swimlane producer to isolate the profiling path from real model execution:

  1. Short-circuit the normal AICore path so the kernel does not run real compute work.
  2. In the AICPU entry path, start one AICPU thread that writes normal L2SwimlaneAicpuTaskRecord buffers for a short fixed duration.
  3. Publish those buffers through the existing L2 swimlane ready-queue/free-queue mechanism.
  4. Let the normal host L2 swimlane management and collector path drain the data.
  5. Check whether the host keeps up or whether device-side profiling drops are reported.

The Python test used to launch the kernel should only act as a harness; the reproduction should not depend on pageattention, qwen, or any other specific workload behavior.

Expected Performance

For a controlled device-side producer below the intended hardware bandwidth envelope, the host profiling path should be able to drain and recycle buffers without scattered profiling record loss, or should expose clear backpressure/overflow behavior that is easy to reason about.

Actual Performance

Under bursty device-side profiling production, L2 swimlane can still report device-side dropped profiling records. This indicates that host-side drain/refill/collector progress can lag behind the AICPU producer and fail to return free buffers in time.

Profiling Data (Optional)

N/A. This issue intentionally tracks the problem statement and reproduction direction only; detailed benchmark numbers should live in the relevant PR or investigation notes.

Additional Context

The main question is whether the limiting factor is host-side profiling control-path throughput, buffer lifecycle design, or a lower-level device-to-host bandwidth limit. The synthetic AICPU producer should help separate this from workload-specific behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions