Skip to content

[Code Health] Reset AICPU profiling collectors' cached static state on disabled/base=0 launch (persistent .so hazard), symmetrically across a2a3 + a5 #1232

Description

@ChaoZheng109

Category

Robustness (potential edge-case failure)

Component

Platform (a2a3 / a2a3sim)

Description

The AICPU profiling collectors (L2Swimlane, PMU, DepGen, TensorDump, ScopeStats) keep file-local static state that persists across launches, because the AICPU .so is loaded once and reused. This state includes:

  • cached header pointers (s_*_header),
  • pool arrays (s_*_pools[]),
  • "current buffer" pointers (s_current_*_buffers[] / current_buf_ptr),
  • orch thread index (s_orch_thread_idx),
  • base / level globals (g_*_base, g_*_level).

Hazard — enabled → disabled launch sequence: when a subsequent launch turns profiling off, it calls set_platform_*_base(0) and/or set_*_enabled(false) but does not run *_aicpu_init (init only runs when profiling is enabled). The statics therefore retain the previous launch's pointers, which may point at memory that has since been freed/reallocated. Any record/complete path that becomes reachable in that state can dereference a stale pointer (potential use-after-free).

This is a pre-existing / latent hazard, not introduced by any single change. The a2a3 l2_swimlane_aicpu_init comment already references the same class of bug and its prior fix (#936). There is no proven live crash path today — record sites are gated (s_phase_initialized, is_*_enabled) — so this is filed as defense-in-depth / robustness rather than an active bug.

Why now / symmetry: PR #1162 introduced a per-collector reset_*_cached_state() helper (nulls all cached statics) and wired it into the disabled/base=0 setters — but only for a5 (L2/PMU/DepGen) and common (tensor_dump; scope_stats partial). The a2a3 arch-specific L2/PMU/DepGen collectors, and scope_stats' set_*_enabled(false) path, were left uncovered. Per review, PR #1162 is dropping this hardening from its scope entirely (it is unrelated to that PR's buffer-drop-reduction goal) so it can land focused; this issue tracks doing the hardening properly and symmetrically across both arches and all five collectors as a standalone change.

Acceptance: each collector defines a reset_*_cached_state() that nulls header + pools + current-buffer pointers + orch idx + level globals, and it is invoked from both set_platform_*_base(0) and set_*_enabled(false), on both a2a3 and a5 (and once in common for tensor_dump / scope_stats). Add a small onboard/sim check that an enabled→disabled→(record attempted) sequence does not touch stale state.

Reference mainline commit where the hazard exists: c080089d (merge-base of #1162).

Location

  • src/a2a3/platform/shared/aicpu/l2_swimlane_collector_aicpu.cppset_platform_l2_swimlane_base / set_l2_swimlane_enabled (no reset)
  • src/a2a3/platform/shared/aicpu/pmu_collector_aicpu.cppset_platform_pmu_base / set_pmu_enabled (no reset)
  • src/a2a3/platform/shared/aicpu/dep_gen_collector_aicpu.cppset_platform_dep_gen_base / set_dep_gen_enabled (no reset)
  • src/common/platform/shared/aicpu/scope_stats_collector_aicpu.cppset_scope_stats_enabled(false) does not reset header/state
  • (reference implementation to mirror) src/a5/platform/shared/aicpu/{l2_swimlane_collector_aicpu,pmu_collector_aicpu,dep_gen_collector_aicpu}.cppreset_*_cached_state() + call sites
  • (reference) src/common/platform/shared/aicpu/tensor_dump_aicpu.cppreset_dump_runtime_state()

Proposed Fix

Mirror the a5 pattern onto a2a3 (and complete scope_stats): add reset_l2_swimlane_cached_state() / reset_pmu_cached_state() / reset_dep_gen_cached_state() to the a2a3 collectors and call them from both the base == 0 and !enable branches of the respective setters; route set_scope_stats_enabled(false) through the same reset. Keep the reset helper as the single source of truth (also usable from *_finalize, as a5 dep_gen does). Do this as one focused PR so a2a3 and a5 stay in lockstep (the framework was unified in #944 and is meant to be reviewed against both arches).

Priority

Medium (minor risk, should fix in next few releases)

Related: #1161, #1162

Metadata

Metadata

Assignees

No one assigned

    Labels

    code healthTechnical debt, robustness, code quality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions