Skip to content

feat(hpm): Add hpmevent for CUTE#18

Merged
yu-yake2002 merged 1 commit into
masterfrom
dev-hpm-v2r2a
Jun 4, 2026
Merged

feat(hpm): Add hpmevent for CUTE#18
yu-yake2002 merged 1 commit into
masterfrom
dev-hpm-v2r2a

Conversation

@ecall73
Copy link
Copy Markdown
Contributor

@ecall73 ecall73 commented Jun 2, 2026

WARNING

THIS FEATURE SPANS A NESTED SUBMODULE BOUNDARY.

  • MERGE ORDER MUST BE INSIDE-OUT: XSAI/CUTE FIRST, THEN XSAI, THEN xsai-env.
  • THE PARENT REPOSITORY MUST ONLY POINT TO A FETCHABLE CHILD COMMIT THAT IS ALREADY MERGED OR OTHERWISE STABLE ON THE CHILD REMOTE.
  • DO NOT MERGE THE XSAI PR WHILE ITS CUTE GITLINK STILL POINTS TO A LOCAL-ONLY OR PR-ONLY COMMIT.

ASSOCIATED PR: XSAI#70

Summary

This PR connects CUTE performance events into XiangShan's native HPM infrastructure instead of keeping a separate CUTE-local PMU/CSR path.

The design goal is intentionally conservative:

  • no new CSR registers
  • no change to cycle / instret architectural behavior
  • keep the existing backend HPM prefix stable
  • append all CUTE events at the tail of the existing backend / mem event pools
  • preserve the existing 6-bit perf event path

In other words, this change makes CUTE events software-visible through the CPU core's existing mhpmevent / mhpmcounter mechanism rather than through an independent CSR family.

This PR also depends on a preceding backend event-pool stabilization step that landed as:

  • fix(HPM): align backend event index to kunminghu-v2

That earlier change is part of the story here and is not just incidental cleanup. Without it, the backend event pool would still contain matrix-related insertions interleaved into the original kunminghu-v2 ordering, and the newly added CUTE backend events would not have a stable software-visible numbering base.

What Changes

CUTE side

  • Add lightweight perf probe bundles:
    • TaskControllerPerfProbe
    • LocalMMUPerfProbe
  • Add a minimal sideband output bundle:
    • CutePerfToCoreIO
  • Export CUTE perf candidates from:
    • TaskController
    • LocalMMU
  • Assemble those raw candidates in CUTETOP
  • Register the assembled perf sideband once at the CUTETOP output boundary before it crosses into the CPU core

XSAI side

  • Forward the CUTE perf sideband through:
    • XSCuteTop
    • XSTile
    • XSCore
  • Extend PerfCounterIO with:
    • perfEventsMatrixBackend
    • perfEventsMatrixMem
  • Preserve the original kunminghu-v2 backend 0..94 prefix by reusing the previously introduced base/ext split in:
    • Rename
    • CtrlBlock
    • Backend
  • Append backend-side CUTE events to the backend event pool tail
  • Append mem-side CUTE events to the MemBlock event pool tail

What this PR does not do

  • does not add a new AME-specific CSR map
  • does not restore the abandoned standalone PFEvent + HPerfMonitor + ame_* CSR implementation
  • does not modify Frontend
  • does not add CUTE-originated cache-internal events into CoupledL2/HuanCun

Event Placement

The new events are split by meaning rather than placed into a separate PMU domain.

Backend prefix stabilization

Before appending CUTE backend events, the backend event pool is first stabilized so that:

  • backend event IDs 0..94 remain identical to kunminghu-v2
  • previously added matrix backend events are moved behind that stable prefix

This is implemented by splitting the backend event pool into a stable base segment and an extension segment:

  • Rename exposes base/ext perf event views
  • CtrlBlock keeps the original rename-free old prefix in its base view
  • Backend rebuilds the final backend pool as:
    • stable old prefix first
    • pre-existing matrix backend extension next
    • newly added CUTE backend events last

As a result, this PR does not place the new CUTE backend events directly after 94. They are appended after the already-existing backend extension region created by the earlier backend index-alignment fix.

The pre-existing backend extension region occupies IDs 95..104 and consists of:

Event ID
rename_stall_cycle_mx 95
me_freelist_1_4_valid 96
me_freelist_2_4_valid 97
me_freelist_3_4_valid 98
me_freelist_4_4_valid 99
IssueQueueMsetmtilexriwmfMrelease_full 100
issueQueue_enq_fire_cnt 101
IssueQueueMsetmtilexrmfwmf_full 102
IssueQueueMmaMarith_full 103
IssueQueueMls_full 104

Backend-appended events

The following 9 events are appended after the existing backend event pool tail:

  1. amu_active_cycle
  2. amu_retire
  3. amu_comp_done
  4. amu_release_done
  5. amu_mte_active
  6. amu_mma_nonfp
  7. amu_mma_fp16
  8. amu_mma_bf16
  9. amu_mma_tf32

For the current implementation, these appear after the already-stabilized backend prefix and extension region, so the final backend event IDs are:

Event ID
amu_active_cycle 105
amu_retire 106
amu_comp_done 107
amu_release_done 108
amu_mte_active 109
amu_mma_nonfp 110
amu_mma_fp16 111
amu_mma_bf16 112
amu_mma_tf32 113

Mem-appended events

The following 12 events are appended after the existing MemBlock event pool tail:

  1. amu_load_a_done
  2. amu_load_b_done
  3. amu_load_c_done
  4. amu_store_done
  5. amu_aml_active
  6. amu_bml_active
  7. amu_cml_load_active
  8. amu_cml_store_active
  9. amu_mem_rd_req
  10. amu_mem_wr_req
  11. amu_mem_rd_32B_req
  12. amu_mem_wr_32B_req

For the current implementation, the final mem event IDs are:

Event ID
amu_load_a_done 145
amu_load_b_done 146
amu_load_c_done 147
amu_store_done 148
amu_aml_active 149
amu_bml_active 150
amu_cml_load_active 151
amu_cml_store_active 152
amu_mem_rd_req 153
amu_mem_wr_req 154
amu_mem_rd_32B_req 155
amu_mem_wr_32B_req 156

Event Semantics

Backend probes

  • amu_active_cycle: exported from CUTE ownedWork, exposed to HPM under the more explicit public name amu_active_cycle
  • amu_retire: CUTE task retire pulse
  • amu_comp_done: compute completion pulse
  • amu_release_done: release issue/completion pulse in the current scheduler model
  • amu_mte_active: MTE busy cycle
  • amu_mma_nonfp/fp16/bf16/tf32: compute completion classified by MMA data type

Mem probes

  • amu_load_*_done / amu_store_done: loader/store completion pulses
  • amu_*_active: loader/store busy-cycle style probes
  • amu_mem_rd_req / amu_mem_wr_req: LocalMMU outgoing read/write request fires
  • amu_mem_rd_32B_req / amu_mem_wr_32B_req: outgoing traffic counted in 32B units

The 32B unit choice is deliberate. It preserves the existing 6-bit perf event width and avoids widening the global perf infrastructure just to carry byte-count values.

Timing Note

The original sideband wiring from CUTE into the core was effectively combinational until the event reached the native HPM logic.

To reduce physical-design risk, the assembled CUTE perf sideband is now registered once at the CUTETOP output boundary before crossing into the CPU core. This keeps the fix narrow:

  • the event definitions remain unchanged
  • the native HPM path remains unchanged
  • only the cross-module perf sideband gains one cycle of latency

This is a timing-oriented implementation detail, not a software-visible semantic change.

Implementation Details

XSAI/CUTE

  • Bundles.scala
    • add CutePerfEventCounts
    • add TaskControllerPerfProbe
    • add LocalMMUPerfProbe
    • add CutePerfToCoreIO
  • TaskController.scala
    • export done/retire/active probe signals
  • LocalMMU.scala
    • export request-fire and 32B-unit traffic probes
  • CUTETOP.scala
    • expose perf
    • assemble backend/mem raw candidate events
    • add one output-side register stage for the perf sideband

XSAI

  • cutewrapper/XSCuteTop.scala
    • forward the CUTE perf sideband
  • xiangshan/XSTile.scala
    • connect CUTE perf sideband into the core
    • provide zero default when CUTE is absent
  • xiangshan/XSCore.scala
    • consume the sideband and map it into backend/mem perf inputs
  • xiangshan/backend/fu/CSR.scala
    • extend PerfCounterIO with backend/mem CUTE perf inputs
  • xiangshan/backend/rename/Rename.scala
    • keep the old rename/free-list prefix intact
    • move pre-existing matrix rename events into the backend extension region
  • xiangshan/backend/CtrlBlock.scala
    • expose backend base/ext event views so the old ctrlblock prefix remains stable
  • xiangshan/backend/Backend.scala
    • reuse the stabilized backend base/ext ordering introduced by fix(HPM): align backend event index to kunminghu-v2
    • append backend CUTE events at the backend pool tail
  • xiangshan/mem/MemBlock.scala
    • append mem CUTE events at the mem pool tail

Validation

  • Scala/elaboration-level compile check passed:
    • mill -i xiangshan.compile
    • make xsai
  • AM test case updated and used to validate native HPM programming through:
    • mhpmevent11..18 for backend-side CUTE events
    • mhpmevent19..26 for mem-side CUTE events
  • The test confirmed:
    • backend CUTE events can be selected and counted through native HPM
    • mem CUTE request / traffic events can be selected and counted through native HPM
    • per-case counter reset by writing 0 to mhpmcounter* behaves as expected

Review Focus

  • correctness of the backend/mem event split
  • stability of existing backend and mem event prefixes
  • correctness of the final event ordering and IDs
  • correctness of 32B-unit traffic accounting
  • safety of the single register stage added at the CUTETOP perf boundary
  • nested-submodule integration between XSAI/CUTE and XSAI

@yu-yake2002 yu-yake2002 merged commit 9323565 into master Jun 4, 2026
@ecall73 ecall73 deleted the dev-hpm-v2r2a branch June 5, 2026 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants