feat(hpm): Add hpmevent for CUTE by ecall73 · Pull Request #18 · OpenXiangShan/CUTE

ecall73 · 2026-06-02T17:42:11Z

WARNING

THIS FEATURE SPANS A NESTED SUBMODULE BOUNDARY.

MERGE ORDER MUST BE INSIDE-OUT: XSAI/CUTE FIRST, THEN XSAI, THEN xsai-env.
THE PARENT REPOSITORY MUST ONLY POINT TO A FETCHABLE CHILD COMMIT THAT IS ALREADY MERGED OR OTHERWISE STABLE ON THE CHILD REMOTE.
DO NOT MERGE THE XSAI PR WHILE ITS CUTE GITLINK STILL POINTS TO A LOCAL-ONLY OR PR-ONLY COMMIT.

ASSOCIATED PR: XSAI#70

Summary

This PR connects CUTE performance events into XiangShan's native HPM infrastructure instead of keeping a separate CUTE-local PMU/CSR path.

The design goal is intentionally conservative:

no new CSR registers
no change to cycle / instret architectural behavior
keep the existing backend HPM prefix stable
append all CUTE events at the tail of the existing backend / mem event pools
preserve the existing 6-bit perf event path

In other words, this change makes CUTE events software-visible through the CPU core's existing mhpmevent / mhpmcounter mechanism rather than through an independent CSR family.

This PR also depends on a preceding backend event-pool stabilization step that landed as:

fix(HPM): align backend event index to kunminghu-v2

That earlier change is part of the story here and is not just incidental cleanup. Without it, the backend event pool would still contain matrix-related insertions interleaved into the original kunminghu-v2 ordering, and the newly added CUTE backend events would not have a stable software-visible numbering base.

What Changes

CUTE side

Add lightweight perf probe bundles:
- TaskControllerPerfProbe
- LocalMMUPerfProbe
Add a minimal sideband output bundle:
- CutePerfToCoreIO
Export CUTE perf candidates from:
- TaskController
- LocalMMU
Assemble those raw candidates in CUTETOP
Register the assembled perf sideband once at the CUTETOP output boundary before it crosses into the CPU core

XSAI side

Forward the CUTE perf sideband through:
- XSCuteTop
- XSTile
- XSCore
Extend PerfCounterIO with:
- perfEventsMatrixBackend
- perfEventsMatrixMem
Preserve the original kunminghu-v2 backend 0..94 prefix by reusing the previously introduced base/ext split in:
- Rename
- CtrlBlock
- Backend
Append backend-side CUTE events to the backend event pool tail
Append mem-side CUTE events to the MemBlock event pool tail

What this PR does not do

does not add a new AME-specific CSR map
does not restore the abandoned standalone PFEvent + HPerfMonitor + ame_* CSR implementation
does not modify Frontend
does not add CUTE-originated cache-internal events into CoupledL2/HuanCun

Event Placement

The new events are split by meaning rather than placed into a separate PMU domain.

Backend prefix stabilization

Before appending CUTE backend events, the backend event pool is first stabilized so that:

backend event IDs 0..94 remain identical to kunminghu-v2
previously added matrix backend events are moved behind that stable prefix

This is implemented by splitting the backend event pool into a stable base segment and an extension segment:

Rename exposes base/ext perf event views
CtrlBlock keeps the original rename-free old prefix in its base view
Backend rebuilds the final backend pool as:
- stable old prefix first
- pre-existing matrix backend extension next
- newly added CUTE backend events last

As a result, this PR does not place the new CUTE backend events directly after 94. They are appended after the already-existing backend extension region created by the earlier backend index-alignment fix.

The pre-existing backend extension region occupies IDs 95..104 and consists of:

Event	ID
`rename_stall_cycle_mx`	95
`me_freelist_1_4_valid`	96
`me_freelist_2_4_valid`	97
`me_freelist_3_4_valid`	98
`me_freelist_4_4_valid`	99
`IssueQueueMsetmtilexriwmfMrelease_full`	100
`issueQueue_enq_fire_cnt`	101
`IssueQueueMsetmtilexrmfwmf_full`	102
`IssueQueueMmaMarith_full`	103
`IssueQueueMls_full`	104

Backend-appended events

The following 9 events are appended after the existing backend event pool tail:

amu_active_cycle
amu_retire
amu_comp_done
amu_release_done
amu_mte_active
amu_mma_nonfp
amu_mma_fp16
amu_mma_bf16
amu_mma_tf32

For the current implementation, these appear after the already-stabilized backend prefix and extension region, so the final backend event IDs are:

Event	ID
`amu_active_cycle`	105
`amu_retire`	106
`amu_comp_done`	107
`amu_release_done`	108
`amu_mte_active`	109
`amu_mma_nonfp`	110
`amu_mma_fp16`	111
`amu_mma_bf16`	112
`amu_mma_tf32`	113

Mem-appended events

The following 12 events are appended after the existing MemBlock event pool tail:

amu_load_a_done
amu_load_b_done
amu_load_c_done
amu_store_done
amu_aml_active
amu_bml_active
amu_cml_load_active
amu_cml_store_active
amu_mem_rd_req
amu_mem_wr_req
amu_mem_rd_32B_req
amu_mem_wr_32B_req

For the current implementation, the final mem event IDs are:

Event	ID
`amu_load_a_done`	145
`amu_load_b_done`	146
`amu_load_c_done`	147
`amu_store_done`	148
`amu_aml_active`	149
`amu_bml_active`	150
`amu_cml_load_active`	151
`amu_cml_store_active`	152
`amu_mem_rd_req`	153
`amu_mem_wr_req`	154
`amu_mem_rd_32B_req`	155
`amu_mem_wr_32B_req`	156

Event Semantics

Backend probes

amu_active_cycle: exported from CUTE ownedWork, exposed to HPM under the more explicit public name amu_active_cycle
amu_retire: CUTE task retire pulse
amu_comp_done: compute completion pulse
amu_release_done: release issue/completion pulse in the current scheduler model
amu_mte_active: MTE busy cycle
amu_mma_nonfp/fp16/bf16/tf32: compute completion classified by MMA data type

Mem probes

amu_load_*_done / amu_store_done: loader/store completion pulses
amu_*_active: loader/store busy-cycle style probes
amu_mem_rd_req / amu_mem_wr_req: LocalMMU outgoing read/write request fires
amu_mem_rd_32B_req / amu_mem_wr_32B_req: outgoing traffic counted in 32B units

The 32B unit choice is deliberate. It preserves the existing 6-bit perf event width and avoids widening the global perf infrastructure just to carry byte-count values.

Timing Note

The original sideband wiring from CUTE into the core was effectively combinational until the event reached the native HPM logic.

To reduce physical-design risk, the assembled CUTE perf sideband is now registered once at the CUTETOP output boundary before crossing into the CPU core. This keeps the fix narrow:

the event definitions remain unchanged
the native HPM path remains unchanged
only the cross-module perf sideband gains one cycle of latency

This is a timing-oriented implementation detail, not a software-visible semantic change.

Implementation Details

`XSAI/CUTE`

Bundles.scala
- add CutePerfEventCounts
- add TaskControllerPerfProbe
- add LocalMMUPerfProbe
- add CutePerfToCoreIO
TaskController.scala
- export done/retire/active probe signals
LocalMMU.scala
- export request-fire and 32B-unit traffic probes
CUTETOP.scala
- expose perf
- assemble backend/mem raw candidate events
- add one output-side register stage for the perf sideband

`XSAI`

cutewrapper/XSCuteTop.scala
- forward the CUTE perf sideband
xiangshan/XSTile.scala
- connect CUTE perf sideband into the core
- provide zero default when CUTE is absent
xiangshan/XSCore.scala
- consume the sideband and map it into backend/mem perf inputs
xiangshan/backend/fu/CSR.scala
- extend PerfCounterIO with backend/mem CUTE perf inputs
xiangshan/backend/rename/Rename.scala
- keep the old rename/free-list prefix intact
- move pre-existing matrix rename events into the backend extension region
xiangshan/backend/CtrlBlock.scala
- expose backend base/ext event views so the old ctrlblock prefix remains stable
xiangshan/backend/Backend.scala
- reuse the stabilized backend base/ext ordering introduced by fix(HPM): align backend event index to kunminghu-v2
- append backend CUTE events at the backend pool tail
xiangshan/mem/MemBlock.scala
- append mem CUTE events at the mem pool tail

Validation

Scala/elaboration-level compile check passed:
- mill -i xiangshan.compile
- make xsai
AM test case updated and used to validate native HPM programming through:
- mhpmevent11..18 for backend-side CUTE events
- mhpmevent19..26 for mem-side CUTE events
The test confirmed:
- backend CUTE events can be selected and counted through native HPM
- mem CUTE request / traffic events can be selected and counted through native HPM
- per-case counter reset by writing 0 to mhpmcounter* behaves as expected

Review Focus

correctness of the backend/mem event split
stability of existing backend and mem event prefixes
correctness of the final event ordering and IDs
correctness of 32B-unit traffic accounting
safety of the single register stage added at the CUTETOP perf boundary
nested-submodule integration between XSAI/CUTE and XSAI

feat(hpm): Add hpmevent for CUTE

5acdab1

ecall73 requested review from Wonicon, cailuoshan, wakafa1 and yu-yake2002 June 2, 2026 17:42

ecall73 self-assigned this Jun 2, 2026

ecall73 mentioned this pull request Jun 2, 2026

feat(hpm): Add hpmevent for CUTE OpenXiangShan/XSAI#70

Merged

wakafa1 approved these changes Jun 4, 2026

View reviewed changes

yu-yake2002 merged commit 9323565 into master Jun 4, 2026

ecall73 deleted the dev-hpm-v2r2a branch June 5, 2026 06:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hpm): Add hpmevent for CUTE#18

feat(hpm): Add hpmevent for CUTE#18
yu-yake2002 merged 1 commit into
masterfrom
dev-hpm-v2r2a

ecall73 commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ecall73 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WARNING

Summary

What Changes

CUTE side

XSAI side

What this PR does not do

Event Placement

Backend prefix stabilization

Backend-appended events

Mem-appended events

Event Semantics

Backend probes

Mem probes

Timing Note

Implementation Details

XSAI/CUTE

XSAI

Validation

Review Focus

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ecall73 commented Jun 2, 2026 •

edited

Loading

`XSAI/CUTE`

`XSAI`