feat(hpm): Add hpmevent for CUTE#18
Merged
Merged
Conversation
wakafa1
approved these changes
Jun 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WARNING
THIS FEATURE SPANS A NESTED SUBMODULE BOUNDARY.
XSAI/CUTEFIRST, THENXSAI, THENxsai-env.XSAIPR WHILE ITSCUTEGITLINK STILL POINTS TO A LOCAL-ONLY OR PR-ONLY COMMIT.ASSOCIATED PR: XSAI#70
Summary
This PR connects CUTE performance events into XiangShan's native HPM infrastructure instead of keeping a separate CUTE-local PMU/CSR path.
The design goal is intentionally conservative:
cycle/instretarchitectural behaviorIn other words, this change makes CUTE events software-visible through the CPU core's existing
mhpmevent/mhpmcountermechanism rather than through an independent CSR family.This PR also depends on a preceding backend event-pool stabilization step that landed as:
fix(HPM): align backend event index to kunminghu-v2That earlier change is part of the story here and is not just incidental cleanup. Without it, the backend event pool would still contain matrix-related insertions interleaved into the original kunminghu-v2 ordering, and the newly added CUTE backend events would not have a stable software-visible numbering base.
What Changes
CUTE side
TaskControllerPerfProbeLocalMMUPerfProbeCutePerfToCoreIOTaskControllerLocalMMUCUTETOPCUTETOPoutput boundary before it crosses into the CPU coreXSAI side
XSCuteTopXSTileXSCorePerfCounterIOwith:perfEventsMatrixBackendperfEventsMatrixMem0..94prefix by reusing the previously introduced base/ext split in:RenameCtrlBlockBackendWhat this PR does not do
PFEvent + HPerfMonitor + ame_* CSRimplementationFrontendCoupledL2/HuanCunEvent Placement
The new events are split by meaning rather than placed into a separate PMU domain.
Backend prefix stabilization
Before appending CUTE backend events, the backend event pool is first stabilized so that:
0..94remain identical tokunminghu-v2This is implemented by splitting the backend event pool into a stable base segment and an extension segment:
Renameexposes base/ext perf event viewsCtrlBlockkeeps the original rename-free old prefix in its base viewBackendrebuilds the final backend pool as:As a result, this PR does not place the new CUTE backend events directly after
94. They are appended after the already-existing backend extension region created by the earlier backend index-alignment fix.The pre-existing backend extension region occupies IDs
95..104and consists of:rename_stall_cycle_mxme_freelist_1_4_validme_freelist_2_4_validme_freelist_3_4_validme_freelist_4_4_validIssueQueueMsetmtilexriwmfMrelease_fullissueQueue_enq_fire_cntIssueQueueMsetmtilexrmfwmf_fullIssueQueueMmaMarith_fullIssueQueueMls_fullBackend-appended events
The following 9 events are appended after the existing backend event pool tail:
amu_active_cycleamu_retireamu_comp_doneamu_release_doneamu_mte_activeamu_mma_nonfpamu_mma_fp16amu_mma_bf16amu_mma_tf32For the current implementation, these appear after the already-stabilized backend prefix and extension region, so the final backend event IDs are:
amu_active_cycleamu_retireamu_comp_doneamu_release_doneamu_mte_activeamu_mma_nonfpamu_mma_fp16amu_mma_bf16amu_mma_tf32Mem-appended events
The following 12 events are appended after the existing MemBlock event pool tail:
amu_load_a_doneamu_load_b_doneamu_load_c_doneamu_store_doneamu_aml_activeamu_bml_activeamu_cml_load_activeamu_cml_store_activeamu_mem_rd_reqamu_mem_wr_reqamu_mem_rd_32B_reqamu_mem_wr_32B_reqFor the current implementation, the final mem event IDs are:
amu_load_a_doneamu_load_b_doneamu_load_c_doneamu_store_doneamu_aml_activeamu_bml_activeamu_cml_load_activeamu_cml_store_activeamu_mem_rd_reqamu_mem_wr_reqamu_mem_rd_32B_reqamu_mem_wr_32B_reqEvent Semantics
Backend probes
amu_active_cycle: exported from CUTEownedWork, exposed to HPM under the more explicit public nameamu_active_cycleamu_retire: CUTE task retire pulseamu_comp_done: compute completion pulseamu_release_done: release issue/completion pulse in the current scheduler modelamu_mte_active: MTE busy cycleamu_mma_nonfp/fp16/bf16/tf32: compute completion classified by MMA data typeMem probes
amu_load_*_done/amu_store_done: loader/store completion pulsesamu_*_active: loader/store busy-cycle style probesamu_mem_rd_req/amu_mem_wr_req: LocalMMU outgoing read/write request firesamu_mem_rd_32B_req/amu_mem_wr_32B_req: outgoing traffic counted in32BunitsThe
32Bunit choice is deliberate. It preserves the existing 6-bit perf event width and avoids widening the global perf infrastructure just to carry byte-count values.Timing Note
The original sideband wiring from CUTE into the core was effectively combinational until the event reached the native HPM logic.
To reduce physical-design risk, the assembled CUTE perf sideband is now registered once at the
CUTETOPoutput boundary before crossing into the CPU core. This keeps the fix narrow:This is a timing-oriented implementation detail, not a software-visible semantic change.
Implementation Details
XSAI/CUTEBundles.scalaCutePerfEventCountsTaskControllerPerfProbeLocalMMUPerfProbeCutePerfToCoreIOTaskController.scalaLocalMMU.scala32B-unit traffic probesCUTETOP.scalaperfXSAIcutewrapper/XSCuteTop.scalaxiangshan/XSTile.scalaxiangshan/XSCore.scalaxiangshan/backend/fu/CSR.scalaPerfCounterIOwith backend/mem CUTE perf inputsxiangshan/backend/rename/Rename.scalaxiangshan/backend/CtrlBlock.scalaxiangshan/backend/Backend.scalafix(HPM): align backend event index to kunminghu-v2xiangshan/mem/MemBlock.scalaValidation
mill -i xiangshan.compilemake xsaimhpmevent11..18for backend-side CUTE eventsmhpmevent19..26for mem-side CUTE events0tomhpmcounter*behaves as expectedReview Focus
32B-unit traffic accountingCUTETOPperf boundaryXSAI/CUTEandXSAI