Skip to content

feat(pmu): integrate CUTEPMU AME perf counters#67

Closed
ecall73 wants to merge 1 commit into
masterfrom
dev-pmu
Closed

feat(pmu): integrate CUTEPMU AME perf counters#67
ecall73 wants to merge 1 commit into
masterfrom
dev-pmu

Conversation

@ecall73
Copy link
Copy Markdown

@ecall73 ecall73 commented May 28, 2026

DO NOT MERGE THIS BEFORE MERGING CUTE#15


Summary

This PR adds the AME PMU path for CUTE and wires it into the XSAI CSR/perf plumbing.
The same content applies to both layers:

  • XSAI/CUTE: add CUTEPMU, local probes, and AME perf sideband
  • XSAI: connect the sideband into XSCore, XSTile, and NewCSR

The new PMU is meant to be software-facing and lightweight:

  • mcycle / minstret are exposed as fixed AME counters
  • mhpmevent3..31 are writable programmable event selectors
  • mhpmcounter3..31 count the selected events with an independent AME 8-bit path
  • the legacy 6-bit perf path stays unchanged
  • AME counters are only generated when MatAcc == CUTE

CSR Map

AME counter CSRs

[cols="^2,^2,^2,^2,10",options="header"]
|===
| CSR | Address | Privilege | Access | Description
| ame_scounteren | 0x5E6 | SRW | counteren gate | S-mode counter permission bits
| ame_hcounteren | 0x6C6 | HRW | counteren gate | H-mode counter permission bits
| ame_mcounteren | 0x7E8 | MRW | counteren gate | M-mode counter permission bits
| ame_mcountinhibit | 0x7E9 | MRW | inhibit bits | AME counter inhibit control
| ame_mhpmevent3..31 | 0xBC3..0xBDF | MRW | event cfg | 29 programmable AME event selectors
| ame_mcycle | 0xBE0 | MRW | fixed counter | AME cycle counter
| ame_minstret | 0xBE2 | MRW | fixed counter | AME retire counter
| ame_mhpmcounter3..31 | 0xBE3..0xBFF | MRW | programmable counters | AME programmable counters
| ame_cycle | 0xCE0 | URO | shadow | U-mode shadow of ame_mcycle
| ame_instret | 0xCE2 | URO | shadow | U-mode shadow of ame_minstret
| ame_hpmcounter3..31 | 0xCE3..0xCFF | URO | shadow | U-mode shadow of ame_mhpmcounter3..31
| ame_scountovf | 0xDE0 | SRO | overflow shadow | AME overflow vector shadow
|===

Notes

  • mhpmevent3..31 use the same RISC-V event-field layout as the existing XiangShan perf counter family.
  • ame_cycle / ame_instret / ame_hpmcounter* are read-only shadows only.
  • RV64 does not add *h high-half counters.

CSR Bit Layouts

ame_mcountinhibit / Counteren

[cols="^2,^2,10",options="header"]
|===
| Bit(s) | Name | Meaning
| 0 | CY | inhibit AME cycle counting
| 2 | IR | inhibit AME instruction-retire counting
| 31:3 | HPM3 | inhibit programmable AME HPM counters 3..31
|===

ame_mhpmevent3..31

[cols="^2,^2,10",options="header"]
|===
| Bit(s) | Name | Meaning
| 63 | OF | sticky overflow flag, driven by counter overflow
| 62 | MINH | M-mode inhibit
| 61 | SINH | S-mode inhibit
| 60 | UINH | U-mode inhibit
| 59 | VSINH | VS-mode inhibit
| 58 | VUINH | VU-mode inhibit
| 54:50 | OPTYPE2 | event combination op for event group 2
| 49:45 | OPTYPE1 | event combination op for event group 1
| 44:40 | OPTYPE0 | event combination op for event group 0
| 39:30 | EVENT3 | event id group 3
| 29:20 | EVENT2 | event id group 2
| 19:10 | EVENT1 | event id group 1
| 9:0 | EVENT0 | event id group 0
|===

ame_scountovf

[cols="^2,^2,10",options="header"]
|===
| Bit(s) | Name | Meaning
| 31:3 | OFVEC | overflow vector for AME HPM counters 3..31
| 2:0 | - | reserved / zero
|===

Permission gating

AME counters use an independent counteren chain:

  • ame_mcounteren gates machine/supervisor/hypervisor visibility
  • ame_hcounteren gates VS/VU visibility
  • ame_scounteren gates user-facing access below S
  • AME URO shadow counters are still subject to the corresponding permission checks
  • ame_scountovf follows the same read-mask style as the existing scountovf

Event Table

CUTEPMU uses a single event pool of 20 entries, with entry 0 reserved as noEvent.
The remaining entries are wired from TaskController and LocalMMU.

[cols="^2,^2,^2,10",options="header"]
|===
| ID | Name | Source | Description
| 0 | noEvent | - | reserved
| 1 | amu_load_a_done | TaskController | A-load completion
| 2 | amu_load_b_done | TaskController | B-load completion
| 3 | amu_load_c_done | TaskController | C-load completion
| 4 | amu_store_done | TaskController | store completion
| 5 | amu_comp_done | TaskController | compute completion
| 6 | amu_release_done | TaskController | release completion
| 7 | amu_mma_nonfp | TaskController | non-FP MMA completion
| 8 | amu_mma_fp16 | TaskController | FP16 MMA completion
| 9 | amu_mma_bf16 | TaskController | BF16 MMA completion
| 10 | amu_mma_tf32 | TaskController | TF32 MMA completion
| 11 | amu_aml_active | TaskController | AML busy cycle
| 12 | amu_bml_active | TaskController | BML busy cycle
| 13 | amu_cml_load_active | TaskController | CML-load busy cycle
| 14 | amu_mte_active | TaskController | MTE busy cycle
| 15 | amu_cml_store_active | TaskController | CML-store busy cycle
| 16 | amu_mem_rd_req | LocalMMU | read request fire
| 17 | amu_mem_wr_req | LocalMMU | write request fire
| 18 | amu_mem_rd_bytes_req | LocalMMU | read request bytes
| 19 | amu_mem_wr_bytes_req | LocalMMU | write request bytes
|===

Slot mapping

  • mhpmevent3..31 are mapped one-per-slot to the selected event id.
  • AmeCounterNum is 29, so the programmable window covers exactly 3..31.
  • The implementation supports only the currently wired 19 events; unused ids stay reserved for future expansion.

Behavior

  • ame_mcycle counts AME-owned work cycles.
  • ame_minstret counts AME retire completion.
  • Programmable HPM counters count one selected event each.
  • Counter overflow sets OF and feeds ame_scountovf.
  • When enableAme is false, AME CSR logic and the AME sideband are not generated.
  • mhpmevent combination semantics remain the existing XiangShan 4-event composition scheme.

Implementation Details

XSAI/CUTE

  • Bundles.scala
    • add PerfEventAme(value: UInt(8.W))
    • add AmeCSRWriteBundle(addr: UInt(12.W), data: UInt(64.W))
    • add AmePerfFromCSRIO / AmePerfToCoreIO / CutePerfIO
    • add TaskControllerPerfProbe and LocalMMUPerfProbe
  • CUTEParameters.scala
    • add AmeCounterNum = 29
    • keep outsideDataWidthByte as the byte unit used by memory byte counters
  • TaskController.scala
    • export the done/active probes used by AME PMU
    • keep the existing task ownership and retire logic unchanged
  • LocalMMU.scala
    • export request-side read/write and byte probes
    • byte counters use PopCount(RequestMask)
  • CUTETOP.scala
    • instantiate CUTEPMU
    • wire fromCSR.csrW, taskProbe, mmuProbe, and toCore
  • CUTEPMU.scala
    • implement PFEventAme, HPerfCounterAme, HPerfMonitorAme
    • map the 20-entry event pool into per-slot programmable counters

XSAI

  • XSTile.scala
    • add the AME perf sideband between XSCute and XSCore
    • gate the wiring with enableAme = HasMatrixExtension && (MatAccKey == MatAcc.CUTE)
  • XSCuteTop.scala
    • forward the AME CSR write broadcast and return AME perf increments
  • XSCore.scala
    • extend the core IO with AME perf inputs/outputs
  • CSR.scala
    • extend PerfCounterIO with AME-domain counters
  • NewCSR.scala
    • consume AME perf inputs
    • wire AME counteren / countinhibit / mhpmevent / shadow counters
  • MachineLevel.scala
    • add ame_mcounteren, ame_mcountinhibit, ame_mhpmevent3..31, ame_mcycle, ame_minstret, ame_mhpmcounter3..31
  • SupervisorLevel.scala
    • add ame_scounteren, ame_scountovf
  • HypervisorLevel.scala
    • add ame_hcounteren
  • Unprivileged.scala
    • expose ame_cycle, ame_instret, ame_hpmcounter*
  • CSRPermitModule.scala
    • add AME address decoding and permission-chain logic
  • CSRConst.scala
    • define the final AME CSR address map

Review Focus

  • AME address map and privilege gating
  • mhpmevent combination and OF behavior
  • CUTEPMU event coverage and the 19 currently implemented event ids
  • keeping the AME path isolated from the legacy 6-bit perf path
  • nested-submodule boundary between XSAI/CUTE and XSAI

@ecall73 ecall73 closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant