Skip to content

MPAM: Pull Request: CPU-less feature, numa id as domain id, performance fix#328

Open
fyu1 wants to merge 10 commits intoNVIDIA:24.04_linux-nvidia-6.17-nextfrom
fyu1:24.04_linux-nvidia-6.17-next
Open

MPAM: Pull Request: CPU-less feature, numa id as domain id, performance fix#328
fyu1 wants to merge 10 commits intoNVIDIA:24.04_linux-nvidia-6.17-nextfrom
fyu1:24.04_linux-nvidia-6.17-next

Conversation

@fyu1
Copy link
Collaborator

@fyu1 fyu1 commented Feb 23, 2026

Please merge the following MPAM commits:

  1. DGX-15400 MPAM MBW monitoring events missing in 6.17 devel This issue is fixed by the following commits:

NVIDIA: SAUCE: arm_mpam: Fix missing mbm_local_bytes and mbm_total_bytes

  1. DGX-15561 MPAM: stream performance degradation on 6.17-devel and 6.19-rc upstream v3. This issue is fixed by this commit:
    NVIDIA: SAUCE: arm_mpam: Fix memory access performance issue due to too small mbw_min

  2. CPU-less memory node enabling and NUMA node id as domain id. Commits:

NVIDIA: SAUCE: arm_mpam: Fix support for CPU-less NUMA nodes in memory...
NVIDIA: SAUCE: arm_mpam: Add memory type checks to support mbw monitor event assignment mode
NVIDIA: SAUCE: arm_mpam: Handle CPU-less numa nodes
NVIDIA: SAUCE: arm_mpam: Include all associated MSC components during domain setup
NVIDIA: SAUCE: arm_mpam: Sort the domain list by domain-id

  1. MBW_MIN support. Commits:
    NVIDIA: SAUCE: arm_mpam: Add support for MBW_MIN

  2. misc fixes:

NVIDIA: SAUCE: fs/resctrl: Export the closid/rmid to user-space
NVIDIA: SAUCE: arm_mpam: Avoid MSC teardown for the SW programming errors

shankerd04 and others added 9 commits February 23, 2026 18:34
resctrl expects the domain list to be sorted by id.

Do that.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
[ morse: Pulled out of a larger patch ]
Signed-off-by: James Morse <james.morse@arm.com>
(forward ported from commit 2549a35ffbfd18d785bb35b39107de93d4bd3c7f https://git-master.nvidia.com/r/a/linux-stable)
[fenghuay: Remove "FIX ME" in the subject to avoid confusion.]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
… domain setup

The current MPAM driver only considers the first component associated
with an online/offline CPU during domain creation and teardown. This
is insufficient, as CPU-initiated traffic may traverse multiple MSCs
before reaching the target, and each MSC must be programmed consistently
for proper resource partitioning.

Update the MPAM driver to include all components associated with a
given CPU during domain setup/teardown to expose expected schemata
to userspace for effective resource control.

Change-Id: I1eb106495f4e2d4d50cd3d7f2c41800a314764c3
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(forward ported from commit fe7dfd164dda542070ca533715c8ec53b1b08fe0 https://git-master.nvidia.com/r/a/linux-stable)
[fenghuay: solve conflicts, change cpu parameter in
mpam_resctrl_offline_domain_hdr(), change cpu parameter in
mpam_resctrl_alloc_domain_cpu(), change dom->comp to dom->ctrl_comp]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…rors

No need to destory MSC instance for the user/admin programming errors
sicne it's not causing any functional issues.

Change-Id: I7734c7d63e8f38d038ba202dcb1da8102183a2eb
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(cherry picked from commit abb499798dfe50a93a8e8b376af85e0cf614cb5f https://git-master.nvidia.com/r/a/linux-stable)
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
MPAM supports minimum bandwidth partitioning. Add logic to handle
MBW_MIN. The unimplemented bits in MPAMCFG_MBW_MIN are RAZ/WI, so
masking is unnecessary. Apply the same logic to MPAMCFG_MBW_MAX
and MPAMCFG_CMAX to simplify the code and match 'cat schemata'
values to user programmed inputs.

Change-Id: I5b1ce4be69a5d75e8814ebaad7acfe061add2e0b
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(cherry picked from commit 0e5902b38181666cd4a247eadd30c4e6cbcea1c0 https://git-master.nvidia.com/r/a/linux-stable)
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Control and monitor groups have a CLOSID and/or RMID that is used to
count the cache usage and memory bandwidth of tasks in this group.

Not all of MPAMs counters can be exposed via resctrl, as each counter
also needs a monitor to be allocated. It is unlikely there are enough
monitors for every RMID to have a monitor permanently allocated.

To allow counters to be read via perf, the RMID that a control
or monitor group is using needs exposing to user-space. This can be
passed back to perf as a parameter. MPAM's PMG values are not
unique, the PARTID needs to be provided too. Perf allows a number of
u64 arguments, which is not enough to encode a control/monitor group
name.

Similarly, there has been some interest in allowing cgroup to manage
the tasks file for resctrl. Exposing a unique identifier for each
control or monitor group will allow cgroups to point to a resctrl
group that holds its configuration.

Provide a file in each control or monitor group that returns a unique
identifier. When passed back to the kernel, resctrl can decode this
into a closid/rmid, or just identify the control or monitor group.

The value is xor'd with a value picked at boot as obsfucation. This
is to prevent user-space from relying on the layout of this field,
or re-using values between boots of the system. This is to allow the
kernel to change the layout of this field in the future.

Change-Id: I5ce7fcbbfb90edc8a104ecc0fec2d7ec0b8583e4
Signed-off-by: James Morse <james.morse@arm.com>
[sonthineni: Fix build warning messages for v6.13]
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(cherry picked from commit 808be5354fb01bcb62d2405631f61fc874d5747c https://git-master.nvidia.com/r/a/linux-stable)
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
In a NUMA system, each node may include CPUs, memory, MPAM MSC
instances, or any combination thereof. Some high-end servers may
have NUMA nodes that include MPAM MSC but no CPUs. In such cases,
associate all possible CPUs for those MSCs.

Change-Id: Id3e26278b7ced9e7866f8ec6c77f99430e5dad60
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(cherry picked from commit c92f60d532b4d281592c26f3a409998a568c4150 https://git-master.nvidia.com/r/a/linux-stable)
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Fix sighting: DGX-15400 MPAM MBW monitoring events missing in 6.17 devel

Memory bandwidth monitoring event mbm_local_bytes is missing due to
bugs in MPAM driver:

1. type is not passed to arg in mpam_msmon_read();
2. After llc occupancy event is handled, mpam_resctrl_pick_counters()
   returns without continuing to handle mbw local and total bytes
   events.

Fix the issues to allow enable mbw local and total bytes events.

Fixes: 2470378 ("NVIDIA: SAUCE: arm_mpam: Use long MBWU counters if supported")
Fixes: 977c7eb ("NVIDIA: SAUCE: untested: arm_mpam: resctrl: pick classes for use as mbm counters")
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…r event assignment mode

mbm_total_bytes event is in mon_MB_xx file now. Add class memory type
check to allow the event in place.

This enables NUMA NID support for mb_event counter assignment mode.

Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…y bandwidth monitoring and control

Fix multiple issues preventing MBM and MBA for CPU-less NUMA nodes.

Add mutex_lock/_unlock(&domain_list_lock) for proper synchronization.

Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
@fyu1 fyu1 requested review from clsotog and jamieNguyenNVIDIA and removed request for clsotog February 23, 2026 19:54
@fyu1
Copy link
Collaborator Author

fyu1 commented Feb 23, 2026

min_hw_granule = ~max_hw_value;
if (mpam_has_feature(mpam_feat_mbw_max, cfg)) {
u16 delta = ((5 * MPAMCFG_MBW_MAX_MAX) / 100) - 1;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a comment on why 5% less than MAX_BW instead of == MAX_BW.

@clsotog
Copy link
Collaborator

clsotog commented Feb 24, 2026

@fyu1
This commit b39b9c3
the sign-off does not look correct like Jamie pointed in another MR.

Also some commits have this link as cherry-picked https://git-master.nvidia.com/r/a/linux-stable but when I click to that link I get no Found.

…oo small mbw_min

DGX-15561 MPAM: stream performance degradation on 6.17-devel and 6.19-rc
upstream v3

mbw_min allows minimal memory bandwidth. If mbw_min is set too small
during boot time, memory bandwidth could be low when memory contention.
In some cases, this value is 1, which means memory bandwidth can
be as low as 1% of total memory bandwidth. This degrades memory access
performance.

According to T241-MPAM-4 erratum:
   In the T241 implementation of memory-bandwidth partitioning, in the
    absence of contention for bandwidth, the minimum bandwidth setting
    can affect the amount of achieved bandwidth. Specifically, the
    achieved bandwidth in the absence of contention can settle to any
    value between the values of MPAMCFG_MBW_MIN and MPAMCFG_MBW_MAX.
    Also, if MPAMCFG_MBW_MIN is set zero (below 0.78125%), once a core
    enters a throttled state, it will never leave that state.
    The first issue is not a cocern if the MPAM software allows to
    program MPAMCFG_MBW_MIN through the sysfs interface. This patch
    ensures program MBW_MIN=1 (0.78125%) whenever MPAMCFG_MBW_MIN=0
    is programmed.
    In the scenario where the resctrl doesn't support the MBW_MIN
    interface via sysfs, to achieve bandwidth closer to MW_MAX in the
    absence of contention, software should configure a relatively narrow
    gap between MBW_MIN and MBW_MAX. The recommendation is to use a 5%
    gap to mitigate the problem.

The new workaround is changed to:

1. Set mbw_min to 95% of mbw_max so memory bandwidth will be used as
   much as possible.
2. If for any reason, the calculation of 95% of mbw_max is smaller than
   1, mbw_min falls back to 1 to avoid to enter the throttle state.

This is backported from MPAM series 2 v5 that is being reviewed on LKML:
https://lore.kernel.org/lkml/20260224175720.2663924-39-ben.horgan@arm.com/

Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
@fyu1 fyu1 force-pushed the 24.04_linux-nvidia-6.17-next branch from b39b9c3 to 769cf7e Compare February 24, 2026 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants