2024-03-29 01:07:09

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)


This series adds the support for Assignable Bandwidth Monitoring Counters
(ABMC). It is also called QoS RMID Pinning feature

The feature details are documented in the APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC). The documentation is available at
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

The patches are based on top of commit
cd80c2c94699913f9334414189487ff3f93cf0b5 (tip/master)

# Introduction

AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
feature only guarantees that RMIDs currently assigned to a processor will
be tracked by hardware. The counters of any other RMIDs which are no longer
being tracked will be reset to zero. The MBM event counters return
"Unavailable" for the RMIDs that are not active.

Users can create 256 or more monitor groups. But there can be only limited
number of groups that can give guaranteed monitoring numbers. With ever
changing configurations there is no way to definitely know which of these
groups will be active for certain point of time. Users do not have the
option to monitor a group or set of groups for certain period of time
without worrying about RMID being reset in between.

The ABMC feature provides an option to the user to assign an RMID to the
hardware counter and monitor the bandwidth for a longer duration.
The assigned RMID will be active until the user unassigns it manually.
There is no need to worry about counters being reset during this period.
Additionally, the user can specify a bitmask identifying the specific
bandwidth types from the given source to track with the counter.

Without ABMC enabled, monitoring will work in current mode without
assignment option.

# Linux Implementation

Linux resctrl subsystem provides the interface to count maximum of two
memory bandwidth events per group, from a combination of available total
and local events. Keeping the current interface, users can assign a maximum
of 2 ABMC counters per group. User will also have the option to assign only
one counter to the group. If the system runs out of assignable ABMC
counters, kernel will display an error. Users need to unassign an already
assigned counter to make space for new assignments.


# Examples

a. Check if ABMC support is available
#mount -t resctrl resctrl /sys/fs/resctrl/

#cat /sys/fs/resctrl/info/L3_MON/mbm_assign
[abmc]
legacy_mbm

Linux kernel detected ABMC feature and it is enabled.

b. Check how many ABMC counters are available.

#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_cntrs
32

c. Create few resctrl groups.

# mkdir /sys/fs/resctrl/mon_groups/default_mon1
# mkdir /sys/fs/resctrl/non_defult_group
# mkdir /sys/fs/resctrl/non_defult_group/mon_groups/non_default_mon1

d. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
to list and modify the group's assignment states.

The list follows the following format:

* Default CTRL_MON group:
"//<domain_id>=<assignment_flags>"

* Non-default CTRL_MON group:
"<CTRL_MON group>//<domain_id>=<assignment_flags>"

* Child MON group of default CTRL_MON group:
"/<MON group>/<domain_id>=<assignment_flags>"

* Child MON group of non-default CTRL_MON group:
"<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"

Assignment flags can be one of the following:

t MBM total event is assigned
l MBM local event is assigned
tl Both total and local MBM events are assigned
_ None of the MBM events are assigned

Examples:

# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
/default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;

There are four groups and all the groups have local and total event assigned.

"//" - This is a default CONTROL MON group

"non_defult_group//" - This is non default CONTROL MON group

"/default_mon1/" - This is Child MON group of the defult group

"non_defult_group/non_default_mon1/" - This is child MON group of the non default group

=tl means both total and local events are assigned.

e. Update the group assignment states using the interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control.

The write format is similar to the above list format with addition of
op-code for the assignment operation.

* Default CTRL_MON group:
"//<domain_id><op-code><assignment_flags>"

* Non-default CTRL_MON group:
"<CTRL_MON group>//<domain_id><op-code><assignment_flags>"

* Child MON group of default CTRL_MON group:
"/<MON group>/<domain_id><op-code><assignment_flags>"

* Child MON group of non-default CTRL_MON group:
"<CTRL_MON group>/<MON group>/<domain_id><op-code><assignment_flags>"

Op-code can be one of the following:

= Update the assignment to match the flags
+ Assign a new state
- Unassign a new state
_ Unassign all the states


Initial group status:

# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
non_default_ctrl_mon_grp//0=tl;1=tl;
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
//0=tl;1=tl;
/child_default_mon_grp/0=tl;1=tl;


To update the default group to assign only total event.
# echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

Assignment status after the update:
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
non_default_ctrl_mon_grp//0=tl;1=tl;
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
//0=t;1=t;
/child_default_mon_grp/0=tl;1=tl;

To update the MON group child_default_mon_grp to remove local event:
# echo "/child_default_mon_grp/0-l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

Assignment status after the update:
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
//0=t;1=t;
/child_default_mon_grp/0=t;1=t;
non_default_ctrl_mon_grp//0=tl;1=tl;
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;

To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to
remove both local and total events:
# echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/0_" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control

Assignment status after the update:
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
//0=t;1=t;
/child_default_mon_grp/0=t;1=t;
non_default_ctrl_mon_grp//0=tl;1=tl;
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=_;1=_;


f. Read the event mbm_total_bytes and mbm_local_bytes of the default group.
There is no change in reading the evetns with ABMC. If the event is unassigned
when reading, then the read will come back as Unavailable.

# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
779247936
# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
765207488

g. Users will have the option to go back to legacy_mbm mode if required.
This can be done using the following command.

# echo "legacy_mbm" > /sys/fs/resctrl/info/L3_MON/mbm_assign
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign
abmc
[legacy_mbm]


h. Check the bandwidth configuration for the group. Note that bandwidth
configuration has a domain scope. Total event defaults to 0x7F (to
count all the events) and local event defaults to 0x15 (to count all
the local numa events). The event bitmap decoding is available at
https://www.kernel.org/doc/Documentation/x86/resctrl.rst
in section "mbm_total_bytes_config", "mbm_local_bytes_config":

#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
0=0x7f;1=0x7f

#cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
0=0x15;1=0x15

j. Change the bandwidth source for domain 0 for the total event to count only reads.
Note that this change effects total events on the domain 0.

#echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
0=0x33;1=0x7F

k. Now read the total event again. The mbm_total_bytes should display
only the read events.

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
314101

l. Unmount the resctrl

#umount /sys/fs/resctrl/

---
v3:
This series adds the support for global assignment mode discussed in
the thread. https://lore.kernel.org/lkml/[email protected]/
Removed the individual assignment mode and included the global assignment interface.
Added following interface files.
a. /sys/fs/resctrl/info/L3_MON/mbm_assign
Used for displaying the current assignment mode and switch between
ABMC and legacy mode.
b. /sys/fs/resctrl/info/L3_MON/mbm_assign_control
Used for lising the groups assignment mode and modify the assignment states.
c. Most of the changes are related to the new interface.
d. Addressed the comments from Reinette, James and Peter.
e. Hope I have addressed most of the major feedbacks discussed. If I missed
something then it is not intentional. Please feel free to comment.
f. Sending this as an RFC as per Reinette's comment. So, this is still open
for discussion.

v2:
a. Major change is the way ABMC is enabled. Earlier, user needed to remount
with -o abmc to enable ABMC feature. Removed that option now.
Now users can enable ABMC by "$echo 1 to /sys/fs/resctrl/info/L3_MON/mbm_assign_enable".

b. Added new word 21 to x86/cpufeatures.h.

c. Display unsupported if user attempts to read the events when ABMC is enabled
and event is not assigned.

d. Display monitor_state as "Unsupported" when ABMC is disabled.

e. Text updates and rebase to latest tip tree (as of Jan 18).

f. This series is still work in progress. I am yet to hear from ARM developers.

v2:
https://lore.kernel.org/lkml/[email protected]/

v1 :
https://lore.kernel.org/lkml/[email protected]/


Babu Moger (17):
x86/resctrl: Add support for Assignable Bandwidth Monitoring Counters
(ABMC)
x86/resctrl: Add ABMC feature in the command line options
x86/resctrl: Detect Assignable Bandwidth Monitoring feature details
x86/resctrl: Introduce resctrl_file_fflags_init
x86/resctrl: Introduce the interface to display the assignment state
x86/resctrl: Introduce interface to display number of ABMC counters
x86/resctrl: Add support to enable/disable ABMC feature
x86/resctrl: Initialize assignable counters bitmap
x86/resctrl: Introduce assign state for the mon group
x86/resctrl: Add data structures for ABMC assignment
x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg
x86/resctrl: Add the functionality to assign the RMID
x86/resctrl: Add the functionality to unassign the RMID
x86/resctrl: Enable ABMC by default on resctrl mount
x86/resctrl: Introduce the interface switch between ABMC and
legacy_mbm
x86/resctrl: Introduce interface to list assignment states of all the
groups
x86/resctrl: Introduce interface to modify assignment states of the
groups

.../admin-guide/kernel-parameters.txt | 2 +-
Documentation/arch/x86/resctrl.rst | 144 ++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 2 +
arch/x86/kernel/cpu/cpuid-deps.c | 3 +
arch/x86/kernel/cpu/resctrl/core.c | 25 +-
arch/x86/kernel/cpu/resctrl/internal.h | 56 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 24 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 714 +++++++++++++++++-
arch/x86/kernel/cpu/scattered.c | 1 +
include/linux/resctrl.h | 12 +
11 files changed, 964 insertions(+), 20 deletions(-)

--
2.34.1


Babu Moger (17):
x86/resctrl: Add support for Assignable Bandwidth Monitoring Counters
(ABMC)
x86/resctrl: Add ABMC feature in the command line options
x86/resctrl: Detect Assignable Bandwidth Monitoring feature details
x86/resctrl: Introduce resctrl_file_fflags_init
x86/resctrl: Introduce the interface to display the assignment state
x86/resctrl: Introduce interface to display number of ABMC counters
x86/resctrl: Add support to enable/disable ABMC feature
x86/resctrl: Initialize assignable counters bitmap
x86/resctrl: Introduce assign state for the mon group
x86/resctrl: Add data structures for ABMC assignment
x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg
x86/resctrl: Add the functionality to assign the RMID
x86/resctrl: Add the functionality to unassign the RMID
x86/resctrl: Enable ABMC by default on resctrl mount
x86/resctrl: Introduce the interface switch between ABMC and
legacy_mbm
x86/resctrl: Introduce interface to list assignment states of all the
groups
x86/resctrl: Introduce interface to modify assignment states of the
groups

.../admin-guide/kernel-parameters.txt | 2 +-
Documentation/arch/x86/resctrl.rst | 144 ++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 2 +
arch/x86/kernel/cpu/cpuid-deps.c | 3 +
arch/x86/kernel/cpu/resctrl/core.c | 25 +-
arch/x86/kernel/cpu/resctrl/internal.h | 56 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 24 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 714 +++++++++++++++++-
arch/x86/kernel/cpu/scattered.c | 1 +
include/linux/resctrl.h | 12 +
11 files changed, 964 insertions(+), 20 deletions(-)

--
2.34.1



2024-03-29 01:07:38

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 02/17] x86/resctrl: Add ABMC feature in the command line options

Add the command line options to enable or disable the new resctrl feature
ABMC (Assignable Bandwidth Monitoring Counters).

Signed-off-by: Babu Moger <[email protected]>

---
v3: No changes

v2: No changes
---
Documentation/admin-guide/kernel-parameters.txt | 2 +-
Documentation/arch/x86/resctrl.rst | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bb884c14b2f6..b3a2e7f72462 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5551,7 +5551,7 @@
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
- mba, smba, bmec.
+ mba, smba, bmec, abmc.
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 6c245582d8fb..68df7751d1f5 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -26,6 +26,7 @@ MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
MBA (Memory Bandwidth Allocation) "mba"
SMBA (Slow Memory Bandwidth Allocation) ""
BMEC (Bandwidth Monitoring Event Configuration) ""
+ABMC (Assignable Bandwidth Monitoring Counters) ""
=============================================== ================================

Historically, new features were made visible by default in /proc/cpuinfo. This
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 83e40341583e..57a8c6f30dd6 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -664,6 +664,7 @@ enum {
RDT_FLAG_MBA,
RDT_FLAG_SMBA,
RDT_FLAG_BMEC,
+ RDT_FLAG_ABMC,
};

#define RDT_OPT(idx, n, f) \
@@ -689,6 +690,7 @@ static struct rdt_options rdt_options[] __initdata = {
RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
+ RDT_OPT(RDT_FLAG_ABMC, "abmc", X86_FEATURE_ABMC),
};
#define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)

--
2.34.1


2024-03-29 01:07:50

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 01/17] x86/resctrl: Add support for Assignable Bandwidth Monitoring Counters (ABMC)

AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
feature only guarantees that RMIDs currently assigned to a processor will
be tracked by hardware. The counters of any other RMIDs which are no longer
being tracked will be reset to zero. The MBM event counters return
"Unavailable" for the RMIDs that are not active.

Users can create 256 or more monitor groups. But there can be only limited
number of groups that can give guaranteed monitoring numbers. With ever
changing configurations there is no way to definitely know which of these
groups will be active for certain point of time. Users do not have the
option to monitor a group or set of groups for certain period of time
without worrying about RMID being reset in between.

The ABMC feature provides an option to the user to assign an RMID to the
hardware counter and monitor the bandwidth for a longer duration.
The assigned RMID will be active until the user unassigns it manually.
There is no need to worry about counters being reset during this period.
Additionally, the user can specify a bitmask identifying the specific
bandwidth types from the given source to track with the counter.

Linux resctrl subsystem provides the interface to count maximum of two
memory bandwidth events per group, from a combination of available total
and local events. Keeping the current interface, users can assign a maximum
of 2 ABMC counters per group. User will also have the option to assign only
one counter to the group. If the system runs out of assignable ABMC
counters, kernel will display an error. Users need to unassign an already
assigned counter to make space for new assignments.

AMD hardware provides total of 32 ABMC counters when supported.

The feature can be detected via CPUID_Fn80000020_EBX_x00 bit 5.
Bits Description
5 ABMC (Assignable Bandwidth Monitoring Counters)

The feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
v3: Change because of rebase. Actual patch did not change.

v2: Added dependency on X86_FEATURE_BMEC.
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/cpuid-deps.c | 3 +++
arch/x86/kernel/cpu/scattered.c | 1 +
3 files changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index a38f8f9ba657..342b82ec15be 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -466,6 +466,7 @@
* Reuse free bits when adding new feature flags!
*/
#define X86_FEATURE_AMD_LBR_PMC_FREEZE (21*32+ 0) /* AMD LBR and PMC Freeze */
+#define X86_FEATURE_ABMC (21*32+ 1) /* "" Assignable Bandwidth Monitoring Counters */

/*
* BUG word(s)
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index b7174209d855..c1f2abb209b4 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -70,6 +70,9 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_CQM_MBM_LOCAL, X86_FEATURE_CQM_LLC },
{ X86_FEATURE_BMEC, X86_FEATURE_CQM_MBM_TOTAL },
{ X86_FEATURE_BMEC, X86_FEATURE_CQM_MBM_LOCAL },
+ { X86_FEATURE_ABMC, X86_FEATURE_CQM_MBM_TOTAL },
+ { X86_FEATURE_ABMC, X86_FEATURE_CQM_MBM_LOCAL },
+ { X86_FEATURE_ABMC, X86_FEATURE_BMEC },
{ X86_FEATURE_AVX512_BF16, X86_FEATURE_AVX512VL },
{ X86_FEATURE_AVX512_FP16, X86_FEATURE_AVX512BW },
{ X86_FEATURE_ENQCMD, X86_FEATURE_XSAVES },
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index a515328d9d7d..930655f22f75 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -47,6 +47,7 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_MBA, CPUID_EBX, 6, 0x80000008, 0 },
{ X86_FEATURE_SMBA, CPUID_EBX, 2, 0x80000020, 0 },
{ X86_FEATURE_BMEC, CPUID_EBX, 3, 0x80000020, 0 },
+ { X86_FEATURE_ABMC, CPUID_EBX, 5, 0x80000020, 0 },
{ X86_FEATURE_PERFMON_V2, CPUID_EAX, 0, 0x80000022, 0 },
{ X86_FEATURE_AMD_LBR_V2, CPUID_EAX, 1, 0x80000022, 0 },
{ X86_FEATURE_AMD_LBR_PMC_FREEZE, CPUID_EAX, 2, 0x80000022, 0 },
--
2.34.1


2024-03-29 01:07:55

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 03/17] x86/resctrl: Detect Assignable Bandwidth Monitoring feature details

ABMC feature details are reported via CPUID Fn8000_0020_EBX_x5.
Bits Description
15:0 MAX_ABMC Maximum Supported Assignable Bandwidth
Monitoring Counter ID + 1

The feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
v3: Removed changes related to mon_features.
Moved rdt_cpu_has to core.c and added new function resctrl_arch_has_abmc.
Also moved the fields mbm_assign_capable and mbm_assign_cntrs to
rdt_resource. (James)

v2: Changed the field name to mbm_assign_capable from abmc_capable.
---
arch/x86/kernel/cpu/resctrl/core.c | 17 +++++++++++++++++
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/monitor.c | 3 +++
include/linux/resctrl.h | 12 ++++++++++++
4 files changed, 33 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 57a8c6f30dd6..bb82b392cf5d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -740,6 +740,23 @@ bool __init rdt_cpu_has(int flag)
return ret;
}

+inline bool __init resctrl_arch_has_abmc(struct rdt_resource *r)
+{
+ bool ret = rdt_cpu_has(X86_FEATURE_ABMC);
+ u32 eax, ebx, ecx, edx;
+
+ if (ret) {
+ /*
+ * Query CPUID_Fn80000020_EBX_x05 for number of
+ * ABMC counters
+ */
+ cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
+ r->mbm_assign_cntrs = (ebx & 0xFFFF) + 1;
+ }
+
+ return ret;
+}
+
static __init bool get_mem_config(void)
{
struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_MBA];
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c99f26ebe7a6..c4ae6f3993aa 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -584,6 +584,7 @@ void free_rmid(u32 closid, u32 rmid);
int rdt_get_mon_l3_config(struct rdt_resource *r);
void __exit rdt_put_mon_l3_config(void);
bool __init rdt_cpu_has(int flag);
+bool __init resctrl_arch_has_abmc(struct rdt_resource *r);
void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c34a35ec0f03..e5938bf53d5a 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1055,6 +1055,9 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
mbm_local_event.configurable = true;
mbm_config_rftype_init("mbm_local_bytes_config");
}
+
+ if (resctrl_arch_has_abmc(r))
+ r->mbm_assign_capable = ABMC_ASSIGN;
}

l3_mon_evt_init(r);
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a365f67131ec..a1ee9afabff3 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -150,6 +150,14 @@ struct resctrl_membw {
struct rdt_parse_data;
struct resctrl_schema;

+/**
+ * enum mbm_assign_type - The type of assignable monitoring.
+ * @ABMC_ASSIGN: Assignable Bandwidth Monitoring Counters.
+ */
+enum mbm_assign_type {
+ ABMC_ASSIGN = 0x01,
+};
+
/**
* struct rdt_resource - attributes of a resctrl resource
* @rid: The index of the resource
@@ -168,6 +176,8 @@ struct resctrl_schema;
* @evt_list: List of monitoring events
* @fflags: flags to choose base and info files
* @cdp_capable: Is the CDP feature available on this resource
+ * @mbm_assign_capable: Does system capable of supporting monitor assignment?
+ * @mbm_assign_cntrs: Maximum number of assignable counters
*/
struct rdt_resource {
int rid;
@@ -188,6 +198,8 @@ struct rdt_resource {
struct list_head evt_list;
unsigned long fflags;
bool cdp_capable;
+ bool mbm_assign_capable;
+ u32 mbm_assign_cntrs;
};

/**
--
2.34.1


2024-03-29 01:08:12

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 04/17] x86/resctrl: Introduce resctrl_file_fflags_init

Consolidate multiple fflags initialization into one function.

Remove thread_throttle_mode_init, mbm_config_rftype_init and
consolidate them into resctrl_file_fflags_init.

Signed-off-by: Babu Moger <[email protected]>
---
v3: No changes.

v2: New patch. New function to consolidate fflags initialization
---
arch/x86/kernel/cpu/resctrl/core.c | 4 +++-
arch/x86/kernel/cpu/resctrl/internal.h | 4 ++--
arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 16 +++-------------
4 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index bb82b392cf5d..50e9ec5e547b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -229,7 +229,9 @@ static bool __get_mem_config_intel(struct rdt_resource *r)
r->membw.throttle_mode = THREAD_THROTTLE_PER_THREAD;
else
r->membw.throttle_mode = THREAD_THROTTLE_MAX;
- thread_throttle_mode_init();
+
+ resctrl_file_fflags_init("thread_throttle_mode",
+ RFTYPE_CTRL_INFO | RFTYPE_RES_MB);

r->alloc_capable = true;

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c4ae6f3993aa..722388621403 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -602,8 +602,8 @@ void cqm_handle_limbo(struct work_struct *work);
bool has_busy_rmid(struct rdt_domain *d);
void __check_limbo(struct rdt_domain *d, bool force_free);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
-void __init thread_throttle_mode_init(void);
-void __init mbm_config_rftype_init(const char *config);
+void __init resctrl_file_fflags_init(const char *config,
+ unsigned long fflags);
void rdt_staged_configs_clear(void);
bool closid_allocated(unsigned int closid);
int resctrl_find_cleanest_closid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index e5938bf53d5a..735b449039c1 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1049,11 +1049,13 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)

if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
mbm_total_event.configurable = true;
- mbm_config_rftype_init("mbm_total_bytes_config");
+ resctrl_file_fflags_init("mbm_total_bytes_config",
+ RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
mbm_local_event.configurable = true;
- mbm_config_rftype_init("mbm_local_bytes_config");
+ resctrl_file_fflags_init("mbm_local_bytes_config",
+ RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}

if (resctrl_arch_has_abmc(r))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 011e17efb1a6..dda71fb6c10e 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2022,24 +2022,14 @@ static struct rftype *rdtgroup_get_rftype_by_name(const char *name)
return NULL;
}

-void __init thread_throttle_mode_init(void)
-{
- struct rftype *rft;
-
- rft = rdtgroup_get_rftype_by_name("thread_throttle_mode");
- if (!rft)
- return;
-
- rft->fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB;
-}
-
-void __init mbm_config_rftype_init(const char *config)
+void __init resctrl_file_fflags_init(const char *config,
+ unsigned long fflags)
{
struct rftype *rft;

rft = rdtgroup_get_rftype_by_name(config);
if (rft)
- rft->fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE;
+ rft->fflags = fflags;
}

/**
--
2.34.1


2024-03-29 01:08:28

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 05/17] x86/resctrl: Introduce the interface to display the assignment state

The ABMC feature provides an option to the user to assign an RMID
to the hardware counter and monitor the bandwidth for a longer duration.
System can be in only one mode at a time (Legacy Monitor mode or ABMC
mode). By default, ABMC mode is disabled.

Provide an interface to display the monitor mode on the system.
$cat /sys/fs/resctrl/info/L3_MON/mbm_assign
abmc

When the feature is enabled
$cat /sys/fs/resctrl/info/L3_MON/mbm_assign
[abmc]

Signed-off-by: Babu Moger <[email protected]>
---
v3: New patch to display ABMC capability.
---
Documentation/arch/x86/resctrl.rst | 5 +++++
arch/x86/kernel/cpu/resctrl/monitor.c | 4 +++-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 +++++++++++++++++
3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 68df7751d1f5..cd973a013525 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -257,6 +257,11 @@ with the following files:
# cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
0=0x30;1=0x30;3=0x15;4=0x15

+"mbm_assign":
+ Available when assignable monitoring features are supported.
+ Reports the list of assignable features supported and the enclosed brackets
+ indicate the feature is enabled.
+
"max_threshold_occupancy":
Read/write file provides the largest value (in
bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 735b449039c1..48d1957ea5a3 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1058,8 +1058,10 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}

- if (resctrl_arch_has_abmc(r))
+ if (resctrl_arch_has_abmc(r)) {
r->mbm_assign_capable = ABMC_ASSIGN;
+ resctrl_file_fflags_init("mbm_assign", RFTYPE_MON_INFO);
+ }
}

l3_mon_evt_init(r);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index dda71fb6c10e..5ec807e8dd38 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -846,6 +846,17 @@ static int rdtgroup_rmid_show(struct kernfs_open_file *of,
return ret;
}

+static int rdtgroup_mbm_assign_show(struct kernfs_open_file *of,
+ struct seq_file *s, void *v)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+
+ if (r->mbm_assign_capable)
+ seq_puts(s, "abmc\n");
+
+ return 0;
+}
+
#ifdef CONFIG_PROC_CPU_RESCTRL

/*
@@ -1903,6 +1914,12 @@ static struct rftype res_common_files[] = {
.seq_show = mbm_local_bytes_config_show,
.write = mbm_local_bytes_config_write,
},
+ {
+ .name = "mbm_assign",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdtgroup_mbm_assign_show,
+ },
{
.name = "cpus",
.mode = 0644,
--
2.34.1


2024-03-29 01:08:43

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 06/17] x86/resctrl: Introduce interface to display number of ABMC counters

The ABMC feature provides an option to the user to pin (or assign) the
RMID to the hardware counter and monitor the bandwidth for a longer
duration. There are only a limited number of hardware counters.

Provide the interface to display the number of ABMC counters supported.

Signed-off-by: Babu Moger <[email protected]>
---
v3: Changed the field name to mbm_assign_cntrs.

v2: Changed the field name to mbm_assignable_counters from abmc_counters.
---
Documentation/arch/x86/resctrl.rst | 4 ++++
arch/x86/kernel/cpu/resctrl/monitor.c | 1 +
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 16 ++++++++++++++++
3 files changed, 21 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index cd973a013525..e06ffddb64f6 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -262,6 +262,10 @@ with the following files:
Reports the list of assignable features supported and the enclosed brackets
indicate the feature is enabled.

+"mbm_assign_cntrs":
+ The number of assignable counters available when the assignable monitoring
+ feature is supported.
+
"max_threshold_occupancy":
Read/write file provides the largest value (in
bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 48d1957ea5a3..56dc49021540 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1061,6 +1061,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
if (resctrl_arch_has_abmc(r)) {
r->mbm_assign_capable = ABMC_ASSIGN;
resctrl_file_fflags_init("mbm_assign", RFTYPE_MON_INFO);
+ resctrl_file_fflags_init("mbm_assign_cntrs", RFTYPE_MON_INFO);
}
}

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5ec807e8dd38..05f551bc316e 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -857,6 +857,16 @@ static int rdtgroup_mbm_assign_show(struct kernfs_open_file *of,
return 0;
}

+static int rdtgroup_mbm_assign_cntrs_show(struct kernfs_open_file *of,
+ struct seq_file *s, void *v)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+
+ seq_printf(s, "%d\n", r->mbm_assign_cntrs);
+
+ return 0;
+}
+
#ifdef CONFIG_PROC_CPU_RESCTRL

/*
@@ -1920,6 +1930,12 @@ static struct rftype res_common_files[] = {
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdtgroup_mbm_assign_show,
},
+ {
+ .name = "mbm_assign_cntrs",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdtgroup_mbm_assign_cntrs_show,
+ },
{
.name = "cpus",
.mode = 0644,
--
2.34.1


2024-03-29 01:09:15

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 08/17] x86/resctrl: Initialize assignable counters bitmap

AMD Hardware provides a set of counters when the ABMC feature is supported.
These counters are used for assigning events to the resctrl group.

Introduce the bitmap assign_cntrs_free_map to allocate and free the
counters.

Signed-off-by: Babu Moger <[email protected]>

---
v3: Changed the bitmap name to assign_cntrs_free_map. Removed abmc
from the name.

v2: Changed the bitmap name to assignable_counter_free_map from
abmc_counter_free_map.
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f49073c86884..2c7583e7b541 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -186,6 +186,22 @@ bool closid_allocated(unsigned int closid)
return !test_bit(closid, &closid_free_map);
}

+static u64 assign_cntrs_free_map;
+static u32 assign_cntrs_free_map_len;
+
+static void assign_cntrs_init(void)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+
+ if (r->mbm_assign_cntrs > 64) {
+ r->mbm_assign_cntrs = 64;
+ WARN(1, "Cannot support more than 64 Assignable counters\n");
+ }
+
+ assign_cntrs_free_map = BIT_MASK(r->mbm_assign_cntrs) - 1;
+ assign_cntrs_free_map_len = r->mbm_assign_cntrs;
+}
+
/**
* rdtgroup_mode_by_closid - Return mode of resource group with closid
* @closid: closid if the resource group
@@ -2459,6 +2475,9 @@ static int resctrl_abmc_setup(enum resctrl_res_level l, bool enable)
struct rdt_resource *r = &rdt_resources_all[l].r_resctrl;
struct rdt_domain *d;

+ /* Reset the counters bitmap */
+ assign_cntrs_init();
+
/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
list_for_each_entry(d, &r->domains, list) {
on_each_cpu_mask(&d->cpu_mask, resctrl_abmc_msrwrite, &enable, 1);
--
2.34.1


2024-03-29 01:09:31

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 09/17] x86/resctrl: Introduce assign state for the mon group

The ABMC feature provides an option to the user to assign an RMID to
the hardware counter and monitor the bandwidth for the longer duration.
The assigned RMID will be active until user unassigns the RMID.

Add a new field assign_state in mongroup data structure to represent the
assignment state of the group. This will be when ABMC feature is enabled.

Signed-off-by: Babu Moger <[email protected]>
---
v3: Changed the field name to mon_state. Also thie state is not visible to
users directly as part of out global assign approach.

v2: Added check to display "Unsupported" when user tries to access
monitor state when ABMC is not enabled.
---
arch/x86/kernel/cpu/resctrl/internal.h | 9 +++++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++++++
2 files changed, 17 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 8238ee437369..b559b3a4555e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -99,6 +99,13 @@ cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu)
/* ABMC ENABLE */
#define ABMC_ENABLE BIT(0)

+/*
+ * monitor group's state when ABMC is supported
+ */
+#define ASSIGN_NONE 0
+#define ASSIGN_TOTAL BIT(0)
+#define ASSIGN_LOCAL BIT(1)
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
@@ -202,12 +209,14 @@ enum rdtgrp_mode {
* @parent: parent rdtgrp
* @crdtgrp_list: child rdtgroup node list
* @rmid: rmid for this rdtgroup
+ * @mon_state: Assignment state of the group
*/
struct mongroup {
struct kernfs_node *mon_data_kn;
struct rdtgroup *parent;
struct list_head crdtgrp_list;
u32 rmid;
+ u32 mon_state;
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 2c7583e7b541..54ae2e6bf612 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2473,6 +2473,7 @@ static void resctrl_abmc_msrwrite(void *arg)
static int resctrl_abmc_setup(enum resctrl_res_level l, bool enable)
{
struct rdt_resource *r = &rdt_resources_all[l].r_resctrl;
+ struct rdtgroup *prgrp, *crgrp;
struct rdt_domain *d;

/* Reset the counters bitmap */
@@ -2484,6 +2485,13 @@ static int resctrl_abmc_setup(enum resctrl_res_level l, bool enable)
resctrl_arch_reset_rmid_all(r, d);
}

+ /* Reset assign state for all the monitor groups */
+ list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
+ prgrp->mon.mon_state = ASSIGN_NONE;
+ list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
+ crgrp->mon.mon_state = ASSIGN_NONE;
+ }
+
return 0;
}

--
2.34.1


2024-03-29 01:09:44

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 10/17] x86/resctrl: Add data structures for ABMC assignment

ABMC (Bandwidth Monitoring Event Configuration) counters can be configured
by writing to L3_QOS_ABMC_CFG MSR. When ABMC is enabled, the user can
configure a counter by writing to L3_QOS_ABMC_CFG setting the CfgEn field
while specifying the Bandwidth Source, Bandwidth Types, and Counter
Identifier. Add the MSR definition and individual field definitions.

MSR L3_QOS_ABMC_CFG (C000_03FDh) definitions.

==========================================================================
Bits Mnemonic Description Access Type Reset Value
==========================================================================
63 CfgEn Configuration Enable R/W 0

62 CtrEn Counter Enable R/W 0

61:53 – Reserved MBZ 0

52:48 CtrID Counter Identifier R/W 0

47 IsCOS BwSrc field is a COS R/W 0
(not an RMID)

46:44 – Reserved MBZ 0

43:32 BwSrc Bandwidth Source R/W 0
(RMID or COS)

31:0 BwType Bandwidth types to R/W 0
track for this counter
==========================================================================

The feature details are documentd in the APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

---
v3: No changes.
v2: No changes.
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 23 +++++++++++++++++++++++
2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index f16ee50b1a23..ab01abfab089 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1166,6 +1166,7 @@
#define MSR_IA32_SMBA_BW_BASE 0xc0000280
#define MSR_IA32_EVT_CFG_BASE 0xc0000400
#define MSR_IA32_L3_QOS_EXT_CFG 0xc00003ff
+#define MSR_IA32_L3_QOS_ABMC_CFG 0xc00003fd

/* MSR_IA32_VMX_MISC bits */
#define MSR_IA32_VMX_MISC_INTEL_PT (1ULL << 14)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index b559b3a4555e..41b06d46ea74 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -106,6 +106,9 @@ cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu)
#define ASSIGN_TOTAL BIT(0)
#define ASSIGN_LOCAL BIT(1)

+/* Maximum assignable counters per resctrl group */
+#define MAX_ASSIGN_CNTRS 2
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
@@ -210,6 +213,7 @@ enum rdtgrp_mode {
* @crdtgrp_list: child rdtgroup node list
* @rmid: rmid for this rdtgroup
* @mon_state: Assignment state of the group
+ * @abmc_ctr_id: ABMC counterids assigned to this group
*/
struct mongroup {
struct kernfs_node *mon_data_kn;
@@ -217,6 +221,7 @@ struct mongroup {
struct list_head crdtgrp_list;
u32 rmid;
u32 mon_state;
+ u32 abmc_ctr_id[MAX_ASSIGN_CNTRS];
};

/**
@@ -566,6 +571,24 @@ union cpuid_0x10_x_edx {
unsigned int full;
};

+/*
+ * L3_QOS_ABMC_CFG MSR details. ABMC counters can be configured
+ * by writing to L3_QOS_ABMC_CFG.
+ */
+union l3_qos_abmc_cfg {
+ struct {
+ unsigned long bw_type :32,
+ bw_src :12,
+ rsvrd1 : 3,
+ is_cos : 1,
+ ctr_id : 5,
+ rsvrd : 9,
+ ctr_en : 1,
+ cfg_en : 1;
+ } split;
+ unsigned long full;
+};
+
void rdt_last_cmd_clear(void);
void rdt_last_cmd_puts(const char *s);
__printf(1, 2)
--
2.34.1


2024-03-29 01:10:23

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 12/17] x86/resctrl: Add the functionality to assign the RMID

With the support of ABMC (Assignable Bandwidth Monitoring Counters)
feature, the user has the option to assign or unassign the RMID to
hardware counter and monitor the bandwidth for the longer duration.

Provide the interface to assign the counter to the group.

The ABMC feature implements a pair of MSRs, L3_QOS_ABMC_CFG (MSR
C000_03FDh) and L3_QOS_ABMC_DSC (MSR C000_3FEh). Each logical processor
implements a separate copy of these registers. Attempts to read or write
these MSRs when ABMC is not enabled will result in a #GP(0) exception.

Individual assignable bandwidth counters are configured by writing to
L3_QOS_ABMC_CFG MSR and specifying the Counter ID, Bandwidth Source, and
Bandwidth Types. Reading L3_QOS_ABMC_DSC returns the configuration of the
counter specified by L3_QOS_ABMC_CFG [CtrID].

The feature details are documented in the APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
v3: Removed the static from the prototype of rdtgroup_assign_abmc.
The function is not called directly from user anymore. These
changes are related to global assignment interface.

v2: Minor text changes in commit message.
---
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 86 ++++++++++++++++++++++++++
2 files changed, 87 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 88453c86474b..9d84c80104f9 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -651,6 +651,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void __init resctrl_file_fflags_init(const char *config,
unsigned long fflags);
void arch_domain_mbm_evt_config(struct rdt_hw_domain *hw_dom);
+ssize_t rdtgroup_assign_abmc(struct rdtgroup *rdtgrp, u32 evtid, int mon_state);
void rdt_staged_configs_clear(void);
bool closid_allocated(unsigned int closid);
int resctrl_find_cleanest_closid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 7f54788a58de..cfbdaf8b5f83 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -202,6 +202,18 @@ static void assign_cntrs_init(void)
assign_cntrs_free_map_len = r->mbm_assign_cntrs;
}

+static int assign_cntrs_alloc(void)
+{
+ u32 counterid = ffs(assign_cntrs_free_map);
+
+ if (counterid == 0)
+ return -ENOSPC;
+ counterid--;
+ assign_cntrs_free_map &= ~(1 << counterid);
+
+ return counterid;
+}
+
/**
* rdtgroup_mode_by_closid - Return mode of resource group with closid
* @closid: closid if the resource group
@@ -1848,6 +1860,80 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of,
return ret ?: nbytes;
}

+static void rdtgroup_abmc_msrwrite(void *info)
+{
+ u64 *msrval = info;
+
+ wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, *msrval);
+}
+
+static void rdtgroup_abmc_domain(struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ u32 evtid, int index, bool assign)
+{
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ union l3_qos_abmc_cfg abmc_cfg = { 0 };
+ struct arch_mbm_state *arch_mbm;
+
+ abmc_cfg.split.cfg_en = 1;
+ abmc_cfg.split.ctr_en = assign ? 1 : 0;
+ abmc_cfg.split.ctr_id = rdtgrp->mon.abmc_ctr_id[index];
+ abmc_cfg.split.bw_src = rdtgrp->mon.rmid;
+
+ /*
+ * Read the event configuration from the domain and pass it as
+ * bw_type.
+ */
+ if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID) {
+ abmc_cfg.split.bw_type = hw_dom->mbm_total_cfg;
+ arch_mbm = &hw_dom->arch_mbm_total[rdtgrp->mon.rmid];
+ } else {
+ abmc_cfg.split.bw_type = hw_dom->mbm_local_cfg;
+ arch_mbm = &hw_dom->arch_mbm_local[rdtgrp->mon.rmid];
+ }
+
+ smp_call_function_any(&d->cpu_mask, rdtgroup_abmc_msrwrite, &abmc_cfg, 1);
+
+ /* Reset the internal counters */
+ if (arch_mbm)
+ memset(arch_mbm, 0, sizeof(struct arch_mbm_state));
+}
+
+ssize_t rdtgroup_assign_abmc(struct rdtgroup *rdtgrp, u32 evtid, int mon_state)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ int counterid = 0, index;
+ struct rdt_domain *d;
+
+ if (rdtgrp->mon.mon_state & mon_state) {
+ rdt_last_cmd_puts("ABMC counter is assigned already\n");
+ return 0;
+ }
+
+ index = mon_event_config_index_get(evtid);
+ if (index == INVALID_CONFIG_INDEX) {
+ pr_warn_once("Invalid event id %d\n", evtid);
+ return -EINVAL;
+ }
+
+ /*
+ * Allocate a new counter and update domains
+ */
+ counterid = assign_cntrs_alloc();
+ if (counterid < 0) {
+ rdt_last_cmd_puts("Out of ABMC counters\n");
+ return -ENOSPC;
+ }
+
+ rdtgrp->mon.abmc_ctr_id[index] = counterid;
+
+ list_for_each_entry(d, &r->domains, list)
+ rdtgroup_abmc_domain(d, rdtgrp, evtid, index, 1);
+
+ rdtgrp->mon.mon_state |= mon_state;
+
+ return 0;
+}
+
/* rdtgroup information files for one cache resource. */
static struct rftype res_common_files[] = {
{
--
2.34.1


2024-03-29 01:11:42

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 16/17] x86/resctrl: Introduce interface to list assignment states of all the groups

Introduce rdtgroup_mbm_assign_control_show list assignment states of all the
resctrl groups.

List follows the following format:

- Default CTRL_MON group:
"//<domain_id>=<assignment_flags>"

- Non-default CTRL_MON group:
"<CTRL_MON group>//<domain_id>=<assignment_flags>"

- Child MON group of default CTRL_MON group:
"/<MON group>/<domain_id>=<assignment_flags>"

- Child MON group of non-default CTRL_MON group:
"<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"

Signed-off-by: Babu Moger <[email protected]>
---
v3: New patch.
Addresses the feedback to provide the global assignment interface.
https://lore.kernel.org/lkml/[email protected]/
---
Documentation/arch/x86/resctrl.rst | 51 ++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 1 +
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 85 ++++++++++++++++++++++++++
3 files changed, 137 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 3e441b828765..2d96565501ab 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -278,6 +278,57 @@ with the following files:
The number of assignable counters available when the assignable monitoring
feature is supported.

+"mbm_assign_control":
+ Available when assignable monitoring features are supported.
+ Reports the resctrl group and assignment status of each group.
+
+ List follows the following format:
+
+ * Default CTRL_MON group:
+ "//<domain_id>=<assignment_flags>"
+
+ * Non-default CTRL_MON group:
+ "<CTRL_MON group>//<domain_id>=<assignment_flags>"
+
+ * Child MON group of default CTRL_MON group:
+ "/<MON group>/<domain_id>=<assignment_flags>"
+
+ * Child MON group of non-default CTRL_MON group:
+ "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
+
+ Assignment flags can be one of the following:
+ ::
+
+ t MBM total event is assigned
+ l MBM local event is assigned
+ tl Both total and local MBM events are assigned
+ _ None of the MBM events are assigned
+
+ Examples:
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+ //0=tl;1=tl;
+ /child_default_mon_grp/0=t;1=t;
+ non_default_ctrl_mon_grp//0=l;1=l;
+ non_default_ctrl_mon_grp/child_non_default_mon_grp/0=_;1=_;
+
+ //0=tl;1=tl;
+ Both the events on the default group are assigned.
+
+ /child_default_mon_grp/0=t;1=t;
+ Only total event on this mon group is assigned. This is a child
+ monitor group of the default control mon group.
+
+ non_default_ctrl_mon_grp//0=l;1=l;
+ Only local event on this control mon group is assigned. This is a
+ non default control mon group.
+
+ non_default_ctrl_mon_grp/child_non_default_mon_grp/0=_;1=_;
+ None of events are assigned on this mon group. This is a child
+ monitor group of the non default control mon group.
+
+
"max_threshold_occupancy":
Read/write file provides the largest value (in
bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 8677dbf6de43..8a2e2afc85e8 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1062,6 +1062,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
r->mbm_assign_capable = ABMC_ASSIGN;
resctrl_file_fflags_init("mbm_assign", RFTYPE_MON_INFO);
resctrl_file_fflags_init("mbm_assign_cntrs", RFTYPE_MON_INFO);
+ resctrl_file_fflags_init("mbm_assign_control", RFTYPE_MON_INFO);
}
}

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a6e0ef2631ae..9fd37b6c3b24 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -932,6 +932,85 @@ static ssize_t rdtgroup_mbm_assign_write(struct kernfs_open_file *of,
return ret ?: nbytes;
}

+static char *mon_state_to_str(int mon_state, char *str)
+{
+ char *tmp = str;
+
+ switch (mon_state) {
+ case ASSIGN_NONE:
+ *tmp++ = '_';
+ break;
+ case (ASSIGN_TOTAL | ASSIGN_LOCAL):
+ *tmp++ = 't';
+ *tmp++ = 'l';
+ break;
+ case ASSIGN_TOTAL:
+ *tmp++ = 't';
+ break;
+ case ASSIGN_LOCAL:
+ *tmp++ = 'l';
+ break;
+ default:
+ break;
+ }
+
+ *tmp = '\0';
+ return str;
+}
+
+static int rdtgroup_mbm_assign_control_show(struct kernfs_open_file *of,
+ struct seq_file *s, void *v)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ struct rdt_domain *dom;
+ struct rdtgroup *rdtg;
+ int grp_default = 0;
+ char str[10];
+
+ if (!hw_res->abmc_enabled) {
+ rdt_last_cmd_puts("ABMC feature is not enabled\n");
+ return -EINVAL;
+ }
+
+ mutex_lock(&rdtgroup_mutex);
+
+ list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
+ struct rdtgroup *crg;
+
+ if (rdtg == &rdtgroup_default) {
+ grp_default = 1;
+ seq_puts(s, "//");
+ } else {
+ grp_default = 0;
+ seq_printf(s, "%s//", rdtg->kn->name);
+ }
+
+ list_for_each_entry(dom, &r->domains, list)
+ seq_printf(s, "%d=%s;", dom->id,
+ mon_state_to_str(rdtg->mon.mon_state, str));
+ seq_putc(s, '\n');
+
+ list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
+ mon.crdtgrp_list) {
+ if (grp_default)
+ seq_printf(s, "/%s/", crg->kn->name);
+ else
+ seq_printf(s, "%s/%s/", rdtg->kn->name,
+ crg->kn->name);
+
+ list_for_each_entry(dom, &r->domains, list)
+ seq_printf(s, "%d=%s;", dom->id,
+ mon_state_to_str(crg->mon.mon_state, str));
+ seq_putc(s, '\n');
+ }
+ }
+
+ mutex_unlock(&rdtgroup_mutex);
+
+ return 0;
+}
+
static int rdtgroup_mbm_assign_cntrs_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
@@ -2119,6 +2198,12 @@ static struct rftype res_common_files[] = {
.seq_show = rdtgroup_mbm_assign_show,
.write = rdtgroup_mbm_assign_write,
},
+ {
+ .name = "mbm_assign_control",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdtgroup_mbm_assign_control_show,
+ },
{
.name = "mbm_assign_cntrs",
.mode = 0444,
--
2.34.1


2024-03-29 01:12:27

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 14/17] x86/resctrl: Enable ABMC by default on resctrl mount

Enable ABMC by default if assignment feature is available. Also assign
the monitor counters for CTRL_MON and MON groups as they are created.

If for any reason the assignment fails, report the warnings and continue.
It is not required to fail the group creation for assignment failures.
Users will have the option to modify the assignments later.

Signed-off-by: Babu Moger <[email protected]>

---
v3: This is a new patch. Patch addresses the upstream comment to enable
ABMC feature by default if the feature is available.
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 76 ++++++++++++++++++++++++++
1 file changed, 76 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b430ffa554a9..2e58024e95e2 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2750,6 +2750,7 @@ static void rdt_disable_ctx(void)
{
resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false);
resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false);
+ resctrl_arch_set_abmc_enabled(RDT_RESOURCE_L3, false);
set_mba_sc(false);

resctrl_debug = false;
@@ -2780,6 +2781,8 @@ static int rdt_enable_ctx(struct rdt_fs_context *ctx)
if (ctx->enable_debug)
resctrl_debug = true;

+ resctrl_arch_set_abmc_enabled(RDT_RESOURCE_L3, true);
+
return 0;

out_cdpl3:
@@ -2876,6 +2879,48 @@ static void schemata_list_destroy(void)
}
}

+static int resctrl_mbm_assign(struct rdtgroup *rdtgrp)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ int ret = 0;
+
+ if (!hw_res->abmc_enabled)
+ return 0;
+
+ if (is_mbm_total_enabled())
+ ret = rdtgroup_assign_abmc(rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID,
+ ASSIGN_TOTAL);
+
+ if (!ret && is_mbm_local_enabled())
+ ret = rdtgroup_assign_abmc(rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID,
+ ASSIGN_LOCAL);
+
+ return ret;
+}
+
+static int resctrl_mbm_unassign(struct rdtgroup *rdtgrp)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ int ret = 0;
+
+ if (!hw_res->abmc_enabled)
+ return 0;
+
+ if (is_mbm_total_enabled())
+ ret = rdtgroup_unassign_abmc(rdtgrp,
+ QOS_L3_MBM_TOTAL_EVENT_ID,
+ ASSIGN_TOTAL);
+
+ if (!ret && is_mbm_local_enabled())
+ ret = rdtgroup_unassign_abmc(rdtgrp,
+ QOS_L3_MBM_LOCAL_EVENT_ID,
+ ASSIGN_LOCAL);
+
+ return ret;
+}
+
static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
@@ -2935,6 +2980,14 @@ static int rdt_get_tree(struct fs_context *fc)
if (ret < 0)
goto out_mongrp;
rdtgroup_default.mon.mon_data_kn = kn_mondata;
+
+ /*
+ * Assign the monitor counters if it is available. If it fails,
+ * report the warnings and continue. It is not nessaccery to
+ * fail here.
+ */
+ if (resctrl_mbm_assign(&rdtgroup_default) < 0)
+ rdt_last_cmd_puts("Monitor assignment failed\n");
}

ret = rdt_pseudo_lock_init();
@@ -3216,6 +3269,8 @@ static void rdt_kill_sb(struct super_block *sb)
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

+ resctrl_mbm_unassign(&rdtgroup_default);
+
rdt_disable_ctx();

/*Put everything back to default values. */
@@ -3754,6 +3809,14 @@ static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn,
goto out_unlock;
}

+ /*
+ * Assign the monitor counters if it is available. If it fails,
+ * report the warnings and continue. It is not nessaccery to
+ * fail here.
+ */
+ if (resctrl_mbm_assign(rdtgrp) < 0)
+ rdt_last_cmd_puts("Monitor assignment failed\n");
+
kernfs_activate(rdtgrp->kn);

/*
@@ -3798,6 +3861,14 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
if (ret)
goto out_closid_free;

+ /*
+ * Assign the monitor counters if it is available. If it fails,
+ * report the warnings and continue. It is not nessaccery to
+ * fail here.
+ */
+ if (resctrl_mbm_assign(rdtgrp) < 0)
+ rdt_last_cmd_puts("Monitor assignment failed\n");
+
kernfs_activate(rdtgrp->kn);

ret = rdtgroup_init_alloc(rdtgrp);
@@ -3893,6 +3964,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
update_closid_rmid(tmpmask, NULL);

rdtgrp->flags = RDT_DELETED;
+
+ resctrl_mbm_unassign(rdtgrp);
+
free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);

/*
@@ -3939,6 +4013,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
update_closid_rmid(tmpmask, NULL);

+ resctrl_mbm_unassign(rdtgrp);
+
free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
closid_free(rdtgrp->closid);

--
2.34.1


2024-03-29 01:14:19

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

Introduce rdtgroup_mbm_assign_control_write to assign mbm events.
Assignment state can be updated by writing to this interface.
Assignment states are applied on all the domains. Assignment on one
domain applied on all the domains. User can pass one valid domain and
assignment will be updated on all the available domains.

Format is similar to the list format with addition of op-code for the
assignment operation.

* Default CTRL_MON group:
"//<domain_id><op-code><assignment_flags>"

* Non-default CTRL_MON group:
"<CTRL_MON group>//<domain_id><op-code><assignment_flags>"

* Child MON group of default CTRL_MON group:
"/<MON group>/<domain_id><op-code><assignment_flags>"

* Child MON group of non-default CTRL_MON group:
"<CTRL_MON group>/<MON group>/<domain_id><op-code><assignment_flags>"

Op-code can be one of the following:

= Update the assignment to match the flags
+ Assign a new state
- Unassign a new state
_ Unassign all the states

Signed-off-by: Babu Moger <[email protected]>
---

v3: New patch.
Addresses the feedback to provide the global assignment interface.
https://lore.kernel.org/lkml/[email protected]/
---
Documentation/arch/x86/resctrl.rst | 71 ++++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 236 ++++++++++++++++++++++++-
2 files changed, 306 insertions(+), 1 deletion(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 2d96565501ab..64ec70637c66 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -328,6 +328,77 @@ with the following files:
None of events are assigned on this mon group. This is a child
monitor group of the non default control mon group.

+ Assignment state can be updated by writing to this interface.
+
+ NOTE: Assignment on one domain applied on all the domains. User can
+ pass one valid domain and assignment will be updated on all the
+ available domains.
+
+ Format is similar to the list format with addition of op-code for the
+ assignment operation.
+
+ * Default CTRL_MON group:
+ "//<domain_id><op-code><assignment_flags>"
+
+ * Non-default CTRL_MON group:
+ "<CTRL_MON group>//<domain_id><op-code><assignment_flags>"
+
+ * Child MON group of default CTRL_MON group:
+ "/<MON group>/<domain_id><op-code><assignment_flags>"
+
+ * Child MON group of non-default CTRL_MON group:
+ "<CTRL_MON group>/<MON group>/<domain_id><op-code><assignment_flags>"
+
+ Op-code can be one of the following:
+ ::
+
+ = Update the assignment to match the flags
+ + Assign a new state
+ - Unassign a new state
+ _ Unassign all the states
+
+ Examples:
+ ::
+
+ Initial group status:
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+ //0=tl;1=tl;
+ /child_default_mon_grp/0=tl;1=tl;
+ non_default_ctrl_mon_grp//0=tl;1=tl;
+ non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
+
+ To update the default group to assign only total event.
+ # echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+ Assignment status after the update:
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+ //0=t;1=t;
+ /child_default_mon_grp/0=tl;1=tl;
+ non_default_ctrl_mon_grp//0=tl;1=tl;
+ non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
+
+ To update the MON group child_default_mon_grp to remove local event:
+ # echo "/child_default_mon_grp/0-l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+ Assignment status after the update:
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+ //0=t;1=t;
+ /child_default_mon_grp/0=t;1=t;
+ non_default_ctrl_mon_grp//0=tl;1=tl;
+ non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
+
+ To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to
+ remove both local and total events:
+ # echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/0_" >
+ /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+
+ Assignment status after the update:
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+ //0=t;1=t;
+ /child_default_mon_grp/0=t;1=t;
+ non_default_ctrl_mon_grp//0=tl;1=tl;
+ non_default_ctrl_mon_grp/child_non_default_mon_grp/0=_;1=_;
+

"max_threshold_occupancy":
Read/write file provides the largest value (in
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9fd37b6c3b24..7f8b1386287a 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -958,6 +958,30 @@ static char *mon_state_to_str(int mon_state, char *str)
return str;
}

+static int str_to_mon_state(char *flag)
+{
+ int i, mon_state = 0;
+
+ for (i = 0; i < strlen(flag); i++) {
+ switch (*(flag + i)) {
+ case 't':
+ mon_state |= ASSIGN_TOTAL;
+ break;
+ case 'l':
+ mon_state |= ASSIGN_LOCAL;
+ break;
+ case '_':
+ mon_state = ASSIGN_NONE;
+ break;
+ default:
+ mon_state = ASSIGN_NONE;
+ break;
+ }
+ }
+
+ return mon_state;
+}
+
static int rdtgroup_mbm_assign_control_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
@@ -1011,6 +1035,215 @@ static int rdtgroup_mbm_assign_control_show(struct kernfs_open_file *of,
return 0;
}

+static struct rdtgroup *resctrl_get_rdtgroup(enum rdt_group_type rtype, char *p_grp, char *c_grp)
+{
+ struct rdtgroup *rdtg, *crg;
+
+ if (rtype == RDTCTRL_GROUP && *p_grp == '\0') {
+ return &rdtgroup_default;
+ } else if (rtype == RDTCTRL_GROUP) {
+ list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list)
+ if (!strcmp(p_grp, rdtg->kn->name))
+ return rdtg;
+ } else if (rtype == RDTMON_GROUP) {
+ list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
+ if (!strcmp(p_grp, rdtg->kn->name)) {
+ list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
+ mon.crdtgrp_list) {
+ if (!strcmp(c_grp, crg->kn->name))
+ return crg;
+ }
+ }
+ }
+ }
+
+ return NULL;
+}
+
+static int resctrl_process_flags(enum rdt_group_type rtype, char *p_grp, char *c_grp, char *tok)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ int op, mon_state, assign_state, unassign_state;
+ char *dom_str, *id_str, *op_str;
+ struct rdtgroup *rdt_grp;
+ struct rdt_domain *d;
+ unsigned long dom_id;
+ int ret, found = 0;
+
+ rdt_grp = resctrl_get_rdtgroup(rtype, p_grp, c_grp);
+
+ if (!rdt_grp) {
+ rdt_last_cmd_puts("Not a valid resctrl group\n");
+ return -EINVAL;
+ }
+
+next:
+ if (!tok || tok[0] == '\0')
+ return 0;
+
+ /* Start processing the strings for each domain */
+ dom_str = strim(strsep(&tok, ";"));
+
+ op_str = strpbrk(dom_str, "=+-_");
+
+ if (op_str) {
+ op = *op_str;
+ } else {
+ rdt_last_cmd_puts("Missing operation =, +, -, _ character\n");
+ return -EINVAL;
+ }
+
+ id_str = strsep(&dom_str, "=+-_");
+
+ if (!id_str || kstrtoul(id_str, 10, &dom_id)) {
+ rdt_last_cmd_puts("Missing domain id\n");
+ return -EINVAL;
+ }
+
+ /* Verify if the dom_id is valid */
+ list_for_each_entry(d, &r->domains, list) {
+ if (d->id == dom_id) {
+ found = 1;
+ break;
+ }
+ }
+ if (!found) {
+ rdt_last_cmd_printf("Invalid domain id %ld\n", dom_id);
+ return -EINVAL;
+ }
+
+ if (op != '_')
+ mon_state = str_to_mon_state(dom_str);
+
+ assign_state = 0;
+ unassign_state = 0;
+
+ switch (op) {
+ case '+':
+ assign_state = mon_state;
+ break;
+ case '-':
+ unassign_state = mon_state;
+ break;
+ case '=':
+ assign_state = mon_state;
+ unassign_state = (ASSIGN_TOTAL | ASSIGN_LOCAL) & ~assign_state;
+ break;
+ case '_':
+ unassign_state = ASSIGN_TOTAL | ASSIGN_LOCAL;
+ break;
+ default:
+ break;
+ }
+
+ if (assign_state & ASSIGN_TOTAL)
+ ret = rdtgroup_assign_abmc(rdt_grp, QOS_L3_MBM_TOTAL_EVENT_ID,
+ ASSIGN_TOTAL);
+ if (ret)
+ goto out_fail;
+
+ if (assign_state & ASSIGN_LOCAL)
+ ret = rdtgroup_assign_abmc(rdt_grp, QOS_L3_MBM_LOCAL_EVENT_ID,
+ ASSIGN_LOCAL);
+
+ if (ret)
+ goto out_fail;
+
+ if (unassign_state & ASSIGN_TOTAL)
+ ret = rdtgroup_unassign_abmc(rdt_grp, QOS_L3_MBM_TOTAL_EVENT_ID,
+ ASSIGN_TOTAL);
+ if (ret)
+ goto out_fail;
+
+ if (unassign_state & ASSIGN_LOCAL)
+ ret = rdtgroup_unassign_abmc(rdt_grp, QOS_L3_MBM_LOCAL_EVENT_ID,
+ ASSIGN_LOCAL);
+ if (ret)
+ goto out_fail;
+
+ goto next;
+
+out_fail:
+
+ return -EINVAL;
+}
+
+static ssize_t rdtgroup_mbm_assign_control_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+ char *token, *cmon_grp, *mon_grp;
+ struct rdt_hw_resource *hw_res;
+ int ret;
+
+ hw_res = resctrl_to_arch_res(r);
+ if (!hw_res->abmc_enabled)
+ return -EINVAL;
+
+ /* Valid input requires a trailing newline */
+ if (nbytes == 0 || buf[nbytes - 1] != '\n')
+ return -EINVAL;
+
+ buf[nbytes - 1] = '\0';
+ rdt_last_cmd_clear();
+
+ cpus_read_lock();
+ mutex_lock(&rdtgroup_mutex);
+
+ while ((token = strsep(&buf, "\n")) != NULL) {
+ if (strstr(token, "//")) {
+ /*
+ * The control mon group processing:
+ * default CTRL_MON group: "//<flags>"
+ * non-default CTRL_MON group: "<CTRL_MON group>//flags"
+ * The CTRL_MON group will be empty string if it is a
+ * default group.
+ */
+ cmon_grp = strsep(&token, "//");
+
+ /*
+ * strsep returns empty string for contiguous delimiters.
+ * Make sure check for two consicutive delimiters and
+ * advance the token.
+ */
+ mon_grp = strsep(&token, "//");
+ if (*mon_grp != '\0') {
+ rdt_last_cmd_printf("Invalid CTRL_MON group format %s\n", token);
+ ret = -EINVAL;
+ break;
+ }
+
+ ret = resctrl_process_flags(RDTCTRL_GROUP, cmon_grp, mon_grp, token);
+ if (ret)
+ break;
+ } else if (strstr(token, "/")) {
+ /*
+ * Mon group processing:
+ * MON_GROUP inside default CTRL_MON group: "/<MON group>/<flags>"
+ * MON_GROUP within CTRL_MON group: "<CTRL_MON group>/<MON group>/<flags>"
+ */
+ cmon_grp = strsep(&token, "/");
+
+ /* Extract the MON_GROUP. It cannot be empty string */
+ mon_grp = strsep(&token, "/");
+ if (*mon_grp == '\0') {
+ rdt_last_cmd_printf("Invalid MON_GROUP format %s\n", token);
+ ret = -EINVAL;
+ break;
+ }
+
+ ret = resctrl_process_flags(RDTMON_GROUP, cmon_grp, mon_grp, token);
+ if (ret)
+ break;
+ }
+ }
+
+ mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();
+
+ return ret ?: nbytes;
+}
+
static int rdtgroup_mbm_assign_cntrs_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
@@ -2200,9 +2433,10 @@ static struct rftype res_common_files[] = {
},
{
.name = "mbm_assign_control",
- .mode = 0444,
+ .mode = 0644,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdtgroup_mbm_assign_control_show,
+ .write = rdtgroup_mbm_assign_control_write,
},
{
.name = "mbm_assign_cntrs",
--
2.34.1


2024-03-29 01:16:59

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 07/17] x86/resctrl: Add support to enable/disable ABMC feature

Add the functionality to enable/disable ABMC feature.

ABMC is enabled by setting enabled bit(0) in MSR L3_QOS_EXT_CFG. When the
state of ABMC is changed, it must be changed to the updated value on all
logical processors in the QOS Domain.

The ABMC feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
v3: No changes.

v2: Few text changes in commit message.
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 12 ++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 76 +++++++++++++++++++++++++-
3 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 05956bd8bacf..f16ee50b1a23 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1165,6 +1165,7 @@
#define MSR_IA32_MBA_BW_BASE 0xc0000200
#define MSR_IA32_SMBA_BW_BASE 0xc0000280
#define MSR_IA32_EVT_CFG_BASE 0xc0000400
+#define MSR_IA32_L3_QOS_EXT_CFG 0xc00003ff

/* MSR_IA32_VMX_MISC bits */
#define MSR_IA32_VMX_MISC_INTEL_PT (1ULL << 14)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 722388621403..8238ee437369 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -96,6 +96,9 @@ cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu)
return cpu;
}

+/* ABMC ENABLE */
+#define ABMC_ENABLE BIT(0)
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
@@ -433,6 +436,7 @@ struct rdt_parse_data {
* @mbm_cfg_mask: Bandwidth sources that can be tracked when Bandwidth
* Monitoring Event Configuration (BMEC) is supported.
* @cdp_enabled: CDP state of this resource
+ * @abmc_enabled: ABMC feature is enabled
*
* Members of this structure are either private to the architecture
* e.g. mbm_width, or accessed via helpers that provide abstraction. e.g.
@@ -448,6 +452,7 @@ struct rdt_hw_resource {
unsigned int mbm_width;
unsigned int mbm_cfg_mask;
bool cdp_enabled;
+ bool abmc_enabled;
};

static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r)
@@ -491,6 +496,13 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)

int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);

+static inline bool resctrl_arch_get_abmc_enabled(enum resctrl_res_level l)
+{
+ return rdt_resources_all[l].abmc_enabled;
+}
+
+int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable);
+
/*
* To return the common struct rdt_resource, which is contained in struct
* rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource.
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 05f551bc316e..f49073c86884 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -850,9 +850,15 @@ static int rdtgroup_mbm_assign_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
struct rdt_resource *r = of->kn->parent->priv;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);

- if (r->mbm_assign_capable)
+ if (r->mbm_assign_capable && hw_res->abmc_enabled) {
+ seq_puts(s, "[abmc]\n");
+ seq_puts(s, "legacy_mbm\n");
+ } else if (r->mbm_assign_capable) {
seq_puts(s, "abmc\n");
+ seq_puts(s, "[legacy_mbm]\n");
+ }

return 0;
}
@@ -2433,6 +2439,74 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable)
return 0;
}

+static void resctrl_abmc_msrwrite(void *arg)
+{
+ bool *enable = arg;
+ u64 msrval;
+
+ rdmsrl(MSR_IA32_L3_QOS_EXT_CFG, msrval);
+
+ if (*enable)
+ msrval |= ABMC_ENABLE;
+ else
+ msrval &= ~ABMC_ENABLE;
+
+ wrmsrl(MSR_IA32_L3_QOS_EXT_CFG, msrval);
+}
+
+static int resctrl_abmc_setup(enum resctrl_res_level l, bool enable)
+{
+ struct rdt_resource *r = &rdt_resources_all[l].r_resctrl;
+ struct rdt_domain *d;
+
+ /* Update QOS_CFG MSR on all the CPUs in cpu_mask */
+ list_for_each_entry(d, &r->domains, list) {
+ on_each_cpu_mask(&d->cpu_mask, resctrl_abmc_msrwrite, &enable, 1);
+ resctrl_arch_reset_rmid_all(r, d);
+ }
+
+ return 0;
+}
+
+static int resctrl_abmc_enable(enum resctrl_res_level l)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
+ int ret = 0;
+
+ if (!hw_res->abmc_enabled) {
+ ret = resctrl_abmc_setup(l, true);
+ if (!ret)
+ hw_res->abmc_enabled = true;
+ }
+
+ return ret;
+}
+
+static void resctrl_abmc_disable(enum resctrl_res_level l)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
+
+ if (hw_res->abmc_enabled) {
+ resctrl_abmc_setup(l, false);
+ hw_res->abmc_enabled = false;
+ }
+}
+
+int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
+
+ if (!hw_res->r_resctrl.mbm_assign_capable)
+ return -EINVAL;
+
+ if (enable)
+ return resctrl_abmc_enable(l);
+
+ resctrl_abmc_disable(l);
+
+ return 0;
+}
+
/*
* We don't allow rdtgroup directories to be created anywhere
* except the root directory. Thus when looking for the rdtgroup
--
2.34.1


2024-03-29 01:17:20

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 15/17] x86/resctrl: Introduce the interface switch between ABMC and legacy_mbm

Introduce rdtgroup_mbm_assign_write to switch between ABMC and legacy_mbm.

By default ABMC is enabled on resctrl mount if the feature is available.
However, user will have the option to go back to legacy_mbm if required.

Signed-off-by: Babu Moger <[email protected]>

---
v3: New patch to address the review comments from upstream.
---
Documentation/arch/x86/resctrl.rst | 14 ++++++++-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 39 +++++++++++++++++++++++++-
2 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index e06ffddb64f6..3e441b828765 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -260,7 +260,19 @@ with the following files:
"mbm_assign":
Available when assignable monitoring features are supported.
Reports the list of assignable features supported and the enclosed brackets
- indicate the feature is enabled.
+ indicate the feature is enabled. Users will have the option to switch
+ between the monitoring features.
+ Examples:
+
+ * To enable ABMC feature:
+ ::
+
+ # echo "abmc" > /sys/fs/resctrl/info/L3_MON/mbm_assign
+
+ * To enable the legacy monitoring feature:
+ ::
+
+ # echo "legacy_mbm" > /sys/fs/resctrl/info/L3_MON/mbm_assign

"mbm_assign_cntrs":
The number of assignable counters available when the assignable monitoring
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 2e58024e95e2..a6e0ef2631ae 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -896,6 +896,42 @@ static int rdtgroup_mbm_assign_show(struct kernfs_open_file *of,
return 0;
}

+/*
+ * rdtgroup_mode_write - Modify the resource group's mode
+ */
+static ssize_t rdtgroup_mbm_assign_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+ int ret;
+
+ if (!r->mbm_assign_capable)
+ return -EINVAL;
+
+ /* Valid input requires a trailing newline */
+ if (nbytes == 0 || buf[nbytes - 1] != '\n')
+ return -EINVAL;
+
+ buf[nbytes - 1] = '\0';
+
+ cpus_read_lock();
+ mutex_lock(&rdtgroup_mutex);
+
+ rdt_last_cmd_clear();
+
+ if (!strcmp(buf, "legacy_mbm"))
+ ret = resctrl_arch_set_abmc_enabled(RDT_RESOURCE_L3, false);
+ else if (!strcmp(buf, "abmc"))
+ ret = resctrl_arch_set_abmc_enabled(RDT_RESOURCE_L3, true);
+ else
+ ret = -EINVAL;
+
+ mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();
+
+ return ret ?: nbytes;
+}
+
static int rdtgroup_mbm_assign_cntrs_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
@@ -2078,9 +2114,10 @@ static struct rftype res_common_files[] = {
},
{
.name = "mbm_assign",
- .mode = 0444,
+ .mode = 0644,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdtgroup_mbm_assign_show,
+ .write = rdtgroup_mbm_assign_write,
},
{
.name = "mbm_assign_cntrs",
--
2.34.1


2024-03-29 01:17:28

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 11/17] x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg

If the BMEC (Bandwidth Monitoring Event Configuration) feature is
supported, the bandwidth events can be configured to track specific events.
The event configuration is domain specific. ABMC (Assignable Bandwidth
Monitoring Counters) feature needs event configuration information to
assign RMID to the hardware counter. Currently, this information is not
available.

Save the event configuration information in the rdt_hw_domain, so it can
be used while for RMID assignment.

Signed-off-by: Babu Moger <[email protected]>

---
v3: Minor changes related to rebase in mbm_config_write_domain.

v2: No changes.
---
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
arch/x86/kernel/cpu/resctrl/monitor.c | 11 +++++++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 16 +++++++++++++++-
4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 50e9ec5e547b..ed4f6d49d737 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -555,6 +555,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

+ arch_domain_mbm_evt_config(hw_dom);
+
list_add_tail_rcu(&d->list, add_pos);

err = resctrl_online_domain(r, d);
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 41b06d46ea74..88453c86474b 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -385,6 +385,8 @@ struct rdt_hw_domain {
u32 *ctrl_val;
struct arch_mbm_state *arch_mbm_total;
struct arch_mbm_state *arch_mbm_local;
+ u32 mbm_total_cfg;
+ u32 mbm_local_cfg;
};

static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
@@ -648,6 +650,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void __init resctrl_file_fflags_init(const char *config,
unsigned long fflags);
+void arch_domain_mbm_evt_config(struct rdt_hw_domain *hw_dom);
void rdt_staged_configs_clear(void);
bool closid_allocated(unsigned int closid);
int resctrl_find_cleanest_closid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 56dc49021540..8677dbf6de43 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1090,3 +1090,14 @@ void __init intel_rdt_mbm_apply_quirk(void)
mbm_cf_rmidthreshold = mbm_cf_table[cf_index].rmidthreshold;
mbm_cf = mbm_cf_table[cf_index].cf;
}
+
+void arch_domain_mbm_evt_config(struct rdt_hw_domain *hw_dom)
+{
+ if (mbm_total_event.configurable)
+ hw_dom->mbm_total_cfg = MAX_EVT_CONFIG_BITS;
+
+ if (mbm_local_event.configurable)
+ hw_dom->mbm_local_cfg = READS_TO_LOCAL_MEM |
+ NON_TEMP_WRITE_TO_LOCAL_MEM |
+ READS_TO_LOCAL_S_MEM;
+}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 54ae2e6bf612..7f54788a58de 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1703,6 +1703,7 @@ static void mon_event_config_write(void *info)
static void mbm_config_write_domain(struct rdt_resource *r,
struct rdt_domain *d, u32 evtid, u32 val)
{
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct mon_config_info mon_info = {0};

/*
@@ -1712,7 +1713,7 @@ static void mbm_config_write_domain(struct rdt_resource *r,
mon_info.evtid = evtid;
mondata_config_read(d, &mon_info);
if (mon_info.mon_config == val)
- return;
+ goto out;

mon_info.mon_config = val;

@@ -1725,6 +1726,16 @@ static void mbm_config_write_domain(struct rdt_resource *r,
smp_call_function_any(&d->cpu_mask, mon_event_config_write,
&mon_info, 1);

+ /*
+ * Update event config value in the domain when user changes it.
+ */
+ if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID)
+ hw_dom->mbm_total_cfg = val;
+ else if (evtid == QOS_L3_MBM_LOCAL_EVENT_ID)
+ hw_dom->mbm_local_cfg = val;
+ else
+ goto out;
+
/*
* When an Event Configuration is changed, the bandwidth counters
* for all RMIDs and Events will be cleared by the hardware. The
@@ -1735,6 +1746,9 @@ static void mbm_config_write_domain(struct rdt_resource *r,
* mbm_local and mbm_total counts for all the RMIDs.
*/
resctrl_arch_reset_rmid_all(r, d);
+
+out:
+ return;
}

static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
--
2.34.1


2024-03-29 01:17:51

by Moger, Babu

[permalink] [raw]
Subject: [RFC PATCH v3 13/17] x86/resctrl: Add the functionality to unassign the RMID

With the support of ABMC (Assignable Bandwidth Monitoring Counters)
feature, the user has the option to assign or unassign the RMID to
hardware counter and monitor the bandwidth for the longer duration.

Provide the functionality to unassign the counter to the group.

The feature details are documented in the APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

---
v3: Removed the static from the prototype of rdtgroup_unassign_abmc.
The function is not called directly from user anymore. These
changes are related to global assignment interface.

v2: No changes.
---
arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 30 ++++++++++++++++++++++++++
2 files changed, 33 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 9d84c80104f9..90f0bac3ef3a 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -652,6 +652,9 @@ void __init resctrl_file_fflags_init(const char *config,
unsigned long fflags);
void arch_domain_mbm_evt_config(struct rdt_hw_domain *hw_dom);
ssize_t rdtgroup_assign_abmc(struct rdtgroup *rdtgrp, u32 evtid, int mon_state);
+ssize_t rdtgroup_unassign_abmc(struct rdtgroup *rdtgrp, u32 evtid,
+ int mon_state);
+void assign_cntrs_free(int counterid);
void rdt_staged_configs_clear(void);
bool closid_allocated(unsigned int closid);
int resctrl_find_cleanest_closid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index cfbdaf8b5f83..b430ffa554a9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -214,6 +214,11 @@ static int assign_cntrs_alloc(void)
return counterid;
}

+void assign_cntrs_free(int counterid)
+{
+ assign_cntrs_free_map |= 1 << counterid;
+}
+
/**
* rdtgroup_mode_by_closid - Return mode of resource group with closid
* @closid: closid if the resource group
@@ -1934,6 +1939,31 @@ ssize_t rdtgroup_assign_abmc(struct rdtgroup *rdtgrp, u32 evtid, int mon_state)
return 0;
}

+ssize_t rdtgroup_unassign_abmc(struct rdtgroup *rdtgrp, u32 evtid,
+ int mon_state)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_domain *d;
+ int index;
+
+ index = mon_event_config_index_get(evtid);
+ if (index == INVALID_CONFIG_INDEX) {
+ pr_warn_once("Invalid event id %d\n", evtid);
+ return -EINVAL;
+ }
+
+ if (rdtgrp->mon.mon_state & mon_state) {
+ list_for_each_entry(d, &r->domains, list)
+ rdtgroup_abmc_domain(d, rdtgrp, evtid, index, 0);
+
+ assign_cntrs_free(rdtgrp->mon.abmc_ctr_id[index]);
+ }
+
+ rdtgrp->mon.mon_state &= ~mon_state;
+
+ return 0;
+}
+
/* rdtgroup information files for one cache resource. */
static struct rftype res_common_files[] = {
{
--
2.34.1


2024-04-04 00:31:13

by Peter Newman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 07/17] x86/resctrl: Add support to enable/disable ABMC feature

Hi Babu,

On Thu, Mar 28, 2024 at 6:07 PM Babu Moger <[email protected]> wrote:
> struct rdt_fs_context {
> struct kernfs_fs_context kfc;
> bool enable_cdpl2;
> @@ -433,6 +436,7 @@ struct rdt_parse_data {
> * @mbm_cfg_mask: Bandwidth sources that can be tracked when Bandwidth
> * Monitoring Event Configuration (BMEC) is supported.
> * @cdp_enabled: CDP state of this resource
> + * @abmc_enabled: ABMC feature is enabled
> *
> * Members of this structure are either private to the architecture
> * e.g. mbm_width, or accessed via helpers that provide abstraction. e.g.
> @@ -448,6 +452,7 @@ struct rdt_hw_resource {
> unsigned int mbm_width;
> unsigned int mbm_cfg_mask;
> bool cdp_enabled;
> + bool abmc_enabled;
> };
>
> static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r)
> @@ -491,6 +496,13 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
>
> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
>
> +static inline bool resctrl_arch_get_abmc_enabled(enum resctrl_res_level l)
> +{
> + return rdt_resources_all[l].abmc_enabled;
> +}

This inline definition will not work in either this file or
fs/resctrl/internal.h, following James's change[1] moving the code.

resctrl_arch-definitions are either declared in linux/resctrl.h or
defined inline in a file like asm/resctrl.h.


> +
> +int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable);
> +
> /*
> * To return the common struct rdt_resource, which is contained in struct
> * rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource.
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 05f551bc316e..f49073c86884 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -850,9 +850,15 @@ static int rdtgroup_mbm_assign_show(struct kernfs_open_file *of,
> struct seq_file *s, void *v)
> {
> struct rdt_resource *r = of->kn->parent->priv;
> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>
> - if (r->mbm_assign_capable)
> + if (r->mbm_assign_capable && hw_res->abmc_enabled) {
> + seq_puts(s, "[abmc]\n");
> + seq_puts(s, "legacy_mbm\n");
> + } else if (r->mbm_assign_capable) {
> seq_puts(s, "abmc\n");
> + seq_puts(s, "[legacy_mbm]\n");
> + }

This looks like it would move to fs/resctrl/rdtgroup.c where it's not
possible to dereference an rdt_hw_resource struct.

It might be helpful to try building your changes on top of James's
change[1] to get an idea of how this would fit in post-refactoring.
I'll stop pointing out inconsistencies with his portability scheme
now.

>
> return 0;
> }
> @@ -2433,6 +2439,74 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable)
> return 0;
> }
>
> +static void resctrl_abmc_msrwrite(void *arg)
> +{
> + bool *enable = arg;
> + u64 msrval;
> +
> + rdmsrl(MSR_IA32_L3_QOS_EXT_CFG, msrval);
> +
> + if (*enable)
> + msrval |= ABMC_ENABLE;
> + else
> + msrval &= ~ABMC_ENABLE;
> +
> + wrmsrl(MSR_IA32_L3_QOS_EXT_CFG, msrval);
> +}
> +
> +static int resctrl_abmc_setup(enum resctrl_res_level l, bool enable)
> +{
> + struct rdt_resource *r = &rdt_resources_all[l].r_resctrl;
> + struct rdt_domain *d;
> +
> + /* Update QOS_CFG MSR on all the CPUs in cpu_mask */
> + list_for_each_entry(d, &r->domains, list) {
> + on_each_cpu_mask(&d->cpu_mask, resctrl_abmc_msrwrite, &enable, 1);
> + resctrl_arch_reset_rmid_all(r, d);
> + }
> +
> + return 0;
> +}
> +
> +static int resctrl_abmc_enable(enum resctrl_res_level l)
> +{
> + struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
> + int ret = 0;
> +
> + if (!hw_res->abmc_enabled) {
> + ret = resctrl_abmc_setup(l, true);
> + if (!ret)
> + hw_res->abmc_enabled = true;

Presumably this would be called holding the rdtgroup_mutex? Perhaps a
lockdep assertion somewhere would be appropriate?

Thanks!
-Peter

[1] https://lore.kernel.org/lkml/[email protected]/

2024-04-04 15:20:17

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 07/17] x86/resctrl: Add support to enable/disable ABMC feature

Hi Peter,

On 4/3/24 19:30, Peter Newman wrote:
> Hi Babu,
>
> On Thu, Mar 28, 2024 at 6:07 PM Babu Moger <[email protected]> wrote:
>> struct rdt_fs_context {
>> struct kernfs_fs_context kfc;
>> bool enable_cdpl2;
>> @@ -433,6 +436,7 @@ struct rdt_parse_data {
>> * @mbm_cfg_mask: Bandwidth sources that can be tracked when Bandwidth
>> * Monitoring Event Configuration (BMEC) is supported.
>> * @cdp_enabled: CDP state of this resource
>> + * @abmc_enabled: ABMC feature is enabled
>> *
>> * Members of this structure are either private to the architecture
>> * e.g. mbm_width, or accessed via helpers that provide abstraction. e.g.
>> @@ -448,6 +452,7 @@ struct rdt_hw_resource {
>> unsigned int mbm_width;
>> unsigned int mbm_cfg_mask;
>> bool cdp_enabled;
>> + bool abmc_enabled;
>> };
>>
>> static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r)
>> @@ -491,6 +496,13 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
>>
>> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
>>
>> +static inline bool resctrl_arch_get_abmc_enabled(enum resctrl_res_level l)
>> +{
>> + return rdt_resources_all[l].abmc_enabled;
>> +}
>
> This inline definition will not work in either this file or
> fs/resctrl/internal.h, following James's change[1] moving the code.

Yea. I see..
>
> resctrl_arch-definitions are either declared in linux/resctrl.h or
> defined inline in a file like asm/resctrl.h.

ok.
>
>
>> +
>> +int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable);
>> +
>> /*
>> * To return the common struct rdt_resource, which is contained in struct
>> * rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource.
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 05f551bc316e..f49073c86884 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -850,9 +850,15 @@ static int rdtgroup_mbm_assign_show(struct kernfs_open_file *of,
>> struct seq_file *s, void *v)
>> {
>> struct rdt_resource *r = of->kn->parent->priv;
>> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>>
>> - if (r->mbm_assign_capable)
>> + if (r->mbm_assign_capable && hw_res->abmc_enabled) {
>> + seq_puts(s, "[abmc]\n");
>> + seq_puts(s, "legacy_mbm\n");
>> + } else if (r->mbm_assign_capable) {
>> seq_puts(s, "abmc\n");
>> + seq_puts(s, "[legacy_mbm]\n");
>> + }
>
> This looks like it would move to fs/resctrl/rdtgroup.c where it's not
> possible to dereference an rdt_hw_resource struct.
>
> It might be helpful to try building your changes on top of James's
> change[1] to get an idea of how this would fit in post-refactoring.
> I'll stop pointing out inconsistencies with his portability scheme
> now.

Considering the complexity of James changes, I was hoping my series will
go first. It would be difficult for me to make changes based on transient
patch series. I would think it would be best to base the patches based on
tip/master.

>
>>
>> return 0;
>> }
>> @@ -2433,6 +2439,74 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable)
>> return 0;
>> }
>>
>> +static void resctrl_abmc_msrwrite(void *arg)
>> +{
>> + bool *enable = arg;
>> + u64 msrval;
>> +
>> + rdmsrl(MSR_IA32_L3_QOS_EXT_CFG, msrval);
>> +
>> + if (*enable)
>> + msrval |= ABMC_ENABLE;
>> + else
>> + msrval &= ~ABMC_ENABLE;
>> +
>> + wrmsrl(MSR_IA32_L3_QOS_EXT_CFG, msrval);
>> +}
>> +
>> +static int resctrl_abmc_setup(enum resctrl_res_level l, bool enable)
>> +{
>> + struct rdt_resource *r = &rdt_resources_all[l].r_resctrl;
>> + struct rdt_domain *d;
>> +
>> + /* Update QOS_CFG MSR on all the CPUs in cpu_mask */
>> + list_for_each_entry(d, &r->domains, list) {
>> + on_each_cpu_mask(&d->cpu_mask, resctrl_abmc_msrwrite, &enable, 1);
>> + resctrl_arch_reset_rmid_all(r, d);
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int resctrl_abmc_enable(enum resctrl_res_level l)
>> +{
>> + struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
>> + int ret = 0;
>> +
>> + if (!hw_res->abmc_enabled) {
>> + ret = resctrl_abmc_setup(l, true);
>> + if (!ret)
>> + hw_res->abmc_enabled = true;
>
> Presumably this would be called holding the rdtgroup_mutex? Perhaps a
> lockdep assertion somewhere would be appropriate?

Yes. Sure. Will add it next revision.

>
> Thanks!
> -Peter
>
> [1] https://lore.kernel.org/lkml/[email protected]/

--
Thanks
Babu Moger

2024-04-04 17:36:50

by Peter Newman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 07/17] x86/resctrl: Add support to enable/disable ABMC feature

Hi Babu,

On Thu, Apr 4, 2024 at 8:16 AM Moger, Babu <[email protected]> wrote:

> On 4/3/24 19:30, Peter Newman wrote:
> > This looks like it would move to fs/resctrl/rdtgroup.c where it's not
> > possible to dereference an rdt_hw_resource struct.
> >
> > It might be helpful to try building your changes on top of James's
> > change[1] to get an idea of how this would fit in post-refactoring.
> > I'll stop pointing out inconsistencies with his portability scheme
> > now.
>
> Considering the complexity of James changes, I was hoping my series will
> go first. It would be difficult for me to make changes based on transient
> patch series. I would think it would be best to base the patches based on
> tip/master.

I don't need you to push the patches to the mailing list based on
James's series. I was just asking you to try building locally on top
of the refactoring changes. You are putting in the effort trying to
make this code portable (i.e., inventing new
resctrl_arch_-interfaces), so it would be sensible to check your work
locally.

However, I am the main stakeholder who cares about MPAM and ABMC
working in the same kernel, so I can continue to give feedback on
portability as I compose the series' together.

-Peter

2024-04-04 18:36:27

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 07/17] x86/resctrl: Add support to enable/disable ABMC feature

Hi Peter,

On 4/4/24 12:36, Peter Newman wrote:
> Hi Babu,
>
> On Thu, Apr 4, 2024 at 8:16 AM Moger, Babu <[email protected]> wrote:
>
>> On 4/3/24 19:30, Peter Newman wrote:
>>> This looks like it would move to fs/resctrl/rdtgroup.c where it's not
>>> possible to dereference an rdt_hw_resource struct.
>>>
>>> It might be helpful to try building your changes on top of James's
>>> change[1] to get an idea of how this would fit in post-refactoring.
>>> I'll stop pointing out inconsistencies with his portability scheme
>>> now.
>>
>> Considering the complexity of James changes, I was hoping my series will
>> go first. It would be difficult for me to make changes based on transient
>> patch series. I would think it would be best to base the patches based on
>> tip/master.
>
> I don't need you to push the patches to the mailing list based on
> James's series. I was just asking you to try building locally on top
> of the refactoring changes. You are putting in the effort trying to
> make this code portable (i.e., inventing new
> resctrl_arch_-interfaces), so it would be sensible to check your work
> locally.

I am really no focusing much on portability in this series.
I named it to match it with resctrl_arch_set_cdp_enabled.
Yes. I got your concerns. I will plan check against James changes in next
revision.

>
> However, I am the main stakeholder who cares about MPAM and ABMC
> working in the same kernel, so I can continue to give feedback on
> portability as I compose the series' together.

Agree. Please continue your feedback.
--
Thanks
Babu Moger

2024-04-04 18:45:09

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 07/17] x86/resctrl: Add support to enable/disable ABMC feature

Hi Peter,

On 4/3/2024 5:30 PM, Peter Newman wrote:

..
>
> Presumably this would be called holding the rdtgroup_mutex? Perhaps a
> lockdep assertion somewhere would be appropriate?
>

Considering that you are digging into the implementation already, can
it be assumed that you approve (while considering how "soft RMID" may
build on this) of the new interface as described in the cover letter?

Reinette

2024-04-04 19:01:25

by Peter Newman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 07/17] x86/resctrl: Add support to enable/disable ABMC feature

Hi Reinette,

On Thu, Apr 4, 2024 at 11:43 AM Reinette Chatre
<[email protected]> wrote:
>
> Hi Peter,
>
> On 4/3/2024 5:30 PM, Peter Newman wrote:
>
> ...
> >
> > Presumably this would be called holding the rdtgroup_mutex? Perhaps a
> > lockdep assertion somewhere would be appropriate?
> >
>
> Considering that you are digging into the implementation already, can
> it be assumed that you approve (while considering how "soft RMID" may
> build on this) of the new interface as described in the cover letter?

Yes, I believe we came back to an agreement when discussing the last
series. I'll look over the cover letter in this series just to make
sure everything is there.

-Peter

2024-04-04 19:19:26

by Peter Newman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On Thu, Mar 28, 2024 at 6:07 PM Babu Moger <[email protected]> wrote:
> The list follows the following format:
>
> * Default CTRL_MON group:
> "//<domain_id>=<assignment_flags>"
>
> * Non-default CTRL_MON group:
> "<CTRL_MON group>//<domain_id>=<assignment_flags>"
>
> * Child MON group of default CTRL_MON group:
> "/<MON group>/<domain_id>=<assignment_flags>"
>
> * Child MON group of non-default CTRL_MON group:
> "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
>
> Assignment flags can be one of the following:
>
> t MBM total event is assigned
> l MBM local event is assigned
> tl Both total and local MBM events are assigned
> _ None of the MBM events are assigned
>
> Examples:
>
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> /default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>
> There are four groups and all the groups have local and total event assigned.
>
> "//" - This is a default CONTROL MON group
>
> "non_defult_group//" - This is non default CONTROL MON group
>
> "/default_mon1/" - This is Child MON group of the defult group
>
> "non_defult_group/non_default_mon1/" - This is child MON group of the non default group
>
> =tl means both total and local events are assigned.

I recall there was supposed to be a way to perform the same update on
all domains together so that it isn't tedious to not do per-domain
customizations. (And also to avoid serializing programming all the
domains the same way.)


>
> .../admin-guide/kernel-parameters.txt | 2 +-
> Documentation/arch/x86/resctrl.rst | 144 ++++
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/msr-index.h | 2 +
> arch/x86/kernel/cpu/cpuid-deps.c | 3 +
> arch/x86/kernel/cpu/resctrl/core.c | 25 +-
> arch/x86/kernel/cpu/resctrl/internal.h | 56 +-
> arch/x86/kernel/cpu/resctrl/monitor.c | 24 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 714 +++++++++++++++++-
> arch/x86/kernel/cpu/scattered.c | 1 +
> include/linux/resctrl.h | 12 +
> 11 files changed, 964 insertions(+), 20 deletions(-)
>
> --
> 2.34.1
>

This should be fine for me to get started with. I'll see if I can work
backwards from the patches adding the parsing code to see how I'll
work the software implementation in.

Thanks!
-Peter

2024-04-04 20:03:04

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,


On 4/4/24 14:08, Peter Newman wrote:
> Hi Babu,
>
> On Thu, Mar 28, 2024 at 6:07 PM Babu Moger <[email protected]> wrote:
>> The list follows the following format:
>>
>> * Default CTRL_MON group:
>> "//<domain_id>=<assignment_flags>"
>>
>> * Non-default CTRL_MON group:
>> "<CTRL_MON group>//<domain_id>=<assignment_flags>"
>>
>> * Child MON group of default CTRL_MON group:
>> "/<MON group>/<domain_id>=<assignment_flags>"
>>
>> * Child MON group of non-default CTRL_MON group:
>> "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
>>
>> Assignment flags can be one of the following:
>>
>> t MBM total event is assigned
>> l MBM local event is assigned
>> tl Both total and local MBM events are assigned
>> _ None of the MBM events are assigned
>>
>> Examples:
>>
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>> non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>> //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>> /default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>
>> There are four groups and all the groups have local and total event assigned.
>>
>> "//" - This is a default CONTROL MON group
>>
>> "non_defult_group//" - This is non default CONTROL MON group
>>
>> "/default_mon1/" - This is Child MON group of the defult group
>>
>> "non_defult_group/non_default_mon1/" - This is child MON group of the non default group
>>
>> =tl means both total and local events are assigned.
>
> I recall there was supposed to be a way to perform the same update on
> all domains together so that it isn't tedious to not do per-domain

Yes. Correct. Reinette suggested to have "no domains" means ALL the domains.

Example:

Initial list:
$cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
non_def_ctrl_mon_grep//0=_;1=_;2=_;3=_;4=_;5=_;6=_;7=_;
//0=_;1=_;2=_;3=_;4=_;5=_;6=_;7=_;

Two groups and no events assigned.


To assign total event on all the domains, The command will look like this.

$ echo "//=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

Parsing becomes ugly here. I look for domain number after the name. Now I
have add some ugly checks there.


I also thought about something like this:

$ echo "//FFFF=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

FFFF means all the domains. But there could be domain number with FFFF also.

So, I dropped the idea.


> customizations. (And also to avoid serializing programming all the
> domains the same way.)

One more thing with respect to domains:

This series updates all the domains when assignment is requested.
Makes it easy to implement.

For example:

$ echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

This command will assign total event on all the domains on default group
even though user passed only domain 0.

I am looking at supporting domain specific assignment right now.
If your use case is specific to each domain then I can add that support in
in next revision.

>
>
>>
>> .../admin-guide/kernel-parameters.txt | 2 +-
>> Documentation/arch/x86/resctrl.rst | 144 ++++
>> arch/x86/include/asm/cpufeatures.h | 1 +
>> arch/x86/include/asm/msr-index.h | 2 +
>> arch/x86/kernel/cpu/cpuid-deps.c | 3 +
>> arch/x86/kernel/cpu/resctrl/core.c | 25 +-
>> arch/x86/kernel/cpu/resctrl/internal.h | 56 +-
>> arch/x86/kernel/cpu/resctrl/monitor.c | 24 +-
>> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 714 +++++++++++++++++-
>> arch/x86/kernel/cpu/scattered.c | 1 +
>> include/linux/resctrl.h | 12 +
>> 11 files changed, 964 insertions(+), 20 deletions(-)
>>
>> --
>> 2.34.1
>>
>
> This should be fine for me to get started with. I'll see if I can work
> backwards from the patches adding the parsing code to see how I'll
> work the software implementation in.
>
> Thanks!
> -Peter

--
Thanks
Babu Moger

2024-04-16 19:01:14

by Peter Newman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/17] x86/resctrl: Introduce assign state for the mon group

Hi Babu,

On Thu, Mar 28, 2024 at 6:08 PM Babu Moger <[email protected]> wrote:
>
> +/*
> + * monitor group's state when ABMC is supported
> + */
> +#define ASSIGN_NONE 0
> +#define ASSIGN_TOTAL BIT(0)
> +#define ASSIGN_LOCAL BIT(1)

We already have an enumeration for the monitoring events (i.e.,
QOS_L3_MBM_TOTAL_EVENT_ID), which should already be suitable for
maintaining a bitmap of which events have assigned monitors.

-Peter

2024-04-16 19:53:18

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/17] x86/resctrl: Introduce assign state for the mon group

Hi Peter,

On 4/16/24 13:52, Peter Newman wrote:
> Hi Babu,
>
> On Thu, Mar 28, 2024 at 6:08 PM Babu Moger <[email protected]> wrote:
>>
>> +/*
>> + * monitor group's state when ABMC is supported
>> + */
>> +#define ASSIGN_NONE 0
>> +#define ASSIGN_TOTAL BIT(0)
>> +#define ASSIGN_LOCAL BIT(1)
>
> We already have an enumeration for the monitoring events (i.e.,
> QOS_L3_MBM_TOTAL_EVENT_ID), which should already be suitable for
> maintaining a bitmap of which events have assigned monitors.
>

/*
* Event IDs, the values match those used to program IA32_QM_EVTSEL before
* reading IA32_QM_CTR on RDT systems.
*/
enum resctrl_event_id {
QOS_L3_OCCUP_EVENT_ID = 0x01,
QOS_L3_MBM_TOTAL_EVENT_ID = 0x02,
QOS_L3_MBM_LOCAL_EVENT_ID = 0x03,
};

I think you are referring to this definition. We need just one bit for
each event. The QOS_L3_MBM_LOCAL_EVENT_ID definition(both bit 0 and bit 1
set here) does not work for us here.

I can change the definition to something like this.

+#define ASSIGN_NONE 0
+#define ASSIGN_TOTAL BIT(QOS_L3_MBM_TOTAL_EVENT_ID)
+#define ASSIGN_LOCAL BIT(QOS_L3_MBM_LOCAL_EVENT_ID)

Is that what you meant?

--
Thanks
Babu Moger

2024-04-17 17:45:41

by Peter Newman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

Hi Babu,

On Thu, Mar 28, 2024 at 6:10 PM Babu Moger <[email protected]> wrote:
>
> Introduce rdtgroup_mbm_assign_control_write to assign mbm events.
> Assignment state can be updated by writing to this interface.
> Assignment states are applied on all the domains. Assignment on one
> domain applied on all the domains. User can pass one valid domain and
> assignment will be updated on all the available domains.

It sounds like you said the same thing 3 times in a row.


> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 2d96565501ab..64ec70637c66 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -328,6 +328,77 @@ with the following files:
> None of events are assigned on this mon group. This is a child
> monitor group of the non default control mon group.
>
> + Assignment state can be updated by writing to this interface.
> +
> + NOTE: Assignment on one domain applied on all the domains. User can
> + pass one valid domain and assignment will be updated on all the
> + available domains.

How would different assignments to different domains work? If the
allocations are global, then the allocated monitor ID is available to
all domains whether they use it or not.


> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 9fd37b6c3b24..7f8b1386287a 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1011,6 +1035,215 @@ static int rdtgroup_mbm_assign_control_show(struct kernfs_open_file *of,
> return 0;
> }
>
> +static struct rdtgroup *resctrl_get_rdtgroup(enum rdt_group_type rtype, char *p_grp, char *c_grp)
> +{
> + struct rdtgroup *rdtg, *crg;
> +
> + if (rtype == RDTCTRL_GROUP && *p_grp == '\0') {
> + return &rdtgroup_default;
> + } else if (rtype == RDTCTRL_GROUP) {
> + list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list)
> + if (!strcmp(p_grp, rdtg->kn->name))
> + return rdtg;
> + } else if (rtype == RDTMON_GROUP) {
> + list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
> + if (!strcmp(p_grp, rdtg->kn->name)) {
> + list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
> + mon.crdtgrp_list) {
> + if (!strcmp(c_grp, crg->kn->name))
> + return crg;
> + }
> + }
> + }
> + }
> +
> + return NULL;
> +}
> +
> +static int resctrl_process_flags(enum rdt_group_type rtype, char *p_grp, char *c_grp, char *tok)
> +{
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + int op, mon_state, assign_state, unassign_state;
> + char *dom_str, *id_str, *op_str;
> + struct rdtgroup *rdt_grp;
> + struct rdt_domain *d;
> + unsigned long dom_id;
> + int ret, found = 0;
> +
> + rdt_grp = resctrl_get_rdtgroup(rtype, p_grp, c_grp);
> +
> + if (!rdt_grp) {
> + rdt_last_cmd_puts("Not a valid resctrl group\n");
> + return -EINVAL;
> + }
> +
> +next:
> + if (!tok || tok[0] == '\0')
> + return 0;
> +
> + /* Start processing the strings for each domain */
> + dom_str = strim(strsep(&tok, ";"));
> +
> + op_str = strpbrk(dom_str, "=+-_");
> +
> + if (op_str) {
> + op = *op_str;
> + } else {
> + rdt_last_cmd_puts("Missing operation =, +, -, _ character\n");
> + return -EINVAL;
> + }
> +
> + id_str = strsep(&dom_str, "=+-_");
> +
> + if (!id_str || kstrtoul(id_str, 10, &dom_id)) {
> + rdt_last_cmd_puts("Missing domain id\n");
> + return -EINVAL;
> + }
> +
> + /* Verify if the dom_id is valid */
> + list_for_each_entry(d, &r->domains, list) {
> + if (d->id == dom_id) {
> + found = 1;
> + break;
> + }
> + }
> + if (!found) {
> + rdt_last_cmd_printf("Invalid domain id %ld\n", dom_id);
> + return -EINVAL;
> + }
> +
> + if (op != '_')
> + mon_state = str_to_mon_state(dom_str);
> +
> + assign_state = 0;
> + unassign_state = 0;
> +
> + switch (op) {
> + case '+':
> + assign_state = mon_state;
> + break;
> + case '-':
> + unassign_state = mon_state;
> + break;
> + case '=':
> + assign_state = mon_state;
> + unassign_state = (ASSIGN_TOTAL | ASSIGN_LOCAL) & ~assign_state;
> + break;
> + case '_':
> + unassign_state = ASSIGN_TOTAL | ASSIGN_LOCAL;
> + break;
> + default:
> + break;
> + }
> +
> + if (assign_state & ASSIGN_TOTAL)
> + ret = rdtgroup_assign_abmc(rdt_grp, QOS_L3_MBM_TOTAL_EVENT_ID,
> + ASSIGN_TOTAL);

Related to my comments yesterday[1], it seems redundant for an
interface to need two names for the same event.


> + if (ret)
> + goto out_fail;
> +
> + if (assign_state & ASSIGN_LOCAL)
> + ret = rdtgroup_assign_abmc(rdt_grp, QOS_L3_MBM_LOCAL_EVENT_ID,
> + ASSIGN_LOCAL);
> +
> + if (ret)
> + goto out_fail;
> +
> + if (unassign_state & ASSIGN_TOTAL)
> + ret = rdtgroup_unassign_abmc(rdt_grp, QOS_L3_MBM_TOTAL_EVENT_ID,
> + ASSIGN_TOTAL);
> + if (ret)
> + goto out_fail;
> +
> + if (unassign_state & ASSIGN_LOCAL)
> + ret = rdtgroup_unassign_abmc(rdt_grp, QOS_L3_MBM_LOCAL_EVENT_ID,
> + ASSIGN_LOCAL);
> + if (ret)
> + goto out_fail;
> +
> + goto next;

I saw that each call to rdtgroup_assign_abmc() allocates a counter.
Does that mean assigning to multiple domains (in the same or multiple
commands) allocates a new counter (or pair of counters) in every
domain?

Thanks!
-Peter

[1] https://lore.kernel.org/lkml/CALPaoCj_yb_muT78jFQ5gL0wkifohSAVwxMDTm2FX_2YVpANdw@mail.gmail.com/

2024-04-17 19:39:25

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

Hi Peter,

On 4/17/24 12:45, Peter Newman wrote:
> Hi Babu,
>
> On Thu, Mar 28, 2024 at 6:10 PM Babu Moger <[email protected]> wrote:
>>
>> Introduce rdtgroup_mbm_assign_control_write to assign mbm events.
>> Assignment state can be updated by writing to this interface.
>> Assignment states are applied on all the domains. Assignment on one
>> domain applied on all the domains. User can pass one valid domain and
>> assignment will be updated on all the available domains.
>
> It sounds like you said the same thing 3 times in a row.

Sure. Will change it. With the introduction of domain specific assignment,
I can change it to something like this below.
------------------
"Introduce rdtgroup_mbm_assign_control_write to assign mbm events.

By default, the assignment is applied on all the domains when a new group
is created if the hardware counter is available at the time. This
interface provides the option to modify the assignment specific to each
domain."
------------------

>
>
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index 2d96565501ab..64ec70637c66 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -328,6 +328,77 @@ with the following files:
>> None of events are assigned on this mon group. This is a child
>> monitor group of the non default control mon group.
>>
>> + Assignment state can be updated by writing to this interface.
>> +
>> + NOTE: Assignment on one domain applied on all the domains. User can
>> + pass one valid domain and assignment will be updated on all the
>> + available domains.
>
> How would different assignments to different domains work? If the
> allocations are global, then the allocated monitor ID is available to
> all domains whether they use it or not.

That is correct.
[A] Hardware counters(max 2 per group) are allocated at the group level.
So, those counters are available to all the domains on that group. I will
maintain a bitmap at the domain level. The bitmap will be set on the
domains where assignment is applied and IPIs are sent. IPIs will not be
sent to other domains.

>
>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 9fd37b6c3b24..7f8b1386287a 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -1011,6 +1035,215 @@ static int rdtgroup_mbm_assign_control_show(struct kernfs_open_file *of,
>> return 0;
>> }
>>
>> +static struct rdtgroup *resctrl_get_rdtgroup(enum rdt_group_type rtype, char *p_grp, char *c_grp)
>> +{
>> + struct rdtgroup *rdtg, *crg;
>> +
>> + if (rtype == RDTCTRL_GROUP && *p_grp == '\0') {
>> + return &rdtgroup_default;
>> + } else if (rtype == RDTCTRL_GROUP) {
>> + list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list)
>> + if (!strcmp(p_grp, rdtg->kn->name))
>> + return rdtg;
>> + } else if (rtype == RDTMON_GROUP) {
>> + list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
>> + if (!strcmp(p_grp, rdtg->kn->name)) {
>> + list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
>> + mon.crdtgrp_list) {
>> + if (!strcmp(c_grp, crg->kn->name))
>> + return crg;
>> + }
>> + }
>> + }
>> + }
>> +
>> + return NULL;
>> +}
>> +
>> +static int resctrl_process_flags(enum rdt_group_type rtype, char *p_grp, char *c_grp, char *tok)
>> +{
>> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> + int op, mon_state, assign_state, unassign_state;
>> + char *dom_str, *id_str, *op_str;
>> + struct rdtgroup *rdt_grp;
>> + struct rdt_domain *d;
>> + unsigned long dom_id;
>> + int ret, found = 0;
>> +
>> + rdt_grp = resctrl_get_rdtgroup(rtype, p_grp, c_grp);
>> +
>> + if (!rdt_grp) {
>> + rdt_last_cmd_puts("Not a valid resctrl group\n");
>> + return -EINVAL;
>> + }
>> +
>> +next:
>> + if (!tok || tok[0] == '\0')
>> + return 0;
>> +
>> + /* Start processing the strings for each domain */
>> + dom_str = strim(strsep(&tok, ";"));
>> +
>> + op_str = strpbrk(dom_str, "=+-_");
>> +
>> + if (op_str) {
>> + op = *op_str;
>> + } else {
>> + rdt_last_cmd_puts("Missing operation =, +, -, _ character\n");
>> + return -EINVAL;
>> + }
>> +
>> + id_str = strsep(&dom_str, "=+-_");
>> +
>> + if (!id_str || kstrtoul(id_str, 10, &dom_id)) {
>> + rdt_last_cmd_puts("Missing domain id\n");
>> + return -EINVAL;
>> + }
>> +
>> + /* Verify if the dom_id is valid */
>> + list_for_each_entry(d, &r->domains, list) {
>> + if (d->id == dom_id) {
>> + found = 1;
>> + break;
>> + }
>> + }
>> + if (!found) {
>> + rdt_last_cmd_printf("Invalid domain id %ld\n", dom_id);
>> + return -EINVAL;
>> + }
>> +
>> + if (op != '_')
>> + mon_state = str_to_mon_state(dom_str);
>> +
>> + assign_state = 0;
>> + unassign_state = 0;
>> +
>> + switch (op) {
>> + case '+':
>> + assign_state = mon_state;
>> + break;
>> + case '-':
>> + unassign_state = mon_state;
>> + break;
>> + case '=':
>> + assign_state = mon_state;
>> + unassign_state = (ASSIGN_TOTAL | ASSIGN_LOCAL) & ~assign_state;
>> + break;
>> + case '_':
>> + unassign_state = ASSIGN_TOTAL | ASSIGN_LOCAL;
>> + break;
>> + default:
>> + break;
>> + }
>> +
>> + if (assign_state & ASSIGN_TOTAL)
>> + ret = rdtgroup_assign_abmc(rdt_grp, QOS_L3_MBM_TOTAL_EVENT_ID,
>> + ASSIGN_TOTAL);
>
> Related to my comments yesterday[1], it seems redundant for an
> interface to need two names for the same event.

Yea. I will remove one of this parameter.

>
>
>> + if (ret)
>> + goto out_fail;
>> +
>> + if (assign_state & ASSIGN_LOCAL)
>> + ret = rdtgroup_assign_abmc(rdt_grp, QOS_L3_MBM_LOCAL_EVENT_ID,
>> + ASSIGN_LOCAL);
>> +
>> + if (ret)
>> + goto out_fail;
>> +
>> + if (unassign_state & ASSIGN_TOTAL)
>> + ret = rdtgroup_unassign_abmc(rdt_grp, QOS_L3_MBM_TOTAL_EVENT_ID,
>> + ASSIGN_TOTAL);
>> + if (ret)
>> + goto out_fail;
>> +
>> + if (unassign_state & ASSIGN_LOCAL)
>> + ret = rdtgroup_unassign_abmc(rdt_grp, QOS_L3_MBM_LOCAL_EVENT_ID,
>> + ASSIGN_LOCAL);
>> + if (ret)
>> + goto out_fail;
>> +
>> + goto next;
>
> I saw that each call to rdtgroup_assign_abmc() allocates a counter.
> Does that mean assigning to multiple domains (in the same or multiple
> commands) allocates a new counter (or pair of counters) in every
> domain?

No. Counters allocation is at group level which is global. Will maintain a
bitmap at the domain to determine if the counter is assigned or unassigned
at the specific domain. Please see the comment above [A].

>
> Thanks!
> -Peter
>
> [1] https://lore.kernel.org/lkml/CALPaoCj_yb_muT78jFQ5gL0wkifohSAVwxMDTm2FX_2YVpANdw@mail.gmail.com/

--
Thanks
Babu Moger

2024-04-17 21:03:25

by Peter Newman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

Hi Babu,

On Wed, Apr 17, 2024 at 12:39 PM Moger, Babu <[email protected]> wrote:
> On 4/17/24 12:45, Peter Newman wrote:
> > On Thu, Mar 28, 2024 at 6:10 PM Babu Moger <[email protected]> wrote:
> >> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> >> index 2d96565501ab..64ec70637c66 100644
> >> --- a/Documentation/arch/x86/resctrl.rst
> >> +++ b/Documentation/arch/x86/resctrl.rst
> >> @@ -328,6 +328,77 @@ with the following files:
> >> None of events are assigned on this mon group. This is a child
> >> monitor group of the non default control mon group.
> >>
> >> + Assignment state can be updated by writing to this interface.
> >> +
> >> + NOTE: Assignment on one domain applied on all the domains. User can
> >> + pass one valid domain and assignment will be updated on all the
> >> + available domains.
> >
> > How would different assignments to different domains work? If the
> > allocations are global, then the allocated monitor ID is available to
> > all domains whether they use it or not.
>
> That is correct.
> [A] Hardware counters(max 2 per group) are allocated at the group level.
> So, those counters are available to all the domains on that group. I will
> maintain a bitmap at the domain level. The bitmap will be set on the
> domains where assignment is applied and IPIs are sent. IPIs will not be
> sent to other domains.

Unless the monitor allocation is scoped at the domain level, I don't
see much point in implementing the per-domain parsing today, as the
only benefit is avoiding IPIs to domains whose counters you don't plan
to read.

-Peter

2024-04-17 22:52:39

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

Hi Peter,

On 4/17/2024 3:56 PM, Peter Newman wrote:
> Hi Babu,
>
> On Wed, Apr 17, 2024 at 12:39 PM Moger, Babu <[email protected]> wrote:
>> On 4/17/24 12:45, Peter Newman wrote:
>>> On Thu, Mar 28, 2024 at 6:10 PM Babu Moger <[email protected]> wrote:
>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>> index 2d96565501ab..64ec70637c66 100644
>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>> @@ -328,6 +328,77 @@ with the following files:
>>>> None of events are assigned on this mon group. This is a child
>>>> monitor group of the non default control mon group.
>>>>
>>>> + Assignment state can be updated by writing to this interface.
>>>> +
>>>> + NOTE: Assignment on one domain applied on all the domains. User can
>>>> + pass one valid domain and assignment will be updated on all the
>>>> + available domains.
>>> How would different assignments to different domains work? If the
>>> allocations are global, then the allocated monitor ID is available to
>>> all domains whether they use it or not.
>> That is correct.
>> [A] Hardware counters(max 2 per group) are allocated at the group level.
>> So, those counters are available to all the domains on that group. I will
>> maintain a bitmap at the domain level. The bitmap will be set on the
>> domains where assignment is applied and IPIs are sent. IPIs will not be
>> sent to other domains.
> Unless the monitor allocation is scoped at the domain level, I don't
> see much point in implementing the per-domain parsing today, as the
> only benefit is avoiding IPIs to domains whose counters you don't plan
> to read.

In that case lets remove the domain specific assignments. We can avoid
some code complexity.

thanks

Babu


2024-04-22 16:33:53

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On Thu, Mar 28, 2024 at 08:06:33PM -0500, Babu Moger wrote:
>
> This series adds the support for Assignable Bandwidth Monitoring Counters
> (ABMC). It is also called QoS RMID Pinning feature
>
> The feature details are documented in the APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC). The documentation is available at
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>
> The patches are based on top of commit
> cd80c2c94699913f9334414189487ff3f93cf0b5 (tip/master)

A few very general comments from me here, since I'm not vary familiar
with this topic...


> # Introduction
>
> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
> feature only guarantees that RMIDs currently assigned to a processor will
> be tracked by hardware. The counters of any other RMIDs which are no longer
> being tracked will be reset to zero. The MBM event counters return
> "Unavailable" for the RMIDs that are not active.
>
> Users can create 256 or more monitor groups. But there can be only limited
> number of groups that can give guaranteed monitoring numbers. With ever
> changing configurations there is no way to definitely know which of these
> groups will be active for certain point of time. Users do not have the
> option to monitor a group or set of groups for certain period of time
> without worrying about RMID being reset in between.
>
> The ABMC feature provides an option to the user to assign an RMID to the
> hardware counter and monitor the bandwidth for a longer duration.
> The assigned RMID will be active until the user unassigns it manually.
> There is no need to worry about counters being reset during this period.
> Additionally, the user can specify a bitmask identifying the specific
> bandwidth types from the given source to track with the counter.
>
> Without ABMC enabled, monitoring will work in current mode without
> assignment option.
>
> # Linux Implementation
>
> Linux resctrl subsystem provides the interface to count maximum of two
> memory bandwidth events per group, from a combination of available total
> and local events. Keeping the current interface, users can assign a maximum
> of 2 ABMC counters per group. User will also have the option to assign only
> one counter to the group. If the system runs out of assignable ABMC
> counters, kernel will display an error. Users need to unassign an already
> assigned counter to make space for new assignments.
>
>
> # Examples
>
> a. Check if ABMC support is available
> #mount -t resctrl resctrl /sys/fs/resctrl/
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> [abmc]
> legacy_mbm
>
> Linux kernel detected ABMC feature and it is enabled.
>
> b. Check how many ABMC counters are available.
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_cntrs
> 32
>
> c. Create few resctrl groups.
>
> # mkdir /sys/fs/resctrl/mon_groups/default_mon1
> # mkdir /sys/fs/resctrl/non_defult_group
> # mkdir /sys/fs/resctrl/non_defult_group/mon_groups/non_default_mon1
>
> d. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> to list and modify the group's assignment states.
>
> The list follows the following format:

This section looks possibly inconsistent with (e.)

Is (d.) the userspace read format, with (e.) being the format written by
userspace?


> * Default CTRL_MON group:
> "//<domain_id>=<assignment_flags>"
>
> * Non-default CTRL_MON group:
> "<CTRL_MON group>//<domain_id>=<assignment_flags>"
>
> * Child MON group of default CTRL_MON group:
> "/<MON group>/<domain_id>=<assignment_flags>"
>
> * Child MON group of non-default CTRL_MON group:
> "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
>
> Assignment flags can be one of the following:
>
> t MBM total event is assigned

With my MPAM hat on this looks a bit weird, although I suppose it
follows on from the way "mbm_total_bytes" and "mbm_local_bytes" are
already exposed in resctrlfs.

From an abstract point of view, "total" and "local" are just event
selection criteria, additional to those in mbm_cfg_mask. The different
way they are treated in the hardware feels like an x86 implementation
detail.

For MPAM we don't currently distinguish local from non-local traffic, so
I guess this just reduces to a simple on-off (i.e., "t" or nothing),
which I guess is tolerable.

This might want more thought if there is an expectation that more
categories will be added here, though (?)

> l MBM local event is assigned
> tl Both total and local MBM events are assigned
> _ None of the MBM events are assigned

This use of '_' seems unusual. Can we not just have the empty string
for "nothing assigned"?

Since every assignment is terminated by ';' or end-of-line, I don't
think that there would be any parsing ambiguity (?)

>
> Examples:
>
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> /default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>
> There are four groups and all the groups have local and total event assigned.
>
> "//" - This is a default CONTROL MON group
>
> "non_defult_group//" - This is non default CONTROL MON group
>
> "/default_mon1/" - This is Child MON group of the defult group
>
> "non_defult_group/non_default_mon1/" - This is child MON group of the non default group
>
> =tl means both total and local events are assigned.
>
> e. Update the group assignment states using the interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control.
>
> The write format is similar to the above list format with addition of
> op-code for the assignment operation.

With by resctrl newbie hat on:

It feels a bit complex (for the kernel) to have userspace needing to
write a script into a magic file that we need to parse, specifying
updates to a bunch of controls already visible as objects in resctrlfs
in their own right.

What's the expected use case here?

If userspace really does need to switch lots of events simultaneously
then I guess the overhead of enumerating and poking lots of individual
files might be unacceptable though, and we would still need some global
interfaces for operations such as "unassign everything"...


OTOH, the proposed approach is not so different from the way the
schemata files already work.

>
> * Default CTRL_MON group:
> "//<domain_id><op-code><assignment_flags>"
>
> * Non-default CTRL_MON group:
> "<CTRL_MON group>//<domain_id><op-code><assignment_flags>"
>
> * Child MON group of default CTRL_MON group:
> "/<MON group>/<domain_id><op-code><assignment_flags>"
>
> * Child MON group of non-default CTRL_MON group:
> "<CTRL_MON group>/<MON group>/<domain_id><op-code><assignment_flags>"
>
> Op-code can be one of the following:
>
> = Update the assignment to match the flags
> + Assign a new state
> - Unassign a new state
> _ Unassign all the states

If we adopt "empty string" to mean "no events", then

<foo>/<bar>/<domain>=

would unassign all events, so '_' would not be needed as a separate
syntax.

[...]

Cheers
---Dave

2024-04-22 16:37:46

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On Thu, Apr 04, 2024 at 03:02:45PM -0500, Moger, Babu wrote:
> Hi Peter,
>
>
> On 4/4/24 14:08, Peter Newman wrote:
> > Hi Babu,
> >
> > On Thu, Mar 28, 2024 at 6:07 PM Babu Moger <[email protected]> wrote:
> >> The list follows the following format:
> >>
> >> * Default CTRL_MON group:
> >> "//<domain_id>=<assignment_flags>"
> >>
> >> * Non-default CTRL_MON group:
> >> "<CTRL_MON group>//<domain_id>=<assignment_flags>"
> >>
> >> * Child MON group of default CTRL_MON group:
> >> "/<MON group>/<domain_id>=<assignment_flags>"
> >>
> >> * Child MON group of non-default CTRL_MON group:
> >> "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
> >>
> >> Assignment flags can be one of the following:
> >>
> >> t MBM total event is assigned
> >> l MBM local event is assigned
> >> tl Both total and local MBM events are assigned
> >> _ None of the MBM events are assigned
> >>
> >> Examples:
> >>
> >> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >> non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >> non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >> //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >> /default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >>
> >> There are four groups and all the groups have local and total event assigned.
> >>
> >> "//" - This is a default CONTROL MON group
> >>
> >> "non_defult_group//" - This is non default CONTROL MON group
> >>
> >> "/default_mon1/" - This is Child MON group of the defult group
> >>
> >> "non_defult_group/non_default_mon1/" - This is child MON group of the non default group
> >>
> >> =tl means both total and local events are assigned.
> >
> > I recall there was supposed to be a way to perform the same update on
> > all domains together so that it isn't tedious to not do per-domain
>
> Yes. Correct. Reinette suggested to have "no domains" means ALL the domains.

Would "*" be more intuitive?

Whatever is done here to describe the "wildcard node", would it be worth
having the node field parse the same way in the "schemata" files?

Is there any merit in having range match expressions, e.g. something like

0-3,8-11=foo;4-7,12-*=bar

(The latter is obvious feature creep though, so a real use case for this
would be needed to justify it. I don't have one right now...)

[...]

Cheers
---Dave

2024-04-22 18:24:12

by Peter Newman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Dave,

On Mon, Apr 22, 2024 at 9:33 AM Dave Martin <[email protected]> wrote:
>
> Hi Babu,
>
> On Thu, Mar 28, 2024 at 08:06:33PM -0500, Babu Moger wrote:
> > Assignment flags can be one of the following:
> >
> > t MBM total event is assigned
>
> With my MPAM hat on this looks a bit weird, although I suppose it
> follows on from the way "mbm_total_bytes" and "mbm_local_bytes" are
> already exposed in resctrlfs.
>
> From an abstract point of view, "total" and "local" are just event
> selection criteria, additional to those in mbm_cfg_mask. The different
> way they are treated in the hardware feels like an x86 implementation
> detail.
>
> For MPAM we don't currently distinguish local from non-local traffic, so
> I guess this just reduces to a simple on-off (i.e., "t" or nothing),
> which I guess is tolerable.
>
> This might want more thought if there is an expectation that more
> categories will be added here, though (?)

There should be a path forward whenever we start supporting
user-configured counter classes. I assume the letters a-z will be
enough to cover all the counter classes which could be used at once.

>
> > l MBM local event is assigned
> > tl Both total and local MBM events are assigned
> > _ None of the MBM events are assigned
>
> This use of '_' seems unusual. Can we not just have the empty string
> for "nothing assigned"?
>
> Since every assignment is terminated by ';' or end-of-line, I don't
> think that there would be any parsing ambiguity (?)
>
> >
> > Examples:
> >
> > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> > non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> > non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> > //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> > /default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >
> > There are four groups and all the groups have local and total event assigned.
> >
> > "//" - This is a default CONTROL MON group
> >
> > "non_defult_group//" - This is non default CONTROL MON group
> >
> > "/default_mon1/" - This is Child MON group of the defult group
> >
> > "non_defult_group/non_default_mon1/" - This is child MON group of the non default group
> >
> > =tl means both total and local events are assigned.
> >
> > e. Update the group assignment states using the interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control.
> >
> > The write format is similar to the above list format with addition of
> > op-code for the assignment operation.
>
> With by resctrl newbie hat on:
>
> It feels a bit complex (for the kernel) to have userspace needing to
> write a script into a magic file that we need to parse, specifying
> updates to a bunch of controls already visible as objects in resctrlfs
> in their own right.
>
> What's the expected use case here?

I went over the use case of iterating a small number of monitors over
a much larger number of monitoring groups here:

https://lore.kernel.org/lkml/CALPaoCi=PCWr6U5zYtFPmyaFHU_iqZtZL-LaHC2mYxbETXk3ig@mail.gmail.com/

>
> If userspace really does need to switch lots of events simultaneously
> then I guess the overhead of enumerating and poking lots of individual
> files might be unacceptable though, and we would still need some global
> interfaces for operations such as "unassign everything"...

My main goal is for the number of parallel IPI batches to all the
domains (or write syscalls) to be O(num_rmids / num_monitors) rather
than O(num_rmids * num_monitors) as I need to know how frequently we
can afford to sample the current memory bandwidth of the maximum
number of monitoring groups supported.

Thanks!
-Peter

2024-04-22 20:44:41

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Dave,

On 4/22/24 11:34, Dave Martin wrote:
> Hi Babu,
>
> On Thu, Apr 04, 2024 at 03:02:45PM -0500, Moger, Babu wrote:
>> Hi Peter,
>>
>>
>> On 4/4/24 14:08, Peter Newman wrote:
>>> Hi Babu,
>>>
>>> On Thu, Mar 28, 2024 at 6:07 PM Babu Moger <[email protected]> wrote:
>>>> The list follows the following format:
>>>>
>>>> * Default CTRL_MON group:
>>>> "//<domain_id>=<assignment_flags>"
>>>>
>>>> * Non-default CTRL_MON group:
>>>> "<CTRL_MON group>//<domain_id>=<assignment_flags>"
>>>>
>>>> * Child MON group of default CTRL_MON group:
>>>> "/<MON group>/<domain_id>=<assignment_flags>"
>>>>
>>>> * Child MON group of non-default CTRL_MON group:
>>>> "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
>>>>
>>>> Assignment flags can be one of the following:
>>>>
>>>> t MBM total event is assigned
>>>> l MBM local event is assigned
>>>> tl Both total and local MBM events are assigned
>>>> _ None of the MBM events are assigned
>>>>
>>>> Examples:
>>>>
>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>>> non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>>> //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>>> /default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>>>
>>>> There are four groups and all the groups have local and total event assigned.
>>>>
>>>> "//" - This is a default CONTROL MON group
>>>>
>>>> "non_defult_group//" - This is non default CONTROL MON group
>>>>
>>>> "/default_mon1/" - This is Child MON group of the defult group
>>>>
>>>> "non_defult_group/non_default_mon1/" - This is child MON group of the non default group
>>>>
>>>> =tl means both total and local events are assigned.
>>>
>>> I recall there was supposed to be a way to perform the same update on
>>> all domains together so that it isn't tedious to not do per-domain
>>
>> Yes. Correct. Reinette suggested to have "no domains" means ALL the domains.
>
> Would "*" be more intuitive?

We could. But I don't see the need for wildcard ("*") or ranges and
complexity that comes with that.

Even in schemata processing we don't use the wildcard or ranges and also
there is no mention of that in documentation.
https://www.kernel.org/doc/Documentation/x86/resctrl.rst

Domains(or nodes) are processed one by one. Some examples.

# cat schemata
SMBA:0=2048;1=2048;2=2048;3=2048
MB:0=2048;1=2048;2=2048;3=2048
L3:0=ffff;1=ffff;2=ffff;3=ffff

# echo "SMBA:1=64" > schemata
# cat schemata
SMBA:0=2048;1= 64;2=2048;3=2048
MB:0=2048;1=2048;2=2048;3=2048
L3:0=ffff;1=ffff;2=ffff;3=ffff



>
> Whatever is done here to describe the "wildcard node", would it be worth
> having the node field parse the same way in the "schemata" files?
>
> Is there any merit in having range match expressions, e.g. something like
>
> 0-3,8-11=foo;4-7,12-*=bar
>
> (The latter is obvious feature creep though, so a real use case for this
> would be needed to justify it. I don't have one right now...)
>
> [...]
>
> Cheers
> ---Dave

--
Thanks
Babu Moger

2024-04-23 12:39:52

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

On Mon, Apr 22, 2024 at 03:44:26PM -0500, Moger, Babu wrote:
> Hi Dave,
>
> On 4/22/24 11:34, Dave Martin wrote:
> > Hi Babu,
> >
> > On Thu, Apr 04, 2024 at 03:02:45PM -0500, Moger, Babu wrote:
> >> Hi Peter,
> >>
> >>
> >> On 4/4/24 14:08, Peter Newman wrote:
> >>> Hi Babu,
> >>>
> >>> On Thu, Mar 28, 2024 at 6:07 PM Babu Moger <[email protected]> wrote:
> >>>> The list follows the following format:
> >>>>
> >>>> * Default CTRL_MON group:
> >>>> "//<domain_id>=<assignment_flags>"
> >>>>
> >>>> * Non-default CTRL_MON group:
> >>>> "<CTRL_MON group>//<domain_id>=<assignment_flags>"
> >>>>
> >>>> * Child MON group of default CTRL_MON group:
> >>>> "/<MON group>/<domain_id>=<assignment_flags>"
> >>>>
> >>>> * Child MON group of non-default CTRL_MON group:
> >>>> "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
> >>>>
> >>>> Assignment flags can be one of the following:
> >>>>
> >>>> t MBM total event is assigned
> >>>> l MBM local event is assigned
> >>>> tl Both total and local MBM events are assigned
> >>>> _ None of the MBM events are assigned
> >>>>
> >>>> Examples:
> >>>>
> >>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>> non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >>>> non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >>>> //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >>>> /default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >>>>
> >>>> There are four groups and all the groups have local and total event assigned.
> >>>>
> >>>> "//" - This is a default CONTROL MON group
> >>>>
> >>>> "non_defult_group//" - This is non default CONTROL MON group
> >>>>
> >>>> "/default_mon1/" - This is Child MON group of the defult group
> >>>>
> >>>> "non_defult_group/non_default_mon1/" - This is child MON group of the non default group
> >>>>
> >>>> =tl means both total and local events are assigned.
> >>>
> >>> I recall there was supposed to be a way to perform the same update on
> >>> all domains together so that it isn't tedious to not do per-domain
> >>
> >> Yes. Correct. Reinette suggested to have "no domains" means ALL the domains.
> >
> > Would "*" be more intuitive?
>
> We could. But I don't see the need for wildcard ("*") or ranges and
> complexity that comes with that.

For "*", I mean that this would just stand for "all cpus", not a generic
string match; apologies if I didn't make that clear.

I think that an explicit "*" is still a less surprising way to say
"everything" than "" (which if it means anything at all, usually means
"nothing").

I may have misunderstood the intention here: _if_ the intention is to
provide a way to enable/disable an event in all domains without having
to enumerate them all one by one, then I think "*" is preferable syntax
to "". That was my only real suggestion here.

>
> Even in schemata processing we don't use the wildcard or ranges and also
> there is no mention of that in documentation.
> https://www.kernel.org/doc/Documentation/x86/resctrl.rst

I know, though writing the schemata files can be tedious and annoying,
since their content is often very repetitive, so ...

>
> Domains(or nodes) are processed one by one. Some examples.
>
> # cat schemata
> SMBA:0=2048;1=2048;2=2048;3=2048
> MB:0=2048;1=2048;2=2048;3=2048
> L3:0=ffff;1=ffff;2=ffff;3=ffff
>
> # echo "SMBA:1=64" > schemata
> # cat schemata
> SMBA:0=2048;1= 64;2=2048;3=2048
> MB:0=2048;1=2048;2=2048;3=2048
> L3:0=ffff;1=ffff;2=ffff;3=ffff

.. it would be convenient to be able to do something like

# echo "SMBA:*=64" >schemata
# grep SMBA: schemata
SMBA:0= 64;1= 64;2= 64;3= 64

Anyway, this is nothing directly to do with this series; just a
thought.


> > Whatever is done here to describe the "wildcard node", would it be worth
> > having the node field parse the same way in the "schemata" files?
> >
> > Is there any merit in having range match expressions, e.g. something like
> >
> > 0-3,8-11=foo;4-7,12-*=bar
> >
> > (The latter is obvious feature creep though, so a real use case for this
> > would be needed to justify it. I don't have one right now...)

[...]

> Thanks
> Babu Moger

I do agree that unless someone jumps up and down saying this would
help their use case, this is probably a step too far.

Just thinking aloud (and this kind of feature could be added later in a
backwards compatible way if someone really needs it).

Cheers
---Dave

2024-04-23 12:40:32

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

On Mon, Apr 22, 2024 at 11:23:50AM -0700, Peter Newman wrote:
> Hi Dave,
>
> On Mon, Apr 22, 2024 at 9:33 AM Dave Martin <[email protected]> wrote:
> >
> > Hi Babu,
> >
> > On Thu, Mar 28, 2024 at 08:06:33PM -0500, Babu Moger wrote:
> > > Assignment flags can be one of the following:
> > >
> > > t MBM total event is assigned
> >
> > With my MPAM hat on this looks a bit weird, although I suppose it
> > follows on from the way "mbm_total_bytes" and "mbm_local_bytes" are
> > already exposed in resctrlfs.
> >
> > From an abstract point of view, "total" and "local" are just event
> > selection criteria, additional to those in mbm_cfg_mask. The different
> > way they are treated in the hardware feels like an x86 implementation
> > detail.
> >
> > For MPAM we don't currently distinguish local from non-local traffic, so
> > I guess this just reduces to a simple on-off (i.e., "t" or nothing),
> > which I guess is tolerable.
> >
> > This might want more thought if there is an expectation that more
> > categories will be added here, though (?)
>
> There should be a path forward whenever we start supporting
> user-configured counter classes. I assume the letters a-z will be
> enough to cover all the counter classes which could be used at once.

Ack, though I'd appreciate a response on the point about "_" below in
case people missed it.

>
> >
> > > l MBM local event is assigned
> > > tl Both total and local MBM events are assigned
> > > _ None of the MBM events are assigned
> >
> > This use of '_' seems unusual. Can we not just have the empty string
> > for "nothing assigned"?
> >
> > Since every assignment is terminated by ';' or end-of-line, I don't
> > think that there would be any parsing ambiguity (?)
> >
> > >
> > > Examples:
> > >
> > > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> > > non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> > > non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> > > //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> > > /default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> > >
> > > There are four groups and all the groups have local and total event assigned.
> > >
> > > "//" - This is a default CONTROL MON group
> > >
> > > "non_defult_group//" - This is non default CONTROL MON group
> > >
> > > "/default_mon1/" - This is Child MON group of the defult group
> > >
> > > "non_defult_group/non_default_mon1/" - This is child MON group of the non default group
> > >
> > > =tl means both total and local events are assigned.
> > >
> > > e. Update the group assignment states using the interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control.
> > >
> > > The write format is similar to the above list format with addition of
> > > op-code for the assignment operation.
> >
> > With by resctrl newbie hat on:
> >
> > It feels a bit complex (for the kernel) to have userspace needing to
> > write a script into a magic file that we need to parse, specifying
> > updates to a bunch of controls already visible as objects in resctrlfs
> > in their own right.
> >
> > What's the expected use case here?
>
> I went over the use case of iterating a small number of monitors over
> a much larger number of monitoring groups here:
>
> https://lore.kernel.org/lkml/CALPaoCi=PCWr6U5zYtFPmyaFHU_iqZtZL-LaHC2mYxbETXk3ig@mail.gmail.com/
>
> >
> > If userspace really does need to switch lots of events simultaneously
> > then I guess the overhead of enumerating and poking lots of individual
> > files might be unacceptable though, and we would still need some global
> > interfaces for operations such as "unassign everything"...
>
> My main goal is for the number of parallel IPI batches to all the
> domains (or write syscalls) to be O(num_rmids / num_monitors) rather
> than O(num_rmids * num_monitors) as I need to know how frequently we
> can afford to sample the current memory bandwidth of the maximum
> number of monitoring groups supported.

Fair enough; I wasn't fully aware of the background discussions.

Cheers
---Dave

2024-04-23 15:43:40

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Dave,

On 4/23/24 07:38, Dave Martin wrote:
> Hi Peter,
>
> On Mon, Apr 22, 2024 at 11:23:50AM -0700, Peter Newman wrote:
>> Hi Dave,
>>
>> On Mon, Apr 22, 2024 at 9:33 AM Dave Martin <[email protected]> wrote:
>>>
>>> Hi Babu,
>>>
>>> On Thu, Mar 28, 2024 at 08:06:33PM -0500, Babu Moger wrote:
>>>> Assignment flags can be one of the following:
>>>>
>>>> t MBM total event is assigned
>>>
>>> With my MPAM hat on this looks a bit weird, although I suppose it
>>> follows on from the way "mbm_total_bytes" and "mbm_local_bytes" are
>>> already exposed in resctrlfs.
>>>
>>> From an abstract point of view, "total" and "local" are just event
>>> selection criteria, additional to those in mbm_cfg_mask. The different
>>> way they are treated in the hardware feels like an x86 implementation
>>> detail.
>>>
>>> For MPAM we don't currently distinguish local from non-local traffic, so
>>> I guess this just reduces to a simple on-off (i.e., "t" or nothing),
>>> which I guess is tolerable.
>>>
>>> This might want more thought if there is an expectation that more
>>> categories will be added here, though (?)
>>
>> There should be a path forward whenever we start supporting
>> user-configured counter classes. I assume the letters a-z will be
>> enough to cover all the counter classes which could be used at once.
>
> Ack, though I'd appreciate a response on the point about "_" below in
> case people missed it.

It was based on the dynamic debug interface and also Reinette's suggestion
as well.
https://www.kernel.org/doc/html/v4.10/admin-guide/dynamic-debug-howto.html
(Look for "No flags are set").

We tried to use that similar interface.
--
Thanks
Babu Moger

2024-04-23 16:25:24

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On Tue, Apr 23, 2024 at 10:43:25AM -0500, Moger, Babu wrote:
> Hi Dave,
>
> On 4/23/24 07:38, Dave Martin wrote:
> > Hi Peter,
> >
> > On Mon, Apr 22, 2024 at 11:23:50AM -0700, Peter Newman wrote:
> >> Hi Dave,
> >>
> >> On Mon, Apr 22, 2024 at 9:33 AM Dave Martin <[email protected]> wrote:
> >>>
> >>> Hi Babu,
> >>>
> >>> On Thu, Mar 28, 2024 at 08:06:33PM -0500, Babu Moger wrote:
> >>>> Assignment flags can be one of the following:
> >>>>
> >>>> t MBM total event is assigned
> >>>
> >>> With my MPAM hat on this looks a bit weird, although I suppose it
> >>> follows on from the way "mbm_total_bytes" and "mbm_local_bytes" are
> >>> already exposed in resctrlfs.
> >>>
> >>> From an abstract point of view, "total" and "local" are just event
> >>> selection criteria, additional to those in mbm_cfg_mask. The different
> >>> way they are treated in the hardware feels like an x86 implementation
> >>> detail.
> >>>
> >>> For MPAM we don't currently distinguish local from non-local traffic, so
> >>> I guess this just reduces to a simple on-off (i.e., "t" or nothing),
> >>> which I guess is tolerable.
> >>>
> >>> This might want more thought if there is an expectation that more
> >>> categories will be added here, though (?)
> >>
> >> There should be a path forward whenever we start supporting
> >> user-configured counter classes. I assume the letters a-z will be
> >> enough to cover all the counter classes which could be used at once.
> >
> > Ack, though I'd appreciate a response on the point about "_" below in
> > case people missed it.
>
> It was based on the dynamic debug interface and also Reinette's suggestion
> as well.
> https://www.kernel.org/doc/html/v4.10/admin-guide/dynamic-debug-howto.html
> (Look for "No flags are set").
>
> We tried to use that similar interface.

Fair enough; I haven't touched dynamic debug for quite a while and had
forgotten about this convention being used there.

Apologies for the noise on that!

Cheers
---Dave

2024-04-24 04:15:25

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)



On 4/23/2024 5:37 AM, Dave Martin wrote:
> On Mon, Apr 22, 2024 at 03:44:26PM -0500, Moger, Babu wrote:
>> Hi Dave,
>>
>> On 4/22/24 11:34, Dave Martin wrote:
>>> Hi Babu,
>>>
>>> On Thu, Apr 04, 2024 at 03:02:45PM -0500, Moger, Babu wrote:
>>>> Hi Peter,
>>>>
>>>>
>>>> On 4/4/24 14:08, Peter Newman wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On Thu, Mar 28, 2024 at 6:07 PM Babu Moger <[email protected]> wrote:
>>>>>> The list follows the following format:
>>>>>>
>>>>>> * Default CTRL_MON group:
>>>>>> "//<domain_id>=<assignment_flags>"
>>>>>>
>>>>>> * Non-default CTRL_MON group:
>>>>>> "<CTRL_MON group>//<domain_id>=<assignment_flags>"
>>>>>>
>>>>>> * Child MON group of default CTRL_MON group:
>>>>>> "/<MON group>/<domain_id>=<assignment_flags>"
>>>>>>
>>>>>> * Child MON group of non-default CTRL_MON group:
>>>>>> "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
>>>>>>
>>>>>> Assignment flags can be one of the following:
>>>>>>
>>>>>> t MBM total event is assigned
>>>>>> l MBM local event is assigned
>>>>>> tl Both total and local MBM events are assigned
>>>>>> _ None of the MBM events are assigned
>>>>>>
>>>>>> Examples:
>>>>>>
>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>> non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>>>>> non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>>>>> //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>>>>> /default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>>>>>
>>>>>> There are four groups and all the groups have local and total event assigned.
>>>>>>
>>>>>> "//" - This is a default CONTROL MON group
>>>>>>
>>>>>> "non_defult_group//" - This is non default CONTROL MON group
>>>>>>
>>>>>> "/default_mon1/" - This is Child MON group of the defult group
>>>>>>
>>>>>> "non_defult_group/non_default_mon1/" - This is child MON group of the non default group
>>>>>>
>>>>>> =tl means both total and local events are assigned.
>>>>>
>>>>> I recall there was supposed to be a way to perform the same update on
>>>>> all domains together so that it isn't tedious to not do per-domain
>>>>
>>>> Yes. Correct. Reinette suggested to have "no domains" means ALL the domains.
>>>
>>> Would "*" be more intuitive?
>>
>> We could. But I don't see the need for wildcard ("*") or ranges and
>> complexity that comes with that.
>
> For "*", I mean that this would just stand for "all cpus", not a generic
> string match; apologies if I didn't make that clear.

(reading this by replacing "all cpus" with "all domains")

This sounds reasonable to me. It may indeed make the parsing simpler by
not needing the ugly checks Babu mentioned in [1].

Reinette

[1] https://lore.kernel.org/lkml/[email protected]/

2024-04-24 14:16:35

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

On Tue, Apr 23, 2024 at 09:15:07PM -0700, Reinette Chatre wrote:
>
>
> On 4/23/2024 5:37 AM, Dave Martin wrote:
> > On Mon, Apr 22, 2024 at 03:44:26PM -0500, Moger, Babu wrote:
> >> Hi Dave,
> >>
> >> On 4/22/24 11:34, Dave Martin wrote:
> >>> Hi Babu,
> >>>
> >>> On Thu, Apr 04, 2024 at 03:02:45PM -0500, Moger, Babu wrote:
> >>>> Hi Peter,
> >>>>
> >>>>
> >>>> On 4/4/24 14:08, Peter Newman wrote:
> >>>>> Hi Babu,
> >>>>>
> >>>>> On Thu, Mar 28, 2024 at 6:07 PM Babu Moger <[email protected]> wrote:
> >>>>>> The list follows the following format:
> >>>>>>
> >>>>>> * Default CTRL_MON group:
> >>>>>> "//<domain_id>=<assignment_flags>"
> >>>>>>
> >>>>>> * Non-default CTRL_MON group:
> >>>>>> "<CTRL_MON group>//<domain_id>=<assignment_flags>"
> >>>>>>
> >>>>>> * Child MON group of default CTRL_MON group:
> >>>>>> "/<MON group>/<domain_id>=<assignment_flags>"
> >>>>>>
> >>>>>> * Child MON group of non-default CTRL_MON group:
> >>>>>> "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
> >>>>>>
> >>>>>> Assignment flags can be one of the following:
> >>>>>>
> >>>>>> t MBM total event is assigned
> >>>>>> l MBM local event is assigned
> >>>>>> tl Both total and local MBM events are assigned
> >>>>>> _ None of the MBM events are assigned
> >>>>>>
> >>>>>> Examples:
> >>>>>>
> >>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>> non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >>>>>> non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >>>>>> //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >>>>>> /default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> >>>>>>
> >>>>>> There are four groups and all the groups have local and total event assigned.
> >>>>>>
> >>>>>> "//" - This is a default CONTROL MON group
> >>>>>>
> >>>>>> "non_defult_group//" - This is non default CONTROL MON group
> >>>>>>
> >>>>>> "/default_mon1/" - This is Child MON group of the defult group
> >>>>>>
> >>>>>> "non_defult_group/non_default_mon1/" - This is child MON group of the non default group
> >>>>>>
> >>>>>> =tl means both total and local events are assigned.
> >>>>>
> >>>>> I recall there was supposed to be a way to perform the same update on
> >>>>> all domains together so that it isn't tedious to not do per-domain
> >>>>
> >>>> Yes. Correct. Reinette suggested to have "no domains" means ALL the domains.
> >>>
> >>> Would "*" be more intuitive?
> >>
> >> We could. But I don't see the need for wildcard ("*") or ranges and
> >> complexity that comes with that.
> >
> > For "*", I mean that this would just stand for "all cpus", not a generic
> > string match; apologies if I didn't make that clear.
>
> (reading this by replacing "all cpus" with "all domains")
>
> This sounds reasonable to me. It may indeed make the parsing simpler by
> not needing the ugly checks Babu mentioned in [1].
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/[email protected]/

Ack, I meant "all domains", sorry!

Note, should we try to detect things like:

<resource>:0=fee;1=fie;*=foe;0=fum

.?

Either we treat conflicting assignments as an error, or we do them all
in the order specified, so that assignments on the right override those
on the left (which is what the schemata parsing in ctrlmondata.c:
parse_line() seems to do today if I understand the code correctly).

In the latter case,

<resource>:*=fee;1=fie

would set all nodes except 1 to "fee", and node 1 to "fie", which might
be useful (or at least, convenient).

If we're worried about that being exposed as ABI and used by userspace,
we might want to disallow it explicitly.

Cheers
---Dave

2024-04-24 19:10:47

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Dave,

On 4/24/24 09:16, Dave Martin wrote:
> On Tue, Apr 23, 2024 at 09:15:07PM -0700, Reinette Chatre wrote:
>>
>>
>> On 4/23/2024 5:37 AM, Dave Martin wrote:
>>> On Mon, Apr 22, 2024 at 03:44:26PM -0500, Moger, Babu wrote:
>>>> Hi Dave,
>>>>
>>>> On 4/22/24 11:34, Dave Martin wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On Thu, Apr 04, 2024 at 03:02:45PM -0500, Moger, Babu wrote:
>>>>>> Hi Peter,
>>>>>>
>>>>>>
>>>>>> On 4/4/24 14:08, Peter Newman wrote:
>>>>>>> Hi Babu,
>>>>>>>
>>>>>>> On Thu, Mar 28, 2024 at 6:07 PM Babu Moger <[email protected]> wrote:
>>>>>>>> The list follows the following format:
>>>>>>>>
>>>>>>>> * Default CTRL_MON group:
>>>>>>>> "//<domain_id>=<assignment_flags>"
>>>>>>>>
>>>>>>>> * Non-default CTRL_MON group:
>>>>>>>> "<CTRL_MON group>//<domain_id>=<assignment_flags>"
>>>>>>>>
>>>>>>>> * Child MON group of default CTRL_MON group:
>>>>>>>> "/<MON group>/<domain_id>=<assignment_flags>"
>>>>>>>>
>>>>>>>> * Child MON group of non-default CTRL_MON group:
>>>>>>>> "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
>>>>>>>>
>>>>>>>> Assignment flags can be one of the following:
>>>>>>>>
>>>>>>>> t MBM total event is assigned
>>>>>>>> l MBM local event is assigned
>>>>>>>> tl Both total and local MBM events are assigned
>>>>>>>> _ None of the MBM events are assigned
>>>>>>>>
>>>>>>>> Examples:
>>>>>>>>
>>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>> non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>>>>>>> non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>>>>>>> //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>>>>>>> /default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>>>>>>>>
>>>>>>>> There are four groups and all the groups have local and total event assigned.
>>>>>>>>
>>>>>>>> "//" - This is a default CONTROL MON group
>>>>>>>>
>>>>>>>> "non_defult_group//" - This is non default CONTROL MON group
>>>>>>>>
>>>>>>>> "/default_mon1/" - This is Child MON group of the defult group
>>>>>>>>
>>>>>>>> "non_defult_group/non_default_mon1/" - This is child MON group of the non default group
>>>>>>>>
>>>>>>>> =tl means both total and local events are assigned.
>>>>>>>
>>>>>>> I recall there was supposed to be a way to perform the same update on
>>>>>>> all domains together so that it isn't tedious to not do per-domain
>>>>>>
>>>>>> Yes. Correct. Reinette suggested to have "no domains" means ALL the domains.
>>>>>
>>>>> Would "*" be more intuitive?
>>>>
>>>> We could. But I don't see the need for wildcard ("*") or ranges and
>>>> complexity that comes with that.
>>>
>>> For "*", I mean that this would just stand for "all cpus", not a generic
>>> string match; apologies if I didn't make that clear.
>>
>> (reading this by replacing "all cpus" with "all domains")
>>
>> This sounds reasonable to me. It may indeed make the parsing simpler by
>> not needing the ugly checks Babu mentioned in [1].

Sure. Will plan to address "all domains" (*) option in next revision.

>>
>> Reinette
>>
>> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Ack, I meant "all domains", sorry!
>
> Note, should we try to detect things like:
>
> <resource>:0=fee;1=fie;*=foe;0=fum
>
> ..?
>
> Either we treat conflicting assignments as an error, or we do them all
> in the order specified, so that assignments on the right override those
> on the left (which is what the schemata parsing in ctrlmondata.c:
> parse_line() seems to do today if I understand the code correctly).
>
> In the latter case,
>
> <resource>:*=fee;1=fie
>
> would set all nodes except 1 to "fee", and node 1 to "fie", which might
> be useful (or at least, convenient).
>
> If we're worried about that being exposed as ABI and used by userspace,
> we might want to disallow it explicitly.
>

Sure. Right now we are not planning to support domain specific
assignments. But, will plan to keep options open for future support.
--
Thanks
Babu Moger

2024-05-01 17:48:58

by Peter Newman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On Thu, Mar 28, 2024 at 6:07 PM Babu Moger <[email protected]> wrote:
>
>
> This series adds the support for Assignable Bandwidth Monitoring Counters
> (ABMC). It is also called QoS RMID Pinning feature
>
> The feature details are documented in the APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC). The documentation is available at
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>
> The patches are based on top of commit
> cd80c2c94699913f9334414189487ff3f93cf0b5 (tip/master)
>
> # Introduction
>
> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
> feature only guarantees that RMIDs currently assigned to a processor will
> be tracked by hardware. The counters of any other RMIDs which are no longer
> being tracked will be reset to zero. The MBM event counters return
> "Unavailable" for the RMIDs that are not active.
>
> Users can create 256 or more monitor groups. But there can be only limited
> number of groups that can give guaranteed monitoring numbers. With ever
> changing configurations there is no way to definitely know which of these
> groups will be active for certain point of time. Users do not have the
> option to monitor a group or set of groups for certain period of time
> without worrying about RMID being reset in between.
>
> The ABMC feature provides an option to the user to assign an RMID to the
> hardware counter and monitor the bandwidth for a longer duration.
> The assigned RMID will be active until the user unassigns it manually.
> There is no need to worry about counters being reset during this period.
> Additionally, the user can specify a bitmask identifying the specific
> bandwidth types from the given source to track with the counter.
>
> Without ABMC enabled, monitoring will work in current mode without
> assignment option.
>
> # Linux Implementation
>
> Linux resctrl subsystem provides the interface to count maximum of two
> memory bandwidth events per group, from a combination of available total
> and local events. Keeping the current interface, users can assign a maximum
> of 2 ABMC counters per group. User will also have the option to assign only
> one counter to the group. If the system runs out of assignable ABMC
> counters, kernel will display an error. Users need to unassign an already
> assigned counter to make space for new assignments.
>
>
> # Examples
>
> a. Check if ABMC support is available
> #mount -t resctrl resctrl /sys/fs/resctrl/
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> [abmc]
> legacy_mbm
>
> Linux kernel detected ABMC feature and it is enabled.
>
> b. Check how many ABMC counters are available.
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_cntrs
> 32
>
> c. Create few resctrl groups.
>
> # mkdir /sys/fs/resctrl/mon_groups/default_mon1
> # mkdir /sys/fs/resctrl/non_defult_group
> # mkdir /sys/fs/resctrl/non_defult_group/mon_groups/non_default_mon1
>
> d. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> to list and modify the group's assignment states.
>
> The list follows the following format:
>
> * Default CTRL_MON group:
> "//<domain_id>=<assignment_flags>"
>
> * Non-default CTRL_MON group:
> "<CTRL_MON group>//<domain_id>=<assignment_flags>"
>
> * Child MON group of default CTRL_MON group:
> "/<MON group>/<domain_id>=<assignment_flags>"
>
> * Child MON group of non-default CTRL_MON group:
> "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
>
> Assignment flags can be one of the following:
>
> t MBM total event is assigned
> l MBM local event is assigned
> tl Both total and local MBM events are assigned
> _ None of the MBM events are assigned
>

I was able to successfully build a kernel where this interface is
adapted to work with both real ABMC on hardware that supports it and
my software workaround for older hardware.

My prototype is based on a refactored version of the codebase
supporting MPAM, but the capabilities of the MPAM hardware look
similar enough to ABMC that I'm not concerned about the feasibility.

The FS layer is informed by the arch layer (through rdt_resource
fields) how many assignable monitors are available and whether a
monitor is assigned to an entire group or a single event in a group.
Also, the FS layer can assume that monitors are indexed contiguously,
allowing it to host the data structures managing FS-level view of
monitor usage.

I used the following resctrl_arch-interfaces to propagate assignments
to the implementation:

void resctrl_arch_assign_monitor(struct rdt_domain *d, u32 mon_id, u32
closid, u32 rmid, int evtid);
void resctrl_arch_unassign_monitor(struct rdt_domain *d, u32 mon_id);

I chose to allow reassigning an assigned monitor without calling
unassign first. This is important when monitors are unassigned and
assigned in a single write to mbm_assign_control, as it allows all
updates to be performed in a single round of parallel IPIs to the
domains.


>
> g. Users will have the option to go back to legacy_mbm mode if required.
> This can be done using the following command.
>
> # echo "legacy_mbm" > /sys/fs/resctrl/info/L3_MON/mbm_assign
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> abmc
> [legacy_mbm]

I chose to make this a mount option to simplify the management of the
monitor tracking data structures. They are simply allocated at mount
time and deallocated and unmount.

I called the option "mon_assign": The mount option parser calls
resctrl_arch_mon_assign_enable() to determine whether the
implementation supports assignment in some form. If it returns an
error, the mount fails. When successful, the assignable monitor count
is made non-zero in the appropriate rdt_resource, triggering the
behavior change in the FS layer.

I'm still not sure if it's a good idea to enable monitor assignment by
default. This would be a major disruption in the MBM usage model
triggered by moving software between AMD CPU models. I thought the
safest option was to disallow creating more monitoring groups than
monitors unless the option is selected. Given that nobody else
complained about monitoring HW limitations on the mailing list, I
assumed few users create enough monitoring groups to be impacted.

Thanks!
-Peter

2024-05-02 16:25:51

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

On 5/1/24 12:48, Peter Newman wrote:
> Hi Babu,
>
> On Thu, Mar 28, 2024 at 6:07 PM Babu Moger <[email protected]> wrote:
>>
>>
>> This series adds the support for Assignable Bandwidth Monitoring Counters
>> (ABMC). It is also called QoS RMID Pinning feature
>>
>> The feature details are documented in the APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC). The documentation is available at
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>
>> The patches are based on top of commit
>> cd80c2c94699913f9334414189487ff3f93cf0b5 (tip/master)
>>
>> # Introduction
>>
>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
>> feature only guarantees that RMIDs currently assigned to a processor will
>> be tracked by hardware. The counters of any other RMIDs which are no longer
>> being tracked will be reset to zero. The MBM event counters return
>> "Unavailable" for the RMIDs that are not active.
>>
>> Users can create 256 or more monitor groups. But there can be only limited
>> number of groups that can give guaranteed monitoring numbers. With ever
>> changing configurations there is no way to definitely know which of these
>> groups will be active for certain point of time. Users do not have the
>> option to monitor a group or set of groups for certain period of time
>> without worrying about RMID being reset in between.
>>
>> The ABMC feature provides an option to the user to assign an RMID to the
>> hardware counter and monitor the bandwidth for a longer duration.
>> The assigned RMID will be active until the user unassigns it manually.
>> There is no need to worry about counters being reset during this period.
>> Additionally, the user can specify a bitmask identifying the specific
>> bandwidth types from the given source to track with the counter.
>>
>> Without ABMC enabled, monitoring will work in current mode without
>> assignment option.
>>
>> # Linux Implementation
>>
>> Linux resctrl subsystem provides the interface to count maximum of two
>> memory bandwidth events per group, from a combination of available total
>> and local events. Keeping the current interface, users can assign a maximum
>> of 2 ABMC counters per group. User will also have the option to assign only
>> one counter to the group. If the system runs out of assignable ABMC
>> counters, kernel will display an error. Users need to unassign an already
>> assigned counter to make space for new assignments.
>>
>>
>> # Examples
>>
>> a. Check if ABMC support is available
>> #mount -t resctrl resctrl /sys/fs/resctrl/
>>
>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign
>> [abmc]
>> legacy_mbm
>>
>> Linux kernel detected ABMC feature and it is enabled.
>>
>> b. Check how many ABMC counters are available.
>>
>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_cntrs
>> 32
>>
>> c. Create few resctrl groups.
>>
>> # mkdir /sys/fs/resctrl/mon_groups/default_mon1
>> # mkdir /sys/fs/resctrl/non_defult_group
>> # mkdir /sys/fs/resctrl/non_defult_group/mon_groups/non_default_mon1
>>
>> d. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> to list and modify the group's assignment states.
>>
>> The list follows the following format:
>>
>> * Default CTRL_MON group:
>> "//<domain_id>=<assignment_flags>"
>>
>> * Non-default CTRL_MON group:
>> "<CTRL_MON group>//<domain_id>=<assignment_flags>"
>>
>> * Child MON group of default CTRL_MON group:
>> "/<MON group>/<domain_id>=<assignment_flags>"
>>
>> * Child MON group of non-default CTRL_MON group:
>> "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
>>
>> Assignment flags can be one of the following:
>>
>> t MBM total event is assigned
>> l MBM local event is assigned
>> tl Both total and local MBM events are assigned
>> _ None of the MBM events are assigned
>>
>
> I was able to successfully build a kernel where this interface is
> adapted to work with both real ABMC on hardware that supports it and
> my software workaround for older hardware.

Thanks for trying that out. Good to know.

>
> My prototype is based on a refactored version of the codebase
> supporting MPAM, but the capabilities of the MPAM hardware look
> similar enough to ABMC that I'm not concerned about the feasibility.

That is good.

>
> The FS layer is informed by the arch layer (through rdt_resource
> fields) how many assignable monitors are available and whether a
> monitor is assigned to an entire group or a single event in a group.
> Also, the FS layer can assume that monitors are indexed contiguously,
> allowing it to host the data structures managing FS-level view of
> monitor usage.
>
> I used the following resctrl_arch-interfaces to propagate assignments
> to the implementation:
>
> void resctrl_arch_assign_monitor(struct rdt_domain *d, u32 mon_id, u32
> closid, u32 rmid, int evtid);

Sure. I can add these in next version.

Few comments..

AMD does not need closid for assignment. I assume ARM requires closid.

What is mon_id here?


> void resctrl_arch_unassign_monitor(struct rdt_domain *d, u32 mon_id);

We need rmid and evtid for unassign interface here.


>
> I chose to allow reassigning an assigned monitor without calling
> unassign first. This is important when monitors are unassigned and
> assigned in a single write to mbm_assign_control, as it allows all
> updates to be performed in a single round of parallel IPIs to the
> domains.

Yes. It is not required to call unassign before assign. Hardware(AMD)
supports it.

But, we only have 32 counters. We need to know which counter we are going
to use for assignment. If all the counters already assigned, then we can't
figure out the counter id without calling unassigm first. Using the random
counter will overwrite the already assigned counter.

>
>
>>
>> g. Users will have the option to go back to legacy_mbm mode if required.
>> This can be done using the following command.
>>
>> # echo "legacy_mbm" > /sys/fs/resctrl/info/L3_MON/mbm_assign
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
>> abmc
>> [legacy_mbm]
>
> I chose to make this a mount option to simplify the management of the
> monitor tracking data structures. They are simply allocated at mount
> time and deallocated and unmount.

Initially I added it as an mount option.
Based on our earlier discussion, we decided to use the assign feature by
default if hardware supports it. Users don't have to worry about the details.
>
> I called the option "mon_assign": The mount option parser calls
> resctrl_arch_mon_assign_enable() to determine whether the
> implementation supports assignment in some form. If it returns an
> error, the mount fails. When successful, the assignable monitor count
> is made non-zero in the appropriate rdt_resource, triggering the
> behavior change in the FS layer.
>
> I'm still not sure if it's a good idea to enable monitor assignment by
> default. This would be a major disruption in the MBM usage model
> triggered by moving software between AMD CPU models. I thought the

Why will it be a disruption? Why do you think mount option will solve the
problem? As always, there will be option to go back to legacy mode. right?

> safest option was to disallow creating more monitoring groups than
> monitors unless the option is selected. Given that nobody else

Current code allows to create more groups, but it will report "Monitor
assignment failed" when it runs out of monitors.

> complained about monitoring HW limitations on the mailing list, I
> assumed few users create enough monitoring groups to be impacted.
>
> Thanks!
> -Peter

--
Thanks
Babu Moger

2024-05-02 16:36:15

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

On Thu, Mar 28, 2024 at 08:06:50PM -0500, Babu Moger wrote:
> Introduce rdtgroup_mbm_assign_control_write to assign mbm events.
> Assignment state can be updated by writing to this interface.
> Assignment states are applied on all the domains. Assignment on one
> domain applied on all the domains. User can pass one valid domain and
> assignment will be updated on all the available domains.
>
> Format is similar to the list format with addition of op-code for the
> assignment operation.
>
> * Default CTRL_MON group:
> "//<domain_id><op-code><assignment_flags>"
>
> * Non-default CTRL_MON group:
> "<CTRL_MON group>//<domain_id><op-code><assignment_flags>"
>
> * Child MON group of default CTRL_MON group:
> "/<MON group>/<domain_id><op-code><assignment_flags>"
>
> * Child MON group of non-default CTRL_MON group:
> "<CTRL_MON group>/<MON group>/<domain_id><op-code><assignment_flags>"
>
> Op-code can be one of the following:
>
> = Update the assignment to match the flags
> + Assign a new state
> - Unassign a new state
> _ Unassign all the states
>
> Signed-off-by: Babu Moger <[email protected]>
> ---
>
> v3: New patch.
> Addresses the feedback to provide the global assignment interface.
> https://lore.kernel.org/lkml/[email protected]/
> ---
> Documentation/arch/x86/resctrl.rst | 71 ++++++++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 236 ++++++++++++++++++++++++-
> 2 files changed, 306 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 2d96565501ab..64ec70637c66 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -328,6 +328,77 @@ with the following files:
> None of events are assigned on this mon group. This is a child
> monitor group of the non default control mon group.
>
> + Assignment state can be updated by writing to this interface.
> +
> + NOTE: Assignment on one domain applied on all the domains. User can
> + pass one valid domain and assignment will be updated on all the
> + available domains.
> +
> + Format is similar to the list format with addition of op-code for the
> + assignment operation.
> +
> + * Default CTRL_MON group:
> + "//<domain_id><op-code><assignment_flags>"
> +
> + * Non-default CTRL_MON group:
> + "<CTRL_MON group>//<domain_id><op-code><assignment_flags>"
> +
> + * Child MON group of default CTRL_MON group:
> + "/<MON group>/<domain_id><op-code><assignment_flags>"
> +
> + * Child MON group of non-default CTRL_MON group:
> + "<CTRL_MON group>/<MON group>/<domain_id><op-code><assignment_flags>"

The final bullet seems to cover everything, if we allow <CTRL_MON group>
and <MON group> to be independently empty strings to indicate the
default control and/or monitoring group respectively.

Would that be simpler than treating this as four separate cases?

Also, will this go wrong if someone creates a resctrl group with '\n'
(i.e., a newline character) in the name?

> +
> + Op-code can be one of the following:
> + ::
> +
> + = Update the assignment to match the flags
> + + Assign a new state
> + - Unassign a new state
> + _ Unassign all the states

I can't remember whether I already asked this, but is "_" really
needed here?

Wouldn't it be the case that

//*_

would mean just the same thing as

//*=_

..? (assuming the "*" = "all domains" convention already discussed)

Maybe I'm missing something here.

[...]

Cheers
---Dave

2024-05-02 17:50:43

by Peter Newman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On Thu, May 2, 2024 at 9:25 AM Moger, Babu <[email protected]> wrote:
> On 5/1/24 12:48, Peter Newman wrote:
> > The FS layer is informed by the arch layer (through rdt_resource
> > fields) how many assignable monitors are available and whether a
> > monitor is assigned to an entire group or a single event in a group.
> > Also, the FS layer can assume that monitors are indexed contiguously,
> > allowing it to host the data structures managing FS-level view of
> > monitor usage.
> >
> > I used the following resctrl_arch-interfaces to propagate assignments
> > to the implementation:
> >
> > void resctrl_arch_assign_monitor(struct rdt_domain *d, u32 mon_id, u32
> > closid, u32 rmid, int evtid);
>
> Sure. I can add these in next version.
>
> Few comments..
>
> AMD does not need closid for assignment. I assume ARM requires closid.

Correct, MPAM needs a CLOSID+RMID (PARTID+PMG) to identify a
monitoring group. The CLOSID parameter is ignored on x86.

>
> What is mon_id here?

On ABMC, the value is programmed into L3_QOS_ABMC_CFG.CtrID


>
> > void resctrl_arch_unassign_monitor(struct rdt_domain *d, u32 mon_id);
>
> We need rmid and evtid for unassign interface here.

From my reading of the ABMC specification, it does not look necessary
to program BwSrc or BwType when changing L3_QOS_ABMC_CFG.CtrEn to 0
for a particular CtrID. This interface only disables a counter, so it
should not need to know about how it was previously used when assign
is able to reassign, as assign will always reset the arch_mbm data.

I do not see any harm in the arch_mbm data being stale while the
counter is unassigned, because the data is not accessed when reading
the hardware counter fails. In general, resctrl_arch_rmid_read()
cannot return any information if the hardware counter is not readable
at the time it is called.

>
>
> >
> > I chose to allow reassigning an assigned monitor without calling
> > unassign first. This is important when monitors are unassigned and
> > assigned in a single write to mbm_assign_control, as it allows all
> > updates to be performed in a single round of parallel IPIs to the
> > domains.
>
> Yes. It is not required to call unassign before assign. Hardware(AMD)
> supports it.
>
> But, we only have 32 counters. We need to know which counter we are going
> to use for assignment. If all the counters already assigned, then we can't
> figure out the counter id without calling unassigm first. Using the random
> counter will overwrite the already assigned counter.

I made the caller of resctrl_arch_assign_monitor() responsible for
selecting which monitor to assign. As long as the user orders the
unassign operations before the assign operations in a write to
mbm_assign_control, the FS code will be able to find an available
monitor ID.


> > I chose to make this a mount option to simplify the management of the
> > monitor tracking data structures. They are simply allocated at mount
> > time and deallocated and unmount.
>
> Initially I added it as an mount option.
> Based on our earlier discussion, we decided to use the assign feature by
> default if hardware supports it. Users don't have to worry about the details.
> >
> > I called the option "mon_assign": The mount option parser calls
> > resctrl_arch_mon_assign_enable() to determine whether the
> > implementation supports assignment in some form. If it returns an
> > error, the mount fails. When successful, the assignable monitor count
> > is made non-zero in the appropriate rdt_resource, triggering the
> > behavior change in the FS layer.
> >
> > I'm still not sure if it's a good idea to enable monitor assignment by
> > default. This would be a major disruption in the MBM usage model
> > triggered by moving software between AMD CPU models. I thought the
>
> Why will it be a disruption? Why do you think mount option will solve the
> problem? As always, there will be option to go back to legacy mode. right?
>
> > safest option was to disallow creating more monitoring groups than
> > monitors unless the option is selected. Given that nobody else
>
> Current code allows to create more groups, but it will report "Monitor
> assignment failed" when it runs out of monitors.

Ok that should be fine then.

However, I don't think it's necessary to support dynamically changing
the usage model of monitoring groups without remounting. I believe it
makes it more difficult for the FS code to generically manage monitor
assignment.

-Peter

2024-05-02 17:52:39

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

Hi Dave,

On 5/2/2024 9:21 AM, Dave Martin wrote:
> On Thu, Mar 28, 2024 at 08:06:50PM -0500, Babu Moger wrote:
>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>> index 2d96565501ab..64ec70637c66 100644
>> --- a/Documentation/arch/x86/resctrl.rst
>> +++ b/Documentation/arch/x86/resctrl.rst
>> @@ -328,6 +328,77 @@ with the following files:
>> None of events are assigned on this mon group. This is a child
>> monitor group of the non default control mon group.
>>
>> + Assignment state can be updated by writing to this interface.
>> +
>> + NOTE: Assignment on one domain applied on all the domains. User can
>> + pass one valid domain and assignment will be updated on all the
>> + available domains.
>> +
>> + Format is similar to the list format with addition of op-code for the
>> + assignment operation.
>> +
>> + * Default CTRL_MON group:
>> + "//<domain_id><op-code><assignment_flags>"
>> +
>> + * Non-default CTRL_MON group:
>> + "<CTRL_MON group>//<domain_id><op-code><assignment_flags>"
>> +
>> + * Child MON group of default CTRL_MON group:
>> + "/<MON group>/<domain_id><op-code><assignment_flags>"
>> +
>> + * Child MON group of non-default CTRL_MON group:
>> + "<CTRL_MON group>/<MON group>/<domain_id><op-code><assignment_flags>"
>
> The final bullet seems to cover everything, if we allow <CTRL_MON group>
> and <MON group> to be independently empty strings to indicate the
> default control and/or monitoring group respectively.
>
> Would that be simpler than treating this as four separate cases?
>
> Also, will this go wrong if someone creates a resctrl group with '\n'
> (i.e., a newline character) in the name?

There is a check for this in rdtgroup_mkdir().

>
>> +
>> + Op-code can be one of the following:
>> + ::
>> +
>> + = Update the assignment to match the flags
>> + + Assign a new state
>> + - Unassign a new state
>> + _ Unassign all the states
>
> I can't remember whether I already asked this, but is "_" really
> needed here?

Asked twice:
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/

Answered:
https://lore.kernel.org/lkml/[email protected]/

You seemed ok with answer then:
https://lore.kernel.org/lkml/[email protected]/

>
> Wouldn't it be the case that
>
> //*_
>
> would mean just the same thing as
>
> //*=_
>
> ...? (assuming the "*" = "all domains" convention already discussed)
>
> Maybe I'm missing something here.

I believe have an explicit operator ("+", "=", or "-") simplifies
parsing while providing an interface consistent with what users are already
used to.

Reinette

2024-05-02 18:12:11

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

Hi Dave,


On 5/2/24 12:52, Reinette Chatre wrote:
> Hi Dave,
>
> On 5/2/2024 9:21 AM, Dave Martin wrote:
>> On Thu, Mar 28, 2024 at 08:06:50PM -0500, Babu Moger wrote:
>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>> index 2d96565501ab..64ec70637c66 100644
>>> --- a/Documentation/arch/x86/resctrl.rst
>>> +++ b/Documentation/arch/x86/resctrl.rst
>>> @@ -328,6 +328,77 @@ with the following files:
>>> None of events are assigned on this mon group. This is a child
>>> monitor group of the non default control mon group.
>>>
>>> + Assignment state can be updated by writing to this interface.
>>> +
>>> + NOTE: Assignment on one domain applied on all the domains. User can
>>> + pass one valid domain and assignment will be updated on all the
>>> + available domains.
>>> +
>>> + Format is similar to the list format with addition of op-code for the
>>> + assignment operation.
>>> +
>>> + * Default CTRL_MON group:
>>> + "//<domain_id><op-code><assignment_flags>"
>>> +
>>> + * Non-default CTRL_MON group:
>>> + "<CTRL_MON group>//<domain_id><op-code><assignment_flags>"
>>> +
>>> + * Child MON group of default CTRL_MON group:
>>> + "/<MON group>/<domain_id><op-code><assignment_flags>"
>>> +
>>> + * Child MON group of non-default CTRL_MON group:
>>> + "<CTRL_MON group>/<MON group>/<domain_id><op-code><assignment_flags>"
>>
>> The final bullet seems to cover everything, if we allow <CTRL_MON group>
>> and <MON group> to be independently empty strings to indicate the
>> default control and/or monitoring group respectively.
>>
>> Would that be simpler than treating this as four separate cases?

That is correct. I will add a generic format before this description and
then add these 4 cases. That way it will be more clear.


>>
>> Also, will this go wrong if someone creates a resctrl group with '\n'
>> (i.e., a newline character) in the name?
>
> There is a check for this in rdtgroup_mkdir().
>
>>
>>> +
>>> + Op-code can be one of the following:
>>> + ::
>>> +
>>> + = Update the assignment to match the flags
>>> + + Assign a new state
>>> + - Unassign a new state
>>> + _ Unassign all the states
>>
>> I can't remember whether I already asked this, but is "_" really
>> needed here?
>
> Asked twice:
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
>
> Answered:
> https://lore.kernel.org/lkml/[email protected]/
>
> You seemed ok with answer then:
> https://lore.kernel.org/lkml/[email protected]/
>
>>
>> Wouldn't it be the case that
>>
>> //*_
>>
>> would mean just the same thing as
>>
>> //*=_
>>
>> ...? (assuming the "*" = "all domains" convention already discussed)
>>
>> Maybe I'm missing something here.
>
> I believe have an explicit operator ("+", "=", or "-") simplifies
> parsing while providing an interface consistent with what users are already
> used to.
>
> Reinette

--
Thanks
Babu Moger

2024-05-02 20:15:11

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

On 5/2/24 12:50, Peter Newman wrote:
> Hi Babu,
>
> On Thu, May 2, 2024 at 9:25 AM Moger, Babu <[email protected]> wrote:
>> On 5/1/24 12:48, Peter Newman wrote:
>>> The FS layer is informed by the arch layer (through rdt_resource
>>> fields) how many assignable monitors are available and whether a
>>> monitor is assigned to an entire group or a single event in a group.
>>> Also, the FS layer can assume that monitors are indexed contiguously,
>>> allowing it to host the data structures managing FS-level view of
>>> monitor usage.
>>>
>>> I used the following resctrl_arch-interfaces to propagate assignments
>>> to the implementation:
>>>
>>> void resctrl_arch_assign_monitor(struct rdt_domain *d, u32 mon_id, u32
>>> closid, u32 rmid, int evtid);
>>
>> Sure. I can add these in next version.
>>
>> Few comments..
>>
>> AMD does not need closid for assignment. I assume ARM requires closid.
>
> Correct, MPAM needs a CLOSID+RMID (PARTID+PMG) to identify a
> monitoring group. The CLOSID parameter is ignored on x86.
>
>>
>> What is mon_id here?
>
> On ABMC, the value is programmed into L3_QOS_ABMC_CFG.CtrID

ok.

>
>
>>
>>> void resctrl_arch_unassign_monitor(struct rdt_domain *d, u32 mon_id);
>>
>> We need rmid and evtid for unassign interface here.
>
> From my reading of the ABMC specification, it does not look necessary
> to program BwSrc or BwType when changing L3_QOS_ABMC_CFG.CtrEn to 0
> for a particular CtrID. This interface only disables a counter, so it
> should not need to know about how it was previously used when assign
> is able to reassign, as assign will always reset the arch_mbm data.

Yes. That is correct. We may not need to set BwSrc or BwType for unassign.

But, we need evtid to update the monitor state of the rdtgroup.
>
> I do not see any harm in the arch_mbm data being stale while the
> counter is unassigned, because the data is not accessed when reading
> the hardware counter fails. In general, resctrl_arch_rmid_read()
> cannot return any information if the hardware counter is not readable
> at the time it is called.

Ok. Le me check about keeping the stale arch_mbm data after unassign.
It may be okay.


>
>>
>>
>>>
>>> I chose to allow reassigning an assigned monitor without calling
>>> unassign first. This is important when monitors are unassigned and
>>> assigned in a single write to mbm_assign_control, as it allows all
>>> updates to be performed in a single round of parallel IPIs to the
>>> domains.
>>
>> Yes. It is not required to call unassign before assign. Hardware(AMD)
>> supports it.
>>
>> But, we only have 32 counters. We need to know which counter we are going
>> to use for assignment. If all the counters already assigned, then we can't
>> figure out the counter id without calling unassigm first. Using the random
>> counter will overwrite the already assigned counter.
>
> I made the caller of resctrl_arch_assign_monitor() responsible for
> selecting which monitor to assign. As long as the user orders the
> unassign operations before the assign operations in a write to
> mbm_assign_control, the FS code will be able to find an available
> monitor ID.

How does assign_resctrl_arch_assign_monitor() selects the monitor id (or
counter id) if all of them are assigned already.

In this series the monitor ids are allocated using assign_cntrs_alloc.
rdtgroup_assign_abmc() calls assign_cntrs_alloc() to get monitor id. It
reports error if it cannot get free monitor id.

Expectation is the user to unassign an event from another group(or the
same group) before calling assign.

Are you expecting something else here?

>
>
>>> I chose to make this a mount option to simplify the management of the
>>> monitor tracking data structures. They are simply allocated at mount
>>> time and deallocated and unmount.
>>
>> Initially I added it as an mount option.
>> Based on our earlier discussion, we decided to use the assign feature by
>> default if hardware supports it. Users don't have to worry about the details.
>>>
>>> I called the option "mon_assign": The mount option parser calls
>>> resctrl_arch_mon_assign_enable() to determine whether the
>>> implementation supports assignment in some form. If it returns an
>>> error, the mount fails. When successful, the assignable monitor count
>>> is made non-zero in the appropriate rdt_resource, triggering the
>>> behavior change in the FS layer.
>>>
>>> I'm still not sure if it's a good idea to enable monitor assignment by
>>> default. This would be a major disruption in the MBM usage model
>>> triggered by moving software between AMD CPU models. I thought the
>>
>> Why will it be a disruption? Why do you think mount option will solve the
>> problem? As always, there will be option to go back to legacy mode. right?
>>
>>> safest option was to disallow creating more monitoring groups than
>>> monitors unless the option is selected. Given that nobody else
>>
>> Current code allows to create more groups, but it will report "Monitor
>> assignment failed" when it runs out of monitors.
>
> Ok that should be fine then.
>
> However, I don't think it's necessary to support dynamically changing
> the usage model of monitoring groups without remounting. I believe it
> makes it more difficult for the FS code to generically manage monitor
> assignment.

Are you suggesting to enable ABMC by default when available?

Then provide the mount option switch back to legacy mode?
I am fine with that if we all agree on that.
--
Thanks
Babu Moger

2024-05-02 23:00:37

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

Hi Babu,

On 4/17/2024 3:52 PM, Moger, Babu wrote:
> Hi Peter,
>
> On 4/17/2024 3:56 PM, Peter Newman wrote:
>> Hi Babu,
>>
>> On Wed, Apr 17, 2024 at 12:39 PM Moger, Babu <[email protected]> wrote:
>>> On 4/17/24 12:45, Peter Newman wrote:
>>>> On Thu, Mar 28, 2024 at 6:10 PM Babu Moger <[email protected]> wrote:
>>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>>> index 2d96565501ab..64ec70637c66 100644
>>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>>> @@ -328,6 +328,77 @@ with the following files:
>>>>>           None of events are assigned on this mon group. This is a child
>>>>>           monitor group of the non default control mon group.
>>>>>
>>>>> +       Assignment state can be updated by writing to this interface.
>>>>> +
>>>>> +       NOTE: Assignment on one domain applied on all the domains. User can
>>>>> +       pass one valid domain and assignment will be updated on all the
>>>>> +       available domains.
>>>> How would different assignments to different domains work? If the
>>>> allocations are global, then the allocated monitor ID is available to
>>>> all domains whether they use it or not.
>>> That is correct.
>>> [A] Hardware counters(max 2 per group) are allocated at the group level.
>>> So, those counters are available to all the domains on that group. I will
>>> maintain a bitmap at the domain level. The bitmap will be set on the
>>> domains where assignment is applied and IPIs are sent. IPIs will not be
>>> sent to other domains.
>> Unless the monitor allocation is scoped at the domain level, I don't
>> see much point in implementing the per-domain parsing today, as the
>> only benefit is avoiding IPIs to domains whose counters you don't plan
>> to read.
>
> In that case lets remove the domain specific assignments. We can avoid some code complexity.
>

As I understand counters are scoped at the domain level and it is
an implementation choice to make the allocation global. (Similar to
the decision to make CLOSIDs global.)

Could you please elaborate how you plan to remove domain specific
assignments? I do think it needs to remain as part of the user interface
so I wonder if this may look like only "*=<flags>" is supported on
these systems and attempting to assign an individual domain may fail
with "not supported".

Reinette


2024-05-02 23:22:23

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter and Babu,

On 5/2/2024 1:14 PM, Moger, Babu wrote:
> On 5/2/24 12:50, Peter Newman wrote:
>> On Thu, May 2, 2024 at 9:25 AM Moger, Babu <[email protected]> wrote:
>>> On 5/1/24 12:48, Peter Newman wrote:
..

>>>> I chose to make this a mount option to simplify the management of the
>>>> monitor tracking data structures. They are simply allocated at mount
>>>> time and deallocated and unmount.
>>>
>>> Initially I added it as an mount option.
>>> Based on our earlier discussion, we decided to use the assign feature by
>>> default if hardware supports it. Users don't have to worry about the details.
>>>>
>>>> I called the option "mon_assign": The mount option parser calls
>>>> resctrl_arch_mon_assign_enable() to determine whether the
>>>> implementation supports assignment in some form. If it returns an
>>>> error, the mount fails. When successful, the assignable monitor count
>>>> is made non-zero in the appropriate rdt_resource, triggering the
>>>> behavior change in the FS layer.
>>>>
>>>> I'm still not sure if it's a good idea to enable monitor assignment by
>>>> default. This would be a major disruption in the MBM usage model
>>>> triggered by moving software between AMD CPU models. I thought the
>>>
>>> Why will it be a disruption? Why do you think mount option will solve the
>>> problem? As always, there will be option to go back to legacy mode. right?
>>>
>>>> safest option was to disallow creating more monitoring groups than
>>>> monitors unless the option is selected. Given that nobody else
>>>
>>> Current code allows to create more groups, but it will report "Monitor
>>> assignment failed" when it runs out of monitors.
>>
>> Ok that should be fine then.
>>
>> However, I don't think it's necessary to support dynamically changing
>> the usage model of monitoring groups without remounting. I believe it
>> makes it more difficult for the FS code to generically manage monitor
>> assignment.
>
> Are you suggesting to enable ABMC by default when available?

I do think ABMC should be enabled by default when available and it looks
to be what this series aims to do [1]. The way I reason about this is
that legacy user space gets more reliable monitoring behavior without
needing to change behavior.

I thought there was discussion about communicating to user space
when an attempt is made to read data from an event that does not
have a counter assigned. Something like below but I did not notice this
in this series.

# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
Unassigned

>
> Then provide the mount option switch back to legacy mode?
> I am fine with that if we all agree on that.

Why is a mount option needed? I think we should avoid requiring a remount
unless required and I would like to understand why it is required here.

Peter: could you please elaborate what you mean with it makes it more
difficult for the FS code to generically manage monitor assignment?

Why would user space be required to recreate all control and monitor
groups if wanting to change how memory bandwidth monitoring is done?

From this implementation it has been difficult to understand the impact
of switching between ABMC and legacy.

Reinette

[1] https://lore.kernel.org/lkml/e898059f3c182886b1c16353be7db76d9b852b02.1711674410.git.babu.moger@amd.com/

2024-05-03 01:08:28

by Peter Newman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On Thu, May 2, 2024 at 4:21 PM Reinette Chatre
<[email protected]> wrote:
>
> Hi Peter and Babu,
>
> On 5/2/2024 1:14 PM, Moger, Babu wrote:
> > Are you suggesting to enable ABMC by default when available?
>
> I do think ABMC should be enabled by default when available and it looks
> to be what this series aims to do [1]. The way I reason about this is
> that legacy user space gets more reliable monitoring behavior without
> needing to change behavior.

I don't like that for a monitor assignment-aware user, following the
creation of new monitoring groups, there will be less monitors
available for assignment. If the user wants precise control over where
monitors are allocated, they would need to manually unassign the
automatically-assigned monitor after creating new groups.

It's an annoyance, but I'm not sure if it would break any realistic
usage model. Maybe if the monitoring agent operates independently of
whoever creates monitoring groups it could result in brief periods
where less monitors than expected are available because whoever just
created a new monitoring group hasn't given the automatically-assigned
monitors back yet.

>
> I thought there was discussion about communicating to user space
> when an attempt is made to read data from an event that does not
> have a counter assigned. Something like below but I did not notice this
> in this series.
>
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> Unassigned
>
> >
> > Then provide the mount option switch back to legacy mode?
> > I am fine with that if we all agree on that.
>
> Why is a mount option needed? I think we should avoid requiring a remount
> unless required and I would like to understand why it is required here.
>
> Peter: could you please elaborate what you mean with it makes it more
> difficult for the FS code to generically manage monitor assignment?
>
> Why would user space be required to recreate all control and monitor
> groups if wanting to change how memory bandwidth monitoring is done?

I was looking at this more from the perspective of whether it's
necessary to support the live transition of the groups' configuration
back and forth between programming models. I find it very unlikely
for the userspace controller software to change its mind about the
programming model for monitoring in a running system, so I thought
this would be in the same category as choosing at mount time whether
or not to use CDP or the MBA software controller.

Also, in the software implementation of monitor assignment for older
AMD processors, which is based on allocating a subset of RMIDs, I'm
concerned that the context switch handler would want to read the
monitors associated with the incoming thread's current group to
determine whether it should use one of the tracked RMIDs. I believe it
would be cleaner if the lifetime of the generic monitor-tracking
structures would last until the static branches gating
__resctrl_sched_in() could be disabled.

>
> From this implementation it has been difficult to understand the impact
> of switching between ABMC and legacy.

I'll see if there's a good way to share my software monitor assignment
prototype so it's clearer how the user interface would interact with
diverse implementations. Unfortunately, it's difficult to see the
required abstraction boundaries without the fs/resctrl refactoring
changes[1] applied. It would also require my changes[2] for reading a
thread's RMID from the FS structures to prevent monitor assignments
from forcing an update of all task_structs in the system.

-Peter

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/

2024-05-03 14:53:49

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

On Thu, May 02, 2024 at 10:52:15AM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 5/2/2024 9:21 AM, Dave Martin wrote:
> > On Thu, Mar 28, 2024 at 08:06:50PM -0500, Babu Moger wrote:
> >> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> >> index 2d96565501ab..64ec70637c66 100644
> >> --- a/Documentation/arch/x86/resctrl.rst
> >> +++ b/Documentation/arch/x86/resctrl.rst
> >> @@ -328,6 +328,77 @@ with the following files:
> >> None of events are assigned on this mon group. This is a child
> >> monitor group of the non default control mon group.
> >>
> >> + Assignment state can be updated by writing to this interface.
> >> +
> >> + NOTE: Assignment on one domain applied on all the domains. User can
> >> + pass one valid domain and assignment will be updated on all the
> >> + available domains.
> >> +
> >> + Format is similar to the list format with addition of op-code for the
> >> + assignment operation.
> >> +
> >> + * Default CTRL_MON group:
> >> + "//<domain_id><op-code><assignment_flags>"
> >> +
> >> + * Non-default CTRL_MON group:
> >> + "<CTRL_MON group>//<domain_id><op-code><assignment_flags>"
> >> +
> >> + * Child MON group of default CTRL_MON group:
> >> + "/<MON group>/<domain_id><op-code><assignment_flags>"
> >> +
> >> + * Child MON group of non-default CTRL_MON group:
> >> + "<CTRL_MON group>/<MON group>/<domain_id><op-code><assignment_flags>"
> >
> > The final bullet seems to cover everything, if we allow <CTRL_MON group>
> > and <MON group> to be independently empty strings to indicate the
> > default control and/or monitoring group respectively.
> >
> > Would that be simpler than treating this as four separate cases?
> >
> > Also, will this go wrong if someone creates a resctrl group with '\n'
> > (i.e., a newline character) in the name?
>
> There is a check for this in rdtgroup_mkdir().

Ah, right. Found it. I guess that works.

On a (sort of) related point, are there any concerns about namespace
clashes in resctrlfs? This looks like a potential issue for the resctrl
top-level directory at least.

It's not clear to me how userspace can pick a name for a resctrl group
that is guaranteed not to clash with the name of one of resctrl's own
files in a future kernel.

(Note, this is nothing to do with series; I haven't been sure where to
fit this into the dicsussion...)

>
> >
> >> +
> >> + Op-code can be one of the following:
> >> + ::
> >> +
> >> + = Update the assignment to match the flags
> >> + + Assign a new state
> >> + - Unassign a new state
> >> + _ Unassign all the states
> >
> > I can't remember whether I already asked this, but is "_" really
> > needed here?
>
> Asked twice:
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
>
> Answered:
> https://lore.kernel.org/lkml/[email protected]/
>
> You seemed ok with answer then:
> https://lore.kernel.org/lkml/[email protected]/

There, I was asking about "_" meaning "no flags" in "=_".

>
> >
> > Wouldn't it be the case that
> >
> > //*_
> >
> > would mean just the same thing as
> >
> > //*=_
> >
> > ...? (assuming the "*" = "all domains" convention already discussed)
> >
> > Maybe I'm missing something here.
>
> I believe have an explicit operator ("+", "=", or "-") simplifies
> parsing while providing an interface consistent with what users are already
> used to.
>
> Reinette

That was the point I was trying to make here, apologies if I wasn't
clear.

Cheers
---Dave

2024-05-03 16:18:29

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

Hi Reinette,

Email issues. Responding again..

On 5/2/2024 6:00 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 4/17/2024 3:52 PM, Moger, Babu wrote:
>> Hi Peter,
>>
>> On 4/17/2024 3:56 PM, Peter Newman wrote:
>>> Hi Babu,
>>>
>>> On Wed, Apr 17, 2024 at 12:39 PM Moger, Babu <[email protected]> wrote:
>>>> On 4/17/24 12:45, Peter Newman wrote:
>>>>> On Thu, Mar 28, 2024 at 6:10 PM Babu Moger <[email protected]> wrote:
>>>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>>>> index 2d96565501ab..64ec70637c66 100644
>>>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>>>> @@ -328,6 +328,77 @@ with the following files:
>>>>>>           None of events are assigned on this mon group. This is a child
>>>>>>           monitor group of the non default control mon group.
>>>>>>
>>>>>> +       Assignment state can be updated by writing to this interface.
>>>>>> +
>>>>>> +       NOTE: Assignment on one domain applied on all the domains. User can
>>>>>> +       pass one valid domain and assignment will be updated on all the
>>>>>> +       available domains.
>>>>> How would different assignments to different domains work? If the
>>>>> allocations are global, then the allocated monitor ID is available to
>>>>> all domains whether they use it or not.
>>>> That is correct.
>>>> [A] Hardware counters(max 2 per group) are allocated at the group level.
>>>> So, those counters are available to all the domains on that group. I will
>>>> maintain a bitmap at the domain level. The bitmap will be set on the
>>>> domains where assignment is applied and IPIs are sent. IPIs will not be
>>>> sent to other domains.
>>> Unless the monitor allocation is scoped at the domain level, I don't
>>> see much point in implementing the per-domain parsing today, as the
>>> only benefit is avoiding IPIs to domains whose counters you don't plan
>>> to read.
>>
>> In that case lets remove the domain specific assignments. We can avoid some code complexity.
>>
>
> As I understand counters are scoped at the domain level and it is
> an implementation choice to make the allocation global. (Similar to
> the decision to make CLOSIDs global.)
>
> Could you please elaborate how you plan to remove domain specific
> assignments? I do think it needs to remain as part of the user interface
> so I wonder if this may look like only "*=<flags>" is supported on
> these systems and attempting to assign an individual domain may fail
> with "not supported".

This series applies the assignment to all the domains.

For example:

# echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

User here wants to assign a monitor to total event on domain 0.
But this series applies monitor to all the domains in the system. IPIs
will be sent to all the domains.
Basically this is equivalent to

# echo "//*=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control


I was thinking of adding domain specific assignment in next version.
That involves adding a new field in rdt_domain to keep track of assignment.
Peter suggested it may not be much of a value add for his usage model.
Thanks
- Babu Moger

2024-05-03 20:46:54

by Moger, Babu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

On 5/2/2024 7:57 PM, Peter Newman wrote:
> Hi Reinette,
>
> On Thu, May 2, 2024 at 4:21 PM Reinette Chatre
> <[email protected]> wrote:
>>
>> Hi Peter and Babu,
>>
>> On 5/2/2024 1:14 PM, Moger, Babu wrote:
>>> Are you suggesting to enable ABMC by default when available?
>>
>> I do think ABMC should be enabled by default when available and it looks
>> to be what this series aims to do [1]. The way I reason about this is
>> that legacy user space gets more reliable monitoring behavior without
>> needing to change behavior.
>
> I don't like that for a monitor assignment-aware user, following the
> creation of new monitoring groups, there will be less monitors
> available for assignment. If the user wants precise control over where
> monitors are allocated, they would need to manually unassign the
> automatically-assigned monitor after creating new groups.
>
> It's an annoyance, but I'm not sure if it would break any realistic
> usage model. Maybe if the monitoring agent operates independently of

Yes. Its annoyance.

But if you think about it, normal users don't create too many groups.
They wont have to worry about assign/unassign headache if we enable
monitor assignment automatically. Also there is pqos tool which uses
this interface. It does not have to know about assign/unassign stuff.


> whoever creates monitoring groups it could result in brief periods
> where less monitors than expected are available because whoever just
> created a new monitoring group hasn't given the automatically-assigned
> monitors back yet.
>
>>
>> I thought there was discussion about communicating to user space
>> when an attempt is made to read data from an event that does not
>> have a counter assigned. Something like below but I did not notice this
>> in this series.
>>
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> Unassigned
>>
>>>
>>> Then provide the mount option switch back to legacy mode?
>>> I am fine with that if we all agree on that.
>>
>> Why is a mount option needed? I think we should avoid requiring a remount
>> unless required and I would like to understand why it is required here.
>>
>> Peter: could you please elaborate what you mean with it makes it more
>> difficult for the FS code to generically manage monitor assignment?
>>
>> Why would user space be required to recreate all control and monitor
>> groups if wanting to change how memory bandwidth monitoring is done?
>
> I was looking at this more from the perspective of whether it's
> necessary to support the live transition of the groups' configuration
> back and forth between programming models. I find it very unlikely
> for the userspace controller software to change its mind about the
> programming model for monitoring in a running system, so I thought
> this would be in the same category as choosing at mount time whether
> or not to use CDP or the MBA software controller.

Good point about the mount option is, we don't create extra files for
monitor assignment in /sys/fs/resctrl when we mount with legacy option.

>
> Also, in the software implementation of monitor assignment for older
> AMD processors, which is based on allocating a subset of RMIDs, I'm
> concerned that the context switch handler would want to read the
> monitors associated with the incoming thread's current group to
> determine whether it should use one of the tracked RMIDs. I believe it
> would be cleaner if the lifetime of the generic monitor-tracking
> structures would last until the static branches gating
> __resctrl_sched_in() could be disabled.
>
>>
>> From this implementation it has been difficult to understand the impact
>> of switching between ABMC and legacy.
>
> I'll see if there's a good way to share my software monitor assignment
> prototype so it's clearer how the user interface would interact with
> diverse implementations. Unfortunately, it's difficult to see the
> required abstraction boundaries without the fs/resctrl refactoring
> changes[1] applied. It would also require my changes[2] for reading a
> thread's RMID from the FS structures to prevent monitor assignments
> from forcing an update of all task_structs in the system.
>
> -Peter
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
>

--
- Babu Moger

2024-05-03 21:00:42

by Peter Newman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On Fri, May 3, 2024 at 1:44 PM Moger, Babu <[email protected]> wrote:
>
> Hi Peter,
>
> On 5/2/2024 7:57 PM, Peter Newman wrote:
> > Hi Reinette,
> >
> > On Thu, May 2, 2024 at 4:21 PM Reinette Chatre
> >> I do think ABMC should be enabled by default when available and it looks
> >> to be what this series aims to do [1]. The way I reason about this is
> >> that legacy user space gets more reliable monitoring behavior without
> >> needing to change behavior.
> >
> > I don't like that for a monitor assignment-aware user, following the
> > creation of new monitoring groups, there will be less monitors
> > available for assignment. If the user wants precise control over where
> > monitors are allocated, they would need to manually unassign the
> > automatically-assigned monitor after creating new groups.
> >
> > It's an annoyance, but I'm not sure if it would break any realistic
> > usage model. Maybe if the monitoring agent operates independently of
>
> Yes. Its annoyance.
>
> But if you think about it, normal users don't create too many groups.
> They wont have to worry about assign/unassign headache if we enable
> monitor assignment automatically. Also there is pqos tool which uses
> this interface. It does not have to know about assign/unassign stuff.

Thinking about this again, I don't think it's much of a concern
because the automatic assignment on mongroup creation behavior can be
trivially disabled using a boolean flag.

-Peter

2024-05-03 21:16:01

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

On 5/2/2024 5:57 PM, Peter Newman wrote:
> Hi Reinette,
>
> On Thu, May 2, 2024 at 4:21 PM Reinette Chatre
> <[email protected]> wrote:
>>
>> Hi Peter and Babu,
>>
>> On 5/2/2024 1:14 PM, Moger, Babu wrote:
>>> Are you suggesting to enable ABMC by default when available?
>>
>> I do think ABMC should be enabled by default when available and it looks
>> to be what this series aims to do [1]. The way I reason about this is
>> that legacy user space gets more reliable monitoring behavior without
>> needing to change behavior.
>
> I don't like that for a monitor assignment-aware user, following the
> creation of new monitoring groups, there will be less monitors
> available for assignment. If the user wants precise control over where
> monitors are allocated, they would need to manually unassign the
> automatically-assigned monitor after creating new groups.
>
> It's an annoyance, but I'm not sure if it would break any realistic
> usage model. Maybe if the monitoring agent operates independently of
> whoever creates monitoring groups it could result in brief periods
> where less monitors than expected are available because whoever just
> created a new monitoring group hasn't given the automatically-assigned
> monitors back yet.
>

I will respond in other thread.


>>
>> I thought there was discussion about communicating to user space
>> when an attempt is made to read data from an event that does not
>> have a counter assigned. Something like below but I did not notice this
>> in this series.
>>
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> Unassigned
>>
>>>
>>> Then provide the mount option switch back to legacy mode?
>>> I am fine with that if we all agree on that.
>>
>> Why is a mount option needed? I think we should avoid requiring a remount
>> unless required and I would like to understand why it is required here.
>>
>> Peter: could you please elaborate what you mean with it makes it more
>> difficult for the FS code to generically manage monitor assignment?
>>
>> Why would user space be required to recreate all control and monitor
>> groups if wanting to change how memory bandwidth monitoring is done?
>
> I was looking at this more from the perspective of whether it's
> necessary to support the live transition of the groups' configuration
> back and forth between programming models. I find it very unlikely
> for the userspace controller software to change its mind about the
> programming model for monitoring in a running system, so I thought
> this would be in the same category as choosing at mount time whether
> or not to use CDP or the MBA software controller.

This seems reasonable to me if only considering ABMC and legacy. When
also taking into account soft-RMID it is no longer obvious to me. I do
still have an impression that the soft-RMID solution impacts context switch
duration so I am considering the scenario where user space may want to
use soft-RMID for portions of time to get an idea of workload behavior and
then dynamically move to less accurate measurements to not impact the
workloads all the time.

In this case perhaps more like how user space can dynamically change power
saving mode based on requirements of responsiveness etc.


> Also, in the software implementation of monitor assignment for older
> AMD processors, which is based on allocating a subset of RMIDs, I'm
> concerned that the context switch handler would want to read the
> monitors associated with the incoming thread's current group to
> determine whether it should use one of the tracked RMIDs. I believe it
> would be cleaner if the lifetime of the generic monitor-tracking
> structures would last until the static branches gating
> __resctrl_sched_in() could be disabled.

Yes, this falls under the umbrella of needing to understand the impact
of switching between mechanisms that is not obvious to me.

>
>>
>> From this implementation it has been difficult to understand the impact
>> of switching between ABMC and legacy.
>
> I'll see if there's a good way to share my software monitor assignment
> prototype so it's clearer how the user interface would interact with
> diverse implementations. Unfortunately, it's difficult to see the
> required abstraction boundaries without the fs/resctrl refactoring
> changes[1] applied. It would also require my changes[2] for reading a
> thread's RMID from the FS structures to prevent monitor assignments
> from forcing an update of all task_structs in the system.
>
> -Peter
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/

2024-05-03 21:16:12

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

Hi Dave,

On 5/3/2024 7:53 AM, Dave Martin wrote:
> On Thu, May 02, 2024 at 10:52:15AM -0700, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 5/2/2024 9:21 AM, Dave Martin wrote:
>>> On Thu, Mar 28, 2024 at 08:06:50PM -0500, Babu Moger wrote:
>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>> index 2d96565501ab..64ec70637c66 100644
>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>> @@ -328,6 +328,77 @@ with the following files:
>>>> None of events are assigned on this mon group. This is a child
>>>> monitor group of the non default control mon group.
>>>>
>>>> + Assignment state can be updated by writing to this interface.
>>>> +
>>>> + NOTE: Assignment on one domain applied on all the domains. User can
>>>> + pass one valid domain and assignment will be updated on all the
>>>> + available domains.
>>>> +
>>>> + Format is similar to the list format with addition of op-code for the
>>>> + assignment operation.
>>>> +
>>>> + * Default CTRL_MON group:
>>>> + "//<domain_id><op-code><assignment_flags>"
>>>> +
>>>> + * Non-default CTRL_MON group:
>>>> + "<CTRL_MON group>//<domain_id><op-code><assignment_flags>"
>>>> +
>>>> + * Child MON group of default CTRL_MON group:
>>>> + "/<MON group>/<domain_id><op-code><assignment_flags>"
>>>> +
>>>> + * Child MON group of non-default CTRL_MON group:
>>>> + "<CTRL_MON group>/<MON group>/<domain_id><op-code><assignment_flags>"
>>>
>>> The final bullet seems to cover everything, if we allow <CTRL_MON group>
>>> and <MON group> to be independently empty strings to indicate the
>>> default control and/or monitoring group respectively.
>>>
>>> Would that be simpler than treating this as four separate cases?
>>>
>>> Also, will this go wrong if someone creates a resctrl group with '\n'
>>> (i.e., a newline character) in the name?
>>
>> There is a check for this in rdtgroup_mkdir().
>
> Ah, right. Found it. I guess that works.
>
> On a (sort of) related point, are there any concerns about namespace
> clashes in resctrlfs? This looks like a potential issue for the resctrl
> top-level directory at least.
>
> It's not clear to me how userspace can pick a name for a resctrl group
> that is guaranteed not to clash with the name of one of resctrl's own
> files in a future kernel.
>
> (Note, this is nothing to do with series; I haven't been sure where to
> fit this into the dicsussion...)

It is not obvious to me what scenario you have in mind. Could you please
give an example?

>
>>
>>>
>>>> +
>>>> + Op-code can be one of the following:
>>>> + ::
>>>> +
>>>> + = Update the assignment to match the flags
>>>> + + Assign a new state
>>>> + - Unassign a new state
>>>> + _ Unassign all the states
>>>
>>> I can't remember whether I already asked this, but is "_" really
>>> needed here?
>>
>> Asked twice:
>> https://lore.kernel.org/lkml/[email protected]/
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Answered:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> You seemed ok with answer then:
>> https://lore.kernel.org/lkml/[email protected]/
>
> There, I was asking about "_" meaning "no flags" in "=_".

Apologies. I did not notice the difference. Yes, I agree, "_" is
not expected to be an operator.

Reinette

2024-05-03 21:16:16

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

On 5/3/2024 2:00 PM, Peter Newman wrote:
> Hi Babu,
>
> On Fri, May 3, 2024 at 1:44 PM Moger, Babu <[email protected]> wrote:
>>
>> Hi Peter,
>>
>> On 5/2/2024 7:57 PM, Peter Newman wrote:
>>> Hi Reinette,
>>>
>>> On Thu, May 2, 2024 at 4:21 PM Reinette Chatre
>>>> I do think ABMC should be enabled by default when available and it looks
>>>> to be what this series aims to do [1]. The way I reason about this is
>>>> that legacy user space gets more reliable monitoring behavior without
>>>> needing to change behavior.
>>>
>>> I don't like that for a monitor assignment-aware user, following the
>>> creation of new monitoring groups, there will be less monitors
>>> available for assignment. If the user wants precise control over where
>>> monitors are allocated, they would need to manually unassign the
>>> automatically-assigned monitor after creating new groups.
>>>
>>> It's an annoyance, but I'm not sure if it would break any realistic
>>> usage model. Maybe if the monitoring agent operates independently of
>>
>> Yes. Its annoyance.
>>
>> But if you think about it, normal users don't create too many groups.
>> They wont have to worry about assign/unassign headache if we enable
>> monitor assignment automatically. Also there is pqos tool which uses
>> this interface. It does not have to know about assign/unassign stuff.
>
> Thinking about this again, I don't think it's much of a concern
> because the automatic assignment on mongroup creation behavior can be
> trivially disabled using a boolean flag.

This could be a config option.

Reinette


2024-05-03 21:16:50

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 17/17] x86/resctrl: Introduce interface to modify assignment states of the groups

Hi Babu,

On 5/3/2024 9:14 AM, Moger, Babu wrote:
> On 5/2/2024 6:00 PM, Reinette Chatre wrote:
>> On 4/17/2024 3:52 PM, Moger, Babu wrote:
>>> On 4/17/2024 3:56 PM, Peter Newman wrote:
>>>> On Wed, Apr 17, 2024 at 12:39 PM Moger, Babu <[email protected]> wrote:
>>>>> On 4/17/24 12:45, Peter Newman wrote:
>>>>>> On Thu, Mar 28, 2024 at 6:10 PM Babu Moger <[email protected]> wrote:
>>>>>>> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>>>>>>> index 2d96565501ab..64ec70637c66 100644
>>>>>>> --- a/Documentation/arch/x86/resctrl.rst
>>>>>>> +++ b/Documentation/arch/x86/resctrl.rst
>>>>>>> @@ -328,6 +328,77 @@ with the following files:
>>>>>>>            None of events are assigned on this mon group. This is a child
>>>>>>>            monitor group of the non default control mon group.
>>>>>>>
>>>>>>> +       Assignment state can be updated by writing to this interface.
>>>>>>> +
>>>>>>> +       NOTE: Assignment on one domain applied on all the domains. User can
>>>>>>> +       pass one valid domain and assignment will be updated on all the
>>>>>>> +       available domains.
>>>>>> How would different assignments to different domains work? If the
>>>>>> allocations are global, then the allocated monitor ID is available to
>>>>>> all domains whether they use it or not.
>>>>> That is correct.
>>>>> [A] Hardware counters(max 2 per group) are allocated at the group level.
>>>>> So, those counters are available to all the domains on that group. I will
>>>>> maintain a bitmap at the domain level. The bitmap will be set on the
>>>>> domains where assignment is applied and IPIs are sent. IPIs will not be
>>>>> sent to other domains.
>>>> Unless the monitor allocation is scoped at the domain level, I don't
>>>> see much point in implementing the per-domain parsing today, as the
>>>> only benefit is avoiding IPIs to domains whose counters you don't plan
>>>> to read.
>>>
>>> In that case lets remove the domain specific assignments. We can avoid some code complexity.
>>>
>>
>> As I understand counters are scoped at the domain level and it is
>> an implementation choice to make the allocation global. (Similar to
>> the decision to make CLOSIDs global.)
>>
>> Could you please elaborate how you plan to remove domain specific
>> assignments? I do think it needs to remain as part of the user interface
>> so I wonder if this may look like only "*=<flags>" is supported on
>> these systems and attempting to assign an individual domain may fail
>> with "not supported".
>
> This series applies the assignment to all the domains.
>
> For example:
>
> # echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> User here wants to assign a monitor to total event on domain 0.
> But this series applies monitor to all the domains in the system. IPIs will be sent to all the domains.

I would like to recommend against this. (a) this is not what the API
says will happen, (b) behavior like this may result in users having scripts
with syntax like above expecting changes to all domains and when/if
AMD or another architecture decides to implement per-domain assignment
it will break user space.

> Basically this is equivalent to
>
> # echo "//*=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
>
> I was thinking of adding domain specific assignment in next version.
> That involves adding a new field in rdt_domain to keep track of
> assignment. Peter suggested it may not be much of a value add for his
> usage model.

I do not have insight into how all users will end up using this.

Reinette

2024-05-03 23:24:34

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 3/28/2024 6:06 PM, Babu Moger wrote:

> a. Check if ABMC support is available
> #mount -t resctrl resctrl /sys/fs/resctrl/
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> [abmc]
> legacy_mbm
>
> Linux kernel detected ABMC feature and it is enabled.

Please note that this adds the "abmc" feature to the resctrl
*filesystem* that supports more architectures than just AMD. Calling the
resctrl filesystem feature "abmc" means that (a) AMD needs to be ok with
other architectures calling their features that are
similar-but-maybe-not-identical-to-AMD-ABMC "abmc", or (b) this needs
a new generic name.

> b. Check how many ABMC counters are available.
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_cntrs
> 32
>
> c. Create few resctrl groups.
>
> # mkdir /sys/fs/resctrl/mon_groups/default_mon1
> # mkdir /sys/fs/resctrl/non_defult_group

Can this be non_default_group instead? Seems like non_defult_group is used
consistently but its spelling is unexpected.

> # mkdir /sys/fs/resctrl/non_defult_group/mon_groups/non_default_mon1
>
> d. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> to list and modify the group's assignment states.
>
> The list follows the following format:
>
> * Default CTRL_MON group:
> "//<domain_id>=<assignment_flags>"
>
> * Non-default CTRL_MON group:
> "<CTRL_MON group>//<domain_id>=<assignment_flags>"
>
> * Child MON group of default CTRL_MON group:
> "/<MON group>/<domain_id>=<assignment_flags>"
>
> * Child MON group of non-default CTRL_MON group:
> "<CTRL_MON group>/<MON group>/<domain_id>=<assignment_flags>"
>
> Assignment flags can be one of the following:
>
> t MBM total event is assigned
> l MBM local event is assigned
> tl Both total and local MBM events are assigned
> _ None of the MBM events are assigned
>
> Examples:
>
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> non_defult_group//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> non_defult_group/non_default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
> /default_mon1/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;
>
> There are four groups and all the groups have local and total event assigned.
>
> "//" - This is a default CONTROL MON group
>
> "non_defult_group//" - This is non default CONTROL MON group
>
> "/default_mon1/" - This is Child MON group of the defult group
>
> "non_defult_group/non_default_mon1/" - This is child MON group of the non default group
>
> =tl means both total and local events are assigned.
>
> e. Update the group assignment states using the interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control.
>
> The write format is similar to the above list format with addition of
> op-code for the assignment operation.
>
> * Default CTRL_MON group:
> "//<domain_id><op-code><assignment_flags>"
>
> * Non-default CTRL_MON group:
> "<CTRL_MON group>//<domain_id><op-code><assignment_flags>"
>
> * Child MON group of default CTRL_MON group:
> "/<MON group>/<domain_id><op-code><assignment_flags>"
>
> * Child MON group of non-default CTRL_MON group:
> "<CTRL_MON group>/<MON group>/<domain_id><op-code><assignment_flags>"
>
> Op-code can be one of the following:
>
> = Update the assignment to match the flags
> + Assign a new state
> - Unassign a new state
> _ Unassign all the states

As mentioned in https://lore.kernel.org/lkml/ZjO9hpuLz%[email protected]/
the "_" is not an operator but instead viewed as an part of <assignment_flags>.
It is expected to be used with "=", to unset flags it will be used as below:

echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/0=_" ...

>
>
> Initial group status:
>
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> non_default_ctrl_mon_grp//0=tl;1=tl;
> non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
> //0=tl;1=tl;
> /child_default_mon_grp/0=tl;1=tl;
>
>
> To update the default group to assign only total event.
> # echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> Assignment status after the update:
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> non_default_ctrl_mon_grp//0=tl;1=tl;
> non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
> //0=t;1=t;
> /child_default_mon_grp/0=tl;1=tl;

As mentioned in https://lore.kernel.org/lkml/[email protected]/
using "0=t" is expected to only impact domain #0, not all domains. Similar for
other examples below.

>
> To update the MON group child_default_mon_grp to remove local event:
> # echo "/child_default_mon_grp/0-l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> Assignment status after the update:
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> //0=t;1=t;
> /child_default_mon_grp/0=t;1=t;
> non_default_ctrl_mon_grp//0=tl;1=tl;
> non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
>
> To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to
> remove both local and total events:
> # echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/0_" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> Assignment status after the update:
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> //0=t;1=t;
> /child_default_mon_grp/0=t;1=t;
> non_default_ctrl_mon_grp//0=tl;1=tl;
> non_default_ctrl_mon_grp/child_non_default_mon_grp/0=_;1=_;
>
>
> f. Read the event mbm_total_bytes and mbm_local_bytes of the default group.
> There is no change in reading the evetns with ABMC. If the event is unassigned

evetns -> events

> when reading, then the read will come back as Unavailable.
>
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 779247936
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> 765207488
>
> g. Users will have the option to go back to legacy_mbm mode if required.
> This can be done using the following command.
>
> # echo "legacy_mbm" > /sys/fs/resctrl/info/L3_MON/mbm_assign
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> abmc
> [legacy_mbm]
>

This needs a mention about how state is impacted when a user makes this
switch. For example, if switching from "legacy" to abmc ... if there
are fewer than "num counters" monitor groups, will they get counters
assigned dynamically? What happens to feature specific resctrl files?
What happens to the counters themselves, are they reset? What else
happens during this switch?

>
> h. Check the bandwidth configuration for the group. Note that bandwidth
> configuration has a domain scope. Total event defaults to 0x7F (to
> count all the events) and local event defaults to 0x15 (to count all
> the local numa events). The event bitmap decoding is available at
> https://www.kernel.org/doc/Documentation/x86/resctrl.rst
> in section "mbm_total_bytes_config", "mbm_local_bytes_config":
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
> 0=0x7f;1=0x7f
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
> 0=0x15;1=0x15
>
> j. Change the bandwidth source for domain 0 for the total event to count only reads.
> Note that this change effects total events on the domain 0.
>
> #echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
> #cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
> 0=0x33;1=0x7F
>
> k. Now read the total event again. The mbm_total_bytes should display
> only the read events.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 314101
>
> l. Unmount the resctrl
>
> #umount /sys/fs/resctrl/
>
> ---

Reinette

2024-05-03 23:25:23

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 01/17] x86/resctrl: Add support for Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 3/28/2024 6:06 PM, Babu Moger wrote:
> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
> feature only guarantees that RMIDs currently assigned to a processor will
> be tracked by hardware. The counters of any other RMIDs which are no longer
> being tracked will be reset to zero. The MBM event counters return
> "Unavailable" for the RMIDs that are not active.

I think it will be helpful to use consistent terms. For example, above uses
"tracked by hardware" as well as "active". "tracked by hardware" seems more
specific to me and I think it would help to understand this work if this is
used consistently.

>
> Users can create 256 or more monitor groups. But there can be only limited

I think you write "Users can create 256 or more monitor groups." to match
with earlier "AMD hardware can support 256 or more RMIDs.". Can this be made
specific with "Users can create as many monitor groups as RMIDs supported."?
(please feel free to improve)

> number of groups that can give guaranteed monitoring numbers. With ever
> changing configurations there is no way to definitely know which of these
> groups will be active for certain point of time. Users do not have the
> option to monitor a group or set of groups for certain period of time
> without worrying about RMID being reset in between.
>
> The ABMC feature provides an option to the user to assign an RMID to the
> hardware counter and monitor the bandwidth for a longer duration.
> The assigned RMID will be active until the user unassigns it manually.
> There is no need to worry about counters being reset during this period.
> Additionally, the user can specify a bitmask identifying the specific
> bandwidth types from the given source to track with the counter.
>
> Linux resctrl subsystem provides the interface to count maximum of two
> memory bandwidth events per group, from a combination of available total
> and local events. Keeping the current interface, users can assign a maximum
> of 2 ABMC counters per group. User will also have the option to assign only
> one counter to the group. If the system runs out of assignable ABMC
> counters, kernel will display an error. Users need to unassign an already
> assigned counter to make space for new assignments.
>
> AMD hardware provides total of 32 ABMC counters when supported.

I am not sure if you want to mention this. As written this sounds like
a hardcoded value but it clear from later patches the number of counters
is learned from hardware.

>
> The feature can be detected via CPUID_Fn80000020_EBX_x00 bit 5.
> Bits Description
> 5 ABMC (Assignable Bandwidth Monitoring Counters)
>
> The feature details are documented in APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC).
>
> Signed-off-by: Babu Moger <[email protected]>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

Reinette


2024-05-03 23:26:38

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 03/17] x86/resctrl: Detect Assignable Bandwidth Monitoring feature details

Hi Babu,

On 3/28/2024 6:06 PM, Babu Moger wrote:
> ABMC feature details are reported via CPUID Fn8000_0020_EBX_x5.
> Bits Description
> 15:0 MAX_ABMC Maximum Supported Assignable Bandwidth
> Monitoring Counter ID + 1
>
> The feature details are documented in APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC).
>
> Signed-off-by: Babu Moger <[email protected]>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> ---
> v3: Removed changes related to mon_features.
> Moved rdt_cpu_has to core.c and added new function resctrl_arch_has_abmc.
> Also moved the fields mbm_assign_capable and mbm_assign_cntrs to
> rdt_resource. (James)
>
> v2: Changed the field name to mbm_assign_capable from abmc_capable.
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 17 +++++++++++++++++
> arch/x86/kernel/cpu/resctrl/internal.h | 1 +
> arch/x86/kernel/cpu/resctrl/monitor.c | 3 +++
> include/linux/resctrl.h | 12 ++++++++++++
> 4 files changed, 33 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 57a8c6f30dd6..bb82b392cf5d 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -740,6 +740,23 @@ bool __init rdt_cpu_has(int flag)
> return ret;
> }
>
> +inline bool __init resctrl_arch_has_abmc(struct rdt_resource *r)
> +{
> + bool ret = rdt_cpu_has(X86_FEATURE_ABMC);
> + u32 eax, ebx, ecx, edx;
> +
> + if (ret) {
> + /*
> + * Query CPUID_Fn80000020_EBX_x05 for number of
> + * ABMC counters
> + */
> + cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
> + r->mbm_assign_cntrs = (ebx & 0xFFFF) + 1;
> + }
> +
> + return ret;
> +}

It is not clear to me why this function is needed. I went back to
read James' comment and it sounds to me as though he expected it
to be called from non-arch code ... but this is only called
from rdt_get_mon_l3_config() which is very much architecture specific
and will remain in arch/x86 where rdt_cpu_has() will be accessible.

> +
> static __init bool get_mem_config(void)
> {
> struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_MBA];
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index c99f26ebe7a6..c4ae6f3993aa 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -584,6 +584,7 @@ void free_rmid(u32 closid, u32 rmid);
> int rdt_get_mon_l3_config(struct rdt_resource *r);
> void __exit rdt_put_mon_l3_config(void);
> bool __init rdt_cpu_has(int flag);
> +bool __init resctrl_arch_has_abmc(struct rdt_resource *r);
> void mon_event_count(void *info);
> int rdtgroup_mondata_show(struct seq_file *m, void *arg);
> void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index c34a35ec0f03..e5938bf53d5a 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1055,6 +1055,9 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> mbm_local_event.configurable = true;
> mbm_config_rftype_init("mbm_local_bytes_config");
> }
> +
> + if (resctrl_arch_has_abmc(r))
> + r->mbm_assign_capable = ABMC_ASSIGN;
> }

This is confusing to me in two ways:
(a) why need different layers of abstraction to initialize r->mbm_assign_capable
and r->mbm_assign_cntrs? Can they not just be assigned at the same time?
(b) r->mbm_assign_capable is a bool ... but it is assigned an enum? Why is
this enum needed for this?

>
> l3_mon_evt_init(r);
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index a365f67131ec..a1ee9afabff3 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -150,6 +150,14 @@ struct resctrl_membw {
> struct rdt_parse_data;
> struct resctrl_schema;
>
> +/**
> + * enum mbm_assign_type - The type of assignable monitoring.
> + * @ABMC_ASSIGN: Assignable Bandwidth Monitoring Counters.
> + */
> +enum mbm_assign_type {
> + ABMC_ASSIGN = 0x01,
> +};
> +

Either the resource is mbm_assign_capable or not ... it is not clear
to me why an enum is needed.

> /**
> * struct rdt_resource - attributes of a resctrl resource
> * @rid: The index of the resource
> @@ -168,6 +176,8 @@ struct resctrl_schema;
> * @evt_list: List of monitoring events
> * @fflags: flags to choose base and info files
> * @cdp_capable: Is the CDP feature available on this resource
> + * @mbm_assign_capable: Does system capable of supporting monitor assignment?

"Does system capable" -> "Is system capable"?

> + * @mbm_assign_cntrs: Maximum number of assignable counters
> */
> struct rdt_resource {
> int rid;
> @@ -188,6 +198,8 @@ struct rdt_resource {
> struct list_head evt_list;
> unsigned long fflags;
> bool cdp_capable;
> + bool mbm_assign_capable;
> + u32 mbm_assign_cntrs;
> };

Please check tabs vs spaces (in this whole series please).

I'm thinking that a new "MBM specific" struct within
struct rdt_resource will be helpful to clearly separate the MBM related
data. This will be similar to struct resctrl_cache for
cache allocation and struct resctrl_membw for memory bandwidth
allocation.

>
> /**

Reinette

2024-05-03 23:28:43

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 05/17] x86/resctrl: Introduce the interface to display the assignment state

Hi Babu,

On 3/28/2024 6:06 PM, Babu Moger wrote:
> The ABMC feature provides an option to the user to assign an RMID
> to the hardware counter and monitor the bandwidth for a longer duration.
> System can be in only one mode at a time (Legacy Monitor mode or ABMC
> mode). By default, ABMC mode is disabled.

"By default, ABMC mode is disabled." seems to contradict later work.

>
> Provide an interface to display the monitor mode on the system.
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> abmc

This example seems to contradict earlier statements in two ways:
(a) it only shows one mode vs. there are two modes (legacy or ABMC)
(b) there is no active mode vs. one mode is always active.

>
> When the feature is enabled
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> [abmc]
>
> Signed-off-by: Babu Moger <[email protected]>
> ---
> v3: New patch to display ABMC capability.
> ---
> Documentation/arch/x86/resctrl.rst | 5 +++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 4 +++-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 +++++++++++++++++
> 3 files changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 68df7751d1f5..cd973a013525 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -257,6 +257,11 @@ with the following files:
> # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
> 0=0x30;1=0x30;3=0x15;4=0x15
>
> +"mbm_assign":
> + Available when assignable monitoring features are supported.
> + Reports the list of assignable features supported and the enclosed brackets
> + indicate the feature is enabled.

"indicate the feature is enabled" -> "indicate which feature is enabled" or
"indicates the currently enabled feature" or ...?

> +
> "max_threshold_occupancy":
> Read/write file provides the largest value (in
> bytes) at which a previously used LLC_occupancy
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 735b449039c1..48d1957ea5a3 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1058,8 +1058,10 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
> }
>
> - if (resctrl_arch_has_abmc(r))
> + if (resctrl_arch_has_abmc(r)) {
> r->mbm_assign_capable = ABMC_ASSIGN;
> + resctrl_file_fflags_init("mbm_assign", RFTYPE_MON_INFO);

I think this will need some more thought when considering the fs/arch split.
The architecture can be expected to set r->mbm_assign_capable as above but
having the architecture meddle with the fs flags does not seem like the right
thing to do. I think that RFTYPE_MON_INFO may not be accessible to arch code
anyway.

> + }
> }
>
> l3_mon_evt_init(r);
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index dda71fb6c10e..5ec807e8dd38 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -846,6 +846,17 @@ static int rdtgroup_rmid_show(struct kernfs_open_file *of,
> return ret;
> }
>
> +static int rdtgroup_mbm_assign_show(struct kernfs_open_file *of,
> + struct seq_file *s, void *v)
> +{
> + struct rdt_resource *r = of->kn->parent->priv;
> +
> + if (r->mbm_assign_capable)
> + seq_puts(s, "abmc\n");
> +
> + return 0;
> +}

Should it print "legacy" if not mbm_assign_capable? Or actually, I think
the expectation is that this file will only be accessible if
r->mbm_assign_capable is true ... so having that if (r->mbm_assign_capable)
check is not clear to me ... if that is false then it would be a kernel
bug, no?

> +
> #ifdef CONFIG_PROC_CPU_RESCTRL
>
> /*
> @@ -1903,6 +1914,12 @@ static struct rftype res_common_files[] = {
> .seq_show = mbm_local_bytes_config_show,
> .write = mbm_local_bytes_config_write,
> },
> + {
> + .name = "mbm_assign",
> + .mode = 0444,
> + .kf_ops = &rdtgroup_kf_single_ops,
> + .seq_show = rdtgroup_mbm_assign_show,
> + },
> {
> .name = "cpus",
> .mode = 0644,

Reinette

2024-05-03 23:28:44

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 04/17] x86/resctrl: Introduce resctrl_file_fflags_init

Hi Babu,

In shortlog, please use () to indicate function:
resctrl_file_fflags_init().

On 3/28/2024 6:06 PM, Babu Moger wrote:
> Consolidate multiple fflags initialization into one function.
>
> Remove thread_throttle_mode_init, mbm_config_rftype_init and
> consolidate them into resctrl_file_fflags_init.

This changelog has no context and only describes what the code does,
which can be learned from reading the patch. Could you please
update changelog with context and motivation for this change?

Reinette

2024-05-03 23:30:54

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 07/17] x86/resctrl: Add support to enable/disable ABMC feature

Hi Babu,

On 3/28/2024 6:06 PM, Babu Moger wrote:
> Add the functionality to enable/disable ABMC feature.
>
> ABMC is enabled by setting enabled bit(0) in MSR L3_QOS_EXT_CFG. When the
> state of ABMC is changed, it must be changed to the updated value on all
> logical processors in the QOS Domain.

This patch does much more than enable what is mentioned above. There is little
information about what this patch aims to accomplish. Without this it makes
review difficult.

>
> The ABMC feature details are documented in APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC).
>
> Signed-off-by: Babu Moger <[email protected]>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> ---
> v3: No changes.
>
> v2: Few text changes in commit message.
> ---
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kernel/cpu/resctrl/internal.h | 12 ++++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 76 +++++++++++++++++++++++++-
> 3 files changed, 88 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 05956bd8bacf..f16ee50b1a23 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1165,6 +1165,7 @@
> #define MSR_IA32_MBA_BW_BASE 0xc0000200
> #define MSR_IA32_SMBA_BW_BASE 0xc0000280
> #define MSR_IA32_EVT_CFG_BASE 0xc0000400
> +#define MSR_IA32_L3_QOS_EXT_CFG 0xc00003ff
>
> /* MSR_IA32_VMX_MISC bits */
> #define MSR_IA32_VMX_MISC_INTEL_PT (1ULL << 14)
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 722388621403..8238ee437369 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -96,6 +96,9 @@ cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu)
> return cpu;
> }
>
> +/* ABMC ENABLE */

Can this comment be made more useful?

> +#define ABMC_ENABLE BIT(0)
> +
> struct rdt_fs_context {
> struct kernfs_fs_context kfc;
> bool enable_cdpl2;
> @@ -433,6 +436,7 @@ struct rdt_parse_data {
> * @mbm_cfg_mask: Bandwidth sources that can be tracked when Bandwidth
> * Monitoring Event Configuration (BMEC) is supported.
> * @cdp_enabled: CDP state of this resource
> + * @abmc_enabled: ABMC feature is enabled
> *
> * Members of this structure are either private to the architecture
> * e.g. mbm_width, or accessed via helpers that provide abstraction. e.g.
> @@ -448,6 +452,7 @@ struct rdt_hw_resource {
> unsigned int mbm_width;
> unsigned int mbm_cfg_mask;
> bool cdp_enabled;
> + bool abmc_enabled;
> };
>
> static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r)
> @@ -491,6 +496,13 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
>
> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
>
> +static inline bool resctrl_arch_get_abmc_enabled(enum resctrl_res_level l)
> +{
> + return rdt_resources_all[l].abmc_enabled;
> +}
> +
> +int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable);
> +
> /*
> * To return the common struct rdt_resource, which is contained in struct
> * rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource.
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 05f551bc316e..f49073c86884 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -850,9 +850,15 @@ static int rdtgroup_mbm_assign_show(struct kernfs_open_file *of,
> struct seq_file *s, void *v)
> {
> struct rdt_resource *r = of->kn->parent->priv;
> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>
> - if (r->mbm_assign_capable)
> + if (r->mbm_assign_capable && hw_res->abmc_enabled) {
> + seq_puts(s, "[abmc]\n");
> + seq_puts(s, "legacy_mbm\n");
> + } else if (r->mbm_assign_capable) {
> seq_puts(s, "abmc\n");
> + seq_puts(s, "[legacy_mbm]\n");
> + }
>
> return 0;
> }
> @@ -2433,6 +2439,74 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable)
> return 0;
> }
>
> +static void resctrl_abmc_msrwrite(void *arg)
> +{
> + bool *enable = arg;
> + u64 msrval;
> +
> + rdmsrl(MSR_IA32_L3_QOS_EXT_CFG, msrval);
> +
> + if (*enable)
> + msrval |= ABMC_ENABLE;
> + else
> + msrval &= ~ABMC_ENABLE;
> +
> + wrmsrl(MSR_IA32_L3_QOS_EXT_CFG, msrval);
> +}
> +
> +static int resctrl_abmc_setup(enum resctrl_res_level l, bool enable)
> +{
> + struct rdt_resource *r = &rdt_resources_all[l].r_resctrl;
> + struct rdt_domain *d;
> +
> + /* Update QOS_CFG MSR on all the CPUs in cpu_mask */

"all the CPUs in cpu_mask" -> "all the CPUs associated with the resource"?

> + list_for_each_entry(d, &r->domains, list) {
> + on_each_cpu_mask(&d->cpu_mask, resctrl_abmc_msrwrite, &enable, 1);
> + resctrl_arch_reset_rmid_all(r, d);

Could the changelog please explain why this is needed and what the impact of
this is?

> + }
> +
> + return 0;
> +}

I think the naming can be changed to make these easier to understand. For example,
resctrl_abmc_msrwrite() -> resctrl_abmc_set_one()
resctrl_abmc_setup() -> resctrl_abmc_set_all()

> +
> +static int resctrl_abmc_enable(enum resctrl_res_level l)
> +{
> + struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
> + int ret = 0;
> +
> + if (!hw_res->abmc_enabled) {
> + ret = resctrl_abmc_setup(l, true);
> + if (!ret)
> + hw_res->abmc_enabled = true;
> + }
> +
> + return ret;
> +}
> +
> +static void resctrl_abmc_disable(enum resctrl_res_level l)
> +{
> + struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
> +
> + if (hw_res->abmc_enabled) {
> + resctrl_abmc_setup(l, false);
> + hw_res->abmc_enabled = false;
> + }
> +}
> +
> +int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
> +{
> + struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
> +
> + if (!hw_res->r_resctrl.mbm_assign_capable)
> + return -EINVAL;
> +
> + if (enable)
> + return resctrl_abmc_enable(l);
> +
> + resctrl_abmc_disable(l);
> +
> + return 0;
> +}

Why is resctrl_arch_set_abmc_enabled() necessary? It seem to add an unnecessary
layer of abstraction.

> +
> /*
> * We don't allow rdtgroup directories to be created anywhere
> * except the root directory. Thus when looking for the rdtgroup

Reinette

2024-05-03 23:31:42

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 08/17] x86/resctrl: Initialize assignable counters bitmap

Hi Babu,

On 3/28/2024 6:06 PM, Babu Moger wrote:
> AMD Hardware provides a set of counters when the ABMC feature is supported.
> These counters are used for assigning events to the resctrl group.
>
> Introduce the bitmap assign_cntrs_free_map to allocate and free the
> counters.
>
> Signed-off-by: Babu Moger <[email protected]>
>
> ---
> v3: Changed the bitmap name to assign_cntrs_free_map. Removed abmc
> from the name.
>
> v2: Changed the bitmap name to assignable_counter_free_map from
> abmc_counter_free_map.
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index f49073c86884..2c7583e7b541 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -186,6 +186,22 @@ bool closid_allocated(unsigned int closid)
> return !test_bit(closid, &closid_free_map);
> }
>
> +static u64 assign_cntrs_free_map;
> +static u32 assign_cntrs_free_map_len;

Please provide summary in comments about what these globals are and how they
are used.

> +
> +static void assign_cntrs_init(void)
> +{
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +
> + if (r->mbm_assign_cntrs > 64) {
> + r->mbm_assign_cntrs = 64;
> + WARN(1, "Cannot support more than 64 Assignable counters\n");

I am a bit confused here. The configuration registers are introduced in patch #10
and if I counted right there are 5 bits for the counter id. It thus seems to me
as though there needs to be some checking during enumeration time to ensure
that all counters enumerated can be configured.

> + }
> +
> + assign_cntrs_free_map = BIT_MASK(r->mbm_assign_cntrs) - 1;

Please use bitmap API. For example, bitmap_fill()

> + assign_cntrs_free_map_len = r->mbm_assign_cntrs;
> +}
> +
> /**
> * rdtgroup_mode_by_closid - Return mode of resource group with closid
> * @closid: closid if the resource group
> @@ -2459,6 +2475,9 @@ static int resctrl_abmc_setup(enum resctrl_res_level l, bool enable)
> struct rdt_resource *r = &rdt_resources_all[l].r_resctrl;
> struct rdt_domain *d;
>
> + /* Reset the counters bitmap */
> + assign_cntrs_init();
> +

(At this point it is unclear when resctrl_abmc_setup() is called to understand
if reset of bitmap may be appropriate. Please do expand all changelogs to help
readers along with how this implementation is intended to work.)

> /* Update QOS_CFG MSR on all the CPUs in cpu_mask */
> list_for_each_entry(d, &r->domains, list) {
> on_each_cpu_mask(&d->cpu_mask, resctrl_abmc_msrwrite, &enable, 1);

Reinette

2024-05-03 23:32:20

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 10/17] x86/resctrl: Add data structures for ABMC assignment

Hi Babu,

On 3/28/2024 6:06 PM, Babu Moger wrote:
> ABMC (Bandwidth Monitoring Event Configuration) counters can be configured
> by writing to L3_QOS_ABMC_CFG MSR. When ABMC is enabled, the user can
> configure a counter by writing to L3_QOS_ABMC_CFG setting the CfgEn field
> while specifying the Bandwidth Source, Bandwidth Types, and Counter
> Identifier. Add the MSR definition and individual field definitions.
>
> MSR L3_QOS_ABMC_CFG (C000_03FDh) definitions.
>
> ==========================================================================
> Bits Mnemonic Description Access Type Reset Value
> ==========================================================================
> 63 CfgEn Configuration Enable R/W 0
>
> 62 CtrEn Counter Enable R/W 0
>
> 61:53 – Reserved MBZ 0
>
> 52:48 CtrID Counter Identifier R/W 0
>
> 47 IsCOS BwSrc field is a COS R/W 0
> (not an RMID)
>
> 46:44 – Reserved MBZ 0
>
> 43:32 BwSrc Bandwidth Source R/W 0
> (RMID or COS)
>
> 31:0 BwType Bandwidth types to R/W 0
> track for this counter
> ==========================================================================
>
> The feature details are documentd in the APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC).

This changelog is purely a summary of the hardware architecture. I have not come
across a clear explanation on how this architecture is intended to be supported
by resctrl. When would resctrl need/want to set particular fields? What is
the mapping to resctrl?

>
> Signed-off-by: Babu Moger <[email protected]>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>
> ---
> v3: No changes.
> v2: No changes.
> ---
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kernel/cpu/resctrl/internal.h | 23 +++++++++++++++++++++++
> 2 files changed, 24 insertions(+)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index f16ee50b1a23..ab01abfab089 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1166,6 +1166,7 @@
> #define MSR_IA32_SMBA_BW_BASE 0xc0000280
> #define MSR_IA32_EVT_CFG_BASE 0xc0000400
> #define MSR_IA32_L3_QOS_EXT_CFG 0xc00003ff
> +#define MSR_IA32_L3_QOS_ABMC_CFG 0xc00003fd
>
> /* MSR_IA32_VMX_MISC bits */
> #define MSR_IA32_VMX_MISC_INTEL_PT (1ULL << 14)
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index b559b3a4555e..41b06d46ea74 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -106,6 +106,9 @@ cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu)
> #define ASSIGN_TOTAL BIT(0)
> #define ASSIGN_LOCAL BIT(1)
>
> +/* Maximum assignable counters per resctrl group */
> +#define MAX_ASSIGN_CNTRS 2
> +
> struct rdt_fs_context {
> struct kernfs_fs_context kfc;
> bool enable_cdpl2;
> @@ -210,6 +213,7 @@ enum rdtgrp_mode {
> * @crdtgrp_list: child rdtgroup node list
> * @rmid: rmid for this rdtgroup
> * @mon_state: Assignment state of the group
> + * @abmc_ctr_id: ABMC counterids assigned to this group
> */
> struct mongroup {
> struct kernfs_node *mon_data_kn;
> @@ -217,6 +221,7 @@ struct mongroup {
> struct list_head crdtgrp_list;
> u32 rmid;
> u32 mon_state;
> + u32 abmc_ctr_id[MAX_ASSIGN_CNTRS];
> };
>
> /**
> @@ -566,6 +571,24 @@ union cpuid_0x10_x_edx {
> unsigned int full;
> };
>
> +/*
> + * L3_QOS_ABMC_CFG MSR details. ABMC counters can be configured
> + * by writing to L3_QOS_ABMC_CFG.

There are many fields in this structure ... how is resctrl expected
to set these fields in order to configure a counter? Please expand the
comments.

> + */
> +union l3_qos_abmc_cfg {
> + struct {
> + unsigned long bw_type :32,
> + bw_src :12,
> + rsvrd1 : 3,

Considering how "reserved" is spelled it is
unexpected to see "rsvrd"


> + is_cos : 1,
> + ctr_id : 5,
> + rsvrd : 9,
> + ctr_en : 1,
> + cfg_en : 1;
> + } split;
> + unsigned long full;
> +};
> +
> void rdt_last_cmd_clear(void);
> void rdt_last_cmd_puts(const char *s);
> __printf(1, 2)


Reinette

2024-05-03 23:33:18

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 11/17] x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg

Hi Babu,

On 3/28/2024 6:06 PM, Babu Moger wrote:
> If the BMEC (Bandwidth Monitoring Event Configuration) feature is
> supported, the bandwidth events can be configured to track specific events.
> The event configuration is domain specific. ABMC (Assignable Bandwidth
> Monitoring Counters) feature needs event configuration information to
> assign RMID to the hardware counter. Currently, this information is not
> available.

hmmm ... "Currently, this information is not available." does not sound
accurate. Perhaps it can be made more specific with something like:
"Event configurations are not stored in resctrl but instead always
read from or written to hardware directly when prompted by user space."
(feel free to improve)

>
> Save the event configuration information in the rdt_hw_domain, so it can
> be used while for RMID assignment.

"be used while for RMID assignment" -> "be used for RMID assignment"?

>
> Signed-off-by: Babu Moger <[email protected]>
>
> ---
> v3: Minor changes related to rebase in mbm_config_write_domain.
>
> v2: No changes.
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 2 ++
> arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
> arch/x86/kernel/cpu/resctrl/monitor.c | 11 +++++++++++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 16 +++++++++++++++-
> 4 files changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 50e9ec5e547b..ed4f6d49d737 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -555,6 +555,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> return;
> }
>
> + arch_domain_mbm_evt_config(hw_dom);
> +
> list_add_tail_rcu(&d->list, add_pos);
>
> err = resctrl_online_domain(r, d);
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 41b06d46ea74..88453c86474b 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -385,6 +385,8 @@ struct rdt_hw_domain {
> u32 *ctrl_val;
> struct arch_mbm_state *arch_mbm_total;
> struct arch_mbm_state *arch_mbm_local;
> + u32 mbm_total_cfg;
> + u32 mbm_local_cfg;

(please fix tabs/spaces)

> };
>
> static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
> @@ -648,6 +650,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free);
> void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
> void __init resctrl_file_fflags_init(const char *config,
> unsigned long fflags);
> +void arch_domain_mbm_evt_config(struct rdt_hw_domain *hw_dom);
> void rdt_staged_configs_clear(void);
> bool closid_allocated(unsigned int closid);
> int resctrl_find_cleanest_closid(void);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 56dc49021540..8677dbf6de43 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1090,3 +1090,14 @@ void __init intel_rdt_mbm_apply_quirk(void)
> mbm_cf_rmidthreshold = mbm_cf_table[cf_index].rmidthreshold;
> mbm_cf = mbm_cf_table[cf_index].cf;
> }
> +
> +void arch_domain_mbm_evt_config(struct rdt_hw_domain *hw_dom)
> +{
> + if (mbm_total_event.configurable)
> + hw_dom->mbm_total_cfg = MAX_EVT_CONFIG_BITS;
> +
> + if (mbm_local_event.configurable)
> + hw_dom->mbm_local_cfg = READS_TO_LOCAL_MEM |
> + NON_TEMP_WRITE_TO_LOCAL_MEM |
> + READS_TO_LOCAL_S_MEM;
> +}

Shouldn't the defaults be discovered from hardware?

Reinette


2024-05-03 23:34:14

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 12/17] x86/resctrl: Add the functionality to assign the RMID

Hi Babu,

On 3/28/2024 6:06 PM, Babu Moger wrote:
> With the support of ABMC (Assignable Bandwidth Monitoring Counters)
> feature, the user has the option to assign or unassign the RMID to
> hardware counter and monitor the bandwidth for the longer duration.

What is meant with "the longer duration" (this term is used throughout
this series)? Perhaps "for as long as a hardware counter is assigned"?

>
> Provide the interface to assign the counter to the group.
>
> The ABMC feature implements a pair of MSRs, L3_QOS_ABMC_CFG (MSR
> C000_03FDh) and L3_QOS_ABMC_DSC (MSR C000_3FEh). Each logical processor
> implements a separate copy of these registers. Attempts to read or write
> these MSRs when ABMC is not enabled will result in a #GP(0) exception.
>
> Individual assignable bandwidth counters are configured by writing to
> L3_QOS_ABMC_CFG MSR and specifying the Counter ID, Bandwidth Source, and
> Bandwidth Types. Reading L3_QOS_ABMC_DSC returns the configuration of the
> counter specified by L3_QOS_ABMC_CFG [CtrID].

This mentions the AMD architecture parts needing configuration but not what
resctrl parts are used to accomplish this configuration. It is difficult to
understand this work without this connection.

>
> The feature details are documented in the APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC).
>
> Signed-off-by: Babu Moger <[email protected]>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> ---
> v3: Removed the static from the prototype of rdtgroup_assign_abmc.
> The function is not called directly from user anymore. These
> changes are related to global assignment interface.
>
> v2: Minor text changes in commit message.
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 1 +
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 86 ++++++++++++++++++++++++++
> 2 files changed, 87 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 88453c86474b..9d84c80104f9 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -651,6 +651,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
> void __init resctrl_file_fflags_init(const char *config,
> unsigned long fflags);
> void arch_domain_mbm_evt_config(struct rdt_hw_domain *hw_dom);
> +ssize_t rdtgroup_assign_abmc(struct rdtgroup *rdtgrp, u32 evtid, int mon_state);
> void rdt_staged_configs_clear(void);
> bool closid_allocated(unsigned int closid);
> int resctrl_find_cleanest_closid(void);
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 7f54788a58de..cfbdaf8b5f83 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -202,6 +202,18 @@ static void assign_cntrs_init(void)
> assign_cntrs_free_map_len = r->mbm_assign_cntrs;
> }
>
> +static int assign_cntrs_alloc(void)
> +{
> + u32 counterid = ffs(assign_cntrs_free_map);
> +
> + if (counterid == 0)
> + return -ENOSPC;
> + counterid--;
> + assign_cntrs_free_map &= ~(1 << counterid);
> +
> + return counterid;

Use bitmap API ... eg. find_first_bit() (eliminates
need to adjust counterid), __clear_bit()

> +}
> +
> /**
> * rdtgroup_mode_by_closid - Return mode of resource group with closid
> * @closid: closid if the resource group
> @@ -1848,6 +1860,80 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of,
> return ret ?: nbytes;
> }
>
> +static void rdtgroup_abmc_msrwrite(void *info)
> +{
> + u64 *msrval = info;
> +
> + wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, *msrval);
> +}
> +
> +static void rdtgroup_abmc_domain(struct rdt_domain *d, struct rdtgroup *rdtgrp,
> + u32 evtid, int index, bool assign)
> +{
> + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> + union l3_qos_abmc_cfg abmc_cfg = { 0 };
> + struct arch_mbm_state *arch_mbm;
> +
> + abmc_cfg.split.cfg_en = 1;
> + abmc_cfg.split.ctr_en = assign ? 1 : 0;
> + abmc_cfg.split.ctr_id = rdtgrp->mon.abmc_ctr_id[index];
> + abmc_cfg.split.bw_src = rdtgrp->mon.rmid;
> +
> + /*
> + * Read the event configuration from the domain and pass it as
> + * bw_type.
> + */
> + if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID) {
> + abmc_cfg.split.bw_type = hw_dom->mbm_total_cfg;
> + arch_mbm = &hw_dom->arch_mbm_total[rdtgrp->mon.rmid];
> + } else {
> + abmc_cfg.split.bw_type = hw_dom->mbm_local_cfg;
> + arch_mbm = &hw_dom->arch_mbm_local[rdtgrp->mon.rmid];
> + }
> +
> + smp_call_function_any(&d->cpu_mask, rdtgroup_abmc_msrwrite, &abmc_cfg, 1);
> +
> + /* Reset the internal counters */
> + if (arch_mbm)
> + memset(arch_mbm, 0, sizeof(struct arch_mbm_state));
> +}
> +
> +ssize_t rdtgroup_assign_abmc(struct rdtgroup *rdtgrp, u32 evtid, int mon_state)
> +{
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + int counterid = 0, index;
> + struct rdt_domain *d;
> +
> + if (rdtgrp->mon.mon_state & mon_state) {
> + rdt_last_cmd_puts("ABMC counter is assigned already\n");
> + return 0;
> + }
> +
> + index = mon_event_config_index_get(evtid);
> + if (index == INVALID_CONFIG_INDEX) {
> + pr_warn_once("Invalid event id %d\n", evtid);
> + return -EINVAL;
> + }
> +
> + /*
> + * Allocate a new counter and update domains
> + */
> + counterid = assign_cntrs_alloc();
> + if (counterid < 0) {
> + rdt_last_cmd_puts("Out of ABMC counters\n");
> + return -ENOSPC;
> + }
> +
> + rdtgrp->mon.abmc_ctr_id[index] = counterid;
> +
> + list_for_each_entry(d, &r->domains, list)
> + rdtgroup_abmc_domain(d, rdtgrp, evtid, index, 1);
> +
> + rdtgrp->mon.mon_state |= mon_state;
> +
> + return 0;
> +}
> +
> /* rdtgroup information files for one cache resource. */
> static struct rftype res_common_files[] = {
> {

It is not clear to me where the filesystem and architecture boundaries
are, but I understand that you and Peter already discussed this and I look
forward to next version that will make this easier to understand.

Reinette