2023-12-01 00:58:07

by Babu Moger

[permalink] [raw]
Subject: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

These series adds the support for AMD QoS RMID Pinning feature. It is also
called ABMC (Assignable Bandwidth Monitoring Counters) feature.

The feature details are available in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC). The documentation is available at
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

The patches are based on top of commit
346887b65d89ae987698bc1efd8e5536bd180b3f (tip/master)

# Introduction

AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
feature only guarantees that RMIDs currently assigned to a processor will
be tracked by hardware. The counters of any other RMIDs which are no
longer being tracked will be reset to zero. The MBM event counters return
"Unavailable" for the RMIDs that are not active.

Users can create 256 or more monitor groups. But there can be only limited
number of groups that can be give guaranteed monitoring numbers. With ever
changing system configuration, there is no way to definitely know which of
these groups will be active for certain point of time. Users do not have
the option to monitor a group or set of groups for certain period of time
without worrying about RMID being reset in between.

The ABMC feature provides an option to pin (or assign) the RMID to the
hardware counter and monitor the bandwidth for a longer duration. The
pinned RMID will be active until the user unpins (or unassigns) it. There
is no need to worry about counters being reset during this period.
Additionally, the user can specify a bitmask identifying the specific
bandwidth types from the given source to track with the counter.

# Linux Implementation

Hardware provides total of 32 counters available for assignment.
Each Linux resctrl group can be assigned a maximum of 2 counters. One for
mbm_total_bytes and one for mbm_local_bytes. Users also have the option to
assign only one counter to the group. If the system runs out of assignable
counters, the kernel will display the error when the user attempts a new
counter assignment. Users need to unassign already used counters for new
assignments.

# Examples

a. Check if ABMC support is available
#mount -t resctrl resctrl /sys/fs/resctrl/
#cat /sys/fs/resctrl/info/L3_MON/mon_features
llc_occupancy
mbm_total_bytes
mbm_total_bytes_config
mbm_local_bytes
mbm_local_bytes_config
abmc_capable ← Linux kernel detected ABMC feature.

b. Mount with ABMC support
#umount /sys/fs/resctrl/
#mount -o abmc -t resctrl resctrl /sys/fs/resctrl/

c. Read the monitor states. There will be new file "monitor_state"
for each monitor group when ABMC feature is enabled. By default,
both total and local MBM events are in "unassign" state.

#cat /sys/fs/resctrl/monitor_state
total=unassign;local=unassign

d. Read the event mbm_total_bytes and mbm_local_bytes. Note that MBA
events are not available until the user assigns the events explicitly.
Users need to assign the counters to monitor the events in this mode.

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
Unavailable

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
Unavailable

e. Assign a h/w counter to the total event and read the monitor_state.

#echo total=assign > /sys/fs/resctrl/monitor_state
#cat /sys/fs/resctrl/monitor_state
total=assign;local=unassign

f. Now that the total event is assigned. Read the total event.

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
6136000

g. Assign a h/w counter to both total and local events and read the monitor_state.

#echo "total=assign;local=assign" > /sys/fs/resctrl/monitor_state
#cat /sys/fs/resctrl/monitor_state
total=assign;local=assign

h. Now that both total and local events are assigned, read the events.

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
6136000
#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
58694

i. Check the bandwidth configuration for the group. Note that bandwidth
configuration has a domain scope. Total event defaults to 0x7F (to
count all the events) and local event defaults to 0x15
(to count all the local numa events). The event bitmap decoding is
available in https://www.kernel.org/doc/Documentation/x86/resctrl.rst
in section "mbm_total_bytes_config", "mbm_local_bytes_config":

#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
0=0x7f;1=0x7f

#cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
0=0x15;1=0xi15

j. Change the bandwidth source for domain 0 for the total event to count only reads.
Note that this change effects events on the domain 0.

#echo total=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
0=0x33;1=0x7F

k. Now read the total event again. The mbm_total_bytes should display
only the read events.

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
6136000

l. Unmount the resctrl

#umount /sys/fs/resctrl/

NOTE: For simplicity these examples are run on a default resctrl group.
Similar experiments are can be run non-defaults groups.
---

Babu Moger (15):
x86/resctrl: Remove hard-coded memory bandwidth limit
x86/resctrl: Remove hard-coded memory bandwidth event configuration
x86/resctrl: Add support for Assignable Bandwidth Monitoring Counters
(ABMC)
x86/resctrl: Add ABMC feature in the command line options
x86/resctrl: Detect ABMC feature details
x86/resctrl: Add the mount option for ABMC feature
x86/resctrl: Add support to enable/disable ABMC feature
x86/resctrl: Introduce interface to display number of ABMC counters
x86/resctrl: Add interface to display monitor state of the group
x86/resctrl: Initialize ABMC counters bitmap
x86/resctrl: Add data structures for ABMC assignment
x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg
x86/resctrl: Add the interface to assign a ABMC counter
x86/resctrl: Add interface unassign a ABMC counter
x86/resctrl: Update ABMC assignment on event configuration changes

.../admin-guide/kernel-parameters.txt | 2 +-
Documentation/arch/x86/resctrl.rst | 52 +++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 2 +
arch/x86/kernel/cpu/cpuid-deps.c | 2 +
arch/x86/kernel/cpu/resctrl/core.c | 23 +-
arch/x86/kernel/cpu/resctrl/internal.h | 49 ++-
arch/x86/kernel/cpu/resctrl/monitor.c | 22 +
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 415 +++++++++++++++++-
arch/x86/kernel/cpu/scattered.c | 1 +
include/linux/resctrl.h | 2 +
11 files changed, 562 insertions(+), 9 deletions(-)

--
2.34.1


2023-12-01 00:58:08

by Babu Moger

[permalink] [raw]
Subject: [PATCH 06/15] x86/resctrl: Add the mount option for ABMC feature

Add the mount option for ABMC (Assignable Bandwidth Monitoring Counters)
feature.

Signed-off-by: Babu Moger <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 2 ++
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++++
3 files changed, 8 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 1293cb6cba98..19e906f629d4 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -50,6 +50,8 @@ mount options are:
"debug":
Make debug files accessible. Available debug files are annotated with
"Available only with debug option".
+"abmc":
+ Enable ABMC (Assignable Bandwidth Monitoring Counters) feature.

L2 and L3 CDP are controlled separately.

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 0b22be85a444..b8f3a0b1ca41 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -56,6 +56,7 @@ struct rdt_fs_context {
bool enable_cdpl3;
bool enable_mba_mbps;
bool enable_debug;
+ bool enable_abmc;
};

static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index feeb57ee7888..a4328e12a8f6 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2687,6 +2687,7 @@ enum rdt_param {
Opt_cdpl2,
Opt_mba_mbps,
Opt_debug,
+ Opt_abmc,
nr__rdt_params
};

@@ -2695,6 +2696,7 @@ static const struct fs_parameter_spec rdt_fs_parameters[] = {
fsparam_flag("cdpl2", Opt_cdpl2),
fsparam_flag("mba_MBps", Opt_mba_mbps),
fsparam_flag("debug", Opt_debug),
+ fsparam_flag("abmc", Opt_abmc),
{}
};

@@ -2723,6 +2725,9 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
case Opt_debug:
ctx->enable_debug = true;
return 0;
+ case Opt_abmc:
+ ctx->enable_abmc = true;
+ return 0;
}

return -EINVAL;
--
2.34.1

2023-12-01 00:58:17

by Babu Moger

[permalink] [raw]
Subject: [PATCH 08/15] x86/resctrl: Introduce interface to display number of ABMC counters

The ABMC feature provides an option to the user to pin (or assign) the
RMID to the hardware counter and monitor the bandwidth for a longer
duration. There are only a limited number of hardware counters.

Provide the interface to display the number of ABMC counters supported.

Signed-off-by: Babu Moger <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 4 ++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 29 +++++++++++++++++++++++++-
2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 19e906f629d4..87aa8eec71b7 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -266,6 +266,10 @@ with the following files:
# cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
0=0x30;1=0x30;3=0x15;4=0x15

+"abmc_counters":
+ Available when ABMC feature is enabled. The number of ABMC counters
+ available for assignment.
+
"max_threshold_occupancy":
Read/write file provides the largest value (in
bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 7f6ed903ba17..897707694cc8 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -811,6 +811,17 @@ static int rdtgroup_rmid_show(struct kernfs_open_file *of,
return ret;
}

+static int rdtgroup_abmc_counters_show(struct kernfs_open_file *of,
+ struct seq_file *s, void *v)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+
+ seq_printf(s, "%d\n", hw_res->abmc_counters);
+
+ return 0;
+}
+
#ifdef CONFIG_PROC_CPU_RESCTRL

/*
@@ -1861,6 +1872,12 @@ static struct rftype res_common_files[] = {
.seq_show = mbm_local_bytes_config_show,
.write = mbm_local_bytes_config_write,
},
+ {
+ .name = "abmc_counters",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdtgroup_abmc_counters_show,
+ },
{
.name = "cpus",
.mode = 0644,
@@ -2419,12 +2436,22 @@ static void resctrl_abmc_disable(enum resctrl_res_level l)
int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
{
struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
+ struct rftype *rft;

if (!hw_res->r_resctrl.abmc_capable)
return -EINVAL;

- if (enable)
+ if (enable) {
+ rft = rdtgroup_get_rftype_by_name("abmc_counters");
+ if (rft)
+ rft->fflags = RFTYPE_MON_INFO;
+
return resctrl_abmc_enable(l);
+ }
+
+ rft = rdtgroup_get_rftype_by_name("abmc_counters");
+ if (rft)
+ rft->fflags &= ~RFTYPE_MON_INFO;

resctrl_abmc_disable(l);

--
2.34.1

2023-12-01 00:58:34

by Babu Moger

[permalink] [raw]
Subject: [PATCH 04/15] x86/resctrl: Add ABMC feature in the command line options

Add the command line options to enable or disable the new resctrl feature
ABMC (Assignable Bandwidth Monitoring Counters).

Signed-off-by: Babu Moger <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 2 +-
Documentation/arch/x86/resctrl.rst | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 65731b060e3f..59a9e486fbbf 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5475,7 +5475,7 @@
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
- mba, smba, bmec.
+ mba, smba, bmec, abmc.
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..d816ded93c22 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -26,6 +26,7 @@ MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
MBA (Memory Bandwidth Allocation) "mba"
SMBA (Slow Memory Bandwidth Allocation) ""
BMEC (Bandwidth Monitoring Event Configuration) ""
+ABMC (Assignable Bandwidth Monitoring Counters) ""
=============================================== ================================

Historically, new features were made visible by default in /proc/cpuinfo. This
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3fbae10b662d..a257017b4de5 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -678,6 +678,7 @@ enum {
RDT_FLAG_MBA,
RDT_FLAG_SMBA,
RDT_FLAG_BMEC,
+ RDT_FLAG_ABMC,
};

#define RDT_OPT(idx, n, f) \
@@ -703,6 +704,7 @@ static struct rdt_options rdt_options[] __initdata = {
RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
+ RDT_OPT(RDT_FLAG_ABMC, "abmc", X86_FEATURE_ABMC),
};
#define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)

--
2.34.1

2023-12-01 00:58:34

by Babu Moger

[permalink] [raw]
Subject: [PATCH 07/15] x86/resctrl: Add support to enable/disable ABMC feature

Set up the system to enable or disable ABMC feature. By default,
the ABMC is disabled. User needs to mount resctrl with -o abmc option
to enabled the feature.

ABMC is enabled by setting enabled bit(0) in MSR L3_QOS_EXT_CFG. When the
state of ABMC is changed, it must be changed to the updated value on all
logical processors in the QOS Domain.

The ABMC feature details are available in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 10 ++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 79 +++++++++++++++++++++++++-
3 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 737a52b89e64..a2086aad580c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1123,6 +1123,7 @@
#define MSR_IA32_MBA_BW_BASE 0xc0000200
#define MSR_IA32_SMBA_BW_BASE 0xc0000280
#define MSR_IA32_EVT_CFG_BASE 0xc0000400
+#define MSR_IA32_L3_QOS_EXT_CFG 0xc00003ff

/* MSR_IA32_VMX_MISC bits */
#define MSR_IA32_VMX_MISC_INTEL_PT (1ULL << 14)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index b8f3a0b1ca41..2801bc0dc132 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -50,6 +50,9 @@
/* Dirty Victims to All Types of Memory */
#define DIRTY_VICTIMS_TO_ALL_MEM BIT(6)

+/* ABMC ENABLE */
+#define ABMC_ENABLE BIT(0)
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
@@ -395,6 +398,7 @@ struct rdt_parse_data {
* @mon_scale: cqm counter * mon_scale = occupancy in bytes
* @mbm_width: Monitor width, to detect and correct for overflow.
* @cdp_enabled: CDP state of this resource
+ * @abmc_enabled: ABMC feature is enabled
*
* Members of this structure are either private to the architecture
* e.g. mbm_width, or accessed via helpers that provide abstraction. e.g.
@@ -410,6 +414,7 @@ struct rdt_hw_resource {
unsigned int mon_scale;
unsigned int mbm_width;
bool cdp_enabled;
+ bool abmc_enabled;
};

static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r)
@@ -455,6 +460,11 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)

int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);

+static inline bool resctrl_arch_get_abmc_enabled(enum resctrl_res_level l)
+{
+ return rdt_resources_all[l].abmc_enabled;
+}
+
/*
* To return the common struct rdt_resource, which is contained in struct
* rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource.
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a4328e12a8f6..7f6ed903ba17 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2365,6 +2365,72 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable)
return 0;
}

+static void resctrl_abmc_msrwrite(void *arg)
+{
+ bool *enable = arg;
+ u64 msrval;
+
+ rdmsrl(MSR_IA32_L3_QOS_EXT_CFG, msrval);
+
+ if (*enable)
+ msrval |= ABMC_ENABLE;
+ else
+ msrval &= ~ABMC_ENABLE;
+
+ wrmsrl(MSR_IA32_L3_QOS_EXT_CFG, msrval);
+}
+
+static int resctrl_abmc_setup(enum resctrl_res_level l, bool enable)
+{
+ struct rdt_resource *r = &rdt_resources_all[l].r_resctrl;
+ struct rdt_domain *d;
+
+ /* Update QOS_CFG MSR on all the CPUs in cpu_mask */
+ list_for_each_entry(d, &r->domains, list)
+ on_each_cpu_mask(&d->cpu_mask, resctrl_abmc_msrwrite, &enable, 1);
+
+ return 0;
+}
+
+static int resctrl_abmc_enable(enum resctrl_res_level l)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
+ int ret = 0;
+
+ if (!hw_res->abmc_enabled) {
+ ret = resctrl_abmc_setup(l, true);
+ if (!ret)
+ hw_res->abmc_enabled = true;
+ }
+
+ return ret;
+}
+
+static void resctrl_abmc_disable(enum resctrl_res_level l)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
+
+ if (hw_res->abmc_enabled) {
+ resctrl_abmc_setup(l, false);
+ hw_res->abmc_enabled = false;
+ }
+}
+
+int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
+
+ if (!hw_res->r_resctrl.abmc_capable)
+ return -EINVAL;
+
+ if (enable)
+ return resctrl_abmc_enable(l);
+
+ resctrl_abmc_disable(l);
+
+ return 0;
+}
+
/*
* We don't allow rdtgroup directories to be created anywhere
* except the root directory. Thus when looking for the rdtgroup
@@ -2449,7 +2515,7 @@ static void rdt_disable_ctx(void)
resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false);
resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false);
set_mba_sc(false);
-
+ resctrl_arch_set_abmc_enabled(RDT_RESOURCE_L3, false);
resctrl_debug = false;
}

@@ -2475,11 +2541,19 @@ static int rdt_enable_ctx(struct rdt_fs_context *ctx)
goto out_cdpl3;
}

+ if (ctx->enable_abmc) {
+ ret = resctrl_arch_set_abmc_enabled(RDT_RESOURCE_L3, true);
+ if (ret)
+ goto out_mba_mbps;
+ }
+
if (ctx->enable_debug)
resctrl_debug = true;

return 0;

+out_mba_mbps:
+ set_mba_sc(false);
out_cdpl3:
resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false);
out_cdpl2:
@@ -3802,6 +3876,9 @@ static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf)
if (resctrl_debug)
seq_puts(seq, ",debug");

+ if (resctrl_arch_get_abmc_enabled(RDT_RESOURCE_L3))
+ seq_puts(seq, ",abmc");
+
return 0;
}

--
2.34.1

2023-12-01 00:58:40

by Babu Moger

[permalink] [raw]
Subject: [PATCH 11/15] x86/resctrl: Add data structures for ABMC assignment

ABMC(Bandwidth Monitoring Event Configuration) counters can be configured
by writing to L3_QOS_ABMC_CFG MSR. When ABMC is enabled, the user can
configure a counter by writing to L3_QOS_ABMC_CFG setting the CfgEn field
while specifying the Bandwidth Source, Bandwidth Types, and Counter
Identifier. Add the MSR definition and individual field definitions.

MSR L3_QOS_ABMC_CFG (C000_03FDh) definitions.

==========================================================================
Bits Mnemonic Description Access Type Reset Value
==========================================================================
63 CfgEn Configuration Enable R/W 0

62 CtrEn Counter Enable R/W 0

61:53 – Reserved MBZ 0

52:48 CtrID Counter Identifier R/W 0

47 IsCOS BwSrc field is a COS R/W 0
(not an RMID) R/W 0

46:44 – Reserved MBZ 0

43:32 BwSrc Bandwidth Source R/W 0
(RMID or COS)

31:0 BwType Bandwidth types to R/W 0
track for this counter
==========================================================================

The feature details are available in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a2086aad580c..ec85f6733eda 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1124,6 +1124,7 @@
#define MSR_IA32_SMBA_BW_BASE 0xc0000280
#define MSR_IA32_EVT_CFG_BASE 0xc0000400
#define MSR_IA32_L3_QOS_EXT_CFG 0xc00003ff
+#define MSR_IA32_L3_QOS_ABMC_CFG 0xc00003fd

/* MSR_IA32_VMX_MISC bits */
#define MSR_IA32_VMX_MISC_INTEL_PT (1ULL << 14)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index bc36acd152be..ca4b551dc808 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -59,6 +59,8 @@
#define TOTAL_ASSIGN BIT(0)
#define LOCAL_ASSIGN BIT(1)

+#define ABMC_MAX_PER_GROUP 2
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
@@ -168,6 +170,7 @@ enum rdtgrp_mode {
* @crdtgrp_list: child rdtgroup node list
* @rmid: rmid for this rdtgroup
* @monitor_state: ABMC state of the group
+ * @abmc_ctr_id: ABMC counterids assigned to this group
*/
struct mongroup {
struct kernfs_node *mon_data_kn;
@@ -175,6 +178,7 @@ struct mongroup {
struct list_head crdtgrp_list;
u32 rmid;
u32 monitor_state;
+ u32 abmc_ctr_id[ABMC_MAX_PER_GROUP];
};

/**
@@ -527,6 +531,24 @@ union cpuid_0x10_x_edx {
unsigned int full;
};

+/*
+ * L3_QOS_ABMC_CFG MSR details. ABMC counters can be configured
+ * by writing to L3_QOS_ABMC_CFG.
+ */
+union l3_qos_abmc_cfg {
+ struct {
+ unsigned long bw_type :32,
+ bw_src :12,
+ rsvrd1 : 3,
+ is_cos : 1,
+ ctr_id : 5,
+ rsvrd : 9,
+ ctr_en : 1,
+ cfg_en : 1;
+ } split;
+ unsigned long full;
+};
+
void rdt_last_cmd_clear(void);
void rdt_last_cmd_puts(const char *s);
__printf(1, 2)
--
2.34.1

2023-12-01 00:58:42

by Babu Moger

[permalink] [raw]
Subject: [PATCH 10/15] x86/resctrl: Initialize ABMC counters bitmap

AMD Hardware provides 32 ABMC(Bandwidth Monitoring Event Configuration)
counters when supported. These hardware counters are used to assign
(or pin) the RMID to the group.

Introduce the bitmap abmc_free_map to allocate and free counters.

Signed-off-by: Babu Moger <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index edb679b22b7b..f72d6d8c12df 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -164,6 +164,22 @@ static bool closid_allocated(unsigned int closid)
return (closid_free_map & (1 << closid)) == 0;
}

+static u64 abmc_free_map;
+static u32 abmc_free_map_len;
+
+static void abmc_counters_init(void)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_L3];
+
+ if (hw_res->abmc_counters > 64) {
+ hw_res->abmc_counters = 64;
+ WARN(1, "Cannot support more than 64 abmc counters\n");
+ }
+
+ abmc_free_map = BIT_MASK(hw_res->abmc_counters) - 1;
+ abmc_free_map_len = hw_res->abmc_counters;
+}
+
/**
* rdtgroup_mode_by_closid - Return mode of resource group with closid
* @closid: closid if the resource group
@@ -2715,6 +2731,7 @@ static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
unsigned long flags = RFTYPE_CTRL_BASE;
+ struct rdt_hw_resource *hw_res;
struct rdt_domain *dom;
struct rdt_resource *r;
int ret;
@@ -2745,6 +2762,12 @@ static int rdt_get_tree(struct fs_context *fc)

closid_init();

+ r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ hw_res = resctrl_to_arch_res(r);
+
+ if (r->abmc_capable && hw_res->abmc_enabled)
+ abmc_counters_init();
+
if (rdt_mon_capable)
flags |= RFTYPE_MON;

@@ -2789,7 +2812,6 @@ static int rdt_get_tree(struct fs_context *fc)
static_branch_enable_cpuslocked(&rdt_enable_key);

if (is_mbm_enabled()) {
- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
list_for_each_entry(dom, &r->domains, list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}
--
2.34.1

2023-12-01 00:58:48

by Babu Moger

[permalink] [raw]
Subject: [PATCH 13/15] x86/resctrl: Add the interface to assign a ABMC counter

With the support of ABMC (Assignable Bandwidth Monitoring Counters)
feature, the user has the option to pin (or assign) or unpin (or unassign)
the RMID to hardware counter and monitor the bandwidth for the longer
duration.

Provide the interface to pin (or assign) the counter to the group.

The ABMC feature implements a pair of MSRs, L3_QOS_ABMC_CFG (MSR
C000_03FDh) and L3_QOS_ABMC_DSC (MSR C000_3FEh). Each logical processor
implements a separate copy of these registers. Attempts to read or write
these MSRs when ABMC is not enabled will result in a #GP(0) exception.

Individual assignable bandwidth counters are configured by writing to
L3_QOS_ABMC_CFG MSR and specifying the Counter ID, Bandwidth Source, and
Bandwidth Types. Reading L3_QOS_ABMC_DSC returns the configuration of the
counter specified by L3_QOS_ABMC_CFG [CtrID].

The feature details are available in APM listed below [1]. [1] AMD64
Architecture Programmer's Manual Volume 2: System Programming Publication
(ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
Documentation/arch/x86/resctrl.rst | 7 ++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 152 ++++++++++++++++++++++++-
2 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index d3df7d467eec..65306e7d01b6 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -410,6 +410,13 @@ When monitoring is enabled all MON groups will also contain:
# cat /sys/fs/resctrl/monitor_state
total=assign;local=assign

+ The user needs to pin (or assign) RMID to read the MBM event in
+ ABMC mode. Each event can be assigned or unassigned separately.
+ Example::
+
+ # echo total=assign > /sys/fs/resctrl/monitor_state
+ # echo total=assign;local=assign > /sys/fs/resctrl/monitor_state
+
"mon_hw_id":
Available only with debug option. The identifier used by hardware
for the monitor group. On x86 this is the RMID.
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 047aea628e2e..671ff732992c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -180,6 +180,18 @@ static void abmc_counters_init(void)
abmc_free_map_len = hw_res->abmc_counters;
}

+static int abmc_counters_alloc(void)
+{
+ u32 counterid = ffs(abmc_free_map);
+
+ if (counterid == 0)
+ return -ENOSPC;
+ counterid--;
+ abmc_free_map &= ~(1 << counterid);
+
+ return counterid;
+}
+
/**
* rdtgroup_mode_by_closid - Return mode of resource group with closid
* @closid: closid if the resource group
@@ -1583,6 +1595,143 @@ static inline unsigned int mon_event_config_index_get(u32 evtid)
}
}

+static void rdtgroup_abmc_msrwrite(void *info)
+{
+ u64 *msrval = info;
+
+ wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, *msrval);
+}
+
+static void rdtgroup_abmc_domain(struct rdt_domain *d,
+ struct rdtgroup *rdtgrp,
+ u32 evtid, int index, bool assign)
+{
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ union l3_qos_abmc_cfg abmc_cfg = { 0 };
+ struct arch_mbm_state *arch_mbm;
+
+ abmc_cfg.split.cfg_en = 1;
+ abmc_cfg.split.ctr_en = assign ? 1 : 0;
+ abmc_cfg.split.ctr_id = rdtgrp->mon.abmc_ctr_id[index];
+ abmc_cfg.split.bw_src = rdtgrp->mon.rmid;
+
+ /*
+ * Read the event configuration from the domain and pass it as
+ * bw_type.
+ */
+ if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID) {
+ abmc_cfg.split.bw_type = hw_dom->mbm_total_cfg;
+ arch_mbm = &hw_dom->arch_mbm_total[rdtgrp->mon.rmid];
+ } else {
+ abmc_cfg.split.bw_type = hw_dom->mbm_local_cfg;
+ arch_mbm = &hw_dom->arch_mbm_local[rdtgrp->mon.rmid];
+ }
+
+ smp_call_function_any(&d->cpu_mask, rdtgroup_abmc_msrwrite, &abmc_cfg, 1);
+
+ /* Reset the internal counters */
+ if (arch_mbm)
+ memset(arch_mbm, 0, sizeof(struct arch_mbm_state));
+}
+
+static ssize_t rdtgroup_assign_abmc(struct rdtgroup *rdtgrp,
+ struct rdt_resource *r,
+ u32 evtid, int mon_state)
+{
+ int counterid = 0, index;
+ struct rdt_domain *d;
+
+ if (rdtgrp->mon.monitor_state & mon_state) {
+ rdt_last_cmd_puts("ABMC counter is assigned already\n");
+ return 0;
+ }
+
+ index = mon_event_config_index_get(evtid);
+ if (index == INVALID_CONFIG_INDEX) {
+ pr_warn_once("Invalid event id %d\n", evtid);
+ return -EINVAL;
+ }
+
+ /*
+ * Allocate a new counter and update domains
+ */
+ counterid = abmc_counters_alloc();
+ if (counterid < 0) {
+ rdt_last_cmd_puts("Out of ABMC counters\n");
+ return -ENOSPC;
+ }
+
+ rdtgrp->mon.abmc_ctr_id[index] = counterid;
+
+ list_for_each_entry(d, &r->domains, list)
+ rdtgroup_abmc_domain(d, rdtgrp, evtid, index, 1);
+
+ rdtgrp->mon.monitor_state |= mon_state;
+
+ return 0;
+}
+
+/**
+ * rdtgroup_monitor_state_write - Modify the resource group's assign
+ *
+ */
+static ssize_t rdtgroup_monitor_state_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ char *abmc_str, *event_str;
+ struct rdtgroup *rdtgrp;
+ int ret = 0, mon_state;
+ u32 evtid;
+
+ rdtgrp = rdtgroup_kn_lock_live(of->kn);
+ if (!rdtgrp) {
+ rdtgroup_kn_unlock(of->kn);
+ return -ENOENT;
+ }
+
+ rdt_last_cmd_clear();
+
+ while (buf && buf[0] != '\0') {
+ /* Start processing the strings for each domain */
+ abmc_str = strim(strsep(&buf, ";"));
+ event_str = strsep(&abmc_str, "=");
+
+ if (event_str && abmc_str) {
+ if (!strcmp(event_str, "total")) {
+ mon_state = TOTAL_ASSIGN;
+ evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
+ } else if (!strcmp(event_str, "local")) {
+ mon_state = LOCAL_ASSIGN;
+ evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
+ } else {
+ rdt_last_cmd_puts("Invalid ABMC event\n");
+ ret = -EINVAL;
+ break;
+ }
+
+ if (!strcmp(abmc_str, "assign")) {
+ ret = rdtgroup_assign_abmc(rdtgrp, r, evtid, mon_state);
+ if (ret) {
+ rdt_last_cmd_puts("ABMC assign failed\n");
+ break;
+ }
+ } else {
+ rdt_last_cmd_puts("Invalid ABMC event\n");
+ ret = -EINVAL;
+ break;
+ }
+ } else {
+ rdt_last_cmd_puts("Invalid ABMC input\n");
+ ret = -EINVAL;
+ break;
+ }
+ }
+
+ rdtgroup_kn_unlock(of->kn);
+ return ret ?: nbytes;
+}
+
static void mon_event_config_read(void *info)
{
struct mon_config_info *mon_info = info;
@@ -1944,9 +2093,10 @@ static struct rftype res_common_files[] = {
},
{
.name = "monitor_state",
- .mode = 0444,
+ .mode = 0644,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdtgroup_monitor_state_show,
+ .write = rdtgroup_monitor_state_write,
},
{
.name = "tasks",
--
2.34.1

2023-12-01 00:58:54

by Babu Moger

[permalink] [raw]
Subject: [PATCH 14/15] x86/resctrl: Add interface unassign a ABMC counter

With the support of ABMC (Assignable Bandwidth Monitoring Counters)
feature, the user has the option to pin (or assign) or unpin (or unassign)
the RMID to hardware counter and monitor the bandwidth for the longer
duration.

Provide the interface to unpin (or unassign) the counter.

Signed-off-by: Babu Moger <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 11 ++++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 36 ++++++++++++++++++++++++++
2 files changed, 47 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 65306e7d01b6..b42b59a7ba3c 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -417,6 +417,17 @@ When monitoring is enabled all MON groups will also contain:
# echo total=assign > /sys/fs/resctrl/monitor_state
# echo total=assign;local=assign > /sys/fs/resctrl/monitor_state

+ The user needs to unpin (or unassign) counter to release it.
+ Example::
+
+ # echo total=unassign > /sys/fs/resctrl/monitor_state
+ # cat /sys/fs/resctrl/monitor_state
+ total=unassign;local=assign
+
+ # echo total=unassign;local=unassign > /sys/fs/resctrl/monitor_state
+ # cat /sys/fs/resctrl/monitor_state
+ total=unassign;local=unassign
+
"mon_hw_id":
Available only with debug option. The identifier used by hardware
for the monitor group. On x86 this is the RMID.
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 671ff732992c..6eca47673344 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -192,6 +192,11 @@ static int abmc_counters_alloc(void)
return counterid;
}

+void abmc_counters_free(int counterid)
+{
+ abmc_free_map |= 1 << counterid;
+}
+
/**
* rdtgroup_mode_by_closid - Return mode of resource group with closid
* @closid: closid if the resource group
@@ -1671,6 +1676,31 @@ static ssize_t rdtgroup_assign_abmc(struct rdtgroup *rdtgrp,
return 0;
}

+static ssize_t rdtgroup_unassign_abmc(struct rdtgroup *rdtgrp,
+ struct rdt_resource *r,
+ u32 evtid, int mon_state)
+{
+ struct rdt_domain *d;
+ int index;
+
+ index = mon_event_config_index_get(evtid);
+ if (index == INVALID_CONFIG_INDEX) {
+ pr_warn_once("Invalid event id %d\n", evtid);
+ return -EINVAL;
+ }
+
+ if (rdtgrp->mon.monitor_state & mon_state) {
+ list_for_each_entry(d, &r->domains, list)
+ rdtgroup_abmc_domain(d, rdtgrp, evtid, index, 0);
+
+ abmc_counters_free(rdtgrp->mon.abmc_ctr_id[index]);
+ }
+
+ rdtgrp->mon.monitor_state &= ~mon_state;
+
+ return 0;
+}
+
/**
* rdtgroup_monitor_state_write - Modify the resource group's assign
*
@@ -1716,6 +1746,12 @@ static ssize_t rdtgroup_monitor_state_write(struct kernfs_open_file *of,
rdt_last_cmd_puts("ABMC assign failed\n");
break;
}
+ } else if (!strcmp(abmc_str, "unassign")) {
+ ret = rdtgroup_unassign_abmc(rdtgrp, r, evtid, mon_state);
+ if (ret) {
+ rdt_last_cmd_puts("ABMC unassign failed\n");
+ break;
+ }
} else {
rdt_last_cmd_puts("Invalid ABMC event\n");
ret = -EINVAL;
--
2.34.1

2023-12-01 00:58:55

by Babu Moger

[permalink] [raw]
Subject: [PATCH 09/15] x86/resctrl: Add interface to display monitor state of the group

The ABMC feature provides an option to the user to pin (or assign) the
RMID to the hardware counter and monitor the bandwidth for the longer
duration. The RMID will be active until user unpins (unassigns) the
RMID.

Add a new field monitor_state in resctrl group interface to display the
assignment state of the group. This field is available when resctrl
interface is mounted with "-o abmc".

By default the monitor_state is initialized to unassigned state.
$cat /sys/fs/resctrl/monitor_state
total=unassign;local=unassign

Signed-off-by: Babu Moger <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 20 ++++++++++++++
arch/x86/kernel/cpu/resctrl/internal.h | 8 ++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 36 ++++++++++++++++++++++++++
3 files changed, 64 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 87aa8eec71b7..d3df7d467eec 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -390,6 +390,26 @@ When monitoring is enabled all MON groups will also contain:
the sum for all tasks in the CTRL_MON group and all tasks in
MON groups. Please see example section for more details on usage.

+"monitor_state":
+ Available when ABMC feature is enabled. ABMC feature provides an
+ option to the user to pin (or assign) the RMID to hardware counter
+ and monitor the bandwidth for the longer duration. The RMID will
+ be active until user unpins (unassigns) it manually. Each group
+ will have two events that are assignable. By default, the events
+ are unassigned. Index 0 holds the monitor_state for MBM total bytes.
+ Index 1 holds the monitor_state for MBM local bytes.
+
+ Example::
+
+ # cat /sys/fs/resctrl/monitor_state
+ total=unassign;local=unassign
+
+ When the events are assigned, the output will look like below.
+ Example::
+
+ # cat /sys/fs/resctrl/monitor_state
+ total=assign;local=assign
+
"mon_hw_id":
Available only with debug option. The identifier used by hardware
for the monitor group. On x86 this is the RMID.
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 2801bc0dc132..bc36acd152be 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -53,6 +53,12 @@
/* ABMC ENABLE */
#define ABMC_ENABLE BIT(0)

+/*
+ * monitor group's state when ABMC is enabled
+ */
+#define TOTAL_ASSIGN BIT(0)
+#define LOCAL_ASSIGN BIT(1)
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
@@ -161,12 +167,14 @@ enum rdtgrp_mode {
* @parent: parent rdtgrp
* @crdtgrp_list: child rdtgroup node list
* @rmid: rmid for this rdtgroup
+ * @monitor_state: ABMC state of the group
*/
struct mongroup {
struct kernfs_node *mon_data_kn;
struct rdtgroup *parent;
struct list_head crdtgrp_list;
u32 rmid;
+ u32 monitor_state;
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 897707694cc8..edb679b22b7b 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -779,6 +779,26 @@ static int rdtgroup_tasks_show(struct kernfs_open_file *of,
return ret;
}

+static int rdtgroup_monitor_state_show(struct kernfs_open_file *of,
+ struct seq_file *s, void *v)
+{
+ struct rdtgroup *rdtgrp;
+ int ret = 0;
+
+ rdtgrp = rdtgroup_kn_lock_live(of->kn);
+ if (rdtgrp)
+ seq_printf(s, "total=%s;local=%s\n",
+ rdtgrp->mon.monitor_state & TOTAL_ASSIGN ?
+ "assign" : "unassign",
+ rdtgrp->mon.monitor_state & LOCAL_ASSIGN ?
+ "assign" : "unassign");
+ else
+ ret = -ENOENT;
+ rdtgroup_kn_unlock(of->kn);
+
+ return ret;
+}
+
static int rdtgroup_closid_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
@@ -1895,6 +1915,12 @@ static struct rftype res_common_files[] = {
.flags = RFTYPE_FLAGS_CPUS_LIST,
.fflags = RFTYPE_BASE,
},
+ {
+ .name = "monitor_state",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdtgroup_monitor_state_show,
+ },
{
.name = "tasks",
.mode = 0644,
@@ -2446,6 +2472,12 @@ int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
if (rft)
rft->fflags = RFTYPE_MON_INFO;

+ rft = rdtgroup_get_rftype_by_name("monitor_state");
+ if (rft)
+ rft->fflags = RFTYPE_MON_BASE;
+
+ rdtgroup_default.mon.monitor_state = 0;
+
return resctrl_abmc_enable(l);
}

@@ -2453,6 +2485,10 @@ int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
if (rft)
rft->fflags &= ~RFTYPE_MON_INFO;

+ rft = rdtgroup_get_rftype_by_name("monitor_state");
+ if (rft)
+ rft->fflags &= ~RFTYPE_MON_BASE;
+
resctrl_abmc_disable(l);

return 0;
--
2.34.1

2023-12-01 00:58:56

by Babu Moger

[permalink] [raw]
Subject: [PATCH 15/15] x86/resctrl: Update ABMC assignment on event configuration changes

When ABMC (Assignable Bandwidth Monitoring Counters) feature is enabled,
bandwidth events can be read in following methods.

1. The contents of a specific counter can be read by setting the following
fields in QM_EVTSEL: [ExtendedEvtID]=1, [EvtID]=L3CacheABMC and setting
[RMID] to the desired counter ID. Reading QM_CTR will then return the
contents of the specified counter. The E bit will be set if the counter
configuration was invalid, or if an invalid counter ID was set in the
QM_EVTSEL[RMID] field. Supporting this method requires changes in
rmid_read interface.

2. Alternatively, the contents of a counter may be read by specifying an
RMID and setting the [EvtID] to L3BWMonEvtn where n= {0,1}. If an
assignable bandwidth counter is monitoring that RMID with a BwType bitmask
that matches a QOS_EVT_CFG_n, that counter’s value will be returned when
reading QM_CTR. However, if multiple counters have the same configuration,
QM_CTR will return the value of the counter with the lowest CtrID.

Method 2 is supported in here. For the ABMC counter assignment to work,
the assignment needs to be updated to match BwType to the contents of the
MSR QOS_EVT_CFG_n. So, update the ABMC assignment when event configuration
changes.

The feature details are available in APM listed below [1]. [1] AMD64
Architecture Programmer's Manual Volume 2: System Programming Publication
(ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 40 ++++++++++++++++++++++++++
1 file changed, 40 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 6eca47673344..11890b4afb9f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1768,6 +1768,38 @@ static ssize_t rdtgroup_monitor_state_write(struct kernfs_open_file *of,
return ret ?: nbytes;
}

+static void rdtgroup_update_abmc(struct rdt_resource *r,
+ struct rdt_domain *d, u32 evtid)
+{
+ struct rdtgroup *prgrp, *crgrp;
+ int index, mon_state;
+
+ if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID)
+ mon_state = TOTAL_ASSIGN;
+ else
+ mon_state = LOCAL_ASSIGN;
+
+ index = mon_event_config_index_get(evtid);
+ if (index == INVALID_CONFIG_INDEX) {
+ pr_warn_once("Invalid event id %d\n", evtid);
+ return;
+ }
+
+ /*
+ * Update the assignment for all the monitor groups if the group
+ * is configured with ABMC assignment.
+ */
+ list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
+ if (prgrp->mon.monitor_state & mon_state)
+ rdtgroup_abmc_domain(d, prgrp, evtid, index, 1);
+
+ list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
+ if (crgrp->mon.monitor_state & mon_state)
+ rdtgroup_abmc_domain(d, crgrp, evtid, index, 1);
+ }
+ }
+}
+
static void mon_event_config_read(void *info)
{
struct mon_config_info *mon_info = info;
@@ -1852,6 +1884,7 @@ static void mon_event_config_write(void *info)
static int mbm_config_write_domain(struct rdt_resource *r,
struct rdt_domain *d, u32 evtid, u32 val)
{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct mon_config_info mon_info = {0};
int ret = 0;
@@ -1892,6 +1925,13 @@ static int mbm_config_write_domain(struct rdt_resource *r,
else
goto out;

+ /*
+ * Event configuration changed for the domain, so Update
+ * the ABMC assignment.
+ */
+ if (hw_res->abmc_enabled)
+ rdtgroup_update_abmc(r, d, evtid);
+
/*
* When an Event Configuration is changed, the bandwidth counters
* for all RMIDs and Events will be cleared by the hardware. The
--
2.34.1

2023-12-01 00:58:56

by Babu Moger

[permalink] [raw]
Subject: [PATCH 12/15] x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg

If the BMEC (Bandwidth Monitoring Event Configuration) feature is
supported, the bandwidth events can be configured. The event configuration
is domain specific. ABMC (Assignable Bandwidth Monitoring Counters)
feature needs the event configuration information to assign the hardware
counters.

Save the event configuration information in the rdt_hw_domain, so it can
be used for ABMC assignment.

Signed-off-by: Babu Moger <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
arch/x86/kernel/cpu/resctrl/monitor.c | 11 +++++++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++++++++
4 files changed, 27 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 278698a74c49..5ac9991e81bc 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -556,6 +556,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

+ arch_domain_mbm_evt_config(hw_dom);
+
list_add_tail(&d->list, add_pos);

err = resctrl_online_domain(r, d);
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index ca4b551dc808..bc1756a596f0 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -346,6 +346,8 @@ struct rdt_hw_domain {
u32 *ctrl_val;
struct arch_mbm_state *arch_mbm_total;
struct arch_mbm_state *arch_mbm_local;
+ u32 mbm_total_cfg;
+ u32 mbm_local_cfg;
};

static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
@@ -605,5 +607,6 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void __init thread_throttle_mode_init(void);
void __init mbm_config_rftype_init(const char *config);
void rdt_staged_configs_clear(void);
+void arch_domain_mbm_evt_config(struct rdt_hw_domain *hw_dom);

#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c611b16ba259..34d3b0c7f2c6 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -854,3 +854,14 @@ void __init intel_rdt_mbm_apply_quirk(void)
mbm_cf_rmidthreshold = mbm_cf_table[cf_index].rmidthreshold;
mbm_cf = mbm_cf_table[cf_index].cf;
}
+
+void arch_domain_mbm_evt_config(struct rdt_hw_domain *hw_dom)
+{
+ if (mbm_total_event.configurable)
+ hw_dom->mbm_total_cfg = resctrl_max_evt_bitmask;
+
+ if (mbm_local_event.configurable)
+ hw_dom->mbm_local_cfg = READS_TO_LOCAL_MEM |
+ NON_TEMP_WRITE_TO_LOCAL_MEM |
+ READS_TO_LOCAL_S_MEM;
+}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f72d6d8c12df..047aea628e2e 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1667,6 +1667,7 @@ static void mon_event_config_write(void *info)
static int mbm_config_write_domain(struct rdt_resource *r,
struct rdt_domain *d, u32 evtid, u32 val)
{
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct mon_config_info mon_info = {0};
int ret = 0;

@@ -1696,6 +1697,16 @@ static int mbm_config_write_domain(struct rdt_resource *r,
smp_call_function_any(&d->cpu_mask, mon_event_config_write,
&mon_info, 1);

+ /*
+ * Update event config value in the domain when user changes it.
+ */
+ if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID)
+ hw_dom->mbm_total_cfg = val;
+ else if (evtid == QOS_L3_MBM_LOCAL_EVENT_ID)
+ hw_dom->mbm_local_cfg = val;
+ else
+ goto out;
+
/*
* When an Event Configuration is changed, the bandwidth counters
* for all RMIDs and Events will be cleared by the hardware. The
--
2.34.1

2023-12-01 00:59:25

by Babu Moger

[permalink] [raw]
Subject: [PATCH 05/15] x86/resctrl: Detect ABMC feature details

ABMC feature details are reported via CPUID Fn8000_0020_EBX_x5.
Bits Description
15:0 MAX_ABMC Maximum Supported Assignable Bandwidth
Monitoring Counter ID + 1

Detect the feature details and update
/sys/fs/resctrl/info/L3_MON/mon_features.

The feature details are available in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
Documentation/arch/x86/resctrl.rst | 7 +++++++
arch/x86/kernel/cpu/resctrl/core.c | 17 +++++++++++++++++
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 3 +++
include/linux/resctrl.h | 2 ++
5 files changed, 31 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index d816ded93c22..1293cb6cba98 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -197,6 +197,13 @@ with the following files:
mbm_local_bytes
mbm_local_bytes_config

+ If the system supports Assignable Bandwidth Monitoring
+ Counters (ABMC), the output will have additional text.
+ Example::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ abmc_capable
+
"mbm_total_bytes_config", "mbm_local_bytes_config":
Read/write files containing the configuration for the mbm_total_bytes
and mbm_local_bytes events, respectively, when the Bandwidth
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index a257017b4de5..278698a74c49 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -303,6 +303,17 @@ static void rdt_get_cdp_l2_config(void)
rdt_get_cdp_config(RDT_RESOURCE_L2);
}

+static void rdt_get_abmc_cfg(struct rdt_resource *r)
+{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ u32 eax, ebx, ecx, edx;
+
+ r->abmc_capable = true;
+ /* Query CPUID_Fn80000020_EBX_x05 for number of ABMC counters */
+ cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
+ hw_res->abmc_counters = (ebx & 0xFFFF) + 1;
+}
+
static void
mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
{
@@ -815,6 +826,12 @@ static __init bool get_rdt_alloc_resources(void)
if (get_slow_mem_config())
ret = true;

+ if (rdt_cpu_has(X86_FEATURE_ABMC)) {
+ r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ rdt_get_abmc_cfg(r);
+ ret = true;
+ }
+
return ret;
}

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 524d8bec1439..0b22be85a444 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -388,6 +388,7 @@ struct rdt_parse_data {
* resctrl_arch_get_num_closid() to avoid confusion
* with struct resctrl_schema's property of the same name,
* which has been corrected for features like CDP.
+ * @abmc_countes: Maximum number of ABMC counters supported
* @msr_base: Base MSR address for CBMs
* @msr_update: Function pointer to update QOS MSRs
* @mon_scale: cqm counter * mon_scale = occupancy in bytes
@@ -401,6 +402,7 @@ struct rdt_parse_data {
struct rdt_hw_resource {
struct rdt_resource r_resctrl;
u32 num_closid;
+ u32 abmc_counters;
unsigned int msr_base;
void (*msr_update) (struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 6c22718dbaa2..feeb57ee7888 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1077,6 +1077,9 @@ static int rdt_mon_features_show(struct kernfs_open_file *of,
seq_printf(seq, "%s_config\n", mevt->name);
}

+ if (r->abmc_capable)
+ seq_printf(seq, "abmc_capable\n");
+
return 0;
}

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..656af479a19b 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -162,6 +162,7 @@ struct resctrl_schema;
* @evt_list: List of monitoring events
* @fflags: flags to choose base and info files
* @cdp_capable: Is the CDP feature available on this resource
+ * @abmc_capable: Does system capable of supporting ABMC feature?
*/
struct rdt_resource {
int rid;
@@ -182,6 +183,7 @@ struct rdt_resource {
struct list_head evt_list;
unsigned long fflags;
bool cdp_capable;
+ bool abmc_capable;
};

/**
--
2.34.1

2023-12-05 00:14:17

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

[+James]

Hi James,

On Thu, Nov 30, 2023 at 4:57 PM Babu Moger <[email protected]> wrote:
>
> These series adds the support for AMD QoS RMID Pinning feature. It is also
> called ABMC (Assignable Bandwidth Monitoring Counters) feature.
>
> The feature details are available in APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC). The documentation is available at
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>
> The patches are based on top of commit
> 346887b65d89ae987698bc1efd8e5536bd180b3f (tip/master)
>
> # Introduction
>
> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
> feature only guarantees that RMIDs currently assigned to a processor will
> be tracked by hardware. The counters of any other RMIDs which are no
> longer being tracked will be reset to zero. The MBM event counters return
> "Unavailable" for the RMIDs that are not active.
>
> Users can create 256 or more monitor groups. But there can be only limited
> number of groups that can be give guaranteed monitoring numbers. With ever
> changing system configuration, there is no way to definitely know which of
> these groups will be active for certain point of time. Users do not have
> the option to monitor a group or set of groups for certain period of time
> without worrying about RMID being reset in between.
>
> The ABMC feature provides an option to pin (or assign) the RMID to the
> hardware counter and monitor the bandwidth for a longer duration. The
> pinned RMID will be active until the user unpins (or unassigns) it. There
> is no need to worry about counters being reset during this period.
> Additionally, the user can specify a bitmask identifying the specific
> bandwidth types from the given source to track with the counter.
>
> # Linux Implementation
>
> Hardware provides total of 32 counters available for assignment.
> Each Linux resctrl group can be assigned a maximum of 2 counters. One for
> mbm_total_bytes and one for mbm_local_bytes. Users also have the option to
> assign only one counter to the group. If the system runs out of assignable
> counters, the kernel will display the error when the user attempts a new
> counter assignment. Users need to unassign already used counters for new
> assignments.
>
> # Examples
>
> a. Check if ABMC support is available
> #mount -t resctrl resctrl /sys/fs/resctrl/
> #cat /sys/fs/resctrl/info/L3_MON/mon_features
> llc_occupancy
> mbm_total_bytes
> mbm_total_bytes_config
> mbm_local_bytes
> mbm_local_bytes_config
> abmc_capable ← Linux kernel detected ABMC feature.
>
> b. Mount with ABMC support
> #umount /sys/fs/resctrl/
> #mount -o abmc -t resctrl resctrl /sys/fs/resctrl/
>
> c. Read the monitor states. There will be new file "monitor_state"
> for each monitor group when ABMC feature is enabled. By default,
> both total and local MBM events are in "unassign" state.
>
> #cat /sys/fs/resctrl/monitor_state
> total=unassign;local=unassign
>
> d. Read the event mbm_total_bytes and mbm_local_bytes. Note that MBA
> events are not available until the user assigns the events explicitly.
> Users need to assign the counters to monitor the events in this mode.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> Unavailable
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> Unavailable
>
> e. Assign a h/w counter to the total event and read the monitor_state.
>
> #echo total=assign > /sys/fs/resctrl/monitor_state
> #cat /sys/fs/resctrl/monitor_state
> total=assign;local=unassign
>
> f. Now that the total event is assigned. Read the total event.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 6136000
>
> g. Assign a h/w counter to both total and local events and read the monitor_state.
>
> #echo "total=assign;local=assign" > /sys/fs/resctrl/monitor_state
> #cat /sys/fs/resctrl/monitor_state
> total=assign;local=assign
>
> h. Now that both total and local events are assigned, read the events.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 6136000
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> 58694

We had briefly discussed this topic of explicit counter assignment in
resctrl earlier this year[1], but you didn't want it to be unique to
MPAM.

Now that a similar capability exists on AMD and an interface is being
proposed, we can talk about this in the context of MPAM again.

With some generalization and refinements, I expect this proposal could
be applied to assigning a limited number of MBWU monitors to
monitoring groups.

Also, I had proposed in another thread[2] applying such an interface
to previous AMD hardware where the monitor assignments cannot be
directly controlled to avoid or reduce the overhead in my soft RMID
proposal.

Thanks!
-Peter

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/CALPaoCjg-W3w8OKLHP_g6Evoo03fbgaOQZrGTLX6vdSLp70=SA@mail.gmail.com/

2023-12-05 16:49:34

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 07/15] x86/resctrl: Add support to enable/disable ABMC feature

Hi Babu,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20231130]
[cannot apply to tip/x86/core linus/master v6.7-rc3 v6.7-rc2 v6.7-rc1 v6.7-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Babu-Moger/x86-resctrl-Remove-hard-coded-memory-bandwidth-limit/20231201-090621
base: next-20231130
patch link: https://lore.kernel.org/r/20231201005720.235639-8-babu.moger%40amd.com
patch subject: [PATCH 07/15] x86/resctrl: Add support to enable/disable ABMC feature
config: i386-randconfig-013-20231202 (https://download.01.org/0day-ci/archive/20231206/[email protected]/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231206/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> arch/x86/kernel/cpu/resctrl/rdtgroup.c:2419:5: warning: no previous prototype for 'resctrl_arch_set_abmc_enabled' [-Wmissing-prototypes]
2419 | int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/resctrl_arch_set_abmc_enabled +2419 arch/x86/kernel/cpu/resctrl/rdtgroup.c

2418
> 2419 int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
2420 {
2421 struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
2422
2423 if (!hw_res->r_resctrl.abmc_capable)
2424 return -EINVAL;
2425
2426 if (enable)
2427 return resctrl_abmc_enable(l);
2428
2429 resctrl_abmc_disable(l);
2430
2431 return 0;
2432 }
2433

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-12-05 17:40:24

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH 07/15] x86/resctrl: Add support to enable/disable ABMC feature



On 12/5/23 10:48, kernel test robot wrote:
> Hi Babu,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on next-20231130]
> [cannot apply to tip/x86/core linus/master v6.7-rc3 v6.7-rc2 v6.7-rc1 v6.7-rc4]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Babu-Moger/x86-resctrl-Remove-hard-coded-memory-bandwidth-limit/20231201-090621
> base: next-20231130
> patch link: https://lore.kernel.org/r/20231201005720.235639-8-babu.moger%40amd.com
> patch subject: [PATCH 07/15] x86/resctrl: Add support to enable/disable ABMC feature
> config: i386-randconfig-013-20231202 (https://download.01.org/0day-ci/archive/20231206/[email protected]/config)
> compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231206/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All warnings (new ones prefixed by >>):
>
>>> arch/x86/kernel/cpu/resctrl/rdtgroup.c:2419:5: warning: no previous prototype for 'resctrl_arch_set_abmc_enabled' [-Wmissing-prototypes]
> 2419 | int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>

Yes. Got it. This function requires a prototype in
arch/x86/kernel/cpu/resctrl/internal.h.

Will add it in the next revision after the other comments.

Thanks
Babu Moger


2023-12-05 17:56:32

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 14/15] x86/resctrl: Add interface unassign a ABMC counter

Hi Babu,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20231130]
[cannot apply to tip/x86/core linus/master v6.7-rc3 v6.7-rc2 v6.7-rc1 v6.7-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Babu-Moger/x86-resctrl-Remove-hard-coded-memory-bandwidth-limit/20231201-090621
base: next-20231130
patch link: https://lore.kernel.org/r/20231201005720.235639-15-babu.moger%40amd.com
patch subject: [PATCH 14/15] x86/resctrl: Add interface unassign a ABMC counter
config: i386-randconfig-013-20231202 (https://download.01.org/0day-ci/archive/20231206/[email protected]/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231206/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> arch/x86/kernel/cpu/resctrl/rdtgroup.c:195:6: warning: no previous prototype for 'abmc_counters_free' [-Wmissing-prototypes]
195 | void abmc_counters_free(int counterid)
| ^~~~~~~~~~~~~~~~~~
arch/x86/kernel/cpu/resctrl/rdtgroup.c:2675:5: warning: no previous prototype for 'resctrl_arch_set_abmc_enabled' [-Wmissing-prototypes]
2675 | int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/abmc_counters_free +195 arch/x86/kernel/cpu/resctrl/rdtgroup.c

194
> 195 void abmc_counters_free(int counterid)
196 {
197 abmc_free_map |= 1 << counterid;
198 }
199

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-12-05 18:11:00

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH 14/15] x86/resctrl: Add interface unassign a ABMC counter



On 12/5/23 11:55, kernel test robot wrote:
> Hi Babu,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on next-20231130]
> [cannot apply to tip/x86/core linus/master v6.7-rc3 v6.7-rc2 v6.7-rc1 v6.7-rc4]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Babu-Moger/x86-resctrl-Remove-hard-coded-memory-bandwidth-limit/20231201-090621
> base: next-20231130
> patch link: https://lore.kernel.org/r/20231201005720.235639-15-babu.moger%40amd.com
> patch subject: [PATCH 14/15] x86/resctrl: Add interface unassign a ABMC counter
> config: i386-randconfig-013-20231202 (https://download.01.org/0day-ci/archive/20231206/[email protected]/config)
> compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231206/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All warnings (new ones prefixed by >>):
>
>>> arch/x86/kernel/cpu/resctrl/rdtgroup.c:195:6: warning: no previous prototype for 'abmc_counters_free' [-Wmissing-prototypes]
> 195 | void abmc_counters_free(int counterid)

Yes. Got it. This function requires a prototype in
arch/x86/kernel/cpu/resctrl/internal.h.

Will add it in the next revision after the other comments.
--
Thanks
Babu Moger

2023-12-05 18:51:38

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 07/15] x86/resctrl: Add support to enable/disable ABMC feature

Hi Babu,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20231130]
[cannot apply to tip/x86/core linus/master v6.7-rc3 v6.7-rc2 v6.7-rc1 v6.7-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Babu-Moger/x86-resctrl-Remove-hard-coded-memory-bandwidth-limit/20231201-090621
base: next-20231130
patch link: https://lore.kernel.org/r/20231201005720.235639-8-babu.moger%40amd.com
patch subject: [PATCH 07/15] x86/resctrl: Add support to enable/disable ABMC feature
config: i386-randconfig-141-20231202 (https://download.01.org/0day-ci/archive/20231206/[email protected]/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231206/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> arch/x86/kernel/cpu/resctrl/rdtgroup.c:2419:5: warning: no previous prototype for function 'resctrl_arch_set_abmc_enabled' [-Wmissing-prototypes]
int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
^
arch/x86/kernel/cpu/resctrl/rdtgroup.c:2419:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
^
static
1 warning generated.


vim +/resctrl_arch_set_abmc_enabled +2419 arch/x86/kernel/cpu/resctrl/rdtgroup.c

2418
> 2419 int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
2420 {
2421 struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
2422
2423 if (!hw_res->r_resctrl.abmc_capable)
2424 return -EINVAL;
2425
2426 if (enable)
2427 return resctrl_abmc_enable(l);
2428
2429 resctrl_abmc_disable(l);
2430
2431 return 0;
2432 }
2433

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-12-05 23:18:07

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

(+James)

Hi Babu,

On 11/30/2023 4:57 PM, Babu Moger wrote:
> These series adds the support for AMD QoS RMID Pinning feature. It is also

"These series" - is this series part of a bigger work?

> called ABMC (Assignable Bandwidth Monitoring Counters) feature.
>
> The feature details are available in APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC). The documentation is available at
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>
> The patches are based on top of commit
> 346887b65d89ae987698bc1efd8e5536bd180b3f (tip/master)
>
> # Introduction
>
> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
> feature only guarantees that RMIDs currently assigned to a processor will
> be tracked by hardware. The counters of any other RMIDs which are no
> longer being tracked will be reset to zero. The MBM event counters return
> "Unavailable" for the RMIDs that are not active.
>
> Users can create 256 or more monitor groups. But there can be only limited
> number of groups that can be give guaranteed monitoring numbers. With ever
> changing system configuration, there is no way to definitely know which of
> these groups will be active for certain point of time. Users do not have
> the option to monitor a group or set of groups for certain period of time
> without worrying about RMID being reset in between.
>
> The ABMC feature provides an option to pin (or assign) the RMID to the
> hardware counter and monitor the bandwidth for a longer duration. The
> pinned RMID will be active until the user unpins (or unassigns) it. There
> is no need to worry about counters being reset during this period.
> Additionally, the user can specify a bitmask identifying the specific
> bandwidth types from the given source to track with the counter.
>
> # Linux Implementation
>
> Hardware provides total of 32 counters available for assignment.
> Each Linux resctrl group can be assigned a maximum of 2 counters. One for
> mbm_total_bytes and one for mbm_local_bytes. Users also have the option to
> assign only one counter to the group. If the system runs out of assignable
> counters, the kernel will display the error when the user attempts a new
> counter assignment. Users need to unassign already used counters for new
> assignments.
>
> # Examples
>
> a. Check if ABMC support is available
> #mount -t resctrl resctrl /sys/fs/resctrl/
> #cat /sys/fs/resctrl/info/L3_MON/mon_features
> llc_occupancy
> mbm_total_bytes
> mbm_total_bytes_config
> mbm_local_bytes
> mbm_local_bytes_config
> abmc_capable ← Linux kernel detected ABMC feature.

(Please start thinking about a new name that is not the AMD feature
name. This is added to resctrl filesystem that is the generic interface
used to work with different architectures. This thus needs to be generalized
to what user requires and how it can be accommodated by the hardware ...
this is already expected to be needed by MPAM and having a AMD feature
name could cause confusion.)

>
> b. Mount with ABMC support
> #umount /sys/fs/resctrl/
> #mount -o abmc -t resctrl resctrl /sys/fs/resctrl/
>

hmmm ... so this requires the user to mount resctrl, determine if the
feature is supported, unmount resctrl, remount resctrl with feature enabled.
Could you please elaborate what prevents this feature from being enabled
without needing to remount resctrl?

> c. Read the monitor states. There will be new file "monitor_state"
> for each monitor group when ABMC feature is enabled. By default,
> both total and local MBM events are in "unassign" state.
>
> #cat /sys/fs/resctrl/monitor_state
> total=unassign;local=unassign
>
> d. Read the event mbm_total_bytes and mbm_local_bytes. Note that MBA
> events are not available until the user assigns the events explicitly.
> Users need to assign the counters to monitor the events in this mode.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> Unavailable

How is the llc_occupancy event impacted when ABMC is enabled? Can all RMIDs
still be used to track cache occupancy?

>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> Unavailable

I believe that "Unavailable" already has an accepted meaning within current
interface and is associated with temporary failure. Even the AMD spec states "This
is generally a temporary condition and subsequent reads may succeed". In the
scenario above there is no chance that this counter would produce a value later.
I do not think it is ideal to overload existing interface with different meanings
associated with a new hardware specific feature ... something like "Disabled" seems
more appropriate.

Considering this we may even consider using these files themselves as a
way to enable the counters if they are disabled. For example, just
"echo 1 > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes" can be used
to enable this counter. No need for a new "monitor_state". Please note that this
is not an official proposal since there are two other use cases that still need to
be considered as we await James's feedback on how this may work for MPAM and
also how this may be useful on AMD hardware that does not support ABMC but
users may want to get similar benefits ([1])

>
> e. Assign a h/w counter to the total event and read the monitor_state.
>
> #echo total=assign > /sys/fs/resctrl/monitor_state
> #cat /sys/fs/resctrl/monitor_state
> total=assign;local=unassign
>
> f. Now that the total event is assigned. Read the total event.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 6136000
>
> g. Assign a h/w counter to both total and local events and read the monitor_state.
>
> #echo "total=assign;local=assign" > /sys/fs/resctrl/monitor_state
> #cat /sys/fs/resctrl/monitor_state
> total=assign;local=assign
>
> h. Now that both total and local events are assigned, read the events.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 6136000
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> 58694

It looks like if not all RMIDs asssociated with parent and child groups
have counters then the accumulated counters would just treat the "unassigned"
as zero?

>
> i. Check the bandwidth configuration for the group. Note that bandwidth
> configuration has a domain scope. Total event defaults to 0x7F (to
> count all the events) and local event defaults to 0x15
> (to count all the local numa events). The event bitmap decoding is
> available in https://www.kernel.org/doc/Documentation/x86/resctrl.rst
> in section "mbm_total_bytes_config", "mbm_local_bytes_config":
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
> 0=0x7f;1=0x7f
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
> 0=0x15;1=0xi15


These would not be available if system does not support BMEC. From
patch #3 it does not seem as though ABMC is dependent on BMEC.

Is ABMC dependent on BMEC or are they just using the same
config bits?

>
> j. Change the bandwidth source for domain 0 for the total event to count only reads.
> Note that this change effects events on the domain 0.
>
> #echo total=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config

typo?

> #cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
> 0=0x33;1=0x7F
>
> k. Now read the total event again. The mbm_total_bytes should display
> only the read events.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 6136000

hmmm ... seems like there is a need to make the MBM events configurable even
if BMEC is not supported.

Reinette


[1] https://lore.kernel.org/lkml/CALPaoCjg-W3w8OKLHP_g6Evoo03fbgaOQZrGTLX6vdSLp70=SA@mail.gmail.com/

2023-12-06 15:41:50

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

Hi Reinette,

On 12/5/23 17:17, Reinette Chatre wrote:
> (+James)
>
> Hi Babu,
>
> On 11/30/2023 4:57 PM, Babu Moger wrote:
>> These series adds the support for AMD QoS RMID Pinning feature. It is also
>
> "These series" - is this series part of a bigger work?

No.
There are some some plans to optimize rmid_reads. Peter is planning to
work on that. But both are independent of each other.

>
>> called ABMC (Assignable Bandwidth Monitoring Counters) feature.
>>
>> The feature details are available in APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC). The documentation is available at
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>
>> The patches are based on top of commit
>> 346887b65d89ae987698bc1efd8e5536bd180b3f (tip/master)
>>
>> # Introduction
>>
>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
>> feature only guarantees that RMIDs currently assigned to a processor will
>> be tracked by hardware. The counters of any other RMIDs which are no
>> longer being tracked will be reset to zero. The MBM event counters return
>> "Unavailable" for the RMIDs that are not active.
>>
>> Users can create 256 or more monitor groups. But there can be only limited
>> number of groups that can be give guaranteed monitoring numbers. With ever
>> changing system configuration, there is no way to definitely know which of
>> these groups will be active for certain point of time. Users do not have
>> the option to monitor a group or set of groups for certain period of time
>> without worrying about RMID being reset in between.
>>
>> The ABMC feature provides an option to pin (or assign) the RMID to the
>> hardware counter and monitor the bandwidth for a longer duration. The
>> pinned RMID will be active until the user unpins (or unassigns) it. There
>> is no need to worry about counters being reset during this period.
>> Additionally, the user can specify a bitmask identifying the specific
>> bandwidth types from the given source to track with the counter.
>>
>> # Linux Implementation
>>
>> Hardware provides total of 32 counters available for assignment.
>> Each Linux resctrl group can be assigned a maximum of 2 counters. One for
>> mbm_total_bytes and one for mbm_local_bytes. Users also have the option to
>> assign only one counter to the group. If the system runs out of assignable
>> counters, the kernel will display the error when the user attempts a new
>> counter assignment. Users need to unassign already used counters for new
>> assignments.
>>
>> # Examples
>>
>> a. Check if ABMC support is available
>> #mount -t resctrl resctrl /sys/fs/resctrl/
>> #cat /sys/fs/resctrl/info/L3_MON/mon_features
>> llc_occupancy
>> mbm_total_bytes
>> mbm_total_bytes_config
>> mbm_local_bytes
>> mbm_local_bytes_config
>> abmc_capable ← Linux kernel detected ABMC feature.
>
> (Please start thinking about a new name that is not the AMD feature
> name. This is added to resctrl filesystem that is the generic interface
> used to work with different architectures. This thus needs to be generalized
> to what user requires and how it can be accommodated by the hardware ...
> this is already expected to be needed by MPAM and having a AMD feature
> name could cause confusion.)

Yes. Agree.

How about "assign_capable"?

>
>>
>> b. Mount with ABMC support
>> #umount /sys/fs/resctrl/
>> #mount -o abmc -t resctrl resctrl /sys/fs/resctrl/
>>
>
> hmmm ... so this requires the user to mount resctrl, determine if the
> feature is supported, unmount resctrl, remount resctrl with feature enabled.
> Could you please elaborate what prevents this feature from being enabled
> without needing to remount resctrl?

Spec says
"Enabling ABMC: ABMC is enabled by setting L3_QOS_EXT_CFG.ABMC_En=1 (see
Figure 19-7). When the state of ABMC_En is changed, it must be changed to
the updated value on all logical processors in the QOS Domain.
Upon transitions of the ABMC_En the following actions take place:
All ABMC assignable bandwidth counters are reset to 0.
The L3 default mode bandwidth counters are reset to 0.
The L3_QOS_ABMC_CFG MSR is reset to 0."

So, all the monitoring group counters will be reset.

It is technically possible to enable without remount. But ABMC mode
requires few new files(in each group) which I added when mounted with "-o
abmc". Thought it is a better option.

Otherwise we need to add these files when ABMC is supported(not when
enabled). Need to add another file in /sys/fs/resctrl/info/L3_MON to
enable the feature on the fly.

Both are acceptable options. Any thoughts?


>
>> c. Read the monitor states. There will be new file "monitor_state"
>> for each monitor group when ABMC feature is enabled. By default,
>> both total and local MBM events are in "unassign" state.
>>
>> #cat /sys/fs/resctrl/monitor_state
>> total=unassign;local=unassign
>>
>> d. Read the event mbm_total_bytes and mbm_local_bytes. Note that MBA
>> events are not available until the user assigns the events explicitly.
>> Users need to assign the counters to monitor the events in this mode.
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> Unavailable
>
> How is the llc_occupancy event impacted when ABMC is enabled? Can all RMIDs
> still be used to track cache occupancy?

llc_occupancy event is not impacted by ABMC mode. It can be still used to
track cache occupancy.

>
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> Unavailable
>
> I believe that "Unavailable" already has an accepted meaning within current
> interface and is associated with temporary failure. Even the AMD spec states "This
> is generally a temporary condition and subsequent reads may succeed". In the
> scenario above there is no chance that this counter would produce a value later.
> I do not think it is ideal to overload existing interface with different meanings
> associated with a new hardware specific feature ... something like "Disabled" seems
> more appropriate.

Hardware still reports it as unavailable. Also, there are some error cases
hardware can report unavailable. We may not be able to differentiate that.

>
> Considering this we may even consider using these files themselves as a
> way to enable the counters if they are disabled. For example, just
> "echo 1 > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes" can be used

I am not sure about this. This is specific to domain 0. This group can
have cpus from multiple domains. I think we should have the interface for
all the domains(not for specific domain).

> to enable this counter. No need for a new "monitor_state". Please note that this
> is not an official proposal since there are two other use cases that still need to
> be considered as we await James's feedback on how this may work for MPAM and
> also how this may be useful on AMD hardware that does not support ABMC but
> users may want to get similar benefits ([1])

Ok. Lets wait for James comments.
>
>>
>> e. Assign a h/w counter to the total event and read the monitor_state.
>>
>> #echo total=assign > /sys/fs/resctrl/monitor_state
>> #cat /sys/fs/resctrl/monitor_state
>> total=assign;local=unassign
>>
>> f. Now that the total event is assigned. Read the total event.
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 6136000
>>
>> g. Assign a h/w counter to both total and local events and read the monitor_state.
>>
>> #echo "total=assign;local=assign" > /sys/fs/resctrl/monitor_state
>> #cat /sys/fs/resctrl/monitor_state
>> total=assign;local=assign
>>
>> h. Now that both total and local events are assigned, read the events.
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 6136000
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> 58694
>
> It looks like if not all RMIDs asssociated with parent and child groups
> have counters then the accumulated counters would just treat the "unassigned"
> as zero?

That is correct.

>
>>
>> i. Check the bandwidth configuration for the group. Note that bandwidth
>> configuration has a domain scope. Total event defaults to 0x7F (to
>> count all the events) and local event defaults to 0x15
>> (to count all the local numa events). The event bitmap decoding is
>> available in https://www.kernel.org/doc/Documentation/x86/resctrl.rst
>> in section "mbm_total_bytes_config", "mbm_local_bytes_config":
>>
>> #cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
>> 0=0x7f;1=0x7f
>>
>> #cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
>> 0=0x15;1=0xi15
>
>
> These would not be available if system does not support BMEC. From
> patch #3 it does not seem as though ABMC is dependent on BMEC.
>
> Is ABMC dependent on BMEC or are they just using the same
> config bits?

Good question. They dont have to be dependent on each other. To keep the
rmid_read interface same, we made it dependent on each other. I will add
the dependency in patch 3.

I have added explanation in patch 15.
https://lore.kernel.org/lkml/[email protected]/


>
>>
>> j. Change the bandwidth source for domain 0 for the total event to count only reads.
>> Note that this change effects events on the domain 0.
>>
>> #echo total=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
>
> typo?

Yes. Cut paste mistake. Will fix it.

>
>> #cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
>> 0=0x33;1=0x7F
>>
>> k. Now read the total event again. The mbm_total_bytes should display
>> only the read events.
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 6136000
>
> hmmm ... seems like there is a need to make the MBM events configurable even
> if BMEC is not supported.

Yes, in ABMC mode. Will add the dependency. Will use the standard mode if
BMEC and ABMC are not available.

>
> Reinette
>
>
> [1] https://lore.kernel.org/lkml/CALPaoCjg-W3w8OKLHP_g6Evoo03fbgaOQZrGTLX6vdSLp70=SA@mail.gmail.com/

--
Thanks
Babu Moger

2023-12-06 18:50:40

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

Hi Babu,

On 12/6/2023 7:40 AM, Moger, Babu wrote:
> Hi Reinette,
>
> On 12/5/23 17:17, Reinette Chatre wrote:
>> (+James)
>>
>> Hi Babu,
>>
>> On 11/30/2023 4:57 PM, Babu Moger wrote:
>>> These series adds the support for AMD QoS RMID Pinning feature. It is also
>>
>> "These series" - is this series part of a bigger work?
>
> No.
> There are some some plans to optimize rmid_reads. Peter is planning to
> work on that. But both are independent of each other.

I would propose that you use "This series" instead to avoid creating
wrong impression.

>>> a. Check if ABMC support is available
>>> #mount -t resctrl resctrl /sys/fs/resctrl/
>>> #cat /sys/fs/resctrl/info/L3_MON/mon_features
>>> llc_occupancy
>>> mbm_total_bytes
>>> mbm_total_bytes_config
>>> mbm_local_bytes
>>> mbm_local_bytes_config
>>> abmc_capable ← Linux kernel detected ABMC feature.
>>
>> (Please start thinking about a new name that is not the AMD feature
>> name. This is added to resctrl filesystem that is the generic interface
>> used to work with different architectures. This thus needs to be generalized
>> to what user requires and how it can be accommodated by the hardware ...
>> this is already expected to be needed by MPAM and having a AMD feature
>> name could cause confusion.)
>
> Yes. Agree.
>
> How about "assign_capable"?

Let's wait to learn more about other use case.

>
>>
>>>
>>> b. Mount with ABMC support
>>> #umount /sys/fs/resctrl/
>>> #mount -o abmc -t resctrl resctrl /sys/fs/resctrl/
>>>
>>
>> hmmm ... so this requires the user to mount resctrl, determine if the
>> feature is supported, unmount resctrl, remount resctrl with feature enabled.
>> Could you please elaborate what prevents this feature from being enabled
>> without needing to remount resctrl?
>
> Spec says
> "Enabling ABMC: ABMC is enabled by setting L3_QOS_EXT_CFG.ABMC_En=1 (see
> Figure 19-7). When the state of ABMC_En is changed, it must be changed to
> the updated value on all logical processors in the QOS Domain.
> Upon transitions of the ABMC_En the following actions take place:
> All ABMC assignable bandwidth counters are reset to 0.
> The L3 default mode bandwidth counters are reset to 0.
> The L3_QOS_ABMC_CFG MSR is reset to 0."
>
> So, all the monitoring group counters will be reset.
>
> It is technically possible to enable without remount. But ABMC mode
> requires few new files(in each group) which I added when mounted with "-o
> abmc". Thought it is a better option.
>
> Otherwise we need to add these files when ABMC is supported(not when
> enabled). Need to add another file in /sys/fs/resctrl/info/L3_MON to
> enable the feature on the fly.
>
> Both are acceptable options. Any thoughts?

The new resctrl files in info/ could always be present. For example,
user space may want to know how many counters are available before
enabling the feature.

It is not yet obvious to me that this feature requires new files
in monitor groups.

>>> c. Read the monitor states. There will be new file "monitor_state"
>>> for each monitor group when ABMC feature is enabled. By default,
>>> both total and local MBM events are in "unassign" state.
>>>
>>> #cat /sys/fs/resctrl/monitor_state
>>> total=unassign;local=unassign
>>>
>>> d. Read the event mbm_total_bytes and mbm_local_bytes. Note that MBA
>>> events are not available until the user assigns the events explicitly.
>>> Users need to assign the counters to monitor the events in this mode.
>>>
>>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>> Unavailable
>>
>> How is the llc_occupancy event impacted when ABMC is enabled? Can all RMIDs
>> still be used to track cache occupancy?
>
> llc_occupancy event is not impacted by ABMC mode. It can be still used to
> track cache occupancy.
>
>>
>>>
>>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>> Unavailable
>>
>> I believe that "Unavailable" already has an accepted meaning within current
>> interface and is associated with temporary failure. Even the AMD spec states "This
>> is generally a temporary condition and subsequent reads may succeed". In the
>> scenario above there is no chance that this counter would produce a value later.
>> I do not think it is ideal to overload existing interface with different meanings
>> associated with a new hardware specific feature ... something like "Disabled" seems
>> more appropriate.
>
> Hardware still reports it as unavailable. Also, there are some error cases
> hardware can report unavailable. We may not be able to differentiate that.

This highlights that this resctrl feature is currently latched to AMD's
ABMC. I do not think we should require that this resctrl feature is backed
by hardware that can support reads of counters that are disabled. A counter
read really only needs to be sent to hardware if it is enabled.

>> Considering this we may even consider using these files themselves as a
>> way to enable the counters if they are disabled. For example, just
>> "echo 1 > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes" can be used
>
> I am not sure about this. This is specific to domain 0. This group can
> have cpus from multiple domains. I think we should have the interface for
> all the domains(not for specific domain).

Are the ABMC registers not per CPU? This is unclear to me at this time
since changelog of patch #13 states it is per-CPU but yet the code
uses smp_call_function_any().

Even so, this needs to take other use cases into account. So far Peter
mentioned the scenario where enabling of one counter would do so for all
events associated with that counter and then there could also be a global
enable/disable.

>
>> to enable this counter. No need for a new "monitor_state". Please note that this
>> is not an official proposal since there are two other use cases that still need to
>> be considered as we await James's feedback on how this may work for MPAM and
>> also how this may be useful on AMD hardware that does not support ABMC but
>> users may want to get similar benefits ([1])
>
> Ok. Lets wait for James comments.

Reinette

2023-12-07 16:13:02

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

Hi Reinette,

On 12/6/23 12:49, Reinette Chatre wrote:
> Hi Babu,
>
> On 12/6/2023 7:40 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 12/5/23 17:17, Reinette Chatre wrote:
>>> (+James)
>>>
>>> Hi Babu,
>>>
>>> On 11/30/2023 4:57 PM, Babu Moger wrote:
>>>> These series adds the support for AMD QoS RMID Pinning feature. It is also
>>>
>>> "These series" - is this series part of a bigger work?
>>
>> No.
>> There are some some plans to optimize rmid_reads. Peter is planning to
>> work on that. But both are independent of each other.
>
> I would propose that you use "This series" instead to avoid creating
> wrong impression.

Sure.

>
>>>> a. Check if ABMC support is available
>>>> #mount -t resctrl resctrl /sys/fs/resctrl/
>>>> #cat /sys/fs/resctrl/info/L3_MON/mon_features
>>>> llc_occupancy
>>>> mbm_total_bytes
>>>> mbm_total_bytes_config
>>>> mbm_local_bytes
>>>> mbm_local_bytes_config
>>>> abmc_capable ← Linux kernel detected ABMC feature.
>>>
>>> (Please start thinking about a new name that is not the AMD feature
>>> name. This is added to resctrl filesystem that is the generic interface
>>> used to work with different architectures. This thus needs to be generalized
>>> to what user requires and how it can be accommodated by the hardware ...
>>> this is already expected to be needed by MPAM and having a AMD feature
>>> name could cause confusion.)
>>
>> Yes. Agree.
>>
>> How about "assign_capable"?
>
> Let's wait to learn more about other use case.
>
>>
>>>
>>>>
>>>> b. Mount with ABMC support
>>>> #umount /sys/fs/resctrl/
>>>> #mount -o abmc -t resctrl resctrl /sys/fs/resctrl/
>>>>
>>>
>>> hmmm ... so this requires the user to mount resctrl, determine if the
>>> feature is supported, unmount resctrl, remount resctrl with feature enabled.
>>> Could you please elaborate what prevents this feature from being enabled
>>> without needing to remount resctrl?
>>
>> Spec says
>> "Enabling ABMC: ABMC is enabled by setting L3_QOS_EXT_CFG.ABMC_En=1 (see
>> Figure 19-7). When the state of ABMC_En is changed, it must be changed to
>> the updated value on all logical processors in the QOS Domain.
>> Upon transitions of the ABMC_En the following actions take place:
>> All ABMC assignable bandwidth counters are reset to 0.
>> The L3 default mode bandwidth counters are reset to 0.
>> The L3_QOS_ABMC_CFG MSR is reset to 0."
>>
>> So, all the monitoring group counters will be reset.
>>
>> It is technically possible to enable without remount. But ABMC mode
>> requires few new files(in each group) which I added when mounted with "-o
>> abmc". Thought it is a better option.
>>
>> Otherwise we need to add these files when ABMC is supported(not when
>> enabled). Need to add another file in /sys/fs/resctrl/info/L3_MON to
>> enable the feature on the fly.
>>
>> Both are acceptable options. Any thoughts?
>
> The new resctrl files in info/ could always be present. For example,
> user space may want to know how many counters are available before
> enabling the feature.
>
> It is not yet obvious to me that this feature requires new files
> in monitor groups.

There are two MBM events(total and local) in each group.
We should provide an interface to assign each event independently.
User can assign only one event in a group. We should also provide an
option assign both the events in the group. This needs to be done at
resctrl group level.

>
>>>> c. Read the monitor states. There will be new file "monitor_state"
>>>> for each monitor group when ABMC feature is enabled. By default,
>>>> both total and local MBM events are in "unassign" state.
>>>>
>>>> #cat /sys/fs/resctrl/monitor_state
>>>> total=unassign;local=unassign
>>>>
>>>> d. Read the event mbm_total_bytes and mbm_local_bytes. Note that MBA
>>>> events are not available until the user assigns the events explicitly.
>>>> Users need to assign the counters to monitor the events in this mode.
>>>>
>>>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>> Unavailable
>>>
>>> How is the llc_occupancy event impacted when ABMC is enabled? Can all RMIDs
>>> still be used to track cache occupancy?
>>
>> llc_occupancy event is not impacted by ABMC mode. It can be still used to
>> track cache occupancy.
>>
>>>
>>>>
>>>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>> Unavailable
>>>
>>> I believe that "Unavailable" already has an accepted meaning within current
>>> interface and is associated with temporary failure. Even the AMD spec states "This
>>> is generally a temporary condition and subsequent reads may succeed". In the
>>> scenario above there is no chance that this counter would produce a value later.
>>> I do not think it is ideal to overload existing interface with different meanings
>>> associated with a new hardware specific feature ... something like "Disabled" seems
>>> more appropriate.
>>
>> Hardware still reports it as unavailable. Also, there are some error cases
>> hardware can report unavailable. We may not be able to differentiate that.
>
> This highlights that this resctrl feature is currently latched to AMD's
> ABMC. I do not think we should require that this resctrl feature is backed
> by hardware that can support reads of counters that are disabled. A counter
> read really only needs to be sent to hardware if it is enabled.
>
>>> Considering this we may even consider using these files themselves as a
>>> way to enable the counters if they are disabled. For example, just
>>> "echo 1 > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes" can be used
>>
>> I am not sure about this. This is specific to domain 0. This group can
>> have cpus from multiple domains. I think we should have the interface for
>> all the domains(not for specific domain).
>
> Are the ABMC registers not per CPU? This is unclear to me at this time
> since changelog of patch #13 states it is per-CPU but yet the code
> uses smp_call_function_any().

Here are the clarifications from hardware engineer about this.

# While configuring the counter, should we have to write (L3_QOS_ABMC_CFG)
on all the logical processors in a domain?

No. In order to configure a specific counter, you only need to write it
on a single logical processor in a domain. Configuring the actual ABMC
counter is a side-effect of the write to this register. And the actual
ABMC counter configuration is a global state.

"Each logical processor implements a separate copy of these registers"
identifies that if you write a 5 to L3_QOS_ABMC_CFG on C0T0, you will not
read a 5 from the L3_QOS_ABMC_CFG register on C1T0.


>
> Even so, this needs to take other use cases into account. So far Peter
> mentioned the scenario where enabling of one counter would do so for all
> events associated with that counter and then there could also be a global
> enable/disable.
>
>>
>>> to enable this counter. No need for a new "monitor_state". Please note that this
>>> is not an official proposal since there are two other use cases that still need to
>>> be considered as we await James's feedback on how this may work for MPAM and
>>> also how this may be useful on AMD hardware that does not support ABMC but
>>> users may want to get similar benefits ([1])
>>
>> Ok. Lets wait for James comments.
>
> Reinette
>

--
Thanks
Babu Moger

2023-12-07 19:29:56

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

Hi Babu,

On 12/7/2023 8:12 AM, Moger, Babu wrote:
> On 12/6/23 12:49, Reinette Chatre wrote:
>> On 12/6/2023 7:40 AM, Moger, Babu wrote:
>>> On 12/5/23 17:17, Reinette Chatre wrote:
>>>> On 11/30/2023 4:57 PM, Babu Moger wrote:


>>>>> b. Mount with ABMC support
>>>>> #umount /sys/fs/resctrl/
>>>>> #mount -o abmc -t resctrl resctrl /sys/fs/resctrl/
>>>>>
>>>>
>>>> hmmm ... so this requires the user to mount resctrl, determine if the
>>>> feature is supported, unmount resctrl, remount resctrl with feature enabled.
>>>> Could you please elaborate what prevents this feature from being enabled
>>>> without needing to remount resctrl?
>>>
>>> Spec says
>>> "Enabling ABMC: ABMC is enabled by setting L3_QOS_EXT_CFG.ABMC_En=1 (see
>>> Figure 19-7). When the state of ABMC_En is changed, it must be changed to
>>> the updated value on all logical processors in the QOS Domain.
>>> Upon transitions of the ABMC_En the following actions take place:
>>> All ABMC assignable bandwidth counters are reset to 0.
>>> The L3 default mode bandwidth counters are reset to 0.
>>> The L3_QOS_ABMC_CFG MSR is reset to 0."
>>>
>>> So, all the monitoring group counters will be reset.
>>>
>>> It is technically possible to enable without remount. But ABMC mode
>>> requires few new files(in each group) which I added when mounted with "-o
>>> abmc". Thought it is a better option.
>>>
>>> Otherwise we need to add these files when ABMC is supported(not when
>>> enabled). Need to add another file in /sys/fs/resctrl/info/L3_MON to
>>> enable the feature on the fly.
>>>
>>> Both are acceptable options. Any thoughts?
>>
>> The new resctrl files in info/ could always be present. For example,
>> user space may want to know how many counters are available before
>> enabling the feature.
>>
>> It is not yet obvious to me that this feature requires new files
>> in monitor groups.
>
> There are two MBM events(total and local) in each group.
> We should provide an interface to assign each event independently.
> User can assign only one event in a group. We should also provide an
> option assign both the events in the group. This needs to be done at
> resctrl group level.

Understood. I would like to start by considering how (if at all) existing
files may be used, thus my example of using mbm_total_bytes, before adding
more files.


...

>>>>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>> Unavailable
>>>>
>>>> I believe that "Unavailable" already has an accepted meaning within current
>>>> interface and is associated with temporary failure. Even the AMD spec states "This
>>>> is generally a temporary condition and subsequent reads may succeed". In the
>>>> scenario above there is no chance that this counter would produce a value later.
>>>> I do not think it is ideal to overload existing interface with different meanings
>>>> associated with a new hardware specific feature ... something like "Disabled" seems
>>>> more appropriate.
>>>
>>> Hardware still reports it as unavailable. Also, there are some error cases
>>> hardware can report unavailable. We may not be able to differentiate that.
>>
>> This highlights that this resctrl feature is currently latched to AMD's
>> ABMC. I do not think we should require that this resctrl feature is backed
>> by hardware that can support reads of counters that are disabled. A counter
>> read really only needs to be sent to hardware if it is enabled.
>>
>>>> Considering this we may even consider using these files themselves as a
>>>> way to enable the counters if they are disabled. For example, just
>>>> "echo 1 > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes" can be used
>>>
>>> I am not sure about this. This is specific to domain 0. This group can
>>> have cpus from multiple domains. I think we should have the interface for
>>> all the domains(not for specific domain).
>>
>> Are the ABMC registers not per CPU? This is unclear to me at this time
>> since changelog of patch #13 states it is per-CPU but yet the code
>> uses smp_call_function_any().
>
> Here are the clarifications from hardware engineer about this.
>
> # While configuring the counter, should we have to write (L3_QOS_ABMC_CFG)
> on all the logical processors in a domain?
>
> No. In order to configure a specific counter, you only need to write it
> on a single logical processor in a domain. Configuring the actual ABMC
> counter is a side-effect of the write to this register. And the actual
> ABMC counter configuration is a global state.
>
> "Each logical processor implements a separate copy of these registers"
> identifies that if you write a 5 to L3_QOS_ABMC_CFG on C0T0, you will not
> read a 5 from the L3_QOS_ABMC_CFG register on C1T0.

Thank you for this information. Would reading L3_QOS_ABMC_DSC register on
C1T0 return the configuration written to L3_QOS_ABMC_CFG on C0T0 ?

Even so, you do confirm that the counter configuration is per domain. If I
understand correctly the implementation in this series assumes the counters
are programmed identically on all domains, but theoretically the system can support
domains with different counter configurations. For example, if a resource group
is limited to CPUs in one domain it would be unnecessary to consume the other
domain's counters.

This also ties into what this feature may morph into when considering the
non-ABMC AMD hardware needing similar interface as well as MPAM. I understand
for MPAM that resources are required for a counter but I do not know their
scope.

Reinette

2023-12-07 23:07:42

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

Hi Reinette,

On 12/7/2023 1:29 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 12/7/2023 8:12 AM, Moger, Babu wrote:
>> On 12/6/23 12:49, Reinette Chatre wrote:
>>> On 12/6/2023 7:40 AM, Moger, Babu wrote:
>>>> On 12/5/23 17:17, Reinette Chatre wrote:
>>>>> On 11/30/2023 4:57 PM, Babu Moger wrote:
>
>>>>>> b. Mount with ABMC support
>>>>>> #umount /sys/fs/resctrl/
>>>>>> #mount -o abmc -t resctrl resctrl /sys/fs/resctrl/
>>>>>>
>>>>> hmmm ... so this requires the user to mount resctrl, determine if the
>>>>> feature is supported, unmount resctrl, remount resctrl with feature enabled.
>>>>> Could you please elaborate what prevents this feature from being enabled
>>>>> without needing to remount resctrl?
>>>> Spec says
>>>> "Enabling ABMC: ABMC is enabled by setting L3_QOS_EXT_CFG.ABMC_En=1 (see
>>>> Figure 19-7). When the state of ABMC_En is changed, it must be changed to
>>>> the updated value on all logical processors in the QOS Domain.
>>>> Upon transitions of the ABMC_En the following actions take place:
>>>> All ABMC assignable bandwidth counters are reset to 0.
>>>> The L3 default mode bandwidth counters are reset to 0.
>>>> The L3_QOS_ABMC_CFG MSR is reset to 0."
>>>>
>>>> So, all the monitoring group counters will be reset.
>>>>
>>>> It is technically possible to enable without remount. But ABMC mode
>>>> requires few new files(in each group) which I added when mounted with "-o
>>>> abmc". Thought it is a better option.
>>>>
>>>> Otherwise we need to add these files when ABMC is supported(not when
>>>> enabled). Need to add another file in /sys/fs/resctrl/info/L3_MON to
>>>> enable the feature on the fly.
>>>>
>>>> Both are acceptable options. Any thoughts?
>>> The new resctrl files in info/ could always be present. For example,
>>> user space may want to know how many counters are available before
>>> enabling the feature.
>>>
>>> It is not yet obvious to me that this feature requires new files
>>> in monitor groups.
>> There are two MBM events(total and local) in each group.
>> We should provide an interface to assign each event independently.
>> User can assign only one event in a group. We should also provide an
>> option assign both the events in the group. This needs to be done at
>> resctrl group level.
> Understood. I would like to start by considering how (if at all) existing
> files may be used, thus my example of using mbm_total_bytes, before adding
> more files.
>
>
> ...
>
>>>>>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>> Unavailable
>>>>> I believe that "Unavailable" already has an accepted meaning within current
>>>>> interface and is associated with temporary failure. Even the AMD spec states "This
>>>>> is generally a temporary condition and subsequent reads may succeed". In the
>>>>> scenario above there is no chance that this counter would produce a value later.
>>>>> I do not think it is ideal to overload existing interface with different meanings
>>>>> associated with a new hardware specific feature ... something like "Disabled" seems
>>>>> more appropriate.
>>>> Hardware still reports it as unavailable. Also, there are some error cases
>>>> hardware can report unavailable. We may not be able to differentiate that.
>>> This highlights that this resctrl feature is currently latched to AMD's
>>> ABMC. I do not think we should require that this resctrl feature is backed
>>> by hardware that can support reads of counters that are disabled. A counter
>>> read really only needs to be sent to hardware if it is enabled.
>>>
>>>>> Considering this we may even consider using these files themselves as a
>>>>> way to enable the counters if they are disabled. For example, just
>>>>> "echo 1 > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes" can be used
>>>> I am not sure about this. This is specific to domain 0. This group can
>>>> have cpus from multiple domains. I think we should have the interface for
>>>> all the domains(not for specific domain).
>>> Are the ABMC registers not per CPU? This is unclear to me at this time
>>> since changelog of patch #13 states it is per-CPU but yet the code
>>> uses smp_call_function_any().
>> Here are the clarifications from hardware engineer about this.
>>
>> # While configuring the counter, should we have to write (L3_QOS_ABMC_CFG)
>> on all the logical processors in a domain?
>>
>> No. In order to configure a specific counter, you only need to write it
>> on a single logical processor in a domain. Configuring the actual ABMC
>> counter is a side-effect of the write to this register. And the actual
>> ABMC counter configuration is a global state.
>>
>> "Each logical processor implements a separate copy of these registers"
>> identifies that if you write a 5 to L3_QOS_ABMC_CFG on C0T0, you will not
>> read a 5 from the L3_QOS_ABMC_CFG register on C1T0.
> Thank you for this information. Would reading L3_QOS_ABMC_DSC register on
> C1T0 return the configuration written to L3_QOS_ABMC_CFG on C0T0 ?

Yes. Because the counter counter configuration is global. Reading
L3_QOS_ABMC_DSC will return the configuration of the counter specified by

QOS_ABMC_CFG[CtrID].

>
> Even so, you do confirm that the counter configuration is per domain. If I
> understand correctly the implementation in this series assumes the counters
> are programmed identically on all domains, but theoretically the system can support
> domains with different counter configurations. For example, if a resource group
> is limited to CPUs in one domain it would be unnecessary to consume the other
> domain's counters.
Yes. It is programmed on all the domains. Separating the domain
configuration will require more changes. I am not planning to address in
this series.
>
> This also ties into what this feature may morph into when considering the
> non-ABMC AMD hardware needing similar interface as well as MPAM. I understand
> for MPAM that resources are required for a counter but I do not know their
> scope.
>
> Reinette

2023-12-07 23:27:07

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

Hi Babu,

On 12/7/2023 3:07 PM, Moger, Babu wrote:
> On 12/7/2023 1:29 PM, Reinette Chatre wrote:
>> On 12/7/2023 8:12 AM, Moger, Babu wrote:
>>> On 12/6/23 12:49, Reinette Chatre wrote:
>>>> On 12/6/2023 7:40 AM, Moger, Babu wrote:
>>>>> On 12/5/23 17:17, Reinette Chatre wrote:
>>>>>> On 11/30/2023 4:57 PM, Babu Moger wrote:
>>
>>>>>>> b. Mount with ABMC support
>>>>>>>     #umount /sys/fs/resctrl/
>>>>>>>     #mount  -o abmc -t resctrl resctrl /sys/fs/resctrl/
>>>>>>>     
>>>>>> hmmm ... so this requires the user to mount resctrl, determine if the
>>>>>> feature is supported, unmount resctrl, remount resctrl with feature enabled.
>>>>>> Could you please elaborate what prevents this feature from being enabled
>>>>>> without needing to remount resctrl?
>>>>> Spec says
>>>>> "Enabling ABMC: ABMC is enabled by setting L3_QOS_EXT_CFG.ABMC_En=1 (see
>>>>> Figure 19-7). When the state of ABMC_En is changed, it must be changed to
>>>>> the updated value on all logical processors in the QOS Domain.
>>>>> Upon transitions of the ABMC_En the following actions take place:
>>>>> All ABMC assignable bandwidth counters are reset to 0.
>>>>> The L3 default mode bandwidth counters are reset to 0.
>>>>> The L3_QOS_ABMC_CFG MSR is reset to 0."
>>>>>
>>>>> So, all the monitoring group counters will be reset.
>>>>>
>>>>> It is technically possible to enable without remount. But ABMC mode
>>>>> requires few new files(in each group) which I added when mounted with "-o
>>>>> abmc". Thought it is a better option.
>>>>>
>>>>> Otherwise we need to add these files when ABMC is supported(not when
>>>>> enabled). Need to add another file in /sys/fs/resctrl/info/L3_MON to
>>>>> enable the feature on the fly.
>>>>>
>>>>> Both are acceptable options. Any thoughts?
>>>> The new resctrl files in info/ could always be present. For example,
>>>> user space may want to know how many counters are available before
>>>> enabling the feature.
>>>>
>>>> It is not yet obvious to me that this feature requires new files
>>>> in monitor groups.
>>> There are two MBM events(total and local) in each group.
>>> We should provide an interface to assign each event independently.
>>> User can assign only one event in a group. We should also provide an
>>> option assign both the events in the group. This needs to be done at
>>> resctrl group level.
>> Understood. I would like to start by considering how (if at all) existing
>> files may be used, thus my example of using mbm_total_bytes, before adding
>> more files.
>>
>>
>> ...
>>
>>>>>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>     Unavailable
>>>>>> I believe that "Unavailable" already has an accepted meaning within current
>>>>>> interface and is associated with temporary failure. Even the AMD spec states "This
>>>>>> is generally a temporary condition and subsequent reads may succeed". In the
>>>>>> scenario above there is no chance that this counter would produce a value later.
>>>>>> I do not think it is ideal to overload existing interface with different meanings
>>>>>> associated with a new hardware specific feature ... something like "Disabled" seems
>>>>>> more appropriate.
>>>>> Hardware still reports it as unavailable. Also, there are some error cases
>>>>> hardware can report unavailable. We may not be able to differentiate that.
>>>> This highlights that this resctrl feature is currently latched to AMD's
>>>> ABMC. I do not think we should require that this resctrl feature is backed
>>>> by hardware that can support reads of counters that are disabled. A counter
>>>> read really only needs to be sent to hardware if it is enabled.
>>>>
>>>>>> Considering this we may even consider using these files themselves as a
>>>>>> way to enable the counters if they are disabled. For example, just
>>>>>> "echo 1 > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes" can be used
>>>>> I am not sure about this. This is specific to domain 0. This group can
>>>>> have cpus from multiple domains. I think we should have the interface for
>>>>> all the domains(not for specific domain).
>>>> Are the ABMC registers not per CPU? This is unclear to me at this time
>>>> since changelog of patch #13 states it is per-CPU but yet the code
>>>> uses smp_call_function_any().
>>> Here are the clarifications from hardware engineer about this.
>>>
>>> # While configuring the counter, should we have to write (L3_QOS_ABMC_CFG)
>>> on all the logical processors in a domain?
>>>
>>> No.  In order to configure a specific counter, you only need to write it
>>> on a  single logical processor in a domain.  Configuring the actual ABMC
>>> counter is a side-effect of the write to this register.  And the actual
>>> ABMC counter configuration is a  global state.
>>>
>>> "Each logical processor implements a separate copy of these registers"
>>> identifies that if you write a 5 to L3_QOS_ABMC_CFG on C0T0, you will not
>>> read a 5 from the L3_QOS_ABMC_CFG register on C1T0.
>> Thank you for this information. Would reading L3_QOS_ABMC_DSC register on
>> C1T0 return the configuration written to L3_QOS_ABMC_CFG on C0T0 ?
>
> Yes. Because the counter counter configuration is global. Reading L3_QOS_ABMC_DSC will return the configuration of the counter specified by
>
> QOS_ABMC_CFG[CtrID].


To confirm, when you say "global" you mean within a domain?

>
>>
>> Even so, you do confirm that the counter configuration is per domain. If I
>> understand correctly the implementation in this series assumes the counters
>> are programmed identically on all domains, but theoretically the system can support
>> domains with different counter configurations. For example, if a resource group
>> is limited to CPUs in one domain it would be unnecessary to consume the other
>> domain's counters.
> Yes. It is programmed on all the domains. Separating the domain
> configuration will require more changes. I am not planning to address
> in this series.

That may be ok. The priority is to consider how users want to interact with this
feature and create a suitable interface to support this. This version may not
separate domain configuration, but we do not want to create an the interface that
prevents such an enhancement in the future. Especially since it is already known
that hardware supports it.

Reinette

2023-12-07 23:34:43

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

Hi Reinette,

On 12/7/2023 5:26 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 12/7/2023 3:07 PM, Moger, Babu wrote:
>> On 12/7/2023 1:29 PM, Reinette Chatre wrote:
>>> On 12/7/2023 8:12 AM, Moger, Babu wrote:
>>>> On 12/6/23 12:49, Reinette Chatre wrote:
>>>>> On 12/6/2023 7:40 AM, Moger, Babu wrote:
>>>>>> On 12/5/23 17:17, Reinette Chatre wrote:
>>>>>>> On 11/30/2023 4:57 PM, Babu Moger wrote:
>>>>>>>> b. Mount with ABMC support
>>>>>>>>     #umount /sys/fs/resctrl/
>>>>>>>>     #mount  -o abmc -t resctrl resctrl /sys/fs/resctrl/
>>>>>>>>
>>>>>>> hmmm ... so this requires the user to mount resctrl, determine if the
>>>>>>> feature is supported, unmount resctrl, remount resctrl with feature enabled.
>>>>>>> Could you please elaborate what prevents this feature from being enabled
>>>>>>> without needing to remount resctrl?
>>>>>> Spec says
>>>>>> "Enabling ABMC: ABMC is enabled by setting L3_QOS_EXT_CFG.ABMC_En=1 (see
>>>>>> Figure 19-7). When the state of ABMC_En is changed, it must be changed to
>>>>>> the updated value on all logical processors in the QOS Domain.
>>>>>> Upon transitions of the ABMC_En the following actions take place:
>>>>>> All ABMC assignable bandwidth counters are reset to 0.
>>>>>> The L3 default mode bandwidth counters are reset to 0.
>>>>>> The L3_QOS_ABMC_CFG MSR is reset to 0."
>>>>>>
>>>>>> So, all the monitoring group counters will be reset.
>>>>>>
>>>>>> It is technically possible to enable without remount. But ABMC mode
>>>>>> requires few new files(in each group) which I added when mounted with "-o
>>>>>> abmc". Thought it is a better option.
>>>>>>
>>>>>> Otherwise we need to add these files when ABMC is supported(not when
>>>>>> enabled). Need to add another file in /sys/fs/resctrl/info/L3_MON to
>>>>>> enable the feature on the fly.
>>>>>>
>>>>>> Both are acceptable options. Any thoughts?
>>>>> The new resctrl files in info/ could always be present. For example,
>>>>> user space may want to know how many counters are available before
>>>>> enabling the feature.
>>>>>
>>>>> It is not yet obvious to me that this feature requires new files
>>>>> in monitor groups.
>>>> There are two MBM events(total and local) in each group.
>>>> We should provide an interface to assign each event independently.
>>>> User can assign only one event in a group. We should also provide an
>>>> option assign both the events in the group. This needs to be done at
>>>> resctrl group level.
>>> Understood. I would like to start by considering how (if at all) existing
>>> files may be used, thus my example of using mbm_total_bytes, before adding
>>> more files.
>>>
>>>
>>> ...
>>>
>>>>>>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>     Unavailable
>>>>>>> I believe that "Unavailable" already has an accepted meaning within current
>>>>>>> interface and is associated with temporary failure. Even the AMD spec states "This
>>>>>>> is generally a temporary condition and subsequent reads may succeed". In the
>>>>>>> scenario above there is no chance that this counter would produce a value later.
>>>>>>> I do not think it is ideal to overload existing interface with different meanings
>>>>>>> associated with a new hardware specific feature ... something like "Disabled" seems
>>>>>>> more appropriate.
>>>>>> Hardware still reports it as unavailable. Also, there are some error cases
>>>>>> hardware can report unavailable. We may not be able to differentiate that.
>>>>> This highlights that this resctrl feature is currently latched to AMD's
>>>>> ABMC. I do not think we should require that this resctrl feature is backed
>>>>> by hardware that can support reads of counters that are disabled. A counter
>>>>> read really only needs to be sent to hardware if it is enabled.
>>>>>
>>>>>>> Considering this we may even consider using these files themselves as a
>>>>>>> way to enable the counters if they are disabled. For example, just
>>>>>>> "echo 1 > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes" can be used
>>>>>> I am not sure about this. This is specific to domain 0. This group can
>>>>>> have cpus from multiple domains. I think we should have the interface for
>>>>>> all the domains(not for specific domain).
>>>>> Are the ABMC registers not per CPU? This is unclear to me at this time
>>>>> since changelog of patch #13 states it is per-CPU but yet the code
>>>>> uses smp_call_function_any().
>>>> Here are the clarifications from hardware engineer about this.
>>>>
>>>> # While configuring the counter, should we have to write (L3_QOS_ABMC_CFG)
>>>> on all the logical processors in a domain?
>>>>
>>>> No.  In order to configure a specific counter, you only need to write it
>>>> on a  single logical processor in a domain.  Configuring the actual ABMC
>>>> counter is a side-effect of the write to this register.  And the actual
>>>> ABMC counter configuration is a  global state.
>>>>
>>>> "Each logical processor implements a separate copy of these registers"
>>>> identifies that if you write a 5 to L3_QOS_ABMC_CFG on C0T0, you will not
>>>> read a 5 from the L3_QOS_ABMC_CFG register on C1T0.
>>> Thank you for this information. Would reading L3_QOS_ABMC_DSC register on
>>> C1T0 return the configuration written to L3_QOS_ABMC_CFG on C0T0 ?
>> Yes. Because the counter counter configuration is global. Reading L3_QOS_ABMC_DSC will return the configuration of the counter specified by
>>
>> QOS_ABMC_CFG[CtrID].
>
> To confirm, when you say "global" you mean within a domain?

Yes. That is correct.


>
>>> Even so, you do confirm that the counter configuration is per domain. If I
>>> understand correctly the implementation in this series assumes the counters
>>> are programmed identically on all domains, but theoretically the system can support
>>> domains with different counter configurations. For example, if a resource group
>>> is limited to CPUs in one domain it would be unnecessary to consume the other
>>> domain's counters.
>> Yes. It is programmed on all the domains. Separating the domain
>> configuration will require more changes. I am not planning to address
>> in this series.
> That may be ok. The priority is to consider how users want to interact with this
> feature and create a suitable interface to support this. This version may not
> separate domain configuration, but we do not want to create an the interface that
> prevents such an enhancement in the future. Especially since it is already known
> that hardware supports it.

Yes. Understood.

Thanks

Babu

2023-12-08 19:46:21

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

On Tue, Dec 5, 2023 at 3:17 PM Reinette Chatre
<[email protected]> wrote:
> On 11/30/2023 4:57 PM, Babu Moger wrote:
> > c. Read the monitor states. There will be new file "monitor_state"
> > for each monitor group when ABMC feature is enabled. By default,
> > both total and local MBM events are in "unassign" state.
> >
> > #cat /sys/fs/resctrl/monitor_state
> > total=unassign;local=unassign
> >
> > d. Read the event mbm_total_bytes and mbm_local_bytes. Note that MBA
> > events are not available until the user assigns the events explicitly.
> > Users need to assign the counters to monitor the events in this mode.
> >
> > #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> > Unavailable
>
> How is the llc_occupancy event impacted when ABMC is enabled? Can all RMIDs
> still be used to track cache occupancy?
>
> >
> > #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> > Unavailable
>
> I believe that "Unavailable" already has an accepted meaning within current
> interface and is associated with temporary failure. Even the AMD spec states "This
> is generally a temporary condition and subsequent reads may succeed". In the
> scenario above there is no chance that this counter would produce a value later.
> I do not think it is ideal to overload existing interface with different meanings
> associated with a new hardware specific feature ... something like "Disabled" seems
> more appropriate.

Could we hide event counter files if they're not enabled? Is there
value in displaying the value of a non-running counter that will be
reset the next time it's enabled?


>
> Considering this we may even consider using these files themselves as a
> way to enable the counters if they are disabled. For example, just
> "echo 1 > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes" can be used
> to enable this counter. No need for a new "monitor_state". Please note that this
> is not an official proposal since there are two other use cases that still need to
> be considered as we await James's feedback on how this may work for MPAM and
> also how this may be useful on AMD hardware that does not support ABMC but
> users may want to get similar benefits ([1])

We plan to use the ABMC counters as a window to sample the MB/s rate
of a very large number of groups, so there's a serious concern about
the number of write syscalls this will take, as they will add up
quickly for a large RMID*domain count.

To that end, the ideal would be the ability to re-assign all ABMC
counters on all domains in a single system call.

-Peter

2023-12-08 20:10:10

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

Hi Peter,

On 12/8/2023 11:45 AM, Peter Newman wrote:
> On Tue, Dec 5, 2023 at 3:17 PM Reinette Chatre
> <[email protected]> wrote:
>> On 11/30/2023 4:57 PM, Babu Moger wrote:
>>> c. Read the monitor states. There will be new file "monitor_state"
>>> for each monitor group when ABMC feature is enabled. By default,
>>> both total and local MBM events are in "unassign" state.
>>>
>>> #cat /sys/fs/resctrl/monitor_state
>>> total=unassign;local=unassign
>>>
>>> d. Read the event mbm_total_bytes and mbm_local_bytes. Note that MBA
>>> events are not available until the user assigns the events explicitly.
>>> Users need to assign the counters to monitor the events in this mode.
>>>
>>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>> Unavailable
>>
>> How is the llc_occupancy event impacted when ABMC is enabled? Can all RMIDs
>> still be used to track cache occupancy?
>>
>>>
>>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>> Unavailable
>>
>> I believe that "Unavailable" already has an accepted meaning within current
>> interface and is associated with temporary failure. Even the AMD spec states "This
>> is generally a temporary condition and subsequent reads may succeed". In the
>> scenario above there is no chance that this counter would produce a value later.
>> I do not think it is ideal to overload existing interface with different meanings
>> associated with a new hardware specific feature ... something like "Disabled" seems
>> more appropriate.
>
> Could we hide event counter files if they're not enabled? Is there
> value in displaying the value of a non-running counter that will be
> reset the next time it's enabled?

It may be possible to hide the counter file when it is disabled but in this
case it is not clear to me how to communicate to user space that it is
an available counter that can be enabled and by hiding the file one mechanism
to actually enable the counter is lost. It is not required to
display a stale value when a counter is disabled, text like "Disabled"
can be used.

>> Considering this we may even consider using these files themselves as a
>> way to enable the counters if they are disabled. For example, just
>> "echo 1 > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes" can be used
>> to enable this counter. No need for a new "monitor_state". Please note that this
>> is not an official proposal since there are two other use cases that still need to
>> be considered as we await James's feedback on how this may work for MPAM and
>> also how this may be useful on AMD hardware that does not support ABMC but
>> users may want to get similar benefits ([1])
>
> We plan to use the ABMC counters as a window to sample the MB/s rate
> of a very large number of groups, so there's a serious concern about
> the number of write syscalls this will take, as they will add up
> quickly for a large RMID*domain count.
>
> To that end, the ideal would be the ability to re-assign all ABMC
> counters on all domains in a single system call.

Understood. I've already pointed out that this is a use case needing
to be considered. Please see [1] - search for "global enable/disable".

Reinette

[1] https://lore.kernel.org/lkml/[email protected]/



2023-12-08 22:59:00

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning feature

Hi Reinette/Peter,

> -----Original Message-----
> From: Reinette Chatre <[email protected]>
> Sent: Thursday, December 7, 2023 1:29 PM
> To: Moger, Babu <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; James Morse
> <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Phillips, Kim <[email protected]>;
> [email protected]; [email protected];
> [email protected]; [email protected]; Dadhania, Nikunj
> <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> Giani, Dhaval <[email protected]>
> Subject: Re: [PATCH 00/15] x86/resctrl : Support AMD QoS RMID Pinning
> feature
>
> Hi Babu,
>
> On 12/7/2023 8:12 AM, Moger, Babu wrote:
> > On 12/6/23 12:49, Reinette Chatre wrote:
> >> On 12/6/2023 7:40 AM, Moger, Babu wrote:
> >>> On 12/5/23 17:17, Reinette Chatre wrote:
> >>>> On 11/30/2023 4:57 PM, Babu Moger wrote:
>
>
> >>>>> b. Mount with ABMC support
> >>>>> #umount /sys/fs/resctrl/
> >>>>> #mount -o abmc -t resctrl resctrl /sys/fs/resctrl/
> >>>>>
> >>>>
> >>>> hmmm ... so this requires the user to mount resctrl, determine if
> >>>> the feature is supported, unmount resctrl, remount resctrl with feature
> enabled.
> >>>> Could you please elaborate what prevents this feature from being
> >>>> enabled without needing to remount resctrl?
> >>>
> >>> Spec says
> >>> "Enabling ABMC: ABMC is enabled by setting
> L3_QOS_EXT_CFG.ABMC_En=1
> >>> (see Figure 19-7). When the state of ABMC_En is changed, it must be
> >>> changed to the updated value on all logical processors in the QOS Domain.
> >>> Upon transitions of the ABMC_En the following actions take place:
> >>> All ABMC assignable bandwidth counters are reset to 0.
> >>> The L3 default mode bandwidth counters are reset to 0.
> >>> The L3_QOS_ABMC_CFG MSR is reset to 0."
> >>>
> >>> So, all the monitoring group counters will be reset.
> >>>
> >>> It is technically possible to enable without remount. But ABMC mode
> >>> requires few new files(in each group) which I added when mounted
> >>> with "-o abmc". Thought it is a better option.
> >>>
> >>> Otherwise we need to add these files when ABMC is supported(not when
> >>> enabled). Need to add another file in /sys/fs/resctrl/info/L3_MON to
> >>> enable the feature on the fly.
> >>>
> >>> Both are acceptable options. Any thoughts?

I think we didn’t conclude on this yet. I will remove the requirement to
remount the filesystem to use ABMC. That way users can move back and
forth between the modes without having to remount. We need to take care of
extra cleanup of states(data structure) when user moves back and forth.
Hopefully, I should be able to take care of that.

Thanks
Babu

2023-12-12 18:02:59

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit

The QOS Memory Bandwidth Enforcement Limit is reported by
CPUID_Fn80000020_EAX_x01 and CPUID_Fn80000020_EAX_x02.
Bits Description
31:0 BW_LEN: Size of the QOS Memory Bandwidth Enforcement Limit.

Newer processors can support higher bandwidth limit than the current
hard-coded value. Remove the hard-coded value and detect using CPUID
command. Also update the register variables eax and edx to match the
AMD CPUID definition.

The CPUID details are documentation in the PPR listed below [1].
[1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
11h B1 - 55901 Rev 0.25.

Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <[email protected]>

---
v2: Earlier Sent as a part of ABMC feature.
https://lore.kernel.org/lkml/[email protected]/
Sending it separate now. Addressed comments from Reinette about registers
being used from Intel definition. Also updated commit message.
---
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++------
arch/x86/kernel/cpu/resctrl/internal.h | 1 -
2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19e0681f0435..d04371e851b4 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -231,9 +231,7 @@ static bool __get_mem_config_intel(struct rdt_resource *r)
static bool __rdt_get_mem_config_amd(struct rdt_resource *r)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- union cpuid_0x10_3_eax eax;
- union cpuid_0x10_x_edx edx;
- u32 ebx, ecx, subleaf;
+ u32 eax, ebx, ecx, edx, subleaf;

/*
* Query CPUID_Fn80000020_EDX_x01 for MBA and
@@ -241,9 +239,9 @@ static bool __rdt_get_mem_config_amd(struct rdt_resource *r)
*/
subleaf = (r->rid == RDT_RESOURCE_SMBA) ? 2 : 1;

- cpuid_count(0x80000020, subleaf, &eax.full, &ebx, &ecx, &edx.full);
- hw_res->num_closid = edx.split.cos_max + 1;
- r->default_ctrl = MAX_MBA_BW_AMD;
+ cpuid_count(0x80000020, subleaf, &eax, &ebx, &ecx, &edx);
+ hw_res->num_closid = edx + 1;
+ r->default_ctrl = 1 << eax;

/* AMD does not use delay */
r->membw.delay_linear = false;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..d2979748fae4 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -18,7 +18,6 @@
#define MBM_OVERFLOW_INTERVAL 1000
#define MAX_MBA_BW 100u
#define MBA_IS_LINEAR 0x4
-#define MAX_MBA_BW_AMD 0x800
#define MBM_CNTR_WIDTH_OFFSET_AMD 20

#define RMID_VAL_ERROR BIT_ULL(63)


2023-12-12 18:03:20

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 2/2] x86/resctrl: Remove hard-coded memory bandwidth event configuration

If the BMEC (Bandwidth Monitoring Event Configuration) feature is
supported, the bandwidth events can be configured. The maximum supported
bandwidth bitmask can be determined by following CPUID command.

CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event
Configuration] Read-only. Reset: 0000_007Fh.
Bits Description
31:7 Reserved
6:0 Identifies the bandwidth sources that can be tracked.

The bandwidth sources can change with the processor generations.
Currently, this information is hard-coded. Remove the hard-coded value
and detect using CPUID command. Also print the valid bitmask when the
user tries to configure invalid value.

The CPUID details are documentation in the PPR listed below [1].
[1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
11h B1 - 55901 Rev 0.25.

Fixes: dc2a3e857981 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <[email protected]>

---
v2: Earlier Sent as a part of ABMC feature.
https://lore.kernel.org/lkml/[email protected]/
But this is not related to ABMC. Sending it separate now.
Removed the global resctrl_max_evt_bitmask. Added event_mask as part of
the resource.
---
arch/x86/kernel/cpu/resctrl/internal.h | 5 ++---
arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 18 ++++++++++--------
3 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index d2979748fae4..3e2f505614d8 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -50,9 +50,6 @@
/* Dirty Victims to All Types of Memory */
#define DIRTY_VICTIMS_TO_ALL_MEM BIT(6)

-/* Max event bits supported */
-#define MAX_EVT_CONFIG_BITS GENMASK(6, 0)
-
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
@@ -394,6 +391,7 @@ struct rdt_parse_data {
* @msr_update: Function pointer to update QOS MSRs
* @mon_scale: cqm counter * mon_scale = occupancy in bytes
* @mbm_width: Monitor width, to detect and correct for overflow.
+ * @event_mask: Max supported event bitmask.
* @cdp_enabled: CDP state of this resource
*
* Members of this structure are either private to the architecture
@@ -408,6 +406,7 @@ struct rdt_hw_resource {
struct rdt_resource *r);
unsigned int mon_scale;
unsigned int mbm_width;
+ unsigned int event_mask;
bool cdp_enabled;
};

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..30bf919edfda 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
return ret;

if (rdt_cpu_has(X86_FEATURE_BMEC)) {
+ u32 eax, ebx, ecx, edx;
+
+ /* Detect list of bandwidth sources that can be tracked */
+ cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
+ hw_res->event_mask = ecx;
+
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
mbm_total_event.configurable = true;
mbm_config_rftype_init("mbm_total_bytes_config");
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..8a1e9fdab974 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1537,17 +1537,14 @@ static void mon_event_config_read(void *info)
{
struct mon_config_info *mon_info = info;
unsigned int index;
- u64 msrval;
+ u32 h;

index = mon_event_config_index_get(mon_info->evtid);
if (index == INVALID_CONFIG_INDEX) {
pr_warn_once("Invalid event id %d\n", mon_info->evtid);
return;
}
- rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
-
- /* Report only the valid event configuration bits */
- mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
+ rdmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, h);
}

static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
@@ -1557,6 +1554,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo

static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct mon_config_info mon_info = {0};
struct rdt_domain *dom;
bool sep = false;
@@ -1571,7 +1569,9 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
mon_info.evtid = evtid;
mondata_config_read(dom, &mon_info);

- seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+ /* Report only the valid event configuration bits */
+ seq_printf(s, "%d=0x%02x", dom->id,
+ mon_info.mon_config & hw_res->event_mask);
sep = true;
}
seq_puts(s, "\n");
@@ -1617,12 +1617,14 @@ static void mon_event_config_write(void *info)
static int mbm_config_write_domain(struct rdt_resource *r,
struct rdt_domain *d, u32 evtid, u32 val)
{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct mon_config_info mon_info = {0};
int ret = 0;

/* mon_config cannot be more than the supported set of events */
- if (val > MAX_EVT_CONFIG_BITS) {
- rdt_last_cmd_puts("Invalid event configuration\n");
+ if ((val & hw_res->event_mask) != val) {
+ rdt_last_cmd_printf("Invalid input: The maximum valid bitmask is 0x%02x\n",
+ hw_res->event_mask);
return -EINVAL;
}



2023-12-15 01:24:48

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/resctrl: Remove hard-coded memory bandwidth event configuration

Hi Babu,

On 12/12/2023 10:02 AM, Babu Moger wrote:
> If the BMEC (Bandwidth Monitoring Event Configuration) feature is
> supported, the bandwidth events can be configured. The maximum supported
> bandwidth bitmask can be determined by following CPUID command.
>
> CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event
> Configuration] Read-only. Reset: 0000_007Fh.
> Bits Description
> 31:7 Reserved
> 6:0 Identifies the bandwidth sources that can be tracked.
>
> The bandwidth sources can change with the processor generations.
> Currently, this information is hard-coded. Remove the hard-coded value
> and detect using CPUID command. Also print the valid bitmask when the
> user tries to configure invalid value.
>
> The CPUID details are documentation in the PPR listed below [1].
> [1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
> 11h B1 - 55901 Rev 0.25.
>
> Fixes: dc2a3e857981 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> Signed-off-by: Babu Moger <[email protected]>
>
> ---
> v2: Earlier Sent as a part of ABMC feature.
> https://lore.kernel.org/lkml/[email protected]/
> But this is not related to ABMC. Sending it separate now.
> Removed the global resctrl_max_evt_bitmask. Added event_mask as part of
> the resource.
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 5 ++---
> arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 18 ++++++++++--------
> 3 files changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index d2979748fae4..3e2f505614d8 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -50,9 +50,6 @@
> /* Dirty Victims to All Types of Memory */
> #define DIRTY_VICTIMS_TO_ALL_MEM BIT(6)
>
> -/* Max event bits supported */
> -#define MAX_EVT_CONFIG_BITS GENMASK(6, 0)
> -
> struct rdt_fs_context {
> struct kernfs_fs_context kfc;
> bool enable_cdpl2;
> @@ -394,6 +391,7 @@ struct rdt_parse_data {
> * @msr_update: Function pointer to update QOS MSRs
> * @mon_scale: cqm counter * mon_scale = occupancy in bytes
> * @mbm_width: Monitor width, to detect and correct for overflow.
> + * @event_mask: Max supported event bitmask.

This is a very generic name and description for this feature. Note that in
resctrl monitoring an "event" is already clear (see members of enum resctrl_event_id)
so a generic type of "event_mask" can easily cause confusion with existing
concept of events. How about "mbm_cfg_mask"? Please also make the description
more detailed - it could include that this is unique to BMEC.

> * @cdp_enabled: CDP state of this resource
> *
> * Members of this structure are either private to the architecture
> @@ -408,6 +406,7 @@ struct rdt_hw_resource {
> struct rdt_resource *r);
> unsigned int mon_scale;
> unsigned int mbm_width;
> + unsigned int event_mask;
> bool cdp_enabled;
> };
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..30bf919edfda 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> return ret;
>
> if (rdt_cpu_has(X86_FEATURE_BMEC)) {
> + u32 eax, ebx, ecx, edx;
> +
> + /* Detect list of bandwidth sources that can be tracked */
> + cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
> + hw_res->event_mask = ecx;
> +

This has the same issue as I mentioned in V1. Note that this treats
reserved bits as valid values. I think this is a risky thing to do. For example
when this code is run on future hardware the currently reserved bits may have
values with different meaning than what this code uses it for.

> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
> mbm_total_event.configurable = true;
> mbm_config_rftype_init("mbm_total_bytes_config");
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 69a1de92384a..8a1e9fdab974 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1537,17 +1537,14 @@ static void mon_event_config_read(void *info)
> {
> struct mon_config_info *mon_info = info;
> unsigned int index;
> - u64 msrval;
> + u32 h;
>
> index = mon_event_config_index_get(mon_info->evtid);
> if (index == INVALID_CONFIG_INDEX) {
> pr_warn_once("Invalid event id %d\n", mon_info->evtid);
> return;
> }
> - rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
> -
> - /* Report only the valid event configuration bits */
> - mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
> + rdmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, h);

I do not think this code needed to be changed. We do not want to treat
reserved bits as valid values.

> }
>
> static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
> @@ -1557,6 +1554,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
>
> static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
> {
> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> struct mon_config_info mon_info = {0};
> struct rdt_domain *dom;
> bool sep = false;
> @@ -1571,7 +1569,9 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
> mon_info.evtid = evtid;
> mondata_config_read(dom, &mon_info);
>
> - seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
> + /* Report only the valid event configuration bits */
> + seq_printf(s, "%d=0x%02x", dom->id,
> + mon_info.mon_config & hw_res->event_mask);
> sep = true;
> }
> seq_puts(s, "\n");
> @@ -1617,12 +1617,14 @@ static void mon_event_config_write(void *info)
> static int mbm_config_write_domain(struct rdt_resource *r,
> struct rdt_domain *d, u32 evtid, u32 val)
> {
> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> struct mon_config_info mon_info = {0};
> int ret = 0;
>
> /* mon_config cannot be more than the supported set of events */
> - if (val > MAX_EVT_CONFIG_BITS) {
> - rdt_last_cmd_puts("Invalid event configuration\n");
> + if ((val & hw_res->event_mask) != val) {
> + rdt_last_cmd_printf("Invalid input: The maximum valid bitmask is 0x%02x\n",
> + hw_res->event_mask);
> return -EINVAL;
> }
>
>
>

Reinette

2023-12-15 02:21:20

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit

Hi Babu,

On 12/12/2023 10:02 AM, Babu Moger wrote:
> The QOS Memory Bandwidth Enforcement Limit is reported by
> CPUID_Fn80000020_EAX_x01 and CPUID_Fn80000020_EAX_x02.
> Bits Description
> 31:0 BW_LEN: Size of the QOS Memory Bandwidth Enforcement Limit.
>
> Newer processors can support higher bandwidth limit than the current
> hard-coded value. Remove the hard-coded value and detect using CPUID
> command. Also update the register variables eax and edx to match the
> AMD CPUID definition.
>
> The CPUID details are documentation in the PPR listed below [1].
> [1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
> 11h B1 - 55901 Rev 0.25.
>
> Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> Signed-off-by: Babu Moger <[email protected]>
>
> ---

Reviewed-by: Reinette Chatre <[email protected]>

Reinette

2024-01-02 19:52:45

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit

Hi Reinette,

On 12/14/23 20:20, Reinette Chatre wrote:
> Hi Babu,
>
> On 12/12/2023 10:02 AM, Babu Moger wrote:
>> The QOS Memory Bandwidth Enforcement Limit is reported by
>> CPUID_Fn80000020_EAX_x01 and CPUID_Fn80000020_EAX_x02.
>> Bits Description
>> 31:0 BW_LEN: Size of the QOS Memory Bandwidth Enforcement Limit.
>>
>> Newer processors can support higher bandwidth limit than the current
>> hard-coded value. Remove the hard-coded value and detect using CPUID
>> command. Also update the register variables eax and edx to match the
>> AMD CPUID definition.
>>
>> The CPUID details are documentation in the PPR listed below [1].
>> [1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
>> 11h B1 - 55901 Rev 0.25.
>>
>> Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>> Signed-off-by: Babu Moger <[email protected]>
>>
>> ---
>
> Reviewed-by: Reinette Chatre <[email protected]>

Thank You
Babu Moger

2024-01-02 20:01:16

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/resctrl: Remove hard-coded memory bandwidth event configuration

Hi Reinette,

Sorry for late response. I was out of office for couple of weeks.

On 12/14/23 19:24, Reinette Chatre wrote:
> Hi Babu,
>
> On 12/12/2023 10:02 AM, Babu Moger wrote:
>> If the BMEC (Bandwidth Monitoring Event Configuration) feature is
>> supported, the bandwidth events can be configured. The maximum supported
>> bandwidth bitmask can be determined by following CPUID command.
>>
>> CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event
>> Configuration] Read-only. Reset: 0000_007Fh.
>> Bits Description
>> 31:7 Reserved
>> 6:0 Identifies the bandwidth sources that can be tracked.
>>
>> The bandwidth sources can change with the processor generations.
>> Currently, this information is hard-coded. Remove the hard-coded value
>> and detect using CPUID command. Also print the valid bitmask when the
>> user tries to configure invalid value.
>>
>> The CPUID details are documentation in the PPR listed below [1].
>> [1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
>> 11h B1 - 55901 Rev 0.25.
>>
>> Fixes: dc2a3e857981 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>> Signed-off-by: Babu Moger <[email protected]>
>>
>> ---
>> v2: Earlier Sent as a part of ABMC feature.
>> https://lore.kernel.org/lkml/[email protected]/
>> But this is not related to ABMC. Sending it separate now.
>> Removed the global resctrl_max_evt_bitmask. Added event_mask as part of
>> the resource.
>> ---
>> arch/x86/kernel/cpu/resctrl/internal.h | 5 ++---
>> arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++++
>> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 18 ++++++++++--------
>> 3 files changed, 18 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index d2979748fae4..3e2f505614d8 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -50,9 +50,6 @@
>> /* Dirty Victims to All Types of Memory */
>> #define DIRTY_VICTIMS_TO_ALL_MEM BIT(6)
>>
>> -/* Max event bits supported */
>> -#define MAX_EVT_CONFIG_BITS GENMASK(6, 0)
>> -
>> struct rdt_fs_context {
>> struct kernfs_fs_context kfc;
>> bool enable_cdpl2;
>> @@ -394,6 +391,7 @@ struct rdt_parse_data {
>> * @msr_update: Function pointer to update QOS MSRs
>> * @mon_scale: cqm counter * mon_scale = occupancy in bytes
>> * @mbm_width: Monitor width, to detect and correct for overflow.
>> + * @event_mask: Max supported event bitmask.
>
> This is a very generic name and description for this feature. Note that in
> resctrl monitoring an "event" is already clear (see members of enum resctrl_event_id)
> so a generic type of "event_mask" can easily cause confusion with existing
> concept of events. How about "mbm_cfg_mask"? Please also make the description

That should be fine.

> more detailed - it could include that this is unique to BMEC.

Sure.

>
>> * @cdp_enabled: CDP state of this resource
>> *
>> * Members of this structure are either private to the architecture
>> @@ -408,6 +406,7 @@ struct rdt_hw_resource {
>> struct rdt_resource *r);
>> unsigned int mon_scale;
>> unsigned int mbm_width;
>> + unsigned int event_mask;
>> bool cdp_enabled;
>> };
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index f136ac046851..30bf919edfda 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>> return ret;
>>
>> if (rdt_cpu_has(X86_FEATURE_BMEC)) {
>> + u32 eax, ebx, ecx, edx;
>> +
>> + /* Detect list of bandwidth sources that can be tracked */
>> + cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
>> + hw_res->event_mask = ecx;
>> +
>
> This has the same issue as I mentioned in V1. Note that this treats
> reserved bits as valid values. I think this is a risky thing to do. For example
> when this code is run on future hardware the currently reserved bits may have
> values with different meaning than what this code uses it for.

Sure. Will use the mask MAX_EVT_CONFIG_BITS.
hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;

>
>> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
>> mbm_total_event.configurable = true;
>> mbm_config_rftype_init("mbm_total_bytes_config");
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 69a1de92384a..8a1e9fdab974 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -1537,17 +1537,14 @@ static void mon_event_config_read(void *info)
>> {
>> struct mon_config_info *mon_info = info;
>> unsigned int index;
>> - u64 msrval;
>> + u32 h;
>>
>> index = mon_event_config_index_get(mon_info->evtid);
>> if (index == INVALID_CONFIG_INDEX) {
>> pr_warn_once("Invalid event id %d\n", mon_info->evtid);
>> return;
>> }
>> - rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
>> -
>> - /* Report only the valid event configuration bits */
>> - mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
>> + rdmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, h);
>
> I do not think this code needed to be changed. We do not want to treat
> reserved bits as valid values.

The logic is still the same. We don't have access to rdt_hw_resource in
this function. So, I just moved the masking to mbm_config_show while printing.

Thanks
Babu

>
>> }
>>
>> static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
>> @@ -1557,6 +1554,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
>>
>> static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
>> {
>> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>> struct mon_config_info mon_info = {0};
>> struct rdt_domain *dom;
>> bool sep = false;
>> @@ -1571,7 +1569,9 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
>> mon_info.evtid = evtid;
>> mondata_config_read(dom, &mon_info);
>>
>> - seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
>> + /* Report only the valid event configuration bits */
>> + seq_printf(s, "%d=0x%02x", dom->id,
>> + mon_info.mon_config & hw_res->event_mask);
>> sep = true;
>> }
>> seq_puts(s, "\n");
>> @@ -1617,12 +1617,14 @@ static void mon_event_config_write(void *info)
>> static int mbm_config_write_domain(struct rdt_resource *r,
>> struct rdt_domain *d, u32 evtid, u32 val)
>> {
>> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>> struct mon_config_info mon_info = {0};
>> int ret = 0;
>>
>> /* mon_config cannot be more than the supported set of events */
>> - if (val > MAX_EVT_CONFIG_BITS) {
>> - rdt_last_cmd_puts("Invalid event configuration\n");
>> + if ((val & hw_res->event_mask) != val) {
>> + rdt_last_cmd_printf("Invalid input: The maximum valid bitmask is 0x%02x\n",
>> + hw_res->event_mask);
>> return -EINVAL;
>> }
>>
>>
>>
>
> Reinette

--
Thanks
Babu Moger

2024-01-03 18:38:38

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/resctrl: Remove hard-coded memory bandwidth event configuration

Hi Babu,

On 1/2/2024 12:00 PM, Moger, Babu wrote:
> On 12/14/23 19:24, Reinette Chatre wrote:
>> On 12/12/2023 10:02 AM, Babu Moger wrote:

>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> index f136ac046851..30bf919edfda 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> @@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>>> return ret;
>>>
>>> if (rdt_cpu_has(X86_FEATURE_BMEC)) {
>>> + u32 eax, ebx, ecx, edx;
>>> +
>>> + /* Detect list of bandwidth sources that can be tracked */
>>> + cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
>>> + hw_res->event_mask = ecx;
>>> +
>>
>> This has the same issue as I mentioned in V1. Note that this treats
>> reserved bits as valid values. I think this is a risky thing to do. For example
>> when this code is run on future hardware the currently reserved bits may have
>> values with different meaning than what this code uses it for.
>
> Sure. Will use the mask MAX_EVT_CONFIG_BITS.
> hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
>
>>
>>> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
>>> mbm_total_event.configurable = true;
>>> mbm_config_rftype_init("mbm_total_bytes_config");
>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> index 69a1de92384a..8a1e9fdab974 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> @@ -1537,17 +1537,14 @@ static void mon_event_config_read(void *info)
>>> {
>>> struct mon_config_info *mon_info = info;
>>> unsigned int index;
>>> - u64 msrval;
>>> + u32 h;
>>>
>>> index = mon_event_config_index_get(mon_info->evtid);
>>> if (index == INVALID_CONFIG_INDEX) {
>>> pr_warn_once("Invalid event id %d\n", mon_info->evtid);
>>> return;
>>> }
>>> - rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
>>> -
>>> - /* Report only the valid event configuration bits */
>>> - mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
>>> + rdmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, h);
>>
>> I do not think this code needed to be changed. We do not want to treat
>> reserved bits as valid values.
>
> The logic is still the same. We don't have access to rdt_hw_resource in
> this function. So, I just moved the masking to mbm_config_show while printing.

Why do you need access to rdt_hw_resource? This comment is not about the bandwidth
events supported by the device but instead the bits used to represent these events.
This is the same issue as in rdt_get_mon_l3_config. The above change returns
reserved bits as valid while the original code ensured that only bits used for
field are returned (through the usage of MAX_EVT_CONFIG_BITS).

Reinette

2024-01-03 21:03:20

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/resctrl: Remove hard-coded memory bandwidth event configuration

Hi Reinette,

On 1/3/24 12:38, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/2/2024 12:00 PM, Moger, Babu wrote:
>> On 12/14/23 19:24, Reinette Chatre wrote:
>>> On 12/12/2023 10:02 AM, Babu Moger wrote:
>
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> index f136ac046851..30bf919edfda 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> @@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>>>> return ret;
>>>>
>>>> if (rdt_cpu_has(X86_FEATURE_BMEC)) {
>>>> + u32 eax, ebx, ecx, edx;
>>>> +
>>>> + /* Detect list of bandwidth sources that can be tracked */
>>>> + cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
>>>> + hw_res->event_mask = ecx;
>>>> +
>>>
>>> This has the same issue as I mentioned in V1. Note that this treats
>>> reserved bits as valid values. I think this is a risky thing to do. For example
>>> when this code is run on future hardware the currently reserved bits may have
>>> values with different meaning than what this code uses it for.
>>
>> Sure. Will use the mask MAX_EVT_CONFIG_BITS.
>> hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
>>
>>>
>>>> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
>>>> mbm_total_event.configurable = true;
>>>> mbm_config_rftype_init("mbm_total_bytes_config");
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> index 69a1de92384a..8a1e9fdab974 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> @@ -1537,17 +1537,14 @@ static void mon_event_config_read(void *info)
>>>> {
>>>> struct mon_config_info *mon_info = info;
>>>> unsigned int index;
>>>> - u64 msrval;
>>>> + u32 h;
>>>>
>>>> index = mon_event_config_index_get(mon_info->evtid);
>>>> if (index == INVALID_CONFIG_INDEX) {
>>>> pr_warn_once("Invalid event id %d\n", mon_info->evtid);
>>>> return;
>>>> }
>>>> - rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
>>>> -
>>>> - /* Report only the valid event configuration bits */
>>>> - mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
>>>> + rdmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, h);
>>>
>>> I do not think this code needed to be changed. We do not want to treat
>>> reserved bits as valid values.
>>
>> The logic is still the same. We don't have access to rdt_hw_resource in
>> this function. So, I just moved the masking to mbm_config_show while printing.
>
> Why do you need access to rdt_hw_resource? This comment is not about the bandwidth
> events supported by the device but instead the bits used to represent these events.
> This is the same issue as in rdt_get_mon_l3_config. The above change returns
> reserved bits as valid while the original code ensured that only bits used for
> field are returned (through the usage of MAX_EVT_CONFIG_BITS).

We are already saving the valid bits in hw_res->mbm_cfg_mask during the init.

hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;

I thought we can use it here directly to mask any unsupported bits. So, I
re-arranged the code here.
--
Thanks
Babu Moger

2024-01-03 21:40:57

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/resctrl: Remove hard-coded memory bandwidth event configuration

Hi Babu,

On 1/3/2024 1:03 PM, Moger, Babu wrote:
> On 1/3/24 12:38, Reinette Chatre wrote:
>> On 1/2/2024 12:00 PM, Moger, Babu wrote:
>>> On 12/14/23 19:24, Reinette Chatre wrote:
>>>> On 12/12/2023 10:02 AM, Babu Moger wrote:
>>
>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>>> index f136ac046851..30bf919edfda 100644
>>>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>>> @@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>>>>> return ret;
>>>>>
>>>>> if (rdt_cpu_has(X86_FEATURE_BMEC)) {
>>>>> + u32 eax, ebx, ecx, edx;
>>>>> +
>>>>> + /* Detect list of bandwidth sources that can be tracked */
>>>>> + cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
>>>>> + hw_res->event_mask = ecx;
>>>>> +
>>>>
>>>> This has the same issue as I mentioned in V1. Note that this treats
>>>> reserved bits as valid values. I think this is a risky thing to do. For example
>>>> when this code is run on future hardware the currently reserved bits may have
>>>> values with different meaning than what this code uses it for.
>>>
>>> Sure. Will use the mask MAX_EVT_CONFIG_BITS.
>>> hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
>>>
>>>>
>>>>> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
>>>>> mbm_total_event.configurable = true;
>>>>> mbm_config_rftype_init("mbm_total_bytes_config");
>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>> index 69a1de92384a..8a1e9fdab974 100644
>>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>> @@ -1537,17 +1537,14 @@ static void mon_event_config_read(void *info)
>>>>> {
>>>>> struct mon_config_info *mon_info = info;
>>>>> unsigned int index;
>>>>> - u64 msrval;
>>>>> + u32 h;
>>>>>
>>>>> index = mon_event_config_index_get(mon_info->evtid);
>>>>> if (index == INVALID_CONFIG_INDEX) {
>>>>> pr_warn_once("Invalid event id %d\n", mon_info->evtid);
>>>>> return;
>>>>> }
>>>>> - rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
>>>>> -
>>>>> - /* Report only the valid event configuration bits */
>>>>> - mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
>>>>> + rdmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, h);
>>>>
>>>> I do not think this code needed to be changed. We do not want to treat
>>>> reserved bits as valid values.
>>>
>>> The logic is still the same. We don't have access to rdt_hw_resource in
>>> this function. So, I just moved the masking to mbm_config_show while printing.
>>
>> Why do you need access to rdt_hw_resource? This comment is not about the bandwidth
>> events supported by the device but instead the bits used to represent these events.
>> This is the same issue as in rdt_get_mon_l3_config. The above change returns
>> reserved bits as valid while the original code ensured that only bits used for
>> field are returned (through the usage of MAX_EVT_CONFIG_BITS).
>
> We are already saving the valid bits in hw_res->mbm_cfg_mask during the init.
>
> hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
>
> I thought we can use it here directly to mask any unsupported bits. So, I
> re-arranged the code here.

I am not sure where you mean when you say "use it here" since mbm_cfg_mask is not
used in mon_event_config_read(). My comment is related to mon_event_config_read()
that can reasonably be expected to, and thus should, return the current "mon event
config" value and nothing more.

Reinette


2024-01-04 13:50:37

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/resctrl: Remove hard-coded memory bandwidth event configuration

Hi Reinette,

On 1/3/24 15:40, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/3/2024 1:03 PM, Moger, Babu wrote:
>> On 1/3/24 12:38, Reinette Chatre wrote:
>>> On 1/2/2024 12:00 PM, Moger, Babu wrote:
>>>> On 12/14/23 19:24, Reinette Chatre wrote:
>>>>> On 12/12/2023 10:02 AM, Babu Moger wrote:
>>>
>>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>>>> index f136ac046851..30bf919edfda 100644
>>>>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>>>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>>>> @@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>>>>>> return ret;
>>>>>>
>>>>>> if (rdt_cpu_has(X86_FEATURE_BMEC)) {
>>>>>> + u32 eax, ebx, ecx, edx;
>>>>>> +
>>>>>> + /* Detect list of bandwidth sources that can be tracked */
>>>>>> + cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
>>>>>> + hw_res->event_mask = ecx;
>>>>>> +
>>>>>
>>>>> This has the same issue as I mentioned in V1. Note that this treats
>>>>> reserved bits as valid values. I think this is a risky thing to do. For example
>>>>> when this code is run on future hardware the currently reserved bits may have
>>>>> values with different meaning than what this code uses it for.
>>>>
>>>> Sure. Will use the mask MAX_EVT_CONFIG_BITS.
>>>> hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
>>>>
>>>>>
>>>>>> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
>>>>>> mbm_total_event.configurable = true;
>>>>>> mbm_config_rftype_init("mbm_total_bytes_config");
>>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>>> index 69a1de92384a..8a1e9fdab974 100644
>>>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>>> @@ -1537,17 +1537,14 @@ static void mon_event_config_read(void *info)
>>>>>> {
>>>>>> struct mon_config_info *mon_info = info;
>>>>>> unsigned int index;
>>>>>> - u64 msrval;
>>>>>> + u32 h;
>>>>>>
>>>>>> index = mon_event_config_index_get(mon_info->evtid);
>>>>>> if (index == INVALID_CONFIG_INDEX) {
>>>>>> pr_warn_once("Invalid event id %d\n", mon_info->evtid);
>>>>>> return;
>>>>>> }
>>>>>> - rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
>>>>>> -
>>>>>> - /* Report only the valid event configuration bits */
>>>>>> - mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
>>>>>> + rdmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, h);
>>>>>
>>>>> I do not think this code needed to be changed. We do not want to treat
>>>>> reserved bits as valid values.
>>>>
>>>> The logic is still the same. We don't have access to rdt_hw_resource in
>>>> this function. So, I just moved the masking to mbm_config_show while printing.
>>>
>>> Why do you need access to rdt_hw_resource? This comment is not about the bandwidth
>>> events supported by the device but instead the bits used to represent these events.
>>> This is the same issue as in rdt_get_mon_l3_config. The above change returns
>>> reserved bits as valid while the original code ensured that only bits used for
>>> field are returned (through the usage of MAX_EVT_CONFIG_BITS).
>>
>> We are already saving the valid bits in hw_res->mbm_cfg_mask during the init.
>>
>> hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
>>
>> I thought we can use it here directly to mask any unsupported bits. So, I
>> re-arranged the code here.
>
> I am not sure where you mean when you say "use it here" since mbm_cfg_mask is not
> used in mon_event_config_read(). My comment is related to mon_event_config_read()
> that can reasonably be expected to, and thus should, return the current "mon event
> config" value and nothing more.
>

Ok. Sure. Lets keep the same code as before.

--
Thanks
Babu Moger

2024-01-04 21:22:44

by Babu Moger

[permalink] [raw]
Subject: [PATCH v3 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit

The QOS Memory Bandwidth Enforcement Limit is reported by
CPUID_Fn80000020_EAX_x01 and CPUID_Fn80000020_EAX_x02.
Bits Description
31:0 BW_LEN: Size of the QOS Memory Bandwidth Enforcement Limit.

Newer processors can support higher bandwidth limit than the current
hard-coded value. Remove the hard-coded value and detect using CPUID
command. Also update the register variables eax and edx to match the
AMD CPUID definition.

The CPUID details are documentation in the PPR listed below [1].
[1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
11h B1 - 55901 Rev 0.25.

Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
---
v3: No change. Just updated with Reviewed-by

v2: Earlier Sent as a part of ABMC feature.
https://lore.kernel.org/lkml/[email protected]/
Sending it separate now. Addressed comments from Reinette about registers
being used from Intel definition. Also updated commit message.
---
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++------
arch/x86/kernel/cpu/resctrl/internal.h | 1 -
2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19e0681f0435..d04371e851b4 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -231,9 +231,7 @@ static bool __get_mem_config_intel(struct rdt_resource *r)
static bool __rdt_get_mem_config_amd(struct rdt_resource *r)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- union cpuid_0x10_3_eax eax;
- union cpuid_0x10_x_edx edx;
- u32 ebx, ecx, subleaf;
+ u32 eax, ebx, ecx, edx, subleaf;

/*
* Query CPUID_Fn80000020_EDX_x01 for MBA and
@@ -241,9 +239,9 @@ static bool __rdt_get_mem_config_amd(struct rdt_resource *r)
*/
subleaf = (r->rid == RDT_RESOURCE_SMBA) ? 2 : 1;

- cpuid_count(0x80000020, subleaf, &eax.full, &ebx, &ecx, &edx.full);
- hw_res->num_closid = edx.split.cos_max + 1;
- r->default_ctrl = MAX_MBA_BW_AMD;
+ cpuid_count(0x80000020, subleaf, &eax, &ebx, &ecx, &edx);
+ hw_res->num_closid = edx + 1;
+ r->default_ctrl = 1 << eax;

/* AMD does not use delay */
r->membw.delay_linear = false;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..d2979748fae4 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -18,7 +18,6 @@
#define MBM_OVERFLOW_INTERVAL 1000
#define MAX_MBA_BW 100u
#define MBA_IS_LINEAR 0x4
-#define MAX_MBA_BW_AMD 0x800
#define MBM_CNTR_WIDTH_OFFSET_AMD 20

#define RMID_VAL_ERROR BIT_ULL(63)
--
2.34.1


2024-01-04 21:22:55

by Babu Moger

[permalink] [raw]
Subject: [PATCH v3 2/2] x86/resctrl: Remove hard-coded memory bandwidth event configuration

If the BMEC (Bandwidth Monitoring Event Configuration) feature is
supported, the bandwidth events can be configured. The maximum supported
bandwidth bitmask can be determined by following CPUID command.

CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event
Configuration] Read-only. Reset: 0000_007Fh.
Bits Description
31:7 Reserved
6:0 Identifies the bandwidth sources that can be tracked.

The bandwidth sources can change with the processor generations.
Remove the hard-coded value and detect using CPUID command. Also,
print the valid bitmask when the user tries to configure invalid value.

The CPUID details are documentation in the PPR listed below [1].
[1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
11h B1 - 55901 Rev 0.25.

Fixes: dc2a3e857981 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <[email protected]>
---
v3: Changed the event_mask name to mbm_cfg_mask. Added comments about the field.
Reverted the masking of event configuration to original code.
Few minor comment changes.

v2: Earlier sent as a part of ABMC feature.
https://lore.kernel.org/lkml/[email protected]/
But this is not related to ABMC. Sending it separate now.
Removed the global resctrl_max_evt_bitmask. Added event_mask as part of
the resource.
---
arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 ++++--
3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index d2979748fae4..e3dc35a00a19 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -394,6 +394,8 @@ struct rdt_parse_data {
* @msr_update: Function pointer to update QOS MSRs
* @mon_scale: cqm counter * mon_scale = occupancy in bytes
* @mbm_width: Monitor width, to detect and correct for overflow.
+ * @mbm_cfg_mask: Bandwidth sources that can be tracked when Bandwidth
+ * Monitoring Event Configuration (BMEC) is supported.
* @cdp_enabled: CDP state of this resource
*
* Members of this structure are either private to the architecture
@@ -408,6 +410,7 @@ struct rdt_hw_resource {
struct rdt_resource *r);
unsigned int mon_scale;
unsigned int mbm_width;
+ unsigned int mbm_cfg_mask;
bool cdp_enabled;
};

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..acca577e2b06 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
return ret;

if (rdt_cpu_has(X86_FEATURE_BMEC)) {
+ u32 eax, ebx, ecx, edx;
+
+ /* Detect list of bandwidth sources that can be tracked */
+ cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
+ hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
+
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
mbm_total_event.configurable = true;
mbm_config_rftype_init("mbm_total_bytes_config");
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..5b5a8f0ffb2f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1617,12 +1617,14 @@ static void mon_event_config_write(void *info)
static int mbm_config_write_domain(struct rdt_resource *r,
struct rdt_domain *d, u32 evtid, u32 val)
{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct mon_config_info mon_info = {0};
int ret = 0;

/* mon_config cannot be more than the supported set of events */
- if (val > MAX_EVT_CONFIG_BITS) {
- rdt_last_cmd_puts("Invalid event configuration\n");
+ if ((val & hw_res->mbm_cfg_mask) != val) {
+ rdt_last_cmd_printf("Invalid input: The maximum valid bitmask is 0x%02x\n",
+ hw_res->mbm_cfg_mask);
return -EINVAL;
}

--
2.34.1


2024-01-05 21:14:39

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit

Hi Babu,

On 1/4/2024 1:21 PM, Babu Moger wrote:
> The QOS Memory Bandwidth Enforcement Limit is reported by
> CPUID_Fn80000020_EAX_x01 and CPUID_Fn80000020_EAX_x02.
> Bits Description
> 31:0 BW_LEN: Size of the QOS Memory Bandwidth Enforcement Limit.
>
> Newer processors can support higher bandwidth limit than the current
> hard-coded value. Remove the hard-coded value and detect using CPUID
> command. Also update the register variables eax and edx to match the
> AMD CPUID definition.
>
> The CPUID details are documentation in the PPR listed below [1].

"are documentation" -> "are documented"?

> [1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
> 11h B1 - 55901 Rev 0.25.
>
> Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> Signed-off-by: Babu Moger <[email protected]>
> Reviewed-by: Reinette Chatre <[email protected]>

Looking at "Ordering of commit tags" in Documentation/process/maintainer-tip.rst
I believe "Link:" should be the last entry.

Reinette

2024-01-05 21:18:37

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] x86/resctrl: Remove hard-coded memory bandwidth event configuration

Hi Babu,

On 1/4/2024 1:21 PM, Babu Moger wrote:
> If the BMEC (Bandwidth Monitoring Event Configuration) feature is
> supported, the bandwidth events can be configured. The maximum supported
> bandwidth bitmask can be determined by following CPUID command.
>
> CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event
> Configuration] Read-only. Reset: 0000_007Fh.
> Bits Description
> 31:7 Reserved
> 6:0 Identifies the bandwidth sources that can be tracked.
>
> The bandwidth sources can change with the processor generations.
> Remove the hard-coded value and detect using CPUID command. Also,

I do not think "Remove the hard-coded value" is accurate anymore.

> print the valid bitmask when the user tries to configure invalid value.
>
> The CPUID details are documentation in the PPR listed below [1].

"are documentation" -> "are documented"

> [1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
> 11h B1 - 55901 Rev 0.25.
>
> Fixes: dc2a3e857981 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> Signed-off-by: Babu Moger <[email protected]>

Same comment about "Link:" as for patch 1/2.

> ---
> v3: Changed the event_mask name to mbm_cfg_mask. Added comments about the field.
> Reverted the masking of event configuration to original code.
> Few minor comment changes.
>
> v2: Earlier sent as a part of ABMC feature.
> https://lore.kernel.org/lkml/[email protected]/
> But this is not related to ABMC. Sending it separate now.
> Removed the global resctrl_max_evt_bitmask. Added event_mask as part of
> the resource.
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
> arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 ++++--
> 3 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index d2979748fae4..e3dc35a00a19 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -394,6 +394,8 @@ struct rdt_parse_data {
> * @msr_update: Function pointer to update QOS MSRs
> * @mon_scale: cqm counter * mon_scale = occupancy in bytes
> * @mbm_width: Monitor width, to detect and correct for overflow.
> + * @mbm_cfg_mask: Bandwidth sources that can be tracked when Bandwidth
> + * Monitoring Event Configuration (BMEC) is supported.
> * @cdp_enabled: CDP state of this resource
> *
> * Members of this structure are either private to the architecture
> @@ -408,6 +410,7 @@ struct rdt_hw_resource {
> struct rdt_resource *r);
> unsigned int mon_scale;
> unsigned int mbm_width;
> + unsigned int mbm_cfg_mask;
> bool cdp_enabled;
> };
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..acca577e2b06 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> return ret;
>
> if (rdt_cpu_has(X86_FEATURE_BMEC)) {
> + u32 eax, ebx, ecx, edx;
> +
> + /* Detect list of bandwidth sources that can be tracked */
> + cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
> + hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
> +
> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
> mbm_total_event.configurable = true;
> mbm_config_rftype_init("mbm_total_bytes_config");
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 69a1de92384a..5b5a8f0ffb2f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1617,12 +1617,14 @@ static void mon_event_config_write(void *info)
> static int mbm_config_write_domain(struct rdt_resource *r,
> struct rdt_domain *d, u32 evtid, u32 val)

Not specific to this patch, but since the valid mask is per resource I do not think
it is necessary to check user provided value for every domain. The user provided value
can be checked earlier and only once in mon_config_write() before iterating over all
domains to write the value.

> {
> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> struct mon_config_info mon_info = {0};
> int ret = 0;
>
> /* mon_config cannot be more than the supported set of events */
> - if (val > MAX_EVT_CONFIG_BITS) {
> - rdt_last_cmd_puts("Invalid event configuration\n");
> + if ((val & hw_res->mbm_cfg_mask) != val) {
> + rdt_last_cmd_printf("Invalid input: The maximum valid bitmask is 0x%02x\n",
> + hw_res->mbm_cfg_mask);

I think keeping "Invalid event configuration" is useful to create a detailed message of:
"Invalid event configuration: maximum valid bitmask is 0x%02x"

Reinette

2024-01-05 23:51:52

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit

Hi Reinette,

On 1/5/2024 3:14 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/4/2024 1:21 PM, Babu Moger wrote:
>> The QOS Memory Bandwidth Enforcement Limit is reported by
>> CPUID_Fn80000020_EAX_x01 and CPUID_Fn80000020_EAX_x02.
>> Bits Description
>> 31:0 BW_LEN: Size of the QOS Memory Bandwidth Enforcement Limit.
>>
>> Newer processors can support higher bandwidth limit than the current
>> hard-coded value. Remove the hard-coded value and detect using CPUID
>> command. Also update the register variables eax and edx to match the
>> AMD CPUID definition.
>>
>> The CPUID details are documentation in the PPR listed below [1].
> "are documentation" -> "are documented"?
Sure.
>
>> [1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
>> 11h B1 - 55901 Rev 0.25.
>>
>> Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>> Signed-off-by: Babu Moger <[email protected]>
>> Reviewed-by: Reinette Chatre <[email protected]>
> Looking at "Ordering of commit tags" in Documentation/process/maintainer-tip.rst
> I believe "Link:" should be the last entry.

Sure.

thanks

Babu


2024-01-06 00:13:34

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] x86/resctrl: Remove hard-coded memory bandwidth event configuration

Hi Reinette,

On 1/5/2024 3:18 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/4/2024 1:21 PM, Babu Moger wrote:
>> If the BMEC (Bandwidth Monitoring Event Configuration) feature is
>> supported, the bandwidth events can be configured. The maximum supported
>> bandwidth bitmask can be determined by following CPUID command.
>>
>> CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event
>> Configuration] Read-only. Reset: 0000_007Fh.
>> Bits Description
>> 31:7 Reserved
>> 6:0 Identifies the bandwidth sources that can be tracked.
>>
>> The bandwidth sources can change with the processor generations.
>> Remove the hard-coded value and detect using CPUID command. Also,
> I do not think "Remove the hard-coded value" is accurate anymore.

Will change it to.

"Read the supported bandwidth sources using CPUID command. Also,"

Also I need to update the subject line.

>
>> print the valid bitmask when the user tries to configure invalid value.
>>
>> The CPUID details are documentation in the PPR listed below [1].
> "are documentation" -> "are documented"
Sure.
>
>> [1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
>> 11h B1 - 55901 Rev 0.25.
>>
>> Fixes: dc2a3e857981 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>> Signed-off-by: Babu Moger <[email protected]>
> Same comment about "Link:" as for patch 1/2.
Sure.
>
>> ---
>> v3: Changed the event_mask name to mbm_cfg_mask. Added comments about the field.
>> Reverted the masking of event configuration to original code.
>> Few minor comment changes.
>>
>> v2: Earlier sent as a part of ABMC feature.
>> https://lore.kernel.org/lkml/[email protected]/
>> But this is not related to ABMC. Sending it separate now.
>> Removed the global resctrl_max_evt_bitmask. Added event_mask as part of
>> the resource.
>> ---
>> arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
>> arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++++
>> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 ++++--
>> 3 files changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index d2979748fae4..e3dc35a00a19 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -394,6 +394,8 @@ struct rdt_parse_data {
>> * @msr_update: Function pointer to update QOS MSRs
>> * @mon_scale: cqm counter * mon_scale = occupancy in bytes
>> * @mbm_width: Monitor width, to detect and correct for overflow.
>> + * @mbm_cfg_mask: Bandwidth sources that can be tracked when Bandwidth
>> + * Monitoring Event Configuration (BMEC) is supported.
>> * @cdp_enabled: CDP state of this resource
>> *
>> * Members of this structure are either private to the architecture
>> @@ -408,6 +410,7 @@ struct rdt_hw_resource {
>> struct rdt_resource *r);
>> unsigned int mon_scale;
>> unsigned int mbm_width;
>> + unsigned int mbm_cfg_mask;
>> bool cdp_enabled;
>> };
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index f136ac046851..acca577e2b06 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>> return ret;
>>
>> if (rdt_cpu_has(X86_FEATURE_BMEC)) {
>> + u32 eax, ebx, ecx, edx;
>> +
>> + /* Detect list of bandwidth sources that can be tracked */
>> + cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
>> + hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
>> +
>> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
>> mbm_total_event.configurable = true;
>> mbm_config_rftype_init("mbm_total_bytes_config");
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 69a1de92384a..5b5a8f0ffb2f 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -1617,12 +1617,14 @@ static void mon_event_config_write(void *info)
>> static int mbm_config_write_domain(struct rdt_resource *r,
>> struct rdt_domain *d, u32 evtid, u32 val)
> Not specific to this patch, but since the valid mask is per resource I do not think
> it is necessary to check user provided value for every domain. The user provided value
> can be checked earlier and only once in mon_config_write() before iterating over all
> domains to write the value.
Yes. Agree.
>
>> {
>> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>> struct mon_config_info mon_info = {0};
>> int ret = 0;
>>
>> /* mon_config cannot be more than the supported set of events */
>> - if (val > MAX_EVT_CONFIG_BITS) {
>> - rdt_last_cmd_puts("Invalid event configuration\n");
>> + if ((val & hw_res->mbm_cfg_mask) != val) {
>> + rdt_last_cmd_printf("Invalid input: The maximum valid bitmask is 0x%02x\n",
>> + hw_res->mbm_cfg_mask);
> I think keeping "Invalid event configuration" is useful to create a detailed message of:
> "Invalid event configuration: maximum valid bitmask is 0x%02x"

Sure.

Thanks

Babu


2024-01-11 21:37:04

by Babu Moger

[permalink] [raw]
Subject: [PATCH v4 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit

The QOS Memory Bandwidth Enforcement Limit is reported by
CPUID_Fn80000020_EAX_x01 and CPUID_Fn80000020_EAX_x02.
Bits Description
31:0 BW_LEN: Size of the QOS Memory Bandwidth Enforcement Limit.

Newer processors can support higher bandwidth limit than the current
hard-coded value. Remove the hard-coded value and detect using CPUID
command. Also update the register variables eax and edx to match the
AMD CPUID definition.

The CPUID details are documented in the PPR listed below [1].
[1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
11h B1 - 55901 Rev 0.25.

Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
v4: Minor text changes and re-order of commit tags.
v3: No change. Just updated with Reviewed-by

v2: Earlier Sent as a part of ABMC feature.
https://lore.kernel.org/lkml/[email protected]/
Sending it separate now. Addressed comments from Reinette about registers
being used from Intel definition. Also updated commit message.
---
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++------
arch/x86/kernel/cpu/resctrl/internal.h | 1 -
2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19e0681f0435..d04371e851b4 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -231,9 +231,7 @@ static bool __get_mem_config_intel(struct rdt_resource *r)
static bool __rdt_get_mem_config_amd(struct rdt_resource *r)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- union cpuid_0x10_3_eax eax;
- union cpuid_0x10_x_edx edx;
- u32 ebx, ecx, subleaf;
+ u32 eax, ebx, ecx, edx, subleaf;

/*
* Query CPUID_Fn80000020_EDX_x01 for MBA and
@@ -241,9 +239,9 @@ static bool __rdt_get_mem_config_amd(struct rdt_resource *r)
*/
subleaf = (r->rid == RDT_RESOURCE_SMBA) ? 2 : 1;

- cpuid_count(0x80000020, subleaf, &eax.full, &ebx, &ecx, &edx.full);
- hw_res->num_closid = edx.split.cos_max + 1;
- r->default_ctrl = MAX_MBA_BW_AMD;
+ cpuid_count(0x80000020, subleaf, &eax, &ebx, &ecx, &edx);
+ hw_res->num_closid = edx + 1;
+ r->default_ctrl = 1 << eax;

/* AMD does not use delay */
r->membw.delay_linear = false;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..d2979748fae4 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -18,7 +18,6 @@
#define MBM_OVERFLOW_INTERVAL 1000
#define MAX_MBA_BW 100u
#define MBA_IS_LINEAR 0x4
-#define MAX_MBA_BW_AMD 0x800
#define MBM_CNTR_WIDTH_OFFSET_AMD 20

#define RMID_VAL_ERROR BIT_ULL(63)
--
2.34.1


2024-01-11 21:37:19

by Babu Moger

[permalink] [raw]
Subject: [PATCH v4 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command

If the BMEC (Bandwidth Monitoring Event Configuration) feature is
supported, the bandwidth events can be configured. The maximum supported
bandwidth bitmask can be determined by following CPUID command.

CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event
Configuration] Read-only. Reset: 0000_007Fh.
Bits Description
31:7 Reserved
6:0 Identifies the bandwidth sources that can be tracked.

The bandwidth sources can change with the processor generations.
Read the supported bandwidth sources using the CPUID command.

While at it, move the mask checking to mon_config_write() before
iterating over all the domains. Also, print the valid bitmask when
the user tries to configure invalid event configuration value.

The CPUID details are documented in the PPR listed below [1].
[1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
11h B1 - 55901 Rev 0.25.

Fixes: dc2a3e857981 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
v4: Minor text changes and re-order of commit tags.
Moved the mask check to mon_config_write() before iterating over all the
domains.

v3: Changed the event_mask name to mbm_cfg_mask. Added comments about the field.
Reverted the masking of event configuration to original code.
Few minor comment changes.

v2: Earlier sent as a part of ABMC feature.
https://lore.kernel.org/lkml/[email protected]/
But this is not related to ABMC. Sending it separate now.
Removed the global resctrl_max_evt_bitmask. Added event_mask as part of
the resource.
---
arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 14 ++++++++------
3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index d2979748fae4..e3dc35a00a19 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -394,6 +394,8 @@ struct rdt_parse_data {
* @msr_update: Function pointer to update QOS MSRs
* @mon_scale: cqm counter * mon_scale = occupancy in bytes
* @mbm_width: Monitor width, to detect and correct for overflow.
+ * @mbm_cfg_mask: Bandwidth sources that can be tracked when Bandwidth
+ * Monitoring Event Configuration (BMEC) is supported.
* @cdp_enabled: CDP state of this resource
*
* Members of this structure are either private to the architecture
@@ -408,6 +410,7 @@ struct rdt_hw_resource {
struct rdt_resource *r);
unsigned int mon_scale;
unsigned int mbm_width;
+ unsigned int mbm_cfg_mask;
bool cdp_enabled;
};

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..acca577e2b06 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
return ret;

if (rdt_cpu_has(X86_FEATURE_BMEC)) {
+ u32 eax, ebx, ecx, edx;
+
+ /* Detect list of bandwidth sources that can be tracked */
+ cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
+ hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
+
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
mbm_total_event.configurable = true;
mbm_config_rftype_init("mbm_total_bytes_config");
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..8e9c96d0ee84 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1620,12 +1620,6 @@ static int mbm_config_write_domain(struct rdt_resource *r,
struct mon_config_info mon_info = {0};
int ret = 0;

- /* mon_config cannot be more than the supported set of events */
- if (val > MAX_EVT_CONFIG_BITS) {
- rdt_last_cmd_puts("Invalid event configuration\n");
- return -EINVAL;
- }
-
/*
* Read the current config value first. If both are the same then
* no need to write it again.
@@ -1663,6 +1657,7 @@ static int mbm_config_write_domain(struct rdt_resource *r,

static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
char *dom_str = NULL, *id_str;
unsigned long dom_id, val;
struct rdt_domain *d;
@@ -1686,6 +1681,13 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}

+ /* mon_config cannot be more than the supported set of events */
+ if ((val & hw_res->mbm_cfg_mask) != val) {
+ rdt_last_cmd_printf("Invalid event configuration: The maximum valid "
+ "bitmask is 0x%02x\n", hw_res->mbm_cfg_mask);
+ return -EINVAL;
+ }
+
list_for_each_entry(d, &r->domains, list) {
if (d->id == dom_id) {
ret = mbm_config_write_domain(r, d, evtid, val);
--
2.34.1


2024-01-12 19:03:16

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command

Hi Babu,

On 1/11/2024 1:36 PM, Babu Moger wrote:

> @@ -1686,6 +1681,13 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
> return -EINVAL;
> }
>
> + /* mon_config cannot be more than the supported set of events */

copy&paste error? There is no mon_config in this function.

(copy&paste difficulties reminds me of [1])

> + if ((val & hw_res->mbm_cfg_mask) != val) {
> + rdt_last_cmd_printf("Invalid event configuration: The maximum valid "
> + "bitmask is 0x%02x\n", hw_res->mbm_cfg_mask);

checkpatch.pl should have warned about this split of text across two lines.
Logging functions and single strings are allowed to exceed the max line length.
If you just merge the two lines then checkpatch.pl may still warn for resctrl strings
but that is because it does not recognize rdt_last_cmd_printf() as a logging function.

You can also just shorten the string so this patch passes the checkpatch.pl check.
For example,
"Invalid event configuration: maximum valid mask is 0x%02x\n"
or
"Invalid event configuration: maximum is 0x%02x\n"
or ?

Reinette

[1] https://lore.kernel.org/lkml/[email protected]/

2024-01-12 20:38:47

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command

Hi  Reinette,

On 1/12/2024 1:02 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/11/2024 1:36 PM, Babu Moger wrote:
>
>> @@ -1686,6 +1681,13 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
>> return -EINVAL;
>> }
>>
>> + /* mon_config cannot be more than the supported set of events */
> copy&paste error? There is no mon_config in this function.
Yea. it should be mbm_cfg_mask.  Will fix it.
>
> (copy&paste difficulties reminds me of [1])
>
>> + if ((val & hw_res->mbm_cfg_mask) != val) {
>> + rdt_last_cmd_printf("Invalid event configuration: The maximum valid "
>> + "bitmask is 0x%02x\n", hw_res->mbm_cfg_mask);
> checkpatch.pl should have warned about this split of text across two lines.
> Logging functions and single strings are allowed to exceed the max line length.
> If you just merge the two lines then checkpatch.pl may still warn for resctrl strings
> but that is because it does not recognize rdt_last_cmd_printf() as a logging function.
>
> You can also just shorten the string so this patch passes the checkpatch.pl check.
> For example,
> "Invalid event configuration: maximum valid mask is 0x%02x\n"
> or
> "Invalid event configuration: maximum is 0x%02x\n"
> or ?

Yes. Checkpatch reported error when I split the text.

How about this?. Checkpatch is happy.

rdt_last_cmd_printf("Invalid event configuration: max valid mask is
0x%02x\n",
+                                   hw_res->mbm_cfg_mask);

>
> Reinette
>
> [1] https://lore.kernel.org/lkml/[email protected]/

We spent quite a bit of time on this earlier. Yea. It did not go the way
I wanted it. Now, I needed to get back to higher priority tasks.  Will
pick it up once, I am done with current priorities.

Thanks

Babu



2024-01-12 21:24:28

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command

Hi Babu,

On 1/12/2024 12:38 PM, Moger, Babu wrote:
> Hi  Reinette,
>
> On 1/12/2024 1:02 PM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 1/11/2024 1:36 PM, Babu Moger wrote:
>>
>>> @@ -1686,6 +1681,13 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
>>>           return -EINVAL;
>>>       }
>>>   +    /* mon_config cannot be more than the supported set of events */
>> copy&paste error? There is no mon_config in this function.
> Yea. it should be mbm_cfg_mask.  Will fix it.

I do not think it is correct to replace mon_config with mbm_cfg_mask. Is this comment
not referring to the user provided value (that is checked against mbm_cfg_mask)? So
perhaps something like:

/* Check value from user against supported events. */
or
/* Value from user cannot be more than the supported set of events. */

Please feel free to improve.

>>
>> (copy&paste difficulties reminds me of [1])
>>
>>> +    if ((val & hw_res->mbm_cfg_mask) != val) {
>>> +        rdt_last_cmd_printf("Invalid event configuration: The maximum valid "
>>> +                    "bitmask is 0x%02x\n", hw_res->mbm_cfg_mask);
>> checkpatch.pl should have warned about this split of text across two lines.
>> Logging functions and single strings are allowed to exceed the max line length.
>> If you just merge the two lines then checkpatch.pl may still warn for resctrl strings
>> but that is because it does not recognize rdt_last_cmd_printf() as a logging function.
>>
>> You can also just shorten the string so this patch passes the checkpatch.pl check.
>> For example,
>> "Invalid event configuration: maximum valid mask is 0x%02x\n"
>> or
>> "Invalid event configuration: maximum is 0x%02x\n"
>> or ?
>
> Yes. Checkpatch reported error when I split the text.
>
> How about this?. Checkpatch is happy.
>
> rdt_last_cmd_printf("Invalid event configuration: max valid mask is 0x%02x\n",
> +                                   hw_res->mbm_cfg_mask);
>

Looks good.

Reinette

2024-01-12 21:54:25

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command

Hi Reinette,

On 1/12/2024 3:24 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/12/2024 12:38 PM, Moger, Babu wrote:
>> Hi  Reinette,
>>
>> On 1/12/2024 1:02 PM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 1/11/2024 1:36 PM, Babu Moger wrote:
>>>
>>>> @@ -1686,6 +1681,13 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
>>>>           return -EINVAL;
>>>>       }
>>>>   +    /* mon_config cannot be more than the supported set of events */
>>> copy&paste error? There is no mon_config in this function.
>> Yea. it should be mbm_cfg_mask.  Will fix it.
> I do not think it is correct to replace mon_config with mbm_cfg_mask. Is this comment
> not referring to the user provided value (that is checked against mbm_cfg_mask)? So
> perhaps something like:
>
> /* Check value from user against supported events. */
> or
> /* Value from user cannot be more than the supported set of events. */

This looks good. Thanks

Babu


2024-01-15 22:53:01

by Babu Moger

[permalink] [raw]
Subject: [PATCH v5 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit

The QOS Memory Bandwidth Enforcement Limit is reported by
CPUID_Fn80000020_EAX_x01 and CPUID_Fn80000020_EAX_x02.
Bits Description
31:0 BW_LEN: Size of the QOS Memory Bandwidth Enforcement Limit.

Newer processors can support higher bandwidth limit than the current
hard-coded value. Remove the hard-coded value and detect using CPUID
command. Also update the register variables eax and edx to match the
AMD CPUID definition.

The CPUID details are documented in the PPR listed below [1].
[1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
11h B1 - 55901 Rev 0.25.

Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
v5: No changes.
v4: Minor text changes and re-order of commit tags.
v3: No change. Just updated with Reviewed-by

v2: Earlier Sent as a part of ABMC feature.
https://lore.kernel.org/lkml/[email protected]/
Sending it separate now. Addressed comments from Reinette about registers
being used from Intel definition. Also updated commit message.
---
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++------
arch/x86/kernel/cpu/resctrl/internal.h | 1 -
2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19e0681f0435..d04371e851b4 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -231,9 +231,7 @@ static bool __get_mem_config_intel(struct rdt_resource *r)
static bool __rdt_get_mem_config_amd(struct rdt_resource *r)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- union cpuid_0x10_3_eax eax;
- union cpuid_0x10_x_edx edx;
- u32 ebx, ecx, subleaf;
+ u32 eax, ebx, ecx, edx, subleaf;

/*
* Query CPUID_Fn80000020_EDX_x01 for MBA and
@@ -241,9 +239,9 @@ static bool __rdt_get_mem_config_amd(struct rdt_resource *r)
*/
subleaf = (r->rid == RDT_RESOURCE_SMBA) ? 2 : 1;

- cpuid_count(0x80000020, subleaf, &eax.full, &ebx, &ecx, &edx.full);
- hw_res->num_closid = edx.split.cos_max + 1;
- r->default_ctrl = MAX_MBA_BW_AMD;
+ cpuid_count(0x80000020, subleaf, &eax, &ebx, &ecx, &edx);
+ hw_res->num_closid = edx + 1;
+ r->default_ctrl = 1 << eax;

/* AMD does not use delay */
r->membw.delay_linear = false;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..d2979748fae4 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -18,7 +18,6 @@
#define MBM_OVERFLOW_INTERVAL 1000
#define MAX_MBA_BW 100u
#define MBA_IS_LINEAR 0x4
-#define MAX_MBA_BW_AMD 0x800
#define MBM_CNTR_WIDTH_OFFSET_AMD 20

#define RMID_VAL_ERROR BIT_ULL(63)
--
2.34.1


2024-01-15 22:53:33

by Babu Moger

[permalink] [raw]
Subject: [PATCH v5 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command

If the BMEC (Bandwidth Monitoring Event Configuration) feature is
supported, the bandwidth events can be configured. The maximum supported
bandwidth bitmask can be determined by following CPUID command.

CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event
Configuration] Read-only. Reset: 0000_007Fh.
Bits Description
31:7 Reserved
6:0 Identifies the bandwidth sources that can be tracked.

The bandwidth sources can change with the processor generations.
Read the supported bandwidth sources using the CPUID command.

While at it, move the mask checking to mon_config_write() before
iterating over all the domains. Also, print the valid bitmask when
the user tries to configure invalid event configuration value.

The CPUID details are documented in the PPR listed below [1].
[1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
11h B1 - 55901 Rev 0.25.

Fixes: dc2a3e857981 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
v5: Revised the text in mon_config_write when user tries invalid config.
Few other comment update.

v4: Minor text changes and re-order of commit tags.
Moved the mask check to mon_config_write() before iterating over all the
domains.

v3: Changed the event_mask name to mbm_cfg_mask. Added comments about the field.
Reverted the masking of event configuration to original code.
Few minor comment changes.

v2: Earlier sent as a part of ABMC feature.
https://lore.kernel.org/lkml/[email protected]/
But this is not related to ABMC. Sending it separate now.
Removed the global resctrl_max_evt_bitmask. Added event_mask as part of
the resource.
---
arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 14 ++++++++------
3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index d2979748fae4..e3dc35a00a19 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -394,6 +394,8 @@ struct rdt_parse_data {
* @msr_update: Function pointer to update QOS MSRs
* @mon_scale: cqm counter * mon_scale = occupancy in bytes
* @mbm_width: Monitor width, to detect and correct for overflow.
+ * @mbm_cfg_mask: Bandwidth sources that can be tracked when Bandwidth
+ * Monitoring Event Configuration (BMEC) is supported.
* @cdp_enabled: CDP state of this resource
*
* Members of this structure are either private to the architecture
@@ -408,6 +410,7 @@ struct rdt_hw_resource {
struct rdt_resource *r);
unsigned int mon_scale;
unsigned int mbm_width;
+ unsigned int mbm_cfg_mask;
bool cdp_enabled;
};

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..acca577e2b06 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
return ret;

if (rdt_cpu_has(X86_FEATURE_BMEC)) {
+ u32 eax, ebx, ecx, edx;
+
+ /* Detect list of bandwidth sources that can be tracked */
+ cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
+ hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
+
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
mbm_total_event.configurable = true;
mbm_config_rftype_init("mbm_total_bytes_config");
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..2b69e560b05f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1620,12 +1620,6 @@ static int mbm_config_write_domain(struct rdt_resource *r,
struct mon_config_info mon_info = {0};
int ret = 0;

- /* mon_config cannot be more than the supported set of events */
- if (val > MAX_EVT_CONFIG_BITS) {
- rdt_last_cmd_puts("Invalid event configuration\n");
- return -EINVAL;
- }
-
/*
* Read the current config value first. If both are the same then
* no need to write it again.
@@ -1663,6 +1657,7 @@ static int mbm_config_write_domain(struct rdt_resource *r,

static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
char *dom_str = NULL, *id_str;
unsigned long dom_id, val;
struct rdt_domain *d;
@@ -1686,6 +1681,13 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}

+ /* Value from user cannot be more than the supported set of events */
+ if ((val & hw_res->mbm_cfg_mask) != val) {
+ rdt_last_cmd_printf("Invalid event configuration: max valid mask is 0x%02x\n",
+ hw_res->mbm_cfg_mask);
+ return -EINVAL;
+ }
+
list_for_each_entry(d, &r->domains, list) {
if (d->id == dom_id) {
ret = mbm_config_write_domain(r, d, evtid, val);
--
2.34.1


2024-01-16 19:56:07

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command

Hi Babu,

On 1/15/2024 2:52 PM, Babu Moger wrote:
> If the BMEC (Bandwidth Monitoring Event Configuration) feature is
> supported, the bandwidth events can be configured. The maximum supported
> bandwidth bitmask can be determined by following CPUID command.
>
> CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event
> Configuration] Read-only. Reset: 0000_007Fh.
> Bits Description
> 31:7 Reserved
> 6:0 Identifies the bandwidth sources that can be tracked.
>
> The bandwidth sources can change with the processor generations.
> Read the supported bandwidth sources using the CPUID command.
>
> While at it, move the mask checking to mon_config_write() before
> iterating over all the domains. Also, print the valid bitmask when
> the user tries to configure invalid event configuration value.
>
> The CPUID details are documented in the PPR listed below [1].
> [1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
> 11h B1 - 55901 Rev 0.25.
>
> Fixes: dc2a3e857981 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
> Signed-off-by: Babu Moger <[email protected]>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> ---

Thank you.

Reviewed-by: Reinette Chatre <[email protected]>

Reinette


2024-01-16 22:31:49

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command

Hi Reinette,

On 1/16/2024 1:44 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/15/2024 2:52 PM, Babu Moger wrote:
>> If the BMEC (Bandwidth Monitoring Event Configuration) feature is
>> supported, the bandwidth events can be configured. The maximum supported
>> bandwidth bitmask can be determined by following CPUID command.
>>
>> CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event
>> Configuration] Read-only. Reset: 0000_007Fh.
>> Bits Description
>> 31:7 Reserved
>> 6:0 Identifies the bandwidth sources that can be tracked.
>>
>> The bandwidth sources can change with the processor generations.
>> Read the supported bandwidth sources using the CPUID command.
>>
>> While at it, move the mask checking to mon_config_write() before
>> iterating over all the domains. Also, print the valid bitmask when
>> the user tries to configure invalid event configuration value.
>>
>> The CPUID details are documented in the PPR listed below [1].
>> [1] Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model
>> 11h B1 - 55901 Rev 0.25.
>>
>> Fixes: dc2a3e857981 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
>> Signed-off-by: Babu Moger <[email protected]>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>> ---
> Thank you.
>
> Reviewed-by: Reinette Chatre <[email protected]>


Thank you

Babu Moger


2024-01-19 18:22:45

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)


These series adds the support for Assignable Bandwidth Monitoring Counters
(ABMC). It is also called QoS RMID Pinning feature

The feature details are documented in the APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC). The documentation is available at
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

The patches are based on top of commit
1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)

# Introduction

AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
feature only guarantees that RMIDs currently assigned to a processor will
be tracked by hardware. The counters of any other RMIDs which are no longer
being tracked will be reset to zero. The MBM event counters return
"Unavailable" for the RMIDs that are not active.

Users can create 256 or more monitor groups. But there can be only limited
number of groups that can be give guaranteed monitoring numbers. With ever
changing configurations there is no way to definitely know which of these
groups will be active for certain point of time. Users do not have the
option to monitor a group or set of groups for certain period of time
without worrying about RMID being reset in between.

The ABMC feature provides an option to the user to assign an RMID to the
hardware counter and monitor the bandwidth for a longer duration.
The assigned RMID will be active until the user unassigns it manually.
There is no need to worry about counters being reset during this period.
Additionally, the user can specify a bitmask identifying the specific
bandwidth types from the given source to track with the counter.

Without ABMC enabled, monitoring will work in current mode without
assignment option.

# Linux Implementation

Linux resctrl subsystem provides the interface to count maximum of two
memory bandwidth events per group, from a combination of available total
and local events. Keeping the current interface, users can assign a maximum
of 2 ABMC counters per group. User will also have the option to assign only
one counter to the group. If the system runs out of assignable ABMC
counters, kernel will display an error. Users need to unassign an already
assigned counter to make space for new assignments.


# Examples

a. Check if ABMC support is available
#mount -t resctrl resctrl /sys/fs/resctrl/

#cat /sys/fs/resctrl/info/L3_MON/mon_features
llc_occupancy
mbm_total_bytes
mbm_total_bytes_config
mbm_local_bytes
mbm_local_bytes_config
mbm_assign_capable ← Linux kernel detected ABMC feature

b. Check if ABMC is enabled. By default, ABMC feature is disabled.
Monitoring works in legacy monitor mode when ABMC is not enabled.

#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
0

c. There will be new file "monitor_state" for each monitor group when ABMC
feature is supported. However, monitor_state is not available if ABMC is
disabled.

#cat /sys/fs/resctrl/monitor_state
Unsupported

d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
enabled, monitoring will work in current mode without assignment option.

# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
779247936
# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
765207488

e. Enable ABMC mode.

#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
1

f. Read the monitor states. By default, both total and local MBM
events are in "unassign" state.

#cat /sys/fs/resctrl/monitor_state
total=unassign;local=unassign

g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
the MBA events are not available until the user assigns the events
explicitly.

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
Unsupported

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
Unsupported

h. The event llc_occupancy is not affected by ABMC mode. Users can still
read the llc_occupancy.

#cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
557056

i. Now assign the total event and read the monitor_state.

#echo total=assign > /sys/fs/resctrl/monitor_state
#cat /sys/fs/resctrl/monitor_state
total=assign;local=unassign

j. Now that the total event is assigned. Read the total event.

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
6136000

k. Now assign the local event and read the monitor_state.

#echo local=assign > /sys/fs/resctrl/monitor_state
#cat /sys/fs/resctrl/monitor_state
total=assign;local=assign

Users can also assign both total and local events in one single
command.

l. Now that both total and local events are assigned, read the events.

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
6136000
#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
58694

m. Check the bandwidth configuration for the group. Note that bandwidth
configuration has a domain scope. Total event defaults to 0x7F (to
count all the events) and local event defaults to 0x15 (to count all
the local numa events). The event bitmap decoding is available at
https://www.kernel.org/doc/Documentation/x86/resctrl.rst
in section "mbm_total_bytes_config", "mbm_local_bytes_config":

#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
0=0x7f;1=0x7f

#cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
0=0x15;1=0xi15

n. Change the bandwidth source for domain 0 for the total event to count only reads.
Note that this change effects lotal events on the domain 0.

#echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
0=0x33;1=0x7F

o. Now read the total event again. The mbm_total_bytes should display
only the read events.

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
314101

p. Unmount the resctrl

#umount /sys/fs/resctrl/

NOTE: For simplicity these examples are run on a default resctrl group.
Similar experiments are can be run non-defaults groups.
---
v2:
a. Major change is the way ABMC is enabled. Earlier, user needed to remount
with -o abmc to enable ABMC feature. Removed that option now.
Now users can enable ABMC by "$echo 1 to /sys/fs/resctrl/info/L3_MON/mbm_assign_enable".

b. Added new word 21 to x86/cpufeatures.h.

c. Display unsupported if user attempts to read the events when ABMC is enabled
and event is not assigned.

d. Display monitor_state as "Unsupported" when ABMC is disabled.

e. Text updates and rebase to latest tip tree (as of Jan 18).

f. This series is still work in progress. I am yet to hear from ARM developers.

v1 :
https://lore.kernel.org/lkml/[email protected]/

Babu Moger (17):
x86/cpufeatures: Add word 21 for scattered CPUID features
x86/resctrl: Add support for Assignable Bandwidth Monitoring Counters
(ABMC)
x86/resctrl: Add ABMC feature in the command line options
x86/resctrl: Detect Assignable Bandwidth Monitoring feature details
x86/resctrl: Introduce resctrl_file_fflags_init
x86/resctrl: Introduce interface to display number of ABMC counters
x86/resctrl: Add support to enable/disable ABMC feature
x86/resctrl: Introduce the interface to display ABMC state
x86/resctrl: Introdruce rdtgroup_assign_enable_write
x86/resctrl: Add interface to display monitor state of the group
x86/resctrl: Report Unsupported when MBM events are read
x86/resctrl: Initialize assignable counters bitmap
x86/resctrl: Add data structures for ABMC assignment
x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg
x86/resctrl: Add the interface to assign the RMID
x86/resctrl: Add the interface unassign the RMID
x86/resctrl: Update RMID assignments on event configuration changes

.../admin-guide/kernel-parameters.txt | 2 +-
Documentation/arch/x86/resctrl.rst | 62 +++
arch/x86/include/asm/cpufeature.h | 6 +-
arch/x86/include/asm/cpufeatures.h | 7 +-
arch/x86/include/asm/disabled-features.h | 3 +-
arch/x86/include/asm/msr-index.h | 2 +
arch/x86/include/asm/required-features.h | 3 +-
arch/x86/kernel/cpu/cpuid-deps.c | 3 +
arch/x86/kernel/cpu/resctrl/core.c | 25 +-
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 15 +
arch/x86/kernel/cpu/resctrl/internal.h | 54 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 26 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 466 +++++++++++++++++-
arch/x86/kernel/cpu/scattered.c | 1 +
include/linux/resctrl.h | 2 +
15 files changed, 651 insertions(+), 26 deletions(-)

--
2.34.1


2024-01-19 18:22:55

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 01/17] x86/cpufeatures: Add word 21 for scattered CPUID features

The word 11 which was reserved for various scattered CPUID feature
bits ran out of bits. Add a word 21 for scattered feature bits.

Signed-off-by: Babu Moger <[email protected]>

---
v2: This is new patch in v2. Added because feature word 11 is full now.
---
arch/x86/include/asm/cpufeature.h | 6 ++++--
arch/x86/include/asm/cpufeatures.h | 6 +++++-
arch/x86/include/asm/disabled-features.h | 3 ++-
arch/x86/include/asm/required-features.h | 3 ++-
4 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index a26bebbdff87..de394d8f6d16 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -91,8 +91,9 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 18, feature_bit) || \
CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 19, feature_bit) || \
CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 20, feature_bit) || \
+ CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 21, feature_bit) || \
REQUIRED_MASK_CHECK || \
- BUILD_BUG_ON_ZERO(NCAPINTS != 21))
+ BUILD_BUG_ON_ZERO(NCAPINTS != 22))

#define DISABLED_MASK_BIT_SET(feature_bit) \
( CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 0, feature_bit) || \
@@ -116,8 +117,9 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 18, feature_bit) || \
CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 19, feature_bit) || \
CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 20, feature_bit) || \
+ CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 21, feature_bit) || \
DISABLED_MASK_CHECK || \
- BUILD_BUG_ON_ZERO(NCAPINTS != 21))
+ BUILD_BUG_ON_ZERO(NCAPINTS != 22))

#define cpu_has(c, bit) \
(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 : \
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 29cb275a219d..26bd99a35eae 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -13,7 +13,7 @@
/*
* Defines x86 CPU feature bits
*/
-#define NCAPINTS 21 /* N 32-bit words worth of info */
+#define NCAPINTS 22 /* N 32-bit words worth of info */
#define NBUGINTS 2 /* N 32-bit bug flags */

/*
@@ -458,6 +458,10 @@
#define X86_FEATURE_IBPB_BRTYPE (20*32+28) /* "" MSR_PRED_CMD[IBPB] flushes all branch type predictions */
#define X86_FEATURE_SRSO_NO (20*32+29) /* "" CPU is not affected by SRSO */

+/*
+ * Extended auxiliary flags: For features scattered in various CPUID levels, word 21
+ */
+
/*
* BUG word(s)
*/
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 36d0c1e05e60..784335a74f95 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -143,6 +143,7 @@
#define DISABLED_MASK18 (DISABLE_IBT)
#define DISABLED_MASK19 0
#define DISABLED_MASK20 0
-#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 21)
+#define DISABLED_MASK21 0
+#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 22)

#endif /* _ASM_X86_DISABLED_FEATURES_H */
diff --git a/arch/x86/include/asm/required-features.h b/arch/x86/include/asm/required-features.h
index 7ba1726b71c7..e9187ddd3d1f 100644
--- a/arch/x86/include/asm/required-features.h
+++ b/arch/x86/include/asm/required-features.h
@@ -99,6 +99,7 @@
#define REQUIRED_MASK18 0
#define REQUIRED_MASK19 0
#define REQUIRED_MASK20 0
-#define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 21)
+#define REQUIRED_MASK21 0
+#define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 22)

#endif /* _ASM_X86_REQUIRED_FEATURES_H */
--
2.34.1


2024-01-19 18:23:10

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 02/17] x86/resctrl: Add support for Assignable Bandwidth Monitoring Counters (ABMC)

AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
feature only guarantees that RMIDs currently assigned to a processor will
be tracked by hardware. The counters of any other RMIDs which are no longer
being tracked will be reset to zero. The MBM event counters return
"Unavailable" for the RMIDs that are not active.

Users can create 256 or more monitor groups. But there can be only limited
number of groups that can be give guaranteed monitoring numbers. With ever
changing configurations there is no way to definitely know which of these
groups will be active for certain point of time. Users do not have the
option to monitor a group or set of groups for certain period of time
without worrying about RMID being reset in between.

The ABMC feature provides an option to the user to assign an RMID to the
hardware counter and monitor the bandwidth for a longer duration.
The assigned RMID will be active until the user unassigns it manually.
There is no need to worry about counters being reset during this period.
Additionally, the user can specify a bitmask identifying the specific
bandwidth types from the given source to track with the counter.

Linux resctrl subsystem provides the interface to count maximum of two
memory bandwidth events per group, from a combination of available total
and local events. Keeping the current interface, users can assign a maximum
of 2 ABMC counters per group. User will also have the option to assign only
one counter to the group. If the system runs out of assignable ABMC
counters, kernel will display an error. Users need to unassign an already
assigned counter to make space for new assignments.

AMD hardware provides total of 32 ABMC counters when supported.

The feature can be detected via CPUID_Fn80000020_EBX_x00 bit 5.
Bits Description
5 ABMC (Assignable Bandwidth Monitoring Counters)

The feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

---
v2: Added dependency on X86_FEATURE_BMEC.
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/cpuid-deps.c | 3 +++
arch/x86/kernel/cpu/scattered.c | 1 +
3 files changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 26bd99a35eae..ea57e4515da6 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -461,6 +461,7 @@
/*
* Extended auxiliary flags: For features scattered in various CPUID levels, word 21
*/
+#define X86_FEATURE_ABMC (21*32+ 0) /* "" Assignable Bandwidth Monitoring Counters */

/*
* BUG word(s)
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index e462c1d3800a..44e8423628b7 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -70,6 +70,9 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_CQM_MBM_LOCAL, X86_FEATURE_CQM_LLC },
{ X86_FEATURE_BMEC, X86_FEATURE_CQM_MBM_TOTAL },
{ X86_FEATURE_BMEC, X86_FEATURE_CQM_MBM_LOCAL },
+ { X86_FEATURE_ABMC, X86_FEATURE_CQM_MBM_TOTAL },
+ { X86_FEATURE_ABMC, X86_FEATURE_CQM_MBM_LOCAL },
+ { X86_FEATURE_ABMC, X86_FEATURE_BMEC },
{ X86_FEATURE_AVX512_BF16, X86_FEATURE_AVX512VL },
{ X86_FEATURE_AVX512_FP16, X86_FEATURE_AVX512BW },
{ X86_FEATURE_ENQCMD, X86_FEATURE_XSAVES },
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 0dad49a09b7a..698f2ccb9ac1 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -47,6 +47,7 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_MBA, CPUID_EBX, 6, 0x80000008, 0 },
{ X86_FEATURE_SMBA, CPUID_EBX, 2, 0x80000020, 0 },
{ X86_FEATURE_BMEC, CPUID_EBX, 3, 0x80000020, 0 },
+ { X86_FEATURE_ABMC, CPUID_EBX, 5, 0x80000020, 0 },
{ X86_FEATURE_PERFMON_V2, CPUID_EAX, 0, 0x80000022, 0 },
{ X86_FEATURE_AMD_LBR_V2, CPUID_EAX, 1, 0x80000022, 0 },
{ 0, 0, 0, 0, 0 }
--
2.34.1


2024-01-19 18:23:25

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 03/17] x86/resctrl: Add ABMC feature in the command line options

Add the command line options to enable or disable the new resctrl feature
ABMC (Assignable Bandwidth Monitoring Counters).

Signed-off-by: Babu Moger <[email protected]>

---
v2: No changes
---
Documentation/admin-guide/kernel-parameters.txt | 2 +-
Documentation/arch/x86/resctrl.rst | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b28d80b5af33..68b2c4f799b7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5492,7 +5492,7 @@
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
- mba, smba, bmec.
+ mba, smba, bmec, abmc.
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..d816ded93c22 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -26,6 +26,7 @@ MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
MBA (Memory Bandwidth Allocation) "mba"
SMBA (Slow Memory Bandwidth Allocation) ""
BMEC (Bandwidth Monitoring Event Configuration) ""
+ABMC (Assignable Bandwidth Monitoring Counters) ""
=============================================== ================================

Historically, new features were made visible by default in /proc/cpuinfo. This
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19e0681f0435..4efe2d6a9eb7 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -678,6 +678,7 @@ enum {
RDT_FLAG_MBA,
RDT_FLAG_SMBA,
RDT_FLAG_BMEC,
+ RDT_FLAG_ABMC,
};

#define RDT_OPT(idx, n, f) \
@@ -703,6 +704,7 @@ static struct rdt_options rdt_options[] __initdata = {
RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
+ RDT_OPT(RDT_FLAG_ABMC, "abmc", X86_FEATURE_ABMC),
};
#define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)

--
2.34.1


2024-01-19 18:23:41

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 04/17] x86/resctrl: Detect Assignable Bandwidth Monitoring feature details

ABMC feature details are reported via CPUID Fn8000_0020_EBX_x5.
Bits Description
15:0 MAX_ABMC Maximum Supported Assignable Bandwidth
Monitoring Counter ID + 1

Detect the feature details and update
/sys/fs/resctrl/info/L3_MON/mon_features.

If the system supports Assignable Bandwidth Monitoring Counters (ABMC),
the output will have additional text.
$ cat /sys/fs/resctrl/info/L3_MON/mon_features
mbm_assign_capable

The feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

---
v2: Changed the field name to mbm_assign_capable from abmc_capable.
---
Documentation/arch/x86/resctrl.rst | 7 +++++++
arch/x86/kernel/cpu/resctrl/core.c | 17 +++++++++++++++++
arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 3 +++
include/linux/resctrl.h | 2 ++
5 files changed, 32 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index d816ded93c22..ecc6c65bdaca 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -197,6 +197,13 @@ with the following files:
mbm_local_bytes
mbm_local_bytes_config

+ If the system supports Assignable Bandwidth Monitoring
+ Counters (ABMC), the output will have additional text.
+ Example::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ mbm_assign_capable
+
"mbm_total_bytes_config", "mbm_local_bytes_config":
Read/write files containing the configuration for the mbm_total_bytes
and mbm_local_bytes events, respectively, when the Bandwidth
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 4efe2d6a9eb7..f40ee271a5c7 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -303,6 +303,17 @@ static void rdt_get_cdp_l2_config(void)
rdt_get_cdp_config(RDT_RESOURCE_L2);
}

+static void rdt_get_abmc_cfg(struct rdt_resource *r)
+{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ u32 eax, ebx, ecx, edx;
+
+ r->mbm_assign_capable = true;
+ /* Query CPUID_Fn80000020_EBX_x05 for number of ABMC counters */
+ cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
+ hw_res->mbm_assignable_counters = (ebx & 0xFFFF) + 1;
+}
+
static void
mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
{
@@ -815,6 +826,12 @@ static __init bool get_rdt_alloc_resources(void)
if (get_slow_mem_config())
ret = true;

+ if (rdt_cpu_has(X86_FEATURE_ABMC)) {
+ r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ rdt_get_abmc_cfg(r);
+ ret = true;
+ }
+
return ret;
}

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..01eb0522b42b 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -391,6 +391,8 @@ struct rdt_parse_data {
* resctrl_arch_get_num_closid() to avoid confusion
* with struct resctrl_schema's property of the same name,
* which has been corrected for features like CDP.
+ * @mbm_assignable_counters:
+ * Maximum number of assignable ABMC counters
* @msr_base: Base MSR address for CBMs
* @msr_update: Function pointer to update QOS MSRs
* @mon_scale: cqm counter * mon_scale = occupancy in bytes
@@ -404,6 +406,7 @@ struct rdt_parse_data {
struct rdt_hw_resource {
struct rdt_resource r_resctrl;
u32 num_closid;
+ u32 mbm_assignable_counters;
unsigned int msr_base;
void (*msr_update) (struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..9b82ba977d98 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1077,6 +1077,9 @@ static int rdt_mon_features_show(struct kernfs_open_file *of,
seq_printf(seq, "%s_config\n", mevt->name);
}

+ if (r->mbm_assign_capable)
+ seq_printf(seq, "mbm_assign_capable\n");
+
return 0;
}

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..1751a7b0a369 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -162,6 +162,7 @@ struct resctrl_schema;
* @evt_list: List of monitoring events
* @fflags: flags to choose base and info files
* @cdp_capable: Is the CDP feature available on this resource
+ * @assign_capable: Does system capable of supporting monitor assignment?
*/
struct rdt_resource {
int rid;
@@ -182,6 +183,7 @@ struct rdt_resource {
struct list_head evt_list;
unsigned long fflags;
bool cdp_capable;
+ bool mbm_assign_capable;
};

/**
--
2.34.1


2024-01-19 18:24:31

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 07/17] x86/resctrl: Add support to enable/disable ABMC feature

Add the functionality to enable/disable ABMC feature. By default,
the ABMC is disabled.

ABMC is enabled by setting enabled bit(0) in MSR L3_QOS_EXT_CFG. When the
state of ABMC is changed, it must be changed to the updated value on all
logical processors in the QOS Domain.

The ABMC feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
v2: Few text changes in commit message.
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 12 +++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 70 +++++++++++++++++++++++++-
3 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index f1bd7b91b3c6..ac0ce88a5978 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1126,6 +1126,7 @@
#define MSR_IA32_MBA_BW_BASE 0xc0000200
#define MSR_IA32_SMBA_BW_BASE 0xc0000280
#define MSR_IA32_EVT_CFG_BASE 0xc0000400
+#define MSR_IA32_L3_QOS_EXT_CFG 0xc00003ff

/* MSR_IA32_VMX_MISC bits */
#define MSR_IA32_VMX_MISC_INTEL_PT (1ULL << 14)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 52ba2fc5c6c4..3467221f2af5 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -54,6 +54,9 @@
/* Max event bits supported */
#define MAX_EVT_CONFIG_BITS GENMASK(6, 0)

+/* ABMC ENABLE */
+#define ABMC_ENABLE BIT(0)
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
@@ -398,6 +401,7 @@ struct rdt_parse_data {
* @mon_scale: cqm counter * mon_scale = occupancy in bytes
* @mbm_width: Monitor width, to detect and correct for overflow.
* @cdp_enabled: CDP state of this resource
+ * @abmc_enabled: ABMC feature is enabled
*
* Members of this structure are either private to the architecture
* e.g. mbm_width, or accessed via helpers that provide abstraction. e.g.
@@ -413,6 +417,7 @@ struct rdt_hw_resource {
unsigned int mon_scale;
unsigned int mbm_width;
bool cdp_enabled;
+ bool abmc_enabled;
};

static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r)
@@ -458,6 +463,13 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)

int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);

+static inline bool resctrl_arch_get_abmc_enabled(enum resctrl_res_level l)
+{
+ return rdt_resources_all[l].abmc_enabled;
+}
+
+int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable);
+
/*
* To return the common struct rdt_resource, which is contained in struct
* rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource.
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 53be5cd1c28e..2fb26227cbec 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2372,6 +2372,74 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable)
return 0;
}

+static void resctrl_abmc_msrwrite(void *arg)
+{
+ bool *enable = arg;
+ u64 msrval;
+
+ rdmsrl(MSR_IA32_L3_QOS_EXT_CFG, msrval);
+
+ if (*enable)
+ msrval |= ABMC_ENABLE;
+ else
+ msrval &= ~ABMC_ENABLE;
+
+ wrmsrl(MSR_IA32_L3_QOS_EXT_CFG, msrval);
+}
+
+static int resctrl_abmc_setup(enum resctrl_res_level l, bool enable)
+{
+ struct rdt_resource *r = &rdt_resources_all[l].r_resctrl;
+ struct rdt_domain *d;
+
+ /* Update QOS_CFG MSR on all the CPUs in cpu_mask */
+ list_for_each_entry(d, &r->domains, list) {
+ on_each_cpu_mask(&d->cpu_mask, resctrl_abmc_msrwrite, &enable, 1);
+ resctrl_arch_reset_rmid_all(r, d);
+ }
+
+ return 0;
+}
+
+static int resctrl_abmc_enable(enum resctrl_res_level l)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
+ int ret = 0;
+
+ if (!hw_res->abmc_enabled) {
+ ret = resctrl_abmc_setup(l, true);
+ if (!ret)
+ hw_res->abmc_enabled = true;
+ }
+
+ return ret;
+}
+
+static void resctrl_abmc_disable(enum resctrl_res_level l)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
+
+ if (hw_res->abmc_enabled) {
+ resctrl_abmc_setup(l, false);
+ hw_res->abmc_enabled = false;
+ }
+}
+
+int resctrl_arch_set_abmc_enabled(enum resctrl_res_level l, bool enable)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
+
+ if (!hw_res->r_resctrl.mbm_assign_capable)
+ return -EINVAL;
+
+ if (enable)
+ return resctrl_abmc_enable(l);
+
+ resctrl_abmc_disable(l);
+
+ return 0;
+}
+
/*
* We don't allow rdtgroup directories to be created anywhere
* except the root directory. Thus when looking for the rdtgroup
@@ -2456,7 +2524,7 @@ static void rdt_disable_ctx(void)
resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false);
resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false);
set_mba_sc(false);
-
+ resctrl_arch_set_abmc_enabled(RDT_RESOURCE_L3, false);
resctrl_debug = false;
}

--
2.34.1


2024-01-19 18:24:46

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 05/17] x86/resctrl: Introduce resctrl_file_fflags_init

Consolidate multiple fflags initialization into one function.

Remove thread_throttle_mode_init, mbm_config_rftype_init and
consolidate them into resctrl_file_fflags_init.

Signed-off-by: Babu Moger <[email protected]>

---
v2: New patch. New function to consolidate fflags initialization
---
arch/x86/kernel/cpu/resctrl/core.c | 4 +++-
arch/x86/kernel/cpu/resctrl/internal.h | 4 ++--
arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 16 +++-------------
4 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f40ee271a5c7..a38609c82b9e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -221,7 +221,9 @@ static bool __get_mem_config_intel(struct rdt_resource *r)
r->membw.throttle_mode = THREAD_THROTTLE_PER_THREAD;
else
r->membw.throttle_mode = THREAD_THROTTLE_MAX;
- thread_throttle_mode_init();
+
+ resctrl_file_fflags_init("thread_throttle_mode",
+ RFTYPE_CTRL_INFO | RFTYPE_RES_MB);

r->alloc_capable = true;

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 01eb0522b42b..52ba2fc5c6c4 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -565,8 +565,8 @@ void cqm_handle_limbo(struct work_struct *work);
bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
void __check_limbo(struct rdt_domain *d, bool force_free);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
-void __init thread_throttle_mode_init(void);
-void __init mbm_config_rftype_init(const char *config);
+void __init resctrl_file_fflags_init(const char *config,
+ unsigned long fflags);
void rdt_staged_configs_clear(void);

#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..a6c336b6de61 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -815,11 +815,13 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
if (rdt_cpu_has(X86_FEATURE_BMEC)) {
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
mbm_total_event.configurable = true;
- mbm_config_rftype_init("mbm_total_bytes_config");
+ resctrl_file_fflags_init("mbm_total_bytes_config",
+ RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
mbm_local_event.configurable = true;
- mbm_config_rftype_init("mbm_local_bytes_config");
+ resctrl_file_fflags_init("mbm_local_bytes_config",
+ RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}
}

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9b82ba977d98..3e233251e7ed 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1980,24 +1980,14 @@ static struct rftype *rdtgroup_get_rftype_by_name(const char *name)
return NULL;
}

-void __init thread_throttle_mode_init(void)
-{
- struct rftype *rft;
-
- rft = rdtgroup_get_rftype_by_name("thread_throttle_mode");
- if (!rft)
- return;
-
- rft->fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB;
-}
-
-void __init mbm_config_rftype_init(const char *config)
+void __init resctrl_file_fflags_init(const char *config,
+ unsigned long fflags)
{
struct rftype *rft;

rft = rdtgroup_get_rftype_by_name(config);
if (rft)
- rft->fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE;
+ rft->fflags = fflags;
}

/**
--
2.34.1


2024-01-19 18:25:03

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 09/17] x86/resctrl: Introdruce rdtgroup_assign_enable_write

Introduce rdtgroup_assign_enable_write to enable ABMC mode.

Users can enable the feature by the following command.
$echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable

Signed-off-by: Babu Moger <[email protected]>

---
v2: This is new patch to enable/disable ABMC without mounting the filesystem.
---
Documentation/arch/x86/resctrl.rst | 6 +++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 34 +++++++++++++++++++++++++-
2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index f94a4d314690..f09239cb93e8 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -273,6 +273,12 @@ with the following files:
assignment for monitoring. Feature provides an option to assign the RMID
to the hardware counter and monitor the bandwidth for a longer duration.
The assigned RMID will be active until the user unassigns it.
+ By default, the feature is disabled. Feature can be enabled by writing 1.
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
+ 0
+ # echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable

"max_threshold_occupancy":
Read/write file provides the largest value (in
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 4f160dbf6376..9c8db9562c91 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -833,6 +833,37 @@ static int rdtgroup_mbm_assign_enable_show(struct kernfs_open_file *of,
return 0;
}

+static ssize_t rdtgroup_mbm_assign_enable_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ struct rdtgroup *rdtgrp;
+ unsigned int enable;
+ int ret;
+
+ if (!r->mbm_assign_capable)
+ return -EINVAL;
+
+ ret = kstrtouint(buf, 0, &enable);
+ if (ret)
+ return ret;
+
+ rdtgrp = rdtgroup_kn_lock_live(of->kn);
+ if (!rdtgrp) {
+ rdtgroup_kn_unlock(of->kn);
+ return -EINVAL;
+ }
+
+ if(hw_res->abmc_enabled != enable)
+ ret = resctrl_arch_set_abmc_enabled(r->rid, enable);
+
+ rdtgroup_kn_unlock(of->kn);
+
+ return ret ?: nbytes;
+}
+
#ifdef CONFIG_PROC_CPU_RESCTRL

/*
@@ -1891,9 +1922,10 @@ static struct rftype res_common_files[] = {
},
{
.name = "mbm_assign_enable",
- .mode = 0444,
+ .mode = 0644,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdtgroup_mbm_assign_enable_show,
+ .write = rdtgroup_mbm_assign_enable_write,
},
{
.name = "cpus",
--
2.34.1


2024-01-19 18:25:12

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 06/17] x86/resctrl: Introduce interface to display number of ABMC counters

The ABMC feature provides an option to the user to pin (or assign) the
RMID to the hardware counter and monitor the bandwidth for a longer
duration. There are only a limited number of hardware counters.

Provide the interface to display the number of ABMC counters supported.

Signed-off-by: Babu Moger <[email protected]>

---
v2: Changed the field name to mbm_assignable_counters from abmc_counters.
---
Documentation/arch/x86/resctrl.rst | 4 ++++
arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 +++++++++++++++++
3 files changed, 25 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index ecc6c65bdaca..73eeb50fd0b5 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -264,6 +264,10 @@ with the following files:
# cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
0=0x30;1=0x30;3=0x15;4=0x15

+"mbm_assignable_counters":
+ Available when ABMC feature is supported. The number of assignable bandwidth
+ monitoring counters available.
+
"max_threshold_occupancy":
Read/write file provides the largest value (in
bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index a6c336b6de61..fa492ea820f0 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -823,6 +823,10 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
resctrl_file_fflags_init("mbm_local_bytes_config",
RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}
+
+ if (rdt_cpu_has(X86_FEATURE_ABMC))
+ resctrl_file_fflags_init("mbm_assignable_counters",
+ RFTYPE_MON_INFO);
}

l3_mon_evt_init(r);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 3e233251e7ed..53be5cd1c28e 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -811,6 +811,17 @@ static int rdtgroup_rmid_show(struct kernfs_open_file *of,
return ret;
}

+static int rdtgroup_mbm_assignable_counters_show(struct kernfs_open_file *of,
+ struct seq_file *s, void *v)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+
+ seq_printf(s, "%d\n", hw_res->mbm_assignable_counters);
+
+ return 0;
+}
+
#ifdef CONFIG_PROC_CPU_RESCTRL

/*
@@ -1861,6 +1872,12 @@ static struct rftype res_common_files[] = {
.seq_show = mbm_local_bytes_config_show,
.write = mbm_local_bytes_config_write,
},
+ {
+ .name = "mbm_assignable_counters",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdtgroup_mbm_assignable_counters_show,
+ },
{
.name = "cpus",
.mode = 0644,
--
2.34.1


2024-01-19 18:25:20

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 10/17] x86/resctrl: Add interface to display monitor state of the group

The ABMC feature provides an option to the user to assign an RMID to
the hardware counter and monitor the bandwidth for the longer duration.
The assigned RMID will be active until user unassigns the RMID.

Add a new field monitor_state in resctrl group interface to display the
assignment state of the group. This field is available when ABMC feature
is supported on the system.

By default the monitor_state is initialized to unassigned state when
ABMC is enabled.
$cat /sys/fs/resctrl/monitor_state
total=unassign;local=unassign

Signed-off-by: Babu Moger <[email protected]>
---
v2: Added check to display "Unsupported" when user tries to access
monitor state when ABMC is not enabled.
---
Documentation/arch/x86/resctrl.rst | 20 ++++++++++++
arch/x86/kernel/cpu/resctrl/internal.h | 8 +++++
arch/x86/kernel/cpu/resctrl/monitor.c | 2 ++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 45 ++++++++++++++++++++++++++
4 files changed, 75 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index f09239cb93e8..4f89d5d1b61f 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -400,6 +400,26 @@ When monitoring is enabled all MON groups will also contain:
the sum for all tasks in the CTRL_MON group and all tasks in
MON groups. Please see example section for more details on usage.

+"monitor_state":
+ Available when ABMC feature is supported. ABMC feature provides an
+ option to the user to assign an RMID to hardware counter and
+ monitor the bandwidth for the longer duration. The RMID will
+ be active until user unassigns it manually. Each group will have
+ two events that are assignable. By default, the events are
+ unassigned. Index 0 holds the monitor_state for MBM total bytes.
+ Index 1 holds the monitor_state for MBM local bytes.
+
+ Example::
+
+ # cat /sys/fs/resctrl/monitor_state
+ total=unassign;local=unassign
+
+ When the events are assigned, the output will look like below.
+ Example::
+
+ # cat /sys/fs/resctrl/monitor_state
+ total=assign;local=assign
+
"mon_hw_id":
Available only with debug option. The identifier used by hardware
for the monitor group. On x86 this is the RMID.
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 3467221f2af5..865101c5e1c2 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -57,6 +57,12 @@
/* ABMC ENABLE */
#define ABMC_ENABLE BIT(0)

+/*
+ * monitor group's state when ABMC is enabled
+ */
+#define TOTAL_ASSIGN BIT(0)
+#define LOCAL_ASSIGN BIT(1)
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
@@ -163,12 +169,14 @@ enum rdtgrp_mode {
* @parent: parent rdtgrp
* @crdtgrp_list: child rdtgroup node list
* @rmid: rmid for this rdtgroup
+ * @monitor_state: Assignment state of the group
*/
struct mongroup {
struct kernfs_node *mon_data_kn;
struct rdtgroup *parent;
struct list_head crdtgrp_list;
u32 rmid;
+ u32 monitor_state;
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index a45084e30738..de43be2252cc 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -829,6 +829,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
RFTYPE_MON_INFO);
resctrl_file_fflags_init("mbm_assign_enable",
RFTYPE_MON_INFO);
+ resctrl_file_fflags_init("monitor_state",
+ RFTYPE_MON_BASE);
}
}

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9c8db9562c91..7cae6ac13954 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -779,6 +779,36 @@ static int rdtgroup_tasks_show(struct kernfs_open_file *of,
return ret;
}

+static int rdtgroup_monitor_state_show(struct kernfs_open_file *of,
+ struct seq_file *s, void *v)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ struct rdtgroup *rdtgrp;
+ int ret = 0;
+
+ if (!hw_res->abmc_enabled) {
+ rdt_last_cmd_puts("Assignable Bandwidth Monitoring is not enabled\n");
+ seq_printf(s, "Unsuppoted\n");
+ return ret;
+ }
+
+ rdt_last_cmd_clear();
+
+ rdtgrp = rdtgroup_kn_lock_live(of->kn);
+ if (rdtgrp)
+ seq_printf(s, "total=%s;local=%s\n",
+ rdtgrp->mon.monitor_state & TOTAL_ASSIGN ?
+ "assign" : "unassign",
+ rdtgrp->mon.monitor_state & LOCAL_ASSIGN ?
+ "assign" : "unassign");
+ else
+ ret = -EINVAL;
+ rdtgroup_kn_unlock(of->kn);
+
+ return ret;
+}
+
static int rdtgroup_closid_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
@@ -1944,6 +1974,12 @@ static struct rftype res_common_files[] = {
.flags = RFTYPE_FLAGS_CPUS_LIST,
.fflags = RFTYPE_BASE,
},
+ {
+ .name = "monitor_state",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdtgroup_monitor_state_show,
+ },
{
.name = "tasks",
.mode = 0644,
@@ -2439,6 +2475,7 @@ static void resctrl_abmc_msrwrite(void *arg)
static int resctrl_abmc_setup(enum resctrl_res_level l, bool enable)
{
struct rdt_resource *r = &rdt_resources_all[l].r_resctrl;
+ struct rdtgroup *prgrp, *crgrp;
struct rdt_domain *d;

/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2447,6 +2484,14 @@ static int resctrl_abmc_setup(enum resctrl_res_level l, bool enable)
resctrl_arch_reset_rmid_all(r, d);
}

+ /* Reset monitor state for all the monitor groups */
+ list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
+ memset(&prgrp->mon.monitor_state, 0, sizeof(prgrp->mon.monitor_state));
+
+ list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
+ memset(&crgrp->mon.monitor_state, 0, sizeof(crgrp->mon.monitor_state));
+ }
+
return 0;
}

--
2.34.1


2024-01-19 18:25:37

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 11/17] x86/resctrl: Report Unsupported when MBM events are read

Hardware reports "Unavailable" when a user tries to read the event when
ABMC is enabled and event is not assigned. "Unavailable" is reported in
other error cases also.

To differentiate these cases, skip reading the event and report
"Unsupported" that way users can take corrective action.

Signed-off-by: Babu Moger <[email protected]>
---
v2: New patch based on feedback.
---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index beccb0e87ba7..cc4c41eede25 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -542,12 +542,14 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
+ struct rdt_hw_resource *hw_res;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
union mon_data_bits md;
struct rdt_domain *d;
struct rmid_read rr;
+ int mon_state;
int ret = 0;

rdtgrp = rdtgroup_kn_lock_live(of->kn);
@@ -568,6 +570,19 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
goto out;
}

+ hw_res = resctrl_to_arch_res(r);
+ if (hw_res->abmc_enabled && evtid != QOS_L3_OCCUP_EVENT_ID) {
+ if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID)
+ mon_state = TOTAL_ASSIGN;
+ else
+ mon_state = LOCAL_ASSIGN;
+
+ if (!(rdtgrp->mon.monitor_state & mon_state)) {
+ seq_puts(m, "Unsupported\n");
+ goto out;
+ }
+ }
+
mon_event_read(&rr, r, d, rdtgrp, evtid, false);

if (rr.err == -EIO)
--
2.34.1


2024-01-19 18:26:45

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 15/17] x86/resctrl: Add the interface to assign the RMID

With the support of ABMC (Assignable Bandwidth Monitoring Counters)
feature, the user has the option to assign or unassign the RMID to
hardware counter and monitor the bandwidth for the longer duration.

Provide the interface to assign the counter to the group.

The ABMC feature implements a pair of MSRs, L3_QOS_ABMC_CFG (MSR
C000_03FDh) and L3_QOS_ABMC_DSC (MSR C000_3FEh). Each logical processor
implements a separate copy of these registers. Attempts to read or write
these MSRs when ABMC is not enabled will result in a #GP(0) exception.

Individual assignable bandwidth counters are configured by writing to
L3_QOS_ABMC_CFG MSR and specifying the Counter ID, Bandwidth Source, and
Bandwidth Types. Reading L3_QOS_ABMC_DSC returns the configuration of the
counter specified by L3_QOS_ABMC_CFG [CtrID].

The feature details are documented in the APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
v2: Minor text changes in commit message.
---
Documentation/arch/x86/resctrl.rst | 7 ++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 160 ++++++++++++++++++++++++-
2 files changed, 166 insertions(+), 1 deletion(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 4f89d5d1b61f..2729c6fe6127 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -420,6 +420,13 @@ When monitoring is enabled all MON groups will also contain:
# cat /sys/fs/resctrl/monitor_state
total=assign;local=assign

+ The user needs to pin (or assign) RMID to read the MBM event in
+ ABMC mode. Each event can be assigned or unassigned separately.
+ Example::
+
+ # echo total=assign > /sys/fs/resctrl/monitor_state
+ # echo total=assign;local=assign > /sys/fs/resctrl/monitor_state
+
"mon_hw_id":
Available only with debug option. The identifier used by hardware
for the monitor group. On x86 this is the RMID.
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index df8d2390fc69..3447fc4ff2e9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -180,6 +180,18 @@ static void assignable_counters_init(void)
assignable_counter_free_map_len = hw_res->mbm_assignable_counters;
}

+static int assignable_counters_alloc(void)
+{
+ u32 counterid = ffs(assignable_counter_free_map);
+
+ if (counterid == 0)
+ return -ENOSPC;
+ counterid--;
+ assignable_counter_free_map &= ~(1 << counterid);
+
+ return counterid;
+}
+
/**
* rdtgroup_mode_by_closid - Return mode of resource group with closid
* @closid: closid if the resource group
@@ -1635,6 +1647,151 @@ static inline unsigned int mon_event_config_index_get(u32 evtid)
}
}

+static void rdtgroup_abmc_msrwrite(void *info)
+{
+ u64 *msrval = info;
+
+ wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, *msrval);
+}
+
+static void rdtgroup_abmc_domain(struct rdt_domain *d,
+ struct rdtgroup *rdtgrp,
+ u32 evtid, int index, bool assign)
+{
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ union l3_qos_abmc_cfg abmc_cfg = { 0 };
+ struct arch_mbm_state *arch_mbm;
+
+ abmc_cfg.split.cfg_en = 1;
+ abmc_cfg.split.ctr_en = assign ? 1 : 0;
+ abmc_cfg.split.ctr_id = rdtgrp->mon.abmc_ctr_id[index];
+ abmc_cfg.split.bw_src = rdtgrp->mon.rmid;
+
+ /*
+ * Read the event configuration from the domain and pass it as
+ * bw_type.
+ */
+ if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID) {
+ abmc_cfg.split.bw_type = hw_dom->mbm_total_cfg;
+ arch_mbm = &hw_dom->arch_mbm_total[rdtgrp->mon.rmid];
+ } else {
+ abmc_cfg.split.bw_type = hw_dom->mbm_local_cfg;
+ arch_mbm = &hw_dom->arch_mbm_local[rdtgrp->mon.rmid];
+ }
+
+ smp_call_function_any(&d->cpu_mask, rdtgroup_abmc_msrwrite, &abmc_cfg, 1);
+
+ /* Reset the internal counters */
+ if (arch_mbm)
+ memset(arch_mbm, 0, sizeof(struct arch_mbm_state));
+}
+
+static ssize_t rdtgroup_assign_abmc(struct rdtgroup *rdtgrp,
+ struct rdt_resource *r,
+ u32 evtid, int mon_state)
+{
+ int counterid = 0, index;
+ struct rdt_domain *d;
+
+ if (rdtgrp->mon.monitor_state & mon_state) {
+ rdt_last_cmd_puts("ABMC counter is assigned already\n");
+ return 0;
+ }
+
+ index = mon_event_config_index_get(evtid);
+ if (index == INVALID_CONFIG_INDEX) {
+ pr_warn_once("Invalid event id %d\n", evtid);
+ return -EINVAL;
+ }
+
+ /*
+ * Allocate a new counter and update domains
+ */
+ counterid = assignable_counters_alloc();
+ if (counterid < 0) {
+ rdt_last_cmd_puts("Out of ABMC counters\n");
+ return -ENOSPC;
+ }
+
+ rdtgrp->mon.abmc_ctr_id[index] = counterid;
+
+ list_for_each_entry(d, &r->domains, list)
+ rdtgroup_abmc_domain(d, rdtgrp, evtid, index, 1);
+
+ rdtgrp->mon.monitor_state |= mon_state;
+
+ return 0;
+}
+
+/**
+ * rdtgroup_monitor_state_write - Interface to assign/unassign an RMID.
+ *
+ * Return: 0 for success
+ */
+static ssize_t rdtgroup_monitor_state_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ char *abmc_str, *event_str;
+ struct rdtgroup *rdtgrp;
+ int ret = 0, mon_state;
+ u32 evtid;
+
+ rdtgrp = rdtgroup_kn_lock_live(of->kn);
+ if (!rdtgrp) {
+ rdtgroup_kn_unlock(of->kn);
+ return -EINVAL;
+ }
+
+ if (!hw_res->abmc_enabled) {
+ rdt_last_cmd_puts("ABMC is not enabled\n");
+ rdtgroup_kn_unlock(of->kn);
+ return -EINVAL;
+ }
+
+ rdt_last_cmd_clear();
+
+ while (buf && buf[0] != '\0') {
+ /* Start processing the strings for each domain */
+ abmc_str = strim(strsep(&buf, ";"));
+ event_str = strsep(&abmc_str, "=");
+
+ if (event_str && abmc_str) {
+ if (!strcmp(event_str, "total")) {
+ mon_state = TOTAL_ASSIGN;
+ evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
+ } else if (!strcmp(event_str, "local")) {
+ mon_state = LOCAL_ASSIGN;
+ evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
+ } else {
+ rdt_last_cmd_puts("Invalid ABMC event\n");
+ ret = -EINVAL;
+ break;
+ }
+
+ if (!strcmp(abmc_str, "assign")) {
+ ret = rdtgroup_assign_abmc(rdtgrp, r, evtid, mon_state);
+ if (ret) {
+ rdt_last_cmd_puts("ABMC assign failed\n");
+ break;
+ }
+ } else {
+ rdt_last_cmd_puts("Invalid ABMC event\n");
+ ret = -EINVAL;
+ break;
+ }
+ } else {
+ rdt_last_cmd_puts("Invalid ABMC input\n");
+ ret = -EINVAL;
+ break;
+ }
+ }
+
+ rdtgroup_kn_unlock(of->kn);
+ return ret ?: nbytes;
+}
+
static void mon_event_config_read(void *info)
{
struct mon_config_info *mon_info = info;
@@ -2003,9 +2160,10 @@ static struct rftype res_common_files[] = {
},
{
.name = "monitor_state",
- .mode = 0444,
+ .mode = 0644,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdtgroup_monitor_state_show,
+ .write = rdtgroup_monitor_state_write,
},
{
.name = "tasks",
--
2.34.1


2024-01-19 18:26:54

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 13/17] x86/resctrl: Add data structures for ABMC assignment

ABMC (Bandwidth Monitoring Event Configuration) counters can be configured
by writing to L3_QOS_ABMC_CFG MSR. When ABMC is enabled, the user can
configure a counter by writing to L3_QOS_ABMC_CFG setting the CfgEn field
while specifying the Bandwidth Source, Bandwidth Types, and Counter
Identifier. Add the MSR definition and individual field definitions.

MSR L3_QOS_ABMC_CFG (C000_03FDh) definitions.

==========================================================================
Bits Mnemonic Description Access Type Reset Value
==========================================================================
63 CfgEn Configuration Enable R/W 0

62 CtrEn Counter Enable R/W 0

61:53 – Reserved MBZ 0

52:48 CtrID Counter Identifier R/W 0

47 IsCOS BwSrc field is a COS R/W 0
(not an RMID) R/W 0

46:44 – Reserved MBZ 0

43:32 BwSrc Bandwidth Source R/W 0
(RMID or COS)

31:0 BwType Bandwidth types to R/W 0
track for this counter
==========================================================================

The feature details are documentd in the APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
v2: No changes.
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 23 +++++++++++++++++++++++
2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index ac0ce88a5978..148c2b8a2264 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1127,6 +1127,7 @@
#define MSR_IA32_SMBA_BW_BASE 0xc0000280
#define MSR_IA32_EVT_CFG_BASE 0xc0000400
#define MSR_IA32_L3_QOS_EXT_CFG 0xc00003ff
+#define MSR_IA32_L3_QOS_ABMC_CFG 0xc00003fd

/* MSR_IA32_VMX_MISC bits */
#define MSR_IA32_VMX_MISC_INTEL_PT (1ULL << 14)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 865101c5e1c2..130854dc8b7f 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -63,6 +63,9 @@
#define TOTAL_ASSIGN BIT(0)
#define LOCAL_ASSIGN BIT(1)

+/* Maximum assignable counters per resctrl group */
+#define ABMC_MAX_CTR_PER_GROUP 2
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
@@ -170,6 +173,7 @@ enum rdtgrp_mode {
* @crdtgrp_list: child rdtgroup node list
* @rmid: rmid for this rdtgroup
* @monitor_state: Assignment state of the group
+ * @abmc_ctr_id: ABMC counterids assigned to this group
*/
struct mongroup {
struct kernfs_node *mon_data_kn;
@@ -177,6 +181,7 @@ struct mongroup {
struct list_head crdtgrp_list;
u32 rmid;
u32 monitor_state;
+ u32 abmc_ctr_id[ABMC_MAX_CTR_PER_GROUP];
};

/**
@@ -532,6 +537,24 @@ union cpuid_0x10_x_edx {
unsigned int full;
};

+/*
+ * L3_QOS_ABMC_CFG MSR details. ABMC counters can be configured
+ * by writing to L3_QOS_ABMC_CFG.
+ */
+union l3_qos_abmc_cfg {
+ struct {
+ unsigned long bw_type :32,
+ bw_src :12,
+ rsvrd1 : 3,
+ is_cos : 1,
+ ctr_id : 5,
+ rsvrd : 9,
+ ctr_en : 1,
+ cfg_en : 1;
+ } split;
+ unsigned long full;
+};
+
void rdt_last_cmd_clear(void);
void rdt_last_cmd_puts(const char *s);
__printf(1, 2)
--
2.34.1


2024-01-19 18:26:59

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 16/17] x86/resctrl: Add the interface unassign the RMID

With the support of ABMC (Assignable Bandwidth Monitoring Counters)
feature, the user has the option to assign or unassign a RMID to
hardware counter and monitor the bandwidth for the longer duration.

Provide the interface to unassign the RMID.

Signed-off-by: Babu Moger <[email protected]>
---
v2: No changes.
---
Documentation/arch/x86/resctrl.rst | 11 ++++++++
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 36 ++++++++++++++++++++++++++
3 files changed, 48 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 2729c6fe6127..4ba9b1275a2b 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -427,6 +427,17 @@ When monitoring is enabled all MON groups will also contain:
# echo total=assign > /sys/fs/resctrl/monitor_state
# echo total=assign;local=assign > /sys/fs/resctrl/monitor_state

+ The user needs to unassign counter to release it.
+ Example::
+
+ # echo total=unassign > /sys/fs/resctrl/monitor_state
+ # cat /sys/fs/resctrl/monitor_state
+ total=unassign;local=assign
+
+ # echo total=unassign;local=unassign > /sys/fs/resctrl/monitor_state
+ # cat /sys/fs/resctrl/monitor_state
+ total=unassign;local=unassign
+
"mon_hw_id":
Available only with debug option. The identifier used by hardware
for the monitor group. On x86 this is the RMID.
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e109c0388762..ca3193986b4f 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -614,5 +614,6 @@ void __init resctrl_file_fflags_init(const char *config,
unsigned long fflags);
void rdt_staged_configs_clear(void);
void arch_domain_mbm_evt_config(struct rdt_hw_domain *hw_dom);
+void assignable_counters_free(int counterid);

#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 3447fc4ff2e9..869fab878087 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -192,6 +192,11 @@ static int assignable_counters_alloc(void)
return counterid;
}

+void assignable_counters_free(int counterid)
+{
+ assignable_counter_free_map |= 1 << counterid;
+}
+
/**
* rdtgroup_mode_by_closid - Return mode of resource group with closid
* @closid: closid if the resource group
@@ -1723,6 +1728,31 @@ static ssize_t rdtgroup_assign_abmc(struct rdtgroup *rdtgrp,
return 0;
}

+static ssize_t rdtgroup_unassign_abmc(struct rdtgroup *rdtgrp,
+ struct rdt_resource *r,
+ u32 evtid, int mon_state)
+{
+ struct rdt_domain *d;
+ int index;
+
+ index = mon_event_config_index_get(evtid);
+ if (index == INVALID_CONFIG_INDEX) {
+ pr_warn_once("Invalid event id %d\n", evtid);
+ return -EINVAL;
+ }
+
+ if (rdtgrp->mon.monitor_state & mon_state) {
+ list_for_each_entry(d, &r->domains, list)
+ rdtgroup_abmc_domain(d, rdtgrp, evtid, index, 0);
+
+ assignable_counters_free(rdtgrp->mon.abmc_ctr_id[index]);
+ }
+
+ rdtgrp->mon.monitor_state &= ~mon_state;
+
+ return 0;
+}
+
/**
* rdtgroup_monitor_state_write - Interface to assign/unassign an RMID.
*
@@ -1776,6 +1806,12 @@ static ssize_t rdtgroup_monitor_state_write(struct kernfs_open_file *of,
rdt_last_cmd_puts("ABMC assign failed\n");
break;
}
+ } else if (!strcmp(abmc_str, "unassign")) {
+ ret = rdtgroup_unassign_abmc(rdtgrp, r, evtid, mon_state);
+ if (ret) {
+ rdt_last_cmd_puts("ABMC unassign failed\n");
+ break;
+ }
} else {
rdt_last_cmd_puts("Invalid ABMC event\n");
ret = -EINVAL;
--
2.34.1


2024-01-19 18:27:06

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 14/17] x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg

If the BMEC (Bandwidth Monitoring Event Configuration) feature is
supported, the bandwidth events can be configured to track specific events.
The event configuration is domain specific. ABMC (Assignable Bandwidth
Monitoring Counters) feature needs event configuration information to
assign RMID to the hardware counter. Currently, this information is not
available.

Save the event configuration information in the rdt_hw_domain, so it can
be used while for RMID assignment.

Signed-off-by: Babu Moger <[email protected]>
---
v2: No changes.
---
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
arch/x86/kernel/cpu/resctrl/monitor.c | 11 +++++++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++++++++
4 files changed, 27 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index a38609c82b9e..e0ba43387afe 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -558,6 +558,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

+ arch_domain_mbm_evt_config(hw_dom);
+
list_add_tail(&d->list, add_pos);

err = resctrl_online_domain(r, d);
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 130854dc8b7f..e109c0388762 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -349,6 +349,8 @@ struct rdt_hw_domain {
u32 *ctrl_val;
struct arch_mbm_state *arch_mbm_total;
struct arch_mbm_state *arch_mbm_local;
+ u32 mbm_total_cfg;
+ u32 mbm_local_cfg;
};

static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
@@ -611,5 +613,6 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void __init resctrl_file_fflags_init(const char *config,
unsigned long fflags);
void rdt_staged_configs_clear(void);
+void arch_domain_mbm_evt_config(struct rdt_hw_domain *hw_dom);

#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index de43be2252cc..ec480015980c 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -854,3 +854,14 @@ void __init intel_rdt_mbm_apply_quirk(void)
mbm_cf_rmidthreshold = mbm_cf_table[cf_index].rmidthreshold;
mbm_cf = mbm_cf_table[cf_index].cf;
}
+
+void arch_domain_mbm_evt_config(struct rdt_hw_domain *hw_dom)
+{
+ if (mbm_total_event.configurable)
+ hw_dom->mbm_total_cfg = MAX_EVT_CONFIG_BITS;
+
+ if (mbm_local_event.configurable)
+ hw_dom->mbm_local_cfg = READS_TO_LOCAL_MEM |
+ NON_TEMP_WRITE_TO_LOCAL_MEM |
+ READS_TO_LOCAL_S_MEM;
+}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 19b0ebf4f435..df8d2390fc69 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1719,6 +1719,7 @@ static void mon_event_config_write(void *info)
static int mbm_config_write_domain(struct rdt_resource *r,
struct rdt_domain *d, u32 evtid, u32 val)
{
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct mon_config_info mon_info = {0};
int ret = 0;

@@ -1748,6 +1749,16 @@ static int mbm_config_write_domain(struct rdt_resource *r,
smp_call_function_any(&d->cpu_mask, mon_event_config_write,
&mon_info, 1);

+ /*
+ * Update event config value in the domain when user changes it.
+ */
+ if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID)
+ hw_dom->mbm_total_cfg = val;
+ else if (evtid == QOS_L3_MBM_LOCAL_EVENT_ID)
+ hw_dom->mbm_local_cfg = val;
+ else
+ goto out;
+
/*
* When an Event Configuration is changed, the bandwidth counters
* for all RMIDs and Events will be cleared by the hardware. The
--
2.34.1


2024-01-19 18:27:33

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 08/17] x86/resctrl: Introduce the interface to display ABMC state

The ABMC feature provides an option to the user to assign an RMID
to the hardware counter and monitor the bandwidth for a longer duration.
System can be in only one mode at a time (Legacy Monitor mode or ABMC
mode). By default, ABMC mode is disabled.

Provide an interface to display the monitor mode on the system.
$cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
0

Signed-off-by: Babu Moger <[email protected]>
---
v2: This is new patch to display ABMC mode.
---
Documentation/arch/x86/resctrl.rst | 6 ++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 5 ++++-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 +++++++++++++++++
3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 73eeb50fd0b5..f94a4d314690 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -268,6 +268,12 @@ with the following files:
Available when ABMC feature is supported. The number of assignable bandwidth
monitoring counters available.

+"mbm_assign_enable":
+ Available when ABMC feature is supported. System supports RMID counter
+ assignment for monitoring. Feature provides an option to assign the RMID
+ to the hardware counter and monitor the bandwidth for a longer duration.
+ The assigned RMID will be active until the user unassigns it.
+
"max_threshold_occupancy":
Read/write file provides the largest value (in
bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index fa492ea820f0..a45084e30738 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -824,9 +824,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}

- if (rdt_cpu_has(X86_FEATURE_ABMC))
+ if (rdt_cpu_has(X86_FEATURE_ABMC)) {
resctrl_file_fflags_init("mbm_assignable_counters",
RFTYPE_MON_INFO);
+ resctrl_file_fflags_init("mbm_assign_enable",
+ RFTYPE_MON_INFO);
+ }
}

l3_mon_evt_init(r);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 2fb26227cbec..4f160dbf6376 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -822,6 +822,17 @@ static int rdtgroup_mbm_assignable_counters_show(struct kernfs_open_file *of,
return 0;
}

+static int rdtgroup_mbm_assign_enable_show(struct kernfs_open_file *of,
+ struct seq_file *s, void *v)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+
+ seq_printf(s, "%d\n", hw_res->abmc_enabled);
+
+ return 0;
+}
+
#ifdef CONFIG_PROC_CPU_RESCTRL

/*
@@ -1878,6 +1889,12 @@ static struct rftype res_common_files[] = {
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdtgroup_mbm_assignable_counters_show,
},
+ {
+ .name = "mbm_assign_enable",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdtgroup_mbm_assign_enable_show,
+ },
{
.name = "cpus",
.mode = 0644,
--
2.34.1


2024-01-19 18:29:18

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 12/17] x86/resctrl: Initialize assignable counters bitmap

AMD Hardware provides a set of ABMC counters when the feature is supported.
These hardware counters are used for assigning the RMIDs to the group.

Introduce the bitmap assignable_counter_free_map to allocate and free
counters.

Signed-off-by: Babu Moger <[email protected]>
---
v2: Changed the bitmap name to assignable_counter_free_map from
abmc_counter_free_map.
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 7cae6ac13954..19b0ebf4f435 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -164,6 +164,22 @@ static bool closid_allocated(unsigned int closid)
return (closid_free_map & (1 << closid)) == 0;
}

+static u64 assignable_counter_free_map;
+static u32 assignable_counter_free_map_len;
+
+static void assignable_counters_init(void)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_L3];
+
+ if (hw_res->mbm_assignable_counters > 64) {
+ hw_res->mbm_assignable_counters = 64;
+ WARN(1, "Cannot support more than 64 Assignable counters\n");
+ }
+
+ assignable_counter_free_map = BIT_MASK(hw_res->mbm_assignable_counters) - 1;
+ assignable_counter_free_map_len = hw_res->mbm_assignable_counters;
+}
+
/**
* rdtgroup_mode_by_closid - Return mode of resource group with closid
* @closid: closid if the resource group
@@ -2777,6 +2793,10 @@ static int rdt_get_tree(struct fs_context *fc)

closid_init();

+ r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ if (r->mbm_assign_capable)
+ assignable_counters_init();
+
if (rdt_mon_capable)
flags |= RFTYPE_MON;

@@ -2821,7 +2841,6 @@ static int rdt_get_tree(struct fs_context *fc)
static_branch_enable_cpuslocked(&rdt_enable_key);

if (is_mbm_enabled()) {
- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
list_for_each_entry(dom, &r->domains, list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}
--
2.34.1


2024-01-19 18:31:35

by Babu Moger

[permalink] [raw]
Subject: [PATCH v2 17/17] x86/resctrl: Update RMID assignments on event configuration changes

When ABMC (Assignable Bandwidth Monitoring Counters) feature is enabled,
bandwidth events can be read in following methods.

1. The contents of a specific counter can be read by setting the following
fields in QM_EVTSEL: [ExtendedEvtID]=1, [EvtID]=L3CacheABMC and setting
[RMID] to the desired counter ID. Reading QM_CTR will then return the
contents of the specified counter. The E bit will be set if the counter
configuration was invalid, or if an invalid counter ID was set in the
QM_EVTSEL[RMID] field. The rmid_read interface (resctrl_arch_rmid_read)
does not support this method currently. Supporting this method will
require changes in rmid_read interface.

2. Alternatively, the contents of a counter may be read by specifying an
RMID and setting the [EvtID] to L3BWMonEvtn where n= {0,1}. If an
assignable bandwidth counter is monitoring that RMID with a BwType bitmask
that matches a QOS_EVT_CFG_n, that counter’s value will be returned when
reading QM_CTR. However, if multiple counters have the same configuration,
QM_CTR will return the value of the counter with the lowest CtrID.

Method 2 can be supported without any changes to rmid_read interface.
However, this requires the contents of the MSR QOS_EVT_CFG_[0,1] to match
the BwType while assigning total and local events respectively. So,
whenever event configuration changes, the ABMC assignment needs to be
updated to match the event configuration.

The feature details are documented in APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).

Signed-off-by: Babu Moger <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
---
v2: Minor text changes.
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 40 ++++++++++++++++++++++++++
1 file changed, 40 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 869fab878087..91a20e601ffd 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1828,6 +1828,38 @@ static ssize_t rdtgroup_monitor_state_write(struct kernfs_open_file *of,
return ret ?: nbytes;
}

+static void rdtgroup_update_abmc(struct rdt_resource *r,
+ struct rdt_domain *d, u32 evtid)
+{
+ struct rdtgroup *prgrp, *crgrp;
+ int index, mon_state;
+
+ if (evtid == QOS_L3_MBM_TOTAL_EVENT_ID)
+ mon_state = TOTAL_ASSIGN;
+ else
+ mon_state = LOCAL_ASSIGN;
+
+ index = mon_event_config_index_get(evtid);
+ if (index == INVALID_CONFIG_INDEX) {
+ pr_warn_once("Invalid event id %d\n", evtid);
+ return;
+ }
+
+ /*
+ * Update the assignment for all the monitor groups if the group
+ * is configured with ABMC assignment.
+ */
+ list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
+ if (prgrp->mon.monitor_state & mon_state)
+ rdtgroup_abmc_domain(d, prgrp, evtid, index, 1);
+
+ list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
+ if (crgrp->mon.monitor_state & mon_state)
+ rdtgroup_abmc_domain(d, crgrp, evtid, index, 1);
+ }
+ }
+}
+
static void mon_event_config_read(void *info)
{
struct mon_config_info *mon_info = info;
@@ -1912,6 +1944,7 @@ static void mon_event_config_write(void *info)
static int mbm_config_write_domain(struct rdt_resource *r,
struct rdt_domain *d, u32 evtid, u32 val)
{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct mon_config_info mon_info = {0};
int ret = 0;
@@ -1952,6 +1985,13 @@ static int mbm_config_write_domain(struct rdt_resource *r,
else
goto out;

+ /*
+ * Event configuration changed for the domain, so Update
+ * the ABMC assignment.
+ */
+ if (hw_res->abmc_enabled)
+ rdtgroup_update_abmc(r, d, evtid);
+
/*
* When an Event Configuration is changed, the bandwidth counters
* for all RMIDs and Events will be cleared by the hardware. The
--
2.34.1


2024-01-19 18:33:16

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

+James

On 1/19/2024 10:22 AM, Babu Moger wrote:
>
> f. This series is still work in progress. I am yet to hear from ARM developers.

Please at least include James in your submissions to make Arm aware of this work.

Reinette

2024-01-19 20:36:07

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)


On 1/19/2024 12:32 PM, Reinette Chatre wrote:
> +James
>
> On 1/19/2024 10:22 AM, Babu Moger wrote:
>>
>> f. This series is still work in progress. I am yet to hear from ARM developers.
> Please at least include James in your submissions to make Arm aware of this work.

Thank you

Babu Moger


2024-01-23 10:39:28

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit

On Mon, Jan 15, 2024 at 04:52:27PM -0600, Babu Moger wrote:
> Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")

What's the point of this Fixes tag? You want this backported to stable?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-01-23 15:07:13

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit

Hi Boris,

On 1/23/24 04:36, Borislav Petkov wrote:
> On Mon, Jan 15, 2024 at 04:52:27PM -0600, Babu Moger wrote:
>> Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
>
> What's the point of this Fixes tag? You want this backported to stable?
>
Yes. That is the intention. This applies to both these patches.
--
Thanks
Babu Moger

2024-01-23 15:44:31

by tip-bot2 for Tony Luck

[permalink] [raw]
Subject: [tip: x86/cache] x86/resctrl: Read supported bandwidth sources from CPUID

The following commit has been merged into the x86/cache branch of tip:

Commit-ID: 54e35eb8611cce5550d3d7689679b1a91c864f28
Gitweb: https://git.kernel.org/tip/54e35eb8611cce5550d3d7689679b1a91c864f28
Author: Babu Moger <[email protected]>
AuthorDate: Mon, 15 Jan 2024 16:52:28 -06:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Tue, 23 Jan 2024 16:26:42 +01:00

x86/resctrl: Read supported bandwidth sources from CPUID

If the BMEC (Bandwidth Monitoring Event Configuration) feature is
supported, the bandwidth events can be configured. The maximum supported
bandwidth bitmask can be read from CPUID:

CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event Configuration]
Bits Description
31:7 Reserved
6:0 Identifies the bandwidth sources that can be tracked.

While at it, move the mask checking to mon_config_write() before
iterating over all the domains. Also, print the valid bitmask when the
user tries to configure invalid event configuration value.

The CPUID details are documented in the Processor Programming Reference
(PPR) Vol 1.1 for AMD Family 19h Model 11h B1 - 55901 Rev 0.25 in the
Link tag.

Fixes: dc2a3e857981 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
Signed-off-by: Babu Moger <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: https://lore.kernel.org/r/669896fa512c7451319fa5ca2fdb6f7e015b5635.1705359148.git.babu.moger@amd.com
---
arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 14 ++++++++------
3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index d297974..e3dc35a 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -394,6 +394,8 @@ struct rdt_parse_data {
* @msr_update: Function pointer to update QOS MSRs
* @mon_scale: cqm counter * mon_scale = occupancy in bytes
* @mbm_width: Monitor width, to detect and correct for overflow.
+ * @mbm_cfg_mask: Bandwidth sources that can be tracked when Bandwidth
+ * Monitoring Event Configuration (BMEC) is supported.
* @cdp_enabled: CDP state of this resource
*
* Members of this structure are either private to the architecture
@@ -408,6 +410,7 @@ struct rdt_hw_resource {
struct rdt_resource *r);
unsigned int mon_scale;
unsigned int mbm_width;
+ unsigned int mbm_cfg_mask;
bool cdp_enabled;
};

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac0..acca577 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -813,6 +813,12 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
return ret;

if (rdt_cpu_has(X86_FEATURE_BMEC)) {
+ u32 eax, ebx, ecx, edx;
+
+ /* Detect list of bandwidth sources that can be tracked */
+ cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx);
+ hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS;
+
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
mbm_total_event.configurable = true;
mbm_config_rftype_init("mbm_total_bytes_config");
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de9..2b69e56 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1620,12 +1620,6 @@ static int mbm_config_write_domain(struct rdt_resource *r,
struct mon_config_info mon_info = {0};
int ret = 0;

- /* mon_config cannot be more than the supported set of events */
- if (val > MAX_EVT_CONFIG_BITS) {
- rdt_last_cmd_puts("Invalid event configuration\n");
- return -EINVAL;
- }
-
/*
* Read the current config value first. If both are the same then
* no need to write it again.
@@ -1663,6 +1657,7 @@ out:

static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
char *dom_str = NULL, *id_str;
unsigned long dom_id, val;
struct rdt_domain *d;
@@ -1686,6 +1681,13 @@ next:
return -EINVAL;
}

+ /* Value from user cannot be more than the supported set of events */
+ if ((val & hw_res->mbm_cfg_mask) != val) {
+ rdt_last_cmd_printf("Invalid event configuration: max valid mask is 0x%02x\n",
+ hw_res->mbm_cfg_mask);
+ return -EINVAL;
+ }
+
list_for_each_entry(d, &r->domains, list) {
if (d->id == dom_id) {
ret = mbm_config_write_domain(r, d, evtid, val);

2024-01-23 16:09:47

by tip-bot2 for Tony Luck

[permalink] [raw]
Subject: [tip: x86/cache] x86/resctrl: Remove hard-coded memory bandwidth limit

The following commit has been merged into the x86/cache branch of tip:

Commit-ID: 0976783bb123f30981bc1e7a14d9626a6f63aeac
Gitweb: https://git.kernel.org/tip/0976783bb123f30981bc1e7a14d9626a6f63aeac
Author: Babu Moger <[email protected]>
AuthorDate: Mon, 15 Jan 2024 16:52:27 -06:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Tue, 23 Jan 2024 16:22:51 +01:00

x86/resctrl: Remove hard-coded memory bandwidth limit

The QOS Memory Bandwidth Enforcement Limit is reported by
CPUID_Fn80000020_EAX_x01 and CPUID_Fn80000020_EAX_x02:

Bits Description
31:0 BW_LEN: Size of the QOS Memory Bandwidth Enforcement Limit.

Newer processors can support higher bandwidth limit than the current
hard-coded value. Remove latter and detect using CPUID instead. Also,
update the register variables eax and edx to match the AMD CPUID
definition.

The CPUID details are documented in the Processor Programming Reference
(PPR) Vol 1.1 for AMD Family 19h Model 11h B1 - 55901 Rev 0.25 in the
Link tag below.

Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: https://lore.kernel.org/r/c26a8ca79d399ed076cf8bf2e9fbc58048808289.1705359148.git.babu.moger@amd.com
---
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++------
arch/x86/kernel/cpu/resctrl/internal.h | 1 -
2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d29ebe3..aa9810a 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -231,9 +231,7 @@ static bool __get_mem_config_intel(struct rdt_resource *r)
static bool __rdt_get_mem_config_amd(struct rdt_resource *r)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- union cpuid_0x10_3_eax eax;
- union cpuid_0x10_x_edx edx;
- u32 ebx, ecx, subleaf;
+ u32 eax, ebx, ecx, edx, subleaf;

/*
* Query CPUID_Fn80000020_EDX_x01 for MBA and
@@ -241,9 +239,9 @@ static bool __rdt_get_mem_config_amd(struct rdt_resource *r)
*/
subleaf = (r->rid == RDT_RESOURCE_SMBA) ? 2 : 1;

- cpuid_count(0x80000020, subleaf, &eax.full, &ebx, &ecx, &edx.full);
- hw_res->num_closid = edx.split.cos_max + 1;
- r->default_ctrl = MAX_MBA_BW_AMD;
+ cpuid_count(0x80000020, subleaf, &eax, &ebx, &ecx, &edx);
+ hw_res->num_closid = edx + 1;
+ r->default_ctrl = 1 << eax;

/* AMD does not use delay */
r->membw.delay_linear = false;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa1..d297974 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -18,7 +18,6 @@
#define MBM_OVERFLOW_INTERVAL 1000
#define MAX_MBA_BW 100u
#define MBA_IS_LINEAR 0x4
-#define MAX_MBA_BW_AMD 0x800
#define MBM_CNTR_WIDTH_OFFSET_AMD 20

#define RMID_VAL_ERROR BIT_ULL(63)

2024-02-02 04:09:34

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 1/19/2024 10:22 AM, Babu Moger wrote:
>
> These series adds the support for Assignable Bandwidth Monitoring Counters

Not a good start ([1]).

> (ABMC). It is also called QoS RMID Pinning feature
>
> The feature details are documented in the APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC). The documentation is available at
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>
> The patches are based on top of commit
> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
>
> # Introduction
>
> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
> feature only guarantees that RMIDs currently assigned to a processor will
> be tracked by hardware. The counters of any other RMIDs which are no longer
> being tracked will be reset to zero. The MBM event counters return
> "Unavailable" for the RMIDs that are not active.
>
> Users can create 256 or more monitor groups. But there can be only limited
> number of groups that can be give guaranteed monitoring numbers. With ever

"can be given"?

> changing configurations there is no way to definitely know which of these
> groups will be active for certain point of time. Users do not have the
> option to monitor a group or set of groups for certain period of time
> without worrying about RMID being reset in between.
>
> The ABMC feature provides an option to the user to assign an RMID to the
> hardware counter and monitor the bandwidth for a longer duration.
> The assigned RMID will be active until the user unassigns it manually.
> There is no need to worry about counters being reset during this period.
> Additionally, the user can specify a bitmask identifying the specific
> bandwidth types from the given source to track with the counter.
>
> Without ABMC enabled, monitoring will work in current mode without
> assignment option.
>
> # Linux Implementation
>
> Linux resctrl subsystem provides the interface to count maximum of two
> memory bandwidth events per group, from a combination of available total
> and local events. Keeping the current interface, users can assign a maximum
> of 2 ABMC counters per group. User will also have the option to assign only
> one counter to the group. If the system runs out of assignable ABMC
> counters, kernel will display an error. Users need to unassign an already
> assigned counter to make space for new assignments.
>
>
> # Examples
>
> a. Check if ABMC support is available
> #mount -t resctrl resctrl /sys/fs/resctrl/
>
> #cat /sys/fs/resctrl/info/L3_MON/mon_features
> llc_occupancy
> mbm_total_bytes
> mbm_total_bytes_config
> mbm_local_bytes
> mbm_local_bytes_config
> mbm_assign_capable ← Linux kernel detected ABMC feature
>
> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
> Monitoring works in legacy monitor mode when ABMC is not enabled.
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
> 0
>

With the introduction of "mbm_assign_enable" the entry in mon_features seems
to provide duplicate information.

> c. There will be new file "monitor_state" for each monitor group when ABMC
> feature is supported. However, monitor_state is not available if ABMC is
> disabled.
>
> #cat /sys/fs/resctrl/monitor_state
> Unsupported

This sounds potentially confusing since users will still be able to monitor
the groups ...

>
> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
> enabled, monitoring will work in current mode without assignment option.
>
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 779247936
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> 765207488
>
> e. Enable ABMC mode.
>
> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
> 1
>
> f. Read the monitor states. By default, both total and local MBM
> events are in "unassign" state.
>
> #cat /sys/fs/resctrl/monitor_state
> total=unassign;local=unassign

This interface does not seem to take into account that hardware
can support assignment per domain. I understand that this is
not something you want to implement at this time but the user interface
has to accommodate such an enhancement. This was already mentioned, and
you did acknowledge the point [3] to this new version that does not
reflect this is unexpected.

My previous suggestions do seem to still stand and and I also am not able to
see how Peter's requests [2] were considered. This same interface needs to
accommodate usages apart from ABMC. For example, how to use this interface
to address the same counter issue on AMD hardware without ABMC, and MPAM
(pending James's feedback).

I understand that until we hear from Arm we do not know all the requirements
that this interface needs to support, but I do expect this interface to
at least consider requirements and usage scenarios that are already known.

> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
> the MBA events are not available until the user assigns the events
> explicitly.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> Unsupported
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> Unsupported
>

This needs some more thought to accommodate Peter's scenario where the counter
can be expected to return the final count after the counter is disabled.

> h. The event llc_occupancy is not affected by ABMC mode. Users can still
> read the llc_occupancy.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
> 557056
>
> i. Now assign the total event and read the monitor_state.
>
> #echo total=assign > /sys/fs/resctrl/monitor_state
> #cat /sys/fs/resctrl/monitor_state
> total=assign;local=unassign
>

I do not see the "global assign/unassign" scenario addressed.

This version seems to ignore (without discussion) a lot of earlier
feedback.

Reinette

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
[3] https://lore.kernel.org/lkml/[email protected]/

2024-02-02 05:04:38

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

+James

On 2/1/2024 8:09 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/19/2024 10:22 AM, Babu Moger wrote:
>>
>> These series adds the support for Assignable Bandwidth Monitoring Counters
>
> Not a good start ([1]).
>
>> (ABMC). It is also called QoS RMID Pinning feature
>>
>> The feature details are documented in the APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC). The documentation is available at
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>
>> The patches are based on top of commit
>> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
>>
>> # Introduction
>>
>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
>> feature only guarantees that RMIDs currently assigned to a processor will
>> be tracked by hardware. The counters of any other RMIDs which are no longer
>> being tracked will be reset to zero. The MBM event counters return
>> "Unavailable" for the RMIDs that are not active.
>>
>> Users can create 256 or more monitor groups. But there can be only limited
>> number of groups that can be give guaranteed monitoring numbers. With ever
>
> "can be given"?
>
>> changing configurations there is no way to definitely know which of these
>> groups will be active for certain point of time. Users do not have the
>> option to monitor a group or set of groups for certain period of time
>> without worrying about RMID being reset in between.
>>
>> The ABMC feature provides an option to the user to assign an RMID to the
>> hardware counter and monitor the bandwidth for a longer duration.
>> The assigned RMID will be active until the user unassigns it manually.
>> There is no need to worry about counters being reset during this period.
>> Additionally, the user can specify a bitmask identifying the specific
>> bandwidth types from the given source to track with the counter.
>>
>> Without ABMC enabled, monitoring will work in current mode without
>> assignment option.
>>
>> # Linux Implementation
>>
>> Linux resctrl subsystem provides the interface to count maximum of two
>> memory bandwidth events per group, from a combination of available total
>> and local events. Keeping the current interface, users can assign a maximum
>> of 2 ABMC counters per group. User will also have the option to assign only
>> one counter to the group. If the system runs out of assignable ABMC
>> counters, kernel will display an error. Users need to unassign an already
>> assigned counter to make space for new assignments.
>>
>>
>> # Examples
>>
>> a. Check if ABMC support is available
>> #mount -t resctrl resctrl /sys/fs/resctrl/
>>
>> #cat /sys/fs/resctrl/info/L3_MON/mon_features
>> llc_occupancy
>> mbm_total_bytes
>> mbm_total_bytes_config
>> mbm_local_bytes
>> mbm_local_bytes_config
>> mbm_assign_capable ← Linux kernel detected ABMC feature
>>
>> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
>> Monitoring works in legacy monitor mode when ABMC is not enabled.
>>
>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>> 0
>>
>
> With the introduction of "mbm_assign_enable" the entry in mon_features seems
> to provide duplicate information.
>
>> c. There will be new file "monitor_state" for each monitor group when ABMC
>> feature is supported. However, monitor_state is not available if ABMC is
>> disabled.
>>
>> #cat /sys/fs/resctrl/monitor_state
>> Unsupported
>
> This sounds potentially confusing since users will still be able to monitor
> the groups ...
>
>>
>> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
>> enabled, monitoring will work in current mode without assignment option.
>>
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 779247936
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> 765207488
>>
>> e. Enable ABMC mode.
>>
>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>> 1
>>
>> f. Read the monitor states. By default, both total and local MBM
>> events are in "unassign" state.
>>
>> #cat /sys/fs/resctrl/monitor_state
>> total=unassign;local=unassign
>
> This interface does not seem to take into account that hardware
> can support assignment per domain. I understand that this is
> not something you want to implement at this time but the user interface
> has to accommodate such an enhancement. This was already mentioned, and
> you did acknowledge the point [3] to this new version that does not
> reflect this is unexpected.
>
> My previous suggestions do seem to still stand and and I also am not able to
> see how Peter's requests [2] were considered. This same interface needs to
> accommodate usages apart from ABMC. For example, how to use this interface
> to address the same counter issue on AMD hardware without ABMC, and MPAM
> (pending James's feedback).
>
> I understand that until we hear from Arm we do not know all the requirements
> that this interface needs to support, but I do expect this interface to
> at least consider requirements and usage scenarios that are already known.
>
>> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
>> the MBA events are not available until the user assigns the events
>> explicitly.
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> Unsupported
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> Unsupported
>>
>
> This needs some more thought to accommodate Peter's scenario where the counter
> can be expected to return the final count after the counter is disabled.
>
>> h. The event llc_occupancy is not affected by ABMC mode. Users can still
>> read the llc_occupancy.
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
>> 557056
>>
>> i. Now assign the total event and read the monitor_state.
>>
>> #echo total=assign > /sys/fs/resctrl/monitor_state
>> #cat /sys/fs/resctrl/monitor_state
>> total=assign;local=unassign
>>
>
> I do not see the "global assign/unassign" scenario addressed.
>
> This version seems to ignore (without discussion) a lot of earlier
> feedback.
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
> [3] https://lore.kernel.org/lkml/[email protected]/

2024-02-02 21:58:13

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 2/1/2024 10:09 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/19/2024 10:22 AM, Babu Moger wrote:
>> These series adds the support for Assignable Bandwidth Monitoring Counters
> Not a good start ([1]).

Yea. My bad.

>
>> (ABMC). It is also called QoS RMID Pinning feature
>>
>> The feature details are documented in the APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC). The documentation is available at
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>
>> The patches are based on top of commit
>> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
>>
>> # Introduction
>>
>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
>> feature only guarantees that RMIDs currently assigned to a processor will
>> be tracked by hardware. The counters of any other RMIDs which are no longer
>> being tracked will be reset to zero. The MBM event counters return
>> "Unavailable" for the RMIDs that are not active.
>>
>> Users can create 256 or more monitor groups. But there can be only limited
>> number of groups that can be give guaranteed monitoring numbers. With ever
> "can be given"?

"can give guaranteed monitoring numbers."

I feel this looks better.

>
>> changing configurations there is no way to definitely know which of these
>> groups will be active for certain point of time. Users do not have the
>> option to monitor a group or set of groups for certain period of time
>> without worrying about RMID being reset in between.
>>
>> The ABMC feature provides an option to the user to assign an RMID to the
>> hardware counter and monitor the bandwidth for a longer duration.
>> The assigned RMID will be active until the user unassigns it manually.
>> There is no need to worry about counters being reset during this period.
>> Additionally, the user can specify a bitmask identifying the specific
>> bandwidth types from the given source to track with the counter.
>>
>> Without ABMC enabled, monitoring will work in current mode without
>> assignment option.
>>
>> # Linux Implementation
>>
>> Linux resctrl subsystem provides the interface to count maximum of two
>> memory bandwidth events per group, from a combination of available total
>> and local events. Keeping the current interface, users can assign a maximum
>> of 2 ABMC counters per group. User will also have the option to assign only
>> one counter to the group. If the system runs out of assignable ABMC
>> counters, kernel will display an error. Users need to unassign an already
>> assigned counter to make space for new assignments.
>>
>>
>> # Examples
>>
>> a. Check if ABMC support is available
>> #mount -t resctrl resctrl /sys/fs/resctrl/
>>
>> #cat /sys/fs/resctrl/info/L3_MON/mon_features
>> llc_occupancy
>> mbm_total_bytes
>> mbm_total_bytes_config
>> mbm_local_bytes
>> mbm_local_bytes_config
>> mbm_assign_capable ← Linux kernel detected ABMC feature
>>
>> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
>> Monitoring works in legacy monitor mode when ABMC is not enabled.
>>
>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>> 0
>>
> With the introduction of "mbm_assign_enable" the entry in mon_features seems
> to provide duplicate information.

ok. We can remove the text in mon_features and keep mbm_assign_enable.
We need this to enable and disable the feature.

>
>> c. There will be new file "monitor_state" for each monitor group when ABMC
>> feature is supported. However, monitor_state is not available if ABMC is
>> disabled.
>>
>> #cat /sys/fs/resctrl/monitor_state
>> Unsupported
> This sounds potentially confusing since users will still be able to monitor
> the groups ...
How about "Assignment-Unsupported"?
>
>>
>> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
>> enabled, monitoring will work in current mode without assignment option.
>>
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 779247936
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> 765207488
>>
>> e. Enable ABMC mode.
>>
>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>> 1
>>
>> f. Read the monitor states. By default, both total and local MBM
>> events are in "unassign" state.
>>
>> #cat /sys/fs/resctrl/monitor_state
>> total=unassign;local=unassign
> This interface does not seem to take into account that hardware
> can support assignment per domain. I understand that this is
> not something you want to implement at this time but the user interface
> has to accommodate such an enhancement. This was already mentioned, and
> you did acknowledge the point [3] to this new version that does not
> reflect this is unexpected.

Yea. Domain level assignment is not supported at this point. Do you want
me to explicitly mention here?

Please elaborate what you meant here.

>
> My previous suggestions do seem to still stand and and I also am not able to
> see how Peter's requests [2] were considered. This same interface needs to
> accommodate usages apart from ABMC. For example, how to use this interface
> to address the same counter issue on AMD hardware without ABMC, and MPAM
> (pending James's feedback).

Yea. Agree. Peter's comments are not addressed. I am not all clear about
details of Peters and James requirement.

With respect to ABMC here are my requirements.

a.  Assignment needs to be done at group level.

b. User should be able to assign each event individually. Multiple
events assignment(in one command) should be supported.

c. I have no plans to implement domain level assignment. It is done at
system level.

d. We need only couple of states.  Assigned and unassigned.

e. monitor_state is name of file for user interface. We can change that
based on comments.

Peter, James,

Please comment on what you want achieve in "assignment" based on the
features you are working on.

Do you want to add new states?

>
> I understand that until we hear from Arm we do not know all the requirements
> that this interface needs to support, but I do expect this interface to
> at least consider requirements and usage scenarios that are already known.

Sure. Will try that in the next version. Lets continue the discussion.


>> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
>> the MBA events are not available until the user assigns the events
>> explicitly.
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> Unsupported
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> Unsupported
>>
> This needs some more thought to accommodate Peter's scenario where the counter
> can be expected to return the final count after the counter is disabled.

I am not sure how to achieve this with ABMC. This may be applicable to
soft rmid only. In case of "soft rmid", previous readings are saved in
the soft rmid state.

>
>> h. The event llc_occupancy is not affected by ABMC mode. Users can still
>> read the llc_occupancy.
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
>> 557056
>>
>> i. Now assign the total event and read the monitor_state.
>>
>> #echo total=assign > /sys/fs/resctrl/monitor_state
>> #cat /sys/fs/resctrl/monitor_state
>> total=assign;local=unassign
>>
> I do not see the "global assign/unassign" scenario addressed.

I am not all clear on meaning of "global assign/unassign".  Does it mean
looping thru all the groups and assign the RMIDs?

It may not work in many cases. In case of ABMC, we have only limited
number of hw counters. It will fail after hardware runs out of counters.
It is better done selectively based on which group user is interested in.

But it can be done later if we find a use case for that.

>
> This version seems to ignore (without discussion) a lot of earlier
> feedback.

Please feel free comment. There are various threads of discussion. I may
have missed.

Thanks

Babu

>
> Reinette
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
> [3] https://lore.kernel.org/lkml/[email protected]/

2024-02-05 22:38:55

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 2/2/2024 1:57 PM, Moger, Babu wrote:
> On 2/1/2024 10:09 PM, Reinette Chatre wrote:
>> On 1/19/2024 10:22 AM, Babu Moger wrote:
>>> These series adds the support for Assignable Bandwidth Monitoring Counters
>> Not a good start ([1]).
>
> Yea. My bad.
>
>>
>>> (ABMC). It is also called QoS RMID Pinning feature
>>>
>>> The feature details are documented in the  APM listed below [1].
>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>> Monitoring (ABMC). The documentation is available at
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>>
>>> The patches are based on top of commit
>>> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
>>>
>>> # Introduction
>>>
>>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
>>> feature only guarantees that RMIDs currently assigned to a processor will
>>> be tracked by hardware. The counters of any other RMIDs which are no longer
>>> being tracked will be reset to zero. The MBM event counters return
>>> "Unavailable" for the RMIDs that are not active.
>>>      Users can create 256 or more monitor groups. But there can be only limited
>>> number of groups that can be give guaranteed monitoring numbers.  With ever
>> "can be given"?
>
> "can give guaranteed monitoring numbers."
>
> I feel this looks better.

Sounds good. Thank you.

>
>>
>>> changing configurations there is no way to definitely know which of these
>>> groups will be active for certain point of time. Users do not have the
>>> option to monitor a group or set of groups for certain period of time
>>> without worrying about RMID being reset in between.
>>>      The ABMC feature provides an option to the user to assign an RMID to the
>>> hardware counter and monitor the bandwidth for a longer duration.
>>> The assigned RMID will be active until the user unassigns it manually.
>>> There is no need to worry about counters being reset during this period.
>>> Additionally, the user can specify a bitmask identifying the specific
>>> bandwidth types from the given source to track with the counter.
>>>
>>> Without ABMC enabled, monitoring will work in current mode without
>>> assignment option.
>>>
>>> # Linux Implementation
>>>
>>> Linux resctrl subsystem provides the interface to count maximum of two
>>> memory bandwidth events per group, from a combination of available total
>>> and local events. Keeping the current interface, users can assign a maximum
>>> of 2 ABMC counters per group. User will also have the option to assign only
>>> one counter to the group. If the system runs out of assignable ABMC
>>> counters, kernel will display an error. Users need to unassign an already
>>> assigned counter to make space for new assignments.
>>>
>>>
>>> # Examples
>>>
>>> a. Check if ABMC support is available
>>>     #mount -t resctrl resctrl /sys/fs/resctrl/
>>>
>>>     #cat /sys/fs/resctrl/info/L3_MON/mon_features
>>>     llc_occupancy
>>>     mbm_total_bytes
>>>     mbm_total_bytes_config
>>>     mbm_local_bytes
>>>     mbm_local_bytes_config
>>>     mbm_assign_capable ←  Linux kernel detected ABMC feature
>>>
>>> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
>>>     Monitoring works in legacy monitor mode when ABMC is not enabled.
>>>
>>>     #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>     0
>>>
>> With the introduction of "mbm_assign_enable" the entry in mon_features seems
>> to provide duplicate information.
>
> ok. We can remove the text in mon_features and keep mbm_assign_enable. We need this to enable and disable the feature.

This could be improved beyond a binary "enable"/"disable" interface to user space.
For example, the hardware can discover which "mbm counter assign" related feature
(I'm counting the "soft RMID" here as one of the "mbm counter assign" related
features) is supported on the platform and it can be presented to the user like:

# cat /sys/fs/resctrl/info/L3_MON/mbm_assign
[feature_1] feature_2 feature_3

The output indicates which features are supported by the platform and the brackets indicate
which feature is enabled.


>>> c. There will be new file "monitor_state" for each monitor group when ABMC
>>>     feature is supported. However, monitor_state is not available if ABMC is
>>>     disabled.
>>>     
>>>     #cat /sys/fs/resctrl/monitor_state
>>>     Unsupported
>> This sounds potentially confusing since users will still be able to monitor
>> the groups ...
> How about "Assignment-Unsupported"?

(please see later)

>>
>>>     
>>> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
>>>     enabled, monitoring will work in current mode without assignment option.
>>>     
>>>     # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>     779247936
>>>     # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>     765207488
>>>     
>>> e. Enable ABMC mode.
>>>
>>>     #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>          #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>          1
>>>
>>> f. Read the monitor states. By default, both total and local MBM
>>>     events are in "unassign" state.
>>>     
>>>     #cat /sys/fs/resctrl/monitor_state
>>>     total=unassign;local=unassign
>> This interface does not seem to take into account that hardware
>> can support assignment per domain. I understand that this is
>> not something you want to implement at this time but the user interface
>> has to accommodate such an enhancement. This was already mentioned, and
>> you did acknowledge the point [3] to this new version that does not
>> reflect this is unexpected.
>
> Yea. Domain level assignment is not supported at this point. Do you want me to explicitly mention here?
>
> Please elaborate what you meant here.

You have made it clear on several occasions that you do not intend to support
domain level assignment. That may be ok but the interface you create should
not prevent future support of domain level assignment.

If my point is not clear, could you please share how this interface is able to
support domain level assignment in the future?

I am starting to think that we need a file similar to the schemata file
for group and domain level monitor configurations.

>> My previous suggestions do seem to still stand and and I also am not able to
>> see how Peter's requests [2] were considered. This same interface needs to
>> accommodate usages apart from ABMC. For example, how to use this interface
>> to address the same counter issue on AMD hardware without ABMC, and MPAM
>> (pending James's feedback).
>
> Yea. Agree. Peter's comments are not addressed. I am not all clear
> about details of Peters and James requirement.

Peter listed his requirements in [1]. That email thread is a worthwhile read
for the use cases.

I believe that James is aware of this work and do hope to hear from him.

>
> With respect to ABMC here are my requirements.
>
> a.  Assignment needs to be done at group level.
>
> b. User should be able to assign each event individually. Multiple events assignment(in one command) should be supported.
>
> c. I have no plans to implement domain level assignment. It is done at system level.
>
> d. We need only couple of states.  Assigned and unassigned.
>
> e. monitor_state is name of file for user interface. We can change that based on comments.
>
> Peter, James,
>
> Please comment on what you want achieve in "assignment" based on the features you are working on.
>
> Do you want to add new states?
>
>>
>> I understand that until we hear from Arm we do not know all the requirements
>> that this interface needs to support, but I do expect this interface to
>> at least consider requirements and usage scenarios that are already known.
>
> Sure. Will try that in the next version. Lets continue the discussion.
>
>
>>> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
>>>     the MBA events are not available until the user assigns the events
>>>     explicitly.
>>>     
>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>     Unsupported
>>>     
>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>     Unsupported
>>>
>> This needs some more thought to accommodate Peter's scenario where the counter
>> can be expected to return the final count after the counter is disabled.
>
> I am not sure how to achieve this with ABMC. This may be applicable
> to soft rmid only. In case of "soft rmid", previous readings are
> saved in the soft rmid state.

Right. Please consider this work in two parts, first, there is a generic
interface that aims to support ABMC, "soft RMID", and MPAM. Second, there
is using this interface to support ABMC.

>>> h. The event llc_occupancy is not affected by ABMC mode. Users can still
>>>     read the llc_occupancy.
>>>
>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
>>>     557056
>>>
>>> i. Now assign the total event and read the monitor_state.
>>>     
>>>     #echo total=assign > /sys/fs/resctrl/monitor_state
>>>     #cat /sys/fs/resctrl/monitor_state
>>>     total=assign;local=unassign
>>>     
>> I do not see the "global assign/unassign" scenario addressed.
>
> I am not all clear on meaning of "global assign/unassign". Does it
> mean looping thru all the groups and assign the RMIDs?

Please see [1].


> It may not work in many cases. In case of ABMC, we have only limited
> number of hw counters. It will fail after hardware runs out of
> counters. It is better done selectively based on which group user is
> interested in.

Right. This is one more item where the generic interface needs to
accommodate different hardware implementations. Perhaps this could
be one of the "features" exposed by (global) mbm_assign that the
user can "enable"/"disable" on demand?

> But it can be done later if we find a use case for that.

There already exists a use case as presented by Peter in support
of AMD hardware without ABMC, no?

>> This version seems to ignore (without discussion) a lot of earlier
>> feedback.
>
> Please feel free comment. There are various threads of discussion. I may have missed.
>


Reinette

[1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/

2024-02-08 17:44:54

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

I am trying to propose few things here to move forward based on my
assumptions. Please point me if I missed something.

On 2/5/24 16:38, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/2/2024 1:57 PM, Moger, Babu wrote:
>> On 2/1/2024 10:09 PM, Reinette Chatre wrote:
>>> On 1/19/2024 10:22 AM, Babu Moger wrote:
>>>> These series adds the support for Assignable Bandwidth Monitoring Counters
>>> Not a good start ([1]).
>>
>> Yea. My bad.
>>
>>>
>>>> (ABMC). It is also called QoS RMID Pinning feature
>>>>
>>>> The feature details are documented in the  APM listed below [1].
>>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>>> Monitoring (ABMC). The documentation is available at
>>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>>>
>>>> The patches are based on top of commit
>>>> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
>>>>
>>>> # Introduction
>>>>
>>>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
>>>> feature only guarantees that RMIDs currently assigned to a processor will
>>>> be tracked by hardware. The counters of any other RMIDs which are no longer
>>>> being tracked will be reset to zero. The MBM event counters return
>>>> "Unavailable" for the RMIDs that are not active.
>>>>      Users can create 256 or more monitor groups. But there can be only limited
>>>> number of groups that can be give guaranteed monitoring numbers.  With ever
>>> "can be given"?
>>
>> "can give guaranteed monitoring numbers."
>>
>> I feel this looks better.
>
> Sounds good. Thank you.
>
>>
>>>
>>>> changing configurations there is no way to definitely know which of these
>>>> groups will be active for certain point of time. Users do not have the
>>>> option to monitor a group or set of groups for certain period of time
>>>> without worrying about RMID being reset in between.
>>>>      The ABMC feature provides an option to the user to assign an RMID to the
>>>> hardware counter and monitor the bandwidth for a longer duration.
>>>> The assigned RMID will be active until the user unassigns it manually.
>>>> There is no need to worry about counters being reset during this period.
>>>> Additionally, the user can specify a bitmask identifying the specific
>>>> bandwidth types from the given source to track with the counter.
>>>>
>>>> Without ABMC enabled, monitoring will work in current mode without
>>>> assignment option.
>>>>
>>>> # Linux Implementation
>>>>
>>>> Linux resctrl subsystem provides the interface to count maximum of two
>>>> memory bandwidth events per group, from a combination of available total
>>>> and local events. Keeping the current interface, users can assign a maximum
>>>> of 2 ABMC counters per group. User will also have the option to assign only
>>>> one counter to the group. If the system runs out of assignable ABMC
>>>> counters, kernel will display an error. Users need to unassign an already
>>>> assigned counter to make space for new assignments.
>>>>
>>>>
>>>> # Examples
>>>>
>>>> a. Check if ABMC support is available
>>>>     #mount -t resctrl resctrl /sys/fs/resctrl/
>>>>
>>>>     #cat /sys/fs/resctrl/info/L3_MON/mon_features
>>>>     llc_occupancy
>>>>     mbm_total_bytes
>>>>     mbm_total_bytes_config
>>>>     mbm_local_bytes
>>>>     mbm_local_bytes_config
>>>>     mbm_assign_capable ←  Linux kernel detected ABMC feature
>>>>
>>>> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
>>>>     Monitoring works in legacy monitor mode when ABMC is not enabled.
>>>>
>>>>     #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>     0
>>>>
>>> With the introduction of "mbm_assign_enable" the entry in mon_features seems
>>> to provide duplicate information.
>>
>> ok. We can remove the text in mon_features and keep mbm_assign_enable. We need this to enable and disable the feature.
>
> This could be improved beyond a binary "enable"/"disable" interface to user space.
> For example, the hardware can discover which "mbm counter assign" related feature
> (I'm counting the "soft RMID" here as one of the "mbm counter assign" related
> features) is supported on the platform and it can be presented to the user like:
>
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> [feature_1] feature_2 feature_3

How about this?
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign
ABMC:Capable
SOFT-RMID:Capable

To enable ABMC
# echo ABMC:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign

When ABMC is enabled:
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign
ABMC:Enable
SOFT-RMID:Capable

> The output indicates which features are supported by the platform and the brackets indicate
> which feature is enabled.
>
>
>>>> c. There will be new file "monitor_state" for each monitor group when ABMC
>>>>     feature is supported. However, monitor_state is not available if ABMC is
>>>>     disabled.
>>>>     
>>>>     #cat /sys/fs/resctrl/monitor_state
>>>>     Unsupported
>>> This sounds potentially confusing since users will still be able to monitor
>>> the groups ...
>> How about "Assignment-Unsupported"?
>
> (please see later)
>
>>>
>>>>     
>>>> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
>>>>     enabled, monitoring will work in current mode without assignment option.
>>>>     
>>>>     # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>     779247936
>>>>     # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>     765207488
>>>>     
>>>> e. Enable ABMC mode.
>>>>
>>>>     #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>          #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>          1
>>>>
>>>> f. Read the monitor states. By default, both total and local MBM
>>>>     events are in "unassign" state.
>>>>     
>>>>     #cat /sys/fs/resctrl/monitor_state
>>>>     total=unassign;local=unassign
>>> This interface does not seem to take into account that hardware
>>> can support assignment per domain. I understand that this is
>>> not something you want to implement at this time but the user interface
>>> has to accommodate such an enhancement. This was already mentioned, and
>>> you did acknowledge the point [3] to this new version that does not
>>> reflect this is unexpected.
>>
>> Yea. Domain level assignment is not supported at this point. Do you want me to explicitly mention here?
>>
>> Please elaborate what you meant here.
>
> You have made it clear on several occasions that you do not intend to support
> domain level assignment. That may be ok but the interface you create should
> not prevent future support of domain level assignment.
>
> If my point is not clear, could you please share how this interface is able to
> support domain level assignment in the future?
>
> I am starting to think that we need a file similar to the schemata file
> for group and domain level monitor configurations.

Something like this?

By default
#cat /sys/fs/resctrl/monitor_state
default:0=total=assign,local=assign;1=total=assign,local=assign

With ABMC,
#cat /sys/fs/resctrl/monitor_state
ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign

>
>>> My previous suggestions do seem to still stand and and I also am not able to
>>> see how Peter's requests [2] were considered. This same interface needs to
>>> accommodate usages apart from ABMC. For example, how to use this interface
>>> to address the same counter issue on AMD hardware without ABMC, and MPAM
>>> (pending James's feedback).
>>
>> Yea. Agree. Peter's comments are not addressed. I am not all clear
>> about details of Peters and James requirement.
>
> Peter listed his requirements in [1]. That email thread is a worthwhile read
> for the use cases.
>
> I believe that James is aware of this work and do hope to hear from him.
>
>>
>> With respect to ABMC here are my requirements.
>>
>> a.  Assignment needs to be done at group level.
>>
>> b. User should be able to assign each event individually. Multiple events assignment(in one command) should be supported.
>>
>> c. I have no plans to implement domain level assignment. It is done at system level.
>>
>> d. We need only couple of states.  Assigned and unassigned.
>>
>> e. monitor_state is name of file for user interface. We can change that based on comments.
>>
>> Peter, James,
>>
>> Please comment on what you want achieve in "assignment" based on the features you are working on.
>>
>> Do you want to add new states?
>>
>>>
>>> I understand that until we hear from Arm we do not know all the requirements
>>> that this interface needs to support, but I do expect this interface to
>>> at least consider requirements and usage scenarios that are already known.
>>
>> Sure. Will try that in the next version. Lets continue the discussion.
>>
>>
>>>> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
>>>>     the MBA events are not available until the user assigns the events
>>>>     explicitly.
>>>>     
>>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>     Unsupported
>>>>     
>>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>     Unsupported
>>>>
>>> This needs some more thought to accommodate Peter's scenario where the counter
>>> can be expected to return the final count after the counter is disabled.
>>
>> I am not sure how to achieve this with ABMC. This may be applicable
>> to soft rmid only. In case of "soft rmid", previous readings are
>> saved in the soft rmid state.
>
> Right. Please consider this work in two parts, first, there is a generic
> interface that aims to support ABMC, "soft RMID", and MPAM. Second, there
> is using this interface to support ABMC.

Yea. But it is tough without knowing all the details of the other features.

How about?

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
ABMC:Unassigned
#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
ABMC:Unassigned

>
>>>> h. The event llc_occupancy is not affected by ABMC mode. Users can still
>>>>     read the llc_occupancy.
>>>>
>>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
>>>>     557056
>>>>
>>>> i. Now assign the total event and read the monitor_state.
>>>>     
>>>>     #echo total=assign > /sys/fs/resctrl/monitor_state
>>>>     #cat /sys/fs/resctrl/monitor_state
>>>>     total=assign;local=unassign
>>>>     
>>> I do not see the "global assign/unassign" scenario addressed.
>>
>> I am not all clear on meaning of "global assign/unassign". Does it
>> mean looping thru all the groups and assign the RMIDs?
>
> Please see [1].
>
>
>> It may not work in many cases. In case of ABMC, we have only limited
>> number of hw counters. It will fail after hardware runs out of
>> counters. It is better done selectively based on which group user is
>> interested in.
>
> Right. This is one more item where the generic interface needs to
> accommodate different hardware implementations. Perhaps this could
> be one of the "features" exposed by (global) mbm_assign that the
> user can "enable"/"disable" on demand?
>
>> But it can be done later if we find a use case for that.
>
> There already exists a use case as presented by Peter in support
> of AMD hardware without ABMC, no?

Yes. There is a use case. But seems like the use case is mostly applicable
to soft-rmid feature.

We can tie the global assign only to soft-rmid.

# echo SOFT-RMID:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign

Because this is soft-rmid, call global assign method.

# echo ABMC:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign

Because this is ABMC, do the steps required just to enable ABMC.
Don't do individual assignment

>
>>> This version seems to ignore (without discussion) a lot of earlier
>>> feedback.
>>
>> Please feel free comment. There are various threads of discussion. I may have missed.
>>
>
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/

--
Thanks
Babu Moger

2024-02-16 20:18:51

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <[email protected]> wrote:
> On 2/5/24 16:38, Reinette Chatre wrote:
> > This could be improved beyond a binary "enable"/"disable" interface to user space.
> > For example, the hardware can discover which "mbm counter assign" related feature
> > (I'm counting the "soft RMID" here as one of the "mbm counter assign" related
> > features) is supported on the platform and it can be presented to the user like:
> >
> > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> > [feature_1] feature_2 feature_3
>
> How about this?
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> ABMC:Capable
> SOFT-RMID:Capable
>
> To enable ABMC
> # echo ABMC:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign
>
> When ABMC is enabled:
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> ABMC:Enable
> SOFT-RMID:Capable

There would be no need to use soft RMIDs on a system that supports
ABMC, so I can't think of a reason why the underlying implementation
would matter to our users. The user should only have to request the
interface where monitors must be assigned manually. The mount would
succeed if the system has a way to support the interface.


> > You have made it clear on several occasions that you do not intend to support
> > domain level assignment. That may be ok but the interface you create should
> > not prevent future support of domain level assignment.
> >
> > If my point is not clear, could you please share how this interface is able to
> > support domain level assignment in the future?
> >
> > I am starting to think that we need a file similar to the schemata file
> > for group and domain level monitor configurations.
>
> Something like this?
>
> By default
> #cat /sys/fs/resctrl/monitor_state
> default:0=total=assign,local=assign;1=total=assign,local=assign
>
> With ABMC,
> #cat /sys/fs/resctrl/monitor_state
> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign

The benefit from all the string parsing in this interface is only
halving the number of monitor_state sysfs writes we'd need compared to
creating a separate file for mbm_local and mbm_total. Given that our
use case is to assign the 32 assignable counters to read the bandwidth
of ~256 monitoring groups, this isn't a substantial gain to help us. I
think you should just focus on providing the necessary control
granularity without trying to consolidate writes in this interface. I
will propose an additional interface below to optimize our use case.

Whether mbm_total and mbm_local are combined in the group directories
or not, I don't see why you wouldn't just repeat the same file
interface in the domain directories for a user needing finer-grained
controls.


> >> Peter, James,
> >>
> >> Please comment on what you want achieve in "assignment" based on the features you are working on.

I prototyped and tested the following additional interface for the
large-scale, batch use case that we're primarily concerned about:

info/L3_MON/mbm_{local,total}_bytes_assigned

Writing a whitespace-delimited list of mongroup directory paths does
the following:
1. unassign all monitors for the given counter
2. assigns a monitor to each mongroup referenced in the write
3. batches per-domain register updates resulting from the assignments
into a single IPI for each domain

This interface allows us to do less sysfs writes and IPIs on systems
with more assignable monitoring resources, rather than doing more.

The reference to a mongroup when reading/writing the above node is the
resctrl-root-relative path to the monitoring group. There is probably
a more concise way to refer to the groups, but my prototype used
kernfs_walk_and_get() to locate each rdtgroup struct.

I would also like to add that in the software-ABMC prototype I made,
because it's based on assignment of a small number of RMIDs,
assignment results in all counters being assigned at once. On
implementations where per-counter assignments aren't possible,
assignment through such a resource would be allowed to assign more
resources than explicitly requested.

This would allow an implementation only capable of global assignment
to assign resources to all groups when a non-empty string is written
to the proposed file nodes, and all resources to be unassigned when an
empty string is written. Reading back from the file nodes would tell
the user how much was actually assigned.

Thanks!
-Peter

2024-02-19 18:02:31

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

On 2/16/24 14:18, Peter Newman wrote:
> Hi Babu,
>
> On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <[email protected]> wrote:
>> On 2/5/24 16:38, Reinette Chatre wrote:
>>> This could be improved beyond a binary "enable"/"disable" interface to user space.
>>> For example, the hardware can discover which "mbm counter assign" related feature
>>> (I'm counting the "soft RMID" here as one of the "mbm counter assign" related
>>> features) is supported on the platform and it can be presented to the user like:
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
>>> [feature_1] feature_2 feature_3
>>
>> How about this?
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
>> ABMC:Capable
>> SOFT-RMID:Capable
>>
>> To enable ABMC
>> # echo ABMC:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign
>>
>> When ABMC is enabled:
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
>> ABMC:Enable
>> SOFT-RMID:Capable
>
> There would be no need to use soft RMIDs on a system that supports
> ABMC, so I can't think of a reason why the underlying implementation
> would matter to our users. The user should only have to request the
> interface where monitors must be assigned manually. The mount would
> succeed if the system has a way to support the interface.

Ok Sure. I will exclude Soft-rmid for this interface.

For now, lets keep this only for ABMC.
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign
ABMC:Capable

Or

# cat /sys/fs/resctrl/info/L3_MON/mbm_assign
ABMC:Enable

>
>
>>> You have made it clear on several occasions that you do not intend to support
>>> domain level assignment. That may be ok but the interface you create should
>>> not prevent future support of domain level assignment.
>>>
>>> If my point is not clear, could you please share how this interface is able to
>>> support domain level assignment in the future?
>>>
>>> I am starting to think that we need a file similar to the schemata file
>>> for group and domain level monitor configurations.
>>
>> Something like this?
>>
>> By default
>> #cat /sys/fs/resctrl/monitor_state
>> default:0=total=assign,local=assign;1=total=assign,local=assign
>>
>> With ABMC,
>> #cat /sys/fs/resctrl/monitor_state
>> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign
>
> The benefit from all the string parsing in this interface is only
> halving the number of monitor_state sysfs writes we'd need compared to
> creating a separate file for mbm_local and mbm_total. Given that our
> use case is to assign the 32 assignable counters to read the bandwidth
> of ~256 monitoring groups, this isn't a substantial gain to help us. I
> think you should just focus on providing the necessary control
> granularity without trying to consolidate writes in this interface. I

Ok. Looks like we need to provide the interface to assign the RMIDs to
individual domains in this interface. I wasn't planning that now. But, it
can be done without much changes.

Something like this(corrected typos: replaced '=' with '-').

#cat /sys/fs/resctrl/monitor_state
ABMC:0=total-unassign,local-unassign;1=total-unassign,local-unassign

To assign:

#echo "ABMC:0=total-assign,local-assign" > /sys/fs/resctrl/monitor_state


> will propose an additional interface below to optimize our use case.
>
> Whether mbm_total and mbm_local are combined in the group directories
> or not, I don't see why you wouldn't just repeat the same file
> interface in the domain directories for a user needing finer-grained
> controls.

I don't see the need for the same file inside each domain directory in the
group level when the above command can assign the RMIDs per domain.

>
>
>>>> Peter, James,
>>>>
>>>> Please comment on what you want achieve in "assignment" based on the features you are working on.
>
> I prototyped and tested the following additional interface for the
> large-scale, batch use case that we're primarily concerned about:
>
> info/L3_MON/mbm_{local,total}_bytes_assigned
>
> Writing a whitespace-delimited list of mongroup directory paths does
> the following:
> 1. unassign all monitors for the given counter
> 2. assigns a monitor to each mongroup referenced in the write
> 3. batches per-domain register updates resulting from the assignments
> into a single IPI for each domain
>
> This interface allows us to do less sysfs writes and IPIs on systems
> with more assignable monitoring resources, rather than doing more.
>
> The reference to a mongroup when reading/writing the above node is the
> resctrl-root-relative path to the monitoring group. There is probably
> a more concise way to refer to the groups, but my prototype used
> kernfs_walk_and_get() to locate each rdtgroup struct.
>
> I would also like to add that in the software-ABMC prototype I made,
> because it's based on assignment of a small number of RMIDs,
> assignment results in all counters being assigned at once. On
> implementations where per-counter assignments aren't possible,
> assignment through such a resource would be allowed to assign more
> resources than explicitly requested.
>
> This would allow an implementation only capable of global assignment
> to assign resources to all groups when a non-empty string is written
> to the proposed file nodes, and all resources to be unassigned when an
> empty string is written. Reading back from the file nodes would tell
> the user how much was actually assigned.

Yes. This interface can be extended to ABMC as a global assignment option.
If you have your patches ready I can add your patches on top of my ABMC
feature.
Or if you want to add the support later then I will go ahead with current
base ABMC support.
Let me know.
--
Thanks
Babu Moger

2024-02-20 15:22:05

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 19/01/2024 18:22, Babu Moger wrote:
> These series adds the support for Assignable Bandwidth Monitoring Counters
> (ABMC). It is also called QoS RMID Pinning feature
>
> The feature details are documented in the APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC). The documentation is available at
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>
> The patches are based on top of commit
> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
>
> # Introduction
>
> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
> feature only guarantees that RMIDs currently assigned to a processor will
> be tracked by hardware. The counters of any other RMIDs which are no longer
> being tracked will be reset to zero. The MBM event counters return
> "Unavailable" for the RMIDs that are not active.
>
> Users can create 256 or more monitor groups. But there can be only limited
> number of groups that can be give guaranteed monitoring numbers. With ever
> changing configurations there is no way to definitely know which of these
> groups will be active for certain point of time. Users do not have the
> option to monitor a group or set of groups for certain period of time
> without worrying about RMID being reset in between.
>
> The ABMC feature provides an option to the user to assign an RMID to the
> hardware counter and monitor the bandwidth for a longer duration.
> The assigned RMID will be active until the user unassigns it manually.
> There is no need to worry about counters being reset during this period.
> Additionally, the user can specify a bitmask identifying the specific
> bandwidth types from the given source to track with the counter.

At a high level, if existing software can't use the counters, I'd prefer we move them into
perf. We're currently re-inventing the perf wheel. (this argument doesn't hold for the
llc_occupancy, which is a state not counter!)

But if this lets someone 'pin' the counters for the groups they monitor, then use existing
tools, that seems a good enough argument for doing this.


> Without ABMC enabled, monitoring will work in current mode without
> assignment option.

To check I understand: the counters will get spuriously reset a the whim of the hardware?


> # Linux Implementation
>
> Linux resctrl subsystem provides the interface to count maximum of two
> memory bandwidth events per group, from a combination of available total
> and local events. Keeping the current interface, users can assign a maximum
> of 2 ABMC counters per group. User will also have the option to assign only
> one counter to the group. If the system runs out of assignable ABMC
> counters, kernel will display an error. Users need to unassign an already
> assigned counter to make space for new assignments.
>
>
> # Examples
>
> a. Check if ABMC support is available
> #mount -t resctrl resctrl /sys/fs/resctrl/
>
> #cat /sys/fs/resctrl/info/L3_MON/mon_features
> llc_occupancy
> mbm_total_bytes
> mbm_total_bytes_config
> mbm_local_bytes
> mbm_local_bytes_config
> mbm_assign_capable ← Linux kernel detected ABMC feature
>
> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
> Monitoring works in legacy monitor mode when ABMC is not enabled.
>
> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
> 0
>
> c. There will be new file "monitor_state" for each monitor group when ABMC
> feature is supported. However, monitor_state is not available if ABMC is
> disabled.
> #cat /sys/fs/resctrl/monitor_state
> Unsupported
>
> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
> enabled, monitoring will work in current mode without assignment option.
>
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 779247936
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> 765207488
>
> e. Enable ABMC mode.
>
> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
> 1

Why does this mode need enabling? Can't it be enabled automatically on hardware that
supports it, or enabled implicitly when the first assignment attempt arrives?

I guess this is really needed for a reset - could we implement that instead? This way
there isn't an extra step user-space has to do to make the assignments work.


> f. Read the monitor states. By default, both total and local MBM
> events are in "unassign" state.
>
> #cat /sys/fs/resctrl/monitor_state
> total=unassign;local=unassign


> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
> the MBA events are not available until the user assigns the events
> explicitly.

How does this fit with "monitoring will work in current mode without assignment option.".
You mentioned the hardware resets the counters when this mode is enabled, does it also
refuse to count until the MSR is programmed?

If so - is there any mileage in auto-assigning the first N RMID to counters when the
groups are created? This way existing user-space tools work until they exceed the limits
of hardware. From that point a counter needs to be unassigned from another group. (we'd
need to make it easy to find which groups have a counter assigned)


> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> Unsupported
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> Unsupported
>
> h. The event llc_occupancy is not affected by ABMC mode. Users can still
> read the llc_occupancy.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
> 557056

{
MPAM would be the same - because llc_occupancy isn't a counter its a view of the
state, its possible to multiplex a single llc_occupancy counter behind the scenes
to provide the value for as many groups as needed. I suspect any other
architecture would have the same property.
}

> i. Now assign the total event and read the monitor_state.
>
> #echo total=assign > /sys/fs/resctrl/monitor_state
> #cat /sys/fs/resctrl/monitor_state
> total=assign;local=unassign
>
> j. Now that the total event is assigned. Read the total event.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 6136000
>
> k. Now assign the local event and read the monitor_state.
>
> #echo local=assign > /sys/fs/resctrl/monitor_state
> #cat /sys/fs/resctrl/monitor_state
> total=assign;local=assign
>
> Users can also assign both total and local events in one single
> command.
>
> l. Now that both total and local events are assigned, read the events.
>
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 6136000
> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> 58694

(the bandwidth configuration stuff is the existing BMEC support right?)

From user-space's perspective MPAM could be made to look the same.

There ought to be some indication to user-space of how many counters it can assign, this
number might be different for different resources. This won't be a problem today, but if
we had 'mbm_total_bytes' on the L2 cache, the number of counters may be different.

MPAM platforms are unlikely to support both 'mbm_total' and 'mbm_local', I think this is
just a documentation problem to say that mbm_local can't be configured if its not
supported - user-space can't blindly assign both.

If the configuration is changed over time - I bet user-space needs a quick way to find
where the counters are currently assigned - walking the tree to find out is a bit rubbish.
A file that lists the "control_group_name[/mon_group_name]" would help.


Thanks,

James

2024-02-20 15:22:10

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter, Babu,

On 16/02/2024 20:18, Peter Newman wrote:
> On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <[email protected]> wrote:
>> On 2/5/24 16:38, Reinette Chatre wrote:
>>> You have made it clear on several occasions that you do not intend to support
>>> domain level assignment. That may be ok but the interface you create should
>>> not prevent future support of domain level assignment.
>>>
>>> If my point is not clear, could you please share how this interface is able to
>>> support domain level assignment in the future?
>>>
>>> I am starting to think that we need a file similar to the schemata file
>>> for group and domain level monitor configurations.
>>
>> Something like this?
>>
>> By default
>> #cat /sys/fs/resctrl/monitor_state
>> default:0=total=assign,local=assign;1=total=assign,local=assign
>>
>> With ABMC,
>> #cat /sys/fs/resctrl/monitor_state
>> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign
>
> The benefit from all the string parsing in this interface is only
> halving the number of monitor_state sysfs writes we'd need compared to
> creating a separate file for mbm_local and mbm_total. Given that our
> use case is to assign the 32 assignable counters to read the bandwidth
> of ~256 monitoring groups, this isn't a substantial gain to help us. I
> think you should just focus on providing the necessary control
> granularity without trying to consolidate writes in this interface. I
> will propose an additional interface below to optimize our use case.
>
> Whether mbm_total and mbm_local are combined in the group directories
> or not, I don't see why you wouldn't just repeat the same file
> interface in the domain directories for a user needing finer-grained
> controls.

I don't follow why this has to be done globally. resctrl allows CLOSID to have different
configurations for different purposes between different domains (as long as tasks are
pinned to CPUs). It feels a bit odd that these counters can't be considered as per-domain too.

MPAM can equally allocate monitors/counters per-domain. If we are ever going to have
per-domain assignment, I think its worth the extra work to do that now and avoid the extra
user-space interface baggage from the global version.


>>>> Peter, James,
>>>>
>>>> Please comment on what you want achieve in "assignment" based on the features you are working on.
>
> I prototyped and tested the following additional interface for the
> large-scale, batch use case that we're primarily concerned about:
>
> info/L3_MON/mbm_{local,total}_bytes_assigned
>
> Writing a whitespace-delimited list of mongroup directory paths does

| mkdir /sys/fs/resctrl/my\ group

string parsing in the kernel is rarely fun!


> the following:
> 1. unassign all monitors for the given counter
> 2. assigns a monitor to each mongroup referenced in the write
> 3. batches per-domain register updates resulting from the assignments
> into a single IPI for each domain
>
> This interface allows us to do less sysfs writes and IPIs on systems
> with more assignable monitoring resources, rather than doing more.
>
> The reference to a mongroup when reading/writing the above node is the
> resctrl-root-relative path to the monitoring group. There is probably
> a more concise way to refer to the groups, but my prototype used
> kernfs_walk_and_get() to locate each rdtgroup struct.

If this file were re-used for finding where the monitors were currently allocated, using
the name would be a natural fit for building a path to un-assign one group.


> I would also like to add that in the software-ABMC prototype I made,
> because it's based on assignment of a small number of RMIDs,
> assignment results in all counters being assigned at once. On
> implementations where per-counter assignments aren't possible,
> assignment through such a resource would be allowed to assign more
> resources than explicitly requested.
>
> This would allow an implementation only capable of global assignment

Do we know if this exists? Given the configurations have to be different for a domain, I'd
be surprised if counter configuration is somehow distributed between domains.


> to assign resources to all groups when a non-empty string is written
> to the proposed file nodes, and all resources to be unassigned when an
> empty string is written. Reading back from the file nodes would tell
> the user how much was actually assigned.

What do you mean by 'how much', is this allow to fail early? That feels a bit
counter-intuitive. As this starts with a reset, if the number of counters is known - it
should be easy for user-space to know it can only write X tokens into that file.


Thanks,

James

2024-02-20 18:02:09

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 04/17] x86/resctrl: Detect Assignable Bandwidth Monitoring feature details

Hi Babu,

On 19/01/2024 18:22, Babu Moger wrote:
> ABMC feature details are reported via CPUID Fn8000_0020_EBX_x5.
> Bits Description
> 15:0 MAX_ABMC Maximum Supported Assignable Bandwidth
> Monitoring Counter ID + 1
>
> Detect the feature details and update
> /sys/fs/resctrl/info/L3_MON/mon_features.
>
> If the system supports Assignable Bandwidth Monitoring Counters (ABMC),
> the output will have additional text.
> $ cat /sys/fs/resctrl/info/L3_MON/mon_features
> mbm_assign_capable
>
> The feature details are documented in APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC).

> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 4efe2d6a9eb7..f40ee271a5c7 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -303,6 +303,17 @@ static void rdt_get_cdp_l2_config(void)
> rdt_get_cdp_config(RDT_RESOURCE_L2);
> }
>
> +static void rdt_get_abmc_cfg(struct rdt_resource *r)
> +{
> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> + u32 eax, ebx, ecx, edx;
> +
> + r->mbm_assign_capable = true;
> + /* Query CPUID_Fn80000020_EBX_x05 for number of ABMC counters */
> + cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);

> + hw_res->mbm_assignable_counters = (ebx & 0xFFFF) + 1;

Please put the mbm_assignable_counters field in struct rdt_resource. The filesystem code
needs to access this to allocate/free counters and report how many are available.
After all this gets split and the filesystem code moves to /fs/, the rdt_hw_resrouce
structure is inaccessible to the filesystem code.


> +}
> +
> static void
> mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
> {
> @@ -815,6 +826,12 @@ static __init bool get_rdt_alloc_resources(void)
> if (get_slow_mem_config())
> ret = true;
>
> + if (rdt_cpu_has(X86_FEATURE_ABMC)) {
> + r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + rdt_get_abmc_cfg(r);

> + ret = true;

Does it make sense to report rdt_alloc_capable if the SoC has ABMC, but nothing that can
be configured?

This code would probably make more sense in the get_rdt_mon_resources().


> + }
> +
> return ret;
> }
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a4f1aa15f0a2..01eb0522b42b 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -391,6 +391,8 @@ struct rdt_parse_data {
> * resctrl_arch_get_num_closid() to avoid confusion
> * with struct resctrl_schema's property of the same name,
> * which has been corrected for features like CDP.

> + * @mbm_assignable_counters:
> + * Maximum number of assignable ABMC counters

As above, please move this to struct rdt_resource. The 'hw' version becomes private to the
arch code after the move to /fs//


> * @msr_base: Base MSR address for CBMs
> * @msr_update: Function pointer to update QOS MSRs
> * @mon_scale: cqm counter * mon_scale = occupancy in bytes


Thanks,

James


2024-02-20 18:11:41

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi James,

On Tue, Feb 20, 2024 at 7:21 AM James Morse <[email protected]> wrote:
> On 16/02/2024 20:18, Peter Newman wrote:
> > On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <[email protected]> wrote:
> >> On 2/5/24 16:38, Reinette Chatre wrote:
> >>> You have made it clear on several occasions that you do not intend to support
> >>> domain level assignment. That may be ok but the interface you create should
> >>> not prevent future support of domain level assignment.
> >>>
> >>> If my point is not clear, could you please share how this interface is able to
> >>> support domain level assignment in the future?
> >>>
> >>> I am starting to think that we need a file similar to the schemata file
> >>> for group and domain level monitor configurations.
> >>
> >> Something like this?
> >>
> >> By default
> >> #cat /sys/fs/resctrl/monitor_state
> >> default:0=total=assign,local=assign;1=total=assign,local=assign
> >>
> >> With ABMC,
> >> #cat /sys/fs/resctrl/monitor_state
> >> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign
> >
> > The benefit from all the string parsing in this interface is only
> > halving the number of monitor_state sysfs writes we'd need compared to
> > creating a separate file for mbm_local and mbm_total. Given that our
> > use case is to assign the 32 assignable counters to read the bandwidth
> > of ~256 monitoring groups, this isn't a substantial gain to help us. I
> > think you should just focus on providing the necessary control
> > granularity without trying to consolidate writes in this interface. I
> > will propose an additional interface below to optimize our use case.
> >
> > Whether mbm_total and mbm_local are combined in the group directories
> > or not, I don't see why you wouldn't just repeat the same file
> > interface in the domain directories for a user needing finer-grained
> > controls.
>
> I don't follow why this has to be done globally. resctrl allows CLOSID to have different
> configurations for different purposes between different domains (as long as tasks are
> pinned to CPUs). It feels a bit odd that these counters can't be considered as per-domain too.

Assigning to all domains at once would allow us to better parallelize
the resulting IPIs when we do need to iterate a small set of monitors
over a large list of groups.


> > I prototyped and tested the following additional interface for the
> > large-scale, batch use case that we're primarily concerned about:
> >
> > info/L3_MON/mbm_{local,total}_bytes_assigned
> >
> > Writing a whitespace-delimited list of mongroup directory paths does
>
> | mkdir /sys/fs/resctrl/my\ group
>
> string parsing in the kernel is rarely fun!

Hopefully restricting to a newline-delimited list will keep this fun
and easy then.

Otherwise if referring to many groups in a single write isn't a viable
path forward, I'll still need to find a way to address the
fs/syscall/IPI overhead of measuring the bandwidth of a large number
of groups.

>
>
> > the following:
> > 1. unassign all monitors for the given counter
> > 2. assigns a monitor to each mongroup referenced in the write
> > 3. batches per-domain register updates resulting from the assignments
> > into a single IPI for each domain
> >
> > This interface allows us to do less sysfs writes and IPIs on systems
> > with more assignable monitoring resources, rather than doing more.
> >
> > The reference to a mongroup when reading/writing the above node is the
> > resctrl-root-relative path to the monitoring group. There is probably
> > a more concise way to refer to the groups, but my prototype used
> > kernfs_walk_and_get() to locate each rdtgroup struct.
>
> If this file were re-used for finding where the monitors were currently allocated, using
> the name would be a natural fit for building a path to un-assign one group.
>
>
> > I would also like to add that in the software-ABMC prototype I made,
> > because it's based on assignment of a small number of RMIDs,
> > assignment results in all counters being assigned at once. On
> > implementations where per-counter assignments aren't possible,
> > assignment through such a resource would be allowed to assign more
> > resources than explicitly requested.
> >
> > This would allow an implementation only capable of global assignment
>
> Do we know if this exists? Given the configurations have to be different for a domain, I'd
> be surprised if counter configuration is somehow distributed between domains.

It's currently only a proposal[1] for mitigating the context switch
overhead cost of soft RMIDs. I'm looking at the other alternative
first, though.


> > to assign resources to all groups when a non-empty string is written
> > to the proposed file nodes, and all resources to be unassigned when an
> > empty string is written. Reading back from the file nodes would tell
> > the user how much was actually assigned.
>
> What do you mean by 'how much', is this allow to fail early? That feels a bit
> counter-intuitive. As this starts with a reset, if the number of counters is known - it
> should be easy for user-space to know it can only write X tokens into that file.

I was referring to the operation assigning more groups than requested
if the implementation is only capable of a master enable/disable for
all monitoring: reading back would indicate that all monitoring groups
are in the assigned list.

There would otherwise be an interface telling the user how many
monitors can be assigned, so there's no reason to expect this
operation to fail, short of the user doing something silly like
deleting a group while it's concurrently being assigned.

-Peter

[1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/

2024-02-20 18:15:00

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 06/17] x86/resctrl: Introduce interface to display number of ABMC counters

Hi Babu,

On 19/01/2024 18:22, Babu Moger wrote:
> The ABMC feature provides an option to the user to pin (or assign) the
> RMID to the hardware counter and monitor the bandwidth for a longer
> duration. There are only a limited number of hardware counters.
>
> Provide the interface to display the number of ABMC counters supported.


> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index a6c336b6de61..fa492ea820f0 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -823,6 +823,10 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> resctrl_file_fflags_init("mbm_local_bytes_config",
> RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
> }
> +

> + if (rdt_cpu_has(X86_FEATURE_ABMC))

Please put this in a header and calling it something like
resctrl_arch_has_assignable_counters(). These X86 feature definition macros aren't
available on other architectures!


> + resctrl_file_fflags_init("mbm_assignable_counters",
> + RFTYPE_MON_INFO);
> }
>
> l3_mon_evt_init(r);
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 3e233251e7ed..53be5cd1c28e 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -811,6 +811,17 @@ static int rdtgroup_rmid_show(struct kernfs_open_file *of,
> return ret;
> }
>
> +static int rdtgroup_mbm_assignable_counters_show(struct kernfs_open_file *of,
> + struct seq_file *s, void *v)
> +{
> + struct rdt_resource *r = of->kn->parent->priv;

> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);

(After the move out to /fs/ the resctrl_to_arch_res() macro is private to the arch code.
Needing to do this when providing a value to user-space is the indication that the value
should be in struct rdt_resource instead!)


> + seq_printf(s, "%d\n", hw_res->mbm_assignable_counters);
> +
> + return 0;
> +}
> +
> #ifdef CONFIG_PROC_CPU_RESCTRL
>
> /*


Thanks,

James

2024-02-20 18:15:35

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 20/02/2024 15:21, James Morse wrote:
> There ought to be some indication to user-space of how many counters it can assign, this
> number might be different for different resources. This won't be a problem today, but if
> we had 'mbm_total_bytes' on the L2 cache, the number of counters may be different.

I found this in patch 6 - sorry for the noise!


Thanks,

James

2024-02-20 20:48:58

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi James,

On 2/20/24 09:21, James Morse wrote:
> Hi Babu,
>
> On 19/01/2024 18:22, Babu Moger wrote:
>> These series adds the support for Assignable Bandwidth Monitoring Counters
>> (ABMC). It is also called QoS RMID Pinning feature
>>
>> The feature details are documented in the APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC). The documentation is available at
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>
>> The patches are based on top of commit
>> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
>>
>> # Introduction
>>
>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
>> feature only guarantees that RMIDs currently assigned to a processor will
>> be tracked by hardware. The counters of any other RMIDs which are no longer
>> being tracked will be reset to zero. The MBM event counters return
>> "Unavailable" for the RMIDs that are not active.
>>
>> Users can create 256 or more monitor groups. But there can be only limited
>> number of groups that can be give guaranteed monitoring numbers. With ever
>> changing configurations there is no way to definitely know which of these
>> groups will be active for certain point of time. Users do not have the
>> option to monitor a group or set of groups for certain period of time
>> without worrying about RMID being reset in between.
>>
>> The ABMC feature provides an option to the user to assign an RMID to the
>> hardware counter and monitor the bandwidth for a longer duration.
>> The assigned RMID will be active until the user unassigns it manually.
>> There is no need to worry about counters being reset during this period.
>> Additionally, the user can specify a bitmask identifying the specific
>> bandwidth types from the given source to track with the counter.
>
> At a high level, if existing software can't use the counters, I'd prefer we move them into
> perf. We're currently re-inventing the perf wheel. (this argument doesn't hold for the
> llc_occupancy, which is a state not counter!)
>
> But if this lets someone 'pin' the counters for the groups they monitor, then use existing
> tools, that seems a good enough argument for doing this.

Not sure if I understand this. Yes. This feature provides the option to
pin the counters to the monitor group.

>
>
>> Without ABMC enabled, monitoring will work in current mode without
>> assignment option.
>
> To check I understand: the counters will get spuriously reset a the whim of the hardware?

[1] Not spuriously. Hardware can keep track of certain number of counters
active simultaneously (active counters). If there are more monitor groups
than the hardware can track, then only most recent associations are kept
active. The active set can change based on user actions(RMID association
changes from user).

This feature can help to pin a counter so it does not reset.

>
>
>> # Linux Implementation
>>
>> Linux resctrl subsystem provides the interface to count maximum of two
>> memory bandwidth events per group, from a combination of available total
>> and local events. Keeping the current interface, users can assign a maximum
>> of 2 ABMC counters per group. User will also have the option to assign only
>> one counter to the group. If the system runs out of assignable ABMC
>> counters, kernel will display an error. Users need to unassign an already
>> assigned counter to make space for new assignments.
>>
>>
>> # Examples
>>
>> a. Check if ABMC support is available
>> #mount -t resctrl resctrl /sys/fs/resctrl/
>>
>> #cat /sys/fs/resctrl/info/L3_MON/mon_features
>> llc_occupancy
>> mbm_total_bytes
>> mbm_total_bytes_config
>> mbm_local_bytes
>> mbm_local_bytes_config
>> mbm_assign_capable ← Linux kernel detected ABMC feature
>>
>> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
>> Monitoring works in legacy monitor mode when ABMC is not enabled.
>>
>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>> 0
>>
>> c. There will be new file "monitor_state" for each monitor group when ABMC
>> feature is supported. However, monitor_state is not available if ABMC is
>> disabled.
>> #cat /sys/fs/resctrl/monitor_state
>> Unsupported
>>
>> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
>> enabled, monitoring will work in current mode without assignment option.
>>
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 779247936
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> 765207488
>>
>> e. Enable ABMC mode.
>>
>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>> 1
>
> Why does this mode need enabling? Can't it be enabled automatically on hardware that
> supports it, or enabled implicitly when the first assignment attempt arrives?
>
> I guess this is really needed for a reset - could we implement that instead? This way
> there isn't an extra step user-space has to do to make the assignments work.

Mostly the new features are added as an opt-in method. So, kept it that
way. If we enable this feature automatically, then we have provide an
option to disable it.

>
>
>> f. Read the monitor states. By default, both total and local MBM
>> events are in "unassign" state.
>>
>> #cat /sys/fs/resctrl/monitor_state
>> total=unassign;local=unassign
>
>
>> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
>> the MBA events are not available until the user assigns the events
>> explicitly.
>
> How does this fit with "monitoring will work in current mode without assignment option.".

See my response above. [1]

> You mentioned the hardware resets the counters when this mode is enabled, does it also
> refuse to count until the MSR is programmed?

Yes. That is correct. We need to program the MSRs to start counting again.

>
> If so - is there any mileage in auto-assigning the first N RMID to counters when the
> groups are created? This way existing user-space tools work until they exceed the limits
> of hardware. From that point a counter needs to be unassigned from another group. (we'd
> need to make it easy to find which groups have a counter assigned)

Yes. That is correct. To see the state of assignment, I have added a
monitor_state in each group to see if the counters are assigned to that group.

>
>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> Unsupported
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> Unsupported
>>
>> h. The event llc_occupancy is not affected by ABMC mode. Users can still
>> read the llc_occupancy.
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
>> 557056
>
> {
> MPAM would be the same - because llc_occupancy isn't a counter its a view of the
> state, its possible to multiplex a single llc_occupancy counter behind the scenes
> to provide the value for as many groups as needed. I suspect any other
> architecture would have the same property.

ok. Good to know.

> }
>
>> i. Now assign the total event and read the monitor_state.
>>
>> #echo total=assign > /sys/fs/resctrl/monitor_state
>> #cat /sys/fs/resctrl/monitor_state
>> total=assign;local=unassign
>>
>> j. Now that the total event is assigned. Read the total event.
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 6136000
>>
>> k. Now assign the local event and read the monitor_state.
>>
>> #echo local=assign > /sys/fs/resctrl/monitor_state
>> #cat /sys/fs/resctrl/monitor_state
>> total=assign;local=assign
>>
>> Users can also assign both total and local events in one single
>> command.
>>
>> l. Now that both total and local events are assigned, read the events.
>>
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 6136000
>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> 58694
>
> (the bandwidth configuration stuff is the existing BMEC support right?)

Yes. correct.

>
> From user-space's perspective MPAM could be made to look the same.
>
> There ought to be some indication to user-space of how many counters it can assign, this
> number might be different for different resources. This won't be a problem today, but if
> we had 'mbm_total_bytes' on the L2 cache, the number of counters may be different.
>
> MPAM platforms are unlikely to support both 'mbm_total' and 'mbm_local', I think this is

Ok. Good to know.

> just a documentation problem to say that mbm_local can't be configured if its not
> supported - user-space can't blindly assign both.
>
> If the configuration is changed over time - I bet user-space needs a quick way to find
> where the counters are currently assigned - walking the tree to find out is a bit rubbish.
> A file that lists the "control_group_name[/mon_group_name]" would help.

Looks like you already found in here.

https://lore.kernel.org/lkml/c16cac16c813a203390229d77d5ab37ebc923d95.1705688539.git.babu.moger@amd.com/

>
>
> Thanks,
>
> James

--
Thanks
Babu Moger

2024-02-20 21:24:17

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 06/17] x86/resctrl: Introduce interface to display number of ABMC counters

Hi James,

On 2/20/24 12:14, James Morse wrote:
> Hi Babu,
>
> On 19/01/2024 18:22, Babu Moger wrote:
>> The ABMC feature provides an option to the user to pin (or assign) the
>> RMID to the hardware counter and monitor the bandwidth for a longer
>> duration. There are only a limited number of hardware counters.
>>
>> Provide the interface to display the number of ABMC counters supported.
>
>
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index a6c336b6de61..fa492ea820f0 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -823,6 +823,10 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>> resctrl_file_fflags_init("mbm_local_bytes_config",
>> RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
>> }
>> +
>
>> + if (rdt_cpu_has(X86_FEATURE_ABMC))
>
> Please put this in a header and calling it something like
> resctrl_arch_has_assignable_counters(). These X86 feature definition macros aren't
> available on other architectures!

Sure. Will do.

>
>
>> + resctrl_file_fflags_init("mbm_assignable_counters",
>> + RFTYPE_MON_INFO);
>> }
>>
>> l3_mon_evt_init(r);
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 3e233251e7ed..53be5cd1c28e 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -811,6 +811,17 @@ static int rdtgroup_rmid_show(struct kernfs_open_file *of,
>> return ret;
>> }
>>
>> +static int rdtgroup_mbm_assignable_counters_show(struct kernfs_open_file *of,
>> + struct seq_file *s, void *v)
>> +{
>> + struct rdt_resource *r = of->kn->parent->priv;
>
>> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>
> (After the move out to /fs/ the resctrl_to_arch_res() macro is private to the arch code.
> Needing to do this when providing a value to user-space is the indication that the value
> should be in struct rdt_resource instead!)

Ok. Sure. Will

>
>
>> + seq_printf(s, "%d\n", hw_res->mbm_assignable_counters);
>> +
>> + return 0;
>> +}
>> +
>> #ifdef CONFIG_PROC_CPU_RESCTRL
>>
>> /*
>
>
> Thanks,
>
> James

--
Thanks
Babu Moger

2024-02-20 21:28:45

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 04/17] x86/resctrl: Detect Assignable Bandwidth Monitoring feature details

Hi James,

On 2/20/24 11:56, James Morse wrote:
> Hi Babu,
>
> On 19/01/2024 18:22, Babu Moger wrote:
>> ABMC feature details are reported via CPUID Fn8000_0020_EBX_x5.
>> Bits Description
>> 15:0 MAX_ABMC Maximum Supported Assignable Bandwidth
>> Monitoring Counter ID + 1
>>
>> Detect the feature details and update
>> /sys/fs/resctrl/info/L3_MON/mon_features.
>>
>> If the system supports Assignable Bandwidth Monitoring Counters (ABMC),
>> the output will have additional text.
>> $ cat /sys/fs/resctrl/info/L3_MON/mon_features
>> mbm_assign_capable
>>
>> The feature details are documented in APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC).
>
>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>> index 4efe2d6a9eb7..f40ee271a5c7 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -303,6 +303,17 @@ static void rdt_get_cdp_l2_config(void)
>> rdt_get_cdp_config(RDT_RESOURCE_L2);
>> }
>>
>> +static void rdt_get_abmc_cfg(struct rdt_resource *r)
>> +{
>> + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>> + u32 eax, ebx, ecx, edx;
>> +
>> + r->mbm_assign_capable = true;
>> + /* Query CPUID_Fn80000020_EBX_x05 for number of ABMC counters */
>> + cpuid_count(0x80000020, 5, &eax, &ebx, &ecx, &edx);
>
>> + hw_res->mbm_assignable_counters = (ebx & 0xFFFF) + 1;
>
> Please put the mbm_assignable_counters field in struct rdt_resource. The filesystem code
> needs to access this to allocate/free counters and report how many are available.
> After all this gets split and the filesystem code moves to /fs/, the rdt_hw_resrouce
> structure is inaccessible to the filesystem code.

Ok. Sure.

>
>
>> +}
>> +
>> static void
>> mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
>> {
>> @@ -815,6 +826,12 @@ static __init bool get_rdt_alloc_resources(void)
>> if (get_slow_mem_config())
>> ret = true;
>>
>> + if (rdt_cpu_has(X86_FEATURE_ABMC)) {
>> + r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> + rdt_get_abmc_cfg(r);
>
>> + ret = true;
>
> Does it make sense to report rdt_alloc_capable if the SoC has ABMC, but nothing that can
> be configured?

Good catch. Will move it to get_rdt_mon_resources().
>
> This code would probably make more sense in the get_rdt_mon_resources().

Sure.
>
>
>> + }
>> +
>> return ret;
>> }
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index a4f1aa15f0a2..01eb0522b42b 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -391,6 +391,8 @@ struct rdt_parse_data {
>> * resctrl_arch_get_num_closid() to avoid confusion
>> * with struct resctrl_schema's property of the same name,
>> * which has been corrected for features like CDP.
>
>> + * @mbm_assignable_counters:
>> + * Maximum number of assignable ABMC counters
>
> As above, please move this to struct rdt_resource. The 'hw' version becomes private to the
> arch code after the move to /fs//
>
>
>> * @msr_base: Base MSR address for CBMs
>> * @msr_update: Function pointer to update QOS MSRs
>> * @mon_scale: cqm counter * mon_scale = occupancy in bytes
>
>
> Thanks,
>
> James
>

--
Thanks
Babu Moger

2024-02-23 17:18:51

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)



On 2/20/2024 12:48 PM, Moger, Babu wrote:
> On 2/20/24 09:21, James Morse wrote:
>> On 19/01/2024 18:22, Babu Moger wrote:

>>> e. Enable ABMC mode.
>>>
>>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>> 1
>>
>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>
>> I guess this is really needed for a reset - could we implement that instead? This way
>> there isn't an extra step user-space has to do to make the assignments work.
>
> Mostly the new features are added as an opt-in method. So, kept it that
> way. If we enable this feature automatically, then we have provide an
> option to disable it.
>

At the same time it sounds to me like ABMC can improve current users'
experience without requiring them to do anything. This sounds appealing.
For example, if I understand correctly, it may be possible to start resctrl
with ABMC enabled by default and the number of monitoring groups (currently
exposed to user space via "num_rmids") limited to the number of counters
supported by ABMC. Existing users would then by default obtain better behavior
of counters not resetting.

The "new feature" could then be viewed as adding support for more monitoring
groups than what hardware can support concurrently.

Reinette

2024-02-23 20:11:37

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 2/23/24 11:17, Reinette Chatre wrote:
>
>
> On 2/20/2024 12:48 PM, Moger, Babu wrote:
>> On 2/20/24 09:21, James Morse wrote:
>>> On 19/01/2024 18:22, Babu Moger wrote:
>
>>>> e. Enable ABMC mode.
>>>>
>>>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>> 1
>>>
>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>>
>>> I guess this is really needed for a reset - could we implement that instead? This way
>>> there isn't an extra step user-space has to do to make the assignments work.
>>
>> Mostly the new features are added as an opt-in method. So, kept it that
>> way. If we enable this feature automatically, then we have provide an
>> option to disable it.
>>
>
> At the same time it sounds to me like ABMC can improve current users'
> experience without requiring them to do anything. This sounds appealing.
> For example, if I understand correctly, it may be possible to start resctrl
> with ABMC enabled by default and the number of monitoring groups (currently
> exposed to user space via "num_rmids") limited to the number of counters
> supported by ABMC. Existing users would then by default obtain better behavior
> of counters not resetting.

Yes, I like the idea. But i will break compatibility with pqos
tool(intel_cmt_cat utility). pqos tool monitoring will not work without
supporting ABMC enablement in the tool. ABMC feature requires an extra
step to assign the counters for monitor to work.

>
> The "new feature" could then be viewed as adding support for more monitoring
> groups than what hardware can support concurrently.
>
> Reinette

--
Thanks
Babu Moger

2024-02-23 21:47:39

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)


On 2/20/2024 12:11 PM, Peter Newman wrote:
> Hi James,
>
> On Tue, Feb 20, 2024 at 7:21 AM James Morse <[email protected]> wrote:
>> On 16/02/2024 20:18, Peter Newman wrote:
>>> On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <[email protected]> wrote:
>>>> On 2/5/24 16:38, Reinette Chatre wrote:
>>>>> You have made it clear on several occasions that you do not intend to support
>>>>> domain level assignment. That may be ok but the interface you create should
>>>>> not prevent future support of domain level assignment.
>>>>>
>>>>> If my point is not clear, could you please share how this interface is able to
>>>>> support domain level assignment in the future?
>>>>>
>>>>> I am starting to think that we need a file similar to the schemata file
>>>>> for group and domain level monitor configurations.
>>>> Something like this?
>>>>
>>>> By default
>>>> #cat /sys/fs/resctrl/monitor_state
>>>> default:0=total=assign,local=assign;1=total=assign,local=assign
>>>>
>>>> With ABMC,
>>>> #cat /sys/fs/resctrl/monitor_state
>>>> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign
>>> The benefit from all the string parsing in this interface is only
>>> halving the number of monitor_state sysfs writes we'd need compared to
>>> creating a separate file for mbm_local and mbm_total. Given that our
>>> use case is to assign the 32 assignable counters to read the bandwidth
>>> of ~256 monitoring groups, this isn't a substantial gain to help us. I
>>> think you should just focus on providing the necessary control
>>> granularity without trying to consolidate writes in this interface. I
>>> will propose an additional interface below to optimize our use case.
>>>
>>> Whether mbm_total and mbm_local are combined in the group directories
>>> or not, I don't see why you wouldn't just repeat the same file
>>> interface in the domain directories for a user needing finer-grained
>>> controls.
>> I don't follow why this has to be done globally. resctrl allows CLOSID to have different
>> configurations for different purposes between different domains (as long as tasks are
>> pinned to CPUs). It feels a bit odd that these counters can't be considered as per-domain too.
> Assigning to all domains at once would allow us to better parallelize
> the resulting IPIs when we do need to iterate a small set of monitors
> over a large list of groups.

Planning to work on v3 of this series. For now, I will exclude the
global assignment option from this series.

We can add the global assignment support when we get time to optimize
assignments at later point.

Thanks

Babu Moger


2024-02-23 22:21:48

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 2/23/2024 12:11 PM, Moger, Babu wrote:
> On 2/23/24 11:17, Reinette Chatre wrote:
>>
>>
>> On 2/20/2024 12:48 PM, Moger, Babu wrote:
>>> On 2/20/24 09:21, James Morse wrote:
>>>> On 19/01/2024 18:22, Babu Moger wrote:
>>
>>>>> e. Enable ABMC mode.
>>>>>
>>>>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>> 1
>>>>
>>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>>>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>>>
>>>> I guess this is really needed for a reset - could we implement that instead? This way
>>>> there isn't an extra step user-space has to do to make the assignments work.
>>>
>>> Mostly the new features are added as an opt-in method. So, kept it that
>>> way. If we enable this feature automatically, then we have provide an
>>> option to disable it.
>>>
>>
>> At the same time it sounds to me like ABMC can improve current users'
>> experience without requiring them to do anything. This sounds appealing.
>> For example, if I understand correctly, it may be possible to start resctrl
>> with ABMC enabled by default and the number of monitoring groups (currently
>> exposed to user space via "num_rmids") limited to the number of counters
>> supported by ABMC. Existing users would then by default obtain better behavior
>> of counters not resetting.
>
> Yes, I like the idea. But i will break compatibility with pqos
> tool(intel_cmt_cat utility). pqos tool monitoring will not work without
> supporting ABMC enablement in the tool. ABMC feature requires an extra
> step to assign the counters for monitor to work.

I am considering two scenarios, the "default behavior" is what a user will
experience when booting resctrl on an ABMC system and the "new feature
behavior" where a user can take full advantage of all that ABMC (and soft
RMID, and MPAM) can offer.

So, first, on an ABMC system in the "default behavior" scenario I expect
that resctrl can do required ABMC counter configuration automatically at
the time a monitor group is created. In this "default behavior" scenario
resctrl would expose "num_rmids" to be half of the number of assignable
counters. When a user then creates a monitor group two counters will be
used and configured to count the local and total bytes respectively. If
two counters are not available then ENOSPC returned, just like when system
is out of closid/rmid. With this "default behavior" user space thus gets
improved behavior without making any changes on its part. I do not have
insight into how many counters ABMC could be expected to expose though ...
so some users may be surprised at how few monitor groups can be created
with new hardware? This may not be an issue since that would accurately
reflect how many _reliable_ monitor groups can be created and if user needs
more monitor groups then that would be a time to explore the "new feature"
that requires changes in how user interacts with resctrl.

Apart from the "default behavior" there are two options to consider ...
(a) the "original" behavior(? I do not know what to call it) - this would be
where user space wants(?) to have the current non-ABMC behavior on an ABMC
system, where the previous "num_rmids" monitor groups can be created but
the counters are reset unpredictably ... should this still be supported
on ABMC systems though?
(b) the "new feature" behavior where user space gets full benefit of ABMC
that allows user space to create any number of monitor groups but then
user space needs to let hardware (via resctrl) know which
events should be counted.

I expect that only (b) above would require user space change. Considering
that per documentation, "num_rmids" means "This is the upper bound for how
many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids"
becomes undefined when "new feature" is enabled. When this new feature is enabled
then user space is no longer limited by number of RMIDs on how many monitor
groups can be created and this is the point that the user interface that you
and Peter have ideas about comes into play. Specifically, user space needing
a way to specify:
(a) "let me create more monitor groups that the hardware can support"/"let me
control which events/monitor groups are counted"
(like the "mbm_assign" file in your proposal)
(b) "here are the events that need to be counted"
(like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals)

Reinette




2024-02-26 18:07:12

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 2/23/24 16:21, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/23/2024 12:11 PM, Moger, Babu wrote:
>> On 2/23/24 11:17, Reinette Chatre wrote:
>>>
>>>
>>> On 2/20/2024 12:48 PM, Moger, Babu wrote:
>>>> On 2/20/24 09:21, James Morse wrote:
>>>>> On 19/01/2024 18:22, Babu Moger wrote:
>>>
>>>>>> e. Enable ABMC mode.
>>>>>>
>>>>>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>> 1
>>>>>
>>>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>>>>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>>>>
>>>>> I guess this is really needed for a reset - could we implement that instead? This way
>>>>> there isn't an extra step user-space has to do to make the assignments work.
>>>>
>>>> Mostly the new features are added as an opt-in method. So, kept it that
>>>> way. If we enable this feature automatically, then we have provide an
>>>> option to disable it.
>>>>
>>>
>>> At the same time it sounds to me like ABMC can improve current users'
>>> experience without requiring them to do anything. This sounds appealing.
>>> For example, if I understand correctly, it may be possible to start resctrl
>>> with ABMC enabled by default and the number of monitoring groups (currently
>>> exposed to user space via "num_rmids") limited to the number of counters
>>> supported by ABMC. Existing users would then by default obtain better behavior
>>> of counters not resetting.
>>
>> Yes, I like the idea. But i will break compatibility with pqos
>> tool(intel_cmt_cat utility). pqos tool monitoring will not work without
>> supporting ABMC enablement in the tool. ABMC feature requires an extra
>> step to assign the counters for monitor to work.
>
> I am considering two scenarios, the "default behavior" is what a user will
> experience when booting resctrl on an ABMC system and the "new feature
> behavior" where a user can take full advantage of all that ABMC (and soft
> RMID, and MPAM) can offer.
>
> So, first, on an ABMC system in the "default behavior" scenario I expect
> that resctrl can do required ABMC counter configuration automatically at
> the time a monitor group is created. In this "default behavior" scenario
> resctrl would expose "num_rmids" to be half of the number of assignable
> counters. When a user then creates a monitor group two counters will be
> used and configured to count the local and total bytes respectively. If
> two counters are not available then ENOSPC returned, just like when system
> is out of closid/rmid. With this "default behavior" user space thus gets
> improved behavior without making any changes on its part. I do not have

We can automatically assign the h/w counter when monitor group is created
until we run out of h/w counters. That is good idea. By default user will
not notice any difference in ABMC mode.

> insight into how many counters ABMC could be expected to expose though ...
> so some users may be surprised at how few monitor groups can be created
> with new hardware? This may not be an issue since that would accurately
> reflect how many _reliable_ monitor groups can be created and if user needs
> more monitor groups then that would be a time to explore the "new feature"
> that requires changes in how user interacts with resctrl.

Currently, 32 h/w counters are available to configure. With two counters
for each group, we can create 16 groups(15 new groups plus the default
group). That should be fine as pqos tool creates only 16 groups when it is
started.

>
> Apart from the "default behavior" there are two options to consider ...
> (a) the "original" behavior(? I do not know what to call it) - this would be
> where user space wants(?) to have the current non-ABMC behavior on an ABMC
> system, where the previous "num_rmids" monitor groups can be created but
> the counters are reset unpredictably ... should this still be supported
> on ABMC systems though?

I would say yes. For some reason user(hardware or software issues) is not
able to use ABMC mode, they have an option to go back to legacy mode.

> (b) the "new feature" behavior where user space gets full benefit of ABMC
> that allows user space to create any number of monitor groups but then
> user space needs to let hardware (via resctrl) know which
> events should be counted.

Is this "new feature" is enabled by default when ABMC is available?

Or we need to provide an interface to enable this feature?


>
> I expect that only (b) above would require user space change. Considering
> that per documentation, "num_rmids" means "This is the upper bound for how
> many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids"
> becomes undefined when "new feature" is enabled. When this new feature is enabled
> then user space is no longer limited by number of RMIDs on how many monitor

With ABMC, we will have a new field "mbm_assignable_counters". We don't
have to change the definition of "num_rmids".

> groups can be created and this is the point that the user interface that you
> and Peter have ideas about comes into play. Specifically, user space needing
> a way to specify:
> (a) "let me create more monitor groups that the hardware can support"/"let me
> control which events/monitor groups are counted"
> (like the "mbm_assign" file in your proposal)
> (b) "here are the events that need to be counted"
> (like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals)

With global assignment option out of way for now(may be introduced later),
we can provide two interfaces.

1. /sys/fs/resctrl/info/L3_MON/mbm_assign
This will be enabled by default when ABMC is available. Users can disable
this option to go back to legacy mode.

2. /sys/fs/resctrl/monitor_state.
This can used to individually assign or unassign the counters in each group.

When assigned:
#cat /sys/fs/resctrl/monitor_state
0=total-assign,local-assign;1=total-assign,local-assign

When unassigned:
#cat /sys/fs/resctrl/monitor_state
0=total-unassign,local-unassign;1=total-unassign,local-unassign


Thoughts?
--
Thanks
Babu Moger

2024-02-26 21:22:17

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 2/26/2024 9:59 AM, Moger, Babu wrote:
> On 2/23/24 16:21, Reinette Chatre wrote:
>> On 2/23/2024 12:11 PM, Moger, Babu wrote:
>>> On 2/23/24 11:17, Reinette Chatre wrote:
>>>>
>>>>
>>>> On 2/20/2024 12:48 PM, Moger, Babu wrote:
>>>>> On 2/20/24 09:21, James Morse wrote:
>>>>>> On 19/01/2024 18:22, Babu Moger wrote:
>>>>
>>>>>>> e. Enable ABMC mode.
>>>>>>>
>>>>>>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>>> 1
>>>>>>
>>>>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>>>>>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>>>>>
>>>>>> I guess this is really needed for a reset - could we implement that instead? This way
>>>>>> there isn't an extra step user-space has to do to make the assignments work.
>>>>>
>>>>> Mostly the new features are added as an opt-in method. So, kept it that
>>>>> way. If we enable this feature automatically, then we have provide an
>>>>> option to disable it.
>>>>>
>>>>
>>>> At the same time it sounds to me like ABMC can improve current users'
>>>> experience without requiring them to do anything. This sounds appealing.
>>>> For example, if I understand correctly, it may be possible to start resctrl
>>>> with ABMC enabled by default and the number of monitoring groups (currently
>>>> exposed to user space via "num_rmids") limited to the number of counters
>>>> supported by ABMC. Existing users would then by default obtain better behavior
>>>> of counters not resetting.
>>>
>>> Yes, I like the idea. But i will break compatibility with pqos
>>> tool(intel_cmt_cat utility). pqos tool monitoring will not work without
>>> supporting ABMC enablement in the tool. ABMC feature requires an extra
>>> step to assign the counters for monitor to work.
>>
>> I am considering two scenarios, the "default behavior" is what a user will
>> experience when booting resctrl on an ABMC system and the "new feature
>> behavior" where a user can take full advantage of all that ABMC (and soft
>> RMID, and MPAM) can offer.
>>
>> So, first, on an ABMC system in the "default behavior" scenario I expect
>> that resctrl can do required ABMC counter configuration automatically at
>> the time a monitor group is created. In this "default behavior" scenario
>> resctrl would expose "num_rmids" to be half of the number of assignable
>> counters. When a user then creates a monitor group two counters will be
>> used and configured to count the local and total bytes respectively. If
>> two counters are not available then ENOSPC returned, just like when system
>> is out of closid/rmid. With this "default behavior" user space thus gets
>> improved behavior without making any changes on its part. I do not have
>
> We can automatically assign the h/w counter when monitor group is created
> until we run out of h/w counters. That is good idea. By default user will
> not notice any difference in ABMC mode.
>
>> insight into how many counters ABMC could be expected to expose though ...
>> so some users may be surprised at how few monitor groups can be created
>> with new hardware? This may not be an issue since that would accurately
>> reflect how many _reliable_ monitor groups can be created and if user needs
>> more monitor groups then that would be a time to explore the "new feature"
>> that requires changes in how user interacts with resctrl.
>
> Currently, 32 h/w counters are available to configure. With two counters
> for each group, we can create 16 groups(15 new groups plus the default
> group). That should be fine as pqos tool creates only 16 groups when it is
> started.

user space can never assume that a certain number of groups can
be created.

>> Apart from the "default behavior" there are two options to consider ...
>> (a) the "original" behavior(? I do not know what to call it) - this would be
>> where user space wants(?) to have the current non-ABMC behavior on an ABMC
>> system, where the previous "num_rmids" monitor groups can be created but
>> the counters are reset unpredictably ... should this still be supported
>> on ABMC systems though?
>
> I would say yes. For some reason user(hardware or software issues) is not
> able to use ABMC mode, they have an option to go back to legacy mode.

I see. Should this perhaps be protected behind the resctrl "debug" mount option?

>> (b) the "new feature" behavior where user space gets full benefit of ABMC
>> that allows user space to create any number of monitor groups but then
>> user space needs to let hardware (via resctrl) know which
>> events should be counted.
>
> Is this "new feature" is enabled by default when ABMC is available?

Not in this design, no. In these scenarios ABMC will be available and enabled
in both the "default" and "new feature" behavior. The difference is no user
space changes are needed in "default" scenario and resctrl limits the number
of monitor groups to support all monitor groups to be backed by hardware
counters.
When "new feature" is enabled when ABMC is available and enabled then
user space is able to create more monitor groups than available hardware
counters and new user interface is required to manage associating counters
with monitor events.

>
> Or we need to provide an interface to enable this feature?

Yes, an interface will be needed to enable this feature.

>
>
>>
>> I expect that only (b) above would require user space change. Considering
>> that per documentation, "num_rmids" means "This is the upper bound for how
>> many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids"
>> becomes undefined when "new feature" is enabled. When this new feature is enabled
>> then user space is no longer limited by number of RMIDs on how many monitor
>
> With ABMC, we will have a new field "mbm_assignable_counters". We don't
> have to change the definition of "num_rmids".

The problem here is that "num_rmids" is (as per Documentation/arch/x86/resctrl.rst)
documented to be an upper bound for how many monitor groups can be created.
As I understand, when ABMC is enabled and its full capability exposed to user
space then there is no limit to how many monitor groups can be created, no?

For example, if I understand correctly, theoretically, when ABMC is enabled then
"num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
is not unsigned, tbd if number of directories may also be limited by kernfs).
User space could theoretically create more monitor groups than the number of
rmids that a resource claims to support using current upstream enumeration.
Instead, it is the "mbm_assignable_counters" that is of interest, that is what
user space uses to determine how many of the (potentially very large number of)
monitor groups/monitor events can be counted at any particular time.

>> groups can be created and this is the point that the user interface that you
>> and Peter have ideas about comes into play. Specifically, user space needing
>> a way to specify:
>> (a) "let me create more monitor groups that the hardware can support"/"let me
>> control which events/monitor groups are counted"
>> (like the "mbm_assign" file in your proposal)
>> (b) "here are the events that need to be counted"
>> (like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals)
>
> With global assignment option out of way for now(may be introduced later),
> we can provide two interfaces.
>
> 1. /sys/fs/resctrl/info/L3_MON/mbm_assign
> This will be enabled by default when ABMC is available. Users can disable
> this option to go back to legacy mode.

Potentially (all naming placeholders that will only be visible on systems that
actually supports particular mode):
legacy [default] new_feature soft_rmid

>
> 2. /sys/fs/resctrl/monitor_state.
> This can used to individually assign or unassign the counters in each group.
>
> When assigned:
> #cat /sys/fs/resctrl/monitor_state
> 0=total-assign,local-assign;1=total-assign,local-assign
>
> When unassigned:
> #cat /sys/fs/resctrl/monitor_state
> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>
>
> Thoughts?

How do you expect this interface to be used? I understand the mechanics
of this interface but on a higher level, do you expect user space to
once in a while assign a new counter to a single event or monitor group
(for which a fine grained interface works) or do you expect user space to
shift multiple counters across several monitor events at intervals?

Across resctrl's lifetime we have seen examples of user space wanting
to accomplish more with a single resctrl interaction. For example moving
multiple tasks to a group that you added support for and moving a monitor
group feature from Peter.

I thus think that it would be valuable to consider more efficient
interfaces from the beginning. I do not think that this is the type
of work that is an optimization to be delayed until an unspecified later
time, but instead multiple usage of interface can be considered from the
start with a most optimal interface created from the beginning. Specifically,
why does resctrl need to be "extended" to support a global assignment as proposed
by Peter at a later time, why can it not be done as the original and (ideally)
only mechanism?

Reinette

2024-02-27 18:28:21

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On Tue, Feb 27, 2024 at 10:12 AM Moger, Babu <[email protected]> wrote:
>
> On 2/26/24 15:20, Reinette Chatre wrote:
> >
> > For example, if I understand correctly, theoretically, when ABMC is enabled then
> > "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
> > is not unsigned, tbd if number of directories may also be limited by kernfs).
> > User space could theoretically create more monitor groups than the number of
> > rmids that a resource claims to support using current upstream enumeration.
>
> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
> more than this limit(r->num_rmid).
>
> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
> RMID to assign the monitoring. So, assignment limit is
> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.

That is not entirely true. As long as you don't need to maintain
bandwidth counts for unassigned monitoring groups, there's no need to
allocate a HW RMID to a monitoring group.

In my soft-ABMC prototype, where a limited number of HW RMIDs are
allocated to assigned monitoring groups, I was forced to replace the
HW RMID value stored in the task_struct to a pointer to the struct
mongroup, since the RMID value assigned to the mongroup would
frequently change, resulting in excessive walks down the tasklist to
find all of the tasks using the previous value.

However, the number of hardware monitor group identifiers supported
(i.e., RMID, PARTID:PMG) is usually high enough that I don't think
there's much motivation to support unlimited monitoring groups. In
both soft-RMID and soft-ABMC, I didn't bother supporting more groups
than num_rmids, because the number was large enough.

-Peter

2024-02-27 18:42:49

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 2/26/24 15:20, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>> On 2/23/24 16:21, Reinette Chatre wrote:
>>> On 2/23/2024 12:11 PM, Moger, Babu wrote:
>>>> On 2/23/24 11:17, Reinette Chatre wrote:
>>>>>
>>>>>
>>>>> On 2/20/2024 12:48 PM, Moger, Babu wrote:
>>>>>> On 2/20/24 09:21, James Morse wrote:
>>>>>>> On 19/01/2024 18:22, Babu Moger wrote:
>>>>>
>>>>>>>> e. Enable ABMC mode.
>>>>>>>>
>>>>>>>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>>>> 1
>>>>>>>
>>>>>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>>>>>>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>>>>>>
>>>>>>> I guess this is really needed for a reset - could we implement that instead? This way
>>>>>>> there isn't an extra step user-space has to do to make the assignments work.
>>>>>>
>>>>>> Mostly the new features are added as an opt-in method. So, kept it that
>>>>>> way. If we enable this feature automatically, then we have provide an
>>>>>> option to disable it.
>>>>>>
>>>>>
>>>>> At the same time it sounds to me like ABMC can improve current users'
>>>>> experience without requiring them to do anything. This sounds appealing.
>>>>> For example, if I understand correctly, it may be possible to start resctrl
>>>>> with ABMC enabled by default and the number of monitoring groups (currently
>>>>> exposed to user space via "num_rmids") limited to the number of counters
>>>>> supported by ABMC. Existing users would then by default obtain better behavior
>>>>> of counters not resetting.
>>>>
>>>> Yes, I like the idea. But i will break compatibility with pqos
>>>> tool(intel_cmt_cat utility). pqos tool monitoring will not work without
>>>> supporting ABMC enablement in the tool. ABMC feature requires an extra
>>>> step to assign the counters for monitor to work.
>>>
>>> I am considering two scenarios, the "default behavior" is what a user will
>>> experience when booting resctrl on an ABMC system and the "new feature
>>> behavior" where a user can take full advantage of all that ABMC (and soft
>>> RMID, and MPAM) can offer.
>>>
>>> So, first, on an ABMC system in the "default behavior" scenario I expect
>>> that resctrl can do required ABMC counter configuration automatically at
>>> the time a monitor group is created. In this "default behavior" scenario
>>> resctrl would expose "num_rmids" to be half of the number of assignable
>>> counters. When a user then creates a monitor group two counters will be
>>> used and configured to count the local and total bytes respectively. If
>>> two counters are not available then ENOSPC returned, just like when system
>>> is out of closid/rmid. With this "default behavior" user space thus gets
>>> improved behavior without making any changes on its part. I do not have
>>
>> We can automatically assign the h/w counter when monitor group is created
>> until we run out of h/w counters. That is good idea. By default user will
>> not notice any difference in ABMC mode.
>>
>>> insight into how many counters ABMC could be expected to expose though ...
>>> so some users may be surprised at how few monitor groups can be created
>>> with new hardware? This may not be an issue since that would accurately
>>> reflect how many _reliable_ monitor groups can be created and if user needs
>>> more monitor groups then that would be a time to explore the "new feature"
>>> that requires changes in how user interacts with resctrl.
>>
>> Currently, 32 h/w counters are available to configure. With two counters
>> for each group, we can create 16 groups(15 new groups plus the default
>> group). That should be fine as pqos tool creates only 16 groups when it is
>> started.
>
> user space can never assume that a certain number of groups can
> be created.
>
>>> Apart from the "default behavior" there are two options to consider ...
>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>> where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>> system, where the previous "num_rmids" monitor groups can be created but
>>> the counters are reset unpredictably ... should this still be supported
>>> on ABMC systems though?
>>
>> I would say yes. For some reason user(hardware or software issues) is not
>> able to use ABMC mode, they have an option to go back to legacy mode.
>
> I see. Should this perhaps be protected behind the resctrl "debug" mount option?

The debug option gives wrong impression. It is better to keep the option
open to enable the feature in normal mode.

>
>>> (b) the "new feature" behavior where user space gets full benefit of ABMC
>>> that allows user space to create any number of monitor groups but then
>>> user space needs to let hardware (via resctrl) know which
>>> events should be counted.
>>
>> Is this "new feature" is enabled by default when ABMC is available?
>
> Not in this design, no. In these scenarios ABMC will be available and enabled
> in both the "default" and "new feature" behavior. The difference is no user
> space changes are needed in "default" scenario and resctrl limits the number
> of monitor groups to support all monitor groups to be backed by hardware
> counters.
> When "new feature" is enabled when ABMC is available and enabled then
> user space is able to create more monitor groups than available hardware
> counters and new user interface is required to manage associating counters
> with monitor events.

ok. That sounds good.

>
>>
>> Or we need to provide an interface to enable this feature?
>
> Yes, an interface will be needed to enable this feature.

ok.

>
>>
>>
>>>
>>> I expect that only (b) above would require user space change. Considering
>>> that per documentation, "num_rmids" means "This is the upper bound for how
>>> many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids"
>>> becomes undefined when "new feature" is enabled. When this new feature is enabled
>>> then user space is no longer limited by number of RMIDs on how many monitor
>>
>> With ABMC, we will have a new field "mbm_assignable_counters". We don't
>> have to change the definition of "num_rmids".
>
> The problem here is that "num_rmids" is (as per Documentation/arch/x86/resctrl.rst)
> documented to be an upper bound for how many monitor groups can be created.
> As I understand, when ABMC is enabled and its full capability exposed to user
> space then there is no limit to how many monitor groups can be created, no?

No. That is not correct. The number of monitor groups is still limited by
num_rmids. But assignment is limited by mbm_assignable_counters. More below.

>
> For example, if I understand correctly, theoretically, when ABMC is enabled then
> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
> is not unsigned, tbd if number of directories may also be limited by kernfs).
> User space could theoretically create more monitor groups than the number of
> rmids that a resource claims to support using current upstream enumeration.

CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
bits(depends on specific h/w) to represent RMIDs. So, we cannot create
more than this limit(r->num_rmid).

In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
RMID to assign the monitoring. So, assignment limit is
mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.

> Instead, it is the "mbm_assignable_counters" that is of interest, that is what
> user space uses to determine how many of the (potentially very large number of)
> monitor groups/monitor events can be counted at any particular time.
>
>>> groups can be created and this is the point that the user interface that you
>>> and Peter have ideas about comes into play. Specifically, user space needing
>>> a way to specify:
>>> (a) "let me create more monitor groups that the hardware can support"/"let me
>>> control which events/monitor groups are counted"
>>> (like the "mbm_assign" file in your proposal)
>>> (b) "here are the events that need to be counted"
>>> (like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals)
>>
>> With global assignment option out of way for now(may be introduced later),
>> we can provide two interfaces.
>>
>> 1. /sys/fs/resctrl/info/L3_MON/mbm_assign
>> This will be enabled by default when ABMC is available. Users can disable
>> this option to go back to legacy mode.
>
> Potentially (all naming placeholders that will only be visible on systems that
> actually supports particular mode):
> legacy [default] new_feature soft_rmid

ok

>
>>
>> 2. /sys/fs/resctrl/monitor_state.
>> This can used to individually assign or unassign the counters in each group.
>>
>> When assigned:
>> #cat /sys/fs/resctrl/monitor_state
>> 0=total-assign,local-assign;1=total-assign,local-assign
>>
>> When unassigned:
>> #cat /sys/fs/resctrl/monitor_state
>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>
>>
>> Thoughts?
>
> How do you expect this interface to be used? I understand the mechanics
> of this interface but on a higher level, do you expect user space to
> once in a while assign a new counter to a single event or monitor group
> (for which a fine grained interface works) or do you expect user space to
> shift multiple counters across several monitor events at intervals?

I think we should provide both the options. I was thinking of providing
fine grained interface first.

Few use cases:
1. User wants to assign only one event (total or local) per group.
In this case, he can assign 32 events in 32 different groups.

#echo 0=total-assign > /sys/fs/resctrl/monitor_state
or
#echo 0=local-assign > /sys/fs/resctrl/monitor_state

When done:

#echo 0=total-unassign > /sys/fs/resctrl/monitor_state
or
#echo 0=local-unassign > /sys/fs/resctrl/monitor_state

Note: 0 the domain here.


2. User wants to assign both "local" and "total" events per group. In this
case, he can assign 32 events in 16 different groups.

#echo 0=local-assign,total-assign > /sys/fs/resctrl/monitor_state

When done:

#echo 0=local-unassign,total-unassign > /sys/fs/resctrl/monitor_state

3. combination of 1 and 2.

4. Assign multiple group assignment at once. I consider this as global
assignment. This can be achieved by 1 and 2 from user space looping thru
all the interested groups. Peter is worried about system call latency
here. He wants to optimize this. I was thinking this can done later.

>
> Across resctrl's lifetime we have seen examples of user space wanting
> to accomplish more with a single resctrl interaction. For example moving
> multiple tasks to a group that you added support for and moving a monitor
> group feature from Peter.
>
> I thus think that it would be valuable to consider more efficient
> interfaces from the beginning. I do not think that this is the type
> of work that is an optimization to be delayed until an unspecified later
> time, but instead multiple usage of interface can be considered from the
> start with a most optimal interface created from the beginning. Specifically,
> why does resctrl need to be "extended" to support a global assignment as proposed
> by Peter at a later time, why can it not be done as the original and (ideally)
> only mechanism?
>
> Reinette

--
Thanks
Babu Moger

2024-02-27 20:07:19

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

On 2/27/24 12:26, Peter Newman wrote:
> Hi Babu,
>
> On Tue, Feb 27, 2024 at 10:12 AM Moger, Babu <[email protected]> wrote:
>>
>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>
>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>> User space could theoretically create more monitor groups than the number of
>>> rmids that a resource claims to support using current upstream enumeration.
>>
>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>> more than this limit(r->num_rmid).
>>
>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>> RMID to assign the monitoring. So, assignment limit is
>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>
> That is not entirely true. As long as you don't need to maintain
> bandwidth counts for unassigned monitoring groups, there's no need to
> allocate a HW RMID to a monitoring group.

We don't need to allocate a h/w counter for unassigned group.
My proposal is to allocate h/w counter only if user requests a assignment.
The limit for assigned events at time is mbm_assignable_counters(32 right
now).

>
> In my soft-ABMC prototype, where a limited number of HW RMIDs are
> allocated to assigned monitoring groups, I was forced to replace the
> HW RMID value stored in the task_struct to a pointer to the struct
> mongroup, since the RMID value assigned to the mongroup would
> frequently change, resulting in excessive walks down the tasklist to
> find all of the tasks using the previous value.
>
> However, the number of hardware monitor group identifiers supported
> (i.e., RMID, PARTID:PMG) is usually high enough that I don't think
> there's much motivation to support unlimited monitoring groups. In
> both soft-RMID and soft-ABMC, I didn't bother supporting more groups
> than num_rmids, because the number was large enough.

What is soft-ABMC?

--
Thanks
Babu Moger

2024-02-27 20:34:06

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On Tue, Feb 27, 2024 at 11:37 AM Moger, Babu <[email protected]> wrote:
> On 2/27/24 12:26, Peter Newman wrote:
> > On Tue, Feb 27, 2024 at 10:12 AM Moger, Babu <[email protected]> wrote:
> >>
> >> On 2/26/24 15:20, Reinette Chatre wrote:
> >>>
> >>> For example, if I understand correctly, theoretically, when ABMC is enabled then
> >>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
> >>> is not unsigned, tbd if number of directories may also be limited by kernfs).
> >>> User space could theoretically create more monitor groups than the number of
> >>> rmids that a resource claims to support using current upstream enumeration.
> >>
> >> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
> >> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
> >> more than this limit(r->num_rmid).
> >>
> >> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
> >> RMID to assign the monitoring. So, assignment limit is
> >> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
> >
> > That is not entirely true. As long as you don't need to maintain
> > bandwidth counts for unassigned monitoring groups, there's no need to
> > allocate a HW RMID to a monitoring group.
>
> We don't need to allocate a h/w counter for unassigned group.
> My proposal is to allocate h/w counter only if user requests a assignment.
> The limit for assigned events at time is mbm_assignable_counters(32 right
> now).

I said "RMID", not "counter". The point is, the main purpose served by
the RMID in an unassigned mongroup is providing a unique value to
write into the task_struct to indicate group membership.

>
> >
> > In my soft-ABMC prototype, where a limited number of HW RMIDs are
> > allocated to assigned monitoring groups, I was forced to replace the
> > HW RMID value stored in the task_struct to a pointer to the struct
> > mongroup, since the RMID value assigned to the mongroup would
> > frequently change, resulting in excessive walks down the tasklist to
> > find all of the tasks using the previous value.
> >
> > However, the number of hardware monitor group identifiers supported
> > (i.e., RMID, PARTID:PMG) is usually high enough that I don't think
> > there's much motivation to support unlimited monitoring groups. In
> > both soft-RMID and soft-ABMC, I didn't bother supporting more groups
> > than num_rmids, because the number was large enough.
>
> What is soft-ABMC?

It's the term I'm using to describe[1] the approach of using the
monitor assignment interface to allocate a small number of RMIDs to
monitoring groups.

-Peter

[1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/

2024-02-27 20:42:52

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

On 2/27/24 14:06, Peter Newman wrote:
> Hi Babu,
>
> On Tue, Feb 27, 2024 at 11:37 AM Moger, Babu <[email protected]> wrote:
>> On 2/27/24 12:26, Peter Newman wrote:
>>> On Tue, Feb 27, 2024 at 10:12 AM Moger, Babu <[email protected]> wrote:
>>>>
>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>>
>>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>>> User space could theoretically create more monitor groups than the number of
>>>>> rmids that a resource claims to support using current upstream enumeration.
>>>>
>>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>>> more than this limit(r->num_rmid).
>>>>
>>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>>> RMID to assign the monitoring. So, assignment limit is
>>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>>
>>> That is not entirely true. As long as you don't need to maintain
>>> bandwidth counts for unassigned monitoring groups, there's no need to
>>> allocate a HW RMID to a monitoring group.
>>
>> We don't need to allocate a h/w counter for unassigned group.
>> My proposal is to allocate h/w counter only if user requests a assignment.
>> The limit for assigned events at time is mbm_assignable_counters(32 right
>> now).
>
> I said "RMID", not "counter". The point is, the main purpose served by
> the RMID in an unassigned mongroup is providing a unique value to
> write into the task_struct to indicate group membership.

In case of ABMC, cpu(or task) association still uses RMID value stored in
"struct mongroup" data structure. Same value is written to PQR_ASSOC(MSR
C8Fh). It needs to be a valid value. Hope that make sense.

>
>>
>>>
>>> In my soft-ABMC prototype, where a limited number of HW RMIDs are
>>> allocated to assigned monitoring groups, I was forced to replace the
>>> HW RMID value stored in the task_struct to a pointer to the struct
>>> mongroup, since the RMID value assigned to the mongroup would
>>> frequently change, resulting in excessive walks down the tasklist to
>>> find all of the tasks using the previous value.

You are using this pointer as unique value. This will work as long as you
are not writing this value to PQR_ASSOC MSR.


>>>
>>> However, the number of hardware monitor group identifiers supported
>>> (i.e., RMID, PARTID:PMG) is usually high enough that I don't think
>>> there's much motivation to support unlimited monitoring groups. In
>>> both soft-RMID and soft-ABMC, I didn't bother supporting more groups
>>> than num_rmids, because the number was large enough.
>>
>> What is soft-ABMC?
>
> It's the term I'm using to describe[1] the approach of using the
> monitor assignment interface to allocate a small number of RMIDs to
> monitoring groups.
>
> -Peter
>
> [1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/

--
Thanks
Babu Moger

2024-02-27 23:50:43

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 2/27/2024 10:12 AM, Moger, Babu wrote:
> On 2/26/24 15:20, Reinette Chatre wrote:
>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>> On 2/23/24 16:21, Reinette Chatre wrote:

>>>> Apart from the "default behavior" there are two options to consider ...
>>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>>> where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>>> system, where the previous "num_rmids" monitor groups can be created but
>>>> the counters are reset unpredictably ... should this still be supported
>>>> on ABMC systems though?
>>>
>>> I would say yes. For some reason user(hardware or software issues) is not
>>> able to use ABMC mode, they have an option to go back to legacy mode.
>>
>> I see. Should this perhaps be protected behind the resctrl "debug" mount option?
>
> The debug option gives wrong impression. It is better to keep the option
> open to enable the feature in normal mode.

You mentioned that it would only be needed when there are hardware or
software issues ... so debug does sound appropriate. Could you please give
an example of how debug option gives wrong impression? Why would you want
users to keep using "legacy" mode on an ABMC system?

..

>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>> User space could theoretically create more monitor groups than the number of
>> rmids that a resource claims to support using current upstream enumeration.
>
> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
> more than this limit(r->num_rmid).
>
> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
> RMID to assign the monitoring. So, assignment limit is
> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.

I see. Thank you for clarifying. This does make enabling simpler and one
less user interface item that needs changing.

..

>>> 2. /sys/fs/resctrl/monitor_state.
>>> This can used to individually assign or unassign the counters in each group.
>>>
>>> When assigned:
>>> #cat /sys/fs/resctrl/monitor_state
>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>
>>> When unassigned:
>>> #cat /sys/fs/resctrl/monitor_state
>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>
>>>
>>> Thoughts?
>>
>> How do you expect this interface to be used? I understand the mechanics
>> of this interface but on a higher level, do you expect user space to
>> once in a while assign a new counter to a single event or monitor group
>> (for which a fine grained interface works) or do you expect user space to
>> shift multiple counters across several monitor events at intervals?
>
> I think we should provide both the options. I was thinking of providing
> fine grained interface first.

Could you please provide a motivation for why two interfaces, one inefficient
and one not, should be created and maintained? Users can still do fine grained
assignment with a global assignment interface.

Reinette

2024-02-28 18:00:33

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 2/27/24 17:50, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>> On 2/26/24 15:20, Reinette Chatre wrote:
>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>
>>>>> Apart from the "default behavior" there are two options to consider ...
>>>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>>>> where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>>>> system, where the previous "num_rmids" monitor groups can be created but
>>>>> the counters are reset unpredictably ... should this still be supported
>>>>> on ABMC systems though?
>>>>
>>>> I would say yes. For some reason user(hardware or software issues) is not
>>>> able to use ABMC mode, they have an option to go back to legacy mode.
>>>
>>> I see. Should this perhaps be protected behind the resctrl "debug" mount option?
>>
>> The debug option gives wrong impression. It is better to keep the option
>> open to enable the feature in normal mode.
>
> You mentioned that it would only be needed when there are hardware or
> software issues ... so debug does sound appropriate. Could you please give
> an example of how debug option gives wrong impression? Why would you want
> users to keep using "legacy" mode on an ABMC system?

I don't have a strong argument here. I am fine as long as there is a way
to go back to legacy mode if required. We can provide legacy option in
debug mode.

>
> ...
>
>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>> User space could theoretically create more monitor groups than the number of
>>> rmids that a resource claims to support using current upstream enumeration.
>>
>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>> more than this limit(r->num_rmid).
>>
>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>> RMID to assign the monitoring. So, assignment limit is
>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>
> I see. Thank you for clarifying. This does make enabling simpler and one
> less user interface item that needs changing.
>
> ...
>
>>>> 2. /sys/fs/resctrl/monitor_state.
>>>> This can used to individually assign or unassign the counters in each group.
>>>>
>>>> When assigned:
>>>> #cat /sys/fs/resctrl/monitor_state
>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>
>>>> When unassigned:
>>>> #cat /sys/fs/resctrl/monitor_state
>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>
>>>>
>>>> Thoughts?
>>>
>>> How do you expect this interface to be used? I understand the mechanics
>>> of this interface but on a higher level, do you expect user space to
>>> once in a while assign a new counter to a single event or monitor group
>>> (for which a fine grained interface works) or do you expect user space to
>>> shift multiple counters across several monitor events at intervals?
>>
>> I think we should provide both the options. I was thinking of providing
>> fine grained interface first.
>
> Could you please provide a motivation for why two interfaces, one inefficient
> and one not, should be created and maintained? Users can still do fine grained
> assignment with a global assignment interface.

Lets consider one by one.

1. Fine grained assignment.

It will be part of the mongroup(or control mongroup). User has the access
to the group and can query the group's current status before assigning or
unassigning.

$cd /sys/fs/resctrl/ctrl_mon1
$cat /sys/fs/resctrl/ctrl_mon1/monitor_state
0=total-unassign,local-unassign;1=total-unassign,local-unassign;

Assign the total event

$echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state

Assign the local event

$echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state

Assign both events:

$echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state

Check the assignment status.

$cat /sys/fs/resctrl/ctrl_mon1/monitor_state
0=total-assign,local-assign;1=total-unassign,local-unassign;

-User interface is simple.

-Assignment will fail if all the h/w counters are exhausted. User needs to
unassign a counter from another group and use that counter here. This can
be done just querying the monitor state of another group.

-Monitor group's details(cpus, tasks) are part of the group. So, it is
better to have assignment state inside the group.

Note: Used interface names here just to give example.


2. global assignment:

I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
directory.

In case there are 100 mongroups, we need to have a way to list current
assignment status for these groups. I am not sure how to list status of
these 100 groups.

If user is wants to assign the local event(or total) in a specific group
in this list of 100 groups, I am not sure how to provide interface for
that. Should we pass the name of mongroup? That will involve looping
through using the call kernfs_walk_and_get. This may be ok if we are
dealing with very small number of groups.

--
Thanks
Babu Moger

2024-02-28 20:11:48

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 2/28/2024 9:59 AM, Moger, Babu wrote:
> On 2/27/24 17:50, Reinette Chatre wrote:
>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>

>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>> User space could theoretically create more monitor groups than the number of
>>>> rmids that a resource claims to support using current upstream enumeration.
>>>
>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>> more than this limit(r->num_rmid).
>>>
>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>> RMID to assign the monitoring. So, assignment limit is
>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>
>> I see. Thank you for clarifying. This does make enabling simpler and one
>> less user interface item that needs changing.
>>
>> ...
>>
>>>>> 2. /sys/fs/resctrl/monitor_state.
>>>>> This can used to individually assign or unassign the counters in each group.
>>>>>
>>>>> When assigned:
>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>>
>>>>> When unassigned:
>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>>
>>>>>
>>>>> Thoughts?
>>>>
>>>> How do you expect this interface to be used? I understand the mechanics
>>>> of this interface but on a higher level, do you expect user space to
>>>> once in a while assign a new counter to a single event or monitor group
>>>> (for which a fine grained interface works) or do you expect user space to
>>>> shift multiple counters across several monitor events at intervals?
>>>
>>> I think we should provide both the options. I was thinking of providing
>>> fine grained interface first.
>>
>> Could you please provide a motivation for why two interfaces, one inefficient
>> and one not, should be created and maintained? Users can still do fine grained
>> assignment with a global assignment interface.
>
> Lets consider one by one.
>
> 1. Fine grained assignment.
>
> It will be part of the mongroup(or control mongroup). User has the access
> to the group and can query the group's current status before assigning or
> unassigning.
>
> $cd /sys/fs/resctrl/ctrl_mon1
> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
> 0=total-unassign,local-unassign;1=total-unassign,local-unassign;
>
> Assign the total event
>
> $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>
> Assign the local event
>
> $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>
> Assign both events:
>
> $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>
> Check the assignment status.
>
> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
> 0=total-assign,local-assign;1=total-unassign,local-unassign;
>
> -User interface is simple.

This should not be the only motivation. Please do not sacrifice efficiency
and usability just to have a simple interface. One can also argue that this
interface can only be considered simple from the kernel implementation perspective,
from user space it seems complicated. For example, as James pointed out earlier [1],
user space would need to walk the entire resctrl to find out where counters are
assigned. Peter also pointed out how the multiple syscalls needed when adjusting
hundreds of monitor groups is inefficient. Please take all feedback into account.

You consider "simple interface" as a motivation, there seems to be at least two
arguments against this interface. Please consider these in your comparison
between interfaces. These are things that should be noted and make their way to
the cover letter.

>
> -Assignment will fail if all the h/w counters are exhausted. User needs to
> unassign a counter from another group and use that counter here. This can
> be done just querying the monitor state of another group.

Right ... and as you state there can be hundreds of monitor groups that
user space would need to walk and query to get this information.

>
> -Monitor group's details(cpus, tasks) are part of the group. So, it is
> better to have assignment state inside the group.

The assignment state should be clear from the event file.

> Note: Used interface names here just to give example.
>
>
> 2. global assignment:
>
> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
> directory.
>
> In case there are 100 mongroups, we need to have a way to list current
> assignment status for these groups. I am not sure how to list status of
> these 100 groups.

The kernel has many examples of interfaces that manages status of a large
number of entities. I am thinking, for example, we can learn a lot from
how dynamic debug works. On my system I see:

$ wc -l /sys/kernel/debug/dynamic_debug/control
5359 /sys/kernel/debug/dynamic_debug/control

>
> If user is wants to assign the local event(or total) in a specific group
> in this list of 100 groups, I am not sure how to provide interface for
> that. Should we pass the name of mongroup? That will involve looping
> through using the call kernfs_walk_and_get. This may be ok if we are
> dealing with very small number of groups.
>

What is your concern when needing to modify a large number of groups?
Are you concerned about the size of the writes needing to be parsed? It looks
like kernfs does support writes of larger than PAGE_SIZE, but it is not clear
to me that such large sizes will be required.

There is also kernfs_find_and_get() that may be more convenient to use.
I believe user space needs to provide control group name for a global
interface (the same name can be used by monitor groups belonging to
different control groups), and that can be used to narrow search.

Reading your message I do not find any motivation _against_ a global
interface, except that it is not obvious to you how such interface may look
or work. That is fair. Peter seems to have ideas and a working implementation
that can be used as reference. So far I have only seen one comment [2] from James
that was skeptical about the global interface but the reason notes that MPAM
allocates counters per domain, which is the same as ABMC so we will need more
information from James here on what is required since he did not respond to
Peter.

Below is a *hypothetical* interface to start a discussion that explores how
to support fine grained assignment in an interface that aims to be easy to use
by user space. Obviously Peter is also working on something so there
are many viewpoints to consider.

File info/L3_MON/mbm_assign_control:
#control_group/mon_group/flags
ctrl_a/mon_a/00=_;01=_
ctrl_a/mon_b/00=l;01=t
ctrl_b/mon_c/00=lt;01=lt

Above file displays to user:
* No counters are assigned to monitor group mon_a within control group ctrl_a
* Counter for local MBM is assigned to domain 0 of monitor group mon_b within
control group ctrl_a
* Counter for total MBM is assigned to domain 1 of monitor group mon_b within
control group ctrl_a
* Counters for local and total MBM are assigned to both domains of monitor
group mon_c within control group ctrl_b

With above interface user space can, with a single read, get insight into
how counters are assigned across all monitor groups.
User space can write to the file to modify the flags. If assigning a new
counter when no more counters are available then the write will fail.
Potentially, if changes are made in order provided by the user then
the user will be able to unassign counters from one group and re-assign to
another group with a single write.

I provide this purely to generate some ideas and gather more thoughts on
a global interface.

Reinette

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/






2024-02-29 20:37:29

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 2/28/24 14:04, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/28/2024 9:59 AM, Moger, Babu wrote:
>> On 2/27/24 17:50, Reinette Chatre wrote:
>>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>>
>
>>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>>> User space could theoretically create more monitor groups than the number of
>>>>> rmids that a resource claims to support using current upstream enumeration.
>>>>
>>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>>> more than this limit(r->num_rmid).
>>>>
>>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>>> RMID to assign the monitoring. So, assignment limit is
>>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>>
>>> I see. Thank you for clarifying. This does make enabling simpler and one
>>> less user interface item that needs changing.
>>>
>>> ...
>>>
>>>>>> 2. /sys/fs/resctrl/monitor_state.
>>>>>> This can used to individually assign or unassign the counters in each group.
>>>>>>
>>>>>> When assigned:
>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>>>
>>>>>> When unassigned:
>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>>>
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> How do you expect this interface to be used? I understand the mechanics
>>>>> of this interface but on a higher level, do you expect user space to
>>>>> once in a while assign a new counter to a single event or monitor group
>>>>> (for which a fine grained interface works) or do you expect user space to
>>>>> shift multiple counters across several monitor events at intervals?
>>>>
>>>> I think we should provide both the options. I was thinking of providing
>>>> fine grained interface first.
>>>
>>> Could you please provide a motivation for why two interfaces, one inefficient
>>> and one not, should be created and maintained? Users can still do fine grained
>>> assignment with a global assignment interface.
>>
>> Lets consider one by one.
>>
>> 1. Fine grained assignment.
>>
>> It will be part of the mongroup(or control mongroup). User has the access
>> to the group and can query the group's current status before assigning or
>> unassigning.
>>
>> $cd /sys/fs/resctrl/ctrl_mon1
>> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign;
>>
>> Assign the total event
>>
>> $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>
>> Assign the local event
>>
>> $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>
>> Assign both events:
>>
>> $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>
>> Check the assignment status.
>>
>> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>> 0=total-assign,local-assign;1=total-unassign,local-unassign;
>>
>> -User interface is simple.
>
> This should not be the only motivation. Please do not sacrifice efficiency
> and usability just to have a simple interface. One can also argue that this
> interface can only be considered simple from the kernel implementation perspective,
> from user space it seems complicated. For example, as James pointed out earlier [1],
> user space would need to walk the entire resctrl to find out where counters are
> assigned. Peter also pointed out how the multiple syscalls needed when adjusting
> hundreds of monitor groups is inefficient. Please take all feedback into account.
>
> You consider "simple interface" as a motivation, there seems to be at least two
> arguments against this interface. Please consider these in your comparison
> between interfaces. These are things that should be noted and make their way to
> the cover letter.
>
>>
>> -Assignment will fail if all the h/w counters are exhausted. User needs to
>> unassign a counter from another group and use that counter here. This can
>> be done just querying the monitor state of another group.
>
> Right ... and as you state there can be hundreds of monitor groups that
> user space would need to walk and query to get this information.
>
>>
>> -Monitor group's details(cpus, tasks) are part of the group. So, it is
>> better to have assignment state inside the group.
>
> The assignment state should be clear from the event file.
>
>> Note: Used interface names here just to give example.
>>
>>
>> 2. global assignment:
>>
>> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
>> directory.
>>
>> In case there are 100 mongroups, we need to have a way to list current
>> assignment status for these groups. I am not sure how to list status of
>> these 100 groups.
>
> The kernel has many examples of interfaces that manages status of a large
> number of entities. I am thinking, for example, we can learn a lot from
> how dynamic debug works. On my system I see:
>
> $ wc -l /sys/kernel/debug/dynamic_debug/control
> 5359 /sys/kernel/debug/dynamic_debug/control
>
>>
>> If user is wants to assign the local event(or total) in a specific group
>> in this list of 100 groups, I am not sure how to provide interface for
>> that. Should we pass the name of mongroup? That will involve looping
>> through using the call kernfs_walk_and_get. This may be ok if we are
>> dealing with very small number of groups.
>>
>
> What is your concern when needing to modify a large number of groups?
> Are you concerned about the size of the writes needing to be parsed? It looks
> like kernfs does support writes of larger than PAGE_SIZE, but it is not clear
> to me that such large sizes will be required.
>
> There is also kernfs_find_and_get() that may be more convenient to use.

Will look at this. There is also kernfs_name and kernfs_path.

> I believe user space needs to provide control group name for a global
> interface (the same name can be used by monitor groups belonging to
> different control groups), and that can be used to narrow search.
>
> Reading your message I do not find any motivation _against_ a global
> interface, except that it is not obvious to you how such interface may look
> or work. That is fair. Peter seems to have ideas and a working implementation
> that can be used as reference. So far I have only seen one comment [2] from James
> that was skeptical about the global interface but the reason notes that MPAM
> allocates counters per domain, which is the same as ABMC so we will need more
> information from James here on what is required since he did not respond to
> Peter.
>
> Below is a *hypothetical* interface to start a discussion that explores how
> to support fine grained assignment in an interface that aims to be easy to use
> by user space. Obviously Peter is also working on something so there
> are many viewpoints to consider.
>
> File info/L3_MON/mbm_assign_control:
> #control_group/mon_group/flags
> ctrl_a/mon_a/00=_;01=_
> ctrl_a/mon_b/00=l;01=t
> ctrl_b/mon_c/00=lt;01=lt

I think you left few things here(Like the default control_mon group).

To make more clear, let me list all the groups here based this.

When none of the counters assigned:

$cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
resctrl/00=none,none;01=none,none (#default control_mon group)
resctrl/mon_a/00=none,none;01=none,none (#mon group)
resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group)
resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group)


When some counters are assigned:

$echo "resctrl/00=total,local" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to
default group)

$echo "resctrl/mon_a/00=total;01=total" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon
group)

$echo "resctrl/ctrl_a/00=local;01=local" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control

$echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" >
/sys/fs/resctrl/nfo/L3_MON/mbm_assign_control

$cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
resctrl/00=total,local;01=none,none (#default control_mon group)
resctrl/mon_a/00=total,none;01=total,none (#mon group)
resctrl/ctrl_a/00=none,local;01=none,local (#control_mon group)
resctrl/ctrl_a/mon_ab/00=total,local;01=total,local (#mon group)


Few comments about this approach:
1.This will involve lots of text processing in the kernel. Will need to
figure out calls for these processing.

2.In this approach there is no way to list assignment of a single
group(like group resctrl/ctrl_a/mon_ab alone).

3. This is similar to fine grained approach we discussed but in global level.

Want to get Pater/James comments about this approach.

>
> Above file displays to user:
> * No counters are assigned to monitor group mon_a within control group ctrl_a
> * Counter for local MBM is assigned to domain 0 of monitor group mon_b within
> control group ctrl_a
> * Counter for total MBM is assigned to domain 1 of monitor group mon_b within
> control group ctrl_a
> * Counters for local and total MBM are assigned to both domains of monitor
> group mon_c within control group ctrl_b
>
> With above interface user space can, with a single read, get insight into
> how counters are assigned across all monitor groups.
> User space can write to the file to modify the flags. If assigning a new
> counter when no more counters are available then the write will fail.
> Potentially, if changes are made in order provided by the user then
> the user will be able to unassign counters from one group and re-assign to
> another group with a single write.
>
> I provide this purely to generate some ideas and gather more thoughts on
> a global interface.
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
>
>
>
>
>

--
Thanks
Babu Moger

2024-02-29 21:50:57

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 2/29/2024 12:37 PM, Moger, Babu wrote:
> On 2/28/24 14:04, Reinette Chatre wrote:
>> On 2/28/2024 9:59 AM, Moger, Babu wrote:
>>> On 2/27/24 17:50, Reinette Chatre wrote:
>>>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>>>
>>
>>>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>>>> User space could theoretically create more monitor groups than the number of
>>>>>> rmids that a resource claims to support using current upstream enumeration.
>>>>>
>>>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>>>> more than this limit(r->num_rmid).
>>>>>
>>>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>>>> RMID to assign the monitoring. So, assignment limit is
>>>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>>>
>>>> I see. Thank you for clarifying. This does make enabling simpler and one
>>>> less user interface item that needs changing.
>>>>
>>>> ...
>>>>
>>>>>>> 2. /sys/fs/resctrl/monitor_state.
>>>>>>> This can used to individually assign or unassign the counters in each group.
>>>>>>>
>>>>>>> When assigned:
>>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>>>>
>>>>>>> When unassigned:
>>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>>>>
>>>>>>>
>>>>>>> Thoughts?
>>>>>>
>>>>>> How do you expect this interface to be used? I understand the mechanics
>>>>>> of this interface but on a higher level, do you expect user space to
>>>>>> once in a while assign a new counter to a single event or monitor group
>>>>>> (for which a fine grained interface works) or do you expect user space to
>>>>>> shift multiple counters across several monitor events at intervals?
>>>>>
>>>>> I think we should provide both the options. I was thinking of providing
>>>>> fine grained interface first.
>>>>
>>>> Could you please provide a motivation for why two interfaces, one inefficient
>>>> and one not, should be created and maintained? Users can still do fine grained
>>>> assignment with a global assignment interface.
>>>
>>> Lets consider one by one.
>>>
>>> 1. Fine grained assignment.
>>>
>>> It will be part of the mongroup(or control mongroup). User has the access
>>> to the group and can query the group's current status before assigning or
>>> unassigning.
>>>
>>> $cd /sys/fs/resctrl/ctrl_mon1
>>> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign;
>>>
>>> Assign the total event
>>>
>>> $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>
>>> Assign the local event
>>>
>>> $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>
>>> Assign both events:
>>>
>>> $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>
>>> Check the assignment status.
>>>
>>> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>> 0=total-assign,local-assign;1=total-unassign,local-unassign;
>>>
>>> -User interface is simple.
>>
>> This should not be the only motivation. Please do not sacrifice efficiency
>> and usability just to have a simple interface. One can also argue that this
>> interface can only be considered simple from the kernel implementation perspective,
>> from user space it seems complicated. For example, as James pointed out earlier [1],
>> user space would need to walk the entire resctrl to find out where counters are
>> assigned. Peter also pointed out how the multiple syscalls needed when adjusting
>> hundreds of monitor groups is inefficient. Please take all feedback into account.
>>
>> You consider "simple interface" as a motivation, there seems to be at least two
>> arguments against this interface. Please consider these in your comparison
>> between interfaces. These are things that should be noted and make their way to
>> the cover letter.
>>
>>>
>>> -Assignment will fail if all the h/w counters are exhausted. User needs to
>>> unassign a counter from another group and use that counter here. This can
>>> be done just querying the monitor state of another group.
>>
>> Right ... and as you state there can be hundreds of monitor groups that
>> user space would need to walk and query to get this information.
>>
>>>
>>> -Monitor group's details(cpus, tasks) are part of the group. So, it is
>>> better to have assignment state inside the group.
>>
>> The assignment state should be clear from the event file.
>>
>>> Note: Used interface names here just to give example.
>>>
>>>
>>> 2. global assignment:
>>>
>>> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
>>> directory.
>>>
>>> In case there are 100 mongroups, we need to have a way to list current
>>> assignment status for these groups. I am not sure how to list status of
>>> these 100 groups.
>>
>> The kernel has many examples of interfaces that manages status of a large
>> number of entities. I am thinking, for example, we can learn a lot from
>> how dynamic debug works. On my system I see:
>>
>> $ wc -l /sys/kernel/debug/dynamic_debug/control
>> 5359 /sys/kernel/debug/dynamic_debug/control
>>
>>>
>>> If user is wants to assign the local event(or total) in a specific group
>>> in this list of 100 groups, I am not sure how to provide interface for
>>> that. Should we pass the name of mongroup? That will involve looping
>>> through using the call kernfs_walk_and_get. This may be ok if we are
>>> dealing with very small number of groups.
>>>
>>
>> What is your concern when needing to modify a large number of groups?
>> Are you concerned about the size of the writes needing to be parsed? It looks
>> like kernfs does support writes of larger than PAGE_SIZE, but it is not clear
>> to me that such large sizes will be required.
>>
>> There is also kernfs_find_and_get() that may be more convenient to use.
>
> Will look at this. There is also kernfs_name and kernfs_path.
>
>> I believe user space needs to provide control group name for a global
>> interface (the same name can be used by monitor groups belonging to
>> different control groups), and that can be used to narrow search.
>>
>> Reading your message I do not find any motivation _against_ a global
>> interface, except that it is not obvious to you how such interface may look
>> or work. That is fair. Peter seems to have ideas and a working implementation
>> that can be used as reference. So far I have only seen one comment [2] from James
>> that was skeptical about the global interface but the reason notes that MPAM
>> allocates counters per domain, which is the same as ABMC so we will need more
>> information from James here on what is required since he did not respond to
>> Peter.
>>
>> Below is a *hypothetical* interface to start a discussion that explores how
>> to support fine grained assignment in an interface that aims to be easy to use
>> by user space. Obviously Peter is also working on something so there
>> are many viewpoints to consider.
>>
>> File info/L3_MON/mbm_assign_control:
>> #control_group/mon_group/flags
>> ctrl_a/mon_a/00=_;01=_
>> ctrl_a/mon_b/00=l;01=t
>> ctrl_b/mon_c/00=lt;01=lt
>
> I think you left few things here(Like the default control_mon group).

No. Similar to proc_resctrl_show() the fields can be empty for
the default group or mon groups belonging to control group.

>
> To make more clear, let me list all the groups here based this.
>
> When none of the counters assigned:
>
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> resctrl/00=none,none;01=none,none (#default control_mon group)
> resctrl/mon_a/00=none,none;01=none,none (#mon group)
> resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group)
> resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group)

I am concerned that inconsistent use of "/" will make parsing hard.

I find "resctrl" and all the "none" redundant. It is not clear what
this improves.
Why have:
resctrl/00=none,none;01=none,none
when this could do:
//00=_;01=_


> When some counters are assigned:
>
> $echo "resctrl/00=total,local" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to
> default group)
>
> $echo "resctrl/mon_a/00=total;01=total" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon
> group)
>
> $echo "resctrl/ctrl_a/00=local;01=local" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" >
> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>

We could learn some more lessons from dynamic debug (see
Documentation/admin-guide/dynamic-debug-howto.rst).
For example, "=" can be used to make an assignment while "+"
can be used to add a counter and "-" can be used to remove a counter.
"=_" can be used to remove counters from all events in that domain.

The interface should also support assign/un-assign to multiple groups with
a single write. To start this could use '\n' as separator as is the custom
with other resctrl interfaces.

> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> resctrl/00=total,local;01=none,none (#default control_mon group)
> resctrl/mon_a/00=total,none;01=total,none (#mon group)
> resctrl/ctrl_a/00=none,local;01=none,local (#control_mon group)
> resctrl/ctrl_a/mon_ab/00=total,local;01=total,local (#mon group)
>
>
> Few comments about this approach:
> 1.This will involve lots of text processing in the kernel. Will need to
> figure out calls for these processing.

I see that additional parsing will be needed to determine control group
and monitor group. For these it sounds like you already have a few options
for kernfs API to use.
Apart from that the counter assignment will be similar parsing as what
was done in your previous versions. I think parsing will be easier if it
does not try to use words for the events but just use one letter flags.
For example, there is thus no need to look for "," in the parsing of the
events, just parse one character at a time where each character has a
specific meaning.

>
> 2.In this approach there is no way to list assignment of a single
> group(like group resctrl/ctrl_a/mon_ab alone).

Should the kernel be responsible for enabling this? User space can just
do a "cat mbm_assign_control | grep mon_ab". Is this not sufficient?

>
> 3. This is similar to fine grained approach we discussed but in global level.

That is what I have been trying to get across. This has full benefit of the
original implementation while also addressing all problems raised against it.

>
> Want to get Pater/James comments about this approach.
(Peter)

Of course. I look forward to that. Once agreed it may also be worthwhile to
approach x86 maintainers with an RFC of the proposed new user interface to learn
their guidance. This is where it is important to keep track of all the requirements,
as well as pros and cons of different options.

Reinette

2024-03-01 20:37:55

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 2/29/24 15:50, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/29/2024 12:37 PM, Moger, Babu wrote:
>> On 2/28/24 14:04, Reinette Chatre wrote:
>>> On 2/28/2024 9:59 AM, Moger, Babu wrote:
>>>> On 2/27/24 17:50, Reinette Chatre wrote:
>>>>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>>>>
>>>
>>>>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>>>>> User space could theoretically create more monitor groups than the number of
>>>>>>> rmids that a resource claims to support using current upstream enumeration.
>>>>>>
>>>>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>>>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>>>>> more than this limit(r->num_rmid).
>>>>>>
>>>>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>>>>> RMID to assign the monitoring. So, assignment limit is
>>>>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>>>>
>>>>> I see. Thank you for clarifying. This does make enabling simpler and one
>>>>> less user interface item that needs changing.
>>>>>
>>>>> ...
>>>>>
>>>>>>>> 2. /sys/fs/resctrl/monitor_state.
>>>>>>>> This can used to individually assign or unassign the counters in each group.
>>>>>>>>
>>>>>>>> When assigned:
>>>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>>>>>
>>>>>>>> When unassigned:
>>>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>>>>>
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>
>>>>>>> How do you expect this interface to be used? I understand the mechanics
>>>>>>> of this interface but on a higher level, do you expect user space to
>>>>>>> once in a while assign a new counter to a single event or monitor group
>>>>>>> (for which a fine grained interface works) or do you expect user space to
>>>>>>> shift multiple counters across several monitor events at intervals?
>>>>>>
>>>>>> I think we should provide both the options. I was thinking of providing
>>>>>> fine grained interface first.
>>>>>
>>>>> Could you please provide a motivation for why two interfaces, one inefficient
>>>>> and one not, should be created and maintained? Users can still do fine grained
>>>>> assignment with a global assignment interface.
>>>>
>>>> Lets consider one by one.
>>>>
>>>> 1. Fine grained assignment.
>>>>
>>>> It will be part of the mongroup(or control mongroup). User has the access
>>>> to the group and can query the group's current status before assigning or
>>>> unassigning.
>>>>
>>>> $cd /sys/fs/resctrl/ctrl_mon1
>>>> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign;
>>>>
>>>> Assign the total event
>>>>
>>>> $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>
>>>> Assign the local event
>>>>
>>>> $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>
>>>> Assign both events:
>>>>
>>>> $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>
>>>> Check the assignment status.
>>>>
>>>> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>> 0=total-assign,local-assign;1=total-unassign,local-unassign;
>>>>
>>>> -User interface is simple.
>>>
>>> This should not be the only motivation. Please do not sacrifice efficiency
>>> and usability just to have a simple interface. One can also argue that this
>>> interface can only be considered simple from the kernel implementation perspective,
>>> from user space it seems complicated. For example, as James pointed out earlier [1],
>>> user space would need to walk the entire resctrl to find out where counters are
>>> assigned. Peter also pointed out how the multiple syscalls needed when adjusting
>>> hundreds of monitor groups is inefficient. Please take all feedback into account.
>>>
>>> You consider "simple interface" as a motivation, there seems to be at least two
>>> arguments against this interface. Please consider these in your comparison
>>> between interfaces. These are things that should be noted and make their way to
>>> the cover letter.
>>>
>>>>
>>>> -Assignment will fail if all the h/w counters are exhausted. User needs to
>>>> unassign a counter from another group and use that counter here. This can
>>>> be done just querying the monitor state of another group.
>>>
>>> Right ... and as you state there can be hundreds of monitor groups that
>>> user space would need to walk and query to get this information.
>>>
>>>>
>>>> -Monitor group's details(cpus, tasks) are part of the group. So, it is
>>>> better to have assignment state inside the group.
>>>
>>> The assignment state should be clear from the event file.
>>>
>>>> Note: Used interface names here just to give example.
>>>>
>>>>
>>>> 2. global assignment:
>>>>
>>>> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
>>>> directory.
>>>>
>>>> In case there are 100 mongroups, we need to have a way to list current
>>>> assignment status for these groups. I am not sure how to list status of
>>>> these 100 groups.
>>>
>>> The kernel has many examples of interfaces that manages status of a large
>>> number of entities. I am thinking, for example, we can learn a lot from
>>> how dynamic debug works. On my system I see:
>>>
>>> $ wc -l /sys/kernel/debug/dynamic_debug/control
>>> 5359 /sys/kernel/debug/dynamic_debug/control
>>>
>>>>
>>>> If user is wants to assign the local event(or total) in a specific group
>>>> in this list of 100 groups, I am not sure how to provide interface for
>>>> that. Should we pass the name of mongroup? That will involve looping
>>>> through using the call kernfs_walk_and_get. This may be ok if we are
>>>> dealing with very small number of groups.
>>>>
>>>
>>> What is your concern when needing to modify a large number of groups?
>>> Are you concerned about the size of the writes needing to be parsed? It looks
>>> like kernfs does support writes of larger than PAGE_SIZE, but it is not clear
>>> to me that such large sizes will be required.
>>>
>>> There is also kernfs_find_and_get() that may be more convenient to use.
>>
>> Will look at this. There is also kernfs_name and kernfs_path.
>>
>>> I believe user space needs to provide control group name for a global
>>> interface (the same name can be used by monitor groups belonging to
>>> different control groups), and that can be used to narrow search.
>>>
>>> Reading your message I do not find any motivation _against_ a global
>>> interface, except that it is not obvious to you how such interface may look
>>> or work. That is fair. Peter seems to have ideas and a working implementation
>>> that can be used as reference. So far I have only seen one comment [2] from James
>>> that was skeptical about the global interface but the reason notes that MPAM
>>> allocates counters per domain, which is the same as ABMC so we will need more
>>> information from James here on what is required since he did not respond to
>>> Peter.
>>>
>>> Below is a *hypothetical* interface to start a discussion that explores how
>>> to support fine grained assignment in an interface that aims to be easy to use
>>> by user space. Obviously Peter is also working on something so there
>>> are many viewpoints to consider.
>>>
>>> File info/L3_MON/mbm_assign_control:
>>> #control_group/mon_group/flags
>>> ctrl_a/mon_a/00=_;01=_
>>> ctrl_a/mon_b/00=l;01=t
>>> ctrl_b/mon_c/00=lt;01=lt
>>
>> I think you left few things here(Like the default control_mon group).
>
> No. Similar to proc_resctrl_show() the fields can be empty for
> the default group or mon groups belonging to control group.

ok. Need to understand this better. Hope I learn while doing this work.

>
>>
>> To make more clear, let me list all the groups here based this.
>>
>> When none of the counters assigned:
>>
>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> resctrl/00=none,none;01=none,none (#default control_mon group)
>> resctrl/mon_a/00=none,none;01=none,none (#mon group)
>> resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group)
>> resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group)
>
> I am concerned that inconsistent use of "/" will make parsing hard.

Do you mean, you don't want to see multiple "/"?

resctrl/ctrl_a/mon_ab/

Change to

mon_ab/

>
> I find "resctrl" and all the "none" redundant. It is not clear what
> this improves.
> Why have:
> resctrl/00=none,none;01=none,none
> when this could do:
> //00=_;01=_

ok.

"//" meaning root of resctrl filesystem?


>
>
>> When some counters are assigned:
>>
>> $echo "resctrl/00=total,local" >
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to
>> default group)
>>
>> $echo "resctrl/mon_a/00=total;01=total" >
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon
>> group)
>>
>> $echo "resctrl/ctrl_a/00=local;01=local" >
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" >
>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>
>
> We could learn some more lessons from dynamic debug (see
> Documentation/admin-guide/dynamic-debug-howto.rst).
> For example, "=" can be used to make an assignment while "+"
> can be used to add a counter and "-" can be used to remove a counter.
> "=_" can be used to remove counters from all events in that domain.

Yes. Looked at dynamic debug. I am still learning this interface. Some
examples below based on my understanding.

To assign a counters to default group on domain 0.
$echo "//00=+lt;01=+lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

To assign a counters to mon group inside the default group.
$echo "mon_a/00=+t;01=+t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

To assign a counters to control mon group inside the default group.
$echo "ctrl_a/00=+l;01=+l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

To assign a counters to control mon group inside another control group.
$echo "mon_ab/00=+lt;01=+lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro

To unassign a counters to control mon group inside another control group.
$echo "mon_ab/00=-lt;01=-lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control

To unassign all the counters on a specific group.
$echo "mon_ab/00=_" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control

It does not matter control group or mon group. We just need to name of
the group in this interface.

Listing will be

$cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
//00=lt;01=lt
/mon_a/00=t;01=t
/ctrl_a/00=l;01=l
/mon_ab/00=_;01=_

>
> The interface should also support assign/un-assign to multiple groups with
> a single write. To start this could use '\n' as separator as is the custom
> with other resctrl interfaces.

Yes. that should be fine.

>
>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> resctrl/00=total,local;01=none,none (#default control_mon group)
>> resctrl/mon_a/00=total,none;01=total,none (#mon group)
>> resctrl/ctrl_a/00=none,local;01=none,local (#control_mon group)
>> resctrl/ctrl_a/mon_ab/00=total,local;01=total,local (#mon group)
>>
>>
>> Few comments about this approach:
>> 1.This will involve lots of text processing in the kernel. Will need to
>> figure out calls for these processing.
>
> I see that additional parsing will be needed to determine control group
> and monitor group. For these it sounds like you already have a few options
> for kernfs API to use.
> Apart from that the counter assignment will be similar parsing as what
> was done in your previous versions. I think parsing will be easier if it
> does not try to use words for the events but just use one letter flags.
> For example, there is thus no need to look for "," in the parsing of the
> events, just parse one character at a time where each character has a
> specific meaning.

ok.

>
>>
>> 2.In this approach there is no way to list assignment of a single
>> group(like group resctrl/ctrl_a/mon_ab alone).
>
> Should the kernel be responsible for enabling this? User space can just
> do a "cat mbm_assign_control | grep mon_ab". Is this not sufficient?

That may be ok. Peter, Please comment on this.

>
>>
>> 3. This is similar to fine grained approach we discussed but in global level.
>
> That is what I have been trying to get across. This has full benefit of the
> original implementation while also addressing all problems raised against it.
>
>>
>> Want to get Pater/James comments about this approach.
> (Peter)
>
> Of course. I look forward to that. Once agreed it may also be worthwhile to
> approach x86 maintainers with an RFC of the proposed new user interface to learn
> their guidance. This is where it is important to keep track of all the requirements,
> as well as pros and cons of different options.

Ok. Sure. I am fine making next version as RFC.

>
> Reinette

--
Thanks
Babu Moger

2024-03-01 23:20:53

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 3/1/2024 12:36 PM, Moger, Babu wrote:
> On 2/29/24 15:50, Reinette Chatre wrote:
>> On 2/29/2024 12:37 PM, Moger, Babu wrote:

..

>>> To make more clear, let me list all the groups here based this.
>>>
>>> When none of the counters assigned:
>>>
>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> resctrl/00=none,none;01=none,none (#default control_mon group)
>>> resctrl/mon_a/00=none,none;01=none,none (#mon group)
>>> resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group)
>>> resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group)
>>
>> I am concerned that inconsistent use of "/" will make parsing hard.
>
> Do you mean, you don't want to see multiple "/"?

No. I think that having a consistent number of "/" will be easier to
parse. In the above example, there are instances of 1, 2, as well as
three "/" among the lines. That seems complicated to parse.

I was thinking that it will make interpreting and parsing easier if there
consistently are just always two "/".

(You may find things to be different once you work on the parsing code
though.)

In summary:
* for monitoring of default CTRL_MON group: "//<flags>"
* for MON_GROUP inside default CTRL_MON group: "/<MON group>/<flags>"
* for monitoring of non-default CTRL_MON group: "<CTRL_MON group>//flags"
* for MON_GROUP within CTRL_MON group: "<CTRL_MON group>/<MON group>/<flags>"

What do you think?

>
> resctrl/ctrl_a/mon_ab/
>
> Change to
>
> mon_ab/

rather:
ctrl_a/mon_ab/<flags>

>
>>
>> I find "resctrl" and all the "none" redundant. It is not clear what
>> this improves.
>> Why have:
>> resctrl/00=none,none;01=none,none
>> when this could do:
>> //00=_;01=_
>
> ok.
>
> "//" meaning root of resctrl filesystem?

More specifically, monitoring of default control group. It is not intended to
specify a pathname.

>>> When some counters are assigned:
>>>
>>> $echo "resctrl/00=total,local" >
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to
>>> default group)
>>>
>>> $echo "resctrl/mon_a/00=total;01=total" >
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon
>>> group)
>>>
>>> $echo "resctrl/ctrl_a/00=local;01=local" >
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>> $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" >
>>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>>
>>
>> We could learn some more lessons from dynamic debug (see Documentation/admin-guide/dynamic-debug-howto.rst). For example, "=" can be used to make an assignment while "+"
>> can be used to add a counter and "-" can be used to remove a counter.
>> "=_" can be used to remove counters from all events in that domain.
>
> Yes. Looked at dynamic debug. I am still learning this interface. Some examples below based on my understanding.
>
> To assign a counters to default group on domain 0.
> $echo "//00=+lt;01=+lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

It should not be necessary to use both "=" and "+" in the same assignment.
I think of "=" as "assign" and "+" as append ("-" as remove).

An example of this, just focusing on default group.

#hypothetical start state of no counters assigned
$ cat mbm_assign_control
#control_group/monitor_group/flags
//00=_;01=_

#assign counter to total MBM of both domains
$ echo "//00=t;01=t" > mbm_assign_control
$ cat mbm_assign_control
#control_group/monitor_group/flags
//00=t;01=t

#add counter to local MBM of both domains without impacting total MBM counters
$echo "//00+l;01+l" > mbm_assign_control
$ cat mbm_assign_control
#control_group/monitor_group/flags
//00=tl;01=tl

#remove local MBM counters without impacting total MBM counters
$echo "//00-l;01-l" > mbm_assign_control
$ cat mbm_assign_control
#control_group/monitor_group/flags
//00=t;01=t

#assign local MBM counters, removing total MBM counters while doing so
$echo "//00=l;01=l" > mbm_assign_control
$ cat mbm_assign_control
#control_group/monitor_group/flags
//00=l;01=l

#remove all counters
$echo "//00=_;01=_" > mbm_assign_control
$ cat mbm_assign_control
#control_group/monitor_group/flags
//00=_;01=_


>
> To assign a counters to mon group inside the default group.
> $echo "mon_a/00=+t;01=+t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

I think it will simplify parsing if number of "/" is consistent:
$echo "/mon_a/00=t;01=t" > ...

>
> To assign a counters to control mon group inside the default group.

It is not clear to me what you mean with this.

> $echo "ctrl_a/00=+l;01=+l"  > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

this looks similar to previous example, so I think it will be hard for parser
to know whether it is dealing with control group or monitor group.
I am not sure I understand your example, but this may perhaps be:

echo "ctrl_a//00=l;01=l > ...

>
> To assign a counters to control mon group inside another control group.

I do not know what you mean with "another control group"

> $echo "mon_ab/00=+lt;01=+lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro

How will parser know which control group? I was expecting:
$ echo "<CTRL_MON group>/<MON group>/<flags>"

>
> To unassign a counters to control mon group inside another control group.
> $echo "mon_ab/00=-lt;01=-lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>
> To unassign all the counters on a specific group.
> $echo "mon_ab/00=_" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>
> It does not matter control group or mon group. We just need to name of the group in this interface.

It matters because users can have monitor groups with the same name within
different control groups.

Reinette

2024-03-04 19:35:09

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 3/1/2024 5:20 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 3/1/2024 12:36 PM, Moger, Babu wrote:
>> On 2/29/24 15:50, Reinette Chatre wrote:
>>> On 2/29/2024 12:37 PM, Moger, Babu wrote:
> ...
>
>>>> To make more clear, let me list all the groups here based this.
>>>>
>>>> When none of the counters assigned:
>>>>
>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> resctrl/00=none,none;01=none,none (#default control_mon group)
>>>> resctrl/mon_a/00=none,none;01=none,none (#mon group)
>>>> resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group)
>>>> resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group)
>>> I am concerned that inconsistent use of "/" will make parsing hard.
>> Do you mean, you don't want to see multiple "/"?
> No. I think that having a consistent number of "/" will be easier to
> parse. In the above example, there are instances of 1, 2, as well as
> three "/" among the lines. That seems complicated to parse.
>
> I was thinking that it will make interpreting and parsing easier if there
> consistently are just always two "/".
>
> (You may find things to be different once you work on the parsing code
> though.)
>
> In summary:
> * for monitoring of default CTRL_MON group: "//<flags>"
> * for MON_GROUP inside default CTRL_MON group: "/<MON group>/<flags>"
> * for monitoring of non-default CTRL_MON group: "<CTRL_MON group>//flags"
> * for MON_GROUP within CTRL_MON group: "<CTRL_MON group>/<MON group>/<flags>"
>
> What do you think?
Looks like you tried to keep two "/" in all the options. Looks good most
part. Will keep the options open for changes when we start implementing.
>
>> resctrl/ctrl_a/mon_ab/
>>
>> Change to
>>
>> mon_ab/
> rather:
> ctrl_a/mon_ab/<flags>
Sure.
>
>>> I find "resctrl" and all the "none" redundant. It is not clear what
>>> this improves.
>>> Why have:
>>> resctrl/00=none,none;01=none,none
>>> when this could do:
>>> //00=_;01=_
>> ok.
>>
>> "//" meaning root of resctrl filesystem?
> More specifically, monitoring of default control group. It is not intended to
> specify a pathname.
>
>>>> When some counters are assigned:
>>>>
>>>> $echo "resctrl/00=total,local" >
>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to
>>>> default group)
>>>>
>>>> $echo "resctrl/mon_a/00=total;01=total" >
>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon
>>>> group)
>>>>
>>>> $echo "resctrl/ctrl_a/00=local;01=local" >
>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>
>>>> $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" >
>>>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>>>
>>> We could learn some more lessons from dynamic debug (see Documentation/admin-guide/dynamic-debug-howto.rst). For example, "=" can be used to make an assignment while "+"
>>> can be used to add a counter and "-" can be used to remove a counter.
>>> "=_" can be used to remove counters from all events in that domain.
>> Yes. Looked at dynamic debug. I am still learning this interface. Some examples below based on my understanding.
>>
>> To assign a counters to default group on domain 0.
>> $echo "//00=+lt;01=+lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> It should not be necessary to use both "=" and "+" in the same assignment.
> I think of "=" as "assign" and "+" as append ("-" as remove).
Here are our options.

a. assign one event (+)

b. unassign one event (-)

c. assign both (++ may be?)

d. unassign both (_)

I think append ( "=") is not required while assigning.  It is confusing.

Assign or Add both involve same action.

How about this? This might be easy to parse. May be space (" ") after
the domain id.

<group>/<domain id> <action><event>; <domain id> <action><event>

>
> An example of this, just focusing on default group.
>
> #hypothetical start state of no counters assigned
> $ cat mbm_assign_control
> #control_group/monitor_group/flags
> //00=_;01=_
Looks good.
>
> #assign counter to total MBM of both domains
> $ echo "//00=t;01=t" > mbm_assign_control
There is no difference in assign or add. Just add total MBM event.

$ echo "//00 +t;01 +t" > mbm_assign_control

> $ cat mbm_assign_control
> #control_group/monitor_group/flags
> //00=t;01=t
good.
>
> #add counter to local MBM of both domains without impacting total MBM counters
> $echo "//00+l;01+l" > mbm_assign_control
It is not required to know about whether total MBM event  is already
assigned or not.  Just assign the event of your interest. If it is
already assigned then kernel just ignores it. Kernel has information all
the assignment status.

$echo "//00 +l;01 +l" > mbm_assign_control

We will know the full status of the assignment when we list again.

> $ cat mbm_assign_control
> #control_group/monitor_group/flags
> //00=tl;01=tl
Good.
>
> #remove local MBM counters without impacting total MBM counters
> $echo "//00-l;01-l" > mbm_assign_control
Remove local MBM counters. We don't need to know about total MBM counter.

$echo "//00 -l;01 -l" > mbm_assign_control

> $ cat mbm_assign_control
> #control_group/monitor_group/flags
> //00=t;01=t
Good.
>
> #assign local MBM counters, removing total MBM counters while doing so
> $echo "//00=l;01=l" > mbm_assign_control
Again confusing here.  Just remove total event and add local event in
two commands.

$echo "//00 -t;01 -t" > mbm_assign_control
$echo "//00 +l;01 +l" > mbm_assign_control

> $ cat mbm_assign_control
> #control_group/monitor_group/flags
> //00=l;01=l
good
>
> #remove all counters
> $echo "//00=_;01=_" > mbm_assign_control
> $ cat mbm_assign_control
> #control_group/monitor_group/flags
> //00=_;01=_
>
>
>> To assign a counters to mon group inside the default group.
>> $echo "mon_a/00=+t;01=+t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> I think it will simplify parsing if number of "/" is consistent:
> $echo "/mon_a/00=t;01=t" > ...
>
>> To assign a counters to control mon group inside the default group.
> It is not clear to me what you mean with this.
>
>> $echo "ctrl_a/00=+l;01=+l"  > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> this looks similar to previous example, so I think it will be hard for parser
> to know whether it is dealing with control group or monitor group.
> I am not sure I understand your example, but this may perhaps be:
>
> echo "ctrl_a//00=l;01=l > ...
>
>> To assign a counters to control mon group inside another control group.
> I do not know what you mean with "another control group"
>
>> $echo "mon_ab/00=+lt;01=+lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro
> How will parser know which control group? I was expecting:
> $ echo "<CTRL_MON group>/<MON group>/<flags>"
Sure.
>
>> To unassign a counters to control mon group inside another control group.
>> $echo "mon_ab/00=-lt;01=-lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>
>> To unassign all the counters on a specific group.
>> $echo "mon_ab/00=_" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>
>> It does not matter control group or mon group. We just need to name of the group in this interface.
> It matters because users can have monitor groups with the same name within
> different control groups.

Agree.

I will list all the example again once we agree on specific format.

Thanks

Babu


2024-03-04 19:58:36

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 3/4/2024 11:34 AM, Moger, Babu wrote:
> On 3/1/2024 5:20 PM, Reinette Chatre wrote:
>> On 3/1/2024 12:36 PM, Moger, Babu wrote:
>>> On 2/29/24 15:50, Reinette Chatre wrote:
>>>> On 2/29/2024 12:37 PM, Moger, Babu wrote:
>>

>>> To assign a counters to default group on domain 0.
>>> $echo "//00=+lt;01=+lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> It should not be necessary to use both "=" and "+" in the same assignment.
>> I think of "=" as "assign" and "+" as append ("-" as remove).
> Here are our options.
>
> a. assign one event (+)

I prefer that we use consistent interface with what users may be used to
in other kernel interfaces, like dynamic debug.
Considering that, "+" will not be "assign one event" but instead
(let me copy text from dynamic debug to help):
"+ add the given flags"

So + will add (append) the provided flags to the matching domain, it
can be multiple flags and does not impact existing flags.

>
> b. unassign one event (-)

"- remove the given flags" - it can be multiple flags that should be
removed from domain.

>
> c. assign both (++ may be?)

No. Please do not constrain the interface with what needs to be supported
for ABMC. We may want to add other flags in the future, do not limit it to
two flags.

>
> d. unassign both (_)

"=_" will unassign all flags without consideration of which flags
are set. User can also use "-l" to just unassign local MBM, "-t" to
unassign total MBM, or "-lt" to unassign local and total MBM specifically.

>
> I think append ( "=") is not required while assigning.  It is confusing.

"=" is not append. It is assign:

" = set the flags to the given flags"

>
> Assign or Add both involve same action.
>
> How about this? This might be easy to parse. May be space (" ") after the domain id.

Why a space?

>
> <group>/<domain id> <action><event>; <domain id> <action><event>
>

<control group>/<monitor group/<domain id><action><flags or _>;<domain id><action><flags or _>

Reinette


2024-03-04 22:24:42

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 3/4/2024 1:58 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 3/4/2024 11:34 AM, Moger, Babu wrote:
>> On 3/1/2024 5:20 PM, Reinette Chatre wrote:
>>> On 3/1/2024 12:36 PM, Moger, Babu wrote:
>>>> On 2/29/24 15:50, Reinette Chatre wrote:
>>>>> On 2/29/2024 12:37 PM, Moger, Babu wrote:
>>>> To assign a counters to default group on domain 0.
>>>> $echo "//00=+lt;01=+lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> It should not be necessary to use both "=" and "+" in the same assignment.
>>> I think of "=" as "assign" and "+" as append ("-" as remove).
>> Here are our options.
>>
>> a. assign one event (+)
> I prefer that we use consistent interface with what users may be used to
> in other kernel interfaces, like dynamic debug.
> Considering that, "+" will not be "assign one event" but instead
> (let me copy text from dynamic debug to help):
> "+ add the given flags"
>
> So + will add (append) the provided flags to the matching domain, it
> can be multiple flags and does not impact existing flags.

ok. Sure.


>
>> b. unassign one event (-)
> "- remove the given flags" - it can be multiple flags that should be
> removed from domain.
>
>> c. assign both (++ may be?)
> No. Please do not constrain the interface with what needs to be supported
> for ABMC. We may want to add other flags in the future, do not limit it to
> two flags.
ok Sure.
>
>> d. unassign both (_)
> "=_" will unassign all flags without consideration of which flags
> are set. User can also use "-l" to just unassign local MBM, "-t" to
> unassign total MBM, or "-lt" to unassign local and total MBM specifically.

oh ok. got it.


>
>> I think append ( "=") is not required while assigning.  It is confusing.
> "=" is not append. It is assign:
>
> " = set the flags to the given flags"
ok.
>
>> Assign or Add both involve same action.
>>
>> How about this? This might be easy to parse. May be space (" ") after the domain id.
> Why a space?
>
>> <group>/<domain id> <action><event>; <domain id> <action><event>
>>
> <control group>/<monitor group/<domain id><action><flags or _>;<domain id><action><flags or _>


Based on our discussion, I am listing few examples here. Let me know if
I missed something.

  mount  -t resctrl resctrl /sys/fs/resctrl/

1. Assign both local and total counters to default group on domain 0 and 1.
   $echo "//00=lt;01=lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

   $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
   //00=lt;01=lt

2. Assign a total event to mon group inside the default group for both
domain 0 and 1.

   $mkdir /sys/fs/resctrl/mon_groups/mon_a
   $echo "/mon_a/00+t;01+t" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control

   $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
   //00=lt;01=lt
   /mon_a/00=t;01=t

3. Assign a local event to non-default control mon group both domain 0
and 1.
   $mkdir /sys/fs/resctrl/ctrl_a
   $echo "/ctrl_a/00=l;01=l"  >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control

   $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
   //00=lt;01=lt
   /mon_a/00=t;01=t
   /ctrl_a/00=l;01=l

4. Assign a both counters to mon group inside another control
group(non-default).
   $mkdir /sys/fs/resctrl/ctrl_a/mon_ab/
   $echo "ctrl_a/mon_ab/00=lt;01=lt" >
/sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro

   $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
   //00=lt;01=lt
   /mon_a/00=t;01=t
   /ctrl_a/00=l;01=l
   ctrl_a/mon_ab/00=lt;01=lt

5. Unassign a counter to mon group inside another control
group(non-default).
   $echo "ctrl_a/mon_ab/00-l;01-l" >
/sys/fs/resctrl/nfo/L3_MON/mbm_assign_control

  $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
  //00=lt;01=lt
  /mon_a/00=t;01=t
  /ctrl_a/00=l;01=l
  ctrl_a/mon_ab/00=t;01=t

6. Unassign all the counters on a specific group.
   $echo "ctrl_a/mon_ab/00=_" >
/sys/fs/resctrl/nfo/L3_MON/mbm_assign_control

   $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
   //00=lt;01=lt
   /mon_a/00=t;01=t
   /ctrl_a/00=l;01=l
   ctrl_a/mon_ab/00=_;01=_

Thanks
Babu Moger


2024-03-05 14:58:28

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

I want to get your feedback on this approach before I start working on
it.  Please look at few examples below.

Thanks

Babu

On 3/4/2024 4:24 PM, Moger, Babu wrote:
> Hi Reinette,
>
> On 3/4/2024 1:58 PM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 3/4/2024 11:34 AM, Moger, Babu wrote:
>>> On 3/1/2024 5:20 PM, Reinette Chatre wrote:
>>>> On 3/1/2024 12:36 PM, Moger, Babu wrote:
>>>>> On 2/29/24 15:50, Reinette Chatre wrote:
>>>>>> On 2/29/2024 12:37 PM, Moger, Babu wrote:
>>>>> To assign a counters to default group on domain 0.
>>>>> $echo "//00=+lt;01=+lt" >
>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> It should not be necessary to use both "=" and "+" in the same
>>>> assignment.
>>>> I think of "=" as "assign" and "+" as append ("-" as remove).
>>> Here are our options.
>>>
>>> a. assign one event (+)
>> I prefer that we use consistent interface with what users may be used to
>> in other kernel interfaces, like dynamic debug.
>> Considering that, "+" will not be "assign one event" but instead
>> (let me copy text from dynamic debug to help):
>> "+    add the given flags"
>>
>> So + will add (append) the provided flags to the matching domain, it
>> can be multiple flags and does not impact existing flags.
>
> ok. Sure.
>
>
>>
>>> b. unassign one event (-)
>> "-    remove the given flags" - it can be multiple flags that should be
>> removed from domain.
>>
>>> c. assign both (++ may be?)
>> No. Please do not constrain the interface with what needs to be
>> supported
>> for ABMC. We may want to add other flags in the future, do not limit
>> it to
>> two flags.
> ok Sure.
>>
>>> d. unassign both (_)
>> "=_" will unassign all flags without consideration of which flags
>> are set. User can also use "-l" to just unassign local MBM, "-t" to
>> unassign total MBM, or "-lt" to unassign local and total MBM
>> specifically.
>
> oh ok. got it.
>
>
>>
>>> I think append ( "=") is not required while assigning.  It is
>>> confusing.
>> "=" is not append. It is assign:
>>
>> " =    set the flags to the given flags"
> ok.
>>
>>> Assign or Add both involve same action.
>>>
>>> How about this? This might be easy to parse. May be space (" ")
>>> after the domain id.
>> Why a space?
>>
>>> <group>/<domain id> <action><event>; <domain id> <action><event>
>>>
>> <control group>/<monitor group/<domain id><action><flags or
>> _>;<domain id><action><flags or _>
>
>
> Based on our discussion, I am listing few examples here. Let me know
> if I missed something.
>
>   mount  -t resctrl resctrl /sys/fs/resctrl/
>
> 1. Assign both local and total counters to default group on domain 0
> and 1.
>    $echo "//00=lt;01=lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>    //00=lt;01=lt
>
> 2. Assign a total event to mon group inside the default group for both
> domain 0 and 1.
>
>    $mkdir /sys/fs/resctrl/mon_groups/mon_a
>    $echo "/mon_a/00+t;01+t" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>    //00=lt;01=lt
>    /mon_a/00=t;01=t
>
> 3. Assign a local event to non-default control mon group both domain 0
> and 1.
>    $mkdir /sys/fs/resctrl/ctrl_a
>    $echo "/ctrl_a/00=l;01=l"  >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>    //00=lt;01=lt
>    /mon_a/00=t;01=t
>    /ctrl_a/00=l;01=l
>
> 4. Assign a both counters to mon group inside another control
> group(non-default).
>    $mkdir /sys/fs/resctrl/ctrl_a/mon_ab/
>    $echo "ctrl_a/mon_ab/00=lt;01=lt" >
> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro
>
>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>    //00=lt;01=lt
>    /mon_a/00=t;01=t
>    /ctrl_a/00=l;01=l
>    ctrl_a/mon_ab/00=lt;01=lt
>
> 5. Unassign a counter to mon group inside another control
> group(non-default).
>    $echo "ctrl_a/mon_ab/00-l;01-l" >
> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>
>   $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>   //00=lt;01=lt
>   /mon_a/00=t;01=t
>   /ctrl_a/00=l;01=l
>   ctrl_a/mon_ab/00=t;01=t
>
> 6. Unassign all the counters on a specific group.
>    $echo "ctrl_a/mon_ab/00=_" >
> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>
>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>    //00=lt;01=lt
>    /mon_a/00=t;01=t
>    /ctrl_a/00=l;01=l
>    ctrl_a/mon_ab/00=_;01=_
>
> Thanks
> Babu Moger
>

2024-03-05 17:12:43

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 3/4/2024 2:24 PM, Moger, Babu wrote:

> Based on our discussion, I am listing few examples here. Let me know if I missed something.
>
>   mount  -t resctrl resctrl /sys/fs/resctrl/

When creating examples it may help to accompany it with an overview of
which groups exist.

>
> 1. Assign both local and total counters to default group on domain 0 and 1.
>    $echo "//00=lt;01=lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>    //00=lt;01=lt

I also think it will be useful to always print a small header that guides
the interpretation. For example,

$ cat mbm_assign_control
#control_group/monitor_group/flags
..

>
> 2. Assign a total event to mon group inside the default group for both domain 0 and 1.
>
>    $mkdir /sys/fs/resctrl/mon_groups/mon_a
>    $echo "/mon_a/00+t;01+t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>    //00=lt;01=lt
>    /mon_a/00=t;01=t

For an example of "+" I think understanding the output will be easier if the "before" view with
existing flags is available. For example,
if it was
$cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
#control_group/monitor_group/flags
/mon_a/00=l;01=l

then
$echo "/mon_a/00+t;01+t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

would result in:
$cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
#control_group/monitor_group/flags
/mon_a/00=lt;01=lt

An example like above would make it easier to understand how it is different
from using "=" like in example 1.

>
> 3. Assign a local event to non-default control mon group both domain 0 and 1.
>    $mkdir /sys/fs/resctrl/ctrl_a
>    $echo "/ctrl_a/00=l;01=l"  > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

I think this should be:
$echo "ctrl_a//00=l;01=l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

>
>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>    //00=lt;01=lt
>    /mon_a/00=t;01=t
>    /ctrl_a/00=l;01=l

ctrl_a//00=l;01=l

>
> 4. Assign a both counters to mon group inside another control group(non-default).
>    $mkdir /sys/fs/resctrl/ctrl_a/mon_ab/

Above will not work.

$ mkdir /sys/fs/resctrl/ctrl_a
$ mkdir /sys/fs/resctrl/ctrl_a/mon_groups/mon_ab

>    $echo "ctrl_a/mon_ab/00=lt;01=lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro

(watch out for typos in examples)

>
>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>    //00=lt;01=lt
>    /mon_a/00=t;01=t
>    /ctrl_a/00=l;01=l
>    ctrl_a/mon_ab/00=lt;01=lt
>
> 5. Unassign a counter to mon group inside another control group(non-default).
>    $echo "ctrl_a/mon_ab/00-l;01-l" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>
>   $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>   //00=lt;01=lt
>   /mon_a/00=t;01=t
>   /ctrl_a/00=l;01=l
>   ctrl_a/mon_ab/00=t;01=t

ack.

>
> 6. Unassign all the counters on a specific group.
>    $echo "ctrl_a/mon_ab/00=_" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control

(watch for typos)

Note that this only did unassign on domain 0.

>
>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>    //00=lt;01=lt
>    /mon_a/00=t;01=t
>    /ctrl_a/00=l;01=l
>    ctrl_a/mon_ab/00=_;01=_

ctrl_a/mon_ab/00=_;01=t

To address some earlier requirements I think it will be helpful to also
show an example of multiple groups changed with a single write.

Reinette


2024-03-05 19:59:37

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 3/5/24 11:12, Reinette Chatre wrote:
> Hi Babu,
>
> On 3/4/2024 2:24 PM, Moger, Babu wrote:
>
>> Based on our discussion, I am listing few examples here. Let me know if I missed something.
>>
>>   mount  -t resctrl resctrl /sys/fs/resctrl/
>
> When creating examples it may help to accompany it with an overview of
> which groups exist.

Ok. I can add a line about types of groups before the examples.

>
>>
>> 1. Assign both local and total counters to default group on domain 0 and 1.
>>    $echo "//00=lt;01=lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>    //00=lt;01=lt
>
> I also think it will be useful to always print a small header that guides
> the interpretation. For example,
>
> $ cat mbm_assign_control
> #control_group/monitor_group/flags
> ...

Ok. Sure

>
>>
>> 2. Assign a total event to mon group inside the default group for both domain 0 and 1.
>>
>>    $mkdir /sys/fs/resctrl/mon_groups/mon_a
>>    $echo "/mon_a/00+t;01+t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>    //00=lt;01=lt
>>    /mon_a/00=t;01=t
>
> For an example of "+" I think understanding the output will be easier if the "before" view with
> existing flags is available. For example,
> if it was
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> #control_group/monitor_group/flags
> /mon_a/00=l;01=l
>
> then
> $echo "/mon_a/00+t;01+t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

Yes. Makes Sense.
>
> would result in:
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> #control_group/monitor_group/flags
> /mon_a/00=lt;01=lt
>
> An example like above would make it easier to understand how it is different
> from using "=" like in example 1.

Yes. Sure.
>
>>
>> 3. Assign a local event to non-default control mon group both domain 0 and 1.
>>    $mkdir /sys/fs/resctrl/ctrl_a
>>    $echo "/ctrl_a/00=l;01=l"  > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> I think this should be:
> $echo "ctrl_a//00=l;01=l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

Ok.

>
>>
>>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>    //00=lt;01=lt
>>    /mon_a/00=t;01=t
>>    /ctrl_a/00=l;01=l
>
> ctrl_a//00=l;01=l

Yes.
>
>>
>> 4. Assign a both counters to mon group inside another control group(non-default).
>>    $mkdir /sys/fs/resctrl/ctrl_a/mon_ab/
>
> Above will not work.

Yes. Missed that.
>
> $ mkdir /sys/fs/resctrl/ctrl_a
> $ mkdir /sys/fs/resctrl/ctrl_a/mon_groups/mon_ab
>
>>    $echo "ctrl_a/mon_ab/00=lt;01=lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro
>
> (watch out for typos in examples)

Sure.
>
>>
>>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>    //00=lt;01=lt
>>    /mon_a/00=t;01=t
>>    /ctrl_a/00=l;01=l
>>    ctrl_a/mon_ab/00=lt;01=lt
>>
>> 5. Unassign a counter to mon group inside another control group(non-default).
>>    $echo "ctrl_a/mon_ab/00-l;01-l" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>
>>   $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>   //00=lt;01=lt
>>   /mon_a/00=t;01=t
>>   /ctrl_a/00=l;01=l
>>   ctrl_a/mon_ab/00=t;01=t
>
> ack.
>
>>
>> 6. Unassign all the counters on a specific group.
>>    $echo "ctrl_a/mon_ab/00=_" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>
> (watch for typos)
>
> Note that this only did unassign on domain 0.

Yes.
>
>>
>>    $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>    //00=lt;01=lt
>>    /mon_a/00=t;01=t
>>    /ctrl_a/00=l;01=l
>>    ctrl_a/mon_ab/00=_;01=_
>
> ctrl_a/mon_ab/00=_;01=t
>
> To address some earlier requirements I think it will be helpful to also
> show an example of multiple groups changed with a single write.

oh.. ok. Two groups delimeted by "\n".

--
Thanks
Babu Moger

2024-03-07 18:58:06

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On Mon, Mar 4, 2024 at 2:24 PM Moger, Babu <[email protected]> wrote:
> Based on our discussion, I am listing few examples here. Let me know if
> I missed something.
>
> mount -t resctrl resctrl /sys/fs/resctrl/
>
> 1. Assign both local and total counters to default group on domain 0 and 1.
> $echo "//00=lt;01=lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> //00=lt;01=lt
>
> 2. Assign a total event to mon group inside the default group for both
> domain 0 and 1.
>
> $mkdir /sys/fs/resctrl/mon_groups/mon_a
> $echo "/mon_a/00+t;01+t" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> //00=lt;01=lt
> /mon_a/00=t;01=t
>
> 3. Assign a local event to non-default control mon group both domain 0
> and 1.
> $mkdir /sys/fs/resctrl/ctrl_a
> $echo "/ctrl_a/00=l;01=l" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> //00=lt;01=lt
> /mon_a/00=t;01=t
> /ctrl_a/00=l;01=l
>
> 4. Assign a both counters to mon group inside another control
> group(non-default).
> $mkdir /sys/fs/resctrl/ctrl_a/mon_ab/
> $echo "ctrl_a/mon_ab/00=lt;01=lt" >
> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro
>
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> //00=lt;01=lt
> /mon_a/00=t;01=t
> /ctrl_a/00=l;01=l
> ctrl_a/mon_ab/00=lt;01=lt
>
> 5. Unassign a counter to mon group inside another control
> group(non-default).
> $echo "ctrl_a/mon_ab/00-l;01-l" >
> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> //00=lt;01=lt
> /mon_a/00=t;01=t
> /ctrl_a/00=l;01=l
> ctrl_a/mon_ab/00=t;01=t
>
> 6. Unassign all the counters on a specific group.
> $echo "ctrl_a/mon_ab/00=_" >
> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> //00=lt;01=lt
> /mon_a/00=t;01=t
> /ctrl_a/00=l;01=l
> ctrl_a/mon_ab/00=_;01=_

The use case I'm interested in is iterating 32 counters over 256
groups[1]. If it's not possible to reassign 32 counters in a single
write system call, with just one IPI per domain per batch reassignment
operation, then I don't see any advantage over the original proposal
with the assignment control file in every group directory. We already
had fine-grained control placing assign/unassign nodes throughout the
directory hierarchy, with the scope implicit in the directory
location.

The interface I proposed in [1] aims to reduce the per-domain IPIs by
a factor of the number of counters, rather than sending off 2 rounds
of IPIs to each domain for each monitoring group.

-Peter

[1] https://lore.kernel.org/lkml/CALPaoChhKJiMAueFtgCTc7ffO++S5DJCySmxqf9ZDmhR9RQapw@mail.gmail.com/

2024-03-07 20:41:32

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

On 3/7/2024 10:57 AM, Peter Newman wrote:
> Hi Babu,
>
> On Mon, Mar 4, 2024 at 2:24 PM Moger, Babu <[email protected]> wrote:
>> Based on our discussion, I am listing few examples here. Let me know if
>> I missed something.
>>
>> mount -t resctrl resctrl /sys/fs/resctrl/
>>
>> 1. Assign both local and total counters to default group on domain 0 and 1.
>> $echo "//00=lt;01=lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> //00=lt;01=lt
>>
>> 2. Assign a total event to mon group inside the default group for both
>> domain 0 and 1.
>>
>> $mkdir /sys/fs/resctrl/mon_groups/mon_a
>> $echo "/mon_a/00+t;01+t" >
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> //00=lt;01=lt
>> /mon_a/00=t;01=t
>>
>> 3. Assign a local event to non-default control mon group both domain 0
>> and 1.
>> $mkdir /sys/fs/resctrl/ctrl_a
>> $echo "/ctrl_a/00=l;01=l" >
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> //00=lt;01=lt
>> /mon_a/00=t;01=t
>> /ctrl_a/00=l;01=l
>>
>> 4. Assign a both counters to mon group inside another control
>> group(non-default).
>> $mkdir /sys/fs/resctrl/ctrl_a/mon_ab/
>> $echo "ctrl_a/mon_ab/00=lt;01=lt" >
>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro
>>
>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> //00=lt;01=lt
>> /mon_a/00=t;01=t
>> /ctrl_a/00=l;01=l
>> ctrl_a/mon_ab/00=lt;01=lt
>>
>> 5. Unassign a counter to mon group inside another control
>> group(non-default).
>> $echo "ctrl_a/mon_ab/00-l;01-l" >
>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>
>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> //00=lt;01=lt
>> /mon_a/00=t;01=t
>> /ctrl_a/00=l;01=l
>> ctrl_a/mon_ab/00=t;01=t
>>
>> 6. Unassign all the counters on a specific group.
>> $echo "ctrl_a/mon_ab/00=_" >
>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>
>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> //00=lt;01=lt
>> /mon_a/00=t;01=t
>> /ctrl_a/00=l;01=l
>> ctrl_a/mon_ab/00=_;01=_
>
> The use case I'm interested in is iterating 32 counters over 256
> groups[1]. If it's not possible to reassign 32 counters in a single
> write system call, with just one IPI per domain per batch reassignment
> operation, then I don't see any advantage over the original proposal
> with the assignment control file in every group directory. We already
> had fine-grained control placing assign/unassign nodes throughout the
> directory hierarchy, with the scope implicit in the directory
> location.

The intent of this interface is to support modification of several
groups with a single write. These examples only show impact to a single
group at a time, but multiple groups can be modified by separating
configurations with a "\n". I believe Babu was planning to add some
of these examples in his next iteration since it is not obvious yet.

>
> The interface I proposed in [1] aims to reduce the per-domain IPIs by
> a factor of the number of counters, rather than sending off 2 rounds
> of IPIs to each domain for each monitoring group.

I understood the proposed interface appeared to focus on one use case
while the goal is to find an interface to support all requirements.
With this proposed interface it it possible to make large scale changes
with a single sysfs write.

Reinette

2024-03-07 22:53:47

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

On 3/7/2024 2:33 PM, Peter Newman wrote:
> Hi Reinette,
>
> On Thu, Mar 7, 2024 at 12:41 PM Reinette Chatre
> <[email protected]> wrote:
>>
>> Hi Peter,
>>
>> On 3/7/2024 10:57 AM, Peter Newman wrote:
>>> Hi Babu,
>>>
>>> On Mon, Mar 4, 2024 at 2:24 PM Moger, Babu <[email protected]> wrote:
>>>> Based on our discussion, I am listing few examples here. Let me know if
>>>> I missed something.
>>>>
>>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>>>
>>>> 1. Assign both local and total counters to default group on domain 0 and 1.
>>>> $echo "//00=lt;01=lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>
>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> //00=lt;01=lt
>>>>
>>>> 2. Assign a total event to mon group inside the default group for both
>>>> domain 0 and 1.
>>>>
>>>> $mkdir /sys/fs/resctrl/mon_groups/mon_a
>>>> $echo "/mon_a/00+t;01+t" >
>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>
>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> //00=lt;01=lt
>>>> /mon_a/00=t;01=t
>>>>
>>>> 3. Assign a local event to non-default control mon group both domain 0
>>>> and 1.
>>>> $mkdir /sys/fs/resctrl/ctrl_a
>>>> $echo "/ctrl_a/00=l;01=l" >
>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>
>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> //00=lt;01=lt
>>>> /mon_a/00=t;01=t
>>>> /ctrl_a/00=l;01=l
>>>>
>>>> 4. Assign a both counters to mon group inside another control
>>>> group(non-default).
>>>> $mkdir /sys/fs/resctrl/ctrl_a/mon_ab/
>>>> $echo "ctrl_a/mon_ab/00=lt;01=lt" >
>>>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro
>>>>
>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> //00=lt;01=lt
>>>> /mon_a/00=t;01=t
>>>> /ctrl_a/00=l;01=l
>>>> ctrl_a/mon_ab/00=lt;01=lt
>>>>
>>>> 5. Unassign a counter to mon group inside another control
>>>> group(non-default).
>>>> $echo "ctrl_a/mon_ab/00-l;01-l" >
>>>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>>>
>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> //00=lt;01=lt
>>>> /mon_a/00=t;01=t
>>>> /ctrl_a/00=l;01=l
>>>> ctrl_a/mon_ab/00=t;01=t
>>>>
>>>> 6. Unassign all the counters on a specific group.
>>>> $echo "ctrl_a/mon_ab/00=_" >
>>>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>>>
>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> //00=lt;01=lt
>>>> /mon_a/00=t;01=t
>>>> /ctrl_a/00=l;01=l
>>>> ctrl_a/mon_ab/00=_;01=_
>>>
>>> The use case I'm interested in is iterating 32 counters over 256
>>> groups[1]. If it's not possible to reassign 32 counters in a single
>>> write system call, with just one IPI per domain per batch reassignment
>>> operation, then I don't see any advantage over the original proposal
>>> with the assignment control file in every group directory. We already
>>> had fine-grained control placing assign/unassign nodes throughout the
>>> directory hierarchy, with the scope implicit in the directory
>>> location.
>>
>> The intent of this interface is to support modification of several
>> groups with a single write. These examples only show impact to a single
>> group at a time, but multiple groups can be modified by separating
>> configurations with a "\n". I believe Babu was planning to add some
>> of these examples in his next iteration since it is not obvious yet.
>>
>>>
>>> The interface I proposed in [1] aims to reduce the per-domain IPIs by
>>> a factor of the number of counters, rather than sending off 2 rounds
>>> of IPIs to each domain for each monitoring group.
>>
>> I understood the proposed interface appeared to focus on one use case
>> while the goal is to find an interface to support all requirements.
>> With this proposed interface it it possible to make large scale changes
>> with a single sysfs write.
>
> Ok I see you requested[1] one such example earlier.
>
> From what I've read, is this what you had in mind of reassigning 32
> counters from the first 16 groups to the next?
>
> I had found that it's hard to get a single write() syscall out of a
> string containing newlines, so I'm using one explicit call:

Apologies but this is not clear to me, could you please elaborate?

If you are referring to testing via shell you can try ANSI-C Quoting like:
echo -n $'c1/m1/00=_\nc2/m2/00=_\n'

>
> write([mbm_assign_control fd],
> "/c1/m1/00=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
> "/c1/m2/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
> "/c1/m3/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
> [...]
> "/c1/m14/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
> "/c1/m15/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
> "/c1/m16/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
> "/c1/m17/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
> "/c1/m18/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
> [...]
> "/c1/m30/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
> "/c1/m31/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n",
> size);

(so far no "/" needed as prefix)

We could also consider some syntax to mean "all domains". For example,
if no domain given then it can mean "all domains"?
So, your example could possibly also be accomplished with a

c1/m1/=_\nc1/m2/=_\nc1/m3/=_\n [...] c1/m16/=lt\nc1/m17/=lt\nc1/m18/=_\n [...]

Any thoughts?

Reinette

2024-03-07 23:14:47

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On Thu, Mar 7, 2024 at 2:53 PM Reinette Chatre
<[email protected]> wrote:
>
> Hi Peter,
>
> On 3/7/2024 2:33 PM, Peter Newman wrote:
> > Hi Reinette,
> >
> > On Thu, Mar 7, 2024 at 12:41 PM Reinette Chatre
> > <[email protected]> wrote:
> >> I understood the proposed interface appeared to focus on one use case
> >> while the goal is to find an interface to support all requirements.
> >> With this proposed interface it it possible to make large scale changes
> >> with a single sysfs write.
> >
> > Ok I see you requested[1] one such example earlier.
> >
> > From what I've read, is this what you had in mind of reassigning 32
> > counters from the first 16 groups to the next?
> >
> > I had found that it's hard to get a single write() syscall out of a
> > string containing newlines, so I'm using one explicit call:
>
> Apologies but this is not clear to me, could you please elaborate?
>
> If you are referring to testing via shell you can try ANSI-C Quoting like:
> echo -n $'c1/m1/00=_\nc2/m2/00=_\n'

The echo command uses buffered output through printf() and
putchar()[1]. The behavior of the buffering seems to be a write() call
after each newline, causing the kernel to see the request below as 32
individual commands.

>
> >
> > write([mbm_assign_control fd],
> > "/c1/m1/00=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
> > "/c1/m2/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
> > "/c1/m3/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
> > [...]
> > "/c1/m14/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
> > "/c1/m15/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
> > "/c1/m16/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
> > "/c1/m17/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
> > "/c1/m18/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
> > [...]
> > "/c1/m30/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
> > "/c1/m31/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n",
> > size);
>
> (so far no "/" needed as prefix)
>
> We could also consider some syntax to mean "all domains". For example,
> if no domain given then it can mean "all domains"?
> So, your example could possibly also be accomplished with a
>
> c1/m1/=_\nc1/m2/=_\nc1/m3/=_\n [...] c1/m16/=lt\nc1/m17/=lt\nc1/m18/=_\n [...]
>
> Any thoughts?

Yes, that would be helpful. The AMD implementations we use typically
have 16 domains or more.

Thanks!
-Peter

[1] https://git.savannah.gnu.org/cgit/bash.git/tree/builtins/echo.def

2024-03-08 03:50:36

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette/Peter,


On 3/7/24 16:53, Reinette Chatre wrote:
> Hi Peter,
>
> On 3/7/2024 2:33 PM, Peter Newman wrote:
>> Hi Reinette,
>>
>> On Thu, Mar 7, 2024 at 12:41 PM Reinette Chatre
>> <[email protected]> wrote:
>>>
>>> Hi Peter,
>>>
>>> On 3/7/2024 10:57 AM, Peter Newman wrote:
>>>> Hi Babu,
>>>>
>>>> On Mon, Mar 4, 2024 at 2:24 PM Moger, Babu <[email protected]> wrote:
>>>>> Based on our discussion, I am listing few examples here. Let me know if
>>>>> I missed something.
>>>>>
>>>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>
>>>>> 1. Assign both local and total counters to default group on domain 0 and 1.
>>>>> $echo "//00=lt;01=lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>
>>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>> //00=lt;01=lt
>>>>>
>>>>> 2. Assign a total event to mon group inside the default group for both
>>>>> domain 0 and 1.
>>>>>
>>>>> $mkdir /sys/fs/resctrl/mon_groups/mon_a
>>>>> $echo "/mon_a/00+t;01+t" >
>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>
>>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>> //00=lt;01=lt
>>>>> /mon_a/00=t;01=t
>>>>>
>>>>> 3. Assign a local event to non-default control mon group both domain 0
>>>>> and 1.
>>>>> $mkdir /sys/fs/resctrl/ctrl_a
>>>>> $echo "/ctrl_a/00=l;01=l" >
>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>
>>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>> //00=lt;01=lt
>>>>> /mon_a/00=t;01=t
>>>>> /ctrl_a/00=l;01=l
>>>>>
>>>>> 4. Assign a both counters to mon group inside another control
>>>>> group(non-default).
>>>>> $mkdir /sys/fs/resctrl/ctrl_a/mon_ab/
>>>>> $echo "ctrl_a/mon_ab/00=lt;01=lt" >
>>>>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro
>>>>>
>>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>> //00=lt;01=lt
>>>>> /mon_a/00=t;01=t
>>>>> /ctrl_a/00=l;01=l
>>>>> ctrl_a/mon_ab/00=lt;01=lt
>>>>>
>>>>> 5. Unassign a counter to mon group inside another control
>>>>> group(non-default).
>>>>> $echo "ctrl_a/mon_ab/00-l;01-l" >
>>>>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>>>>
>>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>> //00=lt;01=lt
>>>>> /mon_a/00=t;01=t
>>>>> /ctrl_a/00=l;01=l
>>>>> ctrl_a/mon_ab/00=t;01=t
>>>>>
>>>>> 6. Unassign all the counters on a specific group.
>>>>> $echo "ctrl_a/mon_ab/00=_" >
>>>>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>>>>
>>>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>> //00=lt;01=lt
>>>>> /mon_a/00=t;01=t
>>>>> /ctrl_a/00=l;01=l
>>>>> ctrl_a/mon_ab/00=_;01=_
>>>>
>>>> The use case I'm interested in is iterating 32 counters over 256
>>>> groups[1]. If it's not possible to reassign 32 counters in a single
>>>> write system call, with just one IPI per domain per batch reassignment
>>>> operation, then I don't see any advantage over the original proposal
>>>> with the assignment control file in every group directory. We already
>>>> had fine-grained control placing assign/unassign nodes throughout the
>>>> directory hierarchy, with the scope implicit in the directory
>>>> location.
>>>
>>> The intent of this interface is to support modification of several
>>> groups with a single write. These examples only show impact to a single
>>> group at a time, but multiple groups can be modified by separating
>>> configurations with a "\n". I believe Babu was planning to add some
>>> of these examples in his next iteration since it is not obvious yet.
>>>
>>>>
>>>> The interface I proposed in [1] aims to reduce the per-domain IPIs by
>>>> a factor of the number of counters, rather than sending off 2 rounds
>>>> of IPIs to each domain for each monitoring group.
>>>
>>> I understood the proposed interface appeared to focus on one use case
>>> while the goal is to find an interface to support all requirements.
>>> With this proposed interface it it possible to make large scale changes
>>> with a single sysfs write.
>>
>> Ok I see you requested[1] one such example earlier.
>>
>> From what I've read, is this what you had in mind of reassigning 32
>> counters from the first 16 groups to the next?
>>
>> I had found that it's hard to get a single write() syscall out of a
>> string containing newlines, so I'm using one explicit call:
>
> Apologies but this is not clear to me, could you please elaborate?
>
> If you are referring to testing via shell you can try ANSI-C Quoting like:
> echo -n $'c1/m1/00=_\nc2/m2/00=_\n'
>
>>
>> write([mbm_assign_control fd],
>> "/c1/m1/00=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
>> "/c1/m2/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
>> "/c1/m3/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
>> [...]
>> "/c1/m14/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
>> "/c1/m15/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
>> "/c1/m16/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
>> "/c1/m17/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
>> "/c1/m18/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
>> [...]
>> "/c1/m30/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
>> "/c1/m31/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n",
>> size);
>
> (so far no "/" needed as prefix)
>
> We could also consider some syntax to mean "all domains". For example,
> if no domain given then it can mean "all domains"?

Yea. Sound good to me. Will let you know if there are any troubles when I
start working on it.

I am also thinking about replacing the newline requirement for multiple
groups. Domains separate by "," and groups separate by ";".

Something like this..

"/c1/m1/00=_,01=_;/c1/m2/00=_,01=_;/c1/m3/00=lt,01=lt"

Thoughts?

> So, your example could possibly also be accomplished with a
>
> c1/m1/=_\nc1/m2/=_\nc1/m3/=_\n [...] c1/m16/=lt\nc1/m17/=lt\nc1/m18/=_\n [...]
>
> Any thoughts?
>
> Reinette

--
Thanks
Babu Moger

2024-03-08 17:13:30

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Peter,

On 3/7/2024 3:14 PM, Peter Newman wrote:
> On Thu, Mar 7, 2024 at 2:53 PM Reinette Chatre
> <[email protected]> wrote:
>> On 3/7/2024 2:33 PM, Peter Newman wrote:
>>> On Thu, Mar 7, 2024 at 12:41 PM Reinette Chatre
>>> <[email protected]> wrote:
>>>> I understood the proposed interface appeared to focus on one use case
>>>> while the goal is to find an interface to support all requirements.
>>>> With this proposed interface it it possible to make large scale changes
>>>> with a single sysfs write.
>>>
>>> Ok I see you requested[1] one such example earlier.
>>>
>>> From what I've read, is this what you had in mind of reassigning 32
>>> counters from the first 16 groups to the next?
>>>
>>> I had found that it's hard to get a single write() syscall out of a
>>> string containing newlines, so I'm using one explicit call:
>>
>> Apologies but this is not clear to me, could you please elaborate?
>>
>> If you are referring to testing via shell you can try ANSI-C Quoting like:
>> echo -n $'c1/m1/00=_\nc2/m2/00=_\n'
>
> The echo command uses buffered output through printf() and
> putchar()[1]. The behavior of the buffering seems to be a write() call
> after each newline, causing the kernel to see the request below as 32
> individual commands.

I see different behavior. Just to confirm I added a printk() in
rdtgroup_schemata_write():

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 7471f6b747b6..00d9809a1bac 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -384,6 +384,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,

rdt_staged_configs_clear();

+ printk("%s:%d Parsing %s\n", __func__, __LINE__, buf);
while ((tok = strsep(&buf, "\n")) != NULL) {
resname = strim(strsep(&tok, ":"));
if (!tok) {

I believe the behavior you are referring to is when user does something
like:
# echo -e "MB:0=90\nL3:0=7ff0" > schemata

Then, indeed it is two separate writes:
[ 636.391304] rdtgroup_schemata_write:387 Parsing MB:0=90
[ 636.397773] rdtgroup_schemata_write:387 Parsing L3:0=7ff0

When using ANSI-C Quoting I see a single write:

# echo -n $'MB:0=90\nL3:0=7ff0\n' > schemata

[ 655.879331] rdtgroup_schemata_write:387 Parsing MB:0=90
L3:0=7ff0

Reinette

2024-03-08 17:26:47

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 3/7/2024 7:50 PM, Moger, Babu wrote:
> I am also thinking about replacing the newline requirement for multiple
> groups. Domains separate by "," and groups separate by ";".
>
> Something like this..
>
> "/c1/m1/00=_,01=_;/c1/m2/00=_,01=_;/c1/m3/00=lt,01=lt"
>
> Thoughts?
>

I would prefer that resctrl uses as consistent interface as possible
between the different files. There are a few files that already
take domains as input (schemata, mbm_total_bytes_config,
mbm_local_bytes_config) and they all separate domains by ";".
I thus find it most appropriate to stick with ";" between domains.

Regarding separation of groups, in schemata file for example it is
already custom to separate groups (resources in that case) with "\n".

Reinette


2024-03-11 15:51:31

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 2/27/24 17:50, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>> On 2/26/24 15:20, Reinette Chatre wrote:
>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>
>>>>> Apart from the "default behavior" there are two options to consider ...
>>>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>>>> where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>>>> system, where the previous "num_rmids" monitor groups can be created but
>>>>> the counters are reset unpredictably ... should this still be supported
>>>>> on ABMC systems though?
>>>>
>>>> I would say yes. For some reason user(hardware or software issues) is not
>>>> able to use ABMC mode, they have an option to go back to legacy mode.
>>>
>>> I see. Should this perhaps be protected behind the resctrl "debug" mount option?
>>
>> The debug option gives wrong impression. It is better to keep the option
>> open to enable the feature in normal mode.
>
> You mentioned that it would only be needed when there are hardware or
> software issues ... so debug does sound appropriate. Could you please give
> an example of how debug option gives wrong impression? Why would you want
> users to keep using "legacy" mode on an ABMC system?

We may not be able to use "-o debug" option to enable "legacy_mbm".
With debug option it will always go to legcay mbm even if ABMC is supported.

For example when ABMC is supported, I cannot mount the resctrl with debug
option to test ABMC.

I need to have a way to mount resctrl with ABMC and debug option. I can
add "-o legacy_mbm" to enable lecacy_mbm.

Thanks
Babu

2024-03-12 13:30:42

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)


On 3/8/24 11:20, Reinette Chatre wrote:
> Hi Babu,
>
> On 3/7/2024 7:50 PM, Moger, Babu wrote:
>> I am also thinking about replacing the newline requirement for multiple
>> groups. Domains separate by "," and groups separate by ";".
>>
>> Something like this..
>>
>> "/c1/m1/00=_,01=_;/c1/m2/00=_,01=_;/c1/m3/00=lt,01=lt"
>>
>> Thoughts?
>>
>
> I would prefer that resctrl uses as consistent interface as possible
> between the different files. There are a few files that already
> take domains as input (schemata, mbm_total_bytes_config,
> mbm_local_bytes_config) and they all separate domains by ";".
> I thus find it most appropriate to stick with ";" between domains.
>
> Regarding separation of groups, in schemata file for example it is
> already custom to separate groups (resources in that case) with "\n".
>

Ok. Sure.

--
Thanks
Babu Moger

2024-03-12 15:14:10

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 3/11/2024 8:40 AM, Moger, Babu wrote:
> On 2/27/24 17:50, Reinette Chatre wrote:
>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>
>>>>>> Apart from the "default behavior" there are two options to consider ...
>>>>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>>>>> where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>>>>> system, where the previous "num_rmids" monitor groups can be created but
>>>>>> the counters are reset unpredictably ... should this still be supported
>>>>>> on ABMC systems though?
>>>>>
>>>>> I would say yes. For some reason user(hardware or software issues) is not
>>>>> able to use ABMC mode, they have an option to go back to legacy mode.
>>>>
>>>> I see. Should this perhaps be protected behind the resctrl "debug" mount option?
>>>
>>> The debug option gives wrong impression. It is better to keep the option
>>> open to enable the feature in normal mode.
>>
>> You mentioned that it would only be needed when there are hardware or
>> software issues ... so debug does sound appropriate. Could you please give
>> an example of how debug option gives wrong impression? Why would you want
>> users to keep using "legacy" mode on an ABMC system?
>
> We may not be able to use "-o debug" option to enable "legacy_mbm".
> With debug option it will always go to legcay mbm even if ABMC is supported.
>
> For example when ABMC is supported, I cannot mount the resctrl with debug
> option to test ABMC.
>
> I need to have a way to mount resctrl with ABMC and debug option. I can
> add "-o legacy_mbm" to enable lecacy_mbm.

I do not think it is necessary to add a unique debug option for this.
What if instead the "-o debug" mount option exposes the "original/legacy"
behavior in the control file? This would enable users to only use this
behavior when in "debug" mode while still able to switch between
legacy and assigned counters.

Reinette


2024-03-12 17:10:05

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 3/12/24 10:13, Reinette Chatre wrote:
> Hi Babu,
>
> On 3/11/2024 8:40 AM, Moger, Babu wrote:
>> On 2/27/24 17:50, Reinette Chatre wrote:
>>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>>
>>>>>>> Apart from the "default behavior" there are two options to consider ...
>>>>>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>>>>>> where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>>>>>> system, where the previous "num_rmids" monitor groups can be created but
>>>>>>> the counters are reset unpredictably ... should this still be supported
>>>>>>> on ABMC systems though?
>>>>>>
>>>>>> I would say yes. For some reason user(hardware or software issues) is not
>>>>>> able to use ABMC mode, they have an option to go back to legacy mode.
>>>>>
>>>>> I see. Should this perhaps be protected behind the resctrl "debug" mount option?
>>>>
>>>> The debug option gives wrong impression. It is better to keep the option
>>>> open to enable the feature in normal mode.
>>>
>>> You mentioned that it would only be needed when there are hardware or
>>> software issues ... so debug does sound appropriate. Could you please give
>>> an example of how debug option gives wrong impression? Why would you want
>>> users to keep using "legacy" mode on an ABMC system?
>>
>> We may not be able to use "-o debug" option to enable "legacy_mbm".
>> With debug option it will always go to legcay mbm even if ABMC is supported.
>>
>> For example when ABMC is supported, I cannot mount the resctrl with debug
>> option to test ABMC.
>>
>> I need to have a way to mount resctrl with ABMC and debug option. I can
>> add "-o legacy_mbm" to enable lecacy_mbm.
>
> I do not think it is necessary to add a unique debug option for this.

It makes the code simple.

> What if instead the "-o debug" mount option exposes the "original/legacy"

Can you please elaborate on this?

Did you mean following command to enable legacy mode?

$echo "original/legacy" /sys/fs/resctrl/info/L3_MON/mbm_assign_control

It feels like a overkill and confusing.

> behavior in the control file? This would enable users to only use this
> behavior when in "debug" mode while still able to switch between
> legacy and assigned counters.
>
> Reinette
>

--
Thanks
Babu Moger

2024-03-12 17:16:56

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Babu,

On 3/12/2024 10:07 AM, Moger, Babu wrote:
> On 3/12/24 10:13, Reinette Chatre wrote:
>> On 3/11/2024 8:40 AM, Moger, Babu wrote:
>>> On 2/27/24 17:50, Reinette Chatre wrote:
>>>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>>>
>>>>>>>> Apart from the "default behavior" there are two options to consider ...
>>>>>>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>>>>>>> where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>>>>>>> system, where the previous "num_rmids" monitor groups can be created but
>>>>>>>> the counters are reset unpredictably ... should this still be supported
>>>>>>>> on ABMC systems though?
>>>>>>>
>>>>>>> I would say yes. For some reason user(hardware or software issues) is not
>>>>>>> able to use ABMC mode, they have an option to go back to legacy mode.
>>>>>>
>>>>>> I see. Should this perhaps be protected behind the resctrl "debug" mount option?
>>>>>
>>>>> The debug option gives wrong impression. It is better to keep the option
>>>>> open to enable the feature in normal mode.
>>>>
>>>> You mentioned that it would only be needed when there are hardware or
>>>> software issues ... so debug does sound appropriate. Could you please give
>>>> an example of how debug option gives wrong impression? Why would you want
>>>> users to keep using "legacy" mode on an ABMC system?
>>>
>>> We may not be able to use "-o debug" option to enable "legacy_mbm".
>>> With debug option it will always go to legcay mbm even if ABMC is supported.
>>>
>>> For example when ABMC is supported, I cannot mount the resctrl with debug
>>> option to test ABMC.
>>>
>>> I need to have a way to mount resctrl with ABMC and debug option. I can
>>> add "-o legacy_mbm" to enable lecacy_mbm.
>>
>> I do not think it is necessary to add a unique debug option for this.
>
> It makes the code simple.
>
>> What if instead the "-o debug" mount option exposes the "original/legacy"
>
> Can you please elaborate on this?
>
> Did you mean following command to enable legacy mode?
>
> $echo "original/legacy" /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> It feels like a overkill and confusing.

I used the "original/legacy" text to make it clear which behavior I was
referring to. It was not a proposal for a label used by user space to
select the behavior.

Isn't /sys/fs/resctrl/info/L3_MON/mbm_assign_control the file that will
assign the counters to domains? That should not be the file used to
select the behavior. You had /sys/fs/resctrl/info/L3_MON/mbm_assign
with which user space selects behavior, no?


Reinette

2024-03-12 17:33:25

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On 3/12/24 12:15, Reinette Chatre wrote:
> Hi Babu,
>
> On 3/12/2024 10:07 AM, Moger, Babu wrote:
>> On 3/12/24 10:13, Reinette Chatre wrote:
>>> On 3/11/2024 8:40 AM, Moger, Babu wrote:
>>>> On 2/27/24 17:50, Reinette Chatre wrote:
>>>>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>>>>
>>>>>>>>> Apart from the "default behavior" there are two options to consider ...
>>>>>>>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>>>>>>>> where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>>>>>>>> system, where the previous "num_rmids" monitor groups can be created but
>>>>>>>>> the counters are reset unpredictably ... should this still be supported
>>>>>>>>> on ABMC systems though?
>>>>>>>>
>>>>>>>> I would say yes. For some reason user(hardware or software issues) is not
>>>>>>>> able to use ABMC mode, they have an option to go back to legacy mode.
>>>>>>>
>>>>>>> I see. Should this perhaps be protected behind the resctrl "debug" mount option?
>>>>>>
>>>>>> The debug option gives wrong impression. It is better to keep the option
>>>>>> open to enable the feature in normal mode.
>>>>>
>>>>> You mentioned that it would only be needed when there are hardware or
>>>>> software issues ... so debug does sound appropriate. Could you please give
>>>>> an example of how debug option gives wrong impression? Why would you want
>>>>> users to keep using "legacy" mode on an ABMC system?
>>>>
>>>> We may not be able to use "-o debug" option to enable "legacy_mbm".
>>>> With debug option it will always go to legcay mbm even if ABMC is supported.
>>>>
>>>> For example when ABMC is supported, I cannot mount the resctrl with debug
>>>> option to test ABMC.
>>>>
>>>> I need to have a way to mount resctrl with ABMC and debug option. I can
>>>> add "-o legacy_mbm" to enable lecacy_mbm.
>>>
>>> I do not think it is necessary to add a unique debug option for this.
>>
>> It makes the code simple.
>>
>>> What if instead the "-o debug" mount option exposes the "original/legacy"
>>
>> Can you please elaborate on this?
>>
>> Did you mean following command to enable legacy mode?
>>
>> $echo "original/legacy" /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> It feels like a overkill and confusing.
>
> I used the "original/legacy" text to make it clear which behavior I was
> referring to. It was not a proposal for a label used by user space to
> select the behavior.
>
> Isn't /sys/fs/resctrl/info/L3_MON/mbm_assign_control the file that will
> assign the counters to domains? That should not be the file used to

ok.

> select the behavior. You had /sys/fs/resctrl/info/L3_MON/mbm_assign
> with which user space selects behavior, no?

Yes. I think we can use this file(/sys/fs/resctrl/info/L3_MON/mbm_assign).
Thanks
Babu Moger

2024-03-07 22:33:49

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Hi Reinette,

On Thu, Mar 7, 2024 at 12:41 PM Reinette Chatre
<[email protected]> wrote:
>
> Hi Peter,
>
> On 3/7/2024 10:57 AM, Peter Newman wrote:
> > Hi Babu,
> >
> > On Mon, Mar 4, 2024 at 2:24 PM Moger, Babu <[email protected]> wrote:
> >> Based on our discussion, I am listing few examples here. Let me know if
> >> I missed something.
> >>
> >> mount -t resctrl resctrl /sys/fs/resctrl/
> >>
> >> 1. Assign both local and total counters to default group on domain 0 and 1.
> >> $echo "//00=lt;01=lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>
> >> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >> //00=lt;01=lt
> >>
> >> 2. Assign a total event to mon group inside the default group for both
> >> domain 0 and 1.
> >>
> >> $mkdir /sys/fs/resctrl/mon_groups/mon_a
> >> $echo "/mon_a/00+t;01+t" >
> >> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>
> >> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >> //00=lt;01=lt
> >> /mon_a/00=t;01=t
> >>
> >> 3. Assign a local event to non-default control mon group both domain 0
> >> and 1.
> >> $mkdir /sys/fs/resctrl/ctrl_a
> >> $echo "/ctrl_a/00=l;01=l" >
> >> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>
> >> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >> //00=lt;01=lt
> >> /mon_a/00=t;01=t
> >> /ctrl_a/00=l;01=l
> >>
> >> 4. Assign a both counters to mon group inside another control
> >> group(non-default).
> >> $mkdir /sys/fs/resctrl/ctrl_a/mon_ab/
> >> $echo "ctrl_a/mon_ab/00=lt;01=lt" >
> >> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro
> >>
> >> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >> //00=lt;01=lt
> >> /mon_a/00=t;01=t
> >> /ctrl_a/00=l;01=l
> >> ctrl_a/mon_ab/00=lt;01=lt
> >>
> >> 5. Unassign a counter to mon group inside another control
> >> group(non-default).
> >> $echo "ctrl_a/mon_ab/00-l;01-l" >
> >> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
> >>
> >> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >> //00=lt;01=lt
> >> /mon_a/00=t;01=t
> >> /ctrl_a/00=l;01=l
> >> ctrl_a/mon_ab/00=t;01=t
> >>
> >> 6. Unassign all the counters on a specific group.
> >> $echo "ctrl_a/mon_ab/00=_" >
> >> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
> >>
> >> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >> //00=lt;01=lt
> >> /mon_a/00=t;01=t
> >> /ctrl_a/00=l;01=l
> >> ctrl_a/mon_ab/00=_;01=_
> >
> > The use case I'm interested in is iterating 32 counters over 256
> > groups[1]. If it's not possible to reassign 32 counters in a single
> > write system call, with just one IPI per domain per batch reassignment
> > operation, then I don't see any advantage over the original proposal
> > with the assignment control file in every group directory. We already
> > had fine-grained control placing assign/unassign nodes throughout the
> > directory hierarchy, with the scope implicit in the directory
> > location.
>
> The intent of this interface is to support modification of several
> groups with a single write. These examples only show impact to a single
> group at a time, but multiple groups can be modified by separating
> configurations with a "\n". I believe Babu was planning to add some
> of these examples in his next iteration since it is not obvious yet.
>
> >
> > The interface I proposed in [1] aims to reduce the per-domain IPIs by
> > a factor of the number of counters, rather than sending off 2 rounds
> > of IPIs to each domain for each monitoring group.
>
> I understood the proposed interface appeared to focus on one use case
> while the goal is to find an interface to support all requirements.
> With this proposed interface it it possible to make large scale changes
> with a single sysfs write.

Ok I see you requested[1] one such example earlier.

From what I've read, is this what you had in mind of reassigning 32
counters from the first 16 groups to the next?

I had found that it's hard to get a single write() syscall out of a
string containing newlines, so I'm using one explicit call:

write([mbm_assign_control fd],
"/c1/m1/00=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
"/c1/m2/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
"/c1/m3/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
[...]
"/c1/m14/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
"/c1/m15/00=_;01=_;02=_;03=_;04=_;05=_;06=_;07=_;08=_;09=_;10=_;11=_;12=_;13=_;14=_;15=_\n"
"/c1/m16/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
"/c1/m17/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
"/c1/m18/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
[...]
"/c1/m30/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n"
"/c1/m31/00=lt;01=lt;02=lt;03=lt;04=lt;05=lt;06=lt;07=lt;08=lt;09=lt;10=lt;11=lt;12=lt;13=lt;14=lt;15=lt\n",
size);

Thanks!
-Peter

[1] https://lore.kernel.org/lkml/d29a28c8-1180-45ec-bd87-d2e8a8124c42@intelcom/