2022-09-07 18:13:24

by Moger, Babu

[permalink] [raw]
Subject: [PATCH v4 00/13] x86/resctrl: Support for AMD QoS new features and bug fix

New AMD processors can now support following QoS features.

1. Slow Memory Bandwidth Allocation (SMBA)
With this feature, the QOS enforcement policies can be applied
to the external slow memory connected to the host. QOS enforcement
is accomplished by assigning a Class Of Service (COS) to a processor
and specifying allocations or limits for that COS for each resource
to be allocated.

Currently, CXL.memory is the only supported "slow" memory device. With
the support of SMBA feature the hardware enables bandwidth allocation
on the slow memory devices.

2. Bandwidth Monitoring Event Configuration (BMEC)
The bandwidth monitoring events mbm_total_event and mbm_local_event
are set to count all the total and local reads/writes respectively.
With the introduction of slow memory, the two counters are not enough
to count all the different types are memory events. With the feature
BMEC, the users have the option to configure mbm_total_event and
mbm_local_event to count the specific type of events.

Following are the bitmaps of events supported.
Bits Description
6 Dirty Victims from the QOS domain to all types of memory
5 Reads to slow memory in the non-local NUMA domain
4 Reads to slow memory in the local NUMA domain
3 Non-temporal writes to non-local NUMA domain
2 Non-temporal writes to local NUMA domain
1 Reads to memory in the non-local NUMA domain
0 Reads to memory in the local NUMA domain

This series adds support for these features. Added a bug fix and a code cleanup.

Feature description is available in the specification, "AMD64 Technology Platform Quality
of Service Extensions, Revision: 1.03 Publication # 56375 Revision: 1.03 Issue Date: February 2022".

Link: https://www.amd.com/en/support/tech-docs/amd64-technology-platform-quality-service-extensions
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

---
4:
Got numerios of comments from Reinette Chatre. Addressed most of them.
Summary of changes.
1. Removed mon_configurable under /sys/fs/resctrl/info/L3_MON/.
2. Updated mon_features texts if the BMEC is supported.
3. Added more explanation about the slow memory support.
4. Replaced smp_call_function_many with on_each_cpu_mask call.
5. Removed arch_has_empty_bitmaps
6. Few other text changes.
7. Removed Reviewed-by if the patch is modified.
8. Rebased the patches to latest tip.

v3:
https://lore.kernel.org/lkml/166117559756.6695.16047463526634290701.stgit@bmoger-ubuntu/
a. Rebased the patches to latest tip. Resolved some conflicts.
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
b. Taken care of feedback from Bagas Sanjaya.
c. Added Reviewed by from Mingo.
Note: I am still looking for comments from Reinette or Fenghua.

v2:
https://lore.kernel.org/lkml/165938717220.724959.10931629283087443782.stgit@bmoger-ubuntu/
a. Rebased the patches to latest stable tree (v5.18.15). Resolved some conflicts.
b. Added the patch to fix CBM issue on AMD. This was originally discussed
https://lore.kernel.org/lkml/[email protected]/

v1:
https://lore.kernel.org/lkml/165757543252.416408.13547339307237713464.stgit@bmoger-ubuntu/

Babu Moger (13):
x86/resctrl: Fix min_cbm_bits for AMD
x86/resctrl: Remove arch_has_empty_bitmaps
x86/cpufeatures: Add Slow Memory Bandwidth Allocation feature flag
x86/resctrl: Add a new resource type RDT_RESOURCE_SMBA
x86/cpufeatures: Add Bandwidth Monitoring Event Configuration feature flag
x86/resctrl: Include new features in command line options
x86/resctrl: Detect and configure Slow Memory Bandwidth allocation
x86/resctrl : Introduce data structure to support monitor configuration
x86/resctrl: Add sysfs interface files to read/write event configuration
x86/resctrl: Add the sysfs interface to read the event configuration
x86/resctrl: Add sysfs interface to write the event configuration
x86/resctrl: Replace smp_call_function_many with on_each_cpu_mask
Documentation/x86: Update resctrl_ui.rst for new features


.../admin-guide/kernel-parameters.txt | 2 +-
Documentation/x86/resctrl.rst | 148 +++++++++-
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/kernel/cpu/resctrl/core.c | 58 +++-
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 5 +-
arch/x86/kernel/cpu/resctrl/internal.h | 32 ++-
arch/x86/kernel/cpu/resctrl/monitor.c | 9 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 272 +++++++++++++++---
arch/x86/kernel/cpu/scattered.c | 2 +
include/linux/resctrl.h | 6 +-
10 files changed, 479 insertions(+), 57 deletions(-)

--


2022-09-07 18:15:25

by Moger, Babu

[permalink] [raw]
Subject: [PATCH v4 13/13] Documentation/x86: Update resctrl_ui.rst for new features

Update the documentation for the new features:
1. Slow Memory Bandwidth allocation (SMBA).
With this feature, the QOS enforcement policies can be applied
to the external slow memory connected to the host. QOS enforcement
is accomplished by assigning a Class Of Service (COS) to a processor
and specifying allocations or limits for that COS for each resource
to be allocated.

2. Bandwidth Monitoring Event Configuration (BMEC).
The bandwidth monitoring events mbm_total_bytes and mbm_local_bytes
are set to count all the total and local reads/writes respectively.
With the introduction of slow memory, the two counters are not
enough to count all the different types are memory events. With the
feature BMEC, the users have the option to configure mbm_total_bytes
and mbm_local_bytes to count the specific type of events.

Also add configuration instructions with examples.

Signed-off-by: Babu Moger <[email protected]>
---
Documentation/x86/resctrl.rst | 148 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 146 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/resctrl.rst b/Documentation/x86/resctrl.rst
index 71a531061e4e..56581587c1a3 100644
--- a/Documentation/x86/resctrl.rst
+++ b/Documentation/x86/resctrl.rst
@@ -17,14 +17,16 @@ AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
flag bits:

-============================================= ================================
+=============================================== ================================
RDT (Resource Director Technology) Allocation "rdt_a"
CAT (Cache Allocation Technology) "cat_l3", "cat_l2"
CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2"
CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc"
MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
MBA (Memory Bandwidth Allocation) "mba"
-============================================= ================================
+SMBA (Slow Memory Bandwidth Allocation) "smba"
+BMEC (Bandwidth Monitoring Event Configuration) "bmec"
+=============================================== ================================

To use the feature mount the file system::

@@ -161,6 +163,23 @@ with the following files:
"mon_features":
Lists the monitoring events if
monitoring is enabled for the resource.
+ Example output:
+
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_local_bytes
+
+ If the system supports Bandwidth Monitoring Event
+ Configuration (BMEC), then the bandwidth events will
+ be configurable. Then the output will be.
+
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_total_config
+ mbm_local_bytes
+ mbm_local_config

"max_threshold_occupancy":
Read/write file provides the largest value (in
@@ -264,6 +283,32 @@ When monitoring is enabled all MON groups will also contain:
the sum for all tasks in the CTRL_MON group and all tasks in
MON groups. Please see example section for more details on usage.

+"mbm_total_config", "mbm_local_config":
+ This contains the current event configuration for the events
+ mbm_total_bytes and mbm_local_bytes, respectively, when the
+ Bandwidth Monitoring Event Configuration (BMEC) feature is supported.
+ These files are organized by L3 domains under the subdirectories
+ "mon_L3_00" and "mon_L3_01". When BMEC is supported, the events
+ mbm_local_bytes and mbm_total_bytes are configurable.
+
+ Following are the types of events supported:
+
+ ==== ========================================================
+ Bits Description
+ ==== ========================================================
+ 6 Dirty Victims from the QOS domain to all types of memory
+ 5 Reads to slow memory in the non-local NUMA domain
+ 4 Reads to slow memory in the local NUMA domain
+ 3 Non-temporal writes to non-local NUMA domain
+ 2 Non-temporal writes to local NUMA domain
+ 1 Reads to memory in the non-local NUMA domain
+ 0 Reads to memory in the local NUMA domain
+ ==== ========================================================
+
+ By default, the mbm_total_bytes configuration is set to 0x7F to count
+ all the event types and the mbm_local_bytes configuration is set to
+ 0x15 to count all the local memory events.
+
Resource allocation rules
-------------------------

@@ -464,6 +509,19 @@ Memory bandwidth domain is L3 cache.

MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...

+Slow Memory bandwidth Allocation (when supported)
+-------------------------------------------------
+Currently, CXL.memory is the only supported "slow" memory device.
+With the support of SMBA feature the hardware enables bandwidth
+allocation on the slow memory devices. If there are multiple slow
+memory devices in the system, then the throttling logic groups all
+the slow sources together and applies the limit on them as a whole.
+
+Slow Memory b/w domain is L3 cache.
+::
+
+ SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
+
Reading/writing the schemata file
---------------------------------
Reading the schemata file will show the state of all resources
@@ -479,6 +537,44 @@ which you wish to change. E.g.
L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
L3CODE:0=fffff;1=fffff;2=fffff;3=fffff

+Reading/writing the schemata file (on AMD systems)
+--------------------------------------------------
+Reading the schemata file will show the state of all resources
+on all domains. When writing the memory bandwidth allocation you
+only need to specify those values in an absolute number expressed
+in 1/8 GB/s increments. To allocate bandwidth limit of 2GB, you
+need to specify the value 16 (16 * 1/8 = 2). E.g.
+::
+
+ # cat schemata
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "MB:1=16" > schemata
+ # cat schemata
+ MB:0=2048;1= 16;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+Reading/writing the schemata file (on AMD systems) with slow memory
+-------------------------------------------------------------------
+Reading the schemata file will show the state of all resources
+on all domains. When writing the memory bandwidth allocation you
+only need to specify those values in an absolute number expressed
+in 1/8 GB/s increments. To allocate bandwidth limit of 8GB, you
+need to specify the value 64 (64 * 1/8 = 8). E.g.
+::
+
+ # cat schemata
+ SMBA:0=2048;1=2048;2=2048;3=2048
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "SMBA:1=64" > schemata
+ # cat schemata
+ SMBA:0=2048;1= 64;2=2048;3=2048
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
Cache Pseudo-Locking
====================
CAT enables a user to specify the amount of cache space that an
@@ -1210,6 +1306,54 @@ View the llc occupancy snapshot::
# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
11234000

+Example 5 (Configure and Monitor specific event types)
+------------------------------------------------------
+
+A single socket system which has real time tasks running on cores 0-4
+and non real time tasks on other CPUs. We want to monitor the memory
+bandwidth allocation for specific events.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p1
+
+Move the CPUs 0-4 over to p1::
+
+ # echo 0xf > p1/cpus
+
+View the current mbm_local_bytes::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/mbm_local_bytes
+ 112501
+
+Change the mbm_local_bytes to count mon-temporal writes to both local
+and non-local NUMA domain. Refer to event supported bitmap under
+mbm_local_config::
+
+ # echo 0xc > /sys/fs/resctrl/p1/mon_data/mon_L3_00/mbm_local_config
+
+View the updated mbm_local_bytes::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/mbm_local_bytes
+ 12601
+
+Similar experiment on mbm_total_bytes. First view the current mbm_total_bytes::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/mbm_total_bytes
+ 1532501
+
+Change the mbm_total_bytes to count only reads to slow memory on both local
+and non-local NUMA domain. Refer to event supported bitmap under
+mbm_total_config::
+
+ # echo 0x30 > /sys/fs/resctrl/p1/mon_data/mon_L3_00/mbm_total_config
+
+View the updated mbm_total_bytes::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/mbm_total_bytes
+ 104562
+
Intel RDT Errata
================



2022-09-07 18:15:43

by Moger, Babu

[permalink] [raw]
Subject: [PATCH v4 11/13] x86/resctrl: Add sysfs interface to write the event configuration

Add the sysfs interface to write the event configuration for the
MBM configurable events. The event configuration can be changed by
writing to the sysfs file for that specific event.

Following are the types of events supported:

==== ===========================================================
Bits Description
==== ===========================================================
6 Dirty Victims from the QOS domain to all types of memory
5 Reads to slow memory in the non-local NUMA domain
4 Reads to slow memory in the local NUMA domain
3 Non-temporal writes to non-local NUMA domain
2 Non-temporal writes to local NUMA domain
1 Reads to memory in the non-local NUMA domain
0 Reads to memory in the local NUMA domain
==== ===========================================================

By default, the mbm_total_bytes configuration is set to 0x7F to count
all the event types and the mbm_local_bytes configuration is set to
0x15 to count all the local memory events.

For example:
To change the mbm_total_bytes to count all the reads, run the command.
$echo 0x33 > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_config

To change the mbm_local_bytes to count all the slow memory reads, run
the command.
$echo 0x30 > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_config

Signed-off-by: Babu Moger <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 112 ++++++++++++++++++++++++++++++++
1 file changed, 112 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 6f067c1ac7c1..59b484eb1267 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -330,9 +330,121 @@ int rdtgroup_mondata_config_show(struct seq_file *m, void *arg)
return ret;
}

+/*
+ * This is called via IPI to read the CQM/MBM counters
+ * in a domain.
+ */
+void mon_event_config_write(void *info)
+{
+ struct mon_config_info *mon_info = info;
+ u32 msr_index;
+
+ switch (mon_info->evtid) {
+ case QOS_L3_MBM_TOTAL_EVENT_ID:
+ msr_index = 0;
+ break;
+ case QOS_L3_MBM_LOCAL_EVENT_ID:
+ msr_index = 1;
+ break;
+ default:
+ /* Not expected to come here */
+ return;
+ }
+
+ wrmsr(MSR_IA32_EVT_CFG_BASE + msr_index, mon_info->mon_config, 0);
+}
+
+ssize_t rdtgroup_mondata_config_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mon_config_info mon_info;
+ struct rdt_hw_resource *hw_res;
+ struct rdtgroup *rdtgrp;
+ struct rdt_resource *r;
+ unsigned int mon_config;
+ cpumask_var_t cpu_mask;
+ union mon_data_bits md;
+ struct rdt_domain *d;
+ u32 resid, domid;
+ int ret = 0, cpu;
+
+ ret = kstrtouint(buf, 0, &mon_config);
+ if (ret)
+ return ret;
+
+ rdt_last_cmd_clear();
+
+ /* mon_config cannot be more than the supported set of events */
+ if (mon_config > MAX_EVT_CONFIG_BITS) {
+ rdt_last_cmd_puts("Invalid event configuration\n");
+ return -EINVAL;
+ }
+
+ cpus_read_lock();
+ rdtgrp = rdtgroup_kn_lock_live(of->kn);
+ if (!rdtgrp) {
+ return -ENOENT;
+ goto e_unlock;
+ }
+
+ if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) {
+ ret = -ENOMEM;
+ goto e_unlock;
+ }
+
+ md.priv = of->kn->priv;
+ resid = md.u.rid;
+ domid = md.u.domid;
+
+ hw_res = &rdt_resources_all[resid];
+ r = &hw_res->r_resctrl;
+ d = rdt_find_domain(r, domid, NULL);
+ if (IS_ERR_OR_NULL(d)) {
+ ret = -ENOENT;
+ goto e_cpumask;
+ }
+
+ /*
+ * Read the current config value first. If both are same
+ * then we dont need to write again
+ */
+ mon_info.evtid = md.u.evtid;
+ mondata_config_read(d, &mon_info);
+ if (mon_info.mon_config == mon_config)
+ goto e_cpumask;
+
+ mon_info.mon_config = mon_config;
+
+ /* Pick all the CPUs in the domain instance */
+ for_each_cpu(cpu, &d->cpu_mask)
+ cpumask_set_cpu(cpu, cpu_mask);
+
+ /* Update MSR_IA32_EVT_CFG_BASE MSR on all the CPUs in cpu_mask */
+ on_each_cpu_mask(cpu_mask, mon_event_config_write, &mon_info, 1);
+
+ /*
+ * When an Event Configuration is changed, the bandwidth counters
+ * for all RMIDs and Events will be cleared, and the U-bit for every
+ * RMID will be set on the next read to any BwEvent for every RMID.
+ * Clear the mbm_local and mbm_total counts for all the RMIDs.
+ */
+ memset(d->mbm_local, 0, sizeof(struct mbm_state) * r->num_rmid);
+ memset(d->mbm_total, 0, sizeof(struct mbm_state) * r->num_rmid);
+
+e_cpumask:
+ free_cpumask_var(cpu_mask);
+
+e_unlock:
+ rdtgroup_kn_unlock(of->kn);
+ cpus_read_unlock();
+
+ return ret ?: nbytes;
+}
+
static const struct kernfs_ops kf_mondata_config_ops = {
.atomic_write_len = PAGE_SIZE,
.seq_show = rdtgroup_mondata_config_show,
+ .write = rdtgroup_mondata_config_write,
};

static bool is_cpu_list(struct kernfs_open_file *of)


2022-09-07 18:23:24

by Moger, Babu

[permalink] [raw]
Subject: [PATCH v4 03/13] x86/cpufeatures: Add Slow Memory Bandwidth Allocation feature flag

Adds the new AMD feature X86_FEATURE_SMBA. With this feature, the QOS
enforcement policies can be applied to external slow memory connected
to the host. QOS enforcement is accomplished by assigning a Class Of
Service (COS) to a processor and specifying allocations or limits for
that COS for each resource to be allocated.

This feature is identified by the CPUID Function 8000_0020_EBX_x0.

CPUID Fn8000_0020_EBX_x0 AMD Bandwidth Enforcement Feature Identifiers (ECX=0)
Bits Field Name Description
2 L3SBE L3 external slow memory bandwidth enforcement

Currently, CXL.memory is the only supported "slow" memory device. With the
support of SMBA feature the hardware enables bandwidth allocation on the
slow memory devices. If there are multiple slow memory devices in the system,
then the throttling logic groups all the slow sources together and applies
the limit on them as a whole.

The presence of the SMBA feature(with CXL.memory) is independent of whether
slow memory device is actually present in the system. If there is no slow
memory in the system, then setting a SMBA limit will have no impact on the
performance of the system.

Presence of CXL memory can be identified by numactl command.

$numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
node 0 size: 63678 MB node 0 free: 59542 MB
node 1 cpus:
node 1 size: 16122 MB
node 1 free: 15627 MB
node distances:
node 0 1
0: 10 50
1: 50 10

CPU list for CXL memory will be emply. The cpu-cxl node distance is greater
than cpu-to-cpu distances. Node 1 has the CXL memory in this case. CXL
memory can also be identified using ACPI SRAT table and memory maps.

Feature description is available in the specification, "AMD64 Technology Platform Quality
of Service Extensions, Revision: 1.03 Publication # 56375 Revision: 1.03 Issue Date: February 2022".

Link: https://www.amd.com/en/support/tech-docs/amd64-technology-platform-quality-service-extensions
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/scattered.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 235dc85c91c3..1815435c9c88 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -304,6 +304,7 @@
#define X86_FEATURE_UNRET (11*32+15) /* "" AMD BTB untrain return */
#define X86_FEATURE_USE_IBPB_FW (11*32+16) /* "" Use IBPB during runtime firmware calls */
#define X86_FEATURE_RSB_VMEXIT_LITE (11*32+17) /* "" Fill RSB on VM exit when EIBRS is enabled */
+#define X86_FEATURE_SMBA (11*32+18) /* SLOW Memory Bandwidth Allocation */

/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index fd44b54c90d5..885ecf46abb2 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -44,6 +44,7 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_CPB, CPUID_EDX, 9, 0x80000007, 0 },
{ X86_FEATURE_PROC_FEEDBACK, CPUID_EDX, 11, 0x80000007, 0 },
{ X86_FEATURE_MBA, CPUID_EBX, 6, 0x80000008, 0 },
+ { X86_FEATURE_SMBA, CPUID_EBX, 2, 0x80000020, 0 },
{ X86_FEATURE_PERFMON_V2, CPUID_EAX, 0, 0x80000022, 0 },
{ 0, 0, 0, 0, 0 }
};


2022-09-07 18:27:12

by Moger, Babu

[permalink] [raw]
Subject: [PATCH v4 07/13] x86/resctrl: Detect and configure Slow Memory Bandwidth allocation

The QoS slow memory configuration details are available via
CPUID_Fn80000020_EDX_x02. Detect the available details and
initialize the rest to defaults.

Signed-off-by: Babu Moger <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 29 +++++++++++++++++++++++++++--
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +-
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 9 ++++++---
4 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 53fbc3acad81..56c96607259c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -227,9 +227,15 @@ static bool __rdt_get_mem_config_amd(struct rdt_resource *r)
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
union cpuid_0x10_3_eax eax;
union cpuid_0x10_x_edx edx;
- u32 ebx, ecx;
+ u32 ebx, ecx, subleaf;
+
+ /*
+ * Query CPUID_Fn80000020_EDX_x01 for MBA and
+ * CPUID_Fn80000020_EDX_x02 for SMBA
+ */
+ subleaf = (r->rid == RDT_RESOURCE_SMBA) ? 2 : 1;

- cpuid_count(0x80000020, 1, &eax.full, &ebx, &ecx, &edx.full);
+ cpuid_count(0x80000020, subleaf, &eax.full, &ebx, &ecx, &edx.full);
hw_res->num_closid = edx.split.cos_max + 1;
r->default_ctrl = MAX_MBA_BW_AMD;

@@ -791,6 +797,19 @@ static __init bool get_mem_config(void)
return false;
}

+static __init bool get_slow_mem_config(void)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_SMBA];
+
+ if (!rdt_cpu_has(X86_FEATURE_SMBA))
+ return false;
+
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
+ return __rdt_get_mem_config_amd(&hw_res->r_resctrl);
+
+ return false;
+}
+
static __init bool get_rdt_alloc_resources(void)
{
struct rdt_resource *r;
@@ -821,6 +840,9 @@ static __init bool get_rdt_alloc_resources(void)
if (get_mem_config())
ret = true;

+ if (get_slow_mem_config())
+ ret = true;
+
return ret;
}

@@ -910,6 +932,9 @@ static __init void rdt_init_res_defs_amd(void)
} else if (r->rid == RDT_RESOURCE_MBA) {
hw_res->msr_base = MSR_IA32_MBA_BW_BASE;
hw_res->msr_update = mba_wrmsr_amd;
+ } else if (r->rid == RDT_RESOURCE_SMBA) {
+ hw_res->msr_base = MSR_IA32_SMBA_BW_BASE;
+ hw_res->msr_update = mba_wrmsr_amd;
}
}
}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 7f38c8bd8429..480600b8e4cf 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -202,7 +202,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
unsigned long dom_id;

if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
- r->rid == RDT_RESOURCE_MBA) {
+ (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) {
rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n");
return -EINVAL;
}
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 24a1dfeb6cb2..c049a274383c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -14,6 +14,7 @@
#define MSR_IA32_L2_CBM_BASE 0xd10
#define MSR_IA32_MBA_THRTL_BASE 0xd50
#define MSR_IA32_MBA_BW_BASE 0xc0000200
+#define MSR_IA32_SMBA_BW_BASE 0xc0000280

#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f276aff521e8..fc5286067201 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1218,7 +1218,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)

list_for_each_entry(s, &resctrl_schema_all, list) {
r = s->res;
- if (r->rid == RDT_RESOURCE_MBA)
+ if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
continue;
has_cache = true;
list_for_each_entry(d, &r->domains, list) {
@@ -1399,7 +1399,8 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
ctrl = resctrl_arch_get_config(r, d,
rdtgrp->closid,
schema->conf_type);
- if (r->rid == RDT_RESOURCE_MBA)
+ if (r->rid == RDT_RESOURCE_MBA ||
+ r->rid == RDT_RESOURCE_SMBA)
size = ctrl;
else
size = rdtgroup_cbm_to_size(r, d, ctrl);
@@ -2807,7 +2808,9 @@ static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp)

list_for_each_entry(s, &resctrl_schema_all, list) {
r = s->res;
- if (r->rid == RDT_RESOURCE_MBA) {
+ if (r->rid == RDT_RESOURCE_MBA ||
+ r->rid == RDT_RESOURCE_SMBA) {
+
rdtgroup_init_mba(r);
} else {
ret = rdtgroup_init_cat(s, rdtgrp->closid);


2022-09-07 18:30:22

by Moger, Babu

[permalink] [raw]
Subject: [PATCH v4 06/13] x86/resctrl: Include new features in command line options

Add the command line options to disable the new features.
smba : Slow Memory Bandwidth Allocation
mbec : Bandwidth Monitor Event Configuration.

Signed-off-by: Babu Moger <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 2 +-
arch/x86/kernel/cpu/resctrl/core.c | 4 ++++
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index d7f30902fda0..c109fecb93ea 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5169,7 +5169,7 @@
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
- mba.
+ mba, smba, bmec.
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index a7e9aabff8c8..53fbc3acad81 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -700,6 +700,8 @@ enum {
RDT_FLAG_L2_CAT,
RDT_FLAG_L2_CDP,
RDT_FLAG_MBA,
+ RDT_FLAG_SMBA,
+ RDT_FLAG_BMEC,
};

#define RDT_OPT(idx, n, f) \
@@ -723,6 +725,8 @@ static struct rdt_options rdt_options[] __initdata = {
RDT_OPT(RDT_FLAG_L2_CAT, "l2cat", X86_FEATURE_CAT_L2),
RDT_OPT(RDT_FLAG_L2_CDP, "l2cdp", X86_FEATURE_CDP_L2),
RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
+ RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
+ RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
};
#define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)



2022-09-07 18:57:14

by Moger, Babu

[permalink] [raw]
Subject: [PATCH v4 02/13] x86/resctrl: Remove arch_has_empty_bitmaps

The field arch_has_empty_bitmaps is not required anymore. The field
min_cbm_bits is enough to validate the CBM (capacity bit mask) if the
architecture can support the zero CBM or not.

Suggested-by: Reinette Chatre <[email protected]>
Signed-off-by: Babu Moger <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 2 --
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 3 +--
include/linux/resctrl.h | 6 +++---
3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index a5c51a14fbce..c2657754672e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -869,7 +869,6 @@ static __init void rdt_init_res_defs_intel(void)
if (r->rid == RDT_RESOURCE_L3 ||
r->rid == RDT_RESOURCE_L2) {
r->cache.arch_has_sparse_bitmaps = false;
- r->cache.arch_has_empty_bitmaps = false;
r->cache.arch_has_per_cpu_cfg = false;
r->cache.min_cbm_bits = 1;
} else if (r->rid == RDT_RESOURCE_MBA) {
@@ -890,7 +889,6 @@ static __init void rdt_init_res_defs_amd(void)
if (r->rid == RDT_RESOURCE_L3 ||
r->rid == RDT_RESOURCE_L2) {
r->cache.arch_has_sparse_bitmaps = true;
- r->cache.arch_has_empty_bitmaps = true;
r->cache.arch_has_per_cpu_cfg = true;
r->cache.min_cbm_bits = 0;
} else if (r->rid == RDT_RESOURCE_MBA) {
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 87666275eed9..7f38c8bd8429 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -98,8 +98,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
return false;
}

- if ((!r->cache.arch_has_empty_bitmaps && val == 0) ||
- val > r->default_ctrl) {
+ if ((r->cache.min_cbm_bits > 0 && val == 0) || val > r->default_ctrl) {
rdt_last_cmd_puts("Mask out of range\n");
return false;
}
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 21deb5212bbd..46ed8589857c 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -72,11 +72,12 @@ struct rdt_domain {
/**
* struct resctrl_cache - Cache allocation related data
* @cbm_len: Length of the cache bit mask
- * @min_cbm_bits: Minimum number of consecutive bits to be set
+ * @min_cbm_bits: Minimum number of consecutive bits to be set.
+ * The value 0 means the architecture can support
+ * zero CBM.
* @shareable_bits: Bitmask of shareable resource with other
* executing entities
* @arch_has_sparse_bitmaps: True if a bitmap like f00f is valid.
- * @arch_has_empty_bitmaps: True if the '0' bitmap is valid.
* @arch_has_per_cpu_cfg: True if QOS_CFG register for this cache
* level has CPU scope.
*/
@@ -85,7 +86,6 @@ struct resctrl_cache {
unsigned int min_cbm_bits;
unsigned int shareable_bits;
bool arch_has_sparse_bitmaps;
- bool arch_has_empty_bitmaps;
bool arch_has_per_cpu_cfg;
};



2022-09-07 19:00:40

by Moger, Babu

[permalink] [raw]
Subject: [PATCH v4 09/13] x86/resctrl: Add sysfs interface files to read/write event configuration

Add two new sysfs files to read/write the event configuration if
the feature Bandwidth Monitoring Event Configuration (BMEC) is
supported. The file mbm_local_config is for the configuration
of the event mbm_local_bytes and the file mbm_total_config is
for the configuration of mbm_total_bytes.

$ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local*
/sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
/sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_config

$ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total*
/sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
/sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_config

Signed-off-by: Babu Moger <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 40 ++++++++++++++++++++++++--------
1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f55a693fa958..da11fdad204d 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -254,6 +254,10 @@ static const struct kernfs_ops kf_mondata_ops = {
.seq_show = rdtgroup_mondata_show,
};

+static const struct kernfs_ops kf_mondata_config_ops = {
+ .atomic_write_len = PAGE_SIZE,
+};
+
static bool is_cpu_list(struct kernfs_open_file *of)
{
struct rftype *rft = of->kn->priv;
@@ -2478,24 +2482,40 @@ static struct file_system_type rdt_fs_type = {
.kill_sb = rdt_kill_sb,
};

-static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
+static int mon_addfile(struct kernfs_node *parent_kn, struct mon_evt *mevt,
void *priv)
{
- struct kernfs_node *kn;
+ struct kernfs_node *kn_evt, *kn_evt_config;
int ret = 0;

- kn = __kernfs_create_file(parent_kn, name, 0444,
- GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
- &kf_mondata_ops, priv, NULL, NULL);
- if (IS_ERR(kn))
- return PTR_ERR(kn);
+ kn_evt = __kernfs_create_file(parent_kn, mevt->name, 0444,
+ GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
+ &kf_mondata_ops, priv, NULL, NULL);
+ if (IS_ERR(kn_evt))
+ return PTR_ERR(kn_evt);

- ret = rdtgroup_kn_set_ugid(kn);
+ ret = rdtgroup_kn_set_ugid(kn_evt);
if (ret) {
- kernfs_remove(kn);
+ kernfs_remove(kn_evt);
return ret;
}

+ if (mevt->configurable) {
+ kn_evt_config = __kernfs_create_file(parent_kn,
+ mevt->config_name, 0644,
+ GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
+ &kf_mondata_config_ops, priv, NULL, NULL);
+ if (IS_ERR(kn_evt_config))
+ return PTR_ERR(kn_evt_config);
+
+ ret = rdtgroup_kn_set_ugid(kn_evt_config);
+ if (ret) {
+ kernfs_remove(kn_evt_config);
+ kernfs_remove(kn_evt);
+ return ret;
+ }
+ }
+
return ret;
}

@@ -2550,7 +2570,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
priv.u.domid = d->id;
list_for_each_entry(mevt, &r->evt_list, list) {
priv.u.evtid = mevt->evtid;
- ret = mon_addfile(kn, mevt->name, priv.priv);
+ ret = mon_addfile(kn, mevt, priv.priv);
if (ret)
goto out_destroy;



2022-09-07 19:05:44

by Moger, Babu

[permalink] [raw]
Subject: [PATCH v4 01/13] x86/resctrl: Fix min_cbm_bits for AMD

AMD systems support zero CBM (capacity bit mask) for L3 allocation.
That is reflected in rdt_init_res_defs_amd() by:

r->cache.arch_has_empty_bitmaps = true;

However given the unified code in cbm_validate(), checking for:
val == 0 && !arch_has_empty_bitmaps

is not enough because of another check in cbm_validate():

if ((zero_bit - first_bit) < r->cache.min_cbm_bits)

The default value of r->cache.min_cbm_bits = 1.

Leading to:

$ cd /sys/fs/resctrl
$ mkdir foo
$ cd foo
$ echo L3:0=0 > schemata
-bash: echo: write error: Invalid argument
$ cat /sys/fs/resctrl/info/last_cmd_status
Need at least 1 bits in the mask

Fix the issue by initializing the min_cbm_bits to 0 for AMD. Also,
remove the default setting of min_cbm_bits and initialize it separately.

After the fix
$ cd /sys/fs/resctrl
$ mkdir foo
$ cd foo
$ echo L3:0=0 > schemata
$ cat /sys/fs/resctrl/info/last_cmd_status
ok

Link: https://lore.kernel.org/lkml/[email protected]/
Fixes: 316e7f901f5a ("x86/resctrl: Add struct rdt_cache::arch_has_{sparse, empty}_bitmaps")
Signed-off-by: Stephane Eranian <[email protected]>
Signed-off-by: Babu Moger <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index bb1c3f5f60c8..a5c51a14fbce 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -66,9 +66,6 @@ struct rdt_hw_resource rdt_resources_all[] = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
.cache_level = 3,
- .cache = {
- .min_cbm_bits = 1,
- },
.domains = domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -83,9 +80,6 @@ struct rdt_hw_resource rdt_resources_all[] = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
.cache_level = 2,
- .cache = {
- .min_cbm_bits = 1,
- },
.domains = domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -877,6 +871,7 @@ static __init void rdt_init_res_defs_intel(void)
r->cache.arch_has_sparse_bitmaps = false;
r->cache.arch_has_empty_bitmaps = false;
r->cache.arch_has_per_cpu_cfg = false;
+ r->cache.min_cbm_bits = 1;
} else if (r->rid == RDT_RESOURCE_MBA) {
hw_res->msr_base = MSR_IA32_MBA_THRTL_BASE;
hw_res->msr_update = mba_wrmsr_intel;
@@ -897,6 +892,7 @@ static __init void rdt_init_res_defs_amd(void)
r->cache.arch_has_sparse_bitmaps = true;
r->cache.arch_has_empty_bitmaps = true;
r->cache.arch_has_per_cpu_cfg = true;
+ r->cache.min_cbm_bits = 0;
} else if (r->rid == RDT_RESOURCE_MBA) {
hw_res->msr_base = MSR_IA32_MBA_BW_BASE;
hw_res->msr_update = mba_wrmsr_amd;


2022-09-08 04:14:17

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH v4 13/13] Documentation/x86: Update resctrl_ui.rst for new features

On Wed, Sep 07, 2022 at 01:01:29PM -0500, Babu Moger wrote:
> Update the documentation for the new features:
> 1. Slow Memory Bandwidth allocation (SMBA).
> With this feature, the QOS enforcement policies can be applied
> to the external slow memory connected to the host. QOS enforcement
> is accomplished by assigning a Class Of Service (COS) to a processor
> and specifying allocations or limits for that COS for each resource
> to be allocated.
>
> 2. Bandwidth Monitoring Event Configuration (BMEC).
> The bandwidth monitoring events mbm_total_bytes and mbm_local_bytes
> are set to count all the total and local reads/writes respectively.
> With the introduction of slow memory, the two counters are not
> enough to count all the different types are memory events. With the
> feature BMEC, the users have the option to configure mbm_total_bytes
> and mbm_local_bytes to count the specific type of events.
>
> Also add configuration instructions with examples.
>
> Signed-off-by: Babu Moger <[email protected]>
> ---
> Documentation/x86/resctrl.rst | 148 ++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 146 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/x86/resctrl.rst b/Documentation/x86/resctrl.rst
> index 71a531061e4e..56581587c1a3 100644
> --- a/Documentation/x86/resctrl.rst
> +++ b/Documentation/x86/resctrl.rst
> @@ -17,14 +17,16 @@ AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
> This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
> flag bits:
>
> -============================================= ================================
> +=============================================== ================================
> RDT (Resource Director Technology) Allocation "rdt_a"
> CAT (Cache Allocation Technology) "cat_l3", "cat_l2"
> CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2"
> CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc"
> MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
> MBA (Memory Bandwidth Allocation) "mba"
> -============================================= ================================
> +SMBA (Slow Memory Bandwidth Allocation) "smba"
> +BMEC (Bandwidth Monitoring Event Configuration) "bmec"
> +=============================================== ================================
>
> To use the feature mount the file system::
>
> @@ -161,6 +163,23 @@ with the following files:
> "mon_features":
> Lists the monitoring events if
> monitoring is enabled for the resource.
> + Example output:
> +
> + # cat /sys/fs/resctrl/info/L3_MON/mon_features
> + llc_occupancy
> + mbm_total_bytes
> + mbm_local_bytes
> +
> + If the system supports Bandwidth Monitoring Event
> + Configuration (BMEC), then the bandwidth events will
> + be configurable. Then the output will be.
> +
> + # cat /sys/fs/resctrl/info/L3_MON/mon_features
> + llc_occupancy
> + mbm_total_bytes
> + mbm_total_config
> + mbm_local_bytes
> + mbm_local_config
>
Use code blocks for terminal output above:

---- >8 ----
t a/Documentation/x86/resctrl.rst b/Documentation/x86/resctrl.rst
index 56581587c1a331..6474cf655792bf 100644
--- a/Documentation/x86/resctrl.rst
+++ b/Documentation/x86/resctrl.rst
@@ -163,23 +163,23 @@ with the following files:
"mon_features":
Lists the monitoring events if
monitoring is enabled for the resource.
- Example output:
+ Example::

- # cat /sys/fs/resctrl/info/L3_MON/mon_features
- llc_occupancy
- mbm_total_bytes
- mbm_local_bytes
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_local_bytes

If the system supports Bandwidth Monitoring Event
Configuration (BMEC), then the bandwidth events will
- be configurable. Then the output will be.
+ be configurable. The output will be::

- # cat /sys/fs/resctrl/info/L3_MON/mon_features
- llc_occupancy
- mbm_total_bytes
- mbm_total_config
- mbm_local_bytes
- mbm_local_config
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_total_config
+ mbm_local_bytes
+ mbm_local_config

"max_threshold_occupancy":
Read/write file provides the largest value (in

> + Following are the types of events supported:
> +
> + ==== ========================================================
> + Bits Description
> + ==== ========================================================
> + 6 Dirty Victims from the QOS domain to all types of memory
> + 5 Reads to slow memory in the non-local NUMA domain
> + 4 Reads to slow memory in the local NUMA domain
> + 3 Non-temporal writes to non-local NUMA domain
> + 2 Non-temporal writes to local NUMA domain
> + 1 Reads to memory in the non-local NUMA domain
> + 0 Reads to memory in the local NUMA domain
> + ==== ========================================================
> +

The table above looks OK.

Thanks.

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (5.70 kB)
signature.asc (235.00 B)
Download all attachments

2022-09-08 10:15:02

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH v4 13/13] Documentation/x86: Update resctrl_ui.rst for new features

On Thu, Sep 08, 2022 at 11:07:59AM +0700, Bagas Sanjaya wrote:
> Use code blocks for terminal output above:
>
> ---- >8 ----
> t a/Documentation/x86/resctrl.rst b/Documentation/x86/resctrl.rst
> index 56581587c1a331..6474cf655792bf 100644
> --- a/Documentation/x86/resctrl.rst
> +++ b/Documentation/x86/resctrl.rst
> @@ -163,23 +163,23 @@ with the following files:
> "mon_features":
> Lists the monitoring events if
> monitoring is enabled for the resource.
> - Example output:
> + Example::
>
> - # cat /sys/fs/resctrl/info/L3_MON/mon_features
> - llc_occupancy
> - mbm_total_bytes
> - mbm_local_bytes
> + # cat /sys/fs/resctrl/info/L3_MON/mon_features
> + llc_occupancy
> + mbm_total_bytes
> + mbm_local_bytes
>
> If the system supports Bandwidth Monitoring Event
> Configuration (BMEC), then the bandwidth events will
> - be configurable. Then the output will be.
> + be configurable. The output will be::
>
> - # cat /sys/fs/resctrl/info/L3_MON/mon_features
> - llc_occupancy
> - mbm_total_bytes
> - mbm_total_config
> - mbm_local_bytes
> - mbm_local_config
> + # cat /sys/fs/resctrl/info/L3_MON/mon_features
> + llc_occupancy
> + mbm_total_bytes
> + mbm_total_config
> + mbm_local_bytes
> + mbm_local_config

Hi Babu,

The suggestion diff above looks corrupted, so here is the proper one:

---- >8 ----
diff --git a/Documentation/x86/resctrl.rst b/Documentation/x86/resctrl.rst
index 56581587c1a331..6474cf655792bf 100644
--- a/Documentation/x86/resctrl.rst
+++ b/Documentation/x86/resctrl.rst
@@ -163,23 +163,23 @@ with the following files:
"mon_features":
Lists the monitoring events if
monitoring is enabled for the resource.
- Example output:
+ Example::

- # cat /sys/fs/resctrl/info/L3_MON/mon_features
- llc_occupancy
- mbm_total_bytes
- mbm_local_bytes
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_local_bytes

If the system supports Bandwidth Monitoring Event
Configuration (BMEC), then the bandwidth events will
- be configurable. Then the output will be.
+ be configurable. The output will be::

- # cat /sys/fs/resctrl/info/L3_MON/mon_features
- llc_occupancy
- mbm_total_bytes
- mbm_total_config
- mbm_local_bytes
- mbm_local_config
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_total_config
+ mbm_local_bytes
+ mbm_local_config

"max_threshold_occupancy":
Read/write file provides the largest value (in

Thanks.

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (3.37 kB)
signature.asc (235.00 B)
Download all attachments

2022-09-08 14:06:11

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v4 13/13] Documentation/x86: Update resctrl_ui.rst for new features


On 9/8/22 04:26, Bagas Sanjaya wrote:
> On Thu, Sep 08, 2022 at 11:07:59AM +0700, Bagas Sanjaya wrote:
>> Use code blocks for terminal output above:
>>
>> ---- >8 ----
>> t a/Documentation/x86/resctrl.rst b/Documentation/x86/resctrl.rst
>> index 56581587c1a331..6474cf655792bf 100644
>> --- a/Documentation/x86/resctrl.rst
>> +++ b/Documentation/x86/resctrl.rst
>> @@ -163,23 +163,23 @@ with the following files:
>> "mon_features":
>> Lists the monitoring events if
>> monitoring is enabled for the resource.
>> - Example output:
>> + Example::
>>
>> - # cat /sys/fs/resctrl/info/L3_MON/mon_features
>> - llc_occupancy
>> - mbm_total_bytes
>> - mbm_local_bytes
>> + # cat /sys/fs/resctrl/info/L3_MON/mon_features
>> + llc_occupancy
>> + mbm_total_bytes
>> + mbm_local_bytes
>>
>> If the system supports Bandwidth Monitoring Event
>> Configuration (BMEC), then the bandwidth events will
>> - be configurable. Then the output will be.
>> + be configurable. The output will be::
>>
>> - # cat /sys/fs/resctrl/info/L3_MON/mon_features
>> - llc_occupancy
>> - mbm_total_bytes
>> - mbm_total_config
>> - mbm_local_bytes
>> - mbm_local_config
>> + # cat /sys/fs/resctrl/info/L3_MON/mon_features
>> + llc_occupancy
>> + mbm_total_bytes
>> + mbm_total_config
>> + mbm_local_bytes
>> + mbm_local_config
> Hi Babu,
>
> The suggestion diff above looks corrupted, so here is the proper one:

Sanjaya,

Sure, Thank you.

Babu

>
> ---- >8 ----
> diff --git a/Documentation/x86/resctrl.rst b/Documentation/x86/resctrl.rst
> index 56581587c1a331..6474cf655792bf 100644
> --- a/Documentation/x86/resctrl.rst
> +++ b/Documentation/x86/resctrl.rst
> @@ -163,23 +163,23 @@ with the following files:
> "mon_features":
> Lists the monitoring events if
> monitoring is enabled for the resource.
> - Example output:
> + Example::
>
> - # cat /sys/fs/resctrl/info/L3_MON/mon_features
> - llc_occupancy
> - mbm_total_bytes
> - mbm_local_bytes
> + # cat /sys/fs/resctrl/info/L3_MON/mon_features
> + llc_occupancy
> + mbm_total_bytes
> + mbm_local_bytes
>
> If the system supports Bandwidth Monitoring Event
> Configuration (BMEC), then the bandwidth events will
> - be configurable. Then the output will be.
> + be configurable. The output will be::
>
> - # cat /sys/fs/resctrl/info/L3_MON/mon_features
> - llc_occupancy
> - mbm_total_bytes
> - mbm_total_config
> - mbm_local_bytes
> - mbm_local_config
> + # cat /sys/fs/resctrl/info/L3_MON/mon_features
> + llc_occupancy
> + mbm_total_bytes
> + mbm_total_config
> + mbm_local_bytes
> + mbm_local_config
>
> "max_threshold_occupancy":
> Read/write file provides the largest value (in
>
> Thanks.
>
--
Thanks
Babu Moger

2022-09-09 17:06:09

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v4 01/13] x86/resctrl: Fix min_cbm_bits for AMD

Hi Babu,

On 07/09/2022 18:59, Babu Moger wrote:
> AMD systems support zero CBM (capacity bit mask) for L3 allocation.
> That is reflected in rdt_init_res_defs_amd() by:
>
> r->cache.arch_has_empty_bitmaps = true;
>
> However given the unified code in cbm_validate(), checking for:
> val == 0 && !arch_has_empty_bitmaps
>
> is not enough because of another check in cbm_validate():
>
> if ((zero_bit - first_bit) < r->cache.min_cbm_bits)

Right, the Intel version had this, but the AMD didn't. I evidently only thought about this
the !arch_has_empty_bitmaps way round! Sorry about that.


> The default value of r->cache.min_cbm_bits = 1.
>
> Leading to:
>
> $ cd /sys/fs/resctrl
> $ mkdir foo
> $ cd foo
> $ echo L3:0=0 > schemata
> -bash: echo: write error: Invalid argument
> $ cat /sys/fs/resctrl/info/last_cmd_status
> Need at least 1 bits in the mask
>
> Fix the issue by initializing the min_cbm_bits to 0 for AMD. Also,
> remove the default setting of min_cbm_bits and initialize it separately.
>
> After the fix
> $ cd /sys/fs/resctrl
> $ mkdir foo
> $ cd foo
> $ echo L3:0=0 > schemata
> $ cat /sys/fs/resctrl/info/last_cmd_status
> ok
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Fixes: 316e7f901f5a ("x86/resctrl: Add struct rdt_cache::arch_has_{sparse, empty}_bitmaps")

> Signed-off-by: Stephane Eranian <[email protected]>
> Signed-off-by: Babu Moger <[email protected]>

Er, who is the author if this patch? If you are reposting Stephane's patch then there
needs to be a 'From: ' at the top of the email so that git preserves the ownership. You
may need some incantation of "git commit --amend --author=" to fix this in your tree.

As its a fix, have you posted this separately? Mixing fixes and new-code makes it hard for
the maintainer to spot what needs to be taken for the next -rc.


> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index bb1c3f5f60c8..a5c51a14fbce 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -66,9 +66,6 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .rid = RDT_RESOURCE_L3,
> .name = "L3",
> .cache_level = 3,
> - .cache = {
> - .min_cbm_bits = 1,
> - },
> .domains = domain_init(RDT_RESOURCE_L3),
> .parse_ctrlval = parse_cbm,
> .format_str = "%d=%0*x",
> @@ -83,9 +80,6 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .rid = RDT_RESOURCE_L2,
> .name = "L2",
> .cache_level = 2,
> - .cache = {
> - .min_cbm_bits = 1,
> - },
> .domains = domain_init(RDT_RESOURCE_L2),
> .parse_ctrlval = parse_cbm,
> .format_str = "%d=%0*x",
> @@ -877,6 +871,7 @@ static __init void rdt_init_res_defs_intel(void)
> r->cache.arch_has_sparse_bitmaps = false;
> r->cache.arch_has_empty_bitmaps = false;
> r->cache.arch_has_per_cpu_cfg = false;
> + r->cache.min_cbm_bits = 1;
> } else if (r->rid == RDT_RESOURCE_MBA) {
> hw_res->msr_base = MSR_IA32_MBA_THRTL_BASE;
> hw_res->msr_update = mba_wrmsr_intel;
> @@ -897,6 +892,7 @@ static __init void rdt_init_res_defs_amd(void)
> r->cache.arch_has_sparse_bitmaps = true;
> r->cache.arch_has_empty_bitmaps = true;
> r->cache.arch_has_per_cpu_cfg = true;
> + r->cache.min_cbm_bits = 0;
> } else if (r->rid == RDT_RESOURCE_MBA) {
> hw_res->msr_base = MSR_IA32_MBA_BW_BASE;
> hw_res->msr_update = mba_wrmsr_amd;
>
>


Reviewed-by: James Morse <[email protected]>

Thanks,

James

2022-09-12 15:41:10

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v4 01/13] x86/resctrl: Fix min_cbm_bits for AMD

Hi James,

On 9/9/22 12:00, James Morse wrote:
> Hi Babu,
>
> On 07/09/2022 18:59, Babu Moger wrote:
>> AMD systems support zero CBM (capacity bit mask) for L3 allocation.
>> That is reflected in rdt_init_res_defs_amd() by:
>>
>> r->cache.arch_has_empty_bitmaps = true;
>>
>> However given the unified code in cbm_validate(), checking for:
>> val == 0 && !arch_has_empty_bitmaps
>>
>> is not enough because of another check in cbm_validate():
>>
>> if ((zero_bit - first_bit) < r->cache.min_cbm_bits)
> Right, the Intel version had this, but the AMD didn't. I evidently only thought about this
> the !arch_has_empty_bitmaps way round! Sorry about that.

>
>
>> The default value of r->cache.min_cbm_bits = 1.
>>
>> Leading to:
>>
>> $ cd /sys/fs/resctrl
>> $ mkdir foo
>> $ cd foo
>> $ echo L3:0=0 > schemata
>> -bash: echo: write error: Invalid argument
>> $ cat /sys/fs/resctrl/info/last_cmd_status
>> Need at least 1 bits in the mask
>>
>> Fix the issue by initializing the min_cbm_bits to 0 for AMD. Also,
>> remove the default setting of min_cbm_bits and initialize it separately.
>>
>> After the fix
>> $ cd /sys/fs/resctrl
>> $ mkdir foo
>> $ cd foo
>> $ echo L3:0=0 > schemata
>> $ cat /sys/fs/resctrl/info/last_cmd_status
>> ok
>>
>> Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20220517001234.3137157-1-eranian%40google.com%2F&amp;data=05%7C01%7Cbabu.moger%40amd.com%7Cc5b955e9726344c550f008da9284dedd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637983396695085653%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=wIVM5%2BrCwfWDpIrLkDtycgoCd4PWMC3D8y%2FAjshIW%2FQ%3D&amp;reserved=0
>> Fixes: 316e7f901f5a ("x86/resctrl: Add struct rdt_cache::arch_has_{sparse, empty}_bitmaps")
>> Signed-off-by: Stephane Eranian <[email protected]>
>> Signed-off-by: Babu Moger <[email protected]>
> Er, who is the author if this patch? If you are reposting Stephane's patch then there
> needs to be a 'From: ' at the top of the email so that git preserves the ownership. You

I can add Stephane's name in "From" if it is right thing to do. But this
patch was modified from the original version Stephane posted.

https://lore.kernel.org/lkml/[email protected]/


> may need some incantation of "git commit --amend --author=" to fix this in your tree.
>
> As its a fix, have you posted this separately? Mixing fixes and new-code makes it hard for
> the maintainer to spot what needs to be taken for the next -rc.

ok. I can send this separate in next version.


>
>
>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>> index bb1c3f5f60c8..a5c51a14fbce 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -66,9 +66,6 @@ struct rdt_hw_resource rdt_resources_all[] = {
>> .rid = RDT_RESOURCE_L3,
>> .name = "L3",
>> .cache_level = 3,
>> - .cache = {
>> - .min_cbm_bits = 1,
>> - },
>> .domains = domain_init(RDT_RESOURCE_L3),
>> .parse_ctrlval = parse_cbm,
>> .format_str = "%d=%0*x",
>> @@ -83,9 +80,6 @@ struct rdt_hw_resource rdt_resources_all[] = {
>> .rid = RDT_RESOURCE_L2,
>> .name = "L2",
>> .cache_level = 2,
>> - .cache = {
>> - .min_cbm_bits = 1,
>> - },
>> .domains = domain_init(RDT_RESOURCE_L2),
>> .parse_ctrlval = parse_cbm,
>> .format_str = "%d=%0*x",
>> @@ -877,6 +871,7 @@ static __init void rdt_init_res_defs_intel(void)
>> r->cache.arch_has_sparse_bitmaps = false;
>> r->cache.arch_has_empty_bitmaps = false;
>> r->cache.arch_has_per_cpu_cfg = false;
>> + r->cache.min_cbm_bits = 1;
>> } else if (r->rid == RDT_RESOURCE_MBA) {
>> hw_res->msr_base = MSR_IA32_MBA_THRTL_BASE;
>> hw_res->msr_update = mba_wrmsr_intel;
>> @@ -897,6 +892,7 @@ static __init void rdt_init_res_defs_amd(void)
>> r->cache.arch_has_sparse_bitmaps = true;
>> r->cache.arch_has_empty_bitmaps = true;
>> r->cache.arch_has_per_cpu_cfg = true;
>> + r->cache.min_cbm_bits = 0;
>> } else if (r->rid == RDT_RESOURCE_MBA) {
>> hw_res->msr_base = MSR_IA32_MBA_BW_BASE;
>> hw_res->msr_update = mba_wrmsr_amd;
>>
>>
>
> Reviewed-by: James Morse <[email protected]>


Thank You.

Babu

2022-09-16 16:01:34

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 01/13] x86/resctrl: Fix min_cbm_bits for AMD

Hi Babu,

On 9/12/2022 7:54 AM, Moger, Babu wrote:
> On 9/9/22 12:00, James Morse wrote:
>> On 07/09/2022 18:59, Babu Moger wrote:


>>> Fixes: 316e7f901f5a ("x86/resctrl: Add struct rdt_cache::arch_has_{sparse, empty}_bitmaps")
>>> Signed-off-by: Stephane Eranian <[email protected]>
>>> Signed-off-by: Babu Moger <[email protected]>
>> Er, who is the author if this patch? If you are reposting Stephane's patch then there
>> needs to be a 'From: ' at the top of the email so that git preserves the ownership. You
>
> I can add Stephane's name in "From" if it is right thing to do. But this
> patch was modified from the original version Stephane posted.
>
> https://lore.kernel.org/lkml/[email protected]/

True, but also please consider what Stephane posted originally:
https://lore.kernel.org/lkml/[email protected]/

A "Co-developed-by" may help with attribution:

Co-developed-by: Stephane Eranian <[email protected]>
Signed-off-by: Stephane Eranian <[email protected]>
Signed-off-by: Babu Moger <[email protected]>

Reinette

2022-09-16 16:01:37

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 01/13] x86/resctrl: Fix min_cbm_bits for AMD

Hi Babu,

On 9/7/2022 10:59 AM, Babu Moger wrote:
> AMD systems support zero CBM (capacity bit mask) for L3 allocation.

Above mentions "for L3 allocation", but the change impacts
L3 as well as L2 allocation. Perhaps just
"cache allocation"?

> That is reflected in rdt_init_res_defs_amd() by:
>
> r->cache.arch_has_empty_bitmaps = true;
>
> However given the unified code in cbm_validate(), checking for:
> val == 0 && !arch_has_empty_bitmaps
>
> is not enough because of another check in cbm_validate():
>
> if ((zero_bit - first_bit) < r->cache.min_cbm_bits)
>
> The default value of r->cache.min_cbm_bits = 1.
>
> Leading to:
>
> $ cd /sys/fs/resctrl
> $ mkdir foo
> $ cd foo
> $ echo L3:0=0 > schemata
> -bash: echo: write error: Invalid argument
> $ cat /sys/fs/resctrl/info/last_cmd_status
> Need at least 1 bits in the mask
>
> Fix the issue by initializing the min_cbm_bits to 0 for AMD. Also,
> remove the default setting of min_cbm_bits and initialize it separately.
>
> After the fix
> $ cd /sys/fs/resctrl
> $ mkdir foo
> $ cd foo
> $ echo L3:0=0 > schemata
> $ cat /sys/fs/resctrl/info/last_cmd_status
> ok
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Fixes: 316e7f901f5a ("x86/resctrl: Add struct rdt_cache::arch_has_{sparse, empty}_bitmaps")
> Signed-off-by: Stephane Eranian <[email protected]>
> Signed-off-by: Babu Moger <[email protected]>
> Reviewed-by: Ingo Molnar <[email protected]>

(apart from the changelog nitpick)

Thank you for clarifying the way forward for this fix.

Reviewed-by: Reinette Chatre <[email protected]>

Reinette


2022-09-16 16:02:00

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 03/13] x86/cpufeatures: Add Slow Memory Bandwidth Allocation feature flag

Hi Babu,

On 9/7/2022 11:00 AM, Babu Moger wrote:
> Adds the new AMD feature X86_FEATURE_SMBA. With this feature, the QOS

Adds -> Add

> enforcement policies can be applied to external slow memory connected
> to the host. QOS enforcement is accomplished by assigning a Class Of
> Service (COS) to a processor and specifying allocations or limits for
> that COS for each resource to be allocated.
>
> This feature is identified by the CPUID Function 8000_0020_EBX_x0.
>
> CPUID Fn8000_0020_EBX_x0 AMD Bandwidth Enforcement Feature Identifiers (ECX=0)
> Bits Field Name Description
> 2 L3SBE L3 external slow memory bandwidth enforcement
>
> Currently, CXL.memory is the only supported "slow" memory device. With the
> support of SMBA feature the hardware enables bandwidth allocation on the
> slow memory devices. If there are multiple slow memory devices in the system,
> then the throttling logic groups all the slow sources together and applies
> the limit on them as a whole.

The above is a useful addition (made in patch 13/13) to the documentation ...

>
> The presence of the SMBA feature(with CXL.memory) is independent of whether
> slow memory device is actually present in the system. If there is no slow
> memory in the system, then setting a SMBA limit will have no impact on the
> performance of the system.

... could the above snippet also please be added to the documentation?

>
> Presence of CXL memory can be identified by numactl command.
>
> $numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
> node 0 size: 63678 MB node 0 free: 59542 MB
> node 1 cpus:
> node 1 size: 16122 MB
> node 1 free: 15627 MB
> node distances:
> node 0 1
> 0: 10 50
> 1: 50 10
>
> CPU list for CXL memory will be emply. The cpu-cxl node distance is greater

emply -> empty?

> than cpu-to-cpu distances. Node 1 has the CXL memory in this case. CXL
> memory can also be identified using ACPI SRAT table and memory maps.
>
> Feature description is available in the specification, "AMD64 Technology Platform Quality
> of Service Extensions, Revision: 1.03 Publication # 56375 Revision: 1.03 Issue Date: February 2022".

Please shorten these lines to the recommended 75 chars per line.

>
> Link: https://www.amd.com/en/support/tech-docs/amd64-technology-platform-quality-service-extensions
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> Signed-off-by: Babu Moger <[email protected]>
> ---
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/kernel/cpu/scattered.c | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 235dc85c91c3..1815435c9c88 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -304,6 +304,7 @@
> #define X86_FEATURE_UNRET (11*32+15) /* "" AMD BTB untrain return */
> #define X86_FEATURE_USE_IBPB_FW (11*32+16) /* "" Use IBPB during runtime firmware calls */
> #define X86_FEATURE_RSB_VMEXIT_LITE (11*32+17) /* "" Fill RSB on VM exit when EIBRS is enabled */
> +#define X86_FEATURE_SMBA (11*32+18) /* SLOW Memory Bandwidth Allocation */

Why is "SLOW" in all caps?

>
> /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
> #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
> diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
> index fd44b54c90d5..885ecf46abb2 100644
> --- a/arch/x86/kernel/cpu/scattered.c
> +++ b/arch/x86/kernel/cpu/scattered.c
> @@ -44,6 +44,7 @@ static const struct cpuid_bit cpuid_bits[] = {
> { X86_FEATURE_CPB, CPUID_EDX, 9, 0x80000007, 0 },
> { X86_FEATURE_PROC_FEEDBACK, CPUID_EDX, 11, 0x80000007, 0 },
> { X86_FEATURE_MBA, CPUID_EBX, 6, 0x80000008, 0 },
> + { X86_FEATURE_SMBA, CPUID_EBX, 2, 0x80000020, 0 },
> { X86_FEATURE_PERFMON_V2, CPUID_EAX, 0, 0x80000022, 0 },
> { 0, 0, 0, 0, 0 }
> };
>
>

Could you please follow the coding style (wrt tabs vs. spaces) of the area you
are contributing to here? Please do so in all patches in this series.

Reinette

2022-09-16 16:04:07

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 02/13] x86/resctrl: Remove arch_has_empty_bitmaps

Hi Babu,

On 9/7/2022 11:00 AM, Babu Moger wrote:
> The field arch_has_empty_bitmaps is not required anymore. The field
> min_cbm_bits is enough to validate the CBM (capacity bit mask) if the
> architecture can support the zero CBM or not.
>
> Suggested-by: Reinette Chatre <[email protected]>
> Signed-off-by: Babu Moger <[email protected]>
> ---

Thank you.

Reviewed-by: Reinette Chatre <[email protected]>

Reinette

2022-09-16 16:05:09

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 06/13] x86/resctrl: Include new features in command line options

Hi Babu,

On 9/7/2022 11:00 AM, Babu Moger wrote:
> Add the command line options to disable the new features.
> smba : Slow Memory Bandwidth Allocation
> mbec : Bandwidth Monitor Event Configuration.

mbec -> bmec ?

>
> Signed-off-by: Babu Moger <[email protected]>

Reinette

2022-09-16 16:36:59

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 09/13] x86/resctrl: Add sysfs interface files to read/write event configuration

Hi Babu,

On 9/7/2022 11:01 AM, Babu Moger wrote:
> Add two new sysfs files to read/write the event configuration if
> the feature Bandwidth Monitoring Event Configuration (BMEC) is
> supported. The file mbm_local_config is for the configuration
> of the event mbm_local_bytes and the file mbm_total_config is
> for the configuration of mbm_total_bytes.
>
> $ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local*
> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_config
>
> $ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total*
> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_config
>

This patch makes the mbm*config files per monitor group. Looking
ahead at later patches how the configuration is set it is not clear
to me that this is the right place for these configuration files.

Looking ahead to patch 10 there is neither rmid nor closid within
the (MSR_IA32_EVT_CFG_BASE + index) register - it only takes
the bits indicating what access types needs to be counted. Also
in patch 10 I understand that the scope of this register is per L3 cache
domain.

Considering this, why is the sysfs file associated with each
monitor group?

For example, consider the following scenario:
# cd /sys/fs/resctrl
# mkdir g2
# mkdir mon_groups/m1
# mkdir mon_groups/m2
# find . | grep mbm_local_config
./mon_data/mon_L3_00/mbm_local_config
./mon_data/mon_L3_01/mbm_local_config
./g2/mon_data/mon_L3_00/mbm_local_config
./g2/mon_data/mon_L3_01/mbm_local_config
./mon_groups/m2/mon_data/mon_L3_00/mbm_local_config
./mon_groups/m2/mon_data/mon_L3_01/mbm_local_config
./mon_groups/m1/mon_data/mon_L3_00/mbm_local_config
./mon_groups/m1/mon_data/mon_L3_01/mbm_local_config


From what I understand, the following sysfs files are
associated with cache domain #0 and thus writing to any of these
files would change the same configuration:
./mon_data/mon_L3_00/mbm_local_config
./g2/mon_data/mon_L3_00/mbm_local_config
./mon_groups/m2/mon_data/mon_L3_00/mbm_local_config
./mon_groups/m1/mon_data/mon_L3_00/mbm_local_config

Could you please correct me where I am wrong?


> Signed-off-by: Babu Moger <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 40 ++++++++++++++++++++++++--------
> 1 file changed, 30 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index f55a693fa958..da11fdad204d 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -254,6 +254,10 @@ static const struct kernfs_ops kf_mondata_ops = {
> .seq_show = rdtgroup_mondata_show,
> };
>
> +static const struct kernfs_ops kf_mondata_config_ops = {
> + .atomic_write_len = PAGE_SIZE,
> +};
> +

Please use coding style (tabs vs spaces) that is consistent with area
you are contributing to.

> static bool is_cpu_list(struct kernfs_open_file *of)
> {
> struct rftype *rft = of->kn->priv;
> @@ -2478,24 +2482,40 @@ static struct file_system_type rdt_fs_type = {
> .kill_sb = rdt_kill_sb,
> };
>
> -static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> +static int mon_addfile(struct kernfs_node *parent_kn, struct mon_evt *mevt,
> void *priv)
> {
> - struct kernfs_node *kn;
> + struct kernfs_node *kn_evt, *kn_evt_config;
> int ret = 0;
>
> - kn = __kernfs_create_file(parent_kn, name, 0444,
> - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
> - &kf_mondata_ops, priv, NULL, NULL);
> - if (IS_ERR(kn))
> - return PTR_ERR(kn);
> + kn_evt = __kernfs_create_file(parent_kn, mevt->name, 0444,
> + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
> + &kf_mondata_ops, priv, NULL, NULL);

Please run your series through checkpatch (alignment issue above)

> + if (IS_ERR(kn_evt))
> + return PTR_ERR(kn_evt);
>
> - ret = rdtgroup_kn_set_ugid(kn);
> + ret = rdtgroup_kn_set_ugid(kn_evt);
> if (ret) {
> - kernfs_remove(kn);
> + kernfs_remove(kn_evt);
> return ret;
> }
>
> + if (mevt->configurable) {
> + kn_evt_config = __kernfs_create_file(parent_kn,
> + mevt->config_name, 0644,
> + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
> + &kf_mondata_config_ops, priv, NULL, NULL);
> + if (IS_ERR(kn_evt_config))
> + return PTR_ERR(kn_evt_config);
> +

Since an error is returned here it seems that some cleanup (kn_evt) is missing?


> + ret = rdtgroup_kn_set_ugid(kn_evt_config);
> + if (ret) {
> + kernfs_remove(kn_evt_config);
> + kernfs_remove(kn_evt);
> + return ret;
> + }
> + }
> +
> return ret;
> }
>

Reinette

2022-09-16 16:43:36

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 11/13] x86/resctrl: Add sysfs interface to write the event configuration

Hi Babu,

On 9/7/2022 11:01 AM, Babu Moger wrote:

...

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 6f067c1ac7c1..59b484eb1267 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -330,9 +330,121 @@ int rdtgroup_mondata_config_show(struct seq_file *m, void *arg)
> return ret;
> }
>
> +/*
> + * This is called via IPI to read the CQM/MBM counters
> + * in a domain.

copy&paste from previous patch?

> + */
> +void mon_event_config_write(void *info)
> +{
> + struct mon_config_info *mon_info = info;
> + u32 msr_index;
> +
> + switch (mon_info->evtid) {
> + case QOS_L3_MBM_TOTAL_EVENT_ID:
> + msr_index = 0;
> + break;
> + case QOS_L3_MBM_LOCAL_EVENT_ID:
> + msr_index = 1;
> + break;
> + default:
> + /* Not expected to come here */
> + return;
> + }
> +
> + wrmsr(MSR_IA32_EVT_CFG_BASE + msr_index, mon_info->mon_config, 0);
> +}
> +
> +ssize_t rdtgroup_mondata_config_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + struct mon_config_info mon_info;
> + struct rdt_hw_resource *hw_res;
> + struct rdtgroup *rdtgrp;
> + struct rdt_resource *r;
> + unsigned int mon_config;
> + cpumask_var_t cpu_mask;
> + union mon_data_bits md;
> + struct rdt_domain *d;
> + u32 resid, domid;
> + int ret = 0, cpu;
> +
> + ret = kstrtouint(buf, 0, &mon_config);
> + if (ret)
> + return ret;
> +
> + rdt_last_cmd_clear();
> +
> + /* mon_config cannot be more than the supported set of events */
> + if (mon_config > MAX_EVT_CONFIG_BITS) {
> + rdt_last_cmd_puts("Invalid event configuration\n");
> + return -EINVAL;
> + }
> +
> + cpus_read_lock();
> + rdtgrp = rdtgroup_kn_lock_live(of->kn);
> + if (!rdtgrp) {
> + return -ENOENT;
> + goto e_unlock;
> + }
> +
> + if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) {
> + ret = -ENOMEM;
> + goto e_unlock;
> + }
> +
> + md.priv = of->kn->priv;
> + resid = md.u.rid;
> + domid = md.u.domid;
> +
> + hw_res = &rdt_resources_all[resid];
> + r = &hw_res->r_resctrl;
> + d = rdt_find_domain(r, domid, NULL);
> + if (IS_ERR_OR_NULL(d)) {
> + ret = -ENOENT;
> + goto e_cpumask;
> + }
> +
> + /*
> + * Read the current config value first. If both are same
> + * then we dont need to write again
> + */
> + mon_info.evtid = md.u.evtid;
> + mondata_config_read(d, &mon_info);
> + if (mon_info.mon_config == mon_config)
> + goto e_cpumask;
> +
> + mon_info.mon_config = mon_config;
> +
> + /* Pick all the CPUs in the domain instance */
> + for_each_cpu(cpu, &d->cpu_mask)
> + cpumask_set_cpu(cpu, cpu_mask);
> +
> + /* Update MSR_IA32_EVT_CFG_BASE MSR on all the CPUs in cpu_mask */
> + on_each_cpu_mask(cpu_mask, mon_event_config_write, &mon_info, 1);

If this is required then could you please add a comment why every CPU in
the domain needs to be updated? Until now configuration changes
only needed to be made on one CPU per domain.
Even in the previous patch when the user reads the current configuration
value it is only done on one CPU in the domain ... to me that implies
that the scope is per domain and only one CPU in the domain needs to be
changed.

> +
> + /*
> + * When an Event Configuration is changed, the bandwidth counters
> + * for all RMIDs and Events will be cleared, and the U-bit for every
> + * RMID will be set on the next read to any BwEvent for every RMID.
> + * Clear the mbm_local and mbm_total counts for all the RMIDs.
> + */

This is a snippet that was copied from the hardware spec and since it is
inserted into the kernel driver the context makes it hard to understand.
Could it be translated into what it means in this context?
Perhaps something like:

/*
* When an Event Configuration is changed, the bandwidth counters
* for all RMIDs and Events will be cleared by the hardware. The
* hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for every
* RMID on the next read to any event for every RMID. Subsequent
* reads will have MSR_IA32_QM_CTR.Unavailable (bit 62) cleared
* while it is tracked by the hardware.
* Clear the mbm_local and mbm_total counts for all the RMIDs.
*/

Please fixup where I got it wrong.

> + memset(d->mbm_local, 0, sizeof(struct mbm_state) * r->num_rmid);
> + memset(d->mbm_total, 0, sizeof(struct mbm_state) * r->num_rmid);
> +

Reinette

2022-09-16 18:35:09

by Moger, Babu

[permalink] [raw]
Subject: RE: [PATCH v4 01/13] x86/resctrl: Fix min_cbm_bits for AMD

[AMD Official Use Only - General]



> -----Original Message-----
> From: Reinette Chatre <[email protected]>
> Sent: Friday, September 16, 2022 10:53 AM
> To: Moger, Babu <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Das1, Sandipan
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH v4 01/13] x86/resctrl: Fix min_cbm_bits for AMD
>
> Hi Babu,
>
> On 9/7/2022 10:59 AM, Babu Moger wrote:
> > AMD systems support zero CBM (capacity bit mask) for L3 allocation.
>
> Above mentions "for L3 allocation", but the change impacts
> L3 as well as L2 allocation. Perhaps just "cache allocation"?
>
> > That is reflected in rdt_init_res_defs_amd() by:
> >
> > r->cache.arch_has_empty_bitmaps = true;
> >
> > However given the unified code in cbm_validate(), checking for:
> > val == 0 && !arch_has_empty_bitmaps
> >
> > is not enough because of another check in cbm_validate():
> >
> > if ((zero_bit - first_bit) < r->cache.min_cbm_bits)
> >
> > The default value of r->cache.min_cbm_bits = 1.
> >
> > Leading to:
> >
> > $ cd /sys/fs/resctrl
> > $ mkdir foo
> > $ cd foo
> > $ echo L3:0=0 > schemata
> > -bash: echo: write error: Invalid argument
> > $ cat /sys/fs/resctrl/info/last_cmd_status
> > Need at least 1 bits in the mask
> >
> > Fix the issue by initializing the min_cbm_bits to 0 for AMD. Also,
> > remove the default setting of min_cbm_bits and initialize it separately.
> >
> > After the fix
> > $ cd /sys/fs/resctrl
> > $ mkdir foo
> > $ cd foo
> > $ echo L3:0=0 > schemata
> > $ cat /sys/fs/resctrl/info/last_cmd_status
> > ok
> >
> > Link:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore
> > .kernel.org%2Flkml%2F20220517001234.3137157-1-
> eranian%40google.com%2F&
> >
> amp;data=05%7C01%7Cbabu.moger%40amd.com%7Cfdccc20e6a234fb3872a0
> 8da97fb
> >
> 94fc%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6379894040498
> 44076%7
> >
> CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI
> 6Ik1
> >
> haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=DIpkBAkI7lAjj5QQ4
> 5VahssT
> > dWGj%2FcUwGJDiHXNYzz8%3D&amp;reserved=0
> > Fixes: 316e7f901f5a ("x86/resctrl: Add struct
> > rdt_cache::arch_has_{sparse, empty}_bitmaps")
> > Signed-off-by: Stephane Eranian <[email protected]>
> > Signed-off-by: Babu Moger <[email protected]>
> > Reviewed-by: Ingo Molnar <[email protected]>
>
> (apart from the changelog nitpick)
>
> Thank you for clarifying the way forward for this fix.
>
> Reviewed-by: Reinette Chatre <[email protected]>

Thank you
Babu Moger

2022-09-16 18:43:40

by Moger, Babu

[permalink] [raw]
Subject: RE: [PATCH v4 01/13] x86/resctrl: Fix min_cbm_bits for AMD

[AMD Official Use Only - General]



> -----Original Message-----
> From: Reinette Chatre <[email protected]>
> Sent: Friday, September 16, 2022 10:53 AM
> To: Moger, Babu <[email protected]>; James Morse
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Das1, Sandipan
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH v4 01/13] x86/resctrl: Fix min_cbm_bits for AMD
>
> Hi Babu,
>
> On 9/12/2022 7:54 AM, Moger, Babu wrote:
> > On 9/9/22 12:00, James Morse wrote:
> >> On 07/09/2022 18:59, Babu Moger wrote:
>
>
> >>> Fixes: 316e7f901f5a ("x86/resctrl: Add struct
> >>> rdt_cache::arch_has_{sparse, empty}_bitmaps")
> >>> Signed-off-by: Stephane Eranian <[email protected]>
> >>> Signed-off-by: Babu Moger <[email protected]>
> >> Er, who is the author if this patch? If you are reposting Stephane's
> >> patch then there needs to be a 'From: ' at the top of the email so
> >> that git preserves the ownership. You
> >
> > I can add Stephane's name in "From" if it is right thing to do. But
> > this patch was modified from the original version Stephane posted.
> >
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore
> > .kernel.org%2Flkml%2F20220517001234.3137157-1-
> eranian%40google.com%2F&
> >
> amp;data=05%7C01%7Cbabu.moger%40amd.com%7C69a28ad3fcfe444404a50
> 8da97fb
> >
> 82ee%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6379894037788
> 16201%7
> >
> CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI
> 6Ik1
> >
> haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=OnLKz6cBiypVvv1x
> 8PSv1JUz
> > ugilG1Gpwgkcz55ydqI%3D&amp;reserved=0
>
> True, but also please consider what Stephane posted originally:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kern
> el.org%2Flkml%2F20220516055055.2734840-1-
> eranian%40google.com%2F&amp;data=05%7C01%7Cbabu.moger%40amd.com
> %7C69a28ad3fcfe444404a508da97fb82ee%7C3dd8961fe4884e608e11a82d994
> e183d%7C0%7C0%7C637989403778816201%7CUnknown%7CTWFpbGZsb3d8ey
> JWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7
> C3000%7C%7C%7C&amp;sdata=2mTjDYC9B%2Fr6HR6SlwToWXewsWub2rfPQp
> c9JIkETus%3D&amp;reserved=0
>
> A "Co-developed-by" may help with attribution:
>
> Co-developed-by: Stephane Eranian <[email protected]>
> Signed-off-by: Stephane Eranian <[email protected]>
> Signed-off-by: Babu Moger <[email protected]>
>
Yes, Sounds good to me.
Thanks
Babu

2022-09-16 19:59:15

by Moger, Babu

[permalink] [raw]
Subject: RE: [PATCH v4 03/13] x86/cpufeatures: Add Slow Memory Bandwidth Allocation feature flag

[AMD Official Use Only - General]

Hi Reinette,

> -----Original Message-----
> From: Reinette Chatre <[email protected]>
> Sent: Friday, September 16, 2022 10:54 AM
> To: Moger, Babu <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Das1, Sandipan
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH v4 03/13] x86/cpufeatures: Add Slow Memory Bandwidth
> Allocation feature flag
>
> Hi Babu,
>
> On 9/7/2022 11:00 AM, Babu Moger wrote:
> > Adds the new AMD feature X86_FEATURE_SMBA. With this feature, the QOS
>
> Adds -> Add

Sure

>
> > enforcement policies can be applied to external slow memory connected
> > to the host. QOS enforcement is accomplished by assigning a Class Of
> > Service (COS) to a processor and specifying allocations or limits for
> > that COS for each resource to be allocated.
> >
> > This feature is identified by the CPUID Function 8000_0020_EBX_x0.
> >
> > CPUID Fn8000_0020_EBX_x0 AMD Bandwidth Enforcement Feature
> Identifiers (ECX=0)
> > Bits Field Name Description
> > 2 L3SBE L3 external slow memory bandwidth enforcement
> >
> > Currently, CXL.memory is the only supported "slow" memory device. With
> > the support of SMBA feature the hardware enables bandwidth allocation
> > on the slow memory devices. If there are multiple slow memory devices
> > in the system, then the throttling logic groups all the slow sources
> > together and applies the limit on them as a whole.
>
> The above is a useful addition (made in patch 13/13) to the documentation ...
>
> >
> > The presence of the SMBA feature(with CXL.memory) is independent of
> > whether slow memory device is actually present in the system. If there
> > is no slow memory in the system, then setting a SMBA limit will have
> > no impact on the performance of the system.
>
> ... could the above snippet also please be added to the documentation?
>

Ok Sure.

> >
> > Presence of CXL memory can be identified by numactl command.
> >
> > $numactl -H
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 node 0 size:
> > 63678 MB node 0 free: 59542 MB node 1 cpus:
> > node 1 size: 16122 MB
> > node 1 free: 15627 MB
> > node distances:
> > node 0 1
> > 0: 10 50
> > 1: 50 10
> >
> > CPU list for CXL memory will be emply. The cpu-cxl node distance is
> > greater
>
> emply -> empty?

ok
>
> > than cpu-to-cpu distances. Node 1 has the CXL memory in this case. CXL
> > memory can also be identified using ACPI SRAT table and memory maps.
> >
> > Feature description is available in the specification, "AMD64
> > Technology Platform Quality of Service Extensions, Revision: 1.03 Publication
> # 56375 Revision: 1.03 Issue Date: February 2022".
>
> Please shorten these lines to the recommended 75 chars per line.

Sure
>
> >
> > Link:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.
> > amd.com%2Fen%2Fsupport%2Ftech-docs%2Famd64-technology-platform-
> quality
> > -service-
> extensions&amp;data=05%7C01%7Cbabu.moger%40amd.com%7C60553f32
> >
> e9ab4ed1219c08da97fbc048%7C3dd8961fe4884e608e11a82d994e183d%7C0%
> 7C0%7C
> >
> 637989404770663129%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> AiLCJQIjo
> >
> iV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdat
> a=WY4H
> > EzWHdMpMUUR%2FBnBupwdHuQ6O2RfdrfcGx4TkXfI%3D&amp;reserved=0
> > Link:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz
> >
> illa.kernel.org%2Fshow_bug.cgi%3Fid%3D206537&amp;data=05%7C01%7Cbab
> u.m
> >
> oger%40amd.com%7C60553f32e9ab4ed1219c08da97fbc048%7C3dd8961fe488
> 4e608e
> >
> 11a82d994e183d%7C0%7C0%7C637989404770663129%7CUnknown%7CTWFpb
> GZsb3d8ey
> >
> JWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7
> C300
> >
> 0%7C%7C%7C&amp;sdata=7e7sGH0iaEW8mlst5mK3fn9wy%2FYRhDU%2BbBm
> PWzSrGL4%3
> > D&amp;reserved=0
> > Signed-off-by: Babu Moger <[email protected]>
> > ---
> > arch/x86/include/asm/cpufeatures.h | 1 +
> > arch/x86/kernel/cpu/scattered.c | 1 +
> > 2 files changed, 2 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/cpufeatures.h
> > b/arch/x86/include/asm/cpufeatures.h
> > index 235dc85c91c3..1815435c9c88 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -304,6 +304,7 @@
> > #define X86_FEATURE_UNRET (11*32+15) /* "" AMD BTB
> untrain return */
> > #define X86_FEATURE_USE_IBPB_FW (11*32+16) /* "" Use IBPB
> during runtime firmware calls */
> > #define X86_FEATURE_RSB_VMEXIT_LITE (11*32+17) /* "" Fill RSB on
> VM exit when EIBRS is enabled */
> > +#define X86_FEATURE_SMBA (11*32+18) /* SLOW Memory
> Bandwidth Allocation */
>
> Why is "SLOW" in all caps?

Will correct it.

>
> >
> > /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
> > #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI
> instructions */
> > diff --git a/arch/x86/kernel/cpu/scattered.c
> > b/arch/x86/kernel/cpu/scattered.c index fd44b54c90d5..885ecf46abb2
> > 100644
> > --- a/arch/x86/kernel/cpu/scattered.c
> > +++ b/arch/x86/kernel/cpu/scattered.c
> > @@ -44,6 +44,7 @@ static const struct cpuid_bit cpuid_bits[] = {
> > { X86_FEATURE_CPB, CPUID_EDX, 9, 0x80000007, 0 },
> > { X86_FEATURE_PROC_FEEDBACK, CPUID_EDX, 11, 0x80000007, 0 },
> > { X86_FEATURE_MBA, CPUID_EBX, 6, 0x80000008, 0 },
> > + { X86_FEATURE_SMBA, CPUID_EBX, 2, 0x80000020, 0 },
> > { X86_FEATURE_PERFMON_V2, CPUID_EAX, 0, 0x80000022, 0 },
> > { 0, 0, 0, 0, 0 }
> > };
> >
> >
>
> Could you please follow the coding style (wrt tabs vs. spaces) of the area you
> are contributing to here? Please do so in all patches in this series.

Sure. Thanks
Babu Moger

2022-09-16 20:14:09

by Moger, Babu

[permalink] [raw]
Subject: RE: [PATCH v4 02/13] x86/resctrl: Remove arch_has_empty_bitmaps

[AMD Official Use Only - General]



> -----Original Message-----
> From: Reinette Chatre <[email protected]>
> Sent: Friday, September 16, 2022 10:54 AM
> To: Moger, Babu <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Das1, Sandipan
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH v4 02/13] x86/resctrl: Remove arch_has_empty_bitmaps
>
> Hi Babu,
>
> On 9/7/2022 11:00 AM, Babu Moger wrote:
> > The field arch_has_empty_bitmaps is not required anymore. The field
> > min_cbm_bits is enough to validate the CBM (capacity bit mask) if the
> > architecture can support the zero CBM or not.
> >
> > Suggested-by: Reinette Chatre <[email protected]>
> > Signed-off-by: Babu Moger <[email protected]>
> > ---
>
> Thank you.
>
> Reviewed-by: Reinette Chatre <[email protected]>

Thank You.
Babu Moger

2022-09-16 21:02:56

by Moger, Babu

[permalink] [raw]
Subject: RE: [PATCH v4 06/13] x86/resctrl: Include new features in command line options

[AMD Official Use Only - General]



> -----Original Message-----
> From: Reinette Chatre <[email protected]>
> Sent: Friday, September 16, 2022 10:55 AM
> To: Moger, Babu <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Das1, Sandipan
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH v4 06/13] x86/resctrl: Include new features in command
> line options
>
> Hi Babu,
>
> On 9/7/2022 11:00 AM, Babu Moger wrote:
> > Add the command line options to disable the new features.
> > smba : Slow Memory Bandwidth Allocation mbec : Bandwidth Monitor Event
> > Configuration.
>
> mbec -> bmec ?

Yes. Thanks
Babu

2022-09-19 16:00:44

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v4 09/13] x86/resctrl: Add sysfs interface files to read/write event configuration

Hi Reinette,

On 9/16/22 10:58, Reinette Chatre wrote:
> Hi Babu,
>
> On 9/7/2022 11:01 AM, Babu Moger wrote:
>> Add two new sysfs files to read/write the event configuration if
>> the feature Bandwidth Monitoring Event Configuration (BMEC) is
>> supported. The file mbm_local_config is for the configuration
>> of the event mbm_local_bytes and the file mbm_total_config is
>> for the configuration of mbm_total_bytes.
>>
>> $ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local*
>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_config
>>
>> $ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total*
>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_config
>>
> This patch makes the mbm*config files per monitor group. Looking
> ahead at later patches how the configuration is set it is not clear
> to me that this is the right place for these configuration files.
>
> Looking ahead to patch 10 there is neither rmid nor closid within
> the (MSR_IA32_EVT_CFG_BASE + index) register - it only takes
> the bits indicating what access types needs to be counted. Also
> in patch 10 I understand that the scope of this register is per L3 cache
> domain.
Yes. Scope of  MSR_IA32_EVT_CFG_BASE per L3 domain.
>
> Considering this, why is the sysfs file associated with each
> monitor group?
Please see the response below.
>
> For example, consider the following scenario:
> # cd /sys/fs/resctrl
> # mkdir g2
> # mkdir mon_groups/m1
> # mkdir mon_groups/m2
> # find . | grep mbm_local_config
> ./mon_data/mon_L3_00/mbm_local_config
> ./mon_data/mon_L3_01/mbm_local_config
> ./g2/mon_data/mon_L3_00/mbm_local_config
> ./g2/mon_data/mon_L3_01/mbm_local_config
> ./mon_groups/m2/mon_data/mon_L3_00/mbm_local_config
> ./mon_groups/m2/mon_data/mon_L3_01/mbm_local_config
> ./mon_groups/m1/mon_data/mon_L3_00/mbm_local_config
> ./mon_groups/m1/mon_data/mon_L3_01/mbm_local_config
>
>
> From what I understand, the following sysfs files are
> associated with cache domain #0 and thus writing to any of these
> files would change the same configuration:
> ./mon_data/mon_L3_00/mbm_local_config
> ./g2/mon_data/mon_L3_00/mbm_local_config
> ./mon_groups/m2/mon_data/mon_L3_00/mbm_local_config
> ./mon_groups/m1/mon_data/mon_L3_00/mbm_local_config
>
> Could you please correct me where I am wrong?

For example, we have CPUs 0-7 in domain 0. We have two counters which are
configurable.

Lets consider same example as your mentioned about.

g2 is a control group.

m1 and m2 are monitor group.

We can have control group g2 with CPUs 0-7 to limit the L3 bandwidth (or
memory bandwidth with required schemata setting).

We can have mon group m1 with cpus 0-3 to monitor mbm_local_bytes.

We can have mon group m2 with cpus  4-7 to monitor mbm_total_bytes.

Each group is independently, monitoring two separate thing. Without having
sysfs file (mbm_local_config and mbm_total_config) in each monitor group,
we wont be able to configure the above configuration.


>
>> Signed-off-by: Babu Moger <[email protected]>
>> ---
>> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 40 ++++++++++++++++++++++++--------
>> 1 file changed, 30 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index f55a693fa958..da11fdad204d 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -254,6 +254,10 @@ static const struct kernfs_ops kf_mondata_ops = {
>> .seq_show = rdtgroup_mondata_show,
>> };
>>
>> +static const struct kernfs_ops kf_mondata_config_ops = {
>> + .atomic_write_len = PAGE_SIZE,
>> +};
>> +
> Please use coding style (tabs vs spaces) that is consistent with area
> you are contributing to.
Sure
>
>> static bool is_cpu_list(struct kernfs_open_file *of)
>> {
>> struct rftype *rft = of->kn->priv;
>> @@ -2478,24 +2482,40 @@ static struct file_system_type rdt_fs_type = {
>> .kill_sb = rdt_kill_sb,
>> };
>>
>> -static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
>> +static int mon_addfile(struct kernfs_node *parent_kn, struct mon_evt *mevt,
>> void *priv)
>> {
>> - struct kernfs_node *kn;
>> + struct kernfs_node *kn_evt, *kn_evt_config;
>> int ret = 0;
>>
>> - kn = __kernfs_create_file(parent_kn, name, 0444,
>> - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
>> - &kf_mondata_ops, priv, NULL, NULL);
>> - if (IS_ERR(kn))
>> - return PTR_ERR(kn);
>> + kn_evt = __kernfs_create_file(parent_kn, mevt->name, 0444,
>> + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
>> + &kf_mondata_ops, priv, NULL, NULL);
> Please run your series through checkpatch (alignment issue above)
Sure
>
>> + if (IS_ERR(kn_evt))
>> + return PTR_ERR(kn_evt);
>>
>> - ret = rdtgroup_kn_set_ugid(kn);
>> + ret = rdtgroup_kn_set_ugid(kn_evt);
>> if (ret) {
>> - kernfs_remove(kn);
>> + kernfs_remove(kn_evt);
>> return ret;
>> }
>>
>> + if (mevt->configurable) {
>> + kn_evt_config = __kernfs_create_file(parent_kn,
>> + mevt->config_name, 0644,
>> + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
>> + &kf_mondata_config_ops, priv, NULL, NULL);
>> + if (IS_ERR(kn_evt_config))
>> + return PTR_ERR(kn_evt_config);
>> +
> Since an error is returned here it seems that some cleanup (kn_evt) is missing?

Yes. That is correct.  Will fix it.

Thanks

Babu

>
>
>> + ret = rdtgroup_kn_set_ugid(kn_evt_config);
>> + if (ret) {
>> + kernfs_remove(kn_evt_config);
>> + kernfs_remove(kn_evt);
>> + return ret;
>> + }
>> + }
>> +
>> return ret;
>> }
>>
> Reinette

--
Thanks
Babu Moger

2022-09-19 17:41:18

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 09/13] x86/resctrl: Add sysfs interface files to read/write event configuration

Hi Babu,

On 9/19/2022 8:46 AM, Moger, Babu wrote:
> Hi Reinette,
>
> On 9/16/22 10:58, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 9/7/2022 11:01 AM, Babu Moger wrote:
>>> Add two new sysfs files to read/write the event configuration if
>>> the feature Bandwidth Monitoring Event Configuration (BMEC) is
>>> supported. The file mbm_local_config is for the configuration
>>> of the event mbm_local_bytes and the file mbm_total_config is
>>> for the configuration of mbm_total_bytes.
>>>
>>> $ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local*
>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_config
>>>
>>> $ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total*
>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_config
>>>
>> This patch makes the mbm*config files per monitor group. Looking
>> ahead at later patches how the configuration is set it is not clear
>> to me that this is the right place for these configuration files.
>>
>> Looking ahead to patch 10 there is neither rmid nor closid within
>> the (MSR_IA32_EVT_CFG_BASE + index) register - it only takes
>> the bits indicating what access types needs to be counted. Also
>> in patch 10 I understand that the scope of this register is per L3 cache
>> domain.
> Yes. Scope of  MSR_IA32_EVT_CFG_BASE per L3 domain.
>>
>> Considering this, why is the sysfs file associated with each
>> monitor group?
> Please see the response below.
>>
>> For example, consider the following scenario:
>> # cd /sys/fs/resctrl
>> # mkdir g2
>> # mkdir mon_groups/m1
>> # mkdir mon_groups/m2
>> # find . | grep mbm_local_config
>> ./mon_data/mon_L3_00/mbm_local_config
>> ./mon_data/mon_L3_01/mbm_local_config
>> ./g2/mon_data/mon_L3_00/mbm_local_config
>> ./g2/mon_data/mon_L3_01/mbm_local_config
>> ./mon_groups/m2/mon_data/mon_L3_00/mbm_local_config
>> ./mon_groups/m2/mon_data/mon_L3_01/mbm_local_config
>> ./mon_groups/m1/mon_data/mon_L3_00/mbm_local_config
>> ./mon_groups/m1/mon_data/mon_L3_01/mbm_local_config
>>
>>
>> From what I understand, the following sysfs files are
>> associated with cache domain #0 and thus writing to any of these
>> files would change the same configuration:
>> ./mon_data/mon_L3_00/mbm_local_config
>> ./g2/mon_data/mon_L3_00/mbm_local_config
>> ./mon_groups/m2/mon_data/mon_L3_00/mbm_local_config
>> ./mon_groups/m1/mon_data/mon_L3_00/mbm_local_config
>>
>> Could you please correct me where I am wrong?
>
> For example, we have CPUs 0-7 in domain 0. We have two counters which are
> configurable.
>
> Lets consider same example as your mentioned about.
>
> g2 is a control group.
>
> m1 and m2 are monitor group.
>
> We can have control group g2 with CPUs 0-7 to limit the L3 bandwidth (or
> memory bandwidth with required schemata setting).
>
> We can have mon group m1 with cpus 0-3 to monitor mbm_local_bytes.
>
> We can have mon group m2 with cpus  4-7 to monitor mbm_total_bytes.
>
> Each group is independently, monitoring two separate thing. Without having

Right, because monitoring, the actual counting of the events, is per monitor
group. When a monitor group is created a new RMID is created and when the
counter is read it is per-RMID.

The event configuration is independent from the RMID using the counter.

> sysfs file (mbm_local_config and mbm_total_config) in each monitor group,
> we wont be able to configure the above configuration.

I do not understand this reasoning. From what I understand the
event configuration is independent from the monitoring group. Thus, changing
an event configuration for one monitoring group would impact all
monitoring groups using that event counter. This implementation associates
an event configuration with each monitoring group and by doing so it
implies that it is unique to the monitoring group, but that is not
how it works.

Reinette








2022-09-19 18:09:11

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v4 11/13] x86/resctrl: Add sysfs interface to write the event configuration


On 9/16/22 11:17, Reinette Chatre wrote:
> Hi Babu,
>
> On 9/7/2022 11:01 AM, Babu Moger wrote:
>
> ...
>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 6f067c1ac7c1..59b484eb1267 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -330,9 +330,121 @@ int rdtgroup_mondata_config_show(struct seq_file *m, void *arg)
>> return ret;
>> }
>>
>> +/*
>> + * This is called via IPI to read the CQM/MBM counters
>> + * in a domain.
> copy&paste from previous patch?
Will correct it.
>
>> + */
>> +void mon_event_config_write(void *info)
>> +{
>> + struct mon_config_info *mon_info = info;
>> + u32 msr_index;
>> +
>> + switch (mon_info->evtid) {
>> + case QOS_L3_MBM_TOTAL_EVENT_ID:
>> + msr_index = 0;
>> + break;
>> + case QOS_L3_MBM_LOCAL_EVENT_ID:
>> + msr_index = 1;
>> + break;
>> + default:
>> + /* Not expected to come here */
>> + return;
>> + }
>> +
>> + wrmsr(MSR_IA32_EVT_CFG_BASE + msr_index, mon_info->mon_config, 0);
>> +}
>> +
>> +ssize_t rdtgroup_mondata_config_write(struct kernfs_open_file *of,
>> + char *buf, size_t nbytes, loff_t off)
>> +{
>> + struct mon_config_info mon_info;
>> + struct rdt_hw_resource *hw_res;
>> + struct rdtgroup *rdtgrp;
>> + struct rdt_resource *r;
>> + unsigned int mon_config;
>> + cpumask_var_t cpu_mask;
>> + union mon_data_bits md;
>> + struct rdt_domain *d;
>> + u32 resid, domid;
>> + int ret = 0, cpu;
>> +
>> + ret = kstrtouint(buf, 0, &mon_config);
>> + if (ret)
>> + return ret;
>> +
>> + rdt_last_cmd_clear();
>> +
>> + /* mon_config cannot be more than the supported set of events */
>> + if (mon_config > MAX_EVT_CONFIG_BITS) {
>> + rdt_last_cmd_puts("Invalid event configuration\n");
>> + return -EINVAL;
>> + }
>> +
>> + cpus_read_lock();
>> + rdtgrp = rdtgroup_kn_lock_live(of->kn);
>> + if (!rdtgrp) {
>> + return -ENOENT;
>> + goto e_unlock;
>> + }
>> +
>> + if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) {
>> + ret = -ENOMEM;
>> + goto e_unlock;
>> + }
>> +
>> + md.priv = of->kn->priv;
>> + resid = md.u.rid;
>> + domid = md.u.domid;
>> +
>> + hw_res = &rdt_resources_all[resid];
>> + r = &hw_res->r_resctrl;
>> + d = rdt_find_domain(r, domid, NULL);
>> + if (IS_ERR_OR_NULL(d)) {
>> + ret = -ENOENT;
>> + goto e_cpumask;
>> + }
>> +
>> + /*
>> + * Read the current config value first. If both are same
>> + * then we dont need to write again
>> + */
>> + mon_info.evtid = md.u.evtid;
>> + mondata_config_read(d, &mon_info);
>> + if (mon_info.mon_config == mon_config)
>> + goto e_cpumask;
>> +
>> + mon_info.mon_config = mon_config;
>> +
>> + /* Pick all the CPUs in the domain instance */
>> + for_each_cpu(cpu, &d->cpu_mask)
>> + cpumask_set_cpu(cpu, cpu_mask);
>> +
>> + /* Update MSR_IA32_EVT_CFG_BASE MSR on all the CPUs in cpu_mask */
>> + on_each_cpu_mask(cpu_mask, mon_event_config_write, &mon_info, 1);
> If this is required then could you please add a comment why every CPU in
> the domain needs to be updated? Until now configuration changes
> only needed to be made on one CPU per domain.
> Even in the previous patch when the user reads the current configuration
> value it is only done on one CPU in the domain ... to me that implies
> that the scope is per domain and only one CPU in the domain needs to be
> changed.

Yes, This register is domain specific. However the hardware team
recommends it to update on all the CPU threads. It is not clear in the
specs right now.

I have asked them to make that clear in the specs next time.

>
>> +
>> + /*
>> + * When an Event Configuration is changed, the bandwidth counters
>> + * for all RMIDs and Events will be cleared, and the U-bit for every
>> + * RMID will be set on the next read to any BwEvent for every RMID.
>> + * Clear the mbm_local and mbm_total counts for all the RMIDs.
>> + */
> This is a snippet that was copied from the hardware spec and since it is
> inserted into the kernel driver the context makes it hard to understand.
> Could it be translated into what it means in this context?
> Perhaps something like:
>
> /*
> * When an Event Configuration is changed, the bandwidth counters
> * for all RMIDs and Events will be cleared by the hardware. The
> * hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for every
> * RMID on the next read to any event for every RMID. Subsequent
> * reads will have MSR_IA32_QM_CTR.Unavailable (bit 62) cleared
> * while it is tracked by the hardware.
> * Clear the mbm_local and mbm_total counts for all the RMIDs.
> */
>
> Please fixup where I got it wrong.

Looks good.

Thanks Babu

>
>> + memset(d->mbm_local, 0, sizeof(struct mbm_state) * r->num_rmid);
>> + memset(d->mbm_total, 0, sizeof(struct mbm_state) * r->num_rmid);
>> +
> Reinette

--
Thanks
Babu Moger

2022-09-19 21:22:23

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v4 09/13] x86/resctrl: Add sysfs interface files to read/write event configuration

Hi Reinette,

On 9/19/22 11:42, Reinette Chatre wrote:
> Hi Babu,
>
> On 9/19/2022 8:46 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 9/16/22 10:58, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 9/7/2022 11:01 AM, Babu Moger wrote:
>>>> Add two new sysfs files to read/write the event configuration if
>>>> the feature Bandwidth Monitoring Event Configuration (BMEC) is
>>>> supported. The file mbm_local_config is for the configuration
>>>> of the event mbm_local_bytes and the file mbm_total_config is
>>>> for the configuration of mbm_total_bytes.
>>>>
>>>> $ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local*
>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_config
>>>>
>>>> $ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total*
>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_config
>>>>
>>> This patch makes the mbm*config files per monitor group. Looking
>>> ahead at later patches how the configuration is set it is not clear
>>> to me that this is the right place for these configuration files.
>>>
>>> Looking ahead to patch 10 there is neither rmid nor closid within
>>> the (MSR_IA32_EVT_CFG_BASE + index) register - it only takes
>>> the bits indicating what access types needs to be counted. Also
>>> in patch 10 I understand that the scope of this register is per L3 cache
>>> domain.
>> Yes. Scope of  MSR_IA32_EVT_CFG_BASE per L3 domain.
>>> Considering this, why is the sysfs file associated with each
>>> monitor group?
>> Please see the response below.
>>> For example, consider the following scenario:
>>> # cd /sys/fs/resctrl
>>> # mkdir g2
>>> # mkdir mon_groups/m1
>>> # mkdir mon_groups/m2
>>> # find . | grep mbm_local_config
>>> ./mon_data/mon_L3_00/mbm_local_config
>>> ./mon_data/mon_L3_01/mbm_local_config
>>> ./g2/mon_data/mon_L3_00/mbm_local_config
>>> ./g2/mon_data/mon_L3_01/mbm_local_config
>>> ./mon_groups/m2/mon_data/mon_L3_00/mbm_local_config
>>> ./mon_groups/m2/mon_data/mon_L3_01/mbm_local_config
>>> ./mon_groups/m1/mon_data/mon_L3_00/mbm_local_config
>>> ./mon_groups/m1/mon_data/mon_L3_01/mbm_local_config
>>>
>>>
>>> From what I understand, the following sysfs files are
>>> associated with cache domain #0 and thus writing to any of these
>>> files would change the same configuration:
>>> ./mon_data/mon_L3_00/mbm_local_config
>>> ./g2/mon_data/mon_L3_00/mbm_local_config
>>> ./mon_groups/m2/mon_data/mon_L3_00/mbm_local_config
>>> ./mon_groups/m1/mon_data/mon_L3_00/mbm_local_config
>>>
>>> Could you please correct me where I am wrong?
>> For example, we have CPUs 0-7 in domain 0. We have two counters which are
>> configurable.
>>
>> Lets consider same example as your mentioned about.
>>
>> g2 is a control group.
>>
>> m1 and m2 are monitor group.
>>
>> We can have control group g2 with CPUs 0-7 to limit the L3 bandwidth (or
>> memory bandwidth with required schemata setting).
>>
>> We can have mon group m1 with cpus 0-3 to monitor mbm_local_bytes.
>>
>> We can have mon group m2 with cpus  4-7 to monitor mbm_total_bytes.
>>
>> Each group is independently, monitoring two separate thing. Without having
> Right, because monitoring, the actual counting of the events, is per monitor
> group. When a monitor group is created a new RMID is created and when the
> counter is read it is per-RMID.
>
> The event configuration is independent from the RMID using the counter.
>
>> sysfs file (mbm_local_config and mbm_total_config) in each monitor group,
>> we wont be able to configure the above configuration.
> I do not understand this reasoning. From what I understand the
> event configuration is independent from the monitoring group. Thus, changing
> an event configuration for one monitoring group would impact all
> monitoring groups using that event counter. This implementation associates
> an event configuration with each monitoring group and by doing so it
> implies that it is unique to the monitoring group, but that is not
> how it works.

The event configuration is designed per L3 domain. The mon_data is also
per domain (like mon_L3_00.. mon_L3_01 etc). So, added the event
configuration file inside each domain. We have all the information inside
the domain. Thought, that is right place. I am open for suggestions.

Thanks

Babu


2022-09-19 21:24:59

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 09/13] x86/resctrl: Add sysfs interface files to read/write event configuration

Hi Babu,

On 9/19/2022 1:26 PM, Moger, Babu wrote:
> On 9/19/22 11:42, Reinette Chatre wrote:
>> On 9/19/2022 8:46 AM, Moger, Babu wrote:
>>> On 9/16/22 10:58, Reinette Chatre wrote:
>>>> On 9/7/2022 11:01 AM, Babu Moger wrote:
>>>>> Add two new sysfs files to read/write the event configuration if
>>>>> the feature Bandwidth Monitoring Event Configuration (BMEC) is
>>>>> supported. The file mbm_local_config is for the configuration
>>>>> of the event mbm_local_bytes and the file mbm_total_config is
>>>>> for the configuration of mbm_total_bytes.
>>>>>
>>>>> $ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local*
>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_config
>>>>>
>>>>> $ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total*
>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_config
>>>>>
>>>> This patch makes the mbm*config files per monitor group. Looking
>>>> ahead at later patches how the configuration is set it is not clear
>>>> to me that this is the right place for these configuration files.
>>>>
>>>> Looking ahead to patch 10 there is neither rmid nor closid within
>>>> the (MSR_IA32_EVT_CFG_BASE + index) register - it only takes
>>>> the bits indicating what access types needs to be counted. Also
>>>> in patch 10 I understand that the scope of this register is per L3 cache
>>>> domain.
>>> Yes. Scope of  MSR_IA32_EVT_CFG_BASE per L3 domain.
>>>> Considering this, why is the sysfs file associated with each
>>>> monitor group?
>>> Please see the response below.
>>>> For example, consider the following scenario:
>>>> # cd /sys/fs/resctrl
>>>> # mkdir g2
>>>> # mkdir mon_groups/m1
>>>> # mkdir mon_groups/m2
>>>> # find . | grep mbm_local_config
>>>> ./mon_data/mon_L3_00/mbm_local_config
>>>> ./mon_data/mon_L3_01/mbm_local_config
>>>> ./g2/mon_data/mon_L3_00/mbm_local_config
>>>> ./g2/mon_data/mon_L3_01/mbm_local_config
>>>> ./mon_groups/m2/mon_data/mon_L3_00/mbm_local_config
>>>> ./mon_groups/m2/mon_data/mon_L3_01/mbm_local_config
>>>> ./mon_groups/m1/mon_data/mon_L3_00/mbm_local_config
>>>> ./mon_groups/m1/mon_data/mon_L3_01/mbm_local_config
>>>>
>>>>
>>>> From what I understand, the following sysfs files are
>>>> associated with cache domain #0 and thus writing to any of these
>>>> files would change the same configuration:
>>>> ./mon_data/mon_L3_00/mbm_local_config
>>>> ./g2/mon_data/mon_L3_00/mbm_local_config
>>>> ./mon_groups/m2/mon_data/mon_L3_00/mbm_local_config
>>>> ./mon_groups/m1/mon_data/mon_L3_00/mbm_local_config
>>>>
>>>> Could you please correct me where I am wrong?
>>> For example, we have CPUs 0-7 in domain 0. We have two counters which are
>>> configurable.
>>>
>>> Lets consider same example as your mentioned about.
>>>
>>> g2 is a control group.
>>>
>>> m1 and m2 are monitor group.
>>>
>>> We can have control group g2 with CPUs 0-7 to limit the L3 bandwidth (or
>>> memory bandwidth with required schemata setting).
>>>
>>> We can have mon group m1 with cpus 0-3 to monitor mbm_local_bytes.
>>>
>>> We can have mon group m2 with cpus  4-7 to monitor mbm_total_bytes.
>>>
>>> Each group is independently, monitoring two separate thing. Without having
>> Right, because monitoring, the actual counting of the events, is per monitor
>> group. When a monitor group is created a new RMID is created and when the
>> counter is read it is per-RMID.
>>
>> The event configuration is independent from the RMID using the counter.
>>
>>> sysfs file (mbm_local_config and mbm_total_config) in each monitor group,
>>> we wont be able to configure the above configuration.
>> I do not understand this reasoning. From what I understand the
>> event configuration is independent from the monitoring group. Thus, changing
>> an event configuration for one monitoring group would impact all
>> monitoring groups using that event counter. This implementation associates
>> an event configuration with each monitoring group and by doing so it
>> implies that it is unique to the monitoring group, but that is not
>> how it works.
>
> The event configuration is designed per L3 domain. The mon_data is also
> per domain (like mon_L3_00.. mon_L3_01 etc). So, added the event
> configuration file inside each domain. We have all the information inside
> the domain. Thought, that is right place. I am open for suggestions.

It is not clear to me if you are also seeing all the duplication that
accompanies this implementation. As you can see in the example I provided in
https://lore.kernel.org/lkml/[email protected]/,
if I understand the implementation correctly, there will be several
configuration files scattered through resctrl that all configure the same
value. I asked you to correct me where I am wrong but you did not correct me.
Instead you keep repeating that placing the files in the duplicate locations
is convenient. I can see how this is convenient for you but please do consider
that having these duplicate configuration files scattered through resctrl
makes for a very confusing user interface and unexpected behavior. Users
would expect that a configuration associated with a monitor group impacts
that monitor group only - not all monitor groups associated with that
domain.

User API is hard so this does need careful thought. Perhaps the architects
can chime in here.

One option could be:
# cd /sys/fs/resctrl/info/L3_MON
# cat mbm_total_config
0=7f;1=7f
# cat mbm_local_config
0=15;1=15

It would be clear when changing mem_total_config or mbm_local_config that
it would impact all monitoring groups within all resource groups. What do
you think?

Reinette

2022-09-20 17:24:54

by Moger, Babu

[permalink] [raw]
Subject: RE: [PATCH v4 09/13] x86/resctrl: Add sysfs interface files to read/write event configuration

[AMD Official Use Only - General]

Hi Reinette,

> -----Original Message-----
> From: Reinette Chatre <[email protected]>
> Sent: Monday, September 19, 2022 4:07 PM
> To: Moger, Babu <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Das1, Sandipan
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH v4 09/13] x86/resctrl: Add sysfs interface files to
> read/write event configuration
>
> Hi Babu,
>
> On 9/19/2022 1:26 PM, Moger, Babu wrote:
> > On 9/19/22 11:42, Reinette Chatre wrote:
> >> On 9/19/2022 8:46 AM, Moger, Babu wrote:
> >>> On 9/16/22 10:58, Reinette Chatre wrote:
> >>>> On 9/7/2022 11:01 AM, Babu Moger wrote:
> >>>>> Add two new sysfs files to read/write the event configuration if
> >>>>> the feature Bandwidth Monitoring Event Configuration (BMEC) is
> >>>>> supported. The file mbm_local_config is for the configuration of
> >>>>> the event mbm_local_bytes and the file mbm_total_config is for the
> >>>>> configuration of mbm_total_bytes.
> >>>>>
> >>>>> $ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local*
> >>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_config
> >>>>>
> >>>>> $ls /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total*
> >>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> >>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_config
> >>>>>
> >>>> This patch makes the mbm*config files per monitor group. Looking
> >>>> ahead at later patches how the configuration is set it is not clear
> >>>> to me that this is the right place for these configuration files.
> >>>>
> >>>> Looking ahead to patch 10 there is neither rmid nor closid within
> >>>> the (MSR_IA32_EVT_CFG_BASE + index) register - it only takes the
> >>>> bits indicating what access types needs to be counted. Also in
> >>>> patch 10 I understand that the scope of this register is per L3
> >>>> cache domain.
> >>> Yes. Scope of? MSR_IA32_EVT_CFG_BASE per L3 domain.
> >>>> Considering this, why is the sysfs file associated with each
> >>>> monitor group?
> >>> Please see the response below.
> >>>> For example, consider the following scenario:
> >>>> # cd /sys/fs/resctrl
> >>>> # mkdir g2
> >>>> # mkdir mon_groups/m1
> >>>> # mkdir mon_groups/m2
> >>>> # find . | grep mbm_local_config
> >>>> ./mon_data/mon_L3_00/mbm_local_config
> >>>> ./mon_data/mon_L3_01/mbm_local_config
> >>>> ./g2/mon_data/mon_L3_00/mbm_local_config
> >>>> ./g2/mon_data/mon_L3_01/mbm_local_config
> >>>> ./mon_groups/m2/mon_data/mon_L3_00/mbm_local_config
> >>>> ./mon_groups/m2/mon_data/mon_L3_01/mbm_local_config
> >>>> ./mon_groups/m1/mon_data/mon_L3_00/mbm_local_config
> >>>> ./mon_groups/m1/mon_data/mon_L3_01/mbm_local_config
> >>>>
> >>>>
> >>>> From what I understand, the following sysfs files are associated
> >>>> with cache domain #0 and thus writing to any of these files would
> >>>> change the same configuration:
> >>>> ./mon_data/mon_L3_00/mbm_local_config
> >>>> ./g2/mon_data/mon_L3_00/mbm_local_config
> >>>> ./mon_groups/m2/mon_data/mon_L3_00/mbm_local_config
> >>>> ./mon_groups/m1/mon_data/mon_L3_00/mbm_local_config
> >>>>
> >>>> Could you please correct me where I am wrong?
> >>> For example, we have CPUs 0-7 in domain 0. We have two counters
> >>> which are configurable.
> >>>
> >>> Lets consider same example as your mentioned about.
> >>>
> >>> g2 is a control group.
> >>>
> >>> m1 and m2 are monitor group.
> >>>
> >>> We can have control group g2 with CPUs 0-7 to limit the L3 bandwidth
> >>> (or memory bandwidth with required schemata setting).
> >>>
> >>> We can have mon group m1 with cpus 0-3 to monitor mbm_local_bytes.
> >>>
> >>> We can have mon group m2 with cpus? 4-7 to monitor mbm_total_bytes.
> >>>
> >>> Each group is independently, monitoring two separate thing. Without
> >>> having
> >> Right, because monitoring, the actual counting of the events, is per
> >> monitor group. When a monitor group is created a new RMID is created
> >> and when the counter is read it is per-RMID.
> >>
> >> The event configuration is independent from the RMID using the counter.
> >>
> >>> sysfs file (mbm_local_config and mbm_total_config) in each monitor
> >>> group, we wont be able to configure the above configuration.
> >> I do not understand this reasoning. From what I understand the event
> >> configuration is independent from the monitoring group. Thus,
> >> changing an event configuration for one monitoring group would impact
> >> all monitoring groups using that event counter. This implementation
> >> associates an event configuration with each monitoring group and by
> >> doing so it implies that it is unique to the monitoring group, but
> >> that is not how it works.
> >
> > The event configuration is designed per L3 domain. The mon_data is
> > also per domain (like mon_L3_00.. mon_L3_01 etc). So, added the event
> > configuration file inside each domain. We have all the information
> > inside the domain. Thought, that is right place. I am open for suggestions.
>
> It is not clear to me if you are also seeing all the duplication that accompanies
> this implementation. As you can see in the example I provided in
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kern
> el.org%2Flkml%2F13294a8f-e76f-a6a9-284c-
> 67adbc80ec7c%40intel.com%2F&amp;data=05%7C01%7Cbabu.moger%40amd.
> com%7Cc22190a25ac044ec5f5408da9a82f5b7%7C3dd8961fe4884e608e11a82
> d994e183d%7C0%7C0%7C637992184504699692%7CUnknown%7CTWFpbGZsb3
> d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> D%7C3000%7C%7C%7C&amp;sdata=uPuGOlwaIgwJ9VnwNOS%2B4mUrqJnS65
> OdrEsEXtztUbU%3D&amp;reserved=0,
> if I understand the implementation correctly, there will be several
> configuration files scattered through resctrl that all configure the same value. I
> asked you to correct me where I am wrong but you did not correct me.
> Instead you keep repeating that placing the files in the duplicate locations is
> convenient. I can see how this is convenient for you but please do consider that
> having these duplicate configuration files scattered through resctrl makes for a
> very confusing user interface and unexpected behavior. Users would expect
> that a configuration associated with a monitor group impacts that monitor
> group only - not all monitor groups associated with that domain.
>
> User API is hard so this does need careful thought. Perhaps the architects can
> chime in here.
>
> One option could be:
> # cd /sys/fs/resctrl/info/L3_MON
> # cat mbm_total_config
> 0=7f;1=7f
> # cat mbm_local_config
> 0=15;1=15

I think this should work.
# cat mbm_total_config
0=7f;1=7f
I would think 0 and 1 are domain ids here.

We have to provide interface to write also.
#echo "0=0x70" > mbm_total_config (update mbm_total_config for domain 0)
#echo 1=0x10 > mbm_local_config (update mbm_local_config for domain 1)

We will have to parse the string and update the specific domains.

>
> It would be clear when changing mem_total_config or mbm_local_config that
> it would impact all monitoring groups within all resource groups. What do you
> think?

Yes. Thank you. It should work. As long and we have the ways to modify(and read) the specific L3 domains then it should be fine. Let me start on that. Will reply if I see any major issues.
Thanks
Babu