2023-07-13 16:54:51

by Tony Luck

[permalink] [raw]
Subject: [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

There isn't a simple hardware enumeration to indicate to software that
a system is running with Sub-NUMA Clustering enabled.

Compare the number of NUMA nodes with the number of L3 caches to calculate
the number of Sub-NUMA nodes per L3 cache.

When Sub-NUMA clustering mode is enabled in BIOS setup, the RMID counters
are distributed equally between the SNC nodes within each socket.

E.g. if there are 400 RMID counters, and the system is configured with
two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
0 on the socket, and RMID counter 200..399 on SNC node 1.

A model specific MSR (0xca0) can change the configuration of the RMIDs
when SNC mode is enabled.

The MSR controls the interpretation of the RMID field in the
IA32_PQR_ASSOC MSR so that the appropriate hardware counters within the
SNC node are updated. If reconfigured from default, RMIDs are divided
evenly across clusters.

Also initialize a per-cpu RMID offset value. Use this to calculate the
value to write to the IA32_QM_EVTSEL MSR when reading RMID event values.

N.B. this works well for well-behaved NUMA applications that access
memory predominantly from the local memory node. For applications that
access memory across multiple nodes it may be necessary for the user
to read counters for all SNC nodes on a socket and add the values to
get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
all that different from applications that span across multiple sockets
in a legacy system.

Signed-off-by: Tony Luck <[email protected]>

---

Changes since v2:

* Rebased to v6.5-rc1

Peter Newman: Found that I'd reversed the actions writing to the new
MSR to enable/disable RMID remapping for SNC mode.
* Fixed.

Peter Newman: Provided Reviewed-by: and Tested-by: tags

* Included in this series.

Randy Dunlap: Reported a run-on sentence in the documentation.

* Broke the sentence into two as suggested.

Shaopeng Tan: Reported that the CMT resctrl self-test failed

* Added extra patch to the series to make the resctrl test detect when SNC
mode is enabled and adjust effective cache size.

* I also patched the rdtgroup_cbm_to_size() function to adjust the cache
size reported in the "size" files in resctrl groups when SNC is active.

Shaopeng Tan: Noted the for_each_capable_rdt_resource() macro is no longer used.

* Deleted defintion of this macro.

Tony Luck (8):
x86/resctrl: Refactor in preparation for node-scoped resources
x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c
x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
x86/resctrl: Add code to setup monitoring at L3 or NODE scope.
x86/resctrl: Add package scoped resource
x86/resctrl: Update documentation with Sub-NUMA cluster changes
x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
selftests/resctrl: Adjust effective L3 cache size when SNC enabled

Documentation/arch/x86/resctrl.rst | 10 +-
include/linux/resctrl.h | 5 +-
arch/x86/include/asm/resctrl.h | 2 +
arch/x86/kernel/cpu/resctrl/internal.h | 20 ++-
tools/testing/selftests/resctrl/resctrl.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 154 ++++++++++++++++++--
arch/x86/kernel/cpu/resctrl/monitor.c | 24 +--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 +-
tools/testing/selftests/resctrl/resctrlfs.c | 57 ++++++++
10 files changed, 248 insertions(+), 33 deletions(-)


base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
--
2.40.1



2023-07-13 16:56:54

by Tony Luck

[permalink] [raw]
Subject: [PATCH v3 4/8] x86/resctrl: Add code to setup monitoring at L3 or NODE scope.

When Sub-NUMA cluster is enabled (snc_ways > 1) use the RDT_RESOURCE_NODE
instead of RDT_RESOURCE_L3 for all monitoring operations.

The mon_scale and num_rmid values from CPUID(0xf,0x1),(EBX,ECX) must be
scaled down by the number of Sub-NUMA Clusters.

A subsequent change will detect sub-NUMA cluster mode and set
"snc_ways". For now set to one (meaning each L3 cache spans one
node).

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 7 +++++++
arch/x86/kernel/cpu/resctrl/core.c | 7 ++++++-
arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
4 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 243017096ddf..38bac0062c82 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -430,6 +430,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);

extern struct dentry *debugfs_resctrl;

+extern int snc_ways;
+
enum resctrl_res_level {
RDT_RESOURCE_L3,
RDT_RESOURCE_L2,
@@ -447,6 +449,11 @@ enum resctrl_scope {
SCOPE_NODE,
};

+static inline int get_mbm_res_level(void)
+{
+ return snc_ways > 1 ? RDT_RESOURCE_NODE : RDT_RESOURCE_L3;
+}
+
static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(res);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index e4bd3072927c..6fe9f87d4403 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,11 @@ int max_name_width, max_data_width;
*/
bool rdt_alloc_capable;

+/*
+ * How many Sub-Numa Cluster nodes share a single L3 cache
+ */
+int snc_ways = 1;
+
static void
mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);
@@ -831,7 +836,7 @@ static __init bool get_rdt_alloc_resources(void)

static __init bool get_rdt_mon_resources(void)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r = &rdt_resources_all[get_mbm_res_level()].r_resctrl;

if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 9be6ffdd01ae..da3f36212898 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -787,8 +787,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
int ret;

resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
- hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
- r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_ways;
+ r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_ways;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;

if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 418658f0a9ad..d037f3da9e55 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2524,7 +2524,7 @@ static int rdt_get_tree(struct fs_context *fc)
static_branch_enable_cpuslocked(&rdt_enable_key);

if (is_mbm_enabled()) {
- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ r = &rdt_resources_all[get_mbm_res_level()].r_resctrl;
list_for_each_entry(dom, &r->domains, list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}
--
2.40.1


2023-07-13 16:58:59

by Tony Luck

[permalink] [raw]
Subject: [PATCH v3 2/8] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c

Scope of monitoring may be scoped at L3 cache granularity (legacy) or
at the node level (systems with Sub NUMA Cluster enabled).

Save the struct rdt_resource pointer that was used to initialize
the monitor sections of code and use that value instead of the
hard-coded RDT_RESOURCE_L3.

No functional change.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..9be6ffdd01ae 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -30,6 +30,8 @@ struct rmid_entry {
struct list_head list;
};

+static struct rdt_resource *mon_resource;
+
/**
* @rmid_free_lru A least recently used list of free RMIDs
* These RMIDs are guaranteed to have an occupancy less than the
@@ -268,7 +270,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
*/
void __check_limbo(struct rdt_domain *d, bool force_free)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r = mon_resource;
struct rmid_entry *entry;
u32 crmid = 1, nrmid;
bool rmid_dirty;
@@ -333,7 +335,7 @@ int alloc_rmid(void)

static void add_rmid_to_limbo(struct rmid_entry *entry)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r = mon_resource;
struct rdt_domain *d;
int cpu, err;
u64 val = 0;
@@ -645,7 +647,7 @@ void cqm_handle_limbo(struct work_struct *work)

mutex_lock(&rdtgroup_mutex);

- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ r = mon_resource;
d = container_of(work, struct rdt_domain, cqm_limbo.work);

__check_limbo(d, false);
@@ -681,7 +683,7 @@ void mbm_handle_overflow(struct work_struct *work)
if (!static_branch_likely(&rdt_mon_enable_key))
goto out_unlock;

- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ r = mon_resource;
d = container_of(work, struct rdt_domain, mbm_over.work);

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
@@ -759,9 +761,9 @@ static struct mon_evt mbm_local_event = {
/*
* Initialize the event list for the resource.
*
- * Note that MBM events are also part of RDT_RESOURCE_L3 resource
- * because as per the SDM the total and local memory bandwidth
- * are enumerated as part of L3 monitoring.
+ * Monitor events can either be part of RDT_RESOURCE_L3 resource,
+ * or they may be per NUMA node on systems with sub-NUMA cluster
+ * enabled and are then in the RDT_RESOURCE_NODE resource.
*/
static void l3_mon_evt_init(struct rdt_resource *r)
{
@@ -773,6 +775,8 @@ static void l3_mon_evt_init(struct rdt_resource *r)
list_add_tail(&mbm_total_event.list, &r->evt_list);
if (is_mbm_local_enabled())
list_add_tail(&mbm_local_event.list, &r->evt_list);
+
+ mon_resource = r;
}

int __init rdt_get_mon_l3_config(struct rdt_resource *r)
--
2.40.1


2023-07-13 17:06:22

by Tony Luck

[permalink] [raw]
Subject: [PATCH v3 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled

Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
nodes. Systems may support splitting into either two or four nodes.

When SNC mode is enabled the effective amount of L3 cache available
for allocation is divided by the number of nodes per L3.

Detect which SNC mode is active by comparing the number of CPUs
that share a cache with CPU0, with the number of CPUs on node0.

Reported-by: "Shaopeng Tan (Fujitsu)" <[email protected]>
Closes: https://lore.kernel.org/r/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com
Signed-off-by: Tony Luck <[email protected]>
---
tools/testing/selftests/resctrl/resctrl.h | 1 +
tools/testing/selftests/resctrl/resctrlfs.c | 57 +++++++++++++++++++++
2 files changed, 58 insertions(+)

diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index 87e39456dee0..a8b43210b573 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -13,6 +13,7 @@
#include <signal.h>
#include <dirent.h>
#include <stdbool.h>
+#include <ctype.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mount.h>
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index fb00245dee92..79eecbf9f863 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -130,6 +130,61 @@ int get_resource_id(int cpu_no, int *resource_id)
return 0;
}

+/*
+ * Count number of CPUs in a /sys bit map
+ */
+static int count_sys_bitmap_bits(char *name)
+{
+ FILE *fp = fopen(name, "r");
+ int count = 0, c;
+
+ if (!fp)
+ return 0;
+
+ while ((c = fgetc(fp)) != EOF) {
+ if (!isxdigit(c))
+ continue;
+ switch (c) {
+ case 'f':
+ count++;
+ case '7': case 'b': case 'd': case 'e':
+ count++;
+ case '3': case '5': case '6': case '9': case 'a': case 'c':
+ count++;
+ case '1': case '2': case '4': case '8':
+ count++;
+ }
+ }
+ fclose(fp);
+
+ return count;
+}
+
+/*
+ * Detect SNC by compating #CPUs in node0 with #CPUs sharing LLC with CPU0
+ * Try to get this right, even if a few CPUs are offline so that the number
+ * of CPUs in node0 is not exactly half or a quarter of the CPUs sharing the
+ * LLC of CPU0.
+ */
+static int snc_ways(void)
+{
+ int node_cpus, cache_cpus;
+
+ node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
+ cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
+
+ if (!node_cpus || !cache_cpus) {
+ fprintf(stderr, "Warning could not determine Sub-NUMA Cluster mode\n");
+ return 1;
+ }
+
+ if (4 * node_cpus >= cache_cpus)
+ return 4;
+ else if (2 * node_cpus >= cache_cpus)
+ return 2;
+ return 1;
+}
+
/*
* get_cache_size - Get cache size for a specified CPU
* @cpu_no: CPU number
@@ -190,6 +245,8 @@ int get_cache_size(int cpu_no, char *cache_type, unsigned long *cache_size)
break;
}

+ if (cache_num == 3)
+ *cache_size /= snc_ways();
return 0;
}

--
2.40.1


2023-07-13 17:06:43

by Tony Luck

[permalink] [raw]
Subject: [PATCH v3 5/8] x86/resctrl: Add package scoped resource

Some Intel features require setting a package scoped model specific
register.

Add a new resource that builds domains for each package.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
---
include/linux/resctrl.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 6 ++++--
arch/x86/kernel/cpu/resctrl/core.c | 23 +++++++++++++++++++----
3 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 25051daa6655..f504f6263fec 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -167,6 +167,7 @@ struct rdt_resource {
int rid;
bool alloc_capable;
bool mon_capable;
+ bool pkg_actions;
int num_rmid;
int scope;
struct resctrl_cache cache;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 38bac0062c82..67340c83392f 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -438,6 +438,7 @@ enum resctrl_res_level {
RDT_RESOURCE_MBA,
RDT_RESOURCE_SMBA,
RDT_RESOURCE_NODE,
+ RDT_RESOURCE_PKG,

/* Must be the last */
RDT_NUM_RESOURCES,
@@ -447,6 +448,7 @@ enum resctrl_scope {
SCOPE_L2_CACHE = 2,
SCOPE_L3_CACHE = 3,
SCOPE_NODE,
+ SCOPE_PKG,
};

static inline int get_mbm_res_level(void)
@@ -478,9 +480,9 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
r <= &rdt_resources_all[RDT_NUM_RESOURCES - 1].r_resctrl; \
r = resctrl_inc(r))

-#define for_each_capable_rdt_resource(r) \
+#define for_each_domain_needed_rdt_resource(r) \
for_each_rdt_resource(r) \
- if (r->alloc_capable || r->mon_capable)
+ if (r->alloc_capable || r->mon_capable || r->pkg_actions)

#define for_each_alloc_capable_rdt_resource(r) \
for_each_rdt_resource(r) \
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6fe9f87d4403..af3be3c2db96 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -127,6 +127,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
.fflags = 0,
},
},
+ [RDT_RESOURCE_PKG] =
+ {
+ .r_resctrl = {
+ .rid = RDT_RESOURCE_PKG,
+ .name = "PKG",
+ .scope = SCOPE_PKG,
+ .domains = domain_init(RDT_RESOURCE_PKG),
+ .fflags = 0,
+ },
+ },
};

/*
@@ -504,9 +514,14 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)

static int get_domain_id(int cpu, enum resctrl_scope scope)
{
- if (scope == SCOPE_NODE)
+ switch (scope) {
+ case SCOPE_NODE:
return cpu_to_node(cpu);
- return get_cpu_cacheinfo_id(cpu, scope);
+ case SCOPE_PKG:
+ return topology_physical_package_id(cpu);
+ default:
+ return get_cpu_cacheinfo_id(cpu, scope);
+ }
}

/*
@@ -630,7 +645,7 @@ static int resctrl_online_cpu(unsigned int cpu)
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
- for_each_capable_rdt_resource(r)
+ for_each_domain_needed_rdt_resource(r)
domain_add_cpu(cpu, r);
/* The cpu is set in default rdtgroup after online. */
cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
@@ -657,7 +672,7 @@ static int resctrl_offline_cpu(unsigned int cpu)
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
- for_each_capable_rdt_resource(r)
+ for_each_domain_needed_rdt_resource(r)
domain_remove_cpu(cpu, r);
list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
--
2.40.1


2023-07-13 17:07:47

by Tony Luck

[permalink] [raw]
Subject: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]

Add a placeholder in the array of struct rdt_hw_resource to be used
for event monitoring of systems with Sub-NUMA Cluster enabled.

Update get_domain_id() to handle SCOPE_NODE.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 4 +++-
arch/x86/kernel/cpu/resctrl/core.c | 12 ++++++++++++
2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 8275b8a74f7e..243017096ddf 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -435,6 +435,7 @@ enum resctrl_res_level {
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
RDT_RESOURCE_SMBA,
+ RDT_RESOURCE_NODE,

/* Must be the last */
RDT_NUM_RESOURCES,
@@ -442,7 +443,8 @@ enum resctrl_res_level {

enum resctrl_scope {
SCOPE_L2_CACHE = 2,
- SCOPE_L3_CACHE = 3
+ SCOPE_L3_CACHE = 3,
+ SCOPE_NODE,
};

static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6571514752f3..e4bd3072927c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -112,6 +112,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
.fflags = RFTYPE_RES_MB,
},
},
+ [RDT_RESOURCE_NODE] =
+ {
+ .r_resctrl = {
+ .rid = RDT_RESOURCE_NODE,
+ .name = "L3",
+ .scope = SCOPE_NODE,
+ .domains = domain_init(RDT_RESOURCE_NODE),
+ .fflags = 0,
+ },
+ },
};

/*
@@ -489,6 +499,8 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)

static int get_domain_id(int cpu, enum resctrl_scope scope)
{
+ if (scope == SCOPE_NODE)
+ return cpu_to_node(cpu);
return get_cpu_cacheinfo_id(cpu, scope);
}

--
2.40.1


2023-07-13 17:07:59

by Tony Luck

[permalink] [raw]
Subject: [PATCH v3 1/8] x86/resctrl: Refactor in preparation for node-scoped resources

Sub-NUMA cluster systems provide monitoring resources at the NUMA
node scope instead of the L3 cache scope.

Rename the cache_level field in struct rdt_resource to the more
generic "scope" and add symbolic names and a helper function.

No functional change.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
---
include/linux/resctrl.h | 4 ++--
arch/x86/kernel/cpu/resctrl/internal.h | 5 +++++
arch/x86/kernel/cpu/resctrl/core.c | 17 +++++++++++------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
5 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..25051daa6655 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -150,7 +150,7 @@ struct resctrl_schema;
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @cache_level: Which cache level defines scope of this resource
+ * @scope: Scope of this resource (cache level or NUMA node)
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
* @domains: All domains for this resource
@@ -168,7 +168,7 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- int cache_level;
+ int scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
struct list_head domains;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 85ceaf9a31ac..8275b8a74f7e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -440,6 +440,11 @@ enum resctrl_res_level {
RDT_NUM_RESOURCES,
};

+enum resctrl_scope {
+ SCOPE_L2_CACHE = 2,
+ SCOPE_L3_CACHE = 3
+};
+
static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(res);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..6571514752f3 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .cache_level = 3,
+ .scope = SCOPE_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .cache_level = 2,
+ .scope = SCOPE_L2_CACHE,
.domains = domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .cache_level = 3,
+ .scope = SCOPE_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .cache_level = 3,
+ .scope = 3,
.domains = domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -487,6 +487,11 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
return 0;
}

+static int get_domain_id(int cpu, enum resctrl_scope scope)
+{
+ return get_cpu_cacheinfo_id(cpu, scope);
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -502,7 +507,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
*/
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_domain_id(cpu, r->scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
@@ -552,7 +557,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

static void domain_remove_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_domain_id(cpu, r->scope);
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 458cb7419502..42f124ffb968 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -297,7 +297,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);

for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == plr->s->res->cache_level) {
+ if (ci->info_list[i].level == plr->s->res->scope) {
plr->line_size = ci->info_list[i].coherency_line_size;
return 0;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 725344048f85..418658f0a9ad 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1348,7 +1348,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->cache_level) {
+ if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
--
2.40.1


2023-07-13 17:08:16

by Tony Luck

[permalink] [raw]
Subject: [PATCH v3 7/8] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

There isn't a simple hardware enumeration to indicate to software that
a system is running with Sub-NUMA Cluster enabled.

Compare the number of NUMA nodes with the number of L3 caches to calculate
the number of Sub-NUMA nodes per L3 cache.

When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
are distributed equally between the SNC nodes within each socket.

E.g. if there are 400 RMID counters, and the system is configured with
two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
0 on the socket, and RMID counter 200..399 on SNC node 1.

A model specific MSR (0xca0) can change the configuration of the RMIDs
when SNC mode is enabled.

The MSR controls the interpretation of the RMID field in the
IA32_PQR_ASSOC MSR so that the appropriate hardware counters
within the SNC node are updated.

Also initialize a per-cpu RMID offset value. Use this
to calculate the value to write to the IA32_QM_EVTSEL MSR when
reading RMID event values.

N.B. this works well for well-behaved NUMA applications that access
memory predominantly from the local memory node. For applications that
access memory across multiple nodes it may be necessary for the user
to read counters for all SNC nodes on a socket and add the values to
get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
all that different from applications that span across multiple sockets
in a legacy system.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Tested-by: Peter Newman <[email protected]>
---
arch/x86/include/asm/resctrl.h | 2 +
arch/x86/kernel/cpu/resctrl/core.c | 99 +++++++++++++++++++++++++-
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
4 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 255a78d9d906..f95e69bacc65 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);

+DECLARE_PER_CPU(int, rmid_offset);
+
/*
* __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
*
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index af3be3c2db96..a03ff1a95624 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@

#define pr_fmt(fmt) "resctrl: " fmt

+#include <linux/cpu.h>
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/cacheinfo.h>
#include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>

+#include <asm/cpu_device_id.h>
#include <asm/intel-family.h>
#include <asm/resctrl.h>
#include "internal.h"
@@ -524,6 +527,39 @@ static int get_domain_id(int cpu, enum resctrl_scope scope)
}
}

+DEFINE_PER_CPU(int, rmid_offset);
+
+static void set_per_cpu_rmid_offset(int cpu, struct rdt_resource *r)
+{
+ this_cpu_write(rmid_offset, (cpu_to_node(cpu) % snc_ways) * r->num_rmid);
+}
+
+/*
+ * This MSR provides for configuration of RMIDs on Sub-NUMA Cluster
+ * systems.
+ * Bit0 = 1 (default) For legacy configuration
+ * Bit0 = 0 RMIDs are divided evenly between SNC nodes.
+ */
+#define MSR_RMID_SNC_CONFIG 0xCA0
+
+static void snc_add_pkg(void)
+{
+ u64 msrval;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
+ msrval &= ~BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
+}
+
+static void snc_remove_pkg(void)
+{
+ u64 msrval;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
+ msrval |= BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -555,6 +591,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
cpumask_set_cpu(cpu, &d->cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
+ if (r->mon_capable)
+ set_per_cpu_rmid_offset(cpu, r);
return;
}

@@ -573,11 +611,17 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

- if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
- domain_free(hw_dom);
- return;
+ if (r->mon_capable) {
+ if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+ domain_free(hw_dom);
+ return;
+ }
+ set_per_cpu_rmid_offset(cpu, r);
}

+ if (r->pkg_actions)
+ snc_add_pkg();
+
list_add_tail(&d->list, add_pos);

err = resctrl_online_domain(r, d);
@@ -613,6 +657,9 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
d->plr->d = NULL;
domain_free(hw_dom);

+ if (r->pkg_actions)
+ snc_remove_pkg();
+
return;
}

@@ -899,11 +946,57 @@ static __init bool get_rdt_resources(void)
return (rdt_mon_capable || rdt_alloc_capable);
}

+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+ {}
+};
+
+/*
+ * There isn't a simple enumeration bit to show whether SNC mode
+ * is enabled. Look at the ratio of number of NUMA nodes to the
+ * number of distinct L3 caches. Take care to skip memory-only nodes.
+ */
+static __init int find_snc_ways(void)
+{
+ unsigned long *node_caches;
+ int mem_only_nodes = 0;
+ int cpu, node, ret;
+
+ if (!x86_match_cpu(snc_cpu_ids))
+ return 1;
+
+ node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
+ if (!node_caches)
+ return 1;
+
+ cpus_read_lock();
+ for_each_node(node) {
+ cpu = cpumask_first(cpumask_of_node(node));
+ if (cpu < nr_cpu_ids)
+ set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+ else
+ mem_only_nodes++;
+ }
+ cpus_read_unlock();
+
+ ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
+ kfree(node_caches);
+
+ if (ret > 1)
+ rdt_resources_all[RDT_RESOURCE_PKG].r_resctrl.pkg_actions = true;
+
+ return ret;
+}
+
static __init void rdt_init_res_defs_intel(void)
{
struct rdt_hw_resource *hw_res;
struct rdt_resource *r;

+ snc_ways = find_snc_ways();
+
for_each_rdt_resource(r) {
hw_res = resctrl_to_arch_res(r);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index da3f36212898..74db99d299e1 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -160,7 +160,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
*/
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));
rdmsrl(MSR_IA32_QM_CTR, msr_val);

if (msr_val & RMID_VAL_ERROR)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index d037f3da9e55..1a9c38b018ba 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1354,7 +1354,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
}
}

- return size;
+ return size / snc_ways;
}

/**
--
2.40.1


2023-07-18 20:46:02

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] x86/resctrl: Refactor in preparation for node-scoped resources

Hi Tony,

On 7/13/2023 9:32 AM, Tony Luck wrote:
> Sub-NUMA cluster systems provide monitoring resources at the NUMA
> node scope instead of the L3 cache scope.
>
> Rename the cache_level field in struct rdt_resource to the more
> generic "scope" and add symbolic names and a helper function.

Can the changelog elaborate how the helper function is intended
to be used? When changelog just states "add a helper function" it
is unnecessary since that is clear from the code.

>
> No functional change.
>
> Signed-off-by: Tony Luck <[email protected]>
> Reviewed-by: Peter Newman <[email protected]>
> ---

...

> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 030d3b409768..6571514752f3 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_L3,
> .name = "L3",
> - .cache_level = 3,
> + .scope = SCOPE_L3_CACHE,
> .domains = domain_init(RDT_RESOURCE_L3),
> .parse_ctrlval = parse_cbm,
> .format_str = "%d=%0*x",
> @@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_L2,
> .name = "L2",
> - .cache_level = 2,
> + .scope = SCOPE_L2_CACHE,
> .domains = domain_init(RDT_RESOURCE_L2),
> .parse_ctrlval = parse_cbm,
> .format_str = "%d=%0*x",
> @@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_MBA,
> .name = "MB",
> - .cache_level = 3,
> + .scope = SCOPE_L3_CACHE,
> .domains = domain_init(RDT_RESOURCE_MBA),
> .parse_ctrlval = parse_bw,
> .format_str = "%d=%*u",
> @@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_SMBA,
> .name = "SMBA",
> - .cache_level = 3,
> + .scope = 3,

Should this be SCOPE_L3_CACHE?

> .domains = domain_init(RDT_RESOURCE_SMBA),
> .parse_ctrlval = parse_bw,
> .format_str = "%d=%*u",

...

> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 458cb7419502..42f124ffb968 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -297,7 +297,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
> plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
>
> for (i = 0; i < ci->num_leaves; i++) {
> - if (ci->info_list[i].level == plr->s->res->cache_level) {
> + if (ci->info_list[i].level == plr->s->res->scope) {
> plr->line_size = ci->info_list[i].coherency_line_size;
> return 0;
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 725344048f85..418658f0a9ad 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1348,7 +1348,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> num_b = bitmap_weight(&cbm, r->cache.cbm_len);
> ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
> for (i = 0; i < ci->num_leaves; i++) {
> - if (ci->info_list[i].level == r->cache_level) {
> + if (ci->info_list[i].level == r->scope) {
> size = ci->info_list[i].size / r->cache.cbm_len * num_b;
> break;
> }

The last two hunks are red flags to me. Clearly the "cache_level"->"scope"
change is done in preparation for "scope" to be assigned more values than
2 or 3. Yet the code continue to use these values as cache levels, comparing
it to cacheinfo->level for which I only expect cache levels 2 or 3 to be valid.
The above two hunks thus now have potential for errors when rdt_resource->scope
has a value that is not 2 or 3.

Even if these functions may not be called if rdt_resource->scope is not 2 or 3,
this change makes the code harder to understand and maintain because now it
requires users to know in which flows particular functions can be called and/or
when code paths with invalid values are "ok".

Reinette


2023-07-18 20:55:44

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 7/8] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

Hi Tony,

On 7/13/2023 9:32 AM, Tony Luck wrote:
> There isn't a simple hardware enumeration to indicate to software that
> a system is running with Sub-NUMA Cluster enabled.

This changelog appears to _almost_ be identical to the cover letter. There
is no problem with changelog and cover letter being identical but it makes
things confusing when there are slight differences between the text.
For example, "Sub-NUMA Cluster" vs "Sub-NUMA Clustering". With this difference
between the two the reader is left wondering what behind the difference is.

>
> Compare the number of NUMA nodes with the number of L3 caches to calculate
> the number of Sub-NUMA nodes per L3 cache.
>
> When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
> are distributed equally between the SNC nodes within each socket.
>
> E.g. if there are 400 RMID counters, and the system is configured with
> two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
> 0 on the socket, and RMID counter 200..399 on SNC node 1.
>
> A model specific MSR (0xca0) can change the configuration of the RMIDs
> when SNC mode is enabled.
>
> The MSR controls the interpretation of the RMID field in the
> IA32_PQR_ASSOC MSR so that the appropriate hardware counters
> within the SNC node are updated.
>
> Also initialize a per-cpu RMID offset value. Use this
> to calculate the value to write to the IA32_QM_EVTSEL MSR when
> reading RMID event values.
>
> N.B. this works well for well-behaved NUMA applications that access
> memory predominantly from the local memory node. For applications that
> access memory across multiple nodes it may be necessary for the user
> to read counters for all SNC nodes on a socket and add the values to
> get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
> all that different from applications that span across multiple sockets
> in a legacy system.
>
> Signed-off-by: Tony Luck <[email protected]>
> Reviewed-by: Peter Newman <[email protected]>
> Tested-by: Peter Newman <[email protected]>
> ---
> arch/x86/include/asm/resctrl.h | 2 +
> arch/x86/kernel/cpu/resctrl/core.c | 99 +++++++++++++++++++++++++-
> arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
> 4 files changed, 100 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 255a78d9d906..f95e69bacc65 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
> DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
> DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
>
> +DECLARE_PER_CPU(int, rmid_offset);
> +
> /*
> * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
> *
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index af3be3c2db96..a03ff1a95624 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>
> #define pr_fmt(fmt) "resctrl: " fmt
>
> +#include <linux/cpu.h>
> #include <linux/slab.h>
> #include <linux/err.h>
> #include <linux/cacheinfo.h>
> #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>
> +#include <asm/cpu_device_id.h>
> #include <asm/intel-family.h>
> #include <asm/resctrl.h>
> #include "internal.h"
> @@ -524,6 +527,39 @@ static int get_domain_id(int cpu, enum resctrl_scope scope)
> }
> }
>
> +DEFINE_PER_CPU(int, rmid_offset);
> +
> +static void set_per_cpu_rmid_offset(int cpu, struct rdt_resource *r)
> +{
> + this_cpu_write(rmid_offset, (cpu_to_node(cpu) % snc_ways) * r->num_rmid);
> +}

Does this mean that per-cpu data is used as a way to keep "per SNC node" data?
Why is it required for this to be per-cpu data instead of, for example, the
offset computed when it is needed?

> +
> +/*
> + * This MSR provides for configuration of RMIDs on Sub-NUMA Cluster
> + * systems.
> + * Bit0 = 1 (default) For legacy configuration
> + * Bit0 = 0 RMIDs are divided evenly between SNC nodes.
> + */
> +#define MSR_RMID_SNC_CONFIG 0xCA0

Please move to msr-index.h. For reference:
97fa21f65c3e ("x86/resctrl: Move MSR defines into msr-index.h")

> +
> +static void snc_add_pkg(void)
> +{
> + u64 msrval;
> +
> + rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
> + msrval &= ~BIT_ULL(0);
> + wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
> +}
> +
> +static void snc_remove_pkg(void)
> +{
> + u64 msrval;
> +
> + rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
> + msrval |= BIT_ULL(0);
> + wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
> +}
> +
> /*
> * domain_add_cpu - Add a cpu to a resource's domain list.
> *
> @@ -555,6 +591,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> cpumask_set_cpu(cpu, &d->cpu_mask);
> if (r->cache.arch_has_per_cpu_cfg)
> rdt_domain_reconfigure_cdp(r);
> + if (r->mon_capable)
> + set_per_cpu_rmid_offset(cpu, r);
> return;
> }
>
> @@ -573,11 +611,17 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> return;
> }
>
> - if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> - domain_free(hw_dom);
> - return;
> + if (r->mon_capable) {
> + if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> + domain_free(hw_dom);
> + return;
> + }
> + set_per_cpu_rmid_offset(cpu, r);
> }
>
> + if (r->pkg_actions)
> + snc_add_pkg();
> +

This seems unnecessary use of a resctrl resource.


> list_add_tail(&d->list, add_pos);
>
> err = resctrl_online_domain(r, d);
> @@ -613,6 +657,9 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> d->plr->d = NULL;
> domain_free(hw_dom);
>
> + if (r->pkg_actions)
> + snc_remove_pkg();
> +
> return;
> }
>
> @@ -899,11 +946,57 @@ static __init bool get_rdt_resources(void)
> return (rdt_mon_capable || rdt_alloc_capable);
> }
>
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> + {}
> +};
> +
> +/*
> + * There isn't a simple enumeration bit to show whether SNC mode
> + * is enabled. Look at the ratio of number of NUMA nodes to the
> + * number of distinct L3 caches. Take care to skip memory-only nodes.
> + */
> +static __init int find_snc_ways(void)

Based on the function comment and function name it is not clear what
this function is intended to do. In cache "ways" have a particular meaning,
what does "ways" mean in this context?

> +{
> + unsigned long *node_caches;
> + int mem_only_nodes = 0;
> + int cpu, node, ret;
> +
> + if (!x86_match_cpu(snc_cpu_ids))
> + return 1;
> +
> + node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> + if (!node_caches)
> + return 1;
> +
> + cpus_read_lock();
> + for_each_node(node) {
> + cpu = cpumask_first(cpumask_of_node(node));
> + if (cpu < nr_cpu_ids)
> + set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> + else
> + mem_only_nodes++;
> + }
> + cpus_read_unlock();
> +
> + ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
> + kfree(node_caches);
> +
> + if (ret > 1)
> + rdt_resources_all[RDT_RESOURCE_PKG].r_resctrl.pkg_actions = true;
> +
> + return ret;
> +}
> +
> static __init void rdt_init_res_defs_intel(void)
> {
> struct rdt_hw_resource *hw_res;
> struct rdt_resource *r;
>
> + snc_ways = find_snc_ways();
> +
> for_each_rdt_resource(r) {
> hw_res = resctrl_to_arch_res(r);
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index da3f36212898..74db99d299e1 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -160,7 +160,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> * are error bits.
> */
> - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));
> rdmsrl(MSR_IA32_QM_CTR, msr_val);
>
> if (msr_val & RMID_VAL_ERROR)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index d037f3da9e55..1a9c38b018ba 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1354,7 +1354,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> }
> }
>
> - return size;
> + return size / snc_ways;
> }
>
> /**

The last hunk does not seem to be covered in the changelog.

Reinette

2023-07-18 21:02:06

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 2/8] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c

Hi Tony,

Regarding subject and change: Why is the focus on just monitor.c?
Hardcoding of RDT_RESOURCE_L3 as monitoring resource is done
elsewhere also (rdtgroup.c:rdt_get_tree()) - why not treat all
hardcoding?

On 7/13/2023 9:32 AM, Tony Luck wrote:

...

> @@ -759,9 +761,9 @@ static struct mon_evt mbm_local_event = {
> /*
> * Initialize the event list for the resource.
> *
> - * Note that MBM events are also part of RDT_RESOURCE_L3 resource
> - * because as per the SDM the total and local memory bandwidth
> - * are enumerated as part of L3 monitoring.
> + * Monitor events can either be part of RDT_RESOURCE_L3 resource,
> + * or they may be per NUMA node on systems with sub-NUMA cluster
> + * enabled and are then in the RDT_RESOURCE_NODE resource.
> */
> static void l3_mon_evt_init(struct rdt_resource *r)
> {
> @@ -773,6 +775,8 @@ static void l3_mon_evt_init(struct rdt_resource *r)
> list_add_tail(&mbm_total_event.list, &r->evt_list);
> if (is_mbm_local_enabled())
> list_add_tail(&mbm_local_event.list, &r->evt_list);
> +
> + mon_resource = r;
> }
>

This does not seem like the right place for this initialization.
mon_evt_init() has a single job that the function comment clearly
states: "Initialize the event list for the resource". What does
the global mon_resource have to do with the event list?

Would get_rdt_mon_resources() not be more appropriate?

Although, looking ahead it is not clear to me why this is needed.
I'll try to focus my responses to the individual patches in this
regard.

> int __init rdt_get_mon_l3_config(struct rdt_resource *r)

Reinette

2023-07-18 21:09:23

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]

Hi Tony,

On 7/13/2023 9:32 AM, Tony Luck wrote:
> Add a placeholder in the array of struct rdt_hw_resource to be used
> for event monitoring of systems with Sub-NUMA Cluster enabled.

Could you please elaborate why a new resource is required?


>
> Update get_domain_id() to handle SCOPE_NODE.
>
> Signed-off-by: Tony Luck <[email protected]>
> Reviewed-by: Peter Newman <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 4 +++-
> arch/x86/kernel/cpu/resctrl/core.c | 12 ++++++++++++
> 2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 8275b8a74f7e..243017096ddf 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -435,6 +435,7 @@ enum resctrl_res_level {
> RDT_RESOURCE_L2,
> RDT_RESOURCE_MBA,
> RDT_RESOURCE_SMBA,
> + RDT_RESOURCE_NODE,
>
> /* Must be the last */
> RDT_NUM_RESOURCES,
> @@ -442,7 +443,8 @@ enum resctrl_res_level {
>
> enum resctrl_scope {
> SCOPE_L2_CACHE = 2,
> - SCOPE_L3_CACHE = 3
> + SCOPE_L3_CACHE = 3,
> + SCOPE_NODE,
> };

A new resource _and_ a new scope is added. Could changelog please
explain why this is required?

>
> static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 6571514752f3..e4bd3072927c 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -112,6 +112,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .fflags = RFTYPE_RES_MB,
> },
> },
> + [RDT_RESOURCE_NODE] =
> + {
> + .r_resctrl = {
> + .rid = RDT_RESOURCE_NODE,
> + .name = "L3",
> + .scope = SCOPE_NODE,
> + .domains = domain_init(RDT_RESOURCE_NODE),
> + .fflags = 0,
> + },
> + },
> };

So the new resource has the same name, from user perspective,
as RDT_RESOURCE_L3. From this perspective it thus seems to be a
shadow of RDT_RESOURCE_L3 that is used as alternative for some properties
of the actual RDT_RESOURCE_L3? This is starting to look as though this
solution is wrenching itself into current architecture.

From what I can tell the monitoring in SNC environment needs a different
domain list because of the change in scope. What else is needed in the
resource that is different from the existing L3 resource? Could the
monitoring scope of a resource not instead be made distinct from its
allocation scope? By default monitoring and allocation scope will be
the same and thus use the same domain list but when SNC is enabled
then monitoring uses a different domain list.

> /*
> @@ -489,6 +499,8 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>
> static int get_domain_id(int cpu, enum resctrl_scope scope)
> {
> + if (scope == SCOPE_NODE)
> + return cpu_to_node(cpu);
> return get_cpu_cacheinfo_id(cpu, scope);
> }
>

Reinette

2023-07-18 21:10:15

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] x86/resctrl: Add package scoped resource

Hi Tony,

On 7/13/2023 9:32 AM, Tony Luck wrote:
> Some Intel features require setting a package scoped model specific
> register.
>
> Add a new resource that builds domains for each package.

If I understand correctly the only purpose of this new resource
is to know when the first CPU associated with a package
comes online. Am I not reading this right? Using a resctrl resource
for this purpose seems inappropriate and unnecessary while also
making the code very hard to follow.

Reinette

2023-07-18 21:25:46

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] x86/resctrl: Add code to setup monitoring at L3 or NODE scope.

Hi Tony,

Regarding subject: "Add code" is not necessary.

On 7/13/2023 9:32 AM, Tony Luck wrote:
> When Sub-NUMA cluster is enabled (snc_ways > 1) use the RDT_RESOURCE_NODE
> instead of RDT_RESOURCE_L3 for all monitoring operations.

This duplication of resource does not look right to me.
RDT_RESOURCE_NODE now contains the monitoring data for RDT_RESOURCE_L3
with related structures within RDT_RESOURCE_L3 going unused.

>
> The mon_scale and num_rmid values from CPUID(0xf,0x1),(EBX,ECX) must be
> scaled down by the number of Sub-NUMA Clusters.
>
> A subsequent change will detect sub-NUMA cluster mode and set
> "snc_ways". For now set to one (meaning each L3 cache spans one
> node).
>
> Signed-off-by: Tony Luck <[email protected]>
> Reviewed-by: Peter Newman <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 7 +++++++
> arch/x86/kernel/cpu/resctrl/core.c | 7 ++++++-
> arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++--
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
> 4 files changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 243017096ddf..38bac0062c82 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -430,6 +430,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>
> extern struct dentry *debugfs_resctrl;
>
> +extern int snc_ways;
> +
> enum resctrl_res_level {
> RDT_RESOURCE_L3,
> RDT_RESOURCE_L2,
> @@ -447,6 +449,11 @@ enum resctrl_scope {
> SCOPE_NODE,
> };
>
> +static inline int get_mbm_res_level(void)
> +{
> + return snc_ways > 1 ? RDT_RESOURCE_NODE : RDT_RESOURCE_L3;
> +}

Need to return the enum here? It may be simpler for this helper to
just return a pointer to the resource. (Although the need for a
separate resource is still not clear to me.)

> +
> static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
> {
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(res);
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index e4bd3072927c..6fe9f87d4403 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -48,6 +48,11 @@ int max_name_width, max_data_width;
> */
> bool rdt_alloc_capable;
>
> +/*
> + * How many Sub-Numa Cluster nodes share a single L3 cache
> + */
> +int snc_ways = 1;
> +

Since snc_ways is always used I think the comment should provide
more detail on the possible values it may have. For example, to
a reader it may not be obvious what the value of snc_ways should be
if SNC is disabled. Also, what does "ways" refer to? (Also mentioned
later).

> static void
> mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> struct rdt_resource *r);
> @@ -831,7 +836,7 @@ static __init bool get_rdt_alloc_resources(void)
>
> static __init bool get_rdt_mon_resources(void)
> {
> - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + struct rdt_resource *r = &rdt_resources_all[get_mbm_res_level()].r_resctrl;
>
> if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
> rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 9be6ffdd01ae..da3f36212898 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -787,8 +787,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> int ret;
>
> resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> - r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_ways;
> + r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_ways;
> hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>
> if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 418658f0a9ad..d037f3da9e55 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2524,7 +2524,7 @@ static int rdt_get_tree(struct fs_context *fc)
> static_branch_enable_cpuslocked(&rdt_enable_key);
>
> if (is_mbm_enabled()) {
> - r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + r = &rdt_resources_all[get_mbm_res_level()].r_resctrl;
> list_for_each_entry(dom, &r->domains, list)
> mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
> }

This final hunk makes me wonder why the monitor.c:
mon_resource is necessary at all. A single helper used everywhere
may be simpler.

Reinette

2023-07-18 23:49:57

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]

On Tue, Jul 18, 2023 at 01:40:32PM -0700, Reinette Chatre wrote:
> > + [RDT_RESOURCE_NODE] =
> > + {
> > + .r_resctrl = {
> > + .rid = RDT_RESOURCE_NODE,
> > + .name = "L3",
> > + .scope = SCOPE_NODE,
> > + .domains = domain_init(RDT_RESOURCE_NODE),
> > + .fflags = 0,
> > + },
> > + },
> > };
>
> So the new resource has the same name, from user perspective,
> as RDT_RESOURCE_L3. From this perspective it thus seems to be a
> shadow of RDT_RESOURCE_L3 that is used as alternative for some properties
> of the actual RDT_RESOURCE_L3? This is starting to look as though this
> solution is wrenching itself into current architecture.
>
> >From what I can tell the monitoring in SNC environment needs a different
> domain list because of the change in scope. What else is needed in the
> resource that is different from the existing L3 resource? Could the
> monitoring scope of a resource not instead be made distinct from its
> allocation scope? By default monitoring and allocation scope will be
> the same and thus use the same domain list but when SNC is enabled
> then monitoring uses a different domain list.

Answering this part first, because my choice here affects a bunch
of the code that also raised comments from you.

The crux of the issue is that when SNC mode is enabled the scope
for L3 monitoring functions changes to "node" scope, while the
scope of L3 control functions (CAT, CDP) remains at L3 cache scope.

My solution was to just create a new resource. But you have an
interesing alternate solution. Add an extra domain list to the
resource structure to allow creation of distinct domain lists
for this case where the scope for control and monitor functions
differs.

So change the resource structure like this:

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..01590aa59a67 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -168,10 +168,12 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- int cache_level;
+ int ctrl_scope;
+ int mon_scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
- struct list_head domains;
+ struct list_head ctrl_domains;
+ struct list_head mon_domains;
char *name;
int data_width;
u32 default_ctrl;

and build/use separate domain lists for when this resource is
being referenced for allocation/monitoring. E.g. domain_add_cpu()
would check "r->alloc_capable" and add a cpu to the ctrl_domains
list based on the ctrl_scope value. It would do the same with
mon_capable / mon_domains / mon_scope.

If ctrl_scope == mon_scope, just build one list as you suggest above.

Maybe there are more places that walk the list of control domains than
walk the list of monitor domains. Need to audit this set:

$ git grep list_for_each.*domains -- arch/x86/kernel/cpu/resctrl
arch/x86/kernel/cpu/resctrl/core.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/core.c: list_for_each(l, &r->domains) {
arch/x86/kernel/cpu/resctrl/ctrlmondata.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/ctrlmondata.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/ctrlmondata.c: list_for_each_entry(dom, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/monitor.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/pseudo_lock.c: list_for_each_entry(d_i, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list)
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r_l->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list)
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &s->res->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {

Maybe "domains" can keep its name and make a "list_for_each_monitor_domain()" macro
to pick the right list to walk?


I don't think this will reduce the amount of code change in a
significant way. But it may be conceptually easier to follow
what is going on.

-Tony

2023-07-18 23:50:49

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] x86/resctrl: Add package scoped resource

On Tue, Jul 18, 2023 at 01:43:45PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 7/13/2023 9:32 AM, Tony Luck wrote:
> > Some Intel features require setting a package scoped model specific
> > register.
> >
> > Add a new resource that builds domains for each package.
>
> If I understand correctly the only purpose of this new resource
> is to know when the first CPU associated with a package
> comes online. Am I not reading this right? Using a resctrl resource
> for this purpose seems inappropriate and unnecessary while also
> making the code very hard to follow.

Reinette,

Yes. You understand.

I agree that this is blatant abuse of the resource structures. I can find
another way to perform an action on the first CPU of a package online,
and the last CPU of a package offline.

-Tony

2023-07-19 00:05:33

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]

Hi Tony,

On 7/18/2023 3:57 PM, Tony Luck wrote:
> On Tue, Jul 18, 2023 at 01:40:32PM -0700, Reinette Chatre wrote:
>>> + [RDT_RESOURCE_NODE] =
>>> + {
>>> + .r_resctrl = {
>>> + .rid = RDT_RESOURCE_NODE,
>>> + .name = "L3",
>>> + .scope = SCOPE_NODE,
>>> + .domains = domain_init(RDT_RESOURCE_NODE),
>>> + .fflags = 0,
>>> + },
>>> + },
>>> };
>>
>> So the new resource has the same name, from user perspective,
>> as RDT_RESOURCE_L3. From this perspective it thus seems to be a
>> shadow of RDT_RESOURCE_L3 that is used as alternative for some properties
>> of the actual RDT_RESOURCE_L3? This is starting to look as though this
>> solution is wrenching itself into current architecture.
>>
>> >From what I can tell the monitoring in SNC environment needs a different
>> domain list because of the change in scope. What else is needed in the
>> resource that is different from the existing L3 resource? Could the
>> monitoring scope of a resource not instead be made distinct from its
>> allocation scope? By default monitoring and allocation scope will be
>> the same and thus use the same domain list but when SNC is enabled
>> then monitoring uses a different domain list.
>
> Answering this part first, because my choice here affects a bunch
> of the code that also raised comments from you.

Indeed.

>
> The crux of the issue is that when SNC mode is enabled the scope
> for L3 monitoring functions changes to "node" scope, while the
> scope of L3 control functions (CAT, CDP) remains at L3 cache scope.
>
> My solution was to just create a new resource. But you have an
> interesing alternate solution. Add an extra domain list to the
> resource structure to allow creation of distinct domain lists
> for this case where the scope for control and monitor functions
> differs.
>
> So change the resource structure like this:
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 8334eeacfec5..01590aa59a67 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -168,10 +168,12 @@ struct rdt_resource {
> bool alloc_capable;
> bool mon_capable;
> int num_rmid;
> - int cache_level;
> + int ctrl_scope;
> + int mon_scope;

I am not sure about getting rid of cache_level so fast.
I see regarding the current problem being solved that
ctrl_scope would have the same values as cache_level but
I find that adding this level of indirection while keeping
the comparison with cacheinfo->level to create a trap
for future mistakes.


> struct resctrl_cache cache;
> struct resctrl_membw membw;
> - struct list_head domains;
> + struct list_head ctrl_domains;
> + struct list_head mon_domains;
> char *name;
> int data_width;
> u32 default_ctrl;
>
> and build/use separate domain lists for when this resource is
> being referenced for allocation/monitoring. E.g. domain_add_cpu()
> would check "r->alloc_capable" and add a cpu to the ctrl_domains
> list based on the ctrl_scope value. It would do the same with
> mon_capable / mon_domains / mon_scope.
>
> If ctrl_scope == mon_scope, just build one list as you suggest above.

Yes, this is the idea. Thank you for considering it. Something else
to consider that may make this even cleaner/simpler would be to review
struct rdt_domain and struct rdt_hw_domain members for "monitor" vs "control"
usage. These structs could potentially be split further into separate
"control" and "monitor" variants. For example, "struct rdt_domain" split into
"struct rdt_ctrl_domain" and "struct rdt_mon_domain". If there is a clean
split then resctrl can always create two lists with the unnecessary duplication
eliminated when two domain lists are created. This would also
eliminate the need to scatter ctrl_scope == mon_scope checks throughout.

>
> Maybe there are more places that walk the list of control domains than
> walk the list of monitor domains. Need to audit this set:
>
> $ git grep list_for_each.*domains -- arch/x86/kernel/cpu/resctrl
> arch/x86/kernel/cpu/resctrl/core.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/core.c: list_for_each(l, &r->domains) {
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c: list_for_each_entry(dom, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/monitor.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c: list_for_each_entry(d_i, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list)
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r_l->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list)
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &s->res->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
>
> Maybe "domains" can keep its name and make a "list_for_each_monitor_domain()" macro
> to pick the right list to walk?

It is not clear to me how "domains" can keep its name. If I understand
the macro would be useful if scope always needs to be considered. I wonder
if the list walkers may not mostly just walk the appropriate list directly
if resctrl always creates separate "control domain" and "monitor domain"
lists.

> I don't think this will reduce the amount of code change in a
> significant way. But it may be conceptually easier to follow
> what is going on.

Reducing the amount of code changed is not a goal to me. If I understand
correctly I think that adapting resctrl to support different monitor and
control scope could create a foundation into which SNC can slot in smoothly.

Reinette



2023-07-19 00:36:55

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]

> Yes, this is the idea. Thank you for considering it. Something else
> to consider that may make this even cleaner/simpler would be to review
> struct rdt_domain and struct rdt_hw_domain members for "monitor" vs "control"
> usage. These structs could potentially be split further into separate
> "control" and "monitor" variants. For example, "struct rdt_domain" split into
> "struct rdt_ctrl_domain" and "struct rdt_mon_domain". If there is a clean
> split then resctrl can always create two lists with the unnecessary duplication
> eliminated when two domain lists are created. This would also
> eliminate the need to scatter ctrl_scope == mon_scope checks throughout.

You might like what I'm doing in the "resctrl2" re-write[1]. Arch independent code
that maintains the domain lists for a resource via a cpuhp notifier just has this
for the domain structure:

struct resctrl_domain {
struct list_head list;
struct cpumask cpu_mask;
int id;
int cache_size;
};

Each module managing a resource decides what extra information it wants to
carry in the domain. So the above structure is common to all, but it is followed
by whatever the resource module wants. E.g. the CBM masks for each CLOSid
for the CAT module. The module tells core code the size to allocate.

"cache_size" is only there because the cache topology bits needed to discover
sizes of caches aren't exported. Both the "size" file and pseudo-locking need
to know the size.

It's also possible that you may hate it. There is zero sharing of resource structures
even if they have the same scope. This is because all modules are independently
loadable.

-Tony

[1] WIP snapshot at git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git
branch resctrl2_v65rc1. That doesn't have pseudo-locking, but most of the rest
of existing resctrl functionality is there.

2023-07-19 02:56:25

by Shaopeng Tan (Fujitsu)

[permalink] [raw]
Subject: RE: [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

Hi tony,

I ran selftest/resctrl in my environment,
the test result is "not ok".

Processer in my environment:
Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz

kernel:
$ uname -r
6.5.0-rc1+

Result :
Sub-NUMA enable:
xxx@xxx:~/linux_v6.5_rc1l$ sudo make -C tools/testing/selftests/resctrl run_tests
make: Entering directory '/.../tools/testing/selftests/resctrl'
TAP version 13
1..1
# timeout set to 120
# selftests: resctrl: resctrl_tests
# TAP version 13
# # Pass: Check kernel supports resctrl filesystem
# # Pass: Check resctrl mountpoint "/sys/fs/resctrl" exists
# # resctrl filesystem not mounted
# # dmesg: [ 3.060018] resctrl: L3 allocation detected
# # dmesg: [ 3.098180] resctrl: MB allocation detected
# # dmesg: [ 3.118507] resctrl: L3 monitoring detected
# 1..4
# # Starting MBM BW change ...
# # Mounting resctrl to "/sys/fs/resctrl"
# # Mounting resctrl to "/sys/fs/resctrl"
# # Benchmark PID: 14784
# # Writing benchmark parameters to resctrl FS
# # Write schema "MB:0=100" to resctrl FS
# # Checking for pass/fail
# # Fail: Check MBM diff within 5%
# # avg_diff_per: 100%
# # Span (MB): 250
# # avg_bw_imc: 14185
# # avg_bw_resc: 28389
# not ok 1 MBM: bw change
# # Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.
# # Starting MBA Schemata change ...
# # Mounting resctrl to "/sys/fs/resctrl"
# # Mounting resctrl to "/sys/fs/resctrl"
# # Benchmark PID: 14787
# # Writing benchmark parameters to resctrl FS
# # Write schema "MB:0=100" to resctrl FS
# # Write schema "MB:0=90" to resctrl FS
# # Write schema "MB:0=80" to resctrl FS
# # Write schema "MB:0=70" to resctrl FS
# # Write schema "MB:0=60" to resctrl FS
# # Write schema "MB:0=50" to resctrl FS
# # Write schema "MB:0=40" to resctrl FS
# # Write schema "MB:0=30" to resctrl FS
# # Write schema "MB:0=20" to resctrl FS
# # Write schema "MB:0=10" to resctrl FS
# # Results are displayed in (MB)
# # Fail: Check MBA diff within 5% for schemata 100
# # avg_diff_per: 99%
# # avg_bw_imc: 14179
# # avg_bw_resc: 28340
# # Fail: Check MBA diff within 5% for schemata 90
# # avg_diff_per: 100%
# # avg_bw_imc: 9244
# # avg_bw_resc: 18497
# # Fail: Check MBA diff within 5% for schemata 80
# # avg_diff_per: 100%
# # avg_bw_imc: 9249
# # avg_bw_resc: 18504
# # Fail: Check MBA diff within 5% for schemata 70
# # avg_diff_per: 100%
# # avg_bw_imc: 9250
# # avg_bw_resc: 18506
# # Fail: Check MBA diff within 5% for schemata 60
# # avg_diff_per: 100%
# # avg_bw_imc: 7521
# # avg_bw_resc: 15055
# # Fail: Check MBA diff within 5% for schemata 50
# # avg_diff_per: 100%
# # avg_bw_imc: 7455
# # avg_bw_resc: 14917
# # Fail: Check MBA diff within 5% for schemata 40
# # avg_diff_per: 100%
# # avg_bw_imc: 5962
# # avg_bw_resc: 11934
# # Fail: Check MBA diff within 5% for schemata 30
# # avg_diff_per: 100%
# # avg_bw_imc: 4208
# # avg_bw_resc: 8436
# # Fail: Check MBA diff within 5% for schemata 20
# # avg_diff_per: 98%
# # avg_bw_imc: 2972
# # avg_bw_resc: 5909
# # Fail: Check MBA diff within 5% for schemata 10
# # avg_diff_per: 99%
# # avg_bw_imc: 1715
# # avg_bw_resc: 3426
# # Fail: Check schemata change using MBA
# # At least one test failed
# not ok 2 MBA: schemata change
# # Starting CMT test ...
# # Mounting resctrl to "/sys/fs/resctrl"
# # Mounting resctrl to "/sys/fs/resctrl"
# # Cache size :6488064
# # Benchmark PID: 14793
# # Writing benchmark parameters to resctrl FS
# # Checking for pass/fail
# # Fail: Check cache miss rate within 15%
# # Percent diff=91
# # Number of bits: 5
# # Average LLC val: 5640192
# # Cache span (bytes): 2949120
# not ok 3 CMT: test
# # Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.
# # Starting CAT test ...
# # Mounting resctrl to "/sys/fs/resctrl"
# # Mounting resctrl to "/sys/fs/resctrl"
# # Cache size :6488064
# # Writing benchmark parameters to resctrl FS
# # Write schema "L3:0=3f" to resctrl FS
# # Checking for pass/fail
# # Fail: Check cache miss rate within 4%
# # Percent diff=6
# # Number of bits: 6
# # Average LLC val: 51475
# # Cache span (lines): 55296
# not ok 4 CAT: test
# # Totals: pass:0 fail:4 xfail:0 xpass:0 skip:0 error:0
not ok 1 selftests: resctrl: resctrl_tests # exit=1
make: Leaving directory '/...l/tools/testing/selftests/resctrl'

Sub-NUMA disable:
xxx@xxx:~/linux_v6.5_rc1l$ sudo make -C tools/testing/selftests/resctrl run_tests
...
# # Starting CAT test ...
# # Mounting resctrl to "/sys/fs/resctrl"
# # Mounting resctrl to "/sys/fs/resctrl"
# # Cache size :6488064
# # Writing benchmark parameters to resctrl FS
# # Write schema "L3:0=3f" to resctrl FS
# # Checking for pass/fail
# # Fail: Check cache miss rate within 4%
# # Percent diff=6
# # Number of bits: 6
# # Average LLC val: 51899
# # Cache span (lines): 55296
# not ok 4 CAT: test
# # Totals: pass:3 fail:1 xfail:0 xpass:0 skip:0 error:0
not ok 1 selftests: resctrl: resctrl_tests # exit=1
make: Leaving directory '/.../tools/testing/selftests/resctrl'

Best regards,
Shaopeng TAN

2023-07-20 00:48:03

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]

Here's a quick hack to see how things might look with
separate domain lists in the "L3" resource.

For testing purposes on a non-SNC system I set ->mon_scope =
MON_SCOPE_NODE, but made domain_add_cpu() allocate the mondomains
list based on L3 scope ... just so I could check that I found all
the places where monitoring needs to use the mondomains list.
The kernel doesn't crash when running tools/testing/selftests/resctrl,
and the tests all pass. But that doesn't mean I didn't miss something.

Some restructuring of control vs. monitoing initialization might
avoid some of the code I duplicated in domain_add_cpu(). But this
is intended just as a "Is this what you meant?" before I dig deeper.

Overall, I think it is a cleaner approach that making a new
"L3" resource with different scope just for the SNC monitoring

-Tony

---

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..e4b653088a22 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -151,9 +151,11 @@ struct resctrl_schema;
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
* @cache_level: Which cache level defines scope of this resource
+ * @mon_scope: Scope of this resource if different from cache_level
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
* @domains: All domains for this resource
+ * @mondomains: Monitor domains for this resource (if mon_scope != 0)
* @name: Name to use in "schemata" file.
* @data_width: Character width of data when displaying
* @default_ctrl: Specifies default cache cbm or memory B/W percent.
@@ -169,9 +171,11 @@ struct rdt_resource {
bool mon_capable;
int num_rmid;
int cache_level;
+ int mon_scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
struct list_head domains;
+ struct list_head mondomains;
char *name;
int data_width;
u32 default_ctrl;
@@ -184,6 +188,8 @@ struct rdt_resource {
bool cdp_capable;
};

+#define MON_SCOPE_NODE 1
+
/**
* struct resctrl_schema - configuration abilities of a resource presented to
* user-space
@@ -217,8 +223,8 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,

u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d, bool mon_setup);
+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d, bool mon_teardown);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 85ceaf9a31ac..c5e2ac2a60cf 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -511,7 +511,7 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
+struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
struct list_head **pos);
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..545d563ba956 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,7 @@ static void
mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);

-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define domain_init(id, field) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.field)

struct rdt_hw_resource rdt_resources_all[] = {
[RDT_RESOURCE_L3] =
@@ -66,7 +66,9 @@ struct rdt_hw_resource rdt_resources_all[] = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
.cache_level = 3,
- .domains = domain_init(RDT_RESOURCE_L3),
+ .mon_scope = MON_SCOPE_NODE, //FAKE
+ .domains = domain_init(RDT_RESOURCE_L3, domains),
+ .mondomains = domain_init(RDT_RESOURCE_L3, mondomains),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -80,7 +82,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
.cache_level = 2,
- .domains = domain_init(RDT_RESOURCE_L2),
+ .domains = domain_init(RDT_RESOURCE_L2, domains),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -94,7 +96,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
.cache_level = 3,
- .domains = domain_init(RDT_RESOURCE_MBA),
+ .domains = domain_init(RDT_RESOURCE_MBA, domains),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -106,7 +108,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
.cache_level = 3,
- .domains = domain_init(RDT_RESOURCE_SMBA),
+ .domains = domain_init(RDT_RESOURCE_SMBA, domains),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -384,14 +386,15 @@ void rdt_ctrl_update(void *arg)
}

/*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Find a domain in one of the lists for a resource that
+ * matches input resource id
*
* Search resource r's domain list to find the resource id. If the resource
* id is found in a domain, return the domain. Otherwise, if requested by
* caller, return the first domain whose id is bigger than the input id.
* The domain list is sorted by id in ascending order.
*/
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
+struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
struct list_head **pos)
{
struct rdt_domain *d;
@@ -400,7 +403,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
if (id < 0)
return ERR_PTR(-ENODEV);

- list_for_each(l, &r->domains) {
+ list_for_each(l, h) {
d = list_entry(l, struct rdt_domain, list);
/* When id is found, return its domain. */
if (id == d->id)
@@ -508,7 +511,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
struct rdt_domain *d;
int err;

- d = rdt_find_domain(r, id, &add_pos);
+ d = rdt_find_domain(&r->domains, id, &add_pos);
if (IS_ERR(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
return;
@@ -536,6 +539,44 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

+ if (!r->mon_scope && r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+ domain_free(hw_dom);
+ return;
+ }
+
+ list_add_tail(&d->list, add_pos);
+
+ err = resctrl_online_domain(r, d, r->mon_scope == 0);
+ if (err) {
+ list_del(&d->list);
+ domain_free(hw_dom);
+ }
+
+ if (r->mon_scope != MON_SCOPE_NODE)
+ return;
+
+ //id = cpu_to_node(cpu);
+ id = get_cpu_cacheinfo_id(cpu, r->cache_level); // FAKE
+ add_pos = NULL;
+ d = rdt_find_domain(&r->mondomains, id, &add_pos);
+ if (IS_ERR(d)) {
+ pr_warn("Couldn't find node id for CPU %d\n", cpu);
+ return;
+ }
+
+ if (d) {
+ cpumask_set_cpu(cpu, &d->cpu_mask);
+ return;
+ }
+
+ hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+ if (!hw_dom)
+ return;
+
+ d = &hw_dom->d_resctrl;
+ d->id = id;
+ cpumask_set_cpu(cpu, &d->cpu_mask);
+
if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
domain_free(hw_dom);
return;
@@ -543,7 +584,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

list_add_tail(&d->list, add_pos);

- err = resctrl_online_domain(r, d);
+ err = resctrl_online_domain(r, d, true);
if (err) {
list_del(&d->list);
domain_free(hw_dom);
@@ -556,7 +597,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

- d = rdt_find_domain(r, id, NULL);
+ d = rdt_find_domain(&r->domains, id, NULL);
if (IS_ERR_OR_NULL(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
return;
@@ -565,7 +606,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)

cpumask_clear_cpu(cpu, &d->cpu_mask);
if (cpumask_empty(&d->cpu_mask)) {
- resctrl_offline_domain(r, d);
+ resctrl_offline_domain(r, d, r->mon_scope == 0);
list_del(&d->list);

/*
@@ -579,7 +620,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
return;
}

- if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
+ if (r->mon_scope == 0 && r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
cancel_delayed_work(&d->mbm_over);
mbm_setup_overflow_handler(d, 0);
@@ -590,6 +631,23 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
cqm_setup_limbo_handler(d, 0);
}
}
+
+ if (r->mon_scope != MON_SCOPE_NODE)
+ return;
+
+ id = cpu_to_node(cpu);
+ d = rdt_find_domain(&r->mondomains, id, NULL);
+ if (IS_ERR_OR_NULL(d)) {
+ pr_warn("Couldn't find node id for CPU %d\n", cpu);
+ return;
+ }
+
+ cpumask_clear_cpu(cpu, &d->cpu_mask);
+ if (cpumask_empty(&d->cpu_mask)) {
+ resctrl_offline_domain(r, d, true);
+ list_del(&d->list);
+ domain_free(hw_dom);
+ }
}

static void clear_closid_rmid(int cpu)
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b44c487727d4..80033cb698d0 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -545,6 +545,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
struct rdt_resource *r;
union mon_data_bits md;
struct rdt_domain *d;
+ struct list_head *h;
struct rmid_read rr;
int ret = 0;

@@ -560,7 +561,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
evtid = md.u.evtid;

r = &rdt_resources_all[resid].r_resctrl;
- d = rdt_find_domain(r, domid, NULL);
+ h = r->mon_scope ? &r->mondomains : &r->domains;
+ d = rdt_find_domain(h, domid, NULL);
if (IS_ERR_OR_NULL(d)) {
ret = -ENOENT;
goto out;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..08085202582a 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -335,12 +335,14 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
struct rdt_domain *d;
+ struct list_head *h;
int cpu, err;
u64 val = 0;

entry->busy = 0;
cpu = get_cpu();
- list_for_each_entry(d, &r->domains, list) {
+ h = r->mon_scope ? &r->mondomains : &r->domains;
+ list_for_each_entry(d, h, list) {
if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
err = resctrl_arch_rmid_read(r, d, entry->rmid,
QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 725344048f85..fb5b23fcb6d4 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1492,11 +1492,13 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
{
struct mon_config_info mon_info = {0};
struct rdt_domain *dom;
+ struct list_head *h;
bool sep = false;

mutex_lock(&rdtgroup_mutex);

- list_for_each_entry(dom, &r->domains, list) {
+ h = r->mon_scope ? &r->mondomains : &r->domains;
+ list_for_each_entry(dom, h, list) {
if (sep)
seq_puts(s, ";");

@@ -1599,6 +1601,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
char *dom_str = NULL, *id_str;
unsigned long dom_id, val;
struct rdt_domain *d;
+ struct list_head *h;
int ret = 0;

next:
@@ -1619,7 +1622,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}

- list_for_each_entry(d, &r->domains, list) {
+ h = r->mon_scope ? &r->mondomains : &r->domains;
+ list_for_each_entry(d, h, list) {
if (d->id == dom_id) {
ret = mbm_config_write_domain(r, d, evtid, val);
if (ret)
@@ -2465,6 +2469,7 @@ static int rdt_get_tree(struct fs_context *fc)
struct rdt_fs_context *ctx = rdt_fc2context(fc);
struct rdt_domain *dom;
struct rdt_resource *r;
+ struct list_head *h;
int ret;

cpus_read_lock();
@@ -2525,7 +2530,8 @@ static int rdt_get_tree(struct fs_context *fc)

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- list_for_each_entry(dom, &r->domains, list)
+ h = r->mon_scope ? &r->mondomains : &r->domains;
+ list_for_each_entry(dom, h, list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}

@@ -2917,9 +2923,11 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdtgroup *prgrp)
{
struct rdt_domain *dom;
+ struct list_head *h;
int ret;

- list_for_each_entry(dom, &r->domains, list) {
+ h = r->mon_scope ? &r->mondomains : &r->domains;
+ list_for_each_entry(dom, h, list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
return ret;
@@ -3708,14 +3716,14 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
kfree(d->mbm_local);
}

-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d, bool mon_teardown)
{
lockdep_assert_held(&rdtgroup_mutex);

if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
mba_sc_domain_destroy(r, d);

- if (!r->mon_capable)
+ if (!mon_teardown || !r->mon_capable)
return;

/*
@@ -3773,7 +3781,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d, bool mon_setup)
{
int err;

@@ -3783,7 +3791,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
/* RDT_RESOURCE_MBA is never mon_capable */
return mba_sc_domain_allocate(r, d);

- if (!r->mon_capable)
+ if (!mon_setup || !r->mon_capable)
return 0;

err = domain_setup_mon_state(r, d);

2023-07-20 18:12:48

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]

Hi Tony,

On 7/18/2023 5:11 PM, Luck, Tony wrote:
>> Yes, this is the idea. Thank you for considering it. Something else
>> to consider that may make this even cleaner/simpler would be to review
>> struct rdt_domain and struct rdt_hw_domain members for "monitor" vs "control"
>> usage. These structs could potentially be split further into separate
>> "control" and "monitor" variants. For example, "struct rdt_domain" split into
>> "struct rdt_ctrl_domain" and "struct rdt_mon_domain". If there is a clean
>> split then resctrl can always create two lists with the unnecessary duplication
>> eliminated when two domain lists are created. This would also
>> eliminate the need to scatter ctrl_scope == mon_scope checks throughout.
>
> You might like what I'm doing in the "resctrl2" re-write[1]. Arch independent code
> that maintains the domain lists for a resource via a cpuhp notifier just has this
> for the domain structure:
>
> struct resctrl_domain {
> struct list_head list;
> struct cpumask cpu_mask;
> int id;
> int cache_size;
> };
>
> Each module managing a resource decides what extra information it wants to
> carry in the domain. So the above structure is common to all, but it is followed
> by whatever the resource module wants. E.g. the CBM masks for each CLOSid
> for the CAT module. The module tells core code the size to allocate.

hmmm ... what I am *hearing* you say is that the goodness from the
rewrite can be added to resctrl? :)

> "cache_size" is only there because the cache topology bits needed to discover
> sizes of caches aren't exported. Both the "size" file and pseudo-locking need
> to know the size.
>
> It's also possible that you may hate it. There is zero sharing of resource structures
> even if they have the same scope. This is because all modules are independently
> loadable.

Apologies but I am still unable to understand the problem statement that
motivates the rewrite.

Reinette

2023-07-20 18:13:34

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]

Hi Tony,

On 7/19/2023 5:20 PM, Tony Luck wrote:
> Here's a quick hack to see how things might look with
> separate domain lists in the "L3" resource.
>
> For testing purposes on a non-SNC system I set ->mon_scope =
> MON_SCOPE_NODE, but made domain_add_cpu() allocate the mondomains
> list based on L3 scope ... just so I could check that I found all
> the places where monitoring needs to use the mondomains list.
> The kernel doesn't crash when running tools/testing/selftests/resctrl,
> and the tests all pass. But that doesn't mean I didn't miss something.
>
> Some restructuring of control vs. monitoing initialization might
> avoid some of the code I duplicated in domain_add_cpu(). But this
> is intended just as a "Is this what you meant?" before I dig deeper.

Thank you for considering the approach. I find that this sample move
towards the idea while also highlighting what else can be considered.
I do not know if you already considered these ideas and found it flawed
so I will try to make it explicit so that you can point out to me where
things will fall apart.

The sample code introduces a new list "mondomains" that is intended to
be used when the monitoring scope is different from the allocation scope.
This introduces duplication when the monitoring and allocation scope is
different. Each list, "domains" and "mondomains" will host structures
that can accommodate both monitoring and allocation data, with the data
not relevant to the list going unused as it is unnecessarily duplicated.
Additionally this forces significant portions of resctrl to now always
consider whether the monitoring and allocation scope is different ...
note how this sample now has code like below scattered throughout.
h = r->mon_scope ? &r->mondomains : &r->domains;

I also find the domain_add_cpu() becoming intricate as it needs to
navigate all the different scenarios.

This unnecessary duplication, new environment checks needed throughout,
and additional code complexities are red flags to me that this solution
is not well integrated.

To deal with these complexities I would like to consider if it may
make things simpler to always (irrespective of allocation and
monitoring scope) maintain allocation and monitoring domain lists.
Each list need only carry data appropriate to its use ... the allocation
list only has data relevant to allocation, the monitoring list only
has data relevant to monitoring. This is the struct rdt_domain related
split I mentioned previously.

Code could become something like:

resctrl_online_cpu()
{
...
for_each_alloc_capable_rdt_resource(r)
alloc_domain_add_cpu(...)
for_each_mon_capable_rdt_resource(r)
mon_domain_add_cpu(...)
...
}

This would reduce complication in domain_add_cpu() since each domain list
only need to concern itself with monitoring or allocation.

Even resctrl_online_domain() can be siplified significantly by
making it specific to allocation or monitoring. For example,
resctrl_online_mon_domain() would only and always just run
the monitoring related code.

With the separate allocation and monitoring domain lists there
may no longer be a need for scattering code with checks like:
h = r->mon_scope ? &r->mondomains : &r->domains;
This would be because the code can directly pick the domain
list it is operating on.

What do you think? The above is just refactoring of existing
code and from what I can tell this would make supporting
SNC straight forward.

Reinette

2023-07-20 22:18:27

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]

> To deal with these complexities I would like to consider if it may
> make things simpler to always (irrespective of allocation and
> monitoring scope) maintain allocation and monitoring domain lists.
> Each list need only carry data appropriate to its use ... the allocation
> list only has data relevant to allocation, the monitoring list only
> has data relevant to monitoring. This is the struct rdt_domain related
> split I mentioned previously.
>
> Code could become something like:

resctrl_online_cpu()
{
...
for_each_alloc_capable_rdt_resource(r)
alloc_domain_add_cpu(...)
for_each_mon_capable_rdt_resource(r)
mon_domain_add_cpu(...)
...
}

> This would reduce complication in domain_add_cpu() since each domain list
> only need to concern itself with monitoring or allocation.

This does seem a worthy target.

I started on a patch to so this ... but I'm not sure I have the stamina or the time
to see it through.

I split struct rdt_domain into rdt_ctrl_domain and rdt_mon_domain. But that
led to also splitting the rdt_hw_domain structure into two, and then splitting
the resctrl_to_arch_dom() function, and then another and another.

That process will eventually converge (there are a finite number of lines
of code) .... but it will be a big patch. I don't see how to stage it a piece
at a time.

-Tony

2023-07-22 19:12:57

by Tony Luck

[permalink] [raw]
Subject: [PATCH v4 0/7] Add support for Sub-NUMA cluster (SNC) systems

The Sub-NUMA cluster feature on some Intel processors partitions
the CPUs that share an L3 cache into two or more sets. This plays
havoc with the Resource Director Technology (RDT) monitoring features.
Prior to this patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID
counters in the same way. This allows for monitoring features
to be used (with the caveat that memory accesses between different
SNC NUMA nodes may still not be counted accuratlely.

Signed-off-by: Tony Luck <[email protected]>

---

Changes since v3:

Reinette provided the most excellent suggestion that this series
could better achieve its objective if it enabled separate domain
lists for control & monitoring within a resource, rather than
creating a whole new resource to support separte node scope needed
for SNC monitoring. Thus all the pre-amble patches from the previous
version have gone, replaced by patches 1-4 of this new series.

Note to anyone backporting this to some older Linux kernel version.
You may be able to skip parts 2-4. These provide separate domain
structures for control and monitor with just the fields needed for
each. But this is largely cosmetic.

Of the code from v3 that survived to v4 the following changes have
been made (also from Reinette's review of v3).

1) Rename "snc_ways" to "snc_nodes_per_l3_cache" to avoid the confusing
use of "ways" which means something entirely different when talking
about caches.
2) Move the #define for MSR_RMID_SNC_CONFIG to <asm/msr-index.h> along
with all the other RDT MSRs.
3) Don't use a per-CPU variable "rmid_offset". Just calculate value
needed at the one place where it is used.
4) Don't create an entire resource structure with package scoped domains
just to set the SNC MSR.
5) Add comment in the commit message about adjusting the value shown in
the "size" files in each resctrl ctrl_mon directory.

This one not from Reinette:
6) Prevent mounting in "mba_MBps" mode when SNC mode is enabled. This
would just be confusing since monitoring is done at the node scope while
control is still at package scope.

Tony Luck (7):
x86/resctrl: Create separate domains for control and monitoring
x86/resctrl: Split the rdt_domain structures
x86/resctrl: Change monitor code to use rdt_mondomain
x86/resctrl: Delete unused fields from struct rdt_domain
x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
x86/resctrl: Update documentation with Sub-NUMA cluster changes
selftests/resctrl: Adjust effective L3 cache size when SNC enabled

Documentation/arch/x86/resctrl.rst | 10 +-
include/linux/resctrl.h | 50 +++-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 40 ++-
tools/testing/selftests/resctrl/resctrl.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 289 ++++++++++++++++----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 6 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 58 ++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 54 ++--
tools/testing/selftests/resctrl/resctrlfs.c | 57 ++++
10 files changed, 427 insertions(+), 139 deletions(-)


base-commit: fdf0eaf11452d72945af31804e2a1048ee1b574c
--
2.40.1


2023-07-22 19:31:25

by Tony Luck

[permalink] [raw]
Subject: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring

First step towards supporting resource control where the scope of
control operations is not the same as monitor operations.

Add an extra list in the rdt_resource structure. For this will
just duplicate the existing list of domains based on the L3 cache
scope.

Refactor the domain_add_cpu() and domain_remove() functions to
build separate lists for r->alloc_capable and r->mon_capable
resources. Note that only the "L3" domain currently supports
both types.

Change all places where monitoring functions walk the list of
domains to use the new "mondomains" list instead of the old
"domains" list.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 10 +-
arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
arch/x86/kernel/cpu/resctrl/core.c | 195 +++++++++++++++-------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 30 ++--
6 files changed, 167 insertions(+), 74 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..1267d56f9e76 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -151,9 +151,11 @@ struct resctrl_schema;
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
* @cache_level: Which cache level defines scope of this resource
+ * @mon_scope: Scope of this resource if different from cache_level
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
* @domains: All domains for this resource
+ * @mondomains: Monitor domains for this resource
* @name: Name to use in "schemata" file.
* @data_width: Character width of data when displaying
* @default_ctrl: Specifies default cache cbm or memory B/W percent.
@@ -169,9 +171,11 @@ struct rdt_resource {
bool mon_capable;
int num_rmid;
int cache_level;
+ int mon_scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
struct list_head domains;
+ struct list_head mondomains;
char *name;
int data_width;
u32 default_ctrl;
@@ -217,8 +221,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,

u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 85ceaf9a31ac..c5e2ac2a60cf 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -511,7 +511,7 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
+struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
struct list_head **pos);
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..274605aaa026 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,7 @@ static void
mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);

-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define domain_init(id, field) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.field)

struct rdt_hw_resource rdt_resources_all[] = {
[RDT_RESOURCE_L3] =
@@ -66,7 +66,9 @@ struct rdt_hw_resource rdt_resources_all[] = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
.cache_level = 3,
- .domains = domain_init(RDT_RESOURCE_L3),
+ .mon_scope = 3,
+ .domains = domain_init(RDT_RESOURCE_L3, domains),
+ .mondomains = domain_init(RDT_RESOURCE_L3, mondomains),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -80,7 +82,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
.cache_level = 2,
- .domains = domain_init(RDT_RESOURCE_L2),
+ .domains = domain_init(RDT_RESOURCE_L2, domains),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -94,7 +96,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
.cache_level = 3,
- .domains = domain_init(RDT_RESOURCE_MBA),
+ .domains = domain_init(RDT_RESOURCE_MBA, domains),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -106,7 +108,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
.cache_level = 3,
- .domains = domain_init(RDT_RESOURCE_SMBA),
+ .domains = domain_init(RDT_RESOURCE_SMBA, domains),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -384,14 +386,15 @@ void rdt_ctrl_update(void *arg)
}

/*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Find a domain in one of the lists for a resource that
+ * matches input resource id
*
* Search resource r's domain list to find the resource id. If the resource
* id is found in a domain, return the domain. Otherwise, if requested by
* caller, return the first domain whose id is bigger than the input id.
* The domain list is sorted by id in ascending order.
*/
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
+struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
struct list_head **pos)
{
struct rdt_domain *d;
@@ -400,7 +403,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
if (id < 0)
return ERR_PTR(-ENODEV);

- list_for_each(l, &r->domains) {
+ list_for_each(l, h) {
d = list_entry(l, struct rdt_domain, list);
/* When id is found, return its domain. */
if (id == d->id)
@@ -487,6 +490,94 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
return 0;
}

+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
+{
+ int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ struct list_head *add_pos = NULL;
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain *d;
+ int err;
+
+ d = rdt_find_domain(&r->domains, id, &add_pos);
+ if (IS_ERR(d)) {
+ pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+ return;
+ }
+
+ if (d) {
+ cpumask_set_cpu(cpu, &d->cpu_mask);
+ if (r->cache.arch_has_per_cpu_cfg)
+ rdt_domain_reconfigure_cdp(r);
+ return;
+ }
+
+ hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+ if (!hw_dom)
+ return;
+
+ d = &hw_dom->d_resctrl;
+ d->id = id;
+ cpumask_set_cpu(cpu, &d->cpu_mask);
+
+ rdt_domain_reconfigure_cdp(r);
+
+ if (domain_setup_ctrlval(r, d)) {
+ domain_free(hw_dom);
+ return;
+ }
+
+ list_add_tail(&d->list, add_pos);
+
+ err = resctrl_online_ctrl_domain(r, d);
+ if (err) {
+ list_del(&d->list);
+ domain_free(hw_dom);
+ }
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
+ struct list_head *add_pos = NULL;
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain *d;
+ int err;
+
+ d = rdt_find_domain(&r->mondomains, id, &add_pos);
+ if (IS_ERR(d)) {
+ pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+ return;
+ }
+
+ if (d) {
+ cpumask_set_cpu(cpu, &d->cpu_mask);
+ if (r->cache.arch_has_per_cpu_cfg)
+ rdt_domain_reconfigure_cdp(r);
+ return;
+ }
+
+ hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+ if (!hw_dom)
+ return;
+
+ d = &hw_dom->d_resctrl;
+ d->id = id;
+ cpumask_set_cpu(cpu, &d->cpu_mask);
+
+ if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+ domain_free(hw_dom);
+ return;
+ }
+
+ list_add_tail(&d->list, add_pos);
+
+ err = resctrl_online_mon_domain(r, d);
+ if (err) {
+ list_del(&d->list);
+ domain_free(hw_dom);
+ }
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
*/
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
- struct list_head *add_pos = NULL;
- struct rdt_hw_domain *hw_dom;
- struct rdt_domain *d;
- int err;
-
- d = rdt_find_domain(r, id, &add_pos);
- if (IS_ERR(d)) {
- pr_warn("Couldn't find cache id for CPU %d\n", cpu);
- return;
- }
-
- if (d) {
- cpumask_set_cpu(cpu, &d->cpu_mask);
- if (r->cache.arch_has_per_cpu_cfg)
- rdt_domain_reconfigure_cdp(r);
- return;
- }
-
- hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
- if (!hw_dom)
- return;
-
- d = &hw_dom->d_resctrl;
- d->id = id;
- cpumask_set_cpu(cpu, &d->cpu_mask);
-
- rdt_domain_reconfigure_cdp(r);
-
- if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
- domain_free(hw_dom);
- return;
- }
-
- if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
- domain_free(hw_dom);
- return;
- }
-
- list_add_tail(&d->list, add_pos);
-
- err = resctrl_online_domain(r, d);
- if (err) {
- list_del(&d->list);
- domain_free(hw_dom);
- }
+ if (r->alloc_capable)
+ domain_add_cpu_ctrl(cpu, r);
+ if (r->mon_capable)
+ domain_add_cpu_mon(cpu, r);
}

-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
{
int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

- d = rdt_find_domain(r, id, NULL);
+ d = rdt_find_domain(&r->domains, id, NULL);
if (IS_ERR_OR_NULL(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
return;
@@ -565,7 +614,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)

cpumask_clear_cpu(cpu, &d->cpu_mask);
if (cpumask_empty(&d->cpu_mask)) {
- resctrl_offline_domain(r, d);
+ resctrl_offline_ctrl_domain(r, d);
list_del(&d->list);

/*
@@ -578,6 +627,30 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)

return;
}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain *d;
+
+ d = rdt_find_domain(&r->mondomains, id, NULL);
+ if (IS_ERR_OR_NULL(d)) {
+ pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+ return;
+ }
+ hw_dom = resctrl_to_arch_dom(d);
+
+ cpumask_clear_cpu(cpu, &d->cpu_mask);
+ if (cpumask_empty(&d->cpu_mask)) {
+ resctrl_offline_mon_domain(r, d);
+ list_del(&d->list);
+
+ domain_free(hw_dom);
+
+ return;
+ }

if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -592,6 +665,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}
}

+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+ if (r->alloc_capable)
+ domain_remove_cpu_ctrl(cpu, r);
+ if (r->mon_capable)
+ domain_remove_cpu_mon(cpu, r);
+}
+
static void clear_closid_rmid(int cpu)
{
struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b44c487727d4..839df83d1a0a 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -560,7 +560,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
evtid = md.u.evtid;

r = &rdt_resources_all[resid].r_resctrl;
- d = rdt_find_domain(r, domid, NULL);
+ d = rdt_find_domain(&r->mondomains, domid, NULL);
if (IS_ERR_OR_NULL(d)) {
ret = -ENOENT;
goto out;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..66beca785535 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)

entry->busy = 0;
cpu = get_cpu();
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->mondomains, list) {
if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
err = resctrl_arch_rmid_read(r, d, entry->rmid,
QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 725344048f85..27753eb5d513 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1496,7 +1496,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid

mutex_lock(&rdtgroup_mutex);

- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->mondomains, list) {
if (sep)
seq_puts(s, ";");

@@ -1619,7 +1619,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->mondomains, list) {
if (d->id == dom_id) {
ret = mbm_config_write_domain(r, d, evtid, val);
if (ret)
@@ -2525,7 +2525,7 @@ static int rdt_get_tree(struct fs_context *fc)

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- list_for_each_entry(dom, &r->domains, list)
+ list_for_each_entry(dom, &r->mondomains, list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}

@@ -2919,7 +2919,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_domain *dom;
int ret;

- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->mondomains, list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
return ret;
@@ -3708,15 +3708,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
kfree(d->mbm_local);
}

-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
mba_sc_domain_destroy(r, d);
+}

- if (!r->mon_capable)
- return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ lockdep_assert_held(&rdtgroup_mutex);

/*
* If resctrl is mounted, remove all the
@@ -3773,18 +3775,22 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
{
- int err;
-
lockdep_assert_held(&rdtgroup_mutex);

if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
/* RDT_RESOURCE_MBA is never mon_capable */
return mba_sc_domain_allocate(r, d);

- if (!r->mon_capable)
- return 0;
+ return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ int err;
+
+ lockdep_assert_held(&rdtgroup_mutex);

err = domain_setup_mon_state(r, d);
if (err)
--
2.40.1


2023-07-22 19:31:37

by Tony Luck

[permalink] [raw]
Subject: [PATCH v4 7/7] selftests/resctrl: Adjust effective L3 cache size when SNC enabled

Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
nodes. Systems may support splitting into either two or four nodes.

When SNC mode is enabled the effective amount of L3 cache available
for allocation is divided by the number of nodes per L3.

Detect which SNC mode is active by comparing the number of CPUs
that share a cache with CPU0, with the number of CPUs on node0.

Reported-by: "Shaopeng Tan (Fujitsu)" <[email protected]>
Closes: https://lore.kernel.org/r/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com
Signed-off-by: Tony Luck <[email protected]>
---
tools/testing/selftests/resctrl/resctrl.h | 1 +
tools/testing/selftests/resctrl/resctrlfs.c | 57 +++++++++++++++++++++
2 files changed, 58 insertions(+)

diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index 87e39456dee0..a8b43210b573 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -13,6 +13,7 @@
#include <signal.h>
#include <dirent.h>
#include <stdbool.h>
+#include <ctype.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mount.h>
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index fb00245dee92..79eecbf9f863 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -130,6 +130,61 @@ int get_resource_id(int cpu_no, int *resource_id)
return 0;
}

+/*
+ * Count number of CPUs in a /sys bit map
+ */
+static int count_sys_bitmap_bits(char *name)
+{
+ FILE *fp = fopen(name, "r");
+ int count = 0, c;
+
+ if (!fp)
+ return 0;
+
+ while ((c = fgetc(fp)) != EOF) {
+ if (!isxdigit(c))
+ continue;
+ switch (c) {
+ case 'f':
+ count++;
+ case '7': case 'b': case 'd': case 'e':
+ count++;
+ case '3': case '5': case '6': case '9': case 'a': case 'c':
+ count++;
+ case '1': case '2': case '4': case '8':
+ count++;
+ }
+ }
+ fclose(fp);
+
+ return count;
+}
+
+/*
+ * Detect SNC by compating #CPUs in node0 with #CPUs sharing LLC with CPU0
+ * Try to get this right, even if a few CPUs are offline so that the number
+ * of CPUs in node0 is not exactly half or a quarter of the CPUs sharing the
+ * LLC of CPU0.
+ */
+static int snc_ways(void)
+{
+ int node_cpus, cache_cpus;
+
+ node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
+ cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
+
+ if (!node_cpus || !cache_cpus) {
+ fprintf(stderr, "Warning could not determine Sub-NUMA Cluster mode\n");
+ return 1;
+ }
+
+ if (4 * node_cpus >= cache_cpus)
+ return 4;
+ else if (2 * node_cpus >= cache_cpus)
+ return 2;
+ return 1;
+}
+
/*
* get_cache_size - Get cache size for a specified CPU
* @cpu_no: CPU number
@@ -190,6 +245,8 @@ int get_cache_size(int cpu_no, char *cache_type, unsigned long *cache_size)
break;
}

+ if (cache_num == 3)
+ *cache_size /= snc_ways();
return 0;
}

--
2.40.1


2023-07-22 19:31:37

by Tony Luck

[permalink] [raw]
Subject: [PATCH v4 5/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

There isn't a simple hardware enumeration to indicate to software that
a system is running with Sub-NUMA Cluster enabled.

Compare the number of NUMA nodes with the number of L3 caches to calculate
the number of Sub-NUMA nodes per L3 cache.

When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
are distributed equally between the SNC nodes within each socket.

E.g. if there are 400 RMID counters, and the system is configured with
two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
0 on the socket, and RMID counter 200..399 on SNC node 1.

A model specific MSR (0xca0) can change the configuration of the RMIDs
when SNC mode is enabled.

The MSR controls the interpretation of the RMID field in the
IA32_PQR_ASSOC MSR so that the appropriate hardware counters
within the SNC node are updated.

To read the RMID counters an offset must be used to get data
from the physical counter associated with the SNC node. As in
the example above with 400 RMID counters Linux sees only 200
counters. No special action is needed to read a counter from
the first SNC node on a socket. But to read a Linux visible
counter 50 on the second SNC node the kernel must load 250
into the QM_EVTSEL MSR.

N.B. this works well for well-behaved NUMA applications that access
memory predominantly from the local memory node. For applications that
access memory across multiple nodes it may be necessary for the user
to read counters for all SNC nodes on a socket and add the values to
get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
all that different from applications that span across multiple sockets
in a legacy system.

The cache allocation feature still provides the same number of
bits in a mask to control allocation into the L3 cache. But each
of those ways has its capacity reduced because the cache is divided
between the SNC nodes. Adjust the value reported in the resctrl
"size" file accordingly.

Mounting the file system with the "mba_MBps" option is disabled
when SNC mode is enabled. This is because the measurement of bandwidth
is per SNC node, while the MBA throttling controls are still at
the L3 cache scope.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 2 +
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 2 +
arch/x86/kernel/cpu/resctrl/core.c | 82 +++++++++++++++++++++++++-
arch/x86/kernel/cpu/resctrl/monitor.c | 18 +++++-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +-
6 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 80a89d171eba..576dc21bd990 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -200,6 +200,8 @@ struct rdt_resource {
bool cdp_capable;
};

+#define MON_SCOPE_NODE 100
+
/**
* struct resctrl_schema - configuration abilities of a resource presented to
* user-space
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3aedae61af4f..4b624a37d64a 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1087,6 +1087,7 @@
#define MSR_IA32_QM_CTR 0xc8e
#define MSR_IA32_PQR_ASSOC 0xc8f
#define MSR_IA32_L3_CBM_BASE 0xc90
+#define MSR_RMID_SNC_CONFIG 0xca0
#define MSR_IA32_L2_CBM_BASE 0xd10
#define MSR_IA32_MBA_THRTL_BASE 0xd50

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 016ef0373c5a..00a330bc5ced 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);

extern struct dentry *debugfs_resctrl;

+extern int snc_nodes_per_l3_cache;
+
enum resctrl_res_level {
RDT_RESOURCE_L3,
RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 0161362b0c3e..1331add347fc 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@

#define pr_fmt(fmt) "resctrl: " fmt

+#include <linux/cpu.h>
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/cacheinfo.h>
#include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>

+#include <asm/cpu_device_id.h>
#include <asm/intel-family.h>
#include <asm/resctrl.h>
#include "internal.h"
@@ -48,6 +51,13 @@ int max_name_width, max_data_width;
*/
bool rdt_alloc_capable;

+/*
+ * Number of SNC nodes that share each L3 cache.
+ * Default is 1 for systems that do not support
+ * SNC, or have SNC disabled.
+ */
+int snc_nodes_per_l3_cache = 1;
+
static void
mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);
@@ -543,9 +553,16 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
}
}

+static int get_mon_scope_id(int cpu, int scope)
+{
+ if (scope == MON_SCOPE_NODE)
+ return cpu_to_node(cpu);
+ return get_cpu_cacheinfo_id(cpu, scope);
+}
+
static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
+ int id = get_mon_scope_id(cpu, r->mon_scope);
struct list_head *add_pos = NULL;
struct rdt_hw_mondomain *hw_mondom;
struct rdt_mondomain *d;
@@ -692,11 +709,28 @@ static void clear_closid_rmid(int cpu)
wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
}

+static void snc_remap_rmids(int cpu)
+{
+ u64 val;
+
+ /* Only need to enable once per package */
+ if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+ return;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, val);
+ val &= ~BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
static int resctrl_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
+
+ if (snc_nodes_per_l3_cache > 1)
+ snc_remap_rmids(cpu);
+
for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
/* The cpu is set in default rdtgroup after online. */
@@ -951,11 +985,57 @@ static __init bool get_rdt_resources(void)
return (rdt_mon_capable || rdt_alloc_capable);
}

+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+ {}
+};
+
+/*
+ * There isn't a simple enumeration bit to show whether SNC mode
+ * is enabled. Look at the ratio of number of NUMA nodes to the
+ * number of distinct L3 caches. Take care to skip memory-only nodes.
+ */
+static __init int get_snc_config(void)
+{
+ unsigned long *node_caches;
+ int mem_only_nodes = 0;
+ int cpu, node, ret;
+
+ if (!x86_match_cpu(snc_cpu_ids))
+ return 1;
+
+ node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
+ if (!node_caches)
+ return 1;
+
+ cpus_read_lock();
+ for_each_node(node) {
+ cpu = cpumask_first(cpumask_of_node(node));
+ if (cpu < nr_cpu_ids)
+ set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+ else
+ mem_only_nodes++;
+ }
+ cpus_read_unlock();
+
+ ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
+ kfree(node_caches);
+
+ if (ret > 1)
+ rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = MON_SCOPE_NODE;
+
+ return ret;
+}
+
static __init void rdt_init_res_defs_intel(void)
{
struct rdt_hw_resource *hw_res;
struct rdt_resource *r;

+ snc_nodes_per_l3_cache = get_snc_config();
+
for_each_rdt_resource(r) {
hw_res = resctrl_to_arch_res(r);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 0d9605fccb34..4ca064e62911 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)

static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ int cpu = get_cpu();
+ int rmid_offset = 0;
u64 msr_val;

+ /*
+ * When SNC mode is on, need to compute the offset to read the
+ * physical RMID counter for the node to which this CPU belongs
+ */
+ if (snc_nodes_per_l3_cache > 1)
+ rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
/*
* As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
* with a valid event code for supported resource type and the bits
@@ -158,9 +168,11 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
*/
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
rdmsrl(MSR_IA32_QM_CTR, msr_val);

+ put_cpu();
+
if (msr_val & RMID_VAL_ERROR)
return -EIO;
if (msr_val & RMID_VAL_UNAVAIL)
@@ -783,8 +795,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
int ret;

resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
- hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
- r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+ r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;

if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 4a268df9b456..d831b21f7389 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1354,7 +1354,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
}
}

- return size;
+ return size / snc_nodes_per_l3_cache;
}

/**
@@ -2587,7 +2587,7 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
ctx->enable_cdpl2 = true;
return 0;
case Opt_mba_mbps:
- if (!supports_mba_mbps())
+ if (!supports_mba_mbps() || snc_nodes_per_l3_cache > 1)
return -EINVAL;
ctx->enable_mba_mbps = true;
return 0;
--
2.40.1


2023-07-22 19:31:41

by Tony Luck

[permalink] [raw]
Subject: [PATCH v4 2/7] x86/resctrl: Split the rdt_domain structures

The rdt_domain and rdt_hw_domain structures contain an amalgam of
fields used by control and monitoring features. Now that there
are separate domain lists for control/monitoring these can be
divided between two structures.

First step: Add new domain structures for monitoring with the
fields that are needed. Leave these fields in the legacy structure
so compilation won't fail. They will be deleted once all the
monitoring code has been converted to use the new structure.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 28 +++++++++++++++++++++++++-
arch/x86/kernel/cpu/resctrl/internal.h | 17 +++++++++++++++-
2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 1267d56f9e76..475912662e47 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,7 +53,7 @@ struct resctrl_staged_config {
};

/**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain - group of CPUs sharing a resctrl control resource
* @list: all instances of this resource
* @id: unique id for this instance
* @cpu_mask: which CPUs share this resource
@@ -86,6 +86,32 @@ struct rdt_domain {
u32 *mbps_val;
};

+/**
+ * struct rdt_mondomain - group of CPUs sharing a resctrl monitor resource
+ * @list: all instances of this resource
+ * @id: unique id for this instance
+ * @cpu_mask: which CPUs share this resource
+ * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
+ * @mbm_total: saved state for MBM total bandwidth
+ * @mbm_local: saved state for MBM local bandwidth
+ * @mbm_over: worker to periodically read MBM h/w counters
+ * @cqm_limbo: worker to periodically read CQM h/w counters
+ * @mbm_work_cpu: worker CPU for MBM h/w counters
+ * @cqm_work_cpu: worker CPU for CQM h/w counters
+ */
+struct rdt_mondomain {
+ struct list_head list;
+ int id;
+ struct cpumask cpu_mask;
+ unsigned long *rmid_busy_llc;
+ struct mbm_state *mbm_total;
+ struct mbm_state *mbm_local;
+ struct delayed_work mbm_over;
+ struct delayed_work cqm_limbo;
+ int mbm_work_cpu;
+ int cqm_work_cpu;
+};
+
/**
* struct resctrl_cache - Cache allocation related data
* @cbm_len: Length of the cache bit mask
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c5e2ac2a60cf..e956090a874e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -320,7 +320,7 @@ struct arch_mbm_state {

/**
* struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- * a resource
+ * a control resource
* @d_resctrl: Properties exposed to the resctrl file system
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
* @arch_mbm_total: arch private state for MBM total bandwidth
@@ -335,6 +335,21 @@ struct rdt_hw_domain {
struct arch_mbm_state *arch_mbm_local;
};

+/**
+ * struct rdt_hw_mondomain - Arch private attributes of a set of CPUs that share
+ * a monitor resource
+ * @d_resctrl: Properties exposed to the resctrl file system
+ * @arch_mbm_total: arch private state for MBM total bandwidth
+ * @arch_mbm_local: arch private state for MBM local bandwidth
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_mondomain {
+ struct rdt_mondomain d_resctrl;
+ struct arch_mbm_state *arch_mbm_total;
+ struct arch_mbm_state *arch_mbm_local;
+};
+
static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
{
return container_of(r, struct rdt_hw_domain, d_resctrl);
--
2.40.1


2023-07-22 19:38:26

by Tony Luck

[permalink] [raw]
Subject: [PATCH v4 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index cb05d90111b4..4d9ddb91751d 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -345,9 +345,13 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:

"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
+ This contains a set of files organized by L3 domain or by NUMA
+ node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+ or enabled respectively) and by RDT event. E.g. on a system with
+ SNC mode disabled with two L3 domains there will be subdirectories
+ "mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
+ L3 cache id. With SNC enabled the directory names are the same,
+ but the numerical suffix refers to the node id. Each of these
directories have one file per event (e.g. "llc_occupancy",
"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
files provide a read out of the current value of the event for
--
2.40.1


2023-07-22 19:38:26

by Tony Luck

[permalink] [raw]
Subject: [PATCH v4 3/7] x86/resctrl: Change monitor code to use rdt_mondomain

A few functions need to be duplicated to provide versions to
operate on control and monitor domains respectively. But most
of the changes are just fixing argument and return value types.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 10 +++---
arch/x86/kernel/cpu/resctrl/internal.h | 21 +++++++-----
arch/x86/kernel/cpu/resctrl/core.c | 40 ++++++++++++++---------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 4 +--
arch/x86/kernel/cpu/resctrl/monitor.c | 38 ++++++++++-----------
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 24 +++++++-------
6 files changed, 75 insertions(+), 62 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 475912662e47..663bbc427c4b 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -248,9 +248,9 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mondomain *d);
void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mondomain *d);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -266,7 +266,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
*/
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mondomain *d,
u32 rmid, enum resctrl_event_id eventid, u64 *val);

/**
@@ -279,7 +279,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mondomain *d,
u32 rmid, enum resctrl_event_id eventid);

/**
@@ -291,7 +291,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mondomain *d);

extern unsigned int resctrl_rmid_realloc_threshold;
extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e956090a874e..401af6ccf272 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -106,7 +106,7 @@ union mon_data_bits {
struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
- struct rdt_domain *d;
+ struct rdt_mondomain *d;
enum resctrl_event_id evtid;
bool first;
int err;
@@ -355,6 +355,11 @@ static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
return container_of(r, struct rdt_hw_domain, d_resctrl);
}

+static inline struct rdt_hw_mondomain *resctrl_to_arch_mondom(struct rdt_mondomain *r)
+{
+ return container_of(r, struct rdt_hw_mondomain, d_resctrl);
+}
+
/**
* struct msr_param - set a range of MSRs from a domain
* @res: The resource to use
@@ -526,8 +531,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
umode_t mask);
-struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
- struct list_head **pos);
+void *rdt_find_domain(struct list_head *h, int id,
+ struct list_head **pos);
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -556,17 +561,17 @@ bool __init rdt_cpu_has(int flag);
void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_mondomain *d, struct rdtgroup *rdtgrp,
int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mondomain *dom,
unsigned long delay_ms);
void mbm_handle_overflow(struct work_struct *work);
void __init intel_rdt_mbm_apply_quirk(void);
bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mondomain *dom, unsigned long delay_ms);
void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mondomain *d);
+void __check_limbo(struct rdt_mondomain *d, bool force_free);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void __init thread_throttle_mode_init(void);
void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 274605aaa026..0161362b0c3e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -393,9 +393,12 @@ void rdt_ctrl_update(void *arg)
* id is found in a domain, return the domain. Otherwise, if requested by
* caller, return the first domain whose id is bigger than the input id.
* The domain list is sorted by id in ascending order.
+ *
+ * N.B. Returned value may be either a pointer to "struct rdt_domain" or
+ * to "struct rdt_mondomain" depending on which domain list is scanned.
*/
-struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
- struct list_head **pos)
+void *rdt_find_domain(struct list_head *h, int id,
+ struct list_head **pos)
{
struct rdt_domain *d;
struct list_head *l;
@@ -434,10 +437,15 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
}

static void domain_free(struct rdt_hw_domain *hw_dom)
+{
+ kfree(hw_dom->ctrl_val);
+ kfree(hw_dom);
+}
+
+static void mondomain_free(struct rdt_hw_mondomain *hw_dom)
{
kfree(hw_dom->arch_mbm_total);
kfree(hw_dom->arch_mbm_local);
- kfree(hw_dom->ctrl_val);
kfree(hw_dom);
}

@@ -467,7 +475,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
*/
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mondomain *hw_dom)
{
size_t tsize;

@@ -539,8 +547,8 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
struct list_head *add_pos = NULL;
- struct rdt_hw_domain *hw_dom;
- struct rdt_domain *d;
+ struct rdt_hw_mondomain *hw_mondom;
+ struct rdt_mondomain *d;
int err;

d = rdt_find_domain(&r->mondomains, id, &add_pos);
@@ -556,16 +564,16 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
return;
}

- hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
- if (!hw_dom)
+ hw_mondom = kzalloc_node(sizeof(*hw_mondom), GFP_KERNEL, cpu_to_node(cpu));
+ if (!hw_mondom)
return;

- d = &hw_dom->d_resctrl;
+ d = &hw_mondom->d_resctrl;
d->id = id;
cpumask_set_cpu(cpu, &d->cpu_mask);

- if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
- domain_free(hw_dom);
+ if (arch_domain_mbm_alloc(r->num_rmid, hw_mondom)) {
+ mondomain_free(hw_mondom);
return;
}

@@ -574,7 +582,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
err = resctrl_online_mon_domain(r, d);
if (err) {
list_del(&d->list);
- domain_free(hw_dom);
+ mondomain_free(hw_mondom);
}
}

@@ -632,22 +640,22 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
- struct rdt_hw_domain *hw_dom;
- struct rdt_domain *d;
+ struct rdt_hw_mondomain *hw_mondom;
+ struct rdt_mondomain *d;

d = rdt_find_domain(&r->mondomains, id, NULL);
if (IS_ERR_OR_NULL(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
return;
}
- hw_dom = resctrl_to_arch_dom(d);
+ hw_mondom = resctrl_to_arch_mondom(d);

cpumask_clear_cpu(cpu, &d->cpu_mask);
if (cpumask_empty(&d->cpu_mask)) {
resctrl_offline_mon_domain(r, d);
list_del(&d->list);

- domain_free(hw_dom);
+ mondomain_free(hw_mondom);

return;
}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 839df83d1a0a..86fc5b0e3d39 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -521,7 +521,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
}

void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_mondomain *d, struct rdtgroup *rdtgrp,
int evtid, int first)
{
/*
@@ -544,7 +544,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
union mon_data_bits md;
- struct rdt_domain *d;
+ struct rdt_mondomain *d;
struct rmid_read rr;
int ret = 0;

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 66beca785535..0d9605fccb34 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
return 0;
}

-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mondomain *hw_dom,
u32 rmid,
enum resctrl_event_id eventid)
{
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
return NULL;
}

-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mondomain *d,
u32 rmid, enum resctrl_event_id eventid)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mondomain *hw_dom = resctrl_to_arch_mondom(d);
struct arch_mbm_state *am;

am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
* Assumes that hardware counters are also reset and thus that there is
* no need to record initial non-zero counts.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mondomain *d)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mondomain *hw_dom = resctrl_to_arch_mondom(d);

if (is_mbm_total_enabled())
memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
return chunks >> shift;
}

-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mondomain *d,
u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mondomain *hw_dom = resctrl_to_arch_mondom(d);
struct arch_mbm_state *am;
u64 msr_val, chunks;
int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
* decrement the count. If the busy count gets to zero on an RMID, we
* free the RMID
*/
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mondomain *d, bool force_free)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
}
}

-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mondomain *d)
{
return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
}
@@ -334,7 +334,7 @@ int alloc_rmid(void)
static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- struct rdt_domain *d;
+ struct rdt_mondomain *d;
int cpu, err;
u64 val = 0;

@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}

-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mondomain *d, u32 rmid,
enum resctrl_event_id evtid)
{
switch (evtid) {
@@ -516,7 +516,7 @@ void mon_event_count(void *info)
* throttle MSRs already have low percentage values. To avoid
* unnecessarily restricting such rdtgroups, we also increase the bandwidth.
*/
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mondomain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
}
}

-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mondomain *d, int rmid)
{
struct rmid_read rr;

@@ -641,12 +641,12 @@ void cqm_handle_limbo(struct work_struct *work)
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
int cpu = smp_processor_id();
struct rdt_resource *r;
- struct rdt_domain *d;
+ struct rdt_mondomain *d;

mutex_lock(&rdtgroup_mutex);

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- d = container_of(work, struct rdt_domain, cqm_limbo.work);
+ d = container_of(work, struct rdt_mondomain, cqm_limbo.work);

__check_limbo(d, false);

@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
mutex_unlock(&rdtgroup_mutex);
}

-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mondomain *dom, unsigned long delay_ms)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;
@@ -674,7 +674,7 @@ void mbm_handle_overflow(struct work_struct *work)
int cpu = smp_processor_id();
struct list_head *head;
struct rdt_resource *r;
- struct rdt_domain *d;
+ struct rdt_mondomain *d;

mutex_lock(&rdtgroup_mutex);

@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
goto out_unlock;

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- d = container_of(work, struct rdt_domain, mbm_over.work);
+ d = container_of(work, struct rdt_mondomain, mbm_over.work);

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
mutex_unlock(&rdtgroup_mutex);
}

-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mondomain *dom, unsigned long delay_ms)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 27753eb5d513..4a268df9b456 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1483,7 +1483,7 @@ static void mon_event_config_read(void *info)
mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
}

-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mondomain *d, struct mon_config_info *mon_info)
{
smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
}
@@ -1491,7 +1491,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
{
struct mon_config_info mon_info = {0};
- struct rdt_domain *dom;
+ struct rdt_mondomain *dom;
bool sep = false;

mutex_lock(&rdtgroup_mutex);
@@ -1548,7 +1548,7 @@ static void mon_event_config_write(void *info)
}

static int mbm_config_write_domain(struct rdt_resource *r,
- struct rdt_domain *d, u32 evtid, u32 val)
+ struct rdt_mondomain *d, u32 evtid, u32 val)
{
struct mon_config_info mon_info = {0};
int ret = 0;
@@ -1598,7 +1598,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
{
char *dom_str = NULL, *id_str;
unsigned long dom_id, val;
- struct rdt_domain *d;
+ struct rdt_mondomain *d;
int ret = 0;

next:
@@ -2463,7 +2463,7 @@ static void schemata_list_destroy(void)
static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
- struct rdt_domain *dom;
+ struct rdt_mondomain *dom;
struct rdt_resource *r;
int ret;

@@ -2845,7 +2845,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}

static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
- struct rdt_domain *d,
+ struct rdt_mondomain *d,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
union mon_data_bits priv;
@@ -2894,7 +2894,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
* and "monitor" groups with given domain id.
*/
static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_domain *d)
+ struct rdt_mondomain *d)
{
struct kernfs_node *parent_kn;
struct rdtgroup *prgrp, *crgrp;
@@ -2916,7 +2916,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_resource *r,
struct rdtgroup *prgrp)
{
- struct rdt_domain *dom;
+ struct rdt_mondomain *dom;
int ret;

list_for_each_entry(dom, &r->mondomains, list) {
@@ -3701,7 +3701,7 @@ static int __init rdtgroup_setup_root(void)
return ret;
}

-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mondomain *d)
{
bitmap_free(d->rmid_busy_llc);
kfree(d->mbm_total);
@@ -3716,7 +3716,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
mba_sc_domain_destroy(r, d);
}

-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mondomain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3745,7 +3745,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
domain_destroy_mon_state(d);
}

-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mondomain *d)
{
size_t tsize;

@@ -3786,7 +3786,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mondomain *d)
{
int err;

--
2.40.1


2023-07-22 19:38:32

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]

On Thu, Jul 20, 2023 at 09:56:50PM +0000, Luck, Tony wrote:
> This does seem a worthy target.
>
> I started on a patch to so this ... but I'm not sure I have the stamina or the time
> to see it through.

I was being a wuss. I came back at this on Friday from a slightly
different perspective, and it all came togeteher fairly easily.

New series posted here:
https://lore.kernel.org/all/[email protected]/

-Tony

2023-07-22 21:31:52

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

On Wed, Jul 19, 2023 at 02:43:20AM +0000, Shaopeng Tan (Fujitsu) wrote:
> Hi tony,
>
> I ran selftest/resctrl in my environment,
> the test result is "not ok".
>
> Processer in my environment:
> Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
>
> kernel:
> $ uname -r
> 6.5.0-rc1+
>
> Result :
> Sub-NUMA enable:
> xxx@xxx:~/linux_v6.5_rc1l$ sudo make -C tools/testing/selftests/resctrl run_tests
> make: Entering directory '/.../tools/testing/selftests/resctrl'

I see most tests pass. Just one fail on my most recent run with the
v4 patch series:

# # Fail: Check MBA diff within 5% for schemata 10
# # avg_diff_per: 7%
# # avg_bw_imc: 883
# # avg_bw_resc: 815
# # Fail: Check schemata change using MBA

But just missed the 5% target by a small amount,
not the near total failures that you see.

I wonder if there is a cross-SNC node memory
allocation issue. Can you try running the test
bound to a CPU in one node:

$ taskset -c 1 sudo make -C tools/testing/selftests/resctrl run_tests

Try with different "-c" arguments to bind to different nodes. Do you
see different results on differnt nodes?

-Tony

2023-07-26 03:58:29

by Drew Fustini

[permalink] [raw]
Subject: Re: [PATCH v4 0/7] Add support for Sub-NUMA cluster (SNC) systems

On Sat, Jul 22, 2023 at 12:07:33PM -0700, Tony Luck wrote:
> The Sub-NUMA cluster feature on some Intel processors partitions
> the CPUs that share an L3 cache into two or more sets. This plays
> havoc with the Resource Director Technology (RDT) monitoring features.
> Prior to this patch Intel has advised that SNC and RDT are incompatible.
>
> Some of these CPU support an MSR that can partition the RMID
> counters in the same way. This allows for monitoring features
> to be used (with the caveat that memory accesses between different
> SNC NUMA nodes may still not be counted accuratlely.
>
> Signed-off-by: Tony Luck <[email protected]>
>
> ---
>
> Changes since v3:
>
> Reinette provided the most excellent suggestion that this series
> could better achieve its objective if it enabled separate domain
> lists for control & monitoring within a resource, rather than
> creating a whole new resource to support separte node scope needed
> for SNC monitoring. Thus all the pre-amble patches from the previous
> version have gone, replaced by patches 1-4 of this new series.

[This comment is unrelated to Sub-NUMA support so please disregard if
this is the wrong place to make these comments]

I think that the resctrl interface for RISC-V CBQRI could also benefit
from separate domain lists for control and monitoring.

For example, the bandwidth controller QoS register [1] interface allows
a device to implement both bandwidth usage monitoring and bandwidth
allocation. The resctrl proof-of-concept [2] had to awkwardly create two
domains for each memory controller in our example SoC, one that would
contain the MBA resource and one that would contain the L3 resource to
represent MBM files like local_bytes.

This resulted in a very odd looking schemata that would be hard to the
user to understand:

# cat /sys/fs/resctrl/schemata
MB:4= 80;6= 80;8= 80
L2:0=0fff;1=0fff
L3:2=ffff;3=0000;5=0000;7=0000

Where:

Domain 0 is L2 cache controller 0 capacity allocation
Domain 1 is L2 cache controller 1 capacity allocation
Domain 2 is L3 cache controller capacity allocation

Domain 4 is Memory controller 0 bandwidth allocation
Domain 6 is Memory controller 1 bandwidth allocation
Domain 8 is Memory controller 2 bandwidth allocation

Domain 3 is Memory controller 0 bandwidth monitoring
Domain 5 is Memory controller 1 bandwidth monitoring
Domain 7 is Memory controller 2 bandwidth monitoring

But there is no value of having the domains created for the purposes of
bandwidth monitoring in schemata.

I've not yet fully understood how the new approach in this patch series
could help the situation for CBQRI, but I thought I would mention that
separate lists for control and monitoring might be useful.

Thanks,
Drew

[1] https://github.com/riscv-non-isa/riscv-cbqri/blob/main/qos_bandwidth.adoc
[2] https://lore.kernel.org/linux-riscv/[email protected]/

2023-07-26 14:35:06

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v4 0/7] Add support for Sub-NUMA cluster (SNC) systems

On Tue, Jul 25, 2023 at 08:10:52PM -0700, Drew Fustini wrote:
> I think that the resctrl interface for RISC-V CBQRI could also benefit
> from separate domain lists for control and monitoring.
>
> For example, the bandwidth controller QoS register [1] interface allows
> a device to implement both bandwidth usage monitoring and bandwidth
> allocation. The resctrl proof-of-concept [2] had to awkwardly create two
> domains for each memory controller in our example SoC, one that would
> contain the MBA resource and one that would contain the L3 resource to
> represent MBM files like local_bytes.
>
> This resulted in a very odd looking schemata that would be hard to the
> user to understand:
>
> # cat /sys/fs/resctrl/schemata
> MB:4= 80;6= 80;8= 80
> L2:0=0fff;1=0fff
> L3:2=ffff;3=0000;5=0000;7=0000
>
> Where:
>
> Domain 0 is L2 cache controller 0 capacity allocation
> Domain 1 is L2 cache controller 1 capacity allocation
> Domain 2 is L3 cache controller capacity allocation
>
> Domain 4 is Memory controller 0 bandwidth allocation
> Domain 6 is Memory controller 1 bandwidth allocation
> Domain 8 is Memory controller 2 bandwidth allocation
>
> Domain 3 is Memory controller 0 bandwidth monitoring
> Domain 5 is Memory controller 1 bandwidth monitoring
> Domain 7 is Memory controller 2 bandwidth monitoring
>
> But there is no value of having the domains created for the purposes of
> bandwidth monitoring in schemata.

There's certainly no value in exposing those domain numbers
in the schemata file. There should also be some way for users
to decode the ids. On x86 the "id" is exposed in sysfs. Though
the user does need to work to get all the details:

$ cat /sys/devices/system/cpu/cpu36/cache/index3/level
3
$ cat /sys/devices/system/cpu/cpu36/cache/index3/id
1
$ cat /sys/devices/system/cpu/cpu36/cache/index3/shared_cpu_list
36-71,108-143

This shows the L3 cachce with id "1" is shared by CPUs 36-71,108-143

X86 also has independent domain numbers for each resource. So the
L2 ones count 0, 1, 2, ... and so do the L3 ones: 0, 1, 2 and the
MBA ones: 0, 1, 2

That fits well with the /sys decoding ... but maybe your approach of
not repeating domain numbers across different resources is less
confusing?

Note that in my resctrl re-write where each resource is handled by
a separate loadable module it may be hard for you to keep the unique
domain scheme as resource modules are unaware of each other. Though
perhaps its just an arch specific hook to provide domain numbers.

> I've not yet fully understood how the new approach in this patch series
> could help the situation for CBQRI, but I thought I would mention that
> separate lists for control and monitoring might be useful.

Good. It's nice to know there's potentially another use case for
this split besides SNC.

-Tony

2023-08-11 18:28:27

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 1/7] x86/resctrl: Create separate domains for control and monitoring

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> First step towards supporting resource control where the scope of
> control operations is not the same as monitor operations.

Each changelog should stand on its own merit. This changelog
appears to be written as a continuation of the cover letter.

Please do ensure that each patch first establishes the context
before it describes the problem and solution. For example,
as a context this changelog can start by describing what the
resctrl domains list represents.

>
> Add an extra list in the rdt_resource structure. For this will
> just duplicate the existing list of domains based on the L3 cache
> scope.

The above paragraph does not make this change appealing at all.

> Refactor the domain_add_cpu() and domain_remove() functions to

domain_remove() -> domain_remove_cpu()

> build separate lists for r->alloc_capable and r->mon_capable
> resources. Note that only the "L3" domain currently supports
> both types.

"L3" domain -> "L3" resource?

>
> Change all places where monitoring functions walk the list of
> domains to use the new "mondomains" list instead of the old
> "domains" list.

I would not refer to it as "the old domains list" as it creates
impression that this is being replaced. The changelog makes
no mention that domains list will remain and be dedicated to
control domains. I think this is important to include in description
of this change.

>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> include/linux/resctrl.h | 10 +-
> arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
> arch/x86/kernel/cpu/resctrl/core.c | 195 +++++++++++++++-------
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +-
> arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 30 ++--
> 6 files changed, 167 insertions(+), 74 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 8334eeacfec5..1267d56f9e76 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -151,9 +151,11 @@ struct resctrl_schema;
> * @mon_capable: Is monitor feature available on this machine
> * @num_rmid: Number of RMIDs available
> * @cache_level: Which cache level defines scope of this resource
> + * @mon_scope: Scope of this resource if different from cache_level

I think this addition should be deferred. As it is here it the "if different
from cache_level" also creates many questions (when will it be different?
how will it be determined that the scope is different in order to know that
mon_scope should be used?)

Looking ahead on how mon_scope is used there does not seem to be an "if"
involved at all ... mon_scope is always the monitoring scope.

> * @cache: Cache allocation related data
> * @membw: If the component has bandwidth controls, their properties.
> * @domains: All domains for this resource

A change to the domains comment would also help - to highlight that it is
now dedicated to control domains.

> + * @mondomains: Monitor domains for this resource
> * @name: Name to use in "schemata" file.
> * @data_width: Character width of data when displaying
> * @default_ctrl: Specifies default cache cbm or memory B/W percent.
> @@ -169,9 +171,11 @@ struct rdt_resource {
> bool mon_capable;
> int num_rmid;
> int cache_level;
> + int mon_scope;
> struct resctrl_cache cache;
> struct resctrl_membw membw;
> struct list_head domains;
> + struct list_head mondomains;
> char *name;
> int data_width;
> u32 default_ctrl;

...

> @@ -384,14 +386,15 @@ void rdt_ctrl_update(void *arg)
> }
>
> /*
> - * rdt_find_domain - Find a domain in a resource that matches input resource id
> + * rdt_find_domain - Find a domain in one of the lists for a resource that
> + * matches input resource id
> *

This change makes the function more vague. I think original summary is
still accurate, how the list is used can be describe in the details below.
I see more changes to this function is upcoming and I will comment more
at those sites.

> * Search resource r's domain list to find the resource id. If the resource
> * id is found in a domain, return the domain. Otherwise, if requested by
> * caller, return the first domain whose id is bigger than the input id.
> * The domain list is sorted by id in ascending order.
> */
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> +struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
> struct list_head **pos)
> {
> struct rdt_domain *d;
> @@ -400,7 +403,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> if (id < 0)
> return ERR_PTR(-ENODEV);
>
> - list_for_each(l, &r->domains) {
> + list_for_each(l, h) {
> d = list_entry(l, struct rdt_domain, list);
> /* When id is found, return its domain. */
> if (id == d->id)
> @@ -487,6 +490,94 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> return 0;
> }
>
> +static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> +{
> + int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> + struct list_head *add_pos = NULL;
> + struct rdt_hw_domain *hw_dom;
> + struct rdt_domain *d;
> + int err;
> +
> + d = rdt_find_domain(&r->domains, id, &add_pos);
> + if (IS_ERR(d)) {
> + pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> + return;
> + }
> +
> + if (d) {
> + cpumask_set_cpu(cpu, &d->cpu_mask);
> + if (r->cache.arch_has_per_cpu_cfg)
> + rdt_domain_reconfigure_cdp(r);
> + return;
> + }
> +
> + hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> + if (!hw_dom)
> + return;
> +
> + d = &hw_dom->d_resctrl;
> + d->id = id;
> + cpumask_set_cpu(cpu, &d->cpu_mask);
> +
> + rdt_domain_reconfigure_cdp(r);
> +
> + if (domain_setup_ctrlval(r, d)) {
> + domain_free(hw_dom);
> + return;
> + }
> +
> + list_add_tail(&d->list, add_pos);
> +
> + err = resctrl_online_ctrl_domain(r, d);
> + if (err) {
> + list_del(&d->list);
> + domain_free(hw_dom);
> + }
> +}
> +
> +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> + int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);

Using a different scope variable but continuing to treat it
as a cache level creates unnecessary confusion at this point.

> + struct list_head *add_pos = NULL;
> + struct rdt_hw_domain *hw_dom;
> + struct rdt_domain *d;
> + int err;
> +
> + d = rdt_find_domain(&r->mondomains, id, &add_pos);
> + if (IS_ERR(d)) {
> + pr_warn("Couldn't find cache id for CPU %d\n", cpu);

Note for future change ... this continues to refer to monitor scope as
a cache id. I did not see this changed in the later patch that actually
changes how scope is used.

> + return;
> + }
> +
> + if (d) {
> + cpumask_set_cpu(cpu, &d->cpu_mask);
> + if (r->cache.arch_has_per_cpu_cfg)
> + rdt_domain_reconfigure_cdp(r);

Copy & paste error?

> + return;
> + }
> +
> + hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> + if (!hw_dom)
> + return;
> +
> + d = &hw_dom->d_resctrl;
> + d->id = id;
> + cpumask_set_cpu(cpu, &d->cpu_mask);
> +
> + if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> + domain_free(hw_dom);
> + return;
> + }
> +
> + list_add_tail(&d->list, add_pos);
> +
> + err = resctrl_online_mon_domain(r, d);
> + if (err) {
> + list_del(&d->list);
> + domain_free(hw_dom);
> + }
> +}
> +
> /*
> * domain_add_cpu - Add a cpu to a resource's domain list.
> *
> @@ -502,61 +593,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> */
> static void domain_add_cpu(int cpu, struct rdt_resource *r)
> {
> - int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> - struct list_head *add_pos = NULL;
> - struct rdt_hw_domain *hw_dom;
> - struct rdt_domain *d;
> - int err;
> -
> - d = rdt_find_domain(r, id, &add_pos);
> - if (IS_ERR(d)) {
> - pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> - return;
> - }
> -
> - if (d) {
> - cpumask_set_cpu(cpu, &d->cpu_mask);
> - if (r->cache.arch_has_per_cpu_cfg)
> - rdt_domain_reconfigure_cdp(r);
> - return;
> - }
> -
> - hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> - if (!hw_dom)
> - return;
> -
> - d = &hw_dom->d_resctrl;
> - d->id = id;
> - cpumask_set_cpu(cpu, &d->cpu_mask);
> -
> - rdt_domain_reconfigure_cdp(r);
> -
> - if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> - domain_free(hw_dom);
> - return;
> - }
> -
> - if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> - domain_free(hw_dom);
> - return;
> - }
> -
> - list_add_tail(&d->list, add_pos);
> -
> - err = resctrl_online_domain(r, d);
> - if (err) {
> - list_del(&d->list);
> - domain_free(hw_dom);
> - }
> + if (r->alloc_capable)
> + domain_add_cpu_ctrl(cpu, r);
> + if (r->mon_capable)
> + domain_add_cpu_mon(cpu, r);
> }

A resource could be both alloc and mon capable ... both
domain_add_cpu_ctrl() and domain_add_cpu_mon() can fail.
Should domain_add_cpu_mon() still be run for a CPU if
domain_add_cpu_ctrl() failed?

Looking ahead the CPU should probably also not be added
to the default groups mask if a failure occurred.

> -static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> +static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> {
> int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> struct rdt_hw_domain *hw_dom;
> struct rdt_domain *d;
>
> - d = rdt_find_domain(r, id, NULL);
> + d = rdt_find_domain(&r->domains, id, NULL);
> if (IS_ERR_OR_NULL(d)) {
> pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> return;
> @@ -565,7 +614,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>
> cpumask_clear_cpu(cpu, &d->cpu_mask);
> if (cpumask_empty(&d->cpu_mask)) {
> - resctrl_offline_domain(r, d);
> + resctrl_offline_ctrl_domain(r, d);
> list_del(&d->list);
>
> /*
> @@ -578,6 +627,30 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>
> return;
> }
> +}
> +
> +static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> + int id = get_cpu_cacheinfo_id(cpu, r->cache_level);

Introducing mon_scope can really be deferred ... here the monitoring code
is not using mon_scope anyway.

> + struct rdt_hw_domain *hw_dom;
> + struct rdt_domain *d;
> +
> + d = rdt_find_domain(&r->mondomains, id, NULL);
> + if (IS_ERR_OR_NULL(d)) {
> + pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> + return;
> + }
> + hw_dom = resctrl_to_arch_dom(d);
> +
> + cpumask_clear_cpu(cpu, &d->cpu_mask);
> + if (cpumask_empty(&d->cpu_mask)) {
> + resctrl_offline_mon_domain(r, d);
> + list_del(&d->list);
> +
> + domain_free(hw_dom);
> +
> + return;
> + }
>
> if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {

Reinette

2023-08-11 18:53:00

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 0/7] Add support for Sub-NUMA cluster (SNC) systems

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> The Sub-NUMA cluster feature on some Intel processors partitions
> the CPUs that share an L3 cache into two or more sets. This plays
> havoc with the Resource Director Technology (RDT) monitoring features.
> Prior to this patch Intel has advised that SNC and RDT are incompatible.
>
> Some of these CPU support an MSR that can partition the RMID
> counters in the same way. This allows for monitoring features
> to be used (with the caveat that memory accesses between different
> SNC NUMA nodes may still not be counted accuratlely.

accuratlely. -> accurately).

Is there any guidance on the scenarios under which memory accesses
may not be counted accurately, how users can detect when this
is the case, or any techniques users can use to avoid this?

Since this question has come up during this series I do think it
will help to document the impact of SNC on CAT.

Reinette

2023-08-11 19:06:02

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> There isn't a simple hardware enumeration to indicate to software that
> a system is running with Sub-NUMA Cluster enabled.
>
> Compare the number of NUMA nodes with the number of L3 caches to calculate
> the number of Sub-NUMA nodes per L3 cache.
>
> When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
> are distributed equally between the SNC nodes within each socket.
>
> E.g. if there are 400 RMID counters, and the system is configured with
> two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
> 0 on the socket, and RMID counter 200..399 on SNC node 1.
>
> A model specific MSR (0xca0) can change the configuration of the RMIDs
> when SNC mode is enabled.
>
> The MSR controls the interpretation of the RMID field in the
> IA32_PQR_ASSOC MSR so that the appropriate hardware counters
> within the SNC node are updated.
>
> To read the RMID counters an offset must be used to get data
> from the physical counter associated with the SNC node. As in
> the example above with 400 RMID counters Linux sees only 200
> counters. No special action is needed to read a counter from
> the first SNC node on a socket. But to read a Linux visible
> counter 50 on the second SNC node the kernel must load 250
> into the QM_EVTSEL MSR.
>
> N.B. this works well for well-behaved NUMA applications that access
> memory predominantly from the local memory node. For applications that
> access memory across multiple nodes it may be necessary for the user
> to read counters for all SNC nodes on a socket and add the values to
> get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
> all that different from applications that span across multiple sockets
> in a legacy system.
>
> The cache allocation feature still provides the same number of
> bits in a mask to control allocation into the L3 cache. But each
> of those ways has its capacity reduced because the cache is divided
> between the SNC nodes. Adjust the value reported in the resctrl
> "size" file accordingly.
>
> Mounting the file system with the "mba_MBps" option is disabled
> when SNC mode is enabled. This is because the measurement of bandwidth
> is per SNC node, while the MBA throttling controls are still at
> the L3 cache scope.
>

I'm counting four logical changes in this changelog. Can they be separate
patches?

> Signed-off-by: Tony Luck <[email protected]>
> ---
> include/linux/resctrl.h | 2 +
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kernel/cpu/resctrl/internal.h | 2 +
> arch/x86/kernel/cpu/resctrl/core.c | 82 +++++++++++++++++++++++++-
> arch/x86/kernel/cpu/resctrl/monitor.c | 18 +++++-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +-
> 6 files changed, 103 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 80a89d171eba..576dc21bd990 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -200,6 +200,8 @@ struct rdt_resource {
> bool cdp_capable;
> };
>
> +#define MON_SCOPE_NODE 100
> +

Could you please add a comment to explain what this constant represents
and how it is used?

> /**
> * struct resctrl_schema - configuration abilities of a resource presented to
> * user-space
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 3aedae61af4f..4b624a37d64a 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1087,6 +1087,7 @@
> #define MSR_IA32_QM_CTR 0xc8e
> #define MSR_IA32_PQR_ASSOC 0xc8f
> #define MSR_IA32_L3_CBM_BASE 0xc90
> +#define MSR_RMID_SNC_CONFIG 0xca0
> #define MSR_IA32_L2_CBM_BASE 0xd10
> #define MSR_IA32_MBA_THRTL_BASE 0xd50
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 016ef0373c5a..00a330bc5ced 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>
> extern struct dentry *debugfs_resctrl;
>
> +extern int snc_nodes_per_l3_cache;
> +
> enum resctrl_res_level {
> RDT_RESOURCE_L3,
> RDT_RESOURCE_L2,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 0161362b0c3e..1331add347fc 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>
> #define pr_fmt(fmt) "resctrl: " fmt
>
> +#include <linux/cpu.h>
> #include <linux/slab.h>
> #include <linux/err.h>
> #include <linux/cacheinfo.h>
> #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>

What does this include provide?

> +#include <asm/cpu_device_id.h>
> #include <asm/intel-family.h>
> #include <asm/resctrl.h>
> #include "internal.h"
> @@ -48,6 +51,13 @@ int max_name_width, max_data_width;
> */
> bool rdt_alloc_capable;
>
> +/*
> + * Number of SNC nodes that share each L3 cache.
> + * Default is 1 for systems that do not support
> + * SNC, or have SNC disabled.
> + */
> +int snc_nodes_per_l3_cache = 1;
> +
> static void
> mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> struct rdt_resource *r);
> @@ -543,9 +553,16 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> }
> }
>
> +static int get_mon_scope_id(int cpu, int scope)
> +{
> + if (scope == MON_SCOPE_NODE)
> + return cpu_to_node(cpu);
> + return get_cpu_cacheinfo_id(cpu, scope);
> +}
> +
> static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> {
> - int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
> + int id = get_mon_scope_id(cpu, r->mon_scope);
> struct list_head *add_pos = NULL;
> struct rdt_hw_mondomain *hw_mondom;
> struct rdt_mondomain *d;
> @@ -692,11 +709,28 @@ static void clear_closid_rmid(int cpu)
> wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
> }
>
> +static void snc_remap_rmids(int cpu)
> +{
> + u64 val;

No need for tab here.

> +
> + /* Only need to enable once per package */
> + if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> + return;
> +
> + rdmsrl(MSR_RMID_SNC_CONFIG, val);
> + val &= ~BIT_ULL(0);
> + wrmsrl(MSR_RMID_SNC_CONFIG, val);
> +}

Could you please document snc_remap_rmids()
with information on what the bit in the register means
and what the above function does?

> +
> static int resctrl_online_cpu(unsigned int cpu)
> {
> struct rdt_resource *r;
>
> mutex_lock(&rdtgroup_mutex);
> +
> + if (snc_nodes_per_l3_cache > 1)
> + snc_remap_rmids(cpu);
> +
> for_each_capable_rdt_resource(r)
> domain_add_cpu(cpu, r);
> /* The cpu is set in default rdtgroup after online. */
> @@ -951,11 +985,57 @@ static __init bool get_rdt_resources(void)
> return (rdt_mon_capable || rdt_alloc_capable);
> }
>

I think it will help to add a comment like:
"CPUs that support the model specific MSR_RMID_SNC_CONFIG register."

> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> + {}
> +};
> +
> +/*
> + * There isn't a simple enumeration bit to show whether SNC mode
> + * is enabled. Look at the ratio of number of NUMA nodes to the
> + * number of distinct L3 caches. Take care to skip memory-only nodes.
> + */
> +static __init int get_snc_config(void)
> +{
> + unsigned long *node_caches;
> + int mem_only_nodes = 0;
> + int cpu, node, ret;
> +
> + if (!x86_match_cpu(snc_cpu_ids))
> + return 1;
> +
> + node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> + if (!node_caches)
> + return 1;
> +
> + cpus_read_lock();
> + for_each_node(node) {
> + cpu = cpumask_first(cpumask_of_node(node));
> + if (cpu < nr_cpu_ids)
> + set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> + else
> + mem_only_nodes++;
> + }
> + cpus_read_unlock();

I am not familiar with the numa code at all so please correct me
where I am wrong. I do see that nr_node_ids is initialized with __init code
so it should be accurate at this point. It looks to me like this initialization
assumes that at least one CPU per node will be online at the time it is run.
It is not clear to me that this assumption would always be true.

> +
> + ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
> + kfree(node_caches);
> +
> + if (ret > 1)
> + rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = MON_SCOPE_NODE;
> +
> + return ret;
> +}
> +
> static __init void rdt_init_res_defs_intel(void)
> {
> struct rdt_hw_resource *hw_res;
> struct rdt_resource *r;
>
> + snc_nodes_per_l3_cache = get_snc_config();
> +
> for_each_rdt_resource(r) {
> hw_res = resctrl_to_arch_res(r);
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 0d9605fccb34..4ca064e62911 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
>
> static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> {
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + int cpu = get_cpu();

I do not think it is necessary to disable preemption here. Also please note that
James is working on changes to not have this code block. Could just smp_processor_id()
do?

> + int rmid_offset = 0;
> u64 msr_val;
>
> + /*
> + * When SNC mode is on, need to compute the offset to read the
> + * physical RMID counter for the node to which this CPU belongs
> + */
> + if (snc_nodes_per_l3_cache > 1)
> + rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> +
> /*
> * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> * with a valid event code for supported resource type and the bits
> @@ -158,9 +168,11 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> * are error bits.
> */
> - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
> rdmsrl(MSR_IA32_QM_CTR, msr_val);
>
> + put_cpu();
> +
> if (msr_val & RMID_VAL_ERROR)
> return -EIO;
> if (msr_val & RMID_VAL_UNAVAIL)

Reinette

2023-08-11 20:10:41

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] x86/resctrl: Change monitor code to use rdt_mondomain

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> A few functions need to be duplicated to provide versions to
> operate on control and monitor domains respectively. But most
> of the changes are just fixing argument and return value types.

Could you please add some context in support of this change?

I do not think "duplicated" is appropriate though. Functions
are not duplicated but instead made to be dedicated to
either control or monitoring domains, no?

...

> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 274605aaa026..0161362b0c3e 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -393,9 +393,12 @@ void rdt_ctrl_update(void *arg)
> * id is found in a domain, return the domain. Otherwise, if requested by
> * caller, return the first domain whose id is bigger than the input id.
> * The domain list is sorted by id in ascending order.
> + *
> + * N.B. Returned value may be either a pointer to "struct rdt_domain" or
> + * to "struct rdt_mondomain" depending on which domain list is scanned.
> */
> -struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
> - struct list_head **pos)
> +void *rdt_find_domain(struct list_head *h, int id,
> + struct list_head **pos)
> {
> struct rdt_domain *d;
> struct list_head *l;

I do not think that void pointers should be passed around. How about two
new functions dedicated to the different domain types with the void pointer
handling contained in a static function? For example,

static void *__rdt_find_domain(struct list_head *h, int id, struct list_head **pos)

struct rdt_mondomain *rdt_find_mondomain(struct rdt_resource *r, int id, struct list_head **pos)
struct rdt_domain *rdt_find_ctrldomain(struct rdt_resource *r, int id, struct list_head **pos)

rdt_find_mondomain() and rdt_find_ctrldomain() would be what callers use
while they can be wrappers of __rdt_find_domain().


> @@ -434,10 +437,15 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
> }
>
> static void domain_free(struct rdt_hw_domain *hw_dom)
> +{
> + kfree(hw_dom->ctrl_val);
> + kfree(hw_dom);
> +}
> +
> +static void mondomain_free(struct rdt_hw_mondomain *hw_dom)
> {
> kfree(hw_dom->arch_mbm_total);
> kfree(hw_dom->arch_mbm_local);
> - kfree(hw_dom->ctrl_val);
> kfree(hw_dom);
> }
>
> @@ -467,7 +475,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
> * @num_rmid: The size of the MBM counter array
> * @hw_dom: The domain that owns the allocated arrays
> */
> -static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> +static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mondomain *hw_dom)
> {
> size_t tsize;
>
> @@ -539,8 +547,8 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> {
> int id = get_cpu_cacheinfo_id(cpu, r->mon_scope);
> struct list_head *add_pos = NULL;
> - struct rdt_hw_domain *hw_dom;
> - struct rdt_domain *d;
> + struct rdt_hw_mondomain *hw_mondom;
> + struct rdt_mondomain *d;
> int err;
>

Please ensure that reverse fir tree order is maintained in all these changes.

Reinette

2023-08-11 20:17:18

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
>
> Signed-off-by: Tony Luck <[email protected]>
> Reviewed-by: Peter Newman <[email protected]>
> ---
> Documentation/arch/x86/resctrl.rst | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index cb05d90111b4..4d9ddb91751d 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -345,9 +345,13 @@ When control is enabled all CTRL_MON groups will also contain:
> When monitoring is enabled all MON groups will also contain:
>
> "mon_data":
> - This contains a set of files organized by L3 domain and by
> - RDT event. E.g. on a system with two L3 domains there will
> - be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
> + This contains a set of files organized by L3 domain or by NUMA
> + node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> + or enabled respectively) and by RDT event. E.g. on a system with
> + SNC mode disabled with two L3 domains there will be subdirectories
> + "mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
> + L3 cache id. With SNC enabled the directory names are the same,
> + but the numerical suffix refers to the node id. Each of these
> directories have one file per event (e.g. "llc_occupancy",
> "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
> files provide a read out of the current value of the event for

I think it would be helpful to add a modified version of the snippet
(from previous patch changelog) regarding well-behaved NUMA apps.
With the above it may be confusing that a single cache allocation has
multiple cache occupancy counters.

This also changes the meaning of the numbers in the directory names.
The documentation already provides guidance on how to find the cache
ID of a logical CPU (see section "Cache IDs"). I think it will be
helpful to add a snippet that makes it clear to users how to map
a CPU to its node ID.

Reinette

2023-08-11 20:21:13

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] selftests/resctrl: Adjust effective L3 cache size when SNC enabled

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
> nodes. Systems may support splitting into either two or four nodes.
>
> When SNC mode is enabled the effective amount of L3 cache available
> for allocation is divided by the number of nodes per L3.
>
> Detect which SNC mode is active by comparing the number of CPUs
> that share a cache with CPU0, with the number of CPUs on node0.
>
> Reported-by: "Shaopeng Tan (Fujitsu)" <[email protected]>
> Closes: https://lore.kernel.org/r/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com

This does not seem to be the case when looking at
https://lore.kernel.org/all/TYAPR01MB6330A4EB3633B791939EA45E8B39A@TYAPR01MB6330.jpnprd01.prod.outlook.com/

> Signed-off-by: Tony Luck <[email protected]>
> ---
> tools/testing/selftests/resctrl/resctrl.h | 1 +
> tools/testing/selftests/resctrl/resctrlfs.c | 57 +++++++++++++++++++++
> 2 files changed, 58 insertions(+)
>
> diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
> index 87e39456dee0..a8b43210b573 100644
> --- a/tools/testing/selftests/resctrl/resctrl.h
> +++ b/tools/testing/selftests/resctrl/resctrl.h
> @@ -13,6 +13,7 @@
> #include <signal.h>
> #include <dirent.h>
> #include <stdbool.h>
> +#include <ctype.h>
> #include <sys/stat.h>
> #include <sys/ioctl.h>
> #include <sys/mount.h>
> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
> index fb00245dee92..79eecbf9f863 100644
> --- a/tools/testing/selftests/resctrl/resctrlfs.c
> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> @@ -130,6 +130,61 @@ int get_resource_id(int cpu_no, int *resource_id)
> return 0;
> }
>
> +/*
> + * Count number of CPUs in a /sys bit map
> + */
> +static int count_sys_bitmap_bits(char *name)
> +{
> + FILE *fp = fopen(name, "r");
> + int count = 0, c;
> +
> + if (!fp)
> + return 0;
> +
> + while ((c = fgetc(fp)) != EOF) {
> + if (!isxdigit(c))
> + continue;
> + switch (c) {
> + case 'f':
> + count++;
> + case '7': case 'b': case 'd': case 'e':
> + count++;
> + case '3': case '5': case '6': case '9': case 'a': case 'c':
> + count++;
> + case '1': case '2': case '4': case '8':
> + count++;
> + }
> + }
> + fclose(fp);
> +
> + return count;
> +}
> +
> +/*
> + * Detect SNC by compating #CPUs in node0 with #CPUs sharing LLC with CPU0
> + * Try to get this right, even if a few CPUs are offline so that the number
> + * of CPUs in node0 is not exactly half or a quarter of the CPUs sharing the
> + * LLC of CPU0.
> + */
> +static int snc_ways(void)
> +{
> + int node_cpus, cache_cpus;
> +
> + node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
> + cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
> +
> + if (!node_cpus || !cache_cpus) {
> + fprintf(stderr, "Warning could not determine Sub-NUMA Cluster mode\n");
> + return 1;
> + }
> +
> + if (4 * node_cpus >= cache_cpus)
> + return 4;
> + else if (2 * node_cpus >= cache_cpus)
> + return 2;
> + return 1;
> +}
> +
> /*
> * get_cache_size - Get cache size for a specified CPU
> * @cpu_no: CPU number
> @@ -190,6 +245,8 @@ int get_cache_size(int cpu_no, char *cache_type, unsigned long *cache_size)
> break;
> }
>
> + if (cache_num == 3)
> + *cache_size /= snc_ways();
> return 0;
> }
>

I am surprised that this small change is sufficient. The resctrl
selftests are definitely not NUMA aware and the CAT and CMT tests
are not taking that into account when picking CPUs to run on. From
what I understand LLC occupancy counters need to be added in this
scenario but I do not see that done either.

Reinette