2023-06-21 17:41:45

by Tony Luck

[permalink] [raw]
Subject: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

There isn't a simple hardware enumeration to indicate to software that
a system is running with Sub-NUMA Clustering enabled.

Compare the number of NUMA nodes with the number of L3 caches to calculate
the number of Sub-NUMA nodes per L3 cache.

When Sub-NUMA clustering mode is enabled in BIOS setup, the RMID counters
are distributed equally between the SNC nodes within each socket.

E.g. if there are 400 RMID counters, and the system is configured with
two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
0 on the socket, and RMID counter 200..399 on SNC node 1.

A model specific MSR (0xca0) can change the configuration of the RMIDs
when SNC mode is enabled.

The MSR controls the interpretation of the RMID field in the
IA32_PQR_ASSOC MSR so that the appropriate hardware counters
within the SNC node are updated. If reconfigured from default, RMIDs
are divided evenly across clusters.
.

Also initialize a per-cpu RMID offset value. Use this
to calculate the value to write to the IA32_QM_EVTSEL MSR when
reading RMID event values.

N.B. this works well for well-behaved NUMA applications that access
memory predominantly from the local memory node. For applications that
access memory across multiple nodes it may be necessary for the user
to read counters for all SNC nodes on a socket and add the values to
get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
all that different from applications that span across multiple sockets
in a legacy system.

Signed-off-by: Tony Luck <[email protected]>

---

Changes since v1:

Re-based to tip/master (on June 21, 2023)

Fenghua:
+ Better comment for l3_mon_evt_init()
+ Don't need .fflags = RFTYPE_RES_MB for node resource. Use .fflags = 0

James:
+ Add helper function to choose resource based on snc_ways
+ Drop the info/snc_ways file. No current justification for it.
+ Typos s/Suffices/Suffixes/, s/Sun-NUMA/Sub-NUMA/
+ Expand SNC acronym on first use in Documentation/x86/resctrl.rst

Peter:
+ Add checks for cpu-less nodes.

Tony Luck (7):
x86/resctrl: Refactor in preparation for node-scoped resources
x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c
x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
x86/resctrl: Add code to setup monitoring at L3 or NODE scope.
x86/resctrl: Add package scoped resource
x86/resctrl: Update documentation with Sub-NUMA cluster changes
x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

Documentation/arch/x86/resctrl.rst | 10 +-
include/linux/resctrl.h | 5 +-
arch/x86/include/asm/resctrl.h | 2 +
arch/x86/kernel/cpu/resctrl/internal.h | 20 +++
arch/x86/kernel/cpu/resctrl/core.c | 154 ++++++++++++++++++++--
arch/x86/kernel/cpu/resctrl/monitor.c | 24 ++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +-
8 files changed, 191 insertions(+), 30 deletions(-)


base-commit: 746d03317c1175666aad909ecc45384da42218aa
--
2.40.1



2023-06-21 17:42:02

by Tony Luck

[permalink] [raw]
Subject: [PATCH v2 5/7] x86/resctrl: Add package scoped resource

Some Intel features require setting a package scoped model specific
register.

Add a new resource that builds domains for each package.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 6 ++++++
arch/x86/kernel/cpu/resctrl/core.c | 23 +++++++++++++++++++----
3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 25051daa6655..f504f6263fec 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -167,6 +167,7 @@ struct rdt_resource {
int rid;
bool alloc_capable;
bool mon_capable;
+ bool pkg_actions;
int num_rmid;
int scope;
struct resctrl_cache cache;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 38bac0062c82..e51a5004be77 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -438,6 +438,7 @@ enum resctrl_res_level {
RDT_RESOURCE_MBA,
RDT_RESOURCE_SMBA,
RDT_RESOURCE_NODE,
+ RDT_RESOURCE_PKG,

/* Must be the last */
RDT_NUM_RESOURCES,
@@ -447,6 +448,7 @@ enum resctrl_scope {
SCOPE_L2_CACHE = 2,
SCOPE_L3_CACHE = 3,
SCOPE_NODE,
+ SCOPE_PKG,
};

static inline int get_mbm_res_level(void)
@@ -482,6 +484,10 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
for_each_rdt_resource(r) \
if (r->alloc_capable || r->mon_capable)

+#define for_each_domain_needed_rdt_resource(r) \
+ for_each_rdt_resource(r) \
+ if (r->alloc_capable || r->mon_capable || r->pkg_actions)
+
#define for_each_alloc_capable_rdt_resource(r) \
for_each_rdt_resource(r) \
if (r->alloc_capable)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6fe9f87d4403..af3be3c2db96 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -127,6 +127,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
.fflags = 0,
},
},
+ [RDT_RESOURCE_PKG] =
+ {
+ .r_resctrl = {
+ .rid = RDT_RESOURCE_PKG,
+ .name = "PKG",
+ .scope = SCOPE_PKG,
+ .domains = domain_init(RDT_RESOURCE_PKG),
+ .fflags = 0,
+ },
+ },
};

/*
@@ -504,9 +514,14 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)

static int get_domain_id(int cpu, enum resctrl_scope scope)
{
- if (scope == SCOPE_NODE)
+ switch (scope) {
+ case SCOPE_NODE:
return cpu_to_node(cpu);
- return get_cpu_cacheinfo_id(cpu, scope);
+ case SCOPE_PKG:
+ return topology_physical_package_id(cpu);
+ default:
+ return get_cpu_cacheinfo_id(cpu, scope);
+ }
}

/*
@@ -630,7 +645,7 @@ static int resctrl_online_cpu(unsigned int cpu)
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
- for_each_capable_rdt_resource(r)
+ for_each_domain_needed_rdt_resource(r)
domain_add_cpu(cpu, r);
/* The cpu is set in default rdtgroup after online. */
cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
@@ -657,7 +672,7 @@ static int resctrl_offline_cpu(unsigned int cpu)
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
- for_each_capable_rdt_resource(r)
+ for_each_domain_needed_rdt_resource(r)
domain_remove_cpu(cpu, r);
list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
--
2.40.1


2023-06-21 17:46:10

by Tony Luck

[permalink] [raw]
Subject: [PATCH v2 2/7] x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c

Scope of monitoring may be scoped at L3 cache granularity (legacy) or
at the node level (systems with Sub NUMA Cluster enabled).

Save the struct rdt_resource pointer that was used to initialize
the monitor sections of code and use that value instead of the
hard-coded RDT_RESOURCE_L3.

No functional change.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..9be6ffdd01ae 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -30,6 +30,8 @@ struct rmid_entry {
struct list_head list;
};

+static struct rdt_resource *mon_resource;
+
/**
* @rmid_free_lru A least recently used list of free RMIDs
* These RMIDs are guaranteed to have an occupancy less than the
@@ -268,7 +270,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
*/
void __check_limbo(struct rdt_domain *d, bool force_free)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r = mon_resource;
struct rmid_entry *entry;
u32 crmid = 1, nrmid;
bool rmid_dirty;
@@ -333,7 +335,7 @@ int alloc_rmid(void)

static void add_rmid_to_limbo(struct rmid_entry *entry)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r = mon_resource;
struct rdt_domain *d;
int cpu, err;
u64 val = 0;
@@ -645,7 +647,7 @@ void cqm_handle_limbo(struct work_struct *work)

mutex_lock(&rdtgroup_mutex);

- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ r = mon_resource;
d = container_of(work, struct rdt_domain, cqm_limbo.work);

__check_limbo(d, false);
@@ -681,7 +683,7 @@ void mbm_handle_overflow(struct work_struct *work)
if (!static_branch_likely(&rdt_mon_enable_key))
goto out_unlock;

- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ r = mon_resource;
d = container_of(work, struct rdt_domain, mbm_over.work);

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
@@ -759,9 +761,9 @@ static struct mon_evt mbm_local_event = {
/*
* Initialize the event list for the resource.
*
- * Note that MBM events are also part of RDT_RESOURCE_L3 resource
- * because as per the SDM the total and local memory bandwidth
- * are enumerated as part of L3 monitoring.
+ * Monitor events can either be part of RDT_RESOURCE_L3 resource,
+ * or they may be per NUMA node on systems with sub-NUMA cluster
+ * enabled and are then in the RDT_RESOURCE_NODE resource.
*/
static void l3_mon_evt_init(struct rdt_resource *r)
{
@@ -773,6 +775,8 @@ static void l3_mon_evt_init(struct rdt_resource *r)
list_add_tail(&mbm_total_event.list, &r->evt_list);
if (is_mbm_local_enabled())
list_add_tail(&mbm_local_event.list, &r->evt_list);
+
+ mon_resource = r;
}

int __init rdt_get_mon_l3_config(struct rdt_resource *r)
--
2.40.1


2023-06-21 17:51:00

by Tony Luck

[permalink] [raw]
Subject: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

There isn't a simple hardware enumeration to indicate to software that
a system is running with Sub-NUMA Cluster enabled.

Compare the number of NUMA nodes with the number of L3 caches to calculate
the number of Sub-NUMA nodes per L3 cache.

When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
are distributed equally between the SNC nodes within each socket.

E.g. if there are 400 RMID counters, and the system is configured with
two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
0 on the socket, and RMID counter 200..399 on SNC node 1.

A model specific MSR (0xca0) can change the configuration of the RMIDs
when SNC mode is enabled.

The MSR controls the interpretation of the RMID field in the
IA32_PQR_ASSOC MSR so that the appropriate hardware counters
within the SNC node are updated.

Also initialize a per-cpu RMID offset value. Use this
to calculate the value to write to the IA32_QM_EVTSEL MSR when
reading RMID event values.

N.B. this works well for well-behaved NUMA applications that access
memory predominantly from the local memory node. For applications that
access memory across multiple nodes it may be necessary for the user
to read counters for all SNC nodes on a socket and add the values to
get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
all that different from applications that span across multiple sockets
in a legacy system.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/include/asm/resctrl.h | 2 +
arch/x86/kernel/cpu/resctrl/core.c | 99 ++++++++++++++++++++++++++-
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
3 files changed, 99 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 255a78d9d906..f95e69bacc65 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);

+DECLARE_PER_CPU(int, rmid_offset);
+
/*
* __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
*
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index af3be3c2db96..869cfb46e8e4 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@

#define pr_fmt(fmt) "resctrl: " fmt

+#include <linux/cpu.h>
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/cacheinfo.h>
#include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>

+#include <asm/cpu_device_id.h>
#include <asm/intel-family.h>
#include <asm/resctrl.h>
#include "internal.h"
@@ -524,6 +527,39 @@ static int get_domain_id(int cpu, enum resctrl_scope scope)
}
}

+DEFINE_PER_CPU(int, rmid_offset);
+
+static void set_per_cpu_rmid_offset(int cpu, struct rdt_resource *r)
+{
+ this_cpu_write(rmid_offset, (cpu_to_node(cpu) % snc_ways) * r->num_rmid);
+}
+
+/*
+ * This MSR provides for configuration of RMIDs on Sub-NUMA Cluster
+ * systems.
+ * Bit0 = 1 (default) For legacy configuration
+ * Bit0 = 0 RMIDs are divided evenly between SNC nodes.
+ */
+#define MSR_RMID_SNC_CONFIG 0xCA0
+
+static void snc_add_pkg(void)
+{
+ u64 msrval;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
+ msrval |= BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
+}
+
+static void snc_remove_pkg(void)
+{
+ u64 msrval;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
+ msrval &= ~BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -555,6 +591,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
cpumask_set_cpu(cpu, &d->cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
+ if (r->mon_capable)
+ set_per_cpu_rmid_offset(cpu, r);
return;
}

@@ -573,11 +611,17 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

- if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
- domain_free(hw_dom);
- return;
+ if (r->mon_capable) {
+ if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+ domain_free(hw_dom);
+ return;
+ }
+ set_per_cpu_rmid_offset(cpu, r);
}

+ if (r->pkg_actions)
+ snc_add_pkg();
+
list_add_tail(&d->list, add_pos);

err = resctrl_online_domain(r, d);
@@ -613,6 +657,9 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
d->plr->d = NULL;
domain_free(hw_dom);

+ if (r->pkg_actions)
+ snc_remove_pkg();
+
return;
}

@@ -899,11 +946,57 @@ static __init bool get_rdt_resources(void)
return (rdt_mon_capable || rdt_alloc_capable);
}

+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+ {}
+};
+
+/*
+ * There isn't a simple enumeration bit to show whether SNC mode
+ * is enabled. Look at the ratio of number of NUMA nodes to the
+ * number of distinct L3 caches. Take care to skip memory-only nodes.
+ */
+static __init int find_snc_ways(void)
+{
+ unsigned long *node_caches;
+ int mem_only_nodes = 0;
+ int cpu, node, ret;
+
+ if (!x86_match_cpu(snc_cpu_ids))
+ return 1;
+
+ node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
+ if (!node_caches)
+ return 1;
+
+ cpus_read_lock();
+ for_each_node(node) {
+ cpu = cpumask_first(cpumask_of_node(node));
+ if (cpu < nr_cpu_ids)
+ set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+ else
+ mem_only_nodes++;
+ }
+ cpus_read_unlock();
+
+ ret = (nr_node_ids - mem_only_nodes) / bitmap_weight(node_caches, nr_node_ids);
+ kfree(node_caches);
+
+ if (ret > 1)
+ rdt_resources_all[RDT_RESOURCE_PKG].r_resctrl.pkg_actions = true;
+
+ return ret;
+}
+
static __init void rdt_init_res_defs_intel(void)
{
struct rdt_hw_resource *hw_res;
struct rdt_resource *r;

+ snc_ways = find_snc_ways();
+
for_each_rdt_resource(r) {
hw_res = resctrl_to_arch_res(r);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index da3f36212898..74db99d299e1 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -160,7 +160,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
*/
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));
rdmsrl(MSR_IA32_QM_CTR, msr_val);

if (msr_val & RMID_VAL_ERROR)
--
2.40.1


2023-06-21 17:51:20

by Tony Luck

[permalink] [raw]
Subject: [PATCH v2 4/7] x86/resctrl: Add code to setup monitoring at L3 or NODE scope.

When Sub-NUMA cluster is enabled (snc_ways > 1) use the RDT_RESOURCE_NODE
instead of RDT_RESOURCE_L3 for all monitoring operations.

The mon_scale and num_rmid values from CPUID(0xf,0x1),(EBX,ECX) must be
scaled down by the number of Sub-NUMA Clusters.

A subsequent change will detect sub-NUMA cluster mode and set
"snc_ways". For now set to one (meaning each L3 cache spans one
node).

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 7 +++++++
arch/x86/kernel/cpu/resctrl/core.c | 7 ++++++-
arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
4 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 243017096ddf..38bac0062c82 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -430,6 +430,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);

extern struct dentry *debugfs_resctrl;

+extern int snc_ways;
+
enum resctrl_res_level {
RDT_RESOURCE_L3,
RDT_RESOURCE_L2,
@@ -447,6 +449,11 @@ enum resctrl_scope {
SCOPE_NODE,
};

+static inline int get_mbm_res_level(void)
+{
+ return snc_ways > 1 ? RDT_RESOURCE_NODE : RDT_RESOURCE_L3;
+}
+
static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(res);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index e4bd3072927c..6fe9f87d4403 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,11 @@ int max_name_width, max_data_width;
*/
bool rdt_alloc_capable;

+/*
+ * How many Sub-Numa Cluster nodes share a single L3 cache
+ */
+int snc_ways = 1;
+
static void
mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);
@@ -831,7 +836,7 @@ static __init bool get_rdt_alloc_resources(void)

static __init bool get_rdt_mon_resources(void)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r = &rdt_resources_all[get_mbm_res_level()].r_resctrl;

if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 9be6ffdd01ae..da3f36212898 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -787,8 +787,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
int ret;

resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
- hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
- r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_ways;
+ r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_ways;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;

if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 418658f0a9ad..d037f3da9e55 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2524,7 +2524,7 @@ static int rdt_get_tree(struct fs_context *fc)
static_branch_enable_cpuslocked(&rdt_enable_key);

if (is_mbm_enabled()) {
- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ r = &rdt_resources_all[get_mbm_res_level()].r_resctrl;
list_for_each_entry(dom, &r->domains, list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}
--
2.40.1


2023-06-21 17:51:39

by Tony Luck

[permalink] [raw]
Subject: [PATCH v2 1/7] x86/resctrl: Refactor in preparation for node-scoped resources

Sub-NUMA cluster systems provide monitoring resources at the NUMA
node scope instead of the L3 cache scope.

Rename the cache_level field in struct rdt_resource to the more
generic "scope" and add symbolic names and a helper function.

No functional change.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 4 ++--
arch/x86/kernel/cpu/resctrl/internal.h | 5 +++++
arch/x86/kernel/cpu/resctrl/core.c | 17 +++++++++++------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
5 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..25051daa6655 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -150,7 +150,7 @@ struct resctrl_schema;
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @cache_level: Which cache level defines scope of this resource
+ * @scope: Scope of this resource (cache level or NUMA node)
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
* @domains: All domains for this resource
@@ -168,7 +168,7 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- int cache_level;
+ int scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
struct list_head domains;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 85ceaf9a31ac..8275b8a74f7e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -440,6 +440,11 @@ enum resctrl_res_level {
RDT_NUM_RESOURCES,
};

+enum resctrl_scope {
+ SCOPE_L2_CACHE = 2,
+ SCOPE_L3_CACHE = 3
+};
+
static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(res);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..6571514752f3 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .cache_level = 3,
+ .scope = SCOPE_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .cache_level = 2,
+ .scope = SCOPE_L2_CACHE,
.domains = domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .cache_level = 3,
+ .scope = SCOPE_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .cache_level = 3,
+ .scope = 3,
.domains = domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -487,6 +487,11 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
return 0;
}

+static int get_domain_id(int cpu, enum resctrl_scope scope)
+{
+ return get_cpu_cacheinfo_id(cpu, scope);
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -502,7 +507,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
*/
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_domain_id(cpu, r->scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
@@ -552,7 +557,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

static void domain_remove_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_domain_id(cpu, r->scope);
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 458cb7419502..42f124ffb968 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -297,7 +297,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);

for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == plr->s->res->cache_level) {
+ if (ci->info_list[i].level == plr->s->res->scope) {
plr->line_size = ci->info_list[i].coherency_line_size;
return 0;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 725344048f85..418658f0a9ad 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1348,7 +1348,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->cache_level) {
+ if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
--
2.40.1


2023-06-21 17:51:44

by Tony Luck

[permalink] [raw]
Subject: [PATCH v2 3/7] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]

Add a placeholder in the array of struct rdt_hw_resource to be used
for event monitoring of systems with Sub-NUMA Cluster enabled.

Update get_domain_id() to handle SCOPE_NODE.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 4 +++-
arch/x86/kernel/cpu/resctrl/core.c | 12 ++++++++++++
2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 8275b8a74f7e..243017096ddf 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -435,6 +435,7 @@ enum resctrl_res_level {
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
RDT_RESOURCE_SMBA,
+ RDT_RESOURCE_NODE,

/* Must be the last */
RDT_NUM_RESOURCES,
@@ -442,7 +443,8 @@ enum resctrl_res_level {

enum resctrl_scope {
SCOPE_L2_CACHE = 2,
- SCOPE_L3_CACHE = 3
+ SCOPE_L3_CACHE = 3,
+ SCOPE_NODE,
};

static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6571514752f3..e4bd3072927c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -112,6 +112,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
.fflags = RFTYPE_RES_MB,
},
},
+ [RDT_RESOURCE_NODE] =
+ {
+ .r_resctrl = {
+ .rid = RDT_RESOURCE_NODE,
+ .name = "L3",
+ .scope = SCOPE_NODE,
+ .domains = domain_init(RDT_RESOURCE_NODE),
+ .fflags = 0,
+ },
+ },
};

/*
@@ -489,6 +499,8 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)

static int get_domain_id(int cpu, enum resctrl_scope scope)
{
+ if (scope == SCOPE_NODE)
+ return cpu_to_node(cpu);
return get_cpu_cacheinfo_id(cpu, scope);
}

--
2.40.1


2023-06-21 17:51:52

by Tony Luck

[permalink] [raw]
Subject: [PATCH v2 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Signed-off-by: Tony Luck <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index cb05d90111b4..13fc9fa664fc 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -345,9 +345,13 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:

"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
+ This contains a set of files organized by L3 domain or by NUMA
+ node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+ or enabled respectively) and by RDT event. E.g. on a system with
+ SNC mode disabled with two L3 domains there will be subdirectories
+ "mon_L3_00" and "mon_L3_01" the numerical suffix refers to the
+ L3 cache id. With SNC enabled the directory names are the same,
+ but the numerical suffix refers to the node id. Each of these
directories have one file per event (e.g. "llc_occupancy",
"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
files provide a read out of the current value of the event for
--
2.40.1


2023-06-21 18:04:48

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes

Hi Tony,

On 6/21/23 10:40, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> Documentation/arch/x86/resctrl.rst | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index cb05d90111b4..13fc9fa664fc 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -345,9 +345,13 @@ When control is enabled all CTRL_MON groups will also contain:
> When monitoring is enabled all MON groups will also contain:
>
> "mon_data":
> - This contains a set of files organized by L3 domain and by
> - RDT event. E.g. on a system with two L3 domains there will
> - be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
> + This contains a set of files organized by L3 domain or by NUMA
> + node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> + or enabled respectively) and by RDT event. E.g. on a system with
> + SNC mode disabled with two L3 domains there will be subdirectories
> + "mon_L3_00" and "mon_L3_01" the numerical suffix refers to the

"mon_L3_01". The
or just insert a semi-colon. Anything to break up the run-on sentence.

> + L3 cache id. With SNC enabled the directory names are the same,
> + but the numerical suffix refers to the node id. Each of these
> directories have one file per event (e.g. "llc_occupancy",

has

> "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
> files provide a read out of the current value of the event for

--
~Randy

2023-06-22 14:32:05

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

Hi Tony,

On Wed, Jun 21, 2023 at 7:40 PM Tony Luck <[email protected]> wrote:
>
> There isn't a simple hardware enumeration to indicate to software that
> a system is running with Sub-NUMA Cluster enabled.
>
> Compare the number of NUMA nodes with the number of L3 caches to calculate
> the number of Sub-NUMA nodes per L3 cache.
>
> When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
> are distributed equally between the SNC nodes within each socket.
>
> E.g. if there are 400 RMID counters, and the system is configured with
> two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
> 0 on the socket, and RMID counter 200..399 on SNC node 1.
>
> A model specific MSR (0xca0) can change the configuration of the RMIDs
> when SNC mode is enabled.
>
> The MSR controls the interpretation of the RMID field in the
> IA32_PQR_ASSOC MSR so that the appropriate hardware counters
> within the SNC node are updated.
>
> Also initialize a per-cpu RMID offset value. Use this
> to calculate the value to write to the IA32_QM_EVTSEL MSR when
> reading RMID event values.
>
> N.B. this works well for well-behaved NUMA applications that access
> memory predominantly from the local memory node. For applications that
> access memory across multiple nodes it may be necessary for the user
> to read counters for all SNC nodes on a socket and add the values to
> get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
> all that different from applications that span across multiple sockets
> in a legacy system.

Unfortunately I'm not getting as good of results with the new series.
The main difference seems to be updating the 0xca0 MSR instead of
applying the offset to PQR_ASSOC.

In my test case of generating bandwidth on 20 random CPUs in 20 random
RMIDs, I'm only getting correct counts from CPUs in node 0. Node 1
CPUs are showing counts which are too small, and nodes 2 and 3 are
seeing no bandwidth at all:

(expected bandwidth is around 30 GB, value in first parenthesis is L3 cache id)

cpu: 134 (0), rmid: 30 (g29): 0 -> 30640791552 (30640791552)
cpu: 138 (0), rmid: 103 (g101): 0 -> 28196962304 (28196962304)

cpu: 35 (0), rmid: 211 (g209): 0 -> 3039232 (3039232)
cpu: 55 (0), rmid: 113 (g111): 0 -> 4874240 (4874240)
cpu: 41 (0), rmid: 83 (g81): 0 -> 2637824 (2637824)
cpu: 42 (0), rmid: 218 (g216): 0 -> 2408448 (2408448)
cpu: 161 (0), rmid: 8 (g7): 0 -> 7856128 (7856128)

cpu: 86 (1), rmid: 171 (g169): 0 -> 0 (0)
cpu: 86 (1), rmid: 121 (g119): 0 -> 0 (0)
cpu: 212 (1), rmid: 163 (g161): 0 -> 0 (0)
cpu: 180 (1), rmid: 129 (g127): 0 -> 0 (0)
cpu: 205 (1), rmid: 186 (g184): 0 -> 0 (0)
cpu: 194 (1), rmid: 160 (g158): 0 -> 0 (0)
cpu: 186 (1), rmid: 196 (g194): 0 -> 0 (0)
cpu: 106 (1), rmid: 93 (g91): 0 -> 0 (0)
cpu: 84 (1), rmid: 168 (g166): 0 -> 0 (0)
cpu: 197 (1), rmid: 104 (g102): 0 -> 0 (0)
cpu: 64 (1), rmid: 103 (g101): 0 -> 0 (0)
cpu: 71 (1), rmid: 81 (g79): 0 -> 0 (0)
cpu: 60 (1), rmid: 221 (g219): 0 -> 0 (0)

Here's the output of `cat /sys/devices/system/node/node*/cpulist` on
this machine for reference:

0-27,112-139
28-55,140-167
56-83,168-195
84-111,196-223

-Peter

2023-06-22 16:29:36

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

> Unfortunately I'm not getting as good of results with the new series.
> The main difference seems to be updating the 0xca0 MSR instead of
> applying the offset to PQR_ASSOC.

I think I may have reversed the actions to update the MSR in one of
my refactor/rebase. The comment here is correct, but that's not
what the code is doing :-(

Can you swap the bodies of these two functions and retest?

+/*
+ * This MSR provides for configuration of RMIDs on Sub-NUMA Cluster
+ * systems.
+ * Bit0 = 1 (default) For legacy configuration
+ * Bit0 = 0 RMIDs are divided evenly between SNC nodes.
+ */
+#define MSR_RMID_SNC_CONFIG 0xCA0
+
+static void snc_add_pkg(void)
+{
+ u64 msrval;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
+ msrval |= BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
+}
+
+static void snc_remove_pkg(void)
+{
+ u64 msrval;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
+ msrval &= ~BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
+}

-Tony

2023-06-23 15:35:57

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

Hi Tony,

On Thu, Jun 22, 2023 at 6:02 PM Luck, Tony <[email protected]> wrote:
>
> > Unfortunately I'm not getting as good of results with the new series.
> > The main difference seems to be updating the 0xca0 MSR instead of
> > applying the offset to PQR_ASSOC.
>
> I think I may have reversed the actions to update the MSR in one of
> my refactor/rebase. The comment here is correct, but that's not
> what the code is doing :-(
>
> Can you swap the bodies of these two functions and retest?

It's a small improvement, but still not great. Still only node 0
giving believable results, but at least no more empty results from the
second package.

I poked around in /proc/kcore and noticed that my snc_ways is still 1, though.

-Peter

2023-06-23 20:32:27

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

On Fri, Jun 23, 2023 at 05:19:37PM +0200, Peter Newman wrote:
> Hi Tony,
>
> On Thu, Jun 22, 2023 at 6:02 PM Luck, Tony <[email protected]> wrote:
> >
> > > Unfortunately I'm not getting as good of results with the new series.
> > > The main difference seems to be updating the 0xca0 MSR instead of
> > > applying the offset to PQR_ASSOC.
> >
> > I think I may have reversed the actions to update the MSR in one of
> > my refactor/rebase. The comment here is correct, but that's not
> > what the code is doing :-(
> >
> > Can you swap the bodies of these two functions and retest?
>
> It's a small improvement, but still not great. Still only node 0
> giving believable results, but at least no more empty results from the
> second package.
>
> I poked around in /proc/kcore and noticed that my snc_ways is still 1, though.

Below is the patch I applied to reverse the add/remove package actions
together with some debug to make double sure SNC is being detected as
I expect and the right actions taken.

When booting the debug messages say:

[ 9.458624] resctrl: SNC_ways = 2
[ 9.458801] resctrl: CPU0: set MSR_RMID_SNC_CONFIG to 0x0
[ 9.461986] resctrl: CPU56: set MSR_RMID_SNC_CONFIG to 0x0

which is all good and correct.

For my tests I have a memory hog process that loops on a memcpy()
of 100 MBytes to generate enough traffic to be totally obvious when
looking at the mbm counters.

Test 1: I used taskset(1) to start a copy on the first CPU of node0
and checked the MBM counters. Both local and remote showed around
25 GB/s on node 0. Killed this process.

Tests 2, 3, 4: Same as 1, but started the process on first CPU of node
1, 2, 3. Same result. Around 25 GB/s appeared in the MBM counts for
the right node in each cycle.

Test 5: Started on node0, then periodically used taskset to bind the
running process onto a CPU on another node. It looks like Linux
migrates the memory for the job shortly after the affinity of the
process is changed. A few seconds after each process migration, the
MBM counters reflect traffic on the node that is the new home of the
process.

-Tony

From 414341db02cd51daaf4a9ea8bd68b22a23cf4b59 Mon Sep 17 00:00:00 2001
From: Tony Luck <[email protected]>
Date: Fri, 23 Jun 2023 08:57:57 -0700
Subject: [PATCH] Fix inverted SNC enable/disable MSR writes. Add some debug
too.

---
arch/x86/kernel/cpu/resctrl/core.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 869cfb46e8e4..e66b2b84fe6f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -546,18 +546,20 @@ static void snc_add_pkg(void)
{
u64 msrval;

- rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
- msrval |= BIT_ULL(0);
- wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
-}
-
-static void snc_remove_pkg(void)
-{
- u64 msrval;
-
rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
msrval &= ~BIT_ULL(0);
wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
+pr_info("CPU%d: set MSR_RMID_SNC_CONFIG to 0x%llx\n", raw_smp_processor_id(), msrval);
+}
+
+static void snc_remove_pkg(void)
+{
+ u64 msrval;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
+ msrval |= BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
+pr_info("CPU%d: set MSR_RMID_SNC_CONFIG to 0x%llx\n", raw_smp_processor_id(), msrval);
}

/*
@@ -987,6 +989,7 @@ static __init int find_snc_ways(void)
if (ret > 1)
rdt_resources_all[RDT_RESOURCE_PKG].r_resctrl.pkg_actions = true;

+pr_info("SNC_ways = %d\n", ret);
return ret;
}

--
2.40.1


2023-06-26 12:44:27

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

Hi Tony,

On Fri, Jun 23, 2023 at 10:20 PM Tony Luck <[email protected]> wrote:
> On Fri, Jun 23, 2023 at 05:19:37PM +0200, Peter Newman wrote:
> > On Thu, Jun 22, 2023 at 6:02 PM Luck, Tony <[email protected]> wrote:
> > >
> > > > Unfortunately I'm not getting as good of results with the new series.
> > > > The main difference seems to be updating the 0xca0 MSR instead of
> > > > applying the offset to PQR_ASSOC.
> > >
> > > I think I may have reversed the actions to update the MSR in one of
> > > my refactor/rebase. The comment here is correct, but that's not
> > > what the code is doing :-(
> > >
> > > Can you swap the bodies of these two functions and retest?
> >
> > It's a small improvement, but still not great. Still only node 0
> > giving believable results, but at least no more empty results from the
> > second package.
> >
> > I poked around in /proc/kcore and noticed that my snc_ways is still 1, though.

It turns out I just forgot that I had KASLR on. snc_ways was in fact 2.

The real problem was my test program was assuming that the absence of
the snc_ways file meant no SNC, so it was using cache IDs instead of
node IDs when choosing a mon_data subdirectory to read results from.

With that fixed, the results look good:

cpu: 95 (3), rmid: 17 (g16): 0 -> 32313974784 (32313974784)
cpu: 198 (3), rmid: 103 (g102): 0 -> 26078961664 (26078961664)
cpu: 117 (0), rmid: 59 (g58): 0 -> 26692599808 (26692599808)
cpu: 152 (1), rmid: 113 (g112): 0 -> 33368244224 (33368244224)
cpu: 33 (1), rmid: 77 (g76): 0 -> 26902077440 (26902077440)
cpu: 63 (2), rmid: 76 (g75): 0 -> 32478494720 (32478494720)
cpu: 90 (3), rmid: 94 (g93): 0 -> 31206088704 (31206088704)
cpu: 136 (0), rmid: 13 (g12): 0 -> 28095463424 (28095463424)
cpu: 37 (1), rmid: 177 (g175): 0 -> 31255060480 (31255060480)
cpu: 124 (0), rmid: 6 (g5): 0 -> 31128502272 (31128502272)
cpu: 102 (3), rmid: 68 (g67): 0 -> 28848963584 (28848963584)
cpu: 103 (3), rmid: 62 (g61): 0 -> 26091233280 (26091233280)
cpu: 71 (2), rmid: 123 (g122): 0 -> 29157933056 (29157933056)
cpu: 94 (3), rmid: 69 (g68): 0 -> 27776458752 (27776458752)
cpu: 102 (3), rmid: 174 (g172): 0 -> 26349281280 (26349281280)
cpu: 155 (1), rmid: 3 (g2): 0 -> 33489125376 (33489125376)
cpu: 40 (1), rmid: 16 (g15): 0 -> 27142750208 (27142750208)
cpu: 69 (2), rmid: 156 (g154): 0 -> 29294411776 (29294411776)
cpu: 121 (0), rmid: 63 (g62): 0 -> 30636777472 (30636777472)
cpu: 171 (2), rmid: 93 (g92): 0 -> 26103046144 (26103046144)

(and it turns out I never needed to manually look up the node IDs,
because the test output would have already told me, had the test been
working correctly)

Sorry for the extra trouble. The series works well for me.

Tested-By: Peter Newman <[email protected]>

Thanks!
-Peter

2023-06-26 12:56:53

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

Hi Tony,

On Wed, Jun 21, 2023 at 7:40 PM Tony Luck <[email protected]> wrote:

> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 255a78d9d906..f95e69bacc65 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
> DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
> DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
>
> +DECLARE_PER_CPU(int, rmid_offset);
> +

I assumed that declarations in this file were those needed by
__resctrl_sched_in(). Now that rmid_offset is used when setting
PQR_ASSOC, would this go somewhere else?

Other than this and fixing the MSR update, the series looks fine to me.

Reviewed-By: Peter Newman <[email protected]>

Thanks!
-Peter

2023-06-26 16:15:35

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

>> +++ b/arch/x86/include/asm/resctrl.h
>> @@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
>> DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>> DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
>>
>> +DECLARE_PER_CPU(int, rmid_offset);
>> +
>
> I assumed that declarations in this file were those needed by
> __resctrl_sched_in(). Now that rmid_offset is used when setting
> PQR_ASSOC, would this go somewhere else?

PQR_ASSOC no longer needs rmid_offset. But QM_EVTSEL still does.

I'll take a look to see if all the SNC detection code can move into
monitor.c. Then rmid_offset could be static in that file. But if that gets
complicated I may leave it alone (with rmid_offset set in core.c and used
in monitor.c).

> Other than this and fixing the MSR update, the series looks fine to me.
>
> Reviewed-By: Peter Newman <[email protected]>

Peter,

Thanks for testing and review. Did you also test on one of your systems
with a memory-only node? I recall that was an issue with my very first
patch series.

-Tony

2023-06-26 16:43:12

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

> I'll take a look to see if all the SNC detection code can move into
> monitor.c. Then rmid_offset could be static in that file. But if that gets
> complicated I may leave it alone (with rmid_offset set in core.c and used
> in monitor.c).

Hmm. Setting rmid_offset is deeply tangled into the cpu/domain
hotplug code path. So I'm going to leave that in core.c

-Tony

2023-06-28 13:57:01

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

Hi Tony,

On Mon, Jun 26, 2023 at 5:52 PM Luck, Tony <[email protected]> wrote:

> Thanks for testing and review. Did you also test on one of your systems
> with a memory-only node? I recall that was an issue with my very first
> patch series.

It turns out the people who reported the crash to me were running an
experiment, but they were able to find one machine they hadn't yanked
the AEP DIMMs out of yet.

It was a cascade lake, so I had to yank the CPU family match check to
ensure that your fix was actually exercised. I can confirm now that it
booted successfully.

-Peter

2023-06-29 07:44:51

by Shaopeng Tan (Fujitsu)

[permalink] [raw]
Subject: RE: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

Hi Tony,

> There isn't a simple hardware enumeration to indicate to software that a
> system is running with Sub-NUMA Cluster enabled.
>
> Compare the number of NUMA nodes with the number of L3 caches to calculate
> the number of Sub-NUMA nodes per L3 cache.
>
> When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters are
> distributed equally between the SNC nodes within each socket.
>
> E.g. if there are 400 RMID counters, and the system is configured with two SNC
> nodes per socket, then RMID counter 0..199 are used on SNC node
> 0 on the socket, and RMID counter 200..399 on SNC node 1.
>
> A model specific MSR (0xca0) can change the configuration of the RMIDs when
> SNC mode is enabled.
>
> The MSR controls the interpretation of the RMID field in the IA32_PQR_ASSOC
> MSR so that the appropriate hardware counters within the SNC node are
> updated.
>
> Also initialize a per-cpu RMID offset value. Use this to calculate the value to
> write to the IA32_QM_EVTSEL MSR when reading RMID event values.
>
> N.B. this works well for well-behaved NUMA applications that access memory
> predominantly from the local memory node. For applications that access
> memory across multiple nodes it may be necessary for the user to read counters
> for all SNC nodes on a socket and add the values to get the actual LLC
> occupancy or memory bandwidth. Perhaps this isn't all that different from
> applications that span across multiple sockets in a legacy system.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/include/asm/resctrl.h | 2 +
> arch/x86/kernel/cpu/resctrl/core.c | 99
> ++++++++++++++++++++++++++-
> arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
> 3 files changed, 99 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 255a78d9d906..f95e69bacc65 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
> DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
> DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
>
> +DECLARE_PER_CPU(int, rmid_offset);
> +
> /*
> * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
> *
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c
> b/arch/x86/kernel/cpu/resctrl/core.c
> index af3be3c2db96..869cfb46e8e4 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>
> #define pr_fmt(fmt) "resctrl: " fmt
>
> +#include <linux/cpu.h>
> #include <linux/slab.h>
> #include <linux/err.h>
> #include <linux/cacheinfo.h>
> #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>
> +#include <asm/cpu_device_id.h>
> #include <asm/intel-family.h>
> #include <asm/resctrl.h>
> #include "internal.h"
> @@ -524,6 +527,39 @@ static int get_domain_id(int cpu, enum resctrl_scope
> scope)
> }
> }
>
> +DEFINE_PER_CPU(int, rmid_offset);
> +
> +static void set_per_cpu_rmid_offset(int cpu, struct rdt_resource *r) {
> + this_cpu_write(rmid_offset, (cpu_to_node(cpu) % snc_ways) *
> +r->num_rmid); }
> +
> +/*
> + * This MSR provides for configuration of RMIDs on Sub-NUMA Cluster
> + * systems.
> + * Bit0 = 1 (default) For legacy configuration
> + * Bit0 = 0 RMIDs are divided evenly between SNC nodes.
> + */
> +#define MSR_RMID_SNC_CONFIG 0xCA0
> +
> +static void snc_add_pkg(void)
> +{
> + u64 msrval;
> +
> + rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
> + msrval |= BIT_ULL(0);
> + wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
> +}
> +
> +static void snc_remove_pkg(void)
> +{
> + u64 msrval;
> +
> + rdmsrl(MSR_RMID_SNC_CONFIG, msrval);
> + msrval &= ~BIT_ULL(0);
> + wrmsrl(MSR_RMID_SNC_CONFIG, msrval);
> +}
> +
> /*
> * domain_add_cpu - Add a cpu to a resource's domain list.
> *
> @@ -555,6 +591,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource
> *r)
> cpumask_set_cpu(cpu, &d->cpu_mask);
> if (r->cache.arch_has_per_cpu_cfg)
> rdt_domain_reconfigure_cdp(r);
> + if (r->mon_capable)
> + set_per_cpu_rmid_offset(cpu, r);
> return;
> }
>
> @@ -573,11 +611,17 @@ static void domain_add_cpu(int cpu, struct
> rdt_resource *r)
> return;
> }
>
> - if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid,
> hw_dom)) {
> - domain_free(hw_dom);
> - return;
> + if (r->mon_capable) {
> + if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> + domain_free(hw_dom);
> + return;
> + }
> + set_per_cpu_rmid_offset(cpu, r);
> }
>
> + if (r->pkg_actions)
> + snc_add_pkg();
> +
> list_add_tail(&d->list, add_pos);
>
> err = resctrl_online_domain(r, d);
> @@ -613,6 +657,9 @@ static void domain_remove_cpu(int cpu, struct
> rdt_resource *r)
> d->plr->d = NULL;
> domain_free(hw_dom);
>
> + if (r->pkg_actions)
> + snc_remove_pkg();
> +
> return;
> }
>
> @@ -899,11 +946,57 @@ static __init bool get_rdt_resources(void)
> return (rdt_mon_capable || rdt_alloc_capable); }
>
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> + {}
> +};

Cascade Lake and Skylake also seem to support Sub-NUMA cluster.
At least in my environment(Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz),
Sub-NUMA cluster is supported.

Best regards,
Shaopeng TAN

2023-06-29 07:49:57

by Shaopeng Tan (Fujitsu)

[permalink] [raw]
Subject: RE: [PATCH v2 5/7] x86/resctrl: Add package scoped resource

Hi Tony,

> Some Intel features require setting a package scoped model specific register.
>
> Add a new resource that builds domains for each package.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> include/linux/resctrl.h | 1 +
> arch/x86/kernel/cpu/resctrl/internal.h | 6 ++++++
> arch/x86/kernel/cpu/resctrl/core.c | 23
> +++++++++++++++++++----
> 3 files changed, 26 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index
> 25051daa6655..f504f6263fec 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -167,6 +167,7 @@ struct rdt_resource {
> int rid;
> bool alloc_capable;
> bool mon_capable;
> + bool pkg_actions;
> int num_rmid;
> int scope;
> struct resctrl_cache cache;
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h
> b/arch/x86/kernel/cpu/resctrl/internal.h
> index 38bac0062c82..e51a5004be77 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -438,6 +438,7 @@ enum resctrl_res_level {
> RDT_RESOURCE_MBA,
> RDT_RESOURCE_SMBA,
> RDT_RESOURCE_NODE,
> + RDT_RESOURCE_PKG,
>
> /* Must be the last */
> RDT_NUM_RESOURCES,
> @@ -447,6 +448,7 @@ enum resctrl_scope {
> SCOPE_L2_CACHE = 2,
> SCOPE_L3_CACHE = 3,
> SCOPE_NODE,
> + SCOPE_PKG,
> };
>
> static inline int get_mbm_res_level(void) @@ -482,6 +484,10 @@ int
> resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
> for_each_rdt_resource(r) \
> if (r->alloc_capable || r->mon_capable)
>
> +#define for_each_domain_needed_rdt_resource(r)
> \
> + for_each_rdt_resource(r) \
> + if (r->alloc_capable || r->mon_capable || r->pkg_actions)
> +
> #define for_each_alloc_capable_rdt_resource(r)
> \
> for_each_rdt_resource(r) \
> if (r->alloc_capable)
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c
> b/arch/x86/kernel/cpu/resctrl/core.c
> index 6fe9f87d4403..af3be3c2db96 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -127,6 +127,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .fflags = 0,
> },
> },
> + [RDT_RESOURCE_PKG] =
> + {
> + .r_resctrl = {
> + .rid = RDT_RESOURCE_PKG,
> + .name = "PKG",
> + .scope = SCOPE_PKG,
> + .domains =
> domain_init(RDT_RESOURCE_PKG),
> + .fflags = 0,
> + },
> + },
> };
>
> /*
> @@ -504,9 +514,14 @@ static int arch_domain_mbm_alloc(u32 num_rmid,
> struct rdt_hw_domain *hw_dom)
>
> static int get_domain_id(int cpu, enum resctrl_scope scope) {
> - if (scope == SCOPE_NODE)
> + switch (scope) {
> + case SCOPE_NODE:
> return cpu_to_node(cpu);
> - return get_cpu_cacheinfo_id(cpu, scope);
> + case SCOPE_PKG:
> + return topology_physical_package_id(cpu);
> + default:
> + return get_cpu_cacheinfo_id(cpu, scope);
> + }
> }
>
> /*
> @@ -630,7 +645,7 @@ static int resctrl_online_cpu(unsigned int cpu)
> struct rdt_resource *r;
>
> mutex_lock(&rdtgroup_mutex);
> - for_each_capable_rdt_resource(r)
> + for_each_domain_needed_rdt_resource(r)
> domain_add_cpu(cpu, r);
> /* The cpu is set in default rdtgroup after online. */
> cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); @@ -657,7
> +672,7 @@ static int resctrl_offline_cpu(unsigned int cpu)
> struct rdt_resource *r;
>
> mutex_lock(&rdtgroup_mutex);
> - for_each_capable_rdt_resource(r)
> + for_each_domain_needed_rdt_resource(r)

Function for_each_capable_rdt_resource(r) is no longer used in anywhere,
I think it is better to remove "#define for_each_capable_rdt_resource(r)" together.

Best regards,
Shaopeng TAN


2023-06-29 09:09:31

by Shaopeng Tan (Fujitsu)

[permalink] [raw]
Subject: RE: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

Hi tony,

I ran selftest/resctrl in my environment,
CMT test failed when I enabled Sub-NUMA Cluster.

I don't know why it failed yet,
I paste the test results below.

Processer in my environment:
Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz

$ sudo make -C tools/testing/selftests/resctrl run_tests
# # Starting CMT test ...
# # Mounting resctrl to "/sys/fs/resctrl"
# # Mounting resctrl to "/sys/fs/resctrl"
# # Cache size :25952256
# # Benchmark PID: 8638
# # Writing benchmark parameters to resctrl FS
# # Checking for pass/fail
# # Fail: Check cache miss rate within 15%
# # Percent diff=21
# # Number of bits: 5
# # Average LLC val: 9216000
# # Cache span (bytes): 11796480
# not ok 3 CMT: test

Best regards,
Shaopeng TAN

2023-06-29 16:06:17

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

>> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
>> + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
>> + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
>> + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
>> + {}
>> +};
>
> Cascade Lake and Skylake also seem to support Sub-NUMA cluster.
> At least in my environment(Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz),
> Sub-NUMA cluster is supported.

Shaopeng TAN,

This is true. But MSR_RMID_SNC_CONFIG (0xCA0) is not implemented
for CPUs before ICELAKE.

-Tony

2023-06-29 16:16:37

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

> I ran selftest/resctrl in my environment,
> CMT test failed when I enabled Sub-NUMA Cluster.
>
> I don't know why it failed yet,
> I paste the test results below.
>
> Processer in my environment:
> Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
>
> $ sudo make -C tools/testing/selftests/resctrl run_tests
> # # Starting CMT test ...
> # # Mounting resctrl to "/sys/fs/resctrl"
> # # Mounting resctrl to "/sys/fs/resctrl"
> # # Cache size :25952256
> # # Benchmark PID: 8638
> # # Writing benchmark parameters to resctrl FS
> # # Checking for pass/fail
> # # Fail: Check cache miss rate within 15%
> # # Percent diff=21
> # # Number of bits: 5
> # # Average LLC val: 9216000
> # # Cache span (bytes): 11796480
> # not ok 3 CMT: test

This is expected. When SNC is enabled, CAT still supports the same number of
bits in the allocation cache mask. But each bit represents half as much cache.

Think of the cache as a 2-D matrix with the cache-ways (bits in the CAT mask)
as the columns, and the rows are the hashed index of the physical address.
When SNC is turned on the hash function for physical addresses from one
of the SNC number nodes will only pick half of those rows (and the other
SNC node gets the other half of the rows).

-Tony

2023-07-11 21:25:08

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

Hi Tony,

On 6/29/2023 9:05 AM, Luck, Tony wrote:
>> I ran selftest/resctrl in my environment,
>> CMT test failed when I enabled Sub-NUMA Cluster.
>>
>> I don't know why it failed yet,
>> I paste the test results below.
>>
>> Processer in my environment:
>> Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
>>
>> $ sudo make -C tools/testing/selftests/resctrl run_tests
>> # # Starting CMT test ...
>> # # Mounting resctrl to "/sys/fs/resctrl"
>> # # Mounting resctrl to "/sys/fs/resctrl"
>> # # Cache size :25952256
>> # # Benchmark PID: 8638
>> # # Writing benchmark parameters to resctrl FS
>> # # Checking for pass/fail
>> # # Fail: Check cache miss rate within 15%
>> # # Percent diff=21
>> # # Number of bits: 5
>> # # Average LLC val: 9216000
>> # # Cache span (bytes): 11796480
>> # not ok 3 CMT: test
>
> This is expected. When SNC is enabled, CAT still supports the same number of
> bits in the allocation cache mask. But each bit represents half as much cache.
>
> Think of the cache as a 2-D matrix with the cache-ways (bits in the CAT mask)
> as the columns, and the rows are the hashed index of the physical address.
> When SNC is turned on the hash function for physical addresses from one
> of the SNC number nodes will only pick half of those rows (and the other
> SNC node gets the other half of the rows).

If a test is expected to fail in a particular scenario then I think
the test failure should be communicated as a "pass". If not this will
reduce confidence in accuracy of tests. Even so, from the description
it sounds as though this test can be made more accurate to indeed pass
in the scenario when SNC is enabled?

Reinette

2023-07-11 21:39:17

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

On Tue, Jul 11, 2023 at 01:50:02PM -0700, Reinette Chatre wrote:
> Hi Tony,
> > This is expected. When SNC is enabled, CAT still supports the same number of
> > bits in the allocation cache mask. But each bit represents half as much cache.
> >
> > Think of the cache as a 2-D matrix with the cache-ways (bits in the CAT mask)
> > as the columns, and the rows are the hashed index of the physical address.
> > When SNC is turned on the hash function for physical addresses from one
> > of the SNC number nodes will only pick half of those rows (and the other
> > SNC node gets the other half of the rows).
>
> If a test is expected to fail in a particular scenario then I think
> the test failure should be communicated as a "pass". If not this will
> reduce confidence in accuracy of tests. Even so, from the description
> it sounds as though this test can be made more accurate to indeed pass
> in the scenario when SNC is enabled?

Hi Reinette,

Yes. This could be done. The resctrl tests would need to determine
if SNC mode is enabled. But I think that is possible by comparing
output of sysfs files. E.g. with SNC disabled the lists of cpus for a node
and a CPU on that node will match like this:

$ cat /sys/devices/system/node/node0/cpulist
0-35,72-107
$ cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
0-35,72-107

but with SNC enabled, the CPUs sharing a cache will be divided across
two or four nodes.

It looks like the existing tests may print a warning. I see
this code in:

tools/testing/selftests/resctrl/resctrl_tests.c

123 res = cmt_resctrl_val(cpu_no, 5, benchmark_cmd);
124 ksft_test_result(!res, "CMT: test\n");
125 if ((get_vendor() == ARCH_INTEL) && res)
126 ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");

but at first glance that warning doesn't appear to try and
check if SNC was the actual problem.

-Tony

2023-07-11 22:22:25

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

Hi Tony,

On 7/11/2023 2:23 PM, Tony Luck wrote:
> On Tue, Jul 11, 2023 at 01:50:02PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>> This is expected. When SNC is enabled, CAT still supports the same number of
>>> bits in the allocation cache mask. But each bit represents half as much cache.
>>>
>>> Think of the cache as a 2-D matrix with the cache-ways (bits in the CAT mask)
>>> as the columns, and the rows are the hashed index of the physical address.
>>> When SNC is turned on the hash function for physical addresses from one
>>> of the SNC number nodes will only pick half of those rows (and the other
>>> SNC node gets the other half of the rows).
>>
>> If a test is expected to fail in a particular scenario then I think
>> the test failure should be communicated as a "pass". If not this will
>> reduce confidence in accuracy of tests. Even so, from the description
>> it sounds as though this test can be made more accurate to indeed pass
>> in the scenario when SNC is enabled?
>
> Hi Reinette,
>
> Yes. This could be done. The resctrl tests would need to determine
> if SNC mode is enabled. But I think that is possible by comparing
> output of sysfs files. E.g. with SNC disabled the lists of cpus for a node
> and a CPU on that node will match like this:
>
> $ cat /sys/devices/system/node/node0/cpulist
> 0-35,72-107
> $ cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
> 0-35,72-107
>
> but with SNC enabled, the CPUs sharing a cache will be divided across
> two or four nodes.
>
> It looks like the existing tests may print a warning. I see
> this code in:
>
> tools/testing/selftests/resctrl/resctrl_tests.c
>
> 123 res = cmt_resctrl_val(cpu_no, 5, benchmark_cmd);
> 124 ksft_test_result(!res, "CMT: test\n");
> 125 if ((get_vendor() == ARCH_INTEL) && res)
> 126 ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
>
> but at first glance that warning doesn't appear to try and
> check if SNC was the actual problem.

Your first glance is accurate. This message was added after finding
tests fail on SNC systems but not finding the correct way to enumerate
whether SNC is enabled. At that time it was still recommended that
SNC not be enabled and thus test failures continued to be accurate.
This work changes that.

Reinette