Intel server systems starting with Skylake support a mode that logically
partitions each socket. E.g. when partitioned two ways, half the cores,
L3 cache, and memory controllers are allocated to each of the partitions.
This may reduce average latency to access L3 cache and memory, with the
tradeoff that only half the L3 cache is available for subnode-local memory
access.
The existing Linux resctrl system mishandles RDT monitoring on systems
with SNC mode enabled.
But, with some simple changes, this can be fixed. When SNC mode is
enabled, the RDT RMID counters are also partitioned with the low numbered
counters going to the first partition, and the high numbered counters
to the second partition[1]. The key is to adjust the RMID value written
to the IA32_PQR_ASSOC MSR on context switch, and the value written to
the IA32_QM_EVTSEL when reading out counters, and to change the scaling
factor that was read from CPUID(0xf,1).EBX
E.g. in 2-way Sub-NUMA cluster with 200 RMID counters there are only
100 available counters to the resctrl code. When running on the first
SNC node RMID values 0..99 are used as before. But when running on the
second node, a task that is assigned resctrl rmid=10 must load 10+100
into IA32_PQR_ASSOC to use RMID counter 110.
There should be no changes to functionality on other architectures,
or on Intel systems with SNC disabled, where snc_ways == 1.
-Tony
[1] Some systems also support a 4-way split. All the above still
applies, just need to account for cores, cache, memory controllers
and RMID counters being divided four ways instead of two.
Tony Luck (7):
x86/resctrl: Refactor in preparation for node-scoped resources
x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c
x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
x86/resctrl: Add code to setup monitoring at L3 or NODE scope.
x86/resctrl: Add a new "snc_ways" file to the monitoring info
directory.
x86/resctrl: Update documentation with Sub-NUMA cluster changes
x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.
Documentation/x86/resctrl.rst | 15 +++-
include/linux/resctrl.h | 4 +-
arch/x86/include/asm/resctrl.h | 4 +-
arch/x86/kernel/cpu/resctrl/internal.h | 9 +++
arch/x86/kernel/cpu/resctrl/core.c | 83 ++++++++++++++++++++---
arch/x86/kernel/cpu/resctrl/monitor.c | 24 ++++---
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 22 +++++-
8 files changed, 136 insertions(+), 27 deletions(-)
--
2.39.1
Add a placeholder in the array of struct rdt_hw_resource to be used
for event monitoring of systems with Sub-NUMA Cluster enabled.
Update get_domain_id() to handle SCOPE_NODE.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 12 ++++++++++++
2 files changed, 14 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 15cea517efaa..39a62babd60b 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -409,12 +409,14 @@ enum resctrl_res_level {
RDT_RESOURCE_L3,
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
+ RDT_RESOURCE_NODE,
/* Must be the last */
RDT_NUM_RESOURCES,
};
enum resctrl_scope {
+ SCOPE_NODE,
SCOPE_L2_CACHE = 2,
SCOPE_L3_CACHE = 3
};
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6914232acf84..19be6fe42ef3 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -100,6 +100,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
.fflags = RFTYPE_RES_MB,
},
},
+ [RDT_RESOURCE_NODE] =
+ {
+ .r_resctrl = {
+ .rid = RDT_RESOURCE_NODE,
+ .name = "L3",
+ .scope = SCOPE_NODE,
+ .domains = domain_init(RDT_RESOURCE_NODE),
+ .fflags = RFTYPE_RES_MB,
+ },
+ },
};
/*
@@ -464,6 +474,8 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
static int get_domain_id(int cpu, enum resctrl_scope scope)
{
+ if (scope == SCOPE_NODE)
+ return cpu_to_node(cpu);
return get_cpu_cacheinfo_id(cpu, scope);
}
--
2.39.1
When Sub-NUMA cluster is enabled (snc_ways > 1) use the RDT_RESOURCE_NODE
instead of RDT_RESOURCE_L3 for all monitoring operations.
The mon_scale and num_rmid values from CPUID(0xf,0x1),(EBX,ECX) must be
scaled down by the number of Sub-NUMA Clusters.
A subsequent change will detect sub-NUMA cluster mode and set
"snc_ways". For now set to one (meaning each L3 cache spans one
node).
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 13 ++++++++++++-
arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 ++++-
4 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 39a62babd60b..ad26d008dafa 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -405,6 +405,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
extern struct dentry *debugfs_resctrl;
+extern int snc_ways;
+
enum resctrl_res_level {
RDT_RESOURCE_L3,
RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19be6fe42ef3..53b2ab37af2f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,11 @@ int max_name_width, max_data_width;
*/
bool rdt_alloc_capable;
+/*
+ * How many Sub-Numa Cluster nodes share a single L3 cache
+ */
+int snc_ways = 1;
+
static void
mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);
@@ -786,7 +791,13 @@ static __init bool get_rdt_alloc_resources(void)
static __init bool get_rdt_mon_resources(void)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r;
+
+ /* When SNC enabled, monitor functions at node instead of L3 cache scope */
+ if (snc_ways > 1)
+ r = &rdt_resources_all[RDT_RESOURCE_NODE].r_resctrl;
+ else
+ r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index d05bbd4f6b2d..3fc63aa68130 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -777,8 +777,8 @@ int rdt_get_mon_l3_config(struct rdt_resource *r)
int ret;
resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
- hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
- r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_ways;
+ r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_ways;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a6ba3080e5db..a0dc64a70d01 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2238,7 +2238,10 @@ static int rdt_get_tree(struct fs_context *fc)
static_branch_enable_cpuslocked(&rdt_enable_key);
if (is_mbm_enabled()) {
- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ if (snc_ways > 1)
+ r = &rdt_resources_all[RDT_RESOURCE_NODE].r_resctrl;
+ else
+ r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
list_for_each_entry(dom, &r->domains, list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}
--
2.39.1
Make it easy for the user to tell if Sub-NUMA Cluster is enabled by
providing an info/ file.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a0dc64a70d01..392e7a08d083 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -997,6 +997,14 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of,
return 0;
}
+static int rdt_snc_ways_show(struct kernfs_open_file *of,
+ struct seq_file *seq, void *v)
+{
+ seq_printf(seq, "%d\n", snc_ways);
+
+ return 0;
+}
+
static int rdt_mon_features_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
@@ -1451,6 +1459,13 @@ static struct rftype res_common_files[] = {
.seq_show = rdt_num_rmids_show,
.fflags = RF_MON_INFO,
},
+ {
+ .name = "snc_ways",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdt_snc_ways_show,
+ .fflags = RF_MON_INFO,
+ },
{
.name = "cbm_mask",
.mode = 0444,
--
2.39.1
With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffices of directories with "L3" in
their name refer to Sun-NUMA nodes instead of L3 cache ids.
Signed-off-by: Tony Luck <[email protected]>
---
Documentation/x86/resctrl.rst | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/Documentation/x86/resctrl.rst b/Documentation/x86/resctrl.rst
index 71a531061e4e..9043a2d2f2d3 100644
--- a/Documentation/x86/resctrl.rst
+++ b/Documentation/x86/resctrl.rst
@@ -167,6 +167,11 @@ with the following files:
bytes) at which a previously used LLC_occupancy
counter can be considered for re-use.
+"snc_ways":
+ A value of "1" marks that SNC mode is disabled.
+ Values of "2" or "4" indicate how many NUMA
+ nodes share an L3 cache.
+
Finally, in the top level of the "info" directory there is a file
named "last_cmd_status". This is reset with every "command" issued
via the file system (making new directories or writing to any of the
@@ -254,9 +259,13 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:
"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
+ This contains a set of files organized by L3 domain or by NUMA
+ node (depending on whether SNC mode is disabled or enabled
+ respectively) and by RDT event. E.g. on a system with SNC
+ mode disabled with two L3 domains there will be subdirectories
+ "mon_L3_00" and "mon_L3_01" the numerical suffix refers to the
+ L3 cache id. With SNC enabled the directory names are the same,
+ but the numerical suffix refers to the node id. Each of these
directories have one file per event (e.g. "llc_occupancy",
"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
files provide a read out of the current value of the event for
--
2.39.1
There isn't a simple hardware enumeration to indicate to software that
a system is running with Sub-NUMA Cluster enabled.
Compare the number of NUMA nodes with the number of L3 caches to calculate
the number of Sub-NUMA nodes per L3 cache.
When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
are distributed equally between the SNC nodes within each socket.
E.g. if there are 400 RMID counters, and the system is configured with
two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
0 on the socket, and RMID counter 200..399 on SNC node 1.
Handle this by initializing a per-cpu RMID offset value. Use this
to calculate the value to write to the RMID field of the IA32_PQR_ASSOC
MSR during context switch, and also to the IA32_QM_EVTSEL MSR when
reading RMID event values.
N.B. this works well for well-behaved NUMA applications that access
memory predominantly from the local memory node. For applications that
access memory across multiple nodes it may be necessary for the user
to read counters for all SNC nodes on a socket and add the values to
get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
all that different from applications that span across multiple sockets
in a legacy system.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/include/asm/resctrl.h | 4 ++-
arch/x86/kernel/cpu/resctrl/core.c | 43 +++++++++++++++++++++++++--
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
3 files changed, 44 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 52788f79786f..59b8afd8c53c 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
+DECLARE_PER_CPU(int, rmid_offset);
+
/*
* __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
*
@@ -69,7 +71,7 @@ static void __resctrl_sched_in(void)
if (static_branch_likely(&rdt_mon_enable_key)) {
tmp = READ_ONCE(current->rmid);
if (tmp)
- rmid = tmp;
+ rmid = tmp + this_cpu_read(rmid_offset);
}
if (closid != state->cur_closid || rmid != state->cur_rmid) {
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 53b2ab37af2f..0ff739375e3b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,6 +16,7 @@
#define pr_fmt(fmt) "resctrl: " fmt
+#include <linux/cpu.h>
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/cacheinfo.h>
@@ -484,6 +485,13 @@ static int get_domain_id(int cpu, enum resctrl_scope scope)
return get_cpu_cacheinfo_id(cpu, scope);
}
+DEFINE_PER_CPU(int, rmid_offset);
+
+static void set_per_cpu_rmid_offset(int cpu, struct rdt_resource *r)
+{
+ this_cpu_write(rmid_offset, (cpu_to_node(cpu) % snc_ways) * r->num_rmid);
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -515,6 +523,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
cpumask_set_cpu(cpu, &d->cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
+ if (r->mon_capable)
+ set_per_cpu_rmid_offset(cpu, r);
return;
}
@@ -533,9 +543,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}
- if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
- domain_free(hw_dom);
- return;
+ if (r->mon_capable) {
+ if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+ domain_free(hw_dom);
+ return;
+ }
+ set_per_cpu_rmid_offset(cpu, r);
}
list_add_tail(&d->list, add_pos);
@@ -845,11 +858,35 @@ static __init bool get_rdt_resources(void)
return (rdt_mon_capable || rdt_alloc_capable);
}
+static __init int find_snc_ways(void)
+{
+ unsigned long *node_caches;
+ int cpu, node, ret;
+
+ node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
+ if (!node_caches)
+ return 1;
+
+ cpus_read_lock();
+ for_each_node(node) {
+ cpu = cpumask_first(cpumask_of_node(node));
+ set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+ }
+ cpus_read_unlock();
+
+ ret = nr_node_ids / bitmap_weight(node_caches, nr_node_ids);
+ kfree(node_caches);
+
+ return ret;
+}
+
static __init void rdt_init_res_defs_intel(void)
{
struct rdt_hw_resource *hw_res;
struct rdt_resource *r;
+ snc_ways = find_snc_ways();
+
for_each_rdt_resource(r) {
hw_res = resctrl_to_arch_res(r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3fc63aa68130..bd5ec348d925 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -160,7 +160,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
*/
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));
rdmsrl(MSR_IA32_QM_CTR, msr_val);
if (msr_val & RMID_VAL_ERROR)
--
2.39.1
Hi, Tony,
On 1/26/23 10:41, Tony Luck wrote:
> Add a placeholder in the array of struct rdt_hw_resource to be used
> for event monitoring of systems with Sub-NUMA Cluster enabled.
>
> Update get_domain_id() to handle SCOPE_NODE.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
> arch/x86/kernel/cpu/resctrl/core.c | 12 ++++++++++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 15cea517efaa..39a62babd60b 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -409,12 +409,14 @@ enum resctrl_res_level {
> RDT_RESOURCE_L3,
> RDT_RESOURCE_L2,
> RDT_RESOURCE_MBA,
> + RDT_RESOURCE_NODE,
>
> /* Must be the last */
> RDT_NUM_RESOURCES,
> };
>
> enum resctrl_scope {
> + SCOPE_NODE,
> SCOPE_L2_CACHE = 2,
> SCOPE_L3_CACHE = 3
> };
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 6914232acf84..19be6fe42ef3 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -100,6 +100,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .fflags = RFTYPE_RES_MB,
> },
> },
> + [RDT_RESOURCE_NODE] =
> + {
> + .r_resctrl = {
> + .rid = RDT_RESOURCE_NODE,
> + .name = "L3",
"L3" was named as RDT_RESOURCE_L3 already. The duplicate name here may
cause duplicate file names in info dir. Maybe rename it as "L3_NODE"?
> + .scope = SCOPE_NODE,
> + .domains = domain_init(RDT_RESOURCE_NODE),
> + .fflags = RFTYPE_RES_MB,
I'm not sure if fflags is RFTYPE_RES_MB | RFTYPE_RES_L3 for both cache
and MB?
> + },
> + },
> };
>
> /*
> @@ -464,6 +474,8 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>
> static int get_domain_id(int cpu, enum resctrl_scope scope)
> {
> + if (scope == SCOPE_NODE)
> + return cpu_to_node(cpu);
> return get_cpu_cacheinfo_id(cpu, scope);
> }
>
Thanks.
-Fenghua
Hi Tony,
On Fri, Jan 27, 2023 at 6:25 AM Yu, Fenghua <[email protected]> wrote:
>
> On 1/26/23 10:41, Tony Luck wrote:
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 6914232acf84..19be6fe42ef3 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -100,6 +100,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
> > .fflags = RFTYPE_RES_MB,
> > },
> > },
> > + [RDT_RESOURCE_NODE] =
> > + {
> > + .r_resctrl = {
> > + .rid = RDT_RESOURCE_NODE,
> > + .name = "L3",
> "L3" was named as RDT_RESOURCE_L3 already. The duplicate name here may
> cause duplicate file names in info dir. Maybe rename it as "L3_NODE"?
I'm trying to get some feedback from our own users on whether changing
the directory names would bother them. At least from my own testing, I
did learn to appreciate the interface change a bit more: I needed an
SNC and non-SNC case to correctly predict which mon_data subdirectory
the data would appear in.
I was able to confirm that this change allows bandwidth to be counted
on RMID/CPU combos where it didn't work before on an SNC2
configuration.
If I'm understanding this correctly, it might be helpful to highlight
that the extra resource is needed to allow a different number of L3
domains in L3 monitoring vs allocation.
Thanks!
-Peter
Tested-By: Peter Newman <[email protected]>
Reviewed-By: Peter Newman <[email protected]>
>> + [RDT_RESOURCE_NODE] =
>> + {
>> + .r_resctrl = {
>> + .rid = RDT_RESOURCE_NODE,
>> + .name = "L3",
> "L3" was named as RDT_RESOURCE_L3 already. The duplicate name here may
> cause duplicate file names in info dir. Maybe rename it as "L3_NODE"?
I thought the same, and my first implementation used a different string here (I picked
"NODE" rather than "L3_NODE").
But my testers complained that this broke all their existing infrastructure that reads
cache occupancy and memory bandwidth. This string is not just used in the info/
directory, it is also the basis for the directory names in mon_data/
$ tree /sys/fs/resctrl/mon_data
/sys/fs/resctrl/mon_data
├── mon_L3_00
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_L3_01
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_L3_02
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
└── mon_L3_03
├── llc_occupancy
├── mbm_local_bytes
└── mbm_total_bytes
The name using "L3" is still appropriate and accurate.
There isn't a "duplicate file names" problem in the info/ directory because a system
either has SNC disabled, and uses the L3-scoped resource, or has SNC enabled and
uses the node-scoped resource.
-Tony
Hi, Tony,
On 1/26/23 10:41, Tony Luck wrote:
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 6914232acf84..19be6fe42ef3 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -100,6 +100,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .fflags = RFTYPE_RES_MB,
> },
> },
> + [RDT_RESOURCE_NODE] =
> + {
> + .r_resctrl = {
> + .rid = RDT_RESOURCE_NODE,
> + .name = "L3",
> + .scope = SCOPE_NODE,
> + .domains = domain_init(RDT_RESOURCE_NODE),
> + .fflags = RFTYPE_RES_MB,
RFTYPE_RES_MB is for the resource to add some files in info/MB.
But the NODE resource doesn't have any files to show in info/MB.
Only shown file for the NODE resource is info/L3_MON/snc_ways. But this
file doesn't need to set fflags.
Maybe
Thanks.
-Fenghua
Hi, Tony,
On 1/27/23 10:23, Luck, Tony wrote:
>>> + [RDT_RESOURCE_NODE] =
>>> + {
>>> + .r_resctrl = {
>>> + .rid = RDT_RESOURCE_NODE,
>>> + .name = "L3",
>
>> "L3" was named as RDT_RESOURCE_L3 already. The duplicate name here may
>> cause duplicate file names in info dir. Maybe rename it as "L3_NODE"?
>
> I thought the same, and my first implementation used a different string here (I picked
> "NODE" rather than "L3_NODE").
>
> But my testers complained that this broke all their existing infrastructure that reads
> cache occupancy and memory bandwidth. This string is not just used in the info/
> directory, it is also the basis for the directory names in mon_data/
>
> $ tree /sys/fs/resctrl/mon_data
> /sys/fs/resctrl/mon_data
> ├── mon_L3_00
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_L3_01
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_L3_02
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> └── mon_L3_03
> ├── llc_occupancy
> ├── mbm_local_bytes
> └── mbm_total_bytes
>
> The name using "L3" is still appropriate and accurate.
>
> There isn't a "duplicate file names" problem in the info/ directory because a system
> either has SNC disabled, and uses the L3-scoped resource, or has SNC enabled and
> uses the node-scoped resource.
That's right.
Thank you for your info!
-Fenghua
Hi, Tony,
On 1/26/23 10:41, Tony Luck wrote:
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 6914232acf84..19be6fe42ef3 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -100,6 +100,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .fflags = RFTYPE_RES_MB,
> },
> },
> + [RDT_RESOURCE_NODE] =
> + {
> + .r_resctrl = {
> + .rid = RDT_RESOURCE_NODE,
> + .name = "L3",
> + .scope = SCOPE_NODE,
> + .domains = domain_init(RDT_RESOURCE_NODE),
> + .fflags = RFTYPE_RES_MB,
RFTYPE_RES_MB is for the resource to add some files in info/MB.
But the NODE resource doesn't have any files to show in info/MB.
Only shown file for the NODE resource is info/L3_MON/snc_ways. But this
file doesn't need to set fflags.
Seems no need to set fflags or fflags=0 to eliminate confusion?
Thanks.
-Fenghua
>> + .domains = domain_init(RDT_RESOURCE_NODE),
>> + .fflags = RFTYPE_RES_MB,
> RFTYPE_RES_MB is for the resource to add some files in info/MB.
> But the NODE resource doesn't have any files to show in info/MB.
>
> Only shown file for the NODE resource is info/L3_MON/snc_ways. But this
> file doesn't need to set fflags.
>
> Seems no need to set fflags or fflags=0 to eliminate confusion?
I just cut & pasted from the L3 ... oops. I think you may be right that
an explicit ".fflags = 0" would be best here.
-Tony
Hi Tony,
On Thu, Jan 26, 2023 at 7:42 PM Tony Luck <[email protected]> wrote:
> +static __init int find_snc_ways(void)
> +{
> + unsigned long *node_caches;
> + int cpu, node, ret;
> +
> + node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> + if (!node_caches)
> + return 1;
> +
> + cpus_read_lock();
> + for_each_node(node) {
Someone tried this patch on a machine with a CPU-less node...
We need to check for this:
+ if (cpumask_empty(cpumask_of_node(node)))
+ continue;
> + cpu = cpumask_first(cpumask_of_node(node));
> + set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> + }
> + cpus_read_unlock();
Thanks!
-Peter
Hi Tony,
Sorry for the late response. I was looking at your patches.
Do you really need a new resource [RDT_RESOURCE_NODE] to handle this
feature?
As far as I can see, all that matters is writing/reading the MSRs
IA32_PQR_ASSOC and IA32_QM_EVTSEL based on cpu index. I think that can
be done without having the new resource. Let me know if I have
misunderstood something.
Thanks
Babu
> -----Original Message-----
> From: Tony Luck <[email protected]>
> Sent: Thursday, January 26, 2023 12:42 PM
> To: Fenghua Yu <[email protected]>; Reinette Chatre
> <[email protected]>; Peter Newman <[email protected]>;
> Jonathan Corbet <[email protected]>; [email protected]
> Cc: Shaopeng Tan <[email protected]>; James Morse
> <[email protected]>; Jamie Iles <[email protected]>; Moger, Babu
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; Tony Luck
> <[email protected]>
> Subject: [PATCH 3/7] x86/resctrl: Add a new node-scoped resource to
> rdt_resources_all[]
>
> Add a placeholder in the array of struct rdt_hw_resource to be used for event
> monitoring of systems with Sub-NUMA Cluster enabled.
>
> Update get_domain_id() to handle SCOPE_NODE.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
> arch/x86/kernel/cpu/resctrl/core.c | 12 ++++++++++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h
> b/arch/x86/kernel/cpu/resctrl/internal.h
> index 15cea517efaa..39a62babd60b 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -409,12 +409,14 @@ enum resctrl_res_level {
> RDT_RESOURCE_L3,
> RDT_RESOURCE_L2,
> RDT_RESOURCE_MBA,
> + RDT_RESOURCE_NODE,
>
> /* Must be the last */
> RDT_NUM_RESOURCES,
> };
>
> enum resctrl_scope {
> + SCOPE_NODE,
> SCOPE_L2_CACHE = 2,
> SCOPE_L3_CACHE = 3
> };
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c
> b/arch/x86/kernel/cpu/resctrl/core.c
> index 6914232acf84..19be6fe42ef3 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -100,6 +100,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .fflags = RFTYPE_RES_MB,
> },
> },
> + [RDT_RESOURCE_NODE] =
> + {
> + .r_resctrl = {
> + .rid = RDT_RESOURCE_NODE,
> + .name = "L3",
> + .scope = SCOPE_NODE,
> + .domains =
> domain_init(RDT_RESOURCE_NODE),
> + .fflags = RFTYPE_RES_MB,
> + },
> + },
> };
>
> /*
> @@ -464,6 +474,8 @@ static int arch_domain_mbm_alloc(u32 num_rmid,
> struct rdt_hw_domain *hw_dom)
>
> static int get_domain_id(int cpu, enum resctrl_scope scope) {
> + if (scope == SCOPE_NODE)
> + return cpu_to_node(cpu);
> return get_cpu_cacheinfo_id(cpu, scope); }
>
> --
> 2.39.1
Babu,
Thanks for looking at my patches.
> Do you really need a new resource [RDT_RESOURCE_NODE] to handle this
> feature?
Yes. When sub-numa cluster mode is enabled, there are separate counts for
each "node" on the socket. This new resource is the key to creating extra
directories in the /sys/fs/resctrl/mon_data/ area so that the memory bandwidth
and cache occupancy can be read for each node, instead of just for each socket.
But there are some other issues with this patch series. New version will be posted
once they are fixed up.
-Tony
Hi Tony,
On 26/01/2023 18:41, Tony Luck wrote:
> Intel server systems starting with Skylake support a mode that logically
> partitions each socket. E.g. when partitioned two ways, half the cores,
> L3 cache, and memory controllers are allocated to each of the partitions.
> This may reduce average latency to access L3 cache and memory, with the
> tradeoff that only half the L3 cache is available for subnode-local memory
> access.
I couldn't find a description of what happens to the CAT bitmaps or counters.
Presumably the CAT bitmaps are duplicated, so each cluster has its own set, and
the counters aren't - so software has to co-ordinate the use of RMID across the CPUs?
How come cacheinfo isn't modified to report the L3 partitions as separate caches?
Otherwise user-space would assume the full size of the cache is available on any of those
CPUs.
This would avoid an ABI change in resctrl (domain is now the numa node), leaving only the
RMID range code.
Thanks,
James
Hi Tony,
On 26/01/2023 18:41, Tony Luck wrote:
> Make it easy for the user to tell if Sub-NUMA Cluster is enabled by
> providing an info/ file.
I think what this is conveying to user-space is 'domain_id_is_numa_node'.
Does user-space need to know the number of ways?
Will this always be a single number, or will it ever be possible to have an SNC=2 and
SNC=1 package in the same system?
Thanks,
James
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index a0dc64a70d01..392e7a08d083 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -997,6 +997,14 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of,
> return 0;
> }
>
> +static int rdt_snc_ways_show(struct kernfs_open_file *of,
> + struct seq_file *seq, void *v)
> +{
> + seq_printf(seq, "%d\n", snc_ways);
> +
> + return 0;
> +}
> +
> static int rdt_mon_features_show(struct kernfs_open_file *of,
> struct seq_file *seq, void *v)
> {
> @@ -1451,6 +1459,13 @@ static struct rftype res_common_files[] = {
> .seq_show = rdt_num_rmids_show,
> .fflags = RF_MON_INFO,
> },
> + {
> + .name = "snc_ways",
> + .mode = 0444,
> + .kf_ops = &rdtgroup_kf_single_ops,
> + .seq_show = rdt_snc_ways_show,
> + .fflags = RF_MON_INFO,
> + },
> {
> .name = "cbm_mask",
> .mode = 0444,
Hi Tony,
On 26/01/2023 18:41, Tony Luck wrote:
> When Sub-NUMA cluster is enabled (snc_ways > 1) use the RDT_RESOURCE_NODE
> instead of RDT_RESOURCE_L3 for all monitoring operations.
>
> The mon_scale and num_rmid values from CPUID(0xf,0x1),(EBX,ECX) must be
> scaled down by the number of Sub-NUMA Clusters.
>
> A subsequent change will detect sub-NUMA cluster mode and set
> "snc_ways". For now set to one (meaning each L3 cache spans one
> node).
(I'm looking at decoupling "monitoring is always on RDT_RESOURCE_L3" as a separate thing
to enabling SNC ... just in case my comments seem strange!)
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 19be6fe42ef3..53b2ab37af2f 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -786,7 +791,13 @@ static __init bool get_rdt_alloc_resources(void)
>
> static __init bool get_rdt_mon_resources(void)
> {
> - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + struct rdt_resource *r;
> +
> + /* When SNC enabled, monitor functions at node instead of L3 cache scope */
> + if (snc_ways > 1)
> + r = &rdt_resources_all[RDT_RESOURCE_NODE].r_resctrl;
> + else
> + r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
Could this be hidden in a helper with some name like resctrl_arch_get_mbm_resource()?
You have the same pattern again in rdt_get_tree(). If this gets more complex in the
future, it means its outside the filesystem code, and all in one place.
Thanks,
James
> if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
> rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index a6ba3080e5db..a0dc64a70d01 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2238,7 +2238,10 @@ static int rdt_get_tree(struct fs_context *fc)
> static_branch_enable_cpuslocked(&rdt_enable_key);
>
> if (is_mbm_enabled()) {
> - r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + if (snc_ways > 1)
> + r = &rdt_resources_all[RDT_RESOURCE_NODE].r_resctrl;
> + else
> + r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> list_for_each_entry(dom, &r->domains, list)
> mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
> }
Hi Tony,
On 26/01/2023 18:41, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffices of directories with "L3" in
(Typo: Suffixes)
> their name refer to Sun-NUMA nodes instead of L3 cache ids.
(Typo: Sub-NUMA)
> diff --git a/Documentation/x86/resctrl.rst b/Documentation/x86/resctrl.rst
> index 71a531061e4e..9043a2d2f2d3 100644
> --- a/Documentation/x86/resctrl.rst
> +++ b/Documentation/x86/resctrl.rst
> @@ -167,6 +167,11 @@ with the following files:
> bytes) at which a previously used LLC_occupancy
> counter can be considered for re-use.
>
> +"snc_ways":
> + A value of "1" marks that SNC mode is disabled.
It would be good to expand the acronym the first time its used in the documentation.
This gives folk a few more characters to google!
> + Values of "2" or "4" indicate how many NUMA
> + nodes share an L3 cache.
> +
> Finally, in the top level of the "info" directory there is a file
> named "last_cmd_status". This is reset with every "command" issued
> via the file system (making new directories or writing to any of the
Thanks,
James
>> + /* When SNC enabled, monitor functions at node instead of L3 cache scope */
>> + if (snc_ways > 1)
>> + r = &rdt_resources_all[RDT_RESOURCE_NODE].r_resctrl;
>> + else
>> + r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>
> Could this be hidden in a helper with some name like resctrl_arch_get_mbm_resource()?
> You have the same pattern again in rdt_get_tree(). If this gets more complex in the
> future, it means its outside the filesystem code, and all in one place.
Sounds like a good idea. Thanks.
-Tony
> > Make it easy for the user to tell if Sub-NUMA Cluster is enabled by
> > providing an info/ file.
>
> I think what this is conveying to user-space is 'domain_id_is_numa_node'.
That seems more architecturally neutral. I like it.
> Does user-space need to know the number of ways?
I don't know. Maybe some might. Perhaps there is some better name that
is architecturally neutral, but still has a numerical rather than boolean value?
> Will this always be a single number, or will it ever be possible to have an SNC=2 and
> SNC=1 package in the same system?
I sincerely hope that it is the same value across the system. Currently the
BIOS setup option to enable SNC doesn't have per-socket choices, it is
just an all-or-nothing choice. "2" isn't the only choice for number of SNC
nodes on a socket. "4" is (or will be) a choice.
-Tony
> > Intel server systems starting with Skylake support a mode that logically
> > partitions each socket. E.g. when partitioned two ways, half the cores,
> > L3 cache, and memory controllers are allocated to each of the partitions.
> > This may reduce average latency to access L3 cache and memory, with the
> > tradeoff that only half the L3 cache is available for subnode-local memory
> > access.
>
> I couldn't find a description of what happens to the CAT bitmaps or counters.
No changes to CAT. The cache is partitioned between sub-numa nodes based
on the index, not by dividing the ways. E.g. an 8-way associative 32MB cache is
still 8-way associative in each sub-node, but with 16MB available to each node.
This means users who want a specific amount of cache will need to allocate
more bits in the cache way mask (because each way is half as big).
> Presumably the CAT bitmaps are duplicated, so each cluster has its own set, and
> the counters aren't - so software has to co-ordinate the use of RMID across the CPUs?
Nope. Still one set of CAT bit maps per socket.
With "N" RMIDs available on a system with SNC disabled, there will be N/2 available
when there are 2 SNC nodes per socket. Processes use values [0 .. N/2).
> How come cacheinfo isn't modified to report the L3 partitions as separate caches?
> Otherwise user-space would assume the full size of the cache is available on any of those
> CPUs.
The size of the cache is perhaps poorly defined in the SNC enabled case. A well
behaved NUMA application that is only accessing memory from its local node will
see an effective cache half the size. But if a process accesses memory from the
other SNC node on the same socket, then it will get allocations in that SNC nodes
half share of the cache. Accessing memory across inter-socket links will end up
allocating across the whole cache.
Moral: SNC mode is intended for applications that have very well-behaved NUMA
characteristics.
-Tony
Hi Tony,
On 1/26/23 12:41, Tony Luck wrote:
> There isn't a simple hardware enumeration to indicate to software that
> a system is running with Sub-NUMA Cluster enabled.
>
> Compare the number of NUMA nodes with the number of L3 caches to calculate
> the number of Sub-NUMA nodes per L3 cache.
>
> When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
> are distributed equally between the SNC nodes within each socket.
>
> E.g. if there are 400 RMID counters, and the system is configured with
> two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
> 0 on the socket, and RMID counter 200..399 on SNC node 1.
>
> Handle this by initializing a per-cpu RMID offset value. Use this
> to calculate the value to write to the RMID field of the IA32_PQR_ASSOC
> MSR during context switch, and also to the IA32_QM_EVTSEL MSR when
> reading RMID event values.
>
> N.B. this works well for well-behaved NUMA applications that access
> memory predominantly from the local memory node. For applications that
> access memory across multiple nodes it may be necessary for the user
> to read counters for all SNC nodes on a socket and add the values to
> get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
> all that different from applications that span across multiple sockets
> in a legacy system.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/include/asm/resctrl.h | 4 ++-
> arch/x86/kernel/cpu/resctrl/core.c | 43 +++++++++++++++++++++++++--
> arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
> 3 files changed, 44 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 52788f79786f..59b8afd8c53c 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
> DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
> DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
>
> +DECLARE_PER_CPU(int, rmid_offset);
> +
> /*
> * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
> *
> @@ -69,7 +71,7 @@ static void __resctrl_sched_in(void)
> if (static_branch_likely(&rdt_mon_enable_key)) {
> tmp = READ_ONCE(current->rmid);
> if (tmp)
> - rmid = tmp;
> + rmid = tmp + this_cpu_read(rmid_offset);
> }
>
> if (closid != state->cur_closid || rmid != state->cur_rmid) {
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 53b2ab37af2f..0ff739375e3b 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,6 +16,7 @@
>
> #define pr_fmt(fmt) "resctrl: " fmt
>
> +#include <linux/cpu.h>
> #include <linux/slab.h>
> #include <linux/err.h>
> #include <linux/cacheinfo.h>
> @@ -484,6 +485,13 @@ static int get_domain_id(int cpu, enum resctrl_scope scope)
> return get_cpu_cacheinfo_id(cpu, scope);
> }
>
> +DEFINE_PER_CPU(int, rmid_offset);
> +
> +static void set_per_cpu_rmid_offset(int cpu, struct rdt_resource *r)
> +{
> + this_cpu_write(rmid_offset, (cpu_to_node(cpu) % snc_ways) * r->num_rmid);
> +}
> +
> /*
> * domain_add_cpu - Add a cpu to a resource's domain list.
> *
> @@ -515,6 +523,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> cpumask_set_cpu(cpu, &d->cpu_mask);
> if (r->cache.arch_has_per_cpu_cfg)
> rdt_domain_reconfigure_cdp(r);
> + if (r->mon_capable)
> + set_per_cpu_rmid_offset(cpu, r);
> return;
> }
>
> @@ -533,9 +543,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> return;
> }
>
> - if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> - domain_free(hw_dom);
> - return;
> + if (r->mon_capable) {
> + if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> + domain_free(hw_dom);
> + return;
> + }
> + set_per_cpu_rmid_offset(cpu, r);
> }
>
> list_add_tail(&d->list, add_pos);
> @@ -845,11 +858,35 @@ static __init bool get_rdt_resources(void)
> return (rdt_mon_capable || rdt_alloc_capable);
> }
>
> +static __init int find_snc_ways(void)
> +{
> + unsigned long *node_caches;
> + int cpu, node, ret;
> +
> + node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> + if (!node_caches)
> + return 1;
> +
> + cpus_read_lock();
> + for_each_node(node) {
> + cpu = cpumask_first(cpumask_of_node(node));
> + set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> + }
> + cpus_read_unlock();
> +
> + ret = nr_node_ids / bitmap_weight(node_caches, nr_node_ids);
> + kfree(node_caches);
> +
> + return ret;
> +}
> +
> static __init void rdt_init_res_defs_intel(void)
> {
> struct rdt_hw_resource *hw_res;
> struct rdt_resource *r;
>
> + snc_ways = find_snc_ways();
> +
> for_each_rdt_resource(r) {
> hw_res = resctrl_to_arch_res(r);
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 3fc63aa68130..bd5ec348d925 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -160,7 +160,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> * are error bits.
> */
> - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));
I am thinking loud here.
When a new monitor group is created, new RMID is assigned. This is done by
alloc_rmid. It does not know about the rmid_offset details. This will
allocate the one of the free RMIDs.
When CPUs are assigned to the group, then per cpu pqr_state is updated.
At that point, this RMID becomes default_rmid for that cpu.
But CPUs can be assigned from two different Sub-NUMA nodes.
Considering same example you mentioned.
E.g. in 2-way Sub-NUMA cluster with 200 RMID counters there are only
100 available counters to the resctrl code. When running on the first
SNC node RMID values 0..99 are used as before. But when running on the
second node, a task that is assigned resctrl rmid=10 must load 10+100
into IA32_PQR_ASSOC to use RMID counter 110.
#mount -t resctrl resctrl /sys/fs/resctrl/
#cd /sys/fs/resctrl/
#mkdir test (Lets say RMID 1 is allocated)
#cd test
#echo 1 > cpus_list
#echo 101 > cpus_list
In this case, the following code may run on two different RMIDs even
though it was intended to run on same RMID.
wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));
Have you thought of this problem?
Thanks
Babu
Babu wrote:
> I am thinking loud here. Have you thought of addressing this problem?
> When a new monitor group is created, new RMID is assigned. This is done by alloc_rmid. It does not know about the rmid_offset details. This will allocate the one of the free RMIDs.
> When CPUs are assigned to the group, then per cpu pqr_state is updated. At that point, this RMID becomes default_rmid for that cpu.
Good point. This is a gap. I haven't handled assigning CPUs to resctrl groups when SNC is enabled.
I'm not sure this has a solution :-(
-Tony
On 2/28/2023 2:39 PM, Luck, Tony wrote:
> Babu wrote:
>> I am thinking loud here. Have you thought of addressing this problem?
>> When a new monitor group is created, new RMID is assigned. This is done by alloc_rmid. It does not know about the rmid_offset details. This will allocate the one of the free RMIDs.
>> When CPUs are assigned to the group, then per cpu pqr_state is updated. At that point, this RMID becomes default_rmid for that cpu.
> Good point. This is a gap. I haven't handled assigning CPUs to resctrl groups when SNC is enabled.
You may need to document it.
Thanks
Babu
Hi Tony,
On 28/02/2023 17:44, Luck, Tony wrote:
>>> Make it easy for the user to tell if Sub-NUMA Cluster is enabled by
>>> providing an info/ file.
>>
>> I think what this is conveying to user-space is 'domain_id_is_numa_node'.
>
> That seems more architecturally neutral. I like it.
>
>> Does user-space need to know the number of ways?
>
> I don't know. Maybe some might. Perhaps there is some better name that
> is architecturally neutral, but still has a numerical rather than boolean value?
If we don't know what user-space needs this for, I'd prefer we don't expose it. It'll be a
problem in the future if there is no single number we can put there.
I don't see what the difference between 2 and 4 would be used for when setting up resctrl,
unless you are choosing which numa-nodes to spread tasks over ... but the numa distance
would be a much better number to base that decision on.
User-space is able to perform the same calculation to find the snc_ways using the
cache/index stuff and node entries in sysfs.
>> Will this always be a single number, or will it ever be possible to have an SNC=2 and
>> SNC=1 package in the same system?
>
> I sincerely hope that it is the same value across the system. Currently the
> BIOS setup option to enable SNC doesn't have per-socket choices, it is
> just an all-or-nothing choice. "2" isn't the only choice for number of SNC
> nodes on a socket. "4" is (or will be) a choice.
Yeah, in the arm world, partners get to make the decision on what is sane. Big-little
means someone could do something that looks like SNC in on cluster, but not another.
If we don't know what user-space needs it for, I'd prefer we don't expose it, just to
avoid giving out rope to shoot ourselves in the foot with.
Thanks,
James
On Mon, Feb 27, 2023 at 02:30:38PM +0100, Peter Newman wrote:
> Hi Tony,
>
> On Thu, Jan 26, 2023 at 7:42 PM Tony Luck <[email protected]> wrote:
> > +static __init int find_snc_ways(void)
> > +{
> > + unsigned long *node_caches;
> > + int cpu, node, ret;
> > +
> > + node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL);
> > + if (!node_caches)
> > + return 1;
> > +
> > + cpus_read_lock();
> > + for_each_node(node) {
>
> Someone tried this patch on a machine with a CPU-less node...
>
> We need to check for this:
>
> + if (cpumask_empty(cpumask_of_node(node)))
> + continue;
>
> > + cpu = cpumask_first(cpumask_of_node(node));
> > + set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> > + }
> > + cpus_read_unlock();
Peter,
Tell me more about your CPU-less nodes. Your fix avoids a bad
pointer reference (because cpumask_first() returns cpu >= nr_cpu_ids
for an empty bitmask).
But now I'm worried about whether I have the right values in the
formula:
nr_node_ids / bitmap_weight(node_caches, nr_node_ids);
This fix avoids counting the L3 from a non-existent CPU, but still
counts the node in the numerator.
Is your CPU-less node a full (non-SNC) node? Like this:
Socket 0 Socket 1
+--------------------+ +--------------------+
| . | | . |
| SNC 0.0 . SNC 0.1 | | zero . zero |
| . | | CPUs . CPUs |
| . | | . |
| . | | . |
+--------------------+ +--------------------+
| L3 Cache | | L3 Cache |
+--------------------+ +--------------------+
I could fix this case by counting how many CPU-less
nodes I find, and reducing the numerator (the denominator
didn't count the L3 cache from socket 1 because there
are no CPUs there)
(nr_node_ids - n_empty_nodes) / bitmap_weight(node_caches, nr_node_ids);
=> 2 / 1
But that won't work if your CPU-less node is an SNC node
and the other SNC node in the same socket does have some
CPUs:
Socket 0 Socket 1
+--------------------+ +--------------------+
| . | | . |
| SNC 0.0 . SNC 0.1 | | zero . SNC 1.1 |
| . | | CPUs . |
| . | | . |
| . | | . |
+--------------------+ +--------------------+
| L3 Cache | | L3 Cache |
+--------------------+ +--------------------+
This would get 3 / 2 ... i.e. I should still count the
empty node because its cache was counted by its SNC
buddy.
-Tony
Hi Tony,
On Fri, Mar 10, 2023 at 6:30 PM Tony Luck <[email protected]> wrote:
> Tell me more about your CPU-less nodes. Your fix avoids a bad
> pointer reference (because cpumask_first() returns cpu >= nr_cpu_ids
> for an empty bitmask).
>
> But now I'm worried about whether I have the right values in the
> formula:
>
> nr_node_ids / bitmap_weight(node_caches, nr_node_ids);
>
> This fix avoids counting the L3 from a non-existent CPU, but still
> counts the node in the numerator.
>
> Is your CPU-less node a full (non-SNC) node? Like this:
>
> Socket 0 Socket 1
> +--------------------+ +--------------------+
> | . | | . |
> | SNC 0.0 . SNC 0.1 | | zero . zero |
> | . | | CPUs . CPUs |
> | . | | . |
> | . | | . |
> +--------------------+ +--------------------+
> | L3 Cache | | L3 Cache |
> +--------------------+ +--------------------+
In the case I saw, the nodes were AEP DIMMs, so all-memory nodes.
Browsing sysfs, they are listed in has_memory, but not
has_normal_memory or has_cpu.
I imagine CXL.mem would have similar issues.
-Peter
> In the case I saw, the nodes were AEP DIMMs, so all-memory nodes.
Peter,
Thanks. This helps a lot.
Ok. I will add code to count the number of memory only nodes and subtract
that from the numerator of "nodes / L3-caches".
I'll ignore the weird case of a memory-only SNC node when other SNC
nodes on the same socket do have CPUs until such time as someone
convinces me that there is a real-world reason to enable SNC and then
disable the CPUs in one node. It would seem much better to keep SNC
turned off so that the remaining CPUs on the socket get access to all
of the L3.
-Tony
On Tue, Feb 28, 2023 at 01:51:32PM -0600, Moger, Babu wrote:
> I am thinking loud here.
> When a new monitor group is created, new RMID is assigned. This is done by
> alloc_rmid. It does not know about the rmid_offset details. This will
> allocate the one of the free RMIDs.
>
> When CPUs are assigned to the group, then per cpu pqr_state is updated.
> At that point, this RMID becomes default_rmid for that cpu.
>
> But CPUs can be assigned from two different Sub-NUMA nodes.
>
> Considering same example you mentioned.
>
> E.g. in 2-way Sub-NUMA cluster with 200 RMID counters there are only
> 100 available counters to the resctrl code. When running on the first
> SNC node RMID values 0..99 are used as before. But when running on the
> second node, a task that is assigned resctrl rmid=10 must load 10+100
> into IA32_PQR_ASSOC to use RMID counter 110.
>
> #mount -t resctrl resctrl /sys/fs/resctrl/
> #cd /sys/fs/resctrl/
> #mkdir test (Lets say RMID 1 is allocated)
> #cd test
> #echo 1 > cpus_list
> #echo 101 > cpus_list
>
> In this case, the following code may run on two different RMIDs even
> though it was intended to run on same RMID.
>
> wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset));
>
> Have you thought of this problem?
Now I've thought about this. I don't think it is a problem.
With SNC enabled for two nodes per socket the available RMIDs
are divided between the SNC nodes, but are for some purposes
numbered [0 .. N/2) but in some cases must be viewed as two
separate sets [0 .. N/2) on the first node and [N/2 .. N) on
the second.
In your example RMID 1 is assigned to the group and you have
one CPU from each node in the group. Processes on CPU1 will
load IA32_PQR_ASSOC.RMID = 1, while processes on CPU101 will
set IA32_PQR_ASSOC.RMID = 101. So counts of memory bandwidth
and cache occupancy will be in two different physical RMID
counters.
To read these back the user needs to lookup which $node each CPU
belongs to and then read from the appropriate
mon_data/mon_L3_$node/{llc_occupancy,mbm_local_bytes,mbm_total_bytes}
file.
$ cat mon_data/mon_L3_00/llc_occupancy # reads RMID=1
$ cat mon_data/mon_L3_01/llc_occupancy # reads RMID=101
-Tony