This series based on top of Linus upstream commit 33e02dc69afb ("Merge
tag 'sound-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound")
The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features. Prior to this
patch Intel has advised that SNC and RDT are incompatible.
Some of these CPUs support an MSR that can partition the RMID counters
in the same way. This allows monitoring features to be used. Legacy
monitoring files provide the sum of counters from each SNC node for
backwards compatibility. Additional files per SNC node provide details
per node.
Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.
Signed-off-by: Tony Luck <[email protected]>
---
Changes since v18: https://lore.kernel.org/all/[email protected]/
Global: Consistent use of "Sub-NUMA Cluster (SNC)"
1-4: No change
5: Rename RESCTRL_NODE as RESCTRL_L3_NODE to make it clear that
these "nodes" are each subsets of L3 cache instances.
6: Changes for snc_nodes_per_l3_cache are localized to monitor.c
Don't use it in decision block use of mba_MBps option.
Moved the old get_node_rmid() function here, but renamed it to
logical_rmid_to_physical_rmid() with a block comment explaining
how RMIDs are distributed when SNC is enabled. Function now
checks if snc_nodes_per_l3_cache == 1 for fast return.
7: New patch. Only allow mba_MBps option if scope of MBM matches MBA
8: Replaces old patch 8. "display_id" field is no more. Add and
initialize the @ci (struct cachinfo *) to rdt_mon_domain.
Note that the new get_cpu_cacheinfo_level() helper function is
added to internal.h as it will also be needed by patch 19.
9: Instead of display_id, add pointer to cacheinfo structure to
struct rmid_read. Add kerneldoc description of existing and
new fields.
10: Added to commit comment describing why mkdir_mondata_subdir()
needs to be refactored.
11: Dropped Intel specific description of fields in the mon_evt
structure. Say that choice of bit to steal was arbitrary, but
can be changed in the future.
12: Fixed typo s/and file/and files/ in commit message. Now using
the cacheinfo structure (specifically "id" field) instead of
display_id.
13: Wordsmith commit into imperative.
I looked at using kobject_has_children() to check for empty
directory, but it needs a "struct kobject *" and all I have
is "struct kernfs_node *". I'm now checking how many CPUs
remain in ci->shared_cpu_map to detect whether this is the
last SNC node.
s/kernfs_find_and_get_ns/kernfs_find_and_get/ in all places.
Fix copy/paste error which used "pgrp" instead of "cgrp".
Dropped the firtree fix for a function I hadn't touched.
Old patch 14 split into 14, 15, 16
The "wedging things until it works" path is gone. Instead
of passing in a random SNC domain that has the right display_id
code now makes use of cacheinfo both to get the L3 id, and to
pick the cpu mask for the smp_call*(). rr.d is now NULL in the
sum case as suggested.
14: New patch. Does the "top half" work of filling out the rmid_read
structure prior to the smp_call*().
15: Need to pass the cachinfo struct to resctrl_arch_rmid_read()
16: When "sum", resctrl_arch_rmid_read() loops across domains sharing
L3 cache.
17: (was 15) sanity
Removed "Fix" from the shortlog description. Use ci->shared_cpu_map
in the sanity check for sum case.
18: (new - split out from old 16) Try to do one thing at a time. Split
the MSR 0xCA0 update code from the SNC detection code.
19: (was 16) Fix typo s/couning/counting/
Use upper case for first letter of messages.
Use cpumask_weight() instead of bitmap_weight.
20: (was 17) Dropped the "This patch needs updating" part of commit
Tony Luck (20):
x86/resctrl: Prepare for new domain scope
x86/resctrl: Prepare to split rdt_domain structure
x86/resctrl: Prepare for different scope for control/monitor
operations
x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
x86/resctrl: Add node-scope to the options for feature scope
x86/resctrl: Introduce snc_nodes_per_l3_cache
x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster
(SNC) systems
x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files
x86/resctrl: Add new fields to struct rmid_read for summation of
domains
x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function
x86/resctrl: Allocate a new bit in union mon_data_bits
x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files
x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC)
mode
x86/resctrl: Fill out rmid_read structure for smp_call*() to read a
counter
x86/resctrl: Pass two extra arguments to resctrl_arch_rmid_read()
x86/resctrl: Make resctrl_arch_rmid_read() handle sum over domains
x86/resctrl: Update CPU sanity checks when reading RMID counters
x86/resctrl: Enable RMID shared RMID mode on Sub-NUMA Cluster (SNC)
systems
x86/resctrl: Sub-NUMA Cluster (SNC) detection and enabling
x86/resctrl: Update documentation with Sub-NUMA cluster changes
Documentation/arch/x86/resctrl.rst | 17 ++
include/linux/resctrl.h | 91 +++++--
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 114 ++++++--
arch/x86/kernel/cpu/resctrl/core.c | 312 ++++++++++++++++------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 74 +++--
arch/x86/kernel/cpu/resctrl/monitor.c | 220 ++++++++++++---
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 27 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 265 +++++++++++-------
9 files changed, 813 insertions(+), 308 deletions(-)
base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
--
2.45.0
Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.
In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.
Clean up the error handling when looking up domains. Report invalid id's
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the callsites.
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 9 ++++-
arch/x86/kernel/cpu/resctrl/core.c | 46 ++++++++++++++++-------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 ++-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 ++-
5 files changed, 49 insertions(+), 19 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a365f67131ec..ed693bfe474d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -150,13 +150,18 @@ struct resctrl_membw {
struct rdt_parse_data;
struct resctrl_schema;
+enum resctrl_scope {
+ RESCTRL_L2_CACHE = 2,
+ RESCTRL_L3_CACHE = 3,
+};
+
/**
* struct rdt_resource - attributes of a resctrl resource
* @rid: The index of the resource
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @cache_level: Which cache level defines scope of this resource
+ * @scope: Scope of this resource
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
* @domains: RCU list of all domains for this resource
@@ -174,7 +179,7 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- int cache_level;
+ enum resctrl_scope scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
struct list_head domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index a113d9aba553..f85b2ff40eef 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -68,7 +68,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -82,7 +82,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .cache_level = 2,
+ .scope = RESCTRL_L2_CACHE,
.domains = domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -96,7 +96,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -108,7 +108,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -392,9 +392,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct rdt_domain *d;
struct list_head *l;
- if (id < 0)
- return ERR_PTR(-ENODEV);
-
list_for_each(l, &r->domains) {
d = list_entry(l, struct rdt_domain, list);
/* When id is found, return its domain. */
@@ -484,6 +481,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
return 0;
}
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+ switch (scope) {
+ case RESCTRL_L2_CACHE:
+ case RESCTRL_L3_CACHE:
+ return get_cpu_cacheinfo_id(cpu, scope);
+ default:
+ break;
+ }
+
+ return -EINVAL;
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -499,7 +509,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
*/
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_domain_id_from_scope(cpu, r->scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
@@ -507,12 +517,14 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
lockdep_assert_held(&domain_list_lock);
- d = rdt_find_domain(r, id, &add_pos);
- if (IS_ERR(d)) {
- pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+ if (id < 0) {
+ pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->scope, r->name);
return;
}
+ d = rdt_find_domain(r, id, &add_pos);
+
if (d) {
cpumask_set_cpu(cpu, &d->cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
@@ -552,15 +564,21 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
static void domain_remove_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_domain_id_from_scope(cpu, r->scope);
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
lockdep_assert_held(&domain_list_lock);
+ if (id < 0) {
+ pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->scope, r->name);
+ return;
+ }
+
d = rdt_find_domain(r, id, NULL);
- if (IS_ERR_OR_NULL(d)) {
- pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+ if (!d) {
+ pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
return;
}
hw_dom = resctrl_to_arch_dom(d);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b7291f60399c..2bf021d42500 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -577,7 +577,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
r = &rdt_resources_all[resid].r_resctrl;
d = rdt_find_domain(r, domid, NULL);
- if (IS_ERR_OR_NULL(d)) {
+ if (!d) {
ret = -ENOENT;
goto out;
}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index aacf236dfe3b..7c4bf0a006ce 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
*/
static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
{
+ enum resctrl_scope scope = plr->s->res->scope;
struct cpu_cacheinfo *ci;
int ret;
int i;
+ if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+ return -ENODEV;
+
/* Pick the first cpu we find that is associated with the cache. */
plr->cpu = cpumask_first(&plr->d->cpu_mask);
@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == plr->s->res->cache_level) {
+ if (ci->info_list[i].level == scope) {
plr->line_size = ci->info_list[i].coherency_line_size;
return 0;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 02f213f1c51c..b8588ce88eef 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1454,10 +1454,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
unsigned int size = 0;
int num_b, i;
+ if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+ return size;
+
num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->cache_level) {
+ if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
--
2.45.0
Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.
Add RESCTRL_L3_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are divided between
nodes that share an L3 cache.
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
2 files changed, 3 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index aa2c22a8e37b..64b6ad1b22a1 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -176,6 +176,7 @@ struct resctrl_schema;
enum resctrl_scope {
RESCTRL_L2_CACHE = 2,
RESCTRL_L3_CACHE = 3,
+ RESCTRL_L3_NODE,
};
/**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b4f2be776408..b86c525d0620 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -510,6 +510,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
case RESCTRL_L2_CACHE:
case RESCTRL_L3_CACHE:
return get_cpu_cacheinfo_id(cpu, scope);
+ case RESCTRL_L3_NODE:
+ return cpu_to_node(cpu);
default:
break;
}
--
2.45.0
The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.
To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 16 ++++--
arch/x86/kernel/cpu/resctrl/core.c | 26 +++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 24 ++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 14 +++---
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 60 +++++++++++------------
6 files changed, 81 insertions(+), 73 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index ed693bfe474d..f63fcf17a3bc 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -59,10 +59,20 @@ struct resctrl_staged_config {
};
/**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
* @list: all instances of this resource
* @id: unique id for this instance
* @cpu_mask: which CPUs share this resource
+ */
+struct rdt_domain_hdr {
+ struct list_head list;
+ int id;
+ struct cpumask cpu_mask;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr: common header for different domain types
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
* @mbm_local: saved state for MBM local bandwidth
@@ -77,9 +87,7 @@ struct resctrl_staged_config {
* by closid
*/
struct rdt_domain {
- struct list_head list;
- int id;
- struct cpumask cpu_mask;
+ struct rdt_domain_hdr hdr;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
struct mbm_state *mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f85b2ff40eef..96fff44f9d03 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -355,9 +355,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
lockdep_assert_cpus_held();
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
/* Find the domain that contains this CPU */
- if (cpumask_test_cpu(cpu, &d->cpu_mask))
+ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
return d;
}
@@ -393,12 +393,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct list_head *l;
list_for_each(l, &r->domains) {
- d = list_entry(l, struct rdt_domain, list);
+ d = list_entry(l, struct rdt_domain, hdr.list);
/* When id is found, return its domain. */
- if (id == d->id)
+ if (id == d->hdr.id)
return d;
/* Stop searching when finding id's position in sorted list. */
- if (id < d->id)
+ if (id < d->hdr.id)
break;
}
@@ -526,7 +526,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
d = rdt_find_domain(r, id, &add_pos);
if (d) {
- cpumask_set_cpu(cpu, &d->cpu_mask);
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
return;
@@ -537,8 +537,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
d = &hw_dom->d_resctrl;
- d->id = id;
- cpumask_set_cpu(cpu, &d->cpu_mask);
+ d->hdr.id = id;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
rdt_domain_reconfigure_cdp(r);
@@ -552,11 +552,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}
- list_add_tail_rcu(&d->list, add_pos);
+ list_add_tail_rcu(&d->hdr.list, add_pos);
err = resctrl_online_domain(r, d);
if (err) {
- list_del_rcu(&d->list);
+ list_del_rcu(&d->hdr.list);
synchronize_rcu();
domain_free(hw_dom);
}
@@ -583,10 +583,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}
hw_dom = resctrl_to_arch_dom(d);
- cpumask_clear_cpu(cpu, &d->cpu_mask);
- if (cpumask_empty(&d->cpu_mask)) {
+ cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+ if (cpumask_empty(&d->hdr.cpu_mask)) {
resctrl_offline_domain(r, d);
- list_del_rcu(&d->list);
+ list_del_rcu(&d->hdr.list);
synchronize_rcu();
/*
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 2bf021d42500..6246f48b0449 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -69,7 +69,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
cfg = &d->staged_config[s->conf_type];
if (cfg->have_new_ctrl) {
- rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+ rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
return -EINVAL;
}
@@ -148,7 +148,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
cfg = &d->staged_config[s->conf_type];
if (cfg->have_new_ctrl) {
- rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+ rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
return -EINVAL;
}
@@ -231,8 +231,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
return -EINVAL;
}
dom = strim(dom);
- list_for_each_entry(d, &r->domains, list) {
- if (d->id == dom_id) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (d->hdr.id == dom_id) {
data.buf = dom;
data.rdtgrp = rdtgrp;
if (r->parse_ctrlval(&data, s, d))
@@ -280,7 +280,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
u32 idx = get_config_index(closid, t);
struct msr_param msr_param;
- if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+ if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
return -EINVAL;
hw_dom->ctrl_val[idx] = cfg_val;
@@ -306,7 +306,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
msr_param.res = NULL;
for (t = 0; t < CDP_NUM_TYPES; t++) {
@@ -330,7 +330,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
}
}
if (msr_param.res)
- smp_call_function_any(&d->cpu_mask, rdt_ctrl_update, &msr_param, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, rdt_ctrl_update, &msr_param, 1);
}
return 0;
@@ -450,7 +450,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
lockdep_assert_cpus_held();
seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_puts(s, ";");
@@ -460,7 +460,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
ctrl_val = resctrl_arch_get_config(r, dom, closid,
schema->conf_type);
- seq_printf(s, r->format_str, dom->id, max_data_width,
+ seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
ctrl_val);
sep = true;
}
@@ -489,7 +489,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
} else {
seq_printf(s, "%s:%d=%x\n",
rdtgrp->plr->s->res->name,
- rdtgrp->plr->d->id,
+ rdtgrp->plr->d->hdr.id,
rdtgrp->plr->cbm);
}
} else {
@@ -537,7 +537,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
return;
}
- cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU);
+ cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, RESCTRL_PICK_ANY_CPU);
/*
* cpumask_any_housekeeping() prefers housekeeping CPUs, but
@@ -546,7 +546,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
* counters on some platforms if its called in IRQ context.
*/
if (tick_nohz_full_cpu(cpu))
- smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
else
smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 2345e6836593..ab8a198d88b3 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -281,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
resctrl_arch_rmid_read_context_check();
- if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+ if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
return -EINVAL;
ret = __rmid_read(rmid, eventid, &msr_val);
@@ -364,7 +364,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
* CLOSID and RMID because there may be dependencies between them
* on some architectures.
*/
- trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->id, val);
+ trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->hdr.id, val);
}
if (force_free || !rmid_dirty) {
@@ -490,7 +490,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
entry->busy = 0;
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
/*
* For the first limbo RMID in the domain,
* setup up the limbo worker.
@@ -801,7 +801,7 @@ void cqm_handle_limbo(struct work_struct *work)
__check_limbo(d, false);
if (has_busy_rmid(d)) {
- d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask,
+ d->cqm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
RESCTRL_PICK_ANY_CPU);
schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo,
delay);
@@ -825,7 +825,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;
- cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu);
+ cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu);
dom->cqm_work_cpu = cpu;
if (cpu < nr_cpu_ids)
@@ -868,7 +868,7 @@ void mbm_handle_overflow(struct work_struct *work)
* Re-check for housekeeping CPUs. This allows the overflow handler to
* move off a nohz_full CPU quickly.
*/
- d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask,
+ d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
RESCTRL_PICK_ANY_CPU);
schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
@@ -897,7 +897,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
*/
if (!resctrl_mounted || !resctrl_arch_mon_capable())
return;
- cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu);
+ cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu);
dom->mbm_work_cpu = cpu;
if (cpu < nr_cpu_ids)
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 7c4bf0a006ce..36d943cb847a 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
int cpu;
int ret;
- for_each_cpu(cpu, &plr->d->cpu_mask) {
+ for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
if (!pm_req) {
rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
@@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
return -ENODEV;
/* Pick the first cpu we find that is associated with the cache. */
- plr->cpu = cpumask_first(&plr->d->cpu_mask);
+ plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);
if (!cpu_online(plr->cpu)) {
rdt_last_cmd_printf("CPU %u associated with cache not online\n",
@@ -859,10 +859,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* associated with them.
*/
for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(d_i, &r->domains, list) {
+ list_for_each_entry(d_i, &r->domains, hdr.list) {
if (d_i->plr)
cpumask_or(cpu_with_psl, cpu_with_psl,
- &d_i->cpu_mask);
+ &d_i->hdr.cpu_mask);
}
}
@@ -870,7 +870,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* Next test if new pseudo-locked region would intersect with
* existing region.
*/
- if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
+ if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
ret = true;
free_cpumask_var(cpu_with_psl);
@@ -1202,7 +1202,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
}
plr->thread_done = 0;
- cpu = cpumask_first(&plr->d->cpu_mask);
+ cpu = cpumask_first(&plr->d->hdr.cpu_mask);
if (!cpu_online(cpu)) {
ret = -ENODEV;
goto out;
@@ -1532,7 +1532,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
* may be scheduled elsewhere and invalidate entries in the
* pseudo-locked region.
*/
- if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
+ if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
mutex_unlock(&rdtgroup_mutex);
return -EINVAL;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b8588ce88eef..e6e2753738c9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -98,7 +98,7 @@ void rdt_staged_configs_clear(void)
lockdep_assert_held(&rdtgroup_mutex);
for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(dom, &r->domains, list)
+ list_for_each_entry(dom, &r->domains, hdr.list)
memset(dom->staged_config, 0, sizeof(dom->staged_config));
}
}
@@ -317,7 +317,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
rdt_last_cmd_puts("Cache domain offline\n");
ret = -ENODEV;
} else {
- mask = &rdtgrp->plr->d->cpu_mask;
+ mask = &rdtgrp->plr->d->hdr.cpu_mask;
seq_printf(s, is_cpu_list(of) ?
"%*pbl\n" : "%*pb\n",
cpumask_pr_args(mask));
@@ -1021,12 +1021,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
exclusive = 0;
- seq_printf(seq, "%d=", dom->id);
+ seq_printf(seq, "%d=", dom->hdr.id);
for (i = 0; i < closids_supported(); i++) {
if (!closid_allocated(i))
continue;
@@ -1343,7 +1343,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
continue;
has_cache = true;
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
ctrl = resctrl_arch_get_config(r, d, closid,
s->conf_type);
if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1458,7 +1458,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
return size;
num_b = bitmap_weight(&cbm, r->cache.cbm_len);
- ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
+ ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
@@ -1506,7 +1506,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
rdtgrp->plr->d,
rdtgrp->plr->cbm);
- seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+ seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
}
goto out;
}
@@ -1518,7 +1518,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
type = schema->conf_type;
sep = false;
seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
if (sep)
seq_putc(s, ';');
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1536,7 +1536,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
else
size = rdtgroup_cbm_to_size(r, d, ctrl);
}
- seq_printf(s, "%d=%u", d->id, size);
+ seq_printf(s, "%d=%u", d->hdr.id, size);
sep = true;
}
seq_putc(s, '\n');
@@ -1596,7 +1596,7 @@ static void mon_event_config_read(void *info)
static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
{
- smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
}
static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
@@ -1608,7 +1608,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_puts(s, ";");
@@ -1616,7 +1616,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
mon_info.evtid = evtid;
mondata_config_read(dom, &mon_info);
- seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+ seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
sep = true;
}
seq_puts(s, "\n");
@@ -1682,7 +1682,7 @@ static void mbm_config_write_domain(struct rdt_resource *r,
* are scoped at the domain level. Writing any of these MSRs
* on one CPU is observed by all the CPUs in the domain.
*/
- smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
&mon_info, 1);
/*
@@ -1732,8 +1732,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}
- list_for_each_entry(d, &r->domains, list) {
- if (d->id == dom_id) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (d->hdr.id == dom_id) {
mbm_config_write_domain(r, d, evtid, val);
goto next;
}
@@ -2280,14 +2280,14 @@ static int set_cache_qos_cfg(int level, bool enable)
return -ENOMEM;
r_l = &rdt_resources_all[level].r_resctrl;
- list_for_each_entry(d, &r_l->domains, list) {
+ list_for_each_entry(d, &r_l->domains, hdr.list) {
if (r_l->cache.arch_has_per_cpu_cfg)
/* Pick all the CPUs in the domain instance */
- for_each_cpu(cpu, &d->cpu_mask)
+ for_each_cpu(cpu, &d->hdr.cpu_mask)
cpumask_set_cpu(cpu, cpu_mask);
else
/* Pick one CPU from each domain instance to update MSR */
- cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+ cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
}
/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2316,7 +2316,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
{
u32 num_closid = resctrl_arch_get_num_closid(r);
- int cpu = cpumask_any(&d->cpu_mask);
+ int cpu = cpumask_any(&d->hdr.cpu_mask);
int i;
d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
@@ -2365,7 +2365,7 @@ static int set_mba_sc(bool mba_sc)
r->membw.mba_sc = mba_sc;
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
for (i = 0; i < num_closid; i++)
d->mbps_val[i] = MBA_MAX_MBPS;
}
@@ -2704,7 +2704,7 @@ static int rdt_get_tree(struct fs_context *fc)
if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- list_for_each_entry(dom, &r->domains, list)
+ list_for_each_entry(dom, &r->domains, hdr.list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL,
RESCTRL_PICK_ANY_CPU);
}
@@ -2831,13 +2831,13 @@ static int reset_all_ctrls(struct rdt_resource *r)
* CBMs in all domains to the maximum mask value. Pick one CPU
* from each domain to update the MSRs below.
*/
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
for (i = 0; i < hw_res->num_closid; i++)
hw_dom->ctrl_val[i] = r->default_ctrl;
msr_param.dom = d;
- smp_call_function_any(&d->cpu_mask, rdt_ctrl_update, &msr_param, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, rdt_ctrl_update, &msr_param, 1);
}
return 0;
@@ -3035,7 +3035,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
char name[32];
int ret;
- sprintf(name, "mon_%s_%02d", r->name, d->id);
+ sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
/* create the directory */
kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
if (IS_ERR(kn))
@@ -3051,7 +3051,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
}
priv.u.rid = r->rid;
- priv.u.domid = d->id;
+ priv.u.domid = d->hdr.id;
list_for_each_entry(mevt, &r->evt_list, list) {
priv.u.evtid = mevt->evtid;
ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3102,7 +3102,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
return ret;
@@ -3261,7 +3261,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
*/
tmp_cbm = cfg->new_ctrl;
if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
- rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+ rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
return -ENOSPC;
}
cfg->have_new_ctrl = true;
@@ -3284,7 +3284,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
struct rdt_domain *d;
int ret;
- list_for_each_entry(d, &s->res->domains, list) {
+ list_for_each_entry(d, &s->res->domains, hdr.list) {
ret = __init_one_rdt_domain(d, s, closid);
if (ret < 0)
return ret;
@@ -3299,7 +3299,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
struct resctrl_staged_config *cfg;
struct rdt_domain *d;
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
if (is_mba_sc(r)) {
d->mbps_val[closid] = MBA_MAX_MBPS;
continue;
@@ -3945,7 +3945,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
* per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- rmdir_mondata_subdir_allrdtgrp(r, d->id);
+ rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
if (is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
--
2.45.0
Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.
Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).
Create separate domain lists for control and monitor operations.
Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 25 ++-
arch/x86/kernel/cpu/resctrl/internal.h | 7 +-
arch/x86/kernel/cpu/resctrl/core.c | 224 +++++++++++++++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 12 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 4 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 60 +++---
7 files changed, 240 insertions(+), 96 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index f63fcf17a3bc..96ddf9ff3183 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -58,15 +58,22 @@ struct resctrl_staged_config {
bool have_new_ctrl;
};
+enum resctrl_domain_type {
+ RESCTRL_CTRL_DOMAIN,
+ RESCTRL_MON_DOMAIN,
+};
+
/**
* struct rdt_domain_hdr - common header for different domain types
* @list: all instances of this resource
* @id: unique id for this instance
+ * @type: type of this instance
* @cpu_mask: which CPUs share this resource
*/
struct rdt_domain_hdr {
struct list_head list;
int id;
+ enum resctrl_domain_type type;
struct cpumask cpu_mask;
};
@@ -169,10 +176,12 @@ enum resctrl_scope {
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @scope: Scope of this resource
+ * @ctrl_scope: Scope of this resource for control functions
+ * @mon_scope: Scope of this resource for monitor functions
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
- * @domains: RCU list of all domains for this resource
+ * @ctrl_domains: RCU list of all control domains for this resource
+ * @mon_domains: RCU list of all monitor domains for this resource
* @name: Name to use in "schemata" file.
* @data_width: Character width of data when displaying
* @default_ctrl: Specifies default cache cbm or memory B/W percent.
@@ -187,10 +196,12 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- enum resctrl_scope scope;
+ enum resctrl_scope ctrl_scope;
+ enum resctrl_scope mon_scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
- struct list_head domains;
+ struct list_head ctrl_domains;
+ struct list_head mon_domains;
char *name;
int data_width;
u32 default_ctrl;
@@ -236,8 +247,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index f1d926832ec8..377679b79919 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -558,8 +558,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
- struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+ struct list_head **pos);
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -578,7 +578,8 @@ int rdt_pseudo_lock_init(void);
void rdt_pseudo_lock_release(void);
int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
void closid_free(int closid);
int alloc_rmid(u32 closid);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 96fff44f9d03..edd9b2bfb53d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -60,7 +60,8 @@ static void mba_wrmsr_intel(struct msr_param *m);
static void cat_wrmsr(struct msr_param *m);
static void mba_wrmsr_amd(struct msr_param *m);
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
struct rdt_hw_resource rdt_resources_all[] = {
[RDT_RESOURCE_L3] =
@@ -68,8 +69,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_L3),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .mon_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L3),
+ .mon_domains = mon_domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -82,8 +85,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .scope = RESCTRL_L2_CACHE,
- .domains = domain_init(RDT_RESOURCE_L2),
+ .ctrl_scope = RESCTRL_L2_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -96,8 +99,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_MBA),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -108,8 +111,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_SMBA),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -349,13 +352,28 @@ static void cat_wrmsr(struct msr_param *m)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
}
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
{
struct rdt_domain *d;
lockdep_assert_cpus_held();
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
+ /* Find the domain that contains this CPU */
+ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
+ return d;
+ }
+
+ return NULL;
+}
+
+struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r)
+{
+ struct rdt_domain *d;
+
+ lockdep_assert_cpus_held();
+
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
/* Find the domain that contains this CPU */
if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
return d;
@@ -379,26 +397,26 @@ void rdt_ctrl_update(void *arg)
}
/*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Search for a domain id in a resource domain list.
*
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
- * The domain list is sorted by id in ascending order.
+ * Search the domain list to find the domain id. If the domain id is
+ * found, return the domain. NULL otherwise. If the domain id is not
+ * found (and NULL returned) then the first domain with id bigger than
+ * the input id can be returned to the caller via @pos.
*/
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
- struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+ struct list_head **pos)
{
- struct rdt_domain *d;
+ struct rdt_domain_hdr *d;
struct list_head *l;
- list_for_each(l, &r->domains) {
- d = list_entry(l, struct rdt_domain, hdr.list);
+ list_for_each(l, h) {
+ d = list_entry(l, struct rdt_domain_hdr, list);
/* When id is found, return its domain. */
- if (id == d->hdr.id)
+ if (id == d->id)
return d;
/* Stop searching when finding id's position in sorted list. */
- if (id < d->hdr.id)
+ if (id < d->id)
break;
}
@@ -494,38 +512,29 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
return -EINVAL;
}
-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
{
- int id = get_domain_id_from_scope(cpu, r->scope);
+ int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
struct rdt_domain *d;
int err;
lockdep_assert_held(&domain_list_lock);
if (id < 0) {
- pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->scope, r->name);
+ pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->ctrl_scope, r->name);
return;
}
- d = rdt_find_domain(r, id, &add_pos);
+ hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+ if (hdr) {
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ return;
+ d = container_of(hdr, struct rdt_domain, hdr);
- if (d) {
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
@@ -538,23 +547,70 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
+ d->hdr.type = RESCTRL_CTRL_DOMAIN;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
rdt_domain_reconfigure_cdp(r);
- if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+ if (domain_setup_ctrlval(r, d)) {
domain_free(hw_dom);
return;
}
- if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+ list_add_tail_rcu(&d->hdr.list, add_pos);
+
+ err = resctrl_online_ctrl_domain(r, d);
+ if (err) {
+ list_del_rcu(&d->hdr.list);
+ synchronize_rcu();
+ domain_free(hw_dom);
+ }
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct list_head *add_pos = NULL;
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
+ struct rdt_domain *d;
+ int err;
+
+ lockdep_assert_held(&domain_list_lock);
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+ if (hdr) {
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ return;
+ d = container_of(hdr, struct rdt_domain, hdr);
+
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ return;
+ }
+
+ hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+ if (!hw_dom)
+ return;
+
+ d = &hw_dom->d_resctrl;
+ d->hdr.id = id;
+ d->hdr.type = RESCTRL_MON_DOMAIN;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+
+ if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
domain_free(hw_dom);
return;
}
list_add_tail_rcu(&d->hdr.list, add_pos);
- err = resctrl_online_domain(r, d);
+ err = resctrl_online_mon_domain(r, d);
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
@@ -562,30 +618,45 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
}
}
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
+{
+ if (r->alloc_capable)
+ domain_add_cpu_ctrl(cpu, r);
+ if (r->mon_capable)
+ domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
{
- int id = get_domain_id_from_scope(cpu, r->scope);
+ int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
struct rdt_domain *d;
lockdep_assert_held(&domain_list_lock);
if (id < 0) {
- pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->scope, r->name);
+ pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->ctrl_scope, r->name);
return;
}
- d = rdt_find_domain(r, id, NULL);
- if (!d) {
- pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
+ hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+ if (!hdr) {
+ pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n",
+ id, cpu, r->name);
return;
}
+
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
hw_dom = resctrl_to_arch_dom(d);
cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
- resctrl_offline_domain(r, d);
+ resctrl_offline_ctrl_domain(r, d);
list_del_rcu(&d->hdr.list);
synchronize_rcu();
@@ -601,6 +672,53 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}
}
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
+ struct rdt_domain *d;
+
+ lockdep_assert_held(&domain_list_lock);
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+ if (!hdr) {
+ pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n",
+ id, cpu, r->name);
+ return;
+ }
+
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
+ hw_dom = resctrl_to_arch_dom(d);
+
+ cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+ if (cpumask_empty(&d->hdr.cpu_mask)) {
+ resctrl_offline_mon_domain(r, d);
+ list_del_rcu(&d->hdr.list);
+ synchronize_rcu();
+ domain_free(hw_dom);
+
+ return;
+ }
+}
+
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+ if (r->alloc_capable)
+ domain_remove_cpu_ctrl(cpu, r);
+ if (r->mon_capable)
+ domain_remove_cpu_mon(cpu, r);
+}
+
static void clear_closid_rmid(int cpu)
{
struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 6246f48b0449..8cc36723f077 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -231,7 +231,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
return -EINVAL;
}
dom = strim(dom);
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (d->hdr.id == dom_id) {
data.buf = dom;
data.rdtgrp = rdtgrp;
@@ -306,7 +306,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
msr_param.res = NULL;
for (t = 0; t < CDP_NUM_TYPES; t++) {
@@ -450,7 +450,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
lockdep_assert_cpus_held();
seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
if (sep)
seq_puts(s, ";");
@@ -556,6 +556,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
+ struct rdt_domain_hdr *hdr;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
@@ -576,11 +577,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
evtid = md.u.evtid;
r = &rdt_resources_all[resid].r_resctrl;
- d = rdt_find_domain(r, domid, NULL);
- if (!d) {
+ hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+ if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
ret = -ENOENT;
goto out;
}
+ d = container_of(hdr, struct rdt_domain, hdr);
mon_event_read(&rr, r, d, rdtgrp, evtid, false);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ab8a198d88b3..82a44de8136f 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -490,7 +490,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
entry->busy = 0;
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
/*
* For the first limbo RMID in the domain,
* setup up the limbo worker.
@@ -687,7 +687,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
idx = resctrl_arch_rmid_idx_encode(closid, rmid);
pmbm_data = &dom_mbm->mbm_local[idx];
- dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+ dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
pr_warn_once("Failure to get domain for MBA update\n");
return;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 36d943cb847a..58985ffcf74e 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
*/
static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
{
- enum resctrl_scope scope = plr->s->res->scope;
+ enum resctrl_scope scope = plr->s->res->ctrl_scope;
struct cpu_cacheinfo *ci;
int ret;
int i;
@@ -859,7 +859,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* associated with them.
*/
for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(d_i, &r->domains, hdr.list) {
+ list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
if (d_i->plr)
cpumask_or(cpu_with_psl, cpu_with_psl,
&d_i->hdr.cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e6e2753738c9..7c1475f393ff 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -98,7 +98,7 @@ void rdt_staged_configs_clear(void)
lockdep_assert_held(&rdtgroup_mutex);
for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(dom, &r->domains, hdr.list)
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
memset(dom->staged_config, 0, sizeof(dom->staged_config));
}
}
@@ -1021,7 +1021,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
@@ -1343,7 +1343,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
continue;
has_cache = true;
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
ctrl = resctrl_arch_get_config(r, d, closid,
s->conf_type);
if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1454,13 +1454,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
unsigned int size = 0;
int num_b, i;
- if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+ if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
return size;
num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->scope) {
+ if (ci->info_list[i].level == r->ctrl_scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
@@ -1518,7 +1518,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
type = schema->conf_type;
sep = false;
seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (sep)
seq_putc(s, ';');
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1608,7 +1608,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->mon_domains, hdr.list) {
if (sep)
seq_puts(s, ";");
@@ -1732,7 +1732,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
if (d->hdr.id == dom_id) {
mbm_config_write_domain(r, d, evtid, val);
goto next;
@@ -2280,7 +2280,7 @@ static int set_cache_qos_cfg(int level, bool enable)
return -ENOMEM;
r_l = &rdt_resources_all[level].r_resctrl;
- list_for_each_entry(d, &r_l->domains, hdr.list) {
+ list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
if (r_l->cache.arch_has_per_cpu_cfg)
/* Pick all the CPUs in the domain instance */
for_each_cpu(cpu, &d->hdr.cpu_mask)
@@ -2365,7 +2365,7 @@ static int set_mba_sc(bool mba_sc)
r->membw.mba_sc = mba_sc;
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
for (i = 0; i < num_closid; i++)
d->mbps_val[i] = MBA_MAX_MBPS;
}
@@ -2704,7 +2704,7 @@ static int rdt_get_tree(struct fs_context *fc)
if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- list_for_each_entry(dom, &r->domains, hdr.list)
+ list_for_each_entry(dom, &r->mon_domains, hdr.list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL,
RESCTRL_PICK_ANY_CPU);
}
@@ -2828,10 +2828,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
/*
* Disable resource control for this resource by setting all
- * CBMs in all domains to the maximum mask value. Pick one CPU
+ * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
* from each domain to update the MSRs below.
*/
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
for (i = 0; i < hw_res->num_closid; i++)
@@ -3102,7 +3102,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->mon_domains, hdr.list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
return ret;
@@ -3284,7 +3284,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
struct rdt_domain *d;
int ret;
- list_for_each_entry(d, &s->res->domains, hdr.list) {
+ list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
ret = __init_one_rdt_domain(d, s, closid);
if (ret < 0)
return ret;
@@ -3299,7 +3299,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
struct resctrl_staged_config *cfg;
struct rdt_domain *d;
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (is_mba_sc(r)) {
d->mbps_val[closid] = MBA_MAX_MBPS;
continue;
@@ -3930,15 +3930,19 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
kfree(d->mbm_local);
}
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
{
mutex_lock(&rdtgroup_mutex);
if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
mba_sc_domain_destroy(r, d);
- if (!r->mon_capable)
- goto out_unlock;
+ mutex_unlock(&rdtgroup_mutex);
+}
+
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ mutex_lock(&rdtgroup_mutex);
/*
* If resctrl is mounted, remove all the
@@ -3964,7 +3968,6 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
domain_destroy_mon_state(d);
-out_unlock:
mutex_unlock(&rdtgroup_mutex);
}
@@ -3999,7 +4002,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
{
int err = 0;
@@ -4008,11 +4011,18 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) {
/* RDT_RESOURCE_MBA is never mon_capable */
err = mba_sc_domain_allocate(r, d);
- goto out_unlock;
}
- if (!r->mon_capable)
- goto out_unlock;
+ mutex_unlock(&rdtgroup_mutex);
+
+ return err;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ int err;
+
+ mutex_lock(&rdtgroup_mutex);
err = domain_setup_mon_state(r, d);
if (err)
@@ -4077,7 +4087,7 @@ void resctrl_offline_cpu(unsigned int cpu)
if (!l3->mon_capable)
goto out_unlock;
- d = get_domain_from_cpu(cpu, l3);
+ d = get_mon_domain_from_cpu(cpu, l3);
if (d) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
cancel_delayed_work(&d->mbm_over);
--
2.45.0
Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.
This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.
Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.
The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.
The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.
Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.
Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
how many SNC domains share an L3 cache instance. Initialize this to
"1". Runtime detection of SNC mode will adjust this value.
Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
nodes.
3) Add a function to convert from logical RMID values (assigned to
tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
to physical RMID values to load into IA32_QM_EVTSEL MSR when
reading counters on each SNC node.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 37 ++++++++++++++++++++++++---
1 file changed, 33 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 89d7e6fcbaa1..b9b4d2b5ca82 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
#define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
+static int snc_nodes_per_l3_cache = 1;
+
/*
* The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
* If rmid > rmid threshold, MBM total and local values should be multiplied
@@ -185,10 +187,37 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
return entry;
}
-static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
+/*
+ * When Sub-NUMA Cluster (SNC) mode is not enabled, the physical RMID
+ * is the same as the logical RMID.
+ *
+ * When SNC mode is enabled the physical RMIDs are distributed across
+ * the SNC nodes. E.g. with two SNC nodes per L3 cache and 200 physical
+ * RMIDs are divided with 0..99 on the first node and 100..199 on
+ * the second node. Compute the value of the physical RMID to pass to
+ * resctrl_arch_rmid_read().
+ *
+ * Caller is responsible to make sure execution running on a CPU in
+ * the domain to be read.
+ */
+static int logical_rmid_to_physical_rmid(int lrmid)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ int cpu = smp_processor_id();
+
+ if (snc_nodes_per_l3_cache == 1)
+ return lrmid;
+
+ return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+}
+
+static int __rmid_read(u32 lrmid,
+ enum resctrl_event_id eventid, u64 *val)
{
u64 msr_val;
+ int prmid;
+ prmid = logical_rmid_to_physical_rmid(lrmid);
/*
* As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
* with a valid event code for supported resource type and the bits
@@ -197,7 +226,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
*/
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
rdmsrl(MSR_IA32_QM_CTR, msr_val);
if (msr_val & RMID_VAL_ERROR)
@@ -1022,8 +1051,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
int ret;
resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
- hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
- r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+ r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
--
2.45.0
When SNC is enabled monitoring data is collected at the SNC node
granularity, but must be reported at L3-cache granularity for
backwards compatibility in addition to reporting at the node
level.
Add a "ci" field to the rdt_mon_domain structure to save the
cache information about the enclosing L3 cache for the domain.
This provides:
1) The cache id which is needed to compose the name of the legacy
monitoring directory, and to determine which domains should be
summed to provide L3-scoped data.
2) The shared_cpu_map which is needed to determine which CPUs can
be used to read the RMID counters with the MSR interface.
This is the first step to an eventual goal of monitor reporting files
like this (for a system with two SNC nodes per L3):
$ cd /sys/fs/resctrl/mon_data
$ tree mon_L3_00
mon_L3_00 <- 00 here is L3 cache id
├── llc_occupancy \ These files provide legacy support
├── mbm_local_bytes > for non-SNC aware monitor apps
├── mbm_total_bytes / that expect data at L3 cache level
├── mon_sub_L3_00 <- 00 here is SNC node id
│  ├── llc_occupancy \ These files are finer grained
│  ├── mbm_local_bytes > data from each SNC node
│  └── mbm_total_bytes /
└── mon_sub_L3_01
├── llc_occupancy \
├── mbm_local_bytes > As above, but for node 1.
└── mbm_total_bytes /
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 2 ++
arch/x86/kernel/cpu/resctrl/internal.h | 21 +++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/core.c | 7 ++++++-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 1 -
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 1 -
5 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 64b6ad1b22a1..d733e1f6485d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -96,6 +96,7 @@ struct rdt_ctrl_domain {
/**
* struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
* @hdr: common header for different domain types
+ * @ci: cache info for this domain
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
* @mbm_local: saved state for MBM local bandwidth
@@ -106,6 +107,7 @@ struct rdt_ctrl_domain {
*/
struct rdt_mon_domain {
struct rdt_domain_hdr hdr;
+ struct cacheinfo *ci;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
struct mbm_state *mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 135190e0711c..eb70d3136ced 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -2,6 +2,7 @@
#ifndef _ASM_X86_RESCTRL_INTERNAL_H
#define _ASM_X86_RESCTRL_INTERNAL_H
+#include <linux/cacheinfo.h>
#include <linux/resctrl.h>
#include <linux/sched.h>
#include <linux/kernfs.h>
@@ -509,6 +510,26 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
+/*
+ * Get the cacheinfo structure of the cache associated with @cpu at level @level.
+ * cpuhp lock must be held.
+ */
+static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+{
+ struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
+ int i;
+
+ for (i = 0; i < ci->num_leaves; i++) {
+ if (ci->info_list[i].level == level) {
+ if (ci->info_list[i].attributes & CACHE_ID)
+ return &ci->info_list[i];
+ break;
+ }
+ }
+
+ return NULL;
+}
+
/*
* To return the common struct rdt_resource, which is contained in struct
* rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b86c525d0620..95ef8fe3cb50 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -19,7 +19,6 @@
#include <linux/cpu.h>
#include <linux/slab.h>
#include <linux/err.h>
-#include <linux/cacheinfo.h>
#include <linux/cpuhotplug.h>
#include <asm/cpu_device_id.h>
@@ -608,6 +607,12 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
d->hdr.type = RESCTRL_MON_DOMAIN;
+ d->ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
+ if (!d->ci) {
+ pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
+ mon_domain_free(hw_dom);
+ return;
+ }
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index abec0d6d9476..20dd9076f89f 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -11,7 +11,6 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-#include <linux/cacheinfo.h>
#include <linux/cpu.h>
#include <linux/cpumask.h>
#include <linux/debugfs.h>
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 1cc4794d5a2e..13f93f2a55b3 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -12,7 +12,6 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-#include <linux/cacheinfo.h>
#include <linux/cpu.h>
#include <linux/debugfs.h>
#include <linux/fs.h>
--
2.45.0
rdtgroup_mondata_show() calls mon_event_read() which calls
mon_event_count() which packages up all the required details into an
rmid_read structure passed across the smp_call*() infrastructure.
Legacy files reporting for a single domain pass that domain in the
rmid_read structure. Files that need to sum multiple domains have
meta data that provides the L3 cache ID for domains that must be
summed.
Add the sumdomains and cacheinfo fields to the rmid_read structure.
Add kerneldoc comments for the rmid_read structure.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index eb70d3136ced..d8156d22cbdc 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -145,12 +145,28 @@ union mon_data_bits {
} u;
};
+/**
+ * struct rmid_read - Data passed across smp_call*() to read event count
+ * @rgrp: Resctrl group (provides RMID value)
+ * @r: Resource
+ * @d: Domain
+ * @evtid: Which monitor event to read
+ * @first: When true this just requests initialization of an MBM counter
+ * @sumdomains: When false just return monitor count from domain @d. When true,
+ * sum all domains in @r sharing L3 @ci.id
+ * @ci: See @sumdomains
+ * @err: Used to return error indication
+ * @val: Used to return value of event counter
+ * @arch_mon_ctx: hardware monitor allocated for this read request (MPAM only)
+ */
struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
struct rdt_mon_domain *d;
enum resctrl_event_id evtid;
bool first;
+ bool sumdomains;
+ struct cacheinfo *ci;
int err;
u64 val;
void *arch_mon_ctx;
--
2.45.0
The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.
Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.
Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 48 ++++++++-------
arch/x86/kernel/cpu/resctrl/internal.h | 62 ++++++++++++--------
arch/x86/kernel/cpu/resctrl/core.c | 71 ++++++++++++-----------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 28 ++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 40 ++++++-------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 64 ++++++++++----------
7 files changed, 174 insertions(+), 145 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 96ddf9ff3183..aa2c22a8e37b 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -78,7 +78,23 @@ struct rdt_domain_hdr {
};
/**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr: common header for different domain types
+ * @plr: pseudo-locked region (if any) associated with domain
+ * @staged_config: parsed configuration to be applied
+ * @mbps_val: When mba_sc is enabled, this holds the array of user
+ * specified control values for mba_sc in MBps, indexed
+ * by closid
+ */
+struct rdt_ctrl_domain {
+ struct rdt_domain_hdr hdr;
+ struct pseudo_lock_region *plr;
+ struct resctrl_staged_config staged_config[CDP_NUM_TYPES];
+ u32 *mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
* @hdr: common header for different domain types
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
@@ -87,13 +103,8 @@ struct rdt_domain_hdr {
* @cqm_limbo: worker to periodically read CQM h/w counters
* @mbm_work_cpu: worker CPU for MBM h/w counters
* @cqm_work_cpu: worker CPU for CQM h/w counters
- * @plr: pseudo-locked region (if any) associated with domain
- * @staged_config: parsed configuration to be applied
- * @mbps_val: When mba_sc is enabled, this holds the array of user
- * specified control values for mba_sc in MBps, indexed
- * by closid
*/
-struct rdt_domain {
+struct rdt_mon_domain {
struct rdt_domain_hdr hdr;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
@@ -102,9 +113,6 @@ struct rdt_domain {
struct delayed_work cqm_limbo;
int mbm_work_cpu;
int cqm_work_cpu;
- struct pseudo_lock_region *plr;
- struct resctrl_staged_config staged_config[CDP_NUM_TYPES];
- u32 *mbps_val;
};
/**
@@ -208,7 +216,7 @@ struct rdt_resource {
const char *format_str;
int (*parse_ctrlval)(struct rdt_parse_data *data,
struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);
struct list_head evt_list;
unsigned long fflags;
bool cdp_capable;
@@ -242,15 +250,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
* Update the ctrl_val and apply this config right now.
* Must be called on one of the domain's CPUs.
*/
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type t, u32 cfg_val);
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);
@@ -279,7 +287,7 @@ void resctrl_offline_cpu(unsigned int cpu);
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
*/
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *arch_mon_ctx);
@@ -312,7 +320,7 @@ static inline void resctrl_arch_rmid_read_context_check(void)
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 closid, u32 rmid,
enum resctrl_event_id eventid);
@@ -325,7 +333,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
extern unsigned int resctrl_rmid_realloc_threshold;
extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 377679b79919..135190e0711c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -147,7 +147,7 @@ union mon_data_bits {
struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
enum resctrl_event_id evtid;
bool first;
int err;
@@ -232,7 +232,7 @@ struct mongroup {
*/
struct pseudo_lock_region {
struct resctrl_schema *s;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
u32 cbm;
wait_queue_head_t lock_thread_wq;
int thread_done;
@@ -355,25 +355,41 @@ struct arch_mbm_state {
};
/**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- * a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ * a resource for a control function
* @d_resctrl: Properties exposed to the resctrl file system
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+ struct rdt_ctrl_domain d_resctrl;
+ u32 *ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ * a resource for a monitor function
+ * @d_resctrl: Properties exposed to the resctrl file system
* @arch_mbm_total: arch private state for MBM total bandwidth
* @arch_mbm_local: arch private state for MBM local bandwidth
*
* Members of this structure are accessed via helpers that provide abstraction.
*/
-struct rdt_hw_domain {
- struct rdt_domain d_resctrl;
- u32 *ctrl_val;
+struct rdt_hw_mon_domain {
+ struct rdt_mon_domain d_resctrl;
struct arch_mbm_state *arch_mbm_total;
struct arch_mbm_state *arch_mbm_local;
};
-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+ return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
{
- return container_of(r, struct rdt_hw_domain, d_resctrl);
+ return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
}
/**
@@ -385,7 +401,7 @@ static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
*/
struct msr_param {
struct rdt_resource *res;
- struct rdt_domain *dom;
+ struct rdt_ctrl_domain *dom;
u32 low;
u32 high;
};
@@ -458,9 +474,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
}
int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);
int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);
extern struct mutex rdtgroup_mutex;
@@ -564,22 +580,22 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
unsigned long cbm);
enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
int rdtgroup_tasks_assigned(struct rdtgroup *r);
int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
int rdt_pseudo_lock_init(void);
void rdt_pseudo_lock_release(void);
int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
-struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
void closid_free(int closid);
int alloc_rmid(u32 closid);
@@ -590,19 +606,19 @@ bool __init rdt_cpu_has(int flag);
void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
unsigned long delay_ms,
int exclude_cpu);
void mbm_handle_overflow(struct work_struct *work);
void __init intel_rdt_mbm_apply_quirk(void);
bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu);
void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void __init thread_throttle_mode_init(void);
void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index edd9b2bfb53d..b4f2be776408 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -309,8 +309,8 @@ static void rdt_get_cdp_l2_config(void)
static void mba_wrmsr_amd(struct msr_param *m)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom);
unsigned int i;
for (i = m->low; i < m->high; i++)
@@ -333,8 +333,8 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
static void mba_wrmsr_intel(struct msr_param *m)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom);
unsigned int i;
/* Write the delay values for mba. */
@@ -344,17 +344,17 @@ static void mba_wrmsr_intel(struct msr_param *m)
static void cat_wrmsr(struct msr_param *m)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom);
unsigned int i;
for (i = m->low; i < m->high; i++)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
}
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
{
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
lockdep_assert_cpus_held();
@@ -367,9 +367,9 @@ struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
return NULL;
}
-struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r)
{
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
lockdep_assert_cpus_held();
@@ -440,18 +440,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
*dc = r->default_ctrl;
}
-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+ kfree(hw_dom->ctrl_val);
+ kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
{
kfree(hw_dom->arch_mbm_total);
kfree(hw_dom->arch_mbm_local);
- kfree(hw_dom->ctrl_val);
kfree(hw_dom);
}
-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct msr_param m;
u32 *dc;
@@ -476,7 +481,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
*/
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
{
size_t tsize;
@@ -515,10 +520,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+ struct rdt_hw_ctrl_domain *hw_dom;
struct list_head *add_pos = NULL;
- struct rdt_hw_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int err;
lockdep_assert_held(&domain_list_lock);
@@ -533,7 +538,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
if (hdr) {
if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
return;
- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_ctrl_domain, hdr);
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
@@ -553,7 +558,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
rdt_domain_reconfigure_cdp(r);
if (domain_setup_ctrlval(r, d)) {
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);
return;
}
@@ -563,7 +568,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);
}
}
@@ -571,9 +576,9 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
struct list_head *add_pos = NULL;
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_mon_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
int err;
lockdep_assert_held(&domain_list_lock);
@@ -588,7 +593,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
if (hdr) {
if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
return;
- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
return;
@@ -604,7 +609,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);
return;
}
@@ -614,7 +619,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);
}
}
@@ -629,9 +634,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
lockdep_assert_held(&domain_list_lock);
@@ -651,8 +656,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
return;
- d = container_of(hdr, struct rdt_domain, hdr);
- hw_dom = resctrl_to_arch_dom(d);
+ d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);
cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
@@ -661,12 +666,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
synchronize_rcu();
/*
- * rdt_domain "d" is going to be freed below, so clear
+ * rdt_ctrl_domain "d" is going to be freed below, so clear
* its pointer from pseudo_lock_region struct.
*/
if (d->plr)
d->plr->d = NULL;
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);
return;
}
@@ -675,9 +680,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_mon_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
lockdep_assert_held(&domain_list_lock);
@@ -697,15 +702,15 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
return;
- d = container_of(hdr, struct rdt_domain, hdr);
- hw_dom = resctrl_to_arch_dom(d);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ hw_dom = resctrl_to_arch_mon_dom(d);
cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
resctrl_offline_mon_domain(r, d);
list_del_rcu(&d->hdr.list);
synchronize_rcu();
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);
return;
}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 8cc36723f077..3b9383612c35 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -60,7 +60,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
}
int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
struct resctrl_staged_config *cfg;
u32 closid = data->rdtgrp->closid;
@@ -139,7 +139,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
* resource type.
*/
int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
struct rdtgroup *rdtgrp = data->rdtgrp;
struct resctrl_staged_config *cfg;
@@ -208,8 +208,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
struct resctrl_staged_config *cfg;
struct rdt_resource *r = s->res;
struct rdt_parse_data data;
+ struct rdt_ctrl_domain *d;
char *dom = NULL, *id;
- struct rdt_domain *d;
unsigned long dom_id;
/* Walking r->domains, ensure it can't race with cpuhp */
@@ -272,11 +272,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
}
}
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type t, u32 cfg_val)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
u32 idx = get_config_index(closid, t);
struct msr_param msr_param;
@@ -297,17 +297,17 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
{
struct resctrl_staged_config *cfg;
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct msr_param msr_param;
+ struct rdt_ctrl_domain *d;
enum resctrl_conf_type t;
- struct rdt_domain *d;
u32 idx;
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
- hw_dom = resctrl_to_arch_dom(d);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);
msr_param.res = NULL;
for (t = 0; t < CDP_NUM_TYPES; t++) {
cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -430,10 +430,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
return ret ?: nbytes;
}
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
u32 idx = get_config_index(closid, type);
return hw_dom->ctrl_val[idx];
@@ -442,7 +442,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
{
struct rdt_resource *r = schema->res;
- struct rdt_domain *dom;
+ struct rdt_ctrl_domain *dom;
bool sep = false;
u32 ctrl_val;
@@ -514,7 +514,7 @@ static int smp_mon_event_count(void *arg)
}
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first)
{
int cpu;
@@ -557,11 +557,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
struct rdt_domain_hdr *hdr;
+ struct rdt_mon_domain *d;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
union mon_data_bits md;
- struct rdt_domain *d;
struct rmid_read rr;
int ret = 0;
@@ -582,7 +582,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
ret = -ENOENT;
goto out;
}
- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
mon_event_read(&rr, r, d, rdtgrp, evtid, false);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 82a44de8136f..89d7e6fcbaa1 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -209,7 +209,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
return 0;
}
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
u32 rmid,
enum resctrl_event_id eventid)
{
@@ -228,11 +228,11 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
return NULL;
}
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 unused, u32 rmid,
enum resctrl_event_id eventid)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct arch_mbm_state *am;
am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -248,9 +248,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
* Assumes that hardware counters are also reset and thus that there is
* no need to record initial non-zero counts.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
if (is_mbm_total_enabled())
memset(hw_dom->arch_mbm_total, 0,
@@ -269,12 +269,12 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
return chunks >> shift;
}
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 unused, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *ignored)
{
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct arch_mbm_state *am;
u64 msr_val, chunks;
int ret;
@@ -320,7 +320,7 @@ static void limbo_release_entry(struct rmid_entry *entry)
* decrement the count. If the busy count gets to zero on an RMID, we
* free the RMID
*/
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -378,7 +378,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
}
-bool has_busy_rmid(struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -479,7 +479,7 @@ int alloc_rmid(u32 closid)
static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
u32 idx;
lockdep_assert_held(&rdtgroup_mutex);
@@ -531,7 +531,7 @@ void free_rmid(u32 closid, u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
u32 rmid, enum resctrl_event_id evtid)
{
u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
@@ -667,12 +667,12 @@ void mon_event_count(void *info)
* throttle MSRs already have low percentage values. To avoid
* unnecessarily restricting such rdtgroups, we also increase the bandwidth.
*/
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
+ struct rdt_ctrl_domain *dom_mba;
struct rdt_resource *r_mba;
- struct rdt_domain *dom_mba;
u32 cur_bw, user_bw, idx;
struct list_head *head;
struct rdtgroup *entry;
@@ -733,7 +733,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
}
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 closid, u32 rmid)
{
struct rmid_read rr;
@@ -791,12 +791,12 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
void cqm_handle_limbo(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
- d = container_of(work, struct rdt_domain, cqm_limbo.work);
+ d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
__check_limbo(d, false);
@@ -819,7 +819,7 @@ void cqm_handle_limbo(struct work_struct *work)
* @exclude_cpu: Which CPU the handler should not run on,
* RESCTRL_PICK_ANY_CPU to pick any CPU.
*/
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
@@ -836,9 +836,9 @@ void mbm_handle_overflow(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
struct rdtgroup *prgrp, *crgrp;
+ struct rdt_mon_domain *d;
struct list_head *head;
struct rdt_resource *r;
- struct rdt_domain *d;
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
@@ -851,7 +851,7 @@ void mbm_handle_overflow(struct work_struct *work)
goto out_unlock;
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- d = container_of(work, struct rdt_domain, mbm_over.work);
+ d = container_of(work, struct rdt_mon_domain, mbm_over.work);
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp->closid, prgrp->mon.rmid);
@@ -885,7 +885,7 @@ void mbm_handle_overflow(struct work_struct *work)
* @exclude_cpu: Which CPU the handler should not run on,
* RESCTRL_PICK_ANY_CPU to pick any CPU.
*/
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 58985ffcf74e..abec0d6d9476 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
* Return: true if @cbm overlaps with pseudo-locked region on @d, false
* otherwise.
*/
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
{
unsigned int cbm_len;
unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
* if it is not possible to test due to memory allocation issue,
* false otherwise.
*/
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
{
+ struct rdt_ctrl_domain *d_i;
cpumask_var_t cpu_with_psl;
struct rdt_resource *r;
- struct rdt_domain *d_i;
bool ret = false;
/* Walking r->domains, ensure it can't race with cpuhp */
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 7c1475f393ff..cc31ede1a1e7 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -92,8 +92,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
void rdt_staged_configs_clear(void)
{
+ struct rdt_ctrl_domain *dom;
struct rdt_resource *r;
- struct rdt_domain *dom;
lockdep_assert_held(&rdtgroup_mutex);
@@ -1012,7 +1012,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
unsigned long sw_shareable = 0, hw_shareable = 0;
unsigned long exclusive = 0, pseudo_locked = 0;
struct rdt_resource *r = s->res;
- struct rdt_domain *dom;
+ struct rdt_ctrl_domain *dom;
int i, hwb, swb, excl, psl;
enum rdtgrp_mode mode;
bool sep = false;
@@ -1243,7 +1243,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
*
* Return: false if CBM does not overlap, true if it does.
*/
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid,
enum resctrl_conf_type type, bool exclusive)
{
@@ -1298,7 +1298,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
*
* Return: true if CBM overlap detected, false if there is no overlap
*/
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid, bool exclusive)
{
enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1329,10 +1329,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
{
int closid = rdtgrp->closid;
+ struct rdt_ctrl_domain *d;
struct resctrl_schema *s;
struct rdt_resource *r;
bool has_cache = false;
- struct rdt_domain *d;
u32 ctrl;
/* Walking r->domains, ensure it can't race with cpuhp */
@@ -1448,7 +1448,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
* bitmap functions work correctly.
*/
unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
- struct rdt_domain *d, unsigned long cbm)
+ struct rdt_ctrl_domain *d, unsigned long cbm)
{
struct cpu_cacheinfo *ci;
unsigned int size = 0;
@@ -1480,9 +1480,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
{
struct resctrl_schema *schema;
enum resctrl_conf_type type;
+ struct rdt_ctrl_domain *d;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
- struct rdt_domain *d;
unsigned int size;
int ret = 0;
u32 closid;
@@ -1594,7 +1594,7 @@ static void mon_event_config_read(void *info)
mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
}
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
{
smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
}
@@ -1602,7 +1602,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
{
struct mon_config_info mon_info = {0};
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
bool sep = false;
cpus_read_lock();
@@ -1661,7 +1661,7 @@ static void mon_event_config_write(void *info)
}
static void mbm_config_write_domain(struct rdt_resource *r,
- struct rdt_domain *d, u32 evtid, u32 val)
+ struct rdt_mon_domain *d, u32 evtid, u32 val)
{
struct mon_config_info mon_info = {0};
@@ -1702,7 +1702,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
char *dom_str = NULL, *id_str;
unsigned long dom_id, val;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
@@ -2261,9 +2261,9 @@ static inline bool is_mba_linear(void)
static int set_cache_qos_cfg(int level, bool enable)
{
void (*update)(void *arg);
+ struct rdt_ctrl_domain *d;
struct rdt_resource *r_l;
cpumask_var_t cpu_mask;
- struct rdt_domain *d;
int cpu;
/* Walking r->domains, ensure it can't race with cpuhp */
@@ -2313,7 +2313,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
l3_qos_cfg_update(&hw_res->cdp_enabled);
}
-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
u32 num_closid = resctrl_arch_get_num_closid(r);
int cpu = cpumask_any(&d->hdr.cpu_mask);
@@ -2331,7 +2331,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
}
static void mba_sc_domain_destroy(struct rdt_resource *r,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
kfree(d->mbps_val);
d->mbps_val = NULL;
@@ -2357,7 +2357,7 @@ static int set_mba_sc(bool mba_sc)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
u32 num_closid = resctrl_arch_get_num_closid(r);
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int i;
if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2629,7 +2629,7 @@ static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
unsigned long flags = RFTYPE_CTRL_BASE;
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
struct rdt_resource *r;
int ret;
@@ -2814,9 +2814,9 @@ static int rdt_init_fs_context(struct fs_context *fc)
static int reset_all_ctrls(struct rdt_resource *r)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct msr_param msr_param;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int i;
/* Walking r->domains, ensure it can't race with cpuhp */
@@ -2832,7 +2832,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
* from each domain to update the MSRs below.
*/
list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
- hw_dom = resctrl_to_arch_dom(d);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);
for (i = 0; i < hw_res->num_closid; i++)
hw_dom->ctrl_val[i] = r->default_ctrl;
@@ -3025,7 +3025,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}
static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
- struct rdt_domain *d,
+ struct rdt_mon_domain *d,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
union mon_data_bits priv;
@@ -3074,7 +3074,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
* and "monitor" groups with given domain id.
*/
static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_domain *d)
+ struct rdt_mon_domain *d)
{
struct kernfs_node *parent_kn;
struct rdtgroup *prgrp, *crgrp;
@@ -3096,7 +3096,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_resource *r,
struct rdtgroup *prgrp)
{
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
int ret;
/* Walking r->domains, ensure it can't race with cpuhp */
@@ -3201,7 +3201,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
* Set the RDT domain up to start off with all usable allocations. That is,
* all shareable and unused bits. All-zero CBM is invalid.
*/
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
u32 closid)
{
enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3281,7 +3281,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
*/
static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
{
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int ret;
list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3297,7 +3297,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
{
struct resctrl_staged_config *cfg;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (is_mba_sc(r)) {
@@ -3923,14 +3923,14 @@ static void __init rdtgroup_setup_default(void)
mutex_unlock(&rdtgroup_mutex);
}
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
{
bitmap_free(d->rmid_busy_llc);
kfree(d->mbm_total);
kfree(d->mbm_local);
}
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
mutex_lock(&rdtgroup_mutex);
@@ -3940,7 +3940,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
mutex_unlock(&rdtgroup_mutex);
}
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
{
mutex_lock(&rdtgroup_mutex);
@@ -3971,7 +3971,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
mutex_unlock(&rdtgroup_mutex);
}
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize;
@@ -4002,7 +4002,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
int err = 0;
@@ -4018,7 +4018,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
return err;
}
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
{
int err;
@@ -4073,8 +4073,8 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
void resctrl_offline_cpu(unsigned int cpu)
{
struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_mon_domain *d;
struct rdtgroup *rdtgrp;
- struct rdt_domain *d;
mutex_lock(&rdtgroup_mutex);
list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
--
2.45.0
In Sub-NUMA Cluster (SNC) mode Linux must create the monitor
files in the original "mon_L3_XX" directories and also in each
of the "mon_sub_L3_YY" directories.
Refactor mkdir_mondata_subdir() to move the creation of monitoring files
into a helper function to avoid the need to duplicate code later.
No functional change.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 45 ++++++++++++++++----------
1 file changed, 28 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 13f93f2a55b3..dd386ad9458a 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3025,14 +3025,37 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}
}
+static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
+ struct rdt_resource *r, struct rdtgroup *prgrp)
+{
+ union mon_data_bits priv;
+ struct mon_evt *mevt;
+ struct rmid_read rr;
+ int ret;
+
+ if (WARN_ON(list_empty(&r->evt_list)))
+ return -EPERM;
+
+ priv.u.rid = r->rid;
+ priv.u.domid = d->hdr.id;
+ list_for_each_entry(mevt, &r->evt_list, list) {
+ priv.u.evtid = mevt->evtid;
+ ret = mon_addfile(kn, mevt->name, priv.priv);
+ if (ret)
+ return ret;
+
+ if (is_mbm_event(mevt->evtid))
+ mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
+ }
+
+ return 0;
+}
+
static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
struct rdt_mon_domain *d,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
- union mon_data_bits priv;
struct kernfs_node *kn;
- struct mon_evt *mevt;
- struct rmid_read rr;
char name[32];
int ret;
@@ -3046,22 +3069,10 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
if (ret)
goto out_destroy;
- if (WARN_ON(list_empty(&r->evt_list))) {
- ret = -EPERM;
+ ret = mon_add_all_files(kn, d, r, prgrp);
+ if (ret)
goto out_destroy;
- }
- priv.u.rid = r->rid;
- priv.u.domid = d->hdr.id;
- list_for_each_entry(mevt, &r->evt_list, list) {
- priv.u.evtid = mevt->evtid;
- ret = mon_addfile(kn, mevt->name, priv.priv);
- if (ret)
- goto out_destroy;
-
- if (is_mbm_event(mevt->evtid))
- mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
- }
kernfs_activate(kn);
return 0;
--
2.45.0
In SNC mode there are multiple subdirectories in each L3 level monitor
directory (one for each SNC node). If all the CPUs in an SNC node are
taken offline, just remove the SNC directory for that node. In
non-SNC mode, or when the last SNC node directory is removed, also
remove the L3 monitor directory.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 39 ++++++++++++++++++++++----
1 file changed, 33 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 6a5c35a176d5..cdcae13d6c6d 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3011,17 +3011,44 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
* and monitor groups with given domain id.
*/
static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- unsigned int dom_id)
+ struct rdt_mon_domain *d)
{
struct rdtgroup *prgrp, *crgrp;
+ bool remove_all = true;
+ struct kernfs_node *kn;
+ char subname[32];
char name[32];
+ sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
+ if (r->mon_scope != RESCTRL_L3_CACHE) {
+ /*
+ * SNC mode: If the last domain is being removed, the count of
+ * CPUs sharing the L3 cache should be 1 (current CPU).
+ */
+ if (cpumask_weight(&d->ci->shared_cpu_map) > 1) {
+ remove_all = false;
+ sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ }
+ }
+
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
- sprintf(name, "mon_%s_%02d", r->name, dom_id);
- kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
+ if (remove_all) {
+ kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
+ } else {
+ kn = kernfs_find_and_get(prgrp->mon.mon_data_kn, name);
+ if (kn)
+ kernfs_remove_by_name(kn, subname);
+ }
- list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
- kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
+ list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
+ if (remove_all) {
+ kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
+ } else {
+ kn = kernfs_find_and_get(crgrp->mon.mon_data_kn, name);
+ if (kn)
+ kernfs_remove_by_name(kn, subname);
+ }
+ }
}
}
@@ -3984,7 +4011,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
* per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
+ rmdir_mondata_subdir_allrdtgrp(r, d);
if (is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
--
2.45.0
When SNC mode is enabled, create subdirectories and files to monitor
at the SNC node granularity. Monitor files at the L3 granularity are
tagged with a "sum" attribute to indicate that all SNC nodes sharing
an L3 cache should be read and summed to provide the result to the
user.
Note that the "domid" field for files that must sum across SNC domains
has the L3 cache instance id, while non-summing files use the domain id.
Also the "sum" files do not need to make a call to mon_event_read() to
initialize the MBM counters. This will be handled by initializing the
individual SNC nodes that share the L3.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 53 ++++++++++++++++++--------
1 file changed, 38 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index dd386ad9458a..6a5c35a176d5 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3026,7 +3026,8 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}
static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
- struct rdt_resource *r, struct rdtgroup *prgrp)
+ struct rdt_resource *r, struct rdtgroup *prgrp,
+ bool do_sum)
{
union mon_data_bits priv;
struct mon_evt *mevt;
@@ -3037,15 +3038,18 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
return -EPERM;
priv.u.rid = r->rid;
- priv.u.domid = d->hdr.id;
+ priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
+ priv.u.sum = do_sum;
list_for_each_entry(mevt, &r->evt_list, list) {
priv.u.evtid = mevt->evtid;
ret = mon_addfile(kn, mevt->name, priv.priv);
if (ret)
return ret;
- if (is_mbm_event(mevt->evtid))
+ if (!do_sum && is_mbm_event(mevt->evtid)) {
+ rr.sumdomains = 0;
mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
+ }
}
return 0;
@@ -3055,23 +3059,42 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
struct rdt_mon_domain *d,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
- struct kernfs_node *kn;
+ struct kernfs_node *kn, *ckn;
char name[32];
+ bool snc_mode;
int ret;
- sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
- /* create the directory */
- kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
- if (IS_ERR(kn))
- return PTR_ERR(kn);
+ snc_mode = r->mon_scope != RESCTRL_L3_CACHE;
+ sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
+ kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
+ if (!kn) {
+ /* create the directory */
+ kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
+ if (IS_ERR(kn))
+ return PTR_ERR(kn);
- ret = rdtgroup_kn_set_ugid(kn);
- if (ret)
- goto out_destroy;
+ ret = rdtgroup_kn_set_ugid(kn);
+ if (ret)
+ goto out_destroy;
+ ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
+ if (ret)
+ goto out_destroy;
+ }
- ret = mon_add_all_files(kn, d, r, prgrp);
- if (ret)
- goto out_destroy;
+ if (snc_mode) {
+ sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
+ if (IS_ERR(ckn))
+ goto out_destroy;
+
+ ret = rdtgroup_kn_set_ugid(ckn);
+ if (ret)
+ goto out_destroy;
+
+ ret = mon_add_all_files(ckn, d, r, prgrp, false);
+ if (ret)
+ goto out_destroy;
+ }
kernfs_activate(kn);
return 0;
--
2.45.0
mon_event_read() fills out most fields of the struct rmid_read that is
passed via an smp_call*() function to a CPU that is part of the correct
domain to read the monitor counters.
The one exception is the sumdomains field that is set by the caller
(either rdtgroup_mondata_show() or mon_add_all_files()).
When rmid_read.sumdomains is false:
The domain field "d" specifies the only domain to read
CPU to execute is chosen from d->hdr.cpu_mask
When rmid_read.sumdomains is true:
The domain field is NULL.
The cache_info field indicates that all domains
that are part of that cache instance should be
summed.
CPU to execute is chosen from ci->shared_cpu_mask
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 28 ++++++++++++++++++-----
1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 3b9383612c35..4e394400e575 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -517,6 +517,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first)
{
+ cpumask_t *cpumask;
int cpu;
/* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
@@ -537,7 +538,8 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
return;
}
- cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, RESCTRL_PICK_ANY_CPU);
+ cpumask = rr->sumdomains ? &rr->ci->shared_cpu_map : &d->hdr.cpu_mask;
+ cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU);
/*
* cpumask_any_housekeeping() prefers housekeeping CPUs, but
@@ -546,7 +548,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
* counters on some platforms if its called in IRQ context.
*/
if (tick_nohz_full_cpu(cpu))
- smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
+ smp_call_function_any(cpumask, mon_event_count, rr, 1);
else
smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
@@ -575,15 +577,29 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
resid = md.u.rid;
domid = md.u.domid;
evtid = md.u.evtid;
-
+ rr.sumdomains = md.u.sum;
r = &rdt_resources_all[resid].r_resctrl;
- hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
- if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+
+ if (rr.sumdomains) {
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
+ if (d->ci->id == domid) {
+ rr.ci = d->ci;
+ d = NULL;
+ goto got_cacheinfo;
+ }
+ }
ret = -ENOENT;
goto out;
+ } else {
+ hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+ if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+ ret = -ENOENT;
+ goto out;
+ }
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
}
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+got_cacheinfo:
mon_event_read(&rr, r, d, rdtgrp, evtid, false);
if (rr.err == -EIO)
--
2.45.0
Legacy resctrl monitor files must provide the sum of event values across
all Sub-NUMA Cluster (SNC) domains that share an L3 cache instance.
Rename the existing resctrl_arch_rmid_read() function as
resctrl_arch_rmid_read_one() (with some small refactoring to drop
arguments that are not needed.
Create a new resctrl_arch_rmid_read() that iterates across
domains when necessary. Pass a CPU number from the right domain to
resctrl_arch_rmid_read_one().
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 41 ++++++++++++++++++++-------
1 file changed, 31 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 5f89ed4823ee..c9dd6ec68fcd 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -200,10 +200,9 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
* Caller is responsible to make sure execution running on a CPU in
* the domain to be read.
*/
-static int logical_rmid_to_physical_rmid(int lrmid)
+static int logical_rmid_to_physical_rmid(int cpu, int lrmid)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- int cpu = smp_processor_id();
if (snc_nodes_per_l3_cache == 1)
return lrmid;
@@ -211,13 +210,13 @@ static int logical_rmid_to_physical_rmid(int lrmid)
return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
}
-static int __rmid_read(u32 lrmid,
+static int __rmid_read(int cpu, u32 lrmid,
enum resctrl_event_id eventid, u64 *val)
{
u64 msr_val;
int prmid;
- prmid = logical_rmid_to_physical_rmid(lrmid);
+ prmid = logical_rmid_to_physical_rmid(cpu, lrmid);
/*
* As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
* with a valid event code for supported resource type and the bits
@@ -269,7 +268,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
memset(am, 0, sizeof(*am));
/* Record any initial, non-zero count value. */
- __rmid_read(rmid, eventid, &am->prev_msr);
+ __rmid_read(smp_processor_id(), rmid, eventid, &am->prev_msr);
}
}
@@ -298,9 +297,8 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
return chunks >> shift;
}
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
- u32 unused, u32 rmid, enum resctrl_event_id eventid,
- u64 *val, bool sum, struct cacheinfo *ci, void *ignored)
+static int resctrl_arch_rmid_read_one(struct rdt_resource *r, struct rdt_mon_domain *d,
+ int cpu, u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
@@ -313,7 +311,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
return -EINVAL;
- ret = __rmid_read(rmid, eventid, &msr_val);
+ ret = __rmid_read(cpu, rmid, eventid, &msr_val);
if (ret)
return ret;
@@ -327,7 +325,30 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
chunks = msr_val;
}
- *val = chunks * hw_res->mon_scale;
+ *val += chunks * hw_res->mon_scale;
+
+ return 0;
+}
+
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+ u32 unused, u32 rmid, enum resctrl_event_id eventid,
+ u64 *val, bool sum, struct cacheinfo *ci, void *ignored)
+{
+ int cpu = smp_processor_id();
+ int ret;
+
+ *val = 0;
+ if (!sum)
+ return resctrl_arch_rmid_read_one(r, d, cpu, rmid, eventid, val);
+
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
+ if (d->ci->id != ci->id)
+ continue;
+ cpu = cpumask_any(&d->hdr.cpu_mask);
+ ret = resctrl_arch_rmid_read_one(r, d, cpu, rmid, eventid, val);
+ if (ret)
+ return ret;
+ }
return 0;
}
--
2.45.0
There is an MSR which configures how RMIDs are distributed across SNC
nodes. When SNC is enabled bit 0 of this MSR must be cleared.
Add an architecture specific hook into domain_add_cpu_mon() to call
a function to set the MSR.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
arch/x86/kernel/cpu/resctrl/monitor.c | 26 ++++++++++++++++++++++++++
4 files changed, 31 insertions(+)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e022e6eb766c..3cb8dd6311c3 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1164,6 +1164,7 @@
#define MSR_IA32_QM_CTR 0xc8e
#define MSR_IA32_PQR_ASSOC 0xc8f
#define MSR_IA32_L3_CBM_BASE 0xc90
+#define MSR_RMID_SNC_CONFIG 0xca0
#define MSR_IA32_L2_CBM_BASE 0xd10
#define MSR_IA32_MBA_THRTL_BASE 0xd50
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 7957fc38b71c..08520321f5d0 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -532,6 +532,8 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
+void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
+
/*
* Get the cacheinfo structure of the cache associated with @cpu at level @level.
* cpuhp lock must be held.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 95ef8fe3cb50..1930fce9dfe9 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -615,6 +615,8 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
}
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ arch_mon_domain_online(r, d);
+
if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
mon_domain_free(hw_dom);
return;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index e7a8e96821e5..c7559735e33a 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1069,6 +1069,32 @@ static void l3_mon_evt_init(struct rdt_resource *r)
list_add_tail(&mbm_local_event.list, &r->evt_list);
}
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub-NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * counting operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * logical_rmid_to_physical_rmid() for code that does this.
+ */
+void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
+{
+ u64 val;
+
+ if (snc_nodes_per_l3_cache == 1)
+ return;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, val);
+ val &= ~BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
int __init rdt_get_mon_l3_config(struct rdt_resource *r)
{
unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset;
--
2.45.0
When SNC is enabled there is a mismatch between the MBA control function
which operates at L3 cache scope and the MBM monitor functions which
measure memory bandwidth on each SNC node.
Block use of the mba_MBps when scopes for MBA/MBM do not match.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index cc31ede1a1e7..1cc4794d5a2e 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2343,10 +2343,12 @@ static void mba_sc_domain_destroy(struct rdt_resource *r,
*/
static bool supports_mba_mbps(void)
{
+ struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
return (is_mbm_local_enabled() &&
- r->alloc_capable && is_mba_linear());
+ r->alloc_capable && is_mba_linear() &&
+ r->ctrl_scope == rmbm->mon_scope);
}
/*
--
2.45.0
When Sub-NUMA Cluster (SNC) mode is enabled the legacy monitor reporting
files must report the sum of the data from all of the SNC nodes that
share the L3 cache that is referenced by the monitor file.
Resctrl squeezes all the attributes of these files into 32-bits so they
can be stored in the "priv" field of struct kernfs_node.
Arbitrarily choose the "evtid" field to sacrifice one bit to make
space for a new bit. This structure is purely internal to resctrl,
no ABI issues with modifying it. Subsequent changes may rearrange the
allocation of bits between each of the fields as needed.
The stolen bit is given to a new "sum" field that indicates that reading
this file must sum across SNC nodes. This bit also indicates that the
domid field is the l3_cache_id (instead of a domain id) to find which
domains must be summed.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index d8156d22cbdc..7957fc38b71c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -133,14 +133,20 @@ struct mon_evt {
* as kernfs private data
* @rid: Resource id associated with the event file
* @evtid: Event id associated with the event file
- * @domid: The domain to which the event file belongs
+ * @sum: Set when event must be summed across multiple
+ * domains.
+ * @domid: When @sum is zero this is the domain to which
+ * the event file belongs. When @sum is one this
+ * is the id of the L3 cache that all domains to be
+ * summed share.
* @u: Name of the bit fields struct
*/
union mon_data_bits {
void *priv;
struct {
unsigned int rid : 10;
- enum resctrl_event_id evtid : 8;
+ enum resctrl_event_id evtid : 7;
+ unsigned int sum : 1;
unsigned int domid : 14;
} u;
};
--
2.45.0
For backwards compatibility on Sub-NUMA Cluster (SNC) systems the legacy
files in the mon_L3_XX directories must report the sum of data from each
SNC node sharing that L3 cache instance.
To make this possible, pass the "sumdomains" and "ci" fields from
rmid_read structure as extra arguments to resctrl_arch_rmid_read().
Note that the call from check_limbo() never operates on a "sum" basis,
so pass sumdomains=false, ci=NULL.
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 4 +++-
arch/x86/kernel/cpu/resctrl/monitor.c | 5 +++--
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index d733e1f6485d..d0281c93229d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -276,6 +276,8 @@ void resctrl_offline_cpu(unsigned int cpu);
* @rmid: rmid of the counter to read.
* @eventid: eventid to read, e.g. L3 occupancy.
* @val: result of the counter read in bytes.
+ * @sum: sum across all domains sharing an L3 cache instance
+ * @ci: cacheinfo structure for the cache when @sum is true
* @arch_mon_ctx: An architecture specific value from
* resctrl_arch_mon_ctx_alloc(), for MPAM this identifies
* the hardware monitor allocated for this read request.
@@ -292,7 +294,7 @@ void resctrl_offline_cpu(unsigned int cpu);
*/
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
- u64 *val, void *arch_mon_ctx);
+ u64 *val, bool sum, struct cacheinfo *ci, void *arch_mon_ctx);
/**
* resctrl_arch_rmid_read_context_check() - warn about invalid contexts
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index b9b4d2b5ca82..5f89ed4823ee 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -300,7 +300,7 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 unused, u32 rmid, enum resctrl_event_id eventid,
- u64 *val, void *ignored)
+ u64 *val, bool sum, struct cacheinfo *ci, void *ignored)
{
struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
@@ -380,6 +380,7 @@ void __check_limbo(struct rdt_mon_domain *d, bool force_free)
entry = __rmid_entry(idx);
if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
QOS_L3_OCCUP_EVENT_ID, &val,
+ false, NULL,
arch_mon_ctx)) {
rmid_dirty = true;
} else {
@@ -589,7 +590,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
}
rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid,
- &tval, rr->arch_mon_ctx);
+ &tval, rr->sumdomains, rr->ci, rr->arch_mon_ctx);
if (rr->err)
return rr->err;
--
2.45.0
When reading from a single domain the existing check that current CPU is
in the domain is accurate.
But when summing across multiple domains that share an L3 cache instance
it is sufficient to run on any CPU in the shared_map for that cache.
Split the check into the two separate cases.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c9dd6ec68fcd..e7a8e96821e5 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -308,9 +308,6 @@ static int resctrl_arch_rmid_read_one(struct rdt_resource *r, struct rdt_mon_dom
resctrl_arch_rmid_read_context_check();
- if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
- return -EINVAL;
-
ret = __rmid_read(cpu, rmid, eventid, &msr_val);
if (ret)
return ret;
@@ -338,8 +335,15 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
int ret;
*val = 0;
- if (!sum)
+ if (!sum) {
+ if (!cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
+ return -EINVAL;
+
return resctrl_arch_rmid_read_one(r, d, cpu, rmid, eventid, val);
+ }
+
+ if (!cpumask_test_cpu(cpu, &ci->shared_cpu_map))
+ return -EINVAL;
list_for_each_entry(d, &r->mon_domains, hdr.list) {
if (d->ci->id != ci->id)
--
2.45.0
With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.
Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.
Signed-off-by: Tony Luck <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 627e23869bca..401f6bfb4a3c 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -375,6 +375,10 @@ When monitoring is enabled all MON groups will also contain:
all tasks in the group. In CTRL_MON groups these files provide
the sum for all tasks in the CTRL_MON group and all tasks in
MON groups. Please see example section for more details on usage.
+ On systems with Sub-NUMA (SNC) cluster enabled there are extra
+ directories for each node (located within the "mon_L3_XX" directory
+ for the L3 cache they occupy). These are named "mon_sub_L3_YY"
+ where "YY" is the node number.
"mon_hw_id":
Available only with debug option. The identifier used by hardware
@@ -484,6 +488,19 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
each bit represents 5% of the capacity of the cache. You could partition
the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.
+The top-level monitoring files in each "mon_L3_XX" directory provide
+the sum of data across all SNC nodes sharing an L3 cache instance.
+Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
+the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
+"mon_sub_L3_YY" directories to get node local data.
+
Memory bandwidth Allocation and monitoring
==========================================
--
2.45.0
There isn't a simple hardware bit that indicates whether a CPU is
running in Sub-NUMA Cluster (SNC) mode. Infer the state by comparing
number CPUs sharing the L3 cache with CPU0 to the number of CPUs in
the same NUMA node as CPU0.
If SNC mode is detected, print a single informational message to the
console.
Add the missing definition of pr_fmt() to monitor.c. This wasn't
noticed before as there are only "can't happen" console messages
from this file.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 59 +++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c7559735e33a..1c5162a68461 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -15,6 +15,8 @@
* Software Developer Manual June 2016, volume 3, section 17.17.
*/
+#define pr_fmt(fmt) "resctrl: " fmt
+
#include <linux/cpu.h>
#include <linux/module.h>
#include <linux/sizes.h>
@@ -1095,6 +1097,61 @@ void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
wrmsrl(MSR_RMID_SNC_CONFIG, val);
}
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+ X86_MATCH_VFM(INTEL_ICELAKE_X, 0),
+ X86_MATCH_VFM(INTEL_SAPPHIRERAPIDS_X, 0),
+ X86_MATCH_VFM(INTEL_EMERALDRAPIDS_X, 0),
+ X86_MATCH_VFM(INTEL_GRANITERAPIDS_X, 0),
+ X86_MATCH_VFM(INTEL_ATOM_CRESTMONT_X, 0),
+ {}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub-NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * number CPUs sharing the L3 cache with CPU0 to the number of CPUs in
+ * the same NUMA node as CPU0.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+ struct cacheinfo *ci = get_cpu_cacheinfo_level(0, RESCTRL_L3_CACHE);
+ const cpumask_t *node0_cpumask;
+ int ret;
+
+ if (!x86_match_cpu(snc_cpu_ids) || !ci)
+ return 1;
+
+ cpus_read_lock();
+ if (num_online_cpus() != num_present_cpus())
+ pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+ cpus_read_unlock();
+
+ node0_cpumask = cpumask_of_node(cpu_to_node(0));
+
+ ret = cpumask_weight(&ci->shared_cpu_map) / cpumask_weight(node0_cpumask);
+
+ /* sanity check: Only valid results are 1, 2, 3, 4 */
+ switch (ret) {
+ case 1:
+ break;
+ case 2 ... 4:
+ pr_info("Sub-NUMA Cluster mode detected with %d nodes per L3 cache\n", ret);
+ rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_L3_NODE;
+ break;
+ default:
+ pr_warn("Ignore improbable SNC node count %d\n", ret);
+ ret = 1;
+ break;
+ }
+
+ return ret;
+}
+
int __init rdt_get_mon_l3_config(struct rdt_resource *r)
{
unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset;
@@ -1102,6 +1159,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
unsigned int threshold;
int ret;
+ snc_nodes_per_l3_cache = snc_get_config();
+
resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
--
2.45.0
Hi Tony,
On 5/28/24 3:19 PM, Tony Luck wrote:
> This series based on top of Linus upstream commit 33e02dc69afb ("Merge
> tag 'sound-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound")
>
> The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
> that share an L3 cache into two or more sets. This plays havoc with the
> Resource Director Technology (RDT) monitoring features. Prior to this
> patch Intel has advised that SNC and RDT are incompatible.
>
> Some of these CPUs support an MSR that can partition the RMID counters
> in the same way. This allows monitoring features to be used. Legacy
> monitoring files provide the sum of counters from each SNC node for
> backwards compatibility. Additional files per SNC node provide details
> per node.
>
> Cache and memory bandwidth allocation features continue to operate at
> the scope of the L3 cache.
>
> Signed-off-by: Tony Luck <[email protected]>
>
> ---
> Changes since v18: https://lore.kernel.org/all/[email protected]/
>
> Global: Consistent use of "Sub-NUMA Cluster (SNC)"
(briefly scanning changes)
>
> 1-4: No change
>
> 5: Rename RESCTRL_NODE as RESCTRL_L3_NODE to make it clear that
> these "nodes" are each subsets of L3 cache instances.
>
> 6: Changes for snc_nodes_per_l3_cache are localized to monitor.c
> Don't use it in decision block use of mba_MBps option.
> Moved the old get_node_rmid() function here, but renamed it to
> logical_rmid_to_physical_rmid() with a block comment explaining
> how RMIDs are distributed when SNC is enabled. Function now
> checks if snc_nodes_per_l3_cache == 1 for fast return.
>
> 7: New patch. Only allow mba_MBps option if scope of MBM matches MBA
>
> 8: Replaces old patch 8. "display_id" field is no more. Add and
> initialize the @ci (struct cachinfo *) to rdt_mon_domain.
> Note that the new get_cpu_cacheinfo_level() helper function is
> added to internal.h as it will also be needed by patch 19.
>
> 9: Instead of display_id, add pointer to cacheinfo structure to
> struct rmid_read. Add kerneldoc description of existing and
> new fields.
>
> 10: Added to commit comment describing why mkdir_mondata_subdir()
> needs to be refactored.
>
> 11: Dropped Intel specific description of fields in the mon_evt
> structure. Say that choice of bit to steal was arbitrary, but
> can be changed in the future.
>
> 12: Fixed typo s/and file/and files/ in commit message. Now using
> the cacheinfo structure (specifically "id" field) instead of
> display_id.
>
> 13: Wordsmith commit into imperative.
> I looked at using kobject_has_children() to check for empty
> directory, but it needs a "struct kobject *" and all I have
> is "struct kernfs_node *". I'm now checking how many CPUs
Consider how kobject_has_children() uses that struct kobject *.
Specifically:
return kobj->sd && kobj->sd->dir.subdirs
It operates on kobj->sd, which is exactly what you have: struct kernfs_node.
> remain in ci->shared_cpu_map to detect whether this is the
> last SNC node.
hmmm, ok, will take a look ... but please finalize discussion of a patch series
before submitting a new series that rejects feedback without discussion and
does something completely different in new version.
Reinette
On Tue, May 28, 2024 at 03:55:29PM -0700, Reinette Chatre wrote:
> Hi Tony,
> > 13: Wordsmith commit into imperative.
> > I looked at using kobject_has_children() to check for empty
> > directory, but it needs a "struct kobject *" and all I have
> > is "struct kernfs_node *". I'm now checking how many CPUs
>
> Consider how kobject_has_children() uses that struct kobject *.
> Specifically:
> return kobj->sd && kobj->sd->dir.subdirs
>
> It operates on kobj->sd, which is exactly what you have: struct kernfs_node.
So right. My turn to grumble about other peoples choice of names. If
that field was named "kn" instead of "sd" I would have spotted this
too.
> > remain in ci->shared_cpu_map to detect whether this is the
> > last SNC node.
>
> hmmm, ok, will take a look ... but please finalize discussion of a patch series
> before submitting a new series that rejects feedback without discussion and
> does something completely different in new version.
Reinette,
So here's what rmdir_mondata_subdir_allrdtgrp() looks like using the
subdirs check. It might need an update/better header comment.
-Tony
---
/*
* Remove all subdirectories of mon_data of ctrl_mon groups
* and monitor groups with given domain id.
*/
static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdt_mon_domain *d)
{
struct rdtgroup *prgrp, *crgrp;
struct kernfs_node *kn;
char subname[32];
char name[32];
sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
if (r->mon_scope != RESCTRL_L3_CACHE) {
/*
* SNC mode: Unless the last domain is being removed must
* just remove the SNC subdomain.
*/
sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
}
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
kn = kernfs_find_and_get(prgrp->mon.mon_data_kn, name);
if (!kn)
continue;
if (kn->dir.subdirs <= 1)
kernfs_remove(kn);
else
kernfs_remove_by_name(kn, subname);
list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
kn = kernfs_find_and_get(crgrp->mon.mon_data_kn, name);
if (!kn)
continue;
if (kn->dir.subdirs <= 1)
kernfs_remove(kn);
else
kernfs_remove_by_name(kn, subname);
}
}
}
Hi Tony,
On 5/29/24 1:20 PM, Tony Luck wrote:
> On Tue, May 28, 2024 at 03:55:29PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>> 13: Wordsmith commit into imperative.
>>> I looked at using kobject_has_children() to check for empty
>>> directory, but it needs a "struct kobject *" and all I have
>>> is "struct kernfs_node *". I'm now checking how many CPUs
>>
>> Consider how kobject_has_children() uses that struct kobject *.
>> Specifically:
>> return kobj->sd && kobj->sd->dir.subdirs
>>
>> It operates on kobj->sd, which is exactly what you have: struct kernfs_node.
>
> So right. My turn to grumble about other peoples choice of names. If
> that field was named "kn" instead of "sd" I would have spotted this
> too.
>
>>> remain in ci->shared_cpu_map to detect whether this is the
>>> last SNC node.
>>
>> hmmm, ok, will take a look ... but please finalize discussion of a patch series
>> before submitting a new series that rejects feedback without discussion and
>> does something completely different in new version.
>
> Reinette,
>
> So here's what rmdir_mondata_subdir_allrdtgrp() looks like using the
> subdirs check. It might need an update/better header comment.
>
> -Tony
>
> ---
>
> /*
> * Remove all subdirectories of mon_data of ctrl_mon groups
> * and monitor groups with given domain id.
(note comment still considers that domain id is parameter)
> */
> static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> struct rdt_mon_domain *d)
> {
> struct rdtgroup *prgrp, *crgrp;
> struct kernfs_node *kn;
> char subname[32];
I wonder if static checkers will know that this cannot be used
uninitialized?
> char name[32];
>
> sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> if (r->mon_scope != RESCTRL_L3_CACHE) {
> /*
> * SNC mode: Unless the last domain is being removed must
> * just remove the SNC subdomain.
> */
> sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
> }
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> kn = kernfs_find_and_get(prgrp->mon.mon_data_kn, name);
> if (!kn)
> continue;
>
> if (kn->dir.subdirs <= 1)
> kernfs_remove(kn);
> else
> kernfs_remove_by_name(kn, subname);
>
> list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
> kn = kernfs_find_and_get(crgrp->mon.mon_data_kn, name);
> if (!kn)
> continue;
>
> if (kn->dir.subdirs <= 1)
> kernfs_remove(kn);
> else
> kernfs_remove_by_name(kn, subname);
> }
> }
> }
This solution looks more intuitive to me. I do think that it may be
missing some kernfs_put()'s?
Reinette
ps. Please do give me a couple of days more with this series before you
submit a new version.
On Wed, May 29, 2024 at 07:46:27PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/29/24 1:20 PM, Tony Luck wrote:
> > On Tue, May 28, 2024 at 03:55:29PM -0700, Reinette Chatre wrote:
> > > Hi Tony,
> > > > 13: Wordsmith commit into imperative.
> > > > I looked at using kobject_has_children() to check for empty
> > > > directory, but it needs a "struct kobject *" and all I have
> > > > is "struct kernfs_node *". I'm now checking how many CPUs
> > >
> > > Consider how kobject_has_children() uses that struct kobject *.
> > > Specifically:
> > > return kobj->sd && kobj->sd->dir.subdirs
> > >
> > > It operates on kobj->sd, which is exactly what you have: struct kernfs_node.
> >
> > So right. My turn to grumble about other peoples choice of names. If
> > that field was named "kn" instead of "sd" I would have spotted this
> > too.
> >
> > > > remain in ci->shared_cpu_map to detect whether this is the
> > > > last SNC node.
> > >
> > > hmmm, ok, will take a look ... but please finalize discussion of a patch series
> > > before submitting a new series that rejects feedback without discussion and
> > > does something completely different in new version.
> >
> > Reinette,
> >
> > So here's what rmdir_mondata_subdir_allrdtgrp() looks like using the
> > subdirs check. It might need an update/better header comment.
> >
> > -Tony
> >
> > ---
> >
> > /*
> > * Remove all subdirectories of mon_data of ctrl_mon groups
> > * and monitor groups with given domain id.
>
> (note comment still considers that domain id is parameter)
Will fix.
> > */
> > static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> > struct rdt_mon_domain *d)
> > {
> > struct rdtgroup *prgrp, *crgrp;
> > struct kernfs_node *kn;
> > char subname[32];
>
> I wonder if static checkers will know that this cannot be used
> uninitialized?
I wondered that too. There are no complaints from gcc. How do people
deal with false positives from static checkers? Simplest would be to
provide an initializer:
char subname[32] = "";
While that might shut up the static check, it would be more confusing
for human readers.
> > char name[32];
> >
> > sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> > if (r->mon_scope != RESCTRL_L3_CACHE) {
> > /*
> > * SNC mode: Unless the last domain is being removed must
> > * just remove the SNC subdomain.
> > */
> > sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
> > }
> >
> > list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> > kn = kernfs_find_and_get(prgrp->mon.mon_data_kn, name);
> > if (!kn)
> > continue;
> >
> > if (kn->dir.subdirs <= 1)
> > kernfs_remove(kn);
> > else
> > kernfs_remove_by_name(kn, subname);
> >
> > list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
> > kn = kernfs_find_and_get(crgrp->mon.mon_data_kn, name);
> > if (!kn)
> > continue;
> >
> > if (kn->dir.subdirs <= 1)
> > kernfs_remove(kn);
> > else
> > kernfs_remove_by_name(kn, subname);
> > }
> > }
> > }
>
> This solution looks more intuitive to me. I do think that it may be
> missing some kernfs_put()'s?
There aren't any kernfs_put()'s in the existing code. Resctrl takes
an extra hold on the CTRL_MON and MON directories and jumps though some
hoops to drop that after the directory has been removed. But the monitor
directories have nothing like that.
> Reinette
>
> ps. Please do give me a couple of days more with this series before you
> submit a new version.
Sure. Will do.
-Tony
Hi Tony,
On 5/30/24 9:36 AM, Tony Luck wrote:
> On Wed, May 29, 2024 at 07:46:27PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 5/29/24 1:20 PM, Tony Luck wrote:
>>> On Tue, May 28, 2024 at 03:55:29PM -0700, Reinette Chatre wrote:
>>>> Hi Tony,
>>>>> 13: Wordsmith commit into imperative.
>>>>> I looked at using kobject_has_children() to check for empty
>>>>> directory, but it needs a "struct kobject *" and all I have
>>>>> is "struct kernfs_node *". I'm now checking how many CPUs
>>>>
>>>> Consider how kobject_has_children() uses that struct kobject *.
>>>> Specifically:
>>>> return kobj->sd && kobj->sd->dir.subdirs
>>>>
>>>> It operates on kobj->sd, which is exactly what you have: struct kernfs_node.
>>>
>>> So right. My turn to grumble about other peoples choice of names. If
>>> that field was named "kn" instead of "sd" I would have spotted this
>>> too.
>>>
>>>>> remain in ci->shared_cpu_map to detect whether this is the
>>>>> last SNC node.
>>>>
>>>> hmmm, ok, will take a look ... but please finalize discussion of a patch series
>>>> before submitting a new series that rejects feedback without discussion and
>>>> does something completely different in new version.
>>>
>>> Reinette,
>>>
>>> So here's what rmdir_mondata_subdir_allrdtgrp() looks like using the
>>> subdirs check. It might need an update/better header comment.
>>>
>>> -Tony
>>>
>>> ---
>>>
>>> /*
>>> * Remove all subdirectories of mon_data of ctrl_mon groups
>>> * and monitor groups with given domain id.
>>
>> (note comment still considers that domain id is parameter)
>
> Will fix.
>
>>> */
>>> static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>>> struct rdt_mon_domain *d)
>>> {
>>> struct rdtgroup *prgrp, *crgrp;
>>> struct kernfs_node *kn;
>>> char subname[32];
>>
>> I wonder if static checkers will know that this cannot be used
>> uninitialized?
>
> I wondered that too. There are no complaints from gcc. How do people
> deal with false positives from static checkers? Simplest would be to
> provide an initializer:
>
> char subname[32] = "";
>
> While that might shut up the static check, it would be more confusing
> for human readers.
or char subname[32] = {};
Please elaborate how this will be confusing to human readers? A comment
may help to address that.
I took the time to run a static checker on this series and it did
not complain about this issue. I did not run it with this fixup though, with
just original submission that seem to have similar pattern. I do still think
it would be good to initialize the arrays.
btw ... the static checker I ran did have four other complaints, three about
uninitialized data and one about divide by zero. Most problems appear to be
in mbm_update() that does not initialize rr.sumdomains nor rr.ci before
calling __mon_event_count().
Please use available tools to check code before posting.
>
>>> char name[32];
>>>
>>> sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
>>> if (r->mon_scope != RESCTRL_L3_CACHE) {
>>> /*
>>> * SNC mode: Unless the last domain is being removed must
>>> * just remove the SNC subdomain.
>>> */
>>> sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
>>> }
>>>
>>> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
>>> kn = kernfs_find_and_get(prgrp->mon.mon_data_kn, name);
>>> if (!kn)
>>> continue;
>>>
>>> if (kn->dir.subdirs <= 1)
>>> kernfs_remove(kn);
>>> else
>>> kernfs_remove_by_name(kn, subname);
>>>
>>> list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
>>> kn = kernfs_find_and_get(crgrp->mon.mon_data_kn, name);
>>> if (!kn)
>>> continue;
>>>
>>> if (kn->dir.subdirs <= 1)
>>> kernfs_remove(kn);
>>> else
>>> kernfs_remove_by_name(kn, subname);
>>> }
>>> }
>>> }
>>
>> This solution looks more intuitive to me. I do think that it may be
>> missing some kernfs_put()'s?
>
> There aren't any kernfs_put()'s in the existing code.
Why should it? Existing code does not have the kernfs_put()'s because
the extra reference is only obtained in this new code.
Reinette
Hi Tony,
On 5/28/24 3:19 PM, Tony Luck wrote:
> When SNC is enabled monitoring data is collected at the SNC node
> granularity, but must be reported at L3-cache granularity for
> backwards compatibility in addition to reporting at the node
> level.
>
> Add a "ci" field to the rdt_mon_domain structure to save the
> cache information about the enclosing L3 cache for the domain.
> This provides:
>
> 1) The cache id which is needed to compose the name of the legacy
> monitoring directory, and to determine which domains should be
> summed to provide L3-scoped data.
>
> 2) The shared_cpu_map which is needed to determine which CPUs can
> be used to read the RMID counters with the MSR interface.
>
> This is the first step to an eventual goal of monitor reporting files
> like this (for a system with two SNC nodes per L3):
>
> $ cd /sys/fs/resctrl/mon_data
> $ tree mon_L3_00
> mon_L3_00 <- 00 here is L3 cache id
> ├── llc_occupancy \ These files provide legacy support
> ├── mbm_local_bytes > for non-SNC aware monitor apps
> ├── mbm_total_bytes / that expect data at L3 cache level
> ├── mon_sub_L3_00 <- 00 here is SNC node id
> │  ├── llc_occupancy \ These files are finer grained
> │  ├── mbm_local_bytes > data from each SNC node
> │  └── mbm_total_bytes /
> └── mon_sub_L3_01
> ├── llc_occupancy \
> ├── mbm_local_bytes > As above, but for node 1.
> └── mbm_total_bytes /
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> include/linux/resctrl.h | 2 ++
> arch/x86/kernel/cpu/resctrl/internal.h | 21 +++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/core.c | 7 ++++++-
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 1 -
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 1 -
> 5 files changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 64b6ad1b22a1..d733e1f6485d 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -96,6 +96,7 @@ struct rdt_ctrl_domain {
> /**
> * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
> * @hdr: common header for different domain types
> + * @ci: cache info for this domain
> * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
> * @mbm_total: saved state for MBM total bandwidth
> * @mbm_local: saved state for MBM local bandwidth
> @@ -106,6 +107,7 @@ struct rdt_ctrl_domain {
> */
> struct rdt_mon_domain {
> struct rdt_domain_hdr hdr;
> + struct cacheinfo *ci;
> unsigned long *rmid_busy_llc;
> struct mbm_state *mbm_total;
> struct mbm_state *mbm_local;
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 135190e0711c..eb70d3136ced 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -2,6 +2,7 @@
> #ifndef _ASM_X86_RESCTRL_INTERNAL_H
> #define _ASM_X86_RESCTRL_INTERNAL_H
>
> +#include <linux/cacheinfo.h>
> #include <linux/resctrl.h>
> #include <linux/sched.h>
> #include <linux/kernfs.h>
> @@ -509,6 +510,26 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
>
> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
>
> +/*
> + * Get the cacheinfo structure of the cache associated with @cpu at level @level.
> + * cpuhp lock must be held.
> + */
> +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
> +{
> + struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
> + int i;
> +
> + for (i = 0; i < ci->num_leaves; i++) {
> + if (ci->info_list[i].level == level) {
> + if (ci->info_list[i].attributes & CACHE_ID)
> + return &ci->info_list[i];
> + break;
> + }
> + }
> +
> + return NULL;
> +}
> +
This does not belong in resctrl. It really looks to partner well with existing
cache helpers in include/linux/cacheinfo.h that already contains get_cpu_cacheinfo_id().
Considering the existing naming get_cpu_cacheinfo() may be more appropriate.
Reinette
Hi Tony,
On 5/28/24 3:19 PM, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
>
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
>
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
>
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
>
> The other mode divides the RMIDs and renumbers the ones on the second
> SNC node to start from zero.
>
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
>
> Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
> how many SNC domains share an L3 cache instance. Initialize this to
> "1". Runtime detection of SNC mode will adjust this value.
>
> Update all places to take appropriate action when SNC mode is enabled:
> 1) The number of logical RMIDs per L3 cache available for use is the
> number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
> nodes.
> 3) Add a function to convert from logical RMID values (assigned to
> tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
> to physical RMID values to load into IA32_QM_EVTSEL MSR when
> reading counters on each SNC node.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/monitor.c | 37 ++++++++++++++++++++++++---
> 1 file changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 89d7e6fcbaa1..b9b4d2b5ca82 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
>
> #define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
>
> +static int snc_nodes_per_l3_cache = 1;
> +
> /*
> * The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
> * If rmid > rmid threshold, MBM total and local values should be multiplied
> @@ -185,10 +187,37 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
> return entry;
> }
>
> -static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> +/*
> + * When Sub-NUMA Cluster (SNC) mode is not enabled, the physical RMID
> + * is the same as the logical RMID.
> + *
> + * When SNC mode is enabled the physical RMIDs are distributed across
> + * the SNC nodes. E.g. with two SNC nodes per L3 cache and 200 physical
> + * RMIDs are divided with 0..99 on the first node and 100..199 on
> + * the second node. Compute the value of the physical RMID to pass to
> + * resctrl_arch_rmid_read().
Please stop rushing version after version. I do not think you read the
above after you wrote it. The sentences run into each other.
Could this be specific about what is meant by "physical" and "logical" RMID?
To me "physical RMID" implies the RMID used by hardware and "logical RMID"
is the RMID used by software ... but when it comes to SNC it is actually:
"physical RMID" - RMID used by MSR_IA32_QM_EVTSEL
"logical RMID" - RMID used by software and the MSR_IA32_PQR_ASSOC register
> + *
> + * Caller is responsible to make sure execution running on a CPU in
"is responsible" and "make sure" means the same, no?
"make sure execution running"?
(Looking ahead in this series and coming back to this, this looks like
rushed work that you in turn expect folks spend quality time reviewing.)
> + * the domain to be read.
> + */
> +static int logical_rmid_to_physical_rmid(int lrmid)
> +{
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + int cpu = smp_processor_id();
> +
> + if (snc_nodes_per_l3_cache == 1)
> + return lrmid;
> +
> + return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> +}
> +
> +static int __rmid_read(u32 lrmid,
> + enum resctrl_event_id eventid, u64 *val)
This line does not need to be split.
> {
> u64 msr_val;
> + int prmid;
>
> + prmid = logical_rmid_to_physical_rmid(lrmid);
> /*
> * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> * with a valid event code for supported resource type and the bits
> @@ -197,7 +226,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> * are error bits.
> */
> - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> + wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
> rdmsrl(MSR_IA32_QM_CTR, msr_val);
>
> if (msr_val & RMID_VAL_ERROR)
> @@ -1022,8 +1051,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> int ret;
>
> resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> - r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> + r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
> hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>
> if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
Reinette
Hi Tony,
On 5/28/24 3:19 PM, Tony Luck wrote:
> rdtgroup_mondata_show() calls mon_event_read() which calls
> mon_event_count() which packages up all the required details into an
No, mon_event_read() does the "packaging".
> rmid_read structure passed across the smp_call*() infrastructure.
>
> Legacy files reporting for a single domain pass that domain in the
> rmid_read structure. Files that need to sum multiple domains have
> meta data that provides the L3 cache ID for domains that must be
> summed.
>
> Add the sumdomains and cacheinfo fields to the rmid_read structure.
This just describes the code and that can be seen from patch. Please
check all changelogs in series for this.
>
> Add kerneldoc comments for the rmid_read structure.
Same.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index eb70d3136ced..d8156d22cbdc 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -145,12 +145,28 @@ union mon_data_bits {
> } u;
> };
>
> +/**
> + * struct rmid_read - Data passed across smp_call*() to read event count
> + * @rgrp: Resctrl group (provides RMID value)
Provides much more than RMID so either make that accurate or drop the annotation.
> + * @r: Resource
> + * @d: Domain
> + * @evtid: Which monitor event to read
> + * @first: When true this just requests initialization of an MBM counter
Seems strange. Perhaps just "Initializes MBM counter when true."
> + * @sumdomains: When false just return monitor count from domain @d. When true,
> + * sum all domains in @r sharing L3 @ci.id
> + * @ci: See @sumdomains
> + * @err: Used to return error indication
> + * @val: Used to return value of event counter
> + * @arch_mon_ctx: hardware monitor allocated for this read request (MPAM only)
Stay consistent with descriptions starting with upper case.
> + */
> struct rmid_read {
> struct rdtgroup *rgrp;
> struct rdt_resource *r;
> struct rdt_mon_domain *d;
> enum resctrl_event_id evtid;
> bool first;
> + bool sumdomains;
> + struct cacheinfo *ci;
> int err;
> u64 val;
> void *arch_mon_ctx;
Reinette
Hi Tony,
Regarding shortlog: isn't the purpose of this change to _avoid_
allocating a new bit?
On 5/28/24 3:19 PM, Tony Luck wrote:
> When Sub-NUMA Cluster (SNC) mode is enabled the legacy monitor reporting
> files must report the sum of the data from all of the SNC nodes that
> share the L3 cache that is referenced by the monitor file.
>
> Resctrl squeezes all the attributes of these files into 32-bits so they
> can be stored in the "priv" field of struct kernfs_node.
>
> Arbitrarily choose the "evtid" field to sacrifice one bit to make
> space for a new bit. This structure is purely internal to resctrl,
Missing explanation why this is ok for this field to sacrifice a bit.
> no ABI issues with modifying it. Subsequent changes may rearrange the
> allocation of bits between each of the fields as needed.
>
> The stolen bit is given to a new "sum" field that indicates that reading
stolen? Is that necessary? It can just be "Give the bit ..." (also
note imperative tone)
> this file must sum across SNC nodes. This bit also indicates that the
> domid field is the l3_cache_id (instead of a domain id) to find which
> domains must be summed.
l3_cache_id looks like a variable that does not exist anywhere in this series
or in existing resctrl.
Reinette
Hi Tony,
On 5/28/24 3:20 PM, Tony Luck wrote:
> Legacy resctrl monitor files must provide the sum of event values across
> all Sub-NUMA Cluster (SNC) domains that share an L3 cache instance.
>
> Rename the existing resctrl_arch_rmid_read() function as
> resctrl_arch_rmid_read_one() (with some small refactoring to drop
> arguments that are not needed.
Missing closing ")".
>
> Create a new resctrl_arch_rmid_read() that iterates across
> domains when necessary. Pass a CPU number from the right domain to
> resctrl_arch_rmid_read_one().
"when necessary"? Can you elaborate?
"Pass a CPU ..." that just describes code that can be learned from patch.
Please describe the changes not the code.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/monitor.c | 41 ++++++++++++++++++++-------
> 1 file changed, 31 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 5f89ed4823ee..c9dd6ec68fcd 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -200,10 +200,9 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
> * Caller is responsible to make sure execution running on a CPU in
> * the domain to be read.
> */
> -static int logical_rmid_to_physical_rmid(int lrmid)
> +static int logical_rmid_to_physical_rmid(int cpu, int lrmid)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> - int cpu = smp_processor_id();
>
> if (snc_nodes_per_l3_cache == 1)
> return lrmid;
> @@ -211,13 +210,13 @@ static int logical_rmid_to_physical_rmid(int lrmid)
> return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> }
>
> -static int __rmid_read(u32 lrmid,
> +static int __rmid_read(int cpu, u32 lrmid,
> enum resctrl_event_id eventid, u64 *val)
> {
> u64 msr_val;
> int prmid;
>
> - prmid = logical_rmid_to_physical_rmid(lrmid);
> + prmid = logical_rmid_to_physical_rmid(cpu, lrmid);
> /*
Passing CPU as parameter to __rmid_read(), which is run via IPI, really
obfuscates the code. How about renaming it to "__rmid_read_phys()"
and provide it the "physical RMID" as parameter to make it clear what it is
doing?
That would mean an extra call to determine "physical RMID" before calling
__rmid_read_phys() but making it clear that it needs a physical RMID should
help to explain what is going on.
> * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> * with a valid event code for supported resource type and the bits
> @@ -269,7 +268,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> memset(am, 0, sizeof(*am));
>
> /* Record any initial, non-zero count value. */
> - __rmid_read(rmid, eventid, &am->prev_msr);
> + __rmid_read(smp_processor_id(), rmid, eventid, &am->prev_msr);
> }
> }
>
> @@ -298,9 +297,8 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
> return chunks >> shift;
> }
>
> -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> - u32 unused, u32 rmid, enum resctrl_event_id eventid,
> - u64 *val, bool sum, struct cacheinfo *ci, void *ignored)
> +static int resctrl_arch_rmid_read_one(struct rdt_resource *r, struct rdt_mon_domain *d,
> + int cpu, u32 rmid, enum resctrl_event_id eventid, u64 *val)
> {
> struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> @@ -313,7 +311,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> return -EINVAL;
>
> - ret = __rmid_read(rmid, eventid, &msr_val);
> + ret = __rmid_read(cpu, rmid, eventid, &msr_val);
> if (ret)
> return ret;
>
> @@ -327,7 +325,30 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> chunks = msr_val;
> }
>
> - *val = chunks * hw_res->mon_scale;
> + *val += chunks * hw_res->mon_scale;
The various new layers of indirection with SNC logic scattered between them
makes this change hard to understand.
> +
> + return 0;
> +}
> +
> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> + u32 unused, u32 rmid, enum resctrl_event_id eventid,
> + u64 *val, bool sum, struct cacheinfo *ci, void *ignored)
This is not architecture specific code.
> +{
> + int cpu = smp_processor_id();
> + int ret;
> +
> + *val = 0;
> + if (!sum)
> + return resctrl_arch_rmid_read_one(r, d, cpu, rmid, eventid, val);
> +
/* SNC quirk that needs to be documented */
> + list_for_each_entry(d, &r->mon_domains, hdr.list) {
> + if (d->ci->id != ci->id)
> + continue;
> + cpu = cpumask_any(&d->hdr.cpu_mask);
> + ret = resctrl_arch_rmid_read_one(r, d, cpu, rmid, eventid, val);
The cpu parameter can be dropped, no? With the domain provided to resctrl_arch_rmid_read_one()
it is not necessary to again split the SNC logic (in this case the "reading from any CPU
in the cache domain is ok but still need accurate arch state") across multiple layers,
just contain this in (documented portion of) resctrl_arch_rmid_read_one().
> + if (ret)
> + return ret;
> + }
>
> return 0;
> }
Reinette
Hi Tony,
On 5/28/24 3:20 PM, Tony Luck wrote:
> There is an MSR which configures how RMIDs are distributed across SNC
> nodes. When SNC is enabled bit 0 of this MSR must be cleared.
Missing explanation why bit 0 of this MSR must be cleared.
>
> Add an architecture specific hook into domain_add_cpu_mon() to call
> a function to set the MSR.
Can this explain more than what the code already tells us?
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
> arch/x86/kernel/cpu/resctrl/core.c | 2 ++
> arch/x86/kernel/cpu/resctrl/monitor.c | 26 ++++++++++++++++++++++++++
> 4 files changed, 31 insertions(+)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index e022e6eb766c..3cb8dd6311c3 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1164,6 +1164,7 @@
> #define MSR_IA32_QM_CTR 0xc8e
> #define MSR_IA32_PQR_ASSOC 0xc8f
> #define MSR_IA32_L3_CBM_BASE 0xc90
> +#define MSR_RMID_SNC_CONFIG 0xca0
> #define MSR_IA32_L2_CBM_BASE 0xd10
> #define MSR_IA32_MBA_THRTL_BASE 0xd50
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 7957fc38b71c..08520321f5d0 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -532,6 +532,8 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
>
> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
>
> +void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
I expect that this function will be called from the already
architecture specific code so needing an architecture specific hook as the changelog
mentions does not seem appropriate.
Reinette
Hi Tony,
On 5/28/24 3:19 PM, Tony Luck wrote:
> When SNC mode is enabled, create subdirectories and files to monitor
> at the SNC node granularity. Monitor files at the L3 granularity are
> tagged with a "sum" attribute to indicate that all SNC nodes sharing
> an L3 cache should be read and summed to provide the result to the
> user.
What does "all SNC nodes sharing an L3 cache should be read and summed" mean?
>
> Note that the "domid" field for files that must sum across SNC domains
> has the L3 cache instance id, while non-summing files use the domain id.
>
> Also the "sum" files do not need to make a call to mon_event_read() to
Drop "Also" - it is a red flag that a new patch is needed.
> initialize the MBM counters. This will be handled by initializing the
> individual SNC nodes that share the L3.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 53 ++++++++++++++++++--------
> 1 file changed, 38 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index dd386ad9458a..6a5c35a176d5 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3026,7 +3026,8 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> }
>
> static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> - struct rdt_resource *r, struct rdtgroup *prgrp)
> + struct rdt_resource *r, struct rdtgroup *prgrp,
> + bool do_sum)
> {
> union mon_data_bits priv;
> struct mon_evt *mevt;
> @@ -3037,15 +3038,18 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> return -EPERM;
>
> priv.u.rid = r->rid;
> - priv.u.domid = d->hdr.id;
> + priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
> + priv.u.sum = do_sum;
> list_for_each_entry(mevt, &r->evt_list, list) {
> priv.u.evtid = mevt->evtid;
> ret = mon_addfile(kn, mevt->name, priv.priv);
> if (ret)
> return ret;
>
> - if (is_mbm_event(mevt->evtid))
> + if (!do_sum && is_mbm_event(mevt->evtid)) {
> + rr.sumdomains = 0;
> mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
> + }
> }
>
> return 0;
> @@ -3055,23 +3059,42 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> struct rdt_mon_domain *d,
> struct rdt_resource *r, struct rdtgroup *prgrp)
> {
> - struct kernfs_node *kn;
> + struct kernfs_node *kn, *ckn;
> char name[32];
> + bool snc_mode;
> int ret;
>
> - sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
> - /* create the directory */
> - kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> - if (IS_ERR(kn))
> - return PTR_ERR(kn);
> + snc_mode = r->mon_scope != RESCTRL_L3_CACHE;
> + sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> + kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
This new kernfs_find_and_get_ns() call takes a reference on the
kernfs_node before it is returned. Where is that new reference
released or why is it ok not to release the reference?
> + if (!kn) {
> + /* create the directory */
At this point this comment is not useful.
> + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> + if (IS_ERR(kn))
> + return PTR_ERR(kn);
>
> - ret = rdtgroup_kn_set_ugid(kn);
> - if (ret)
> - goto out_destroy;
> + ret = rdtgroup_kn_set_ugid(kn);
> + if (ret)
> + goto out_destroy;
> + ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
> + if (ret)
> + goto out_destroy;
> + }
>
> - ret = mon_add_all_files(kn, d, r, prgrp);
> - if (ret)
> - goto out_destroy;
> + if (snc_mode) {
> + sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> + ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
> + if (IS_ERR(ckn))
> + goto out_destroy;
> +
> + ret = rdtgroup_kn_set_ugid(ckn);
> + if (ret)
> + goto out_destroy;
> +
> + ret = mon_add_all_files(ckn, d, r, prgrp, false);
> + if (ret)
> + goto out_destroy;
> + }
>
> kernfs_activate(kn);
> return 0;
Reinette
Hi Tony,
should "and enabling" be dropped from shortlog?
On 5/28/24 3:20 PM, Tony Luck wrote:
> There isn't a simple hardware bit that indicates whether a CPU is
> running in Sub-NUMA Cluster (SNC) mode. Infer the state by comparing
> number CPUs sharing the L3 cache with CPU0 to the number of CPUs in
> the same NUMA node as CPU0.
>
> If SNC mode is detected, print a single informational message to the
> console.
>
> Add the missing definition of pr_fmt() to monitor.c. This wasn't
> noticed before as there are only "can't happen" console messages
> from this file.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/monitor.c | 59 +++++++++++++++++++++++++++
> 1 file changed, 59 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index c7559735e33a..1c5162a68461 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -15,6 +15,8 @@
> * Software Developer Manual June 2016, volume 3, section 17.17.
> */
>
> +#define pr_fmt(fmt) "resctrl: " fmt
> +
> #include <linux/cpu.h>
> #include <linux/module.h>
> #include <linux/sizes.h>
> @@ -1095,6 +1097,61 @@ void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
> wrmsrl(MSR_RMID_SNC_CONFIG, val);
> }
>
> +/* CPU models that support MSR_RMID_SNC_CONFIG */
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> + X86_MATCH_VFM(INTEL_ICELAKE_X, 0),
> + X86_MATCH_VFM(INTEL_SAPPHIRERAPIDS_X, 0),
> + X86_MATCH_VFM(INTEL_EMERALDRAPIDS_X, 0),
> + X86_MATCH_VFM(INTEL_GRANITERAPIDS_X, 0),
> + X86_MATCH_VFM(INTEL_ATOM_CRESTMONT_X, 0),
> + {}
> +};
> +
> +/*
> + * There isn't a simple hardware bit that indicates whether a CPU is running
> + * in Sub-NUMA Cluster (SNC) mode. Infer the state by comparing the
> + * number CPUs sharing the L3 cache with CPU0 to the number of CPUs in
> + * the same NUMA node as CPU0.
> + * It is not possible to accurately determine SNC state if the system is
> + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
> + * to L3 caches. It will be OK if system is booted with hyperthreading
> + * disabled (since this doesn't affect the ratio).
> + */
> +static __init int snc_get_config(void)
> +{
> + struct cacheinfo *ci = get_cpu_cacheinfo_level(0, RESCTRL_L3_CACHE);
> + const cpumask_t *node0_cpumask;
Stray tab
> + int ret;
> +
> + if (!x86_match_cpu(snc_cpu_ids) || !ci)
> + return 1;
> +
> + cpus_read_lock();
> + if (num_online_cpus() != num_present_cpus())
> + pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> + cpus_read_unlock();
> +
> + node0_cpumask = cpumask_of_node(cpu_to_node(0));
> +
> + ret = cpumask_weight(&ci->shared_cpu_map) / cpumask_weight(node0_cpumask);
It is not obvious to the static checker I tried that cpumask_weight(node0_cpumask)
cannot be zero. Can you please insert a check to make static checkers happy?
> +
> + /* sanity check: Only valid results are 1, 2, 3, 4 */
> + switch (ret) {
> + case 1:
> + break;
> + case 2 ... 4:
> + pr_info("Sub-NUMA Cluster mode detected with %d nodes per L3 cache\n", ret);
> + rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_L3_NODE;
> + break;
> + default:
> + pr_warn("Ignore improbable SNC node count %d\n", ret);
> + ret = 1;
> + break;
> + }
> +
> + return ret;
> +}
> +
> int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> {
> unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset;
> @@ -1102,6 +1159,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> unsigned int threshold;
> int ret;
>
> + snc_nodes_per_l3_cache = snc_get_config();
> +
> resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
Reinette
Hi Tony,
On 5/28/24 3:20 PM, Tony Luck wrote:
> For backwards compatibility on Sub-NUMA Cluster (SNC) systems the legacy
> files in the mon_L3_XX directories must report the sum of data from each
> SNC node sharing that L3 cache instance.
>
> To make this possible, pass the "sumdomains" and "ci" fields from
> rmid_read structure as extra arguments to resctrl_arch_rmid_read().
>
> Note that the call from check_limbo() never operates on a "sum" basis,
> so pass sumdomains=false, ci=NULL.
Why is passing "sumdomains" necessary? Can it not be inferred from
domain being NULL?
Reinette
Hi Tony,
On 5/28/24 3:19 PM, Tony Luck wrote:
> mon_event_read() fills out most fields of the struct rmid_read that is
> passed via an smp_call*() function to a CPU that is part of the correct
> domain to read the monitor counters.
>
> The one exception is the sumdomains field that is set by the caller
> (either rdtgroup_mondata_show() or mon_add_all_files()).
>
> When rmid_read.sumdomains is false:
> The domain field "d" specifies the only domain to read
> CPU to execute is chosen from d->hdr.cpu_mask
>
> When rmid_read.sumdomains is true:
> The domain field is NULL.
> The cache_info field indicates that all domains
> that are part of that cache instance should be
> summed.
> CPU to execute is chosen from ci->shared_cpu_mask
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 28 ++++++++++++++++++-----
> 1 file changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 3b9383612c35..4e394400e575 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -517,6 +517,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
> int evtid, int first)
> {
> + cpumask_t *cpumask;
> int cpu;
>
> /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
> @@ -537,7 +538,8 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> return;
> }
>
> - cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, RESCTRL_PICK_ANY_CPU);
> + cpumask = rr->sumdomains ? &rr->ci->shared_cpu_map : &d->hdr.cpu_mask;
> + cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU);
>
> /*
> * cpumask_any_housekeeping() prefers housekeeping CPUs, but
> @@ -546,7 +548,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> * counters on some platforms if its called in IRQ context.
> */
> if (tick_nohz_full_cpu(cpu))
> - smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
> + smp_call_function_any(cpumask, mon_event_count, rr, 1);
> else
> smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
>
Why not provide the cpumask as parameter to mon_event_read() as I asked
about in previous version (again feedback that was totally ignored)? With
this implementation there is a portion of SNC logic in rdtgroup_mondata_show()
and another portion of logic in mon_event_read(). Scattering SNC quirk logic like
this makes this very hard to understand.
To help with understanding this code I asked _twice_ ([1] and [2]) to
_at_least_ provide comments for these SNC branches but even a request for comments
is totally ignored. I even provided some comments based on my understanding
that could have just been copied&pasted (if it was correct), but that was ignored
also. I understand this work has been taking a while and to support this I am
trying to spend more time to review and provide more detailed feedback but to
find it just to be ignored over and over is extremely frustrating and wasting
so much of my time. I do not expect that you do everything as I propose but if
I propose something silly then please point it out so that I can learn? At this
point I am convinced that you find my feedback not worth responding
to leaving us stuck with me who keep trying to review your work and getting
ignored over and over in every new version.
What should I do Tony? Why should I keep reviewing this work? Asking to address
review feedback should not be necessary yet I seem to keep doing it. An attempt
at an ultimatum was futile since it was just dodged [3] with burden placed right
back on me.
> @@ -575,15 +577,29 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> resid = md.u.rid;
> domid = md.u.domid;
> evtid = md.u.evtid;
> -
> + rr.sumdomains = md.u.sum;
> r = &rdt_resources_all[resid].r_resctrl;
> - hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
> - if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
> +
> + if (rr.sumdomains) {
/* Explain what this does and why */
> + list_for_each_entry(d, &r->mon_domains, hdr.list) {
> + if (d->ci->id == domid) {
> + rr.ci = d->ci;
> + d = NULL;
> + goto got_cacheinfo;
> + }
> + }
> ret = -ENOENT;
> goto out;
> + } else {
> + hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
> + if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
> + ret = -ENOENT;
> + goto out;
> + }
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> }
> - d = container_of(hdr, struct rdt_mon_domain, hdr);
>
> +got_cacheinfo:
Something like "read_event" would be more appropriate since "got_cacheinfo"
has no relevance to non-SNC.
> mon_event_read(&rr, r, d, rdtgrp, evtid, false);
>
> if (rr.err == -EIO)
Reinette
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/
On 5/28/24 3:20 PM, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
(insert feedback from v18)
>
> Users should be aware that SNC mode also affects the amount of L3 cache
> available for allocation within each SNC node.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> Documentation/arch/x86/resctrl.rst | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 627e23869bca..401f6bfb4a3c 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -375,6 +375,10 @@ When monitoring is enabled all MON groups will also contain:
> all tasks in the group. In CTRL_MON groups these files provide
> the sum for all tasks in the CTRL_MON group and all tasks in
> MON groups. Please see example section for more details on usage.
> + On systems with Sub-NUMA (SNC) cluster enabled there are extra
(insert feedback from v18)
> + directories for each node (located within the "mon_L3_XX" directory
> + for the L3 cache they occupy). These are named "mon_sub_L3_YY"
> + where "YY" is the node number.
>
> "mon_hw_id":
> Available only with debug option. The identifier used by hardware
> @@ -484,6 +488,19 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
> each bit represents 5% of the capacity of the cache. You could partition
> the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>
> +Notes on Sub-NUMA Cluster mode
> +==============================
> +When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
> +nodes much more readily than between regular NUMA nodes since the CPUs
> +on Sub-NUMA nodes share the same L3 cache and the system may report
> +the NUMA distance between Sub-NUMA nodes with a lower value than used
> +for regular NUMA nodes.
> +The top-level monitoring files in each "mon_L3_XX" directory provide
> +the sum of data across all SNC nodes sharing an L3 cache instance.
> +Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
> +the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
> +"mon_sub_L3_YY" directories to get node local data.
> +
> Memory bandwidth Allocation and monitoring
> ==========================================
>
> btw ... the static checker I ran did have four other complaints, three about
> uninitialized data and one about divide by zero. Most problems appear to be
> in mbm_update() that does not initialize rr.sumdomains nor rr.ci before
> calling __mon_event_count().
>
> Please use available tools to check code before posting.
Which static checker? I tried smatch and it only finds one:
arch/x86/kernel/cpu/resctrl/rdtgroup.c:3129 mkdir_mondata_subdir() error: uninitialized symbol 'ret'.
Which is a false positive (but I don't fault smatch for not understanding the relationship
between the two "if" blocks in this function.
-Tony
Hi Tony,
On 5/30/24 3:49 PM, Luck, Tony wrote:
>> btw ... the static checker I ran did have four other complaints, three about
>> uninitialized data and one about divide by zero. Most problems appear to be
>> in mbm_update() that does not initialize rr.sumdomains nor rr.ci before
>> calling __mon_event_count().
>>
>> Please use available tools to check code before posting.
>
> Which static checker? I tried smatch and it only finds one:
I used coverity this time.
Reinette
On Thu, May 30, 2024 at 01:21:01PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/28/24 3:19 PM, Tony Luck wrote:
> > When SNC is enabled monitoring data is collected at the SNC node
> > granularity, but must be reported at L3-cache granularity for
> > backwards compatibility in addition to reporting at the node
> > level.
> >
> > Add a "ci" field to the rdt_mon_domain structure to save the
> > cache information about the enclosing L3 cache for the domain.
> > This provides:
> >
> > 1) The cache id which is needed to compose the name of the legacy
> > monitoring directory, and to determine which domains should be
> > summed to provide L3-scoped data.
> >
> > 2) The shared_cpu_map which is needed to determine which CPUs can
> > be used to read the RMID counters with the MSR interface.
> >
> > This is the first step to an eventual goal of monitor reporting files
> > like this (for a system with two SNC nodes per L3):
> >
> > $ cd /sys/fs/resctrl/mon_data
> > $ tree mon_L3_00
> > mon_L3_00 <- 00 here is L3 cache id
> > ├── llc_occupancy \ These files provide legacy support
> > ├── mbm_local_bytes > for non-SNC aware monitor apps
> > ├── mbm_total_bytes / that expect data at L3 cache level
> > ├── mon_sub_L3_00 <- 00 here is SNC node id
> > │  ├── llc_occupancy \ These files are finer grained
> > │  ├── mbm_local_bytes > data from each SNC node
> > │  └── mbm_total_bytes /
> > └── mon_sub_L3_01
> > ├── llc_occupancy \
> > ├── mbm_local_bytes > As above, but for node 1.
> > └── mbm_total_bytes /
> >
> > Signed-off-by: Tony Luck <[email protected]>
> > ---
> > include/linux/resctrl.h | 2 ++
> > arch/x86/kernel/cpu/resctrl/internal.h | 21 +++++++++++++++++++++
> > arch/x86/kernel/cpu/resctrl/core.c | 7 ++++++-
> > arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 1 -
> > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 1 -
> > 5 files changed, 29 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 64b6ad1b22a1..d733e1f6485d 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -96,6 +96,7 @@ struct rdt_ctrl_domain {
> > /**
> > * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
> > * @hdr: common header for different domain types
> > + * @ci: cache info for this domain
> > * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
> > * @mbm_total: saved state for MBM total bandwidth
> > * @mbm_local: saved state for MBM local bandwidth
> > @@ -106,6 +107,7 @@ struct rdt_ctrl_domain {
> > */
> > struct rdt_mon_domain {
> > struct rdt_domain_hdr hdr;
> > + struct cacheinfo *ci;
> > unsigned long *rmid_busy_llc;
> > struct mbm_state *mbm_total;
> > struct mbm_state *mbm_local;
> > diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> > index 135190e0711c..eb70d3136ced 100644
> > --- a/arch/x86/kernel/cpu/resctrl/internal.h
> > +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> > @@ -2,6 +2,7 @@
> > #ifndef _ASM_X86_RESCTRL_INTERNAL_H
> > #define _ASM_X86_RESCTRL_INTERNAL_H
> > +#include <linux/cacheinfo.h>
> > #include <linux/resctrl.h>
> > #include <linux/sched.h>
> > #include <linux/kernfs.h>
> > @@ -509,6 +510,26 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
> > int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
> > +/*
> > + * Get the cacheinfo structure of the cache associated with @cpu at level @level.
> > + * cpuhp lock must be held.
> > + */
> > +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
> > +{
> > + struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
> > + int i;
> > +
> > + for (i = 0; i < ci->num_leaves; i++) {
> > + if (ci->info_list[i].level == level) {
> > + if (ci->info_list[i].attributes & CACHE_ID)
> > + return &ci->info_list[i];
> > + break;
> > + }
> > + }
> > +
> > + return NULL;
> > +}
> > +
>
> This does not belong in resctrl. It really looks to partner well with existing
> cache helpers in include/linux/cacheinfo.h that already contains get_cpu_cacheinfo_id().
> Considering the existing naming get_cpu_cacheinfo() may be more appropriate.
Reinette,
The name get_cpu_cacheinfo() already exists and does something different
(returns a "struct cpu_cacheinfo *" rather than a "struct cacheinfo *").
How does this look for the change to <linux/cacheinfo.h> ... add a new
function get_cpu_cacheinfo_level() and then use it as a helper for the
existing get_cpu_cacheinfo_id()
-Tony
---
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index 2cb15fe4fe12..301b0b24f446 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -113,10 +113,10 @@ int acpi_get_cache_info(unsigned int cpu,
const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf);
/*
- * Get the id of the cache associated with @cpu at level @level.
+ * Get the cacheinfo structure for cache associated with @cpu at level @level.
* cpuhp lock must be held.
*/
-static inline int get_cpu_cacheinfo_id(int cpu, int level)
+static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
{
struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
int i;
@@ -124,12 +124,23 @@ static inline int get_cpu_cacheinfo_id(int cpu, int level)
for (i = 0; i < ci->num_leaves; i++) {
if (ci->info_list[i].level == level) {
if (ci->info_list[i].attributes & CACHE_ID)
- return ci->info_list[i].id;
- return -1;
+ return &ci->info_list[i];
+ return NULL;
}
}
- return -1;
+ return NULL;
+}
+
+/*
+ * Get the id of the cache associated with @cpu at level @level.
+ * cpuhp lock must be held.
+ */
+static inline int get_cpu_cacheinfo_id(int cpu, int level)
+{
+ struct cacheinfo *ci = get_cpu_cacheinfo_level(cpu, level);
+
+ return ci ? ci->id : -1;
}
#ifdef CONFIG_ARM64
Hi Tony,
On 5/30/24 5:26 PM, Tony Luck wrote:
> On Thu, May 30, 2024 at 01:21:01PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 5/28/24 3:19 PM, Tony Luck wrote:
>>> When SNC is enabled monitoring data is collected at the SNC node
>>> granularity, but must be reported at L3-cache granularity for
>>> backwards compatibility in addition to reporting at the node
>>> level.
>>>
>>> Add a "ci" field to the rdt_mon_domain structure to save the
>>> cache information about the enclosing L3 cache for the domain.
>>> This provides:
>>>
>>> 1) The cache id which is needed to compose the name of the legacy
>>> monitoring directory, and to determine which domains should be
>>> summed to provide L3-scoped data.
>>>
>>> 2) The shared_cpu_map which is needed to determine which CPUs can
>>> be used to read the RMID counters with the MSR interface.
>>>
>>> This is the first step to an eventual goal of monitor reporting files
>>> like this (for a system with two SNC nodes per L3):
>>>
>>> $ cd /sys/fs/resctrl/mon_data
>>> $ tree mon_L3_00
>>> mon_L3_00 <- 00 here is L3 cache id
>>> ├── llc_occupancy \ These files provide legacy support
>>> ├── mbm_local_bytes > for non-SNC aware monitor apps
>>> ├── mbm_total_bytes / that expect data at L3 cache level
>>> ├── mon_sub_L3_00 <- 00 here is SNC node id
>>> │  ├── llc_occupancy \ These files are finer grained
>>> │  ├── mbm_local_bytes > data from each SNC node
>>> │  └── mbm_total_bytes /
>>> └── mon_sub_L3_01
>>> ├── llc_occupancy \
>>> ├── mbm_local_bytes > As above, but for node 1.
>>> └── mbm_total_bytes /
>>>
>>> Signed-off-by: Tony Luck <[email protected]>
>>> ---
>>> include/linux/resctrl.h | 2 ++
>>> arch/x86/kernel/cpu/resctrl/internal.h | 21 +++++++++++++++++++++
>>> arch/x86/kernel/cpu/resctrl/core.c | 7 ++++++-
>>> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 1 -
>>> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 1 -
>>> 5 files changed, 29 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index 64b6ad1b22a1..d733e1f6485d 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -96,6 +96,7 @@ struct rdt_ctrl_domain {
>>> /**
>>> * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
>>> * @hdr: common header for different domain types
>>> + * @ci: cache info for this domain
>>> * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
>>> * @mbm_total: saved state for MBM total bandwidth
>>> * @mbm_local: saved state for MBM local bandwidth
>>> @@ -106,6 +107,7 @@ struct rdt_ctrl_domain {
>>> */
>>> struct rdt_mon_domain {
>>> struct rdt_domain_hdr hdr;
>>> + struct cacheinfo *ci;
>>> unsigned long *rmid_busy_llc;
>>> struct mbm_state *mbm_total;
>>> struct mbm_state *mbm_local;
>>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>>> index 135190e0711c..eb70d3136ced 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>>> @@ -2,6 +2,7 @@
>>> #ifndef _ASM_X86_RESCTRL_INTERNAL_H
>>> #define _ASM_X86_RESCTRL_INTERNAL_H
>>> +#include <linux/cacheinfo.h>
>>> #include <linux/resctrl.h>
>>> #include <linux/sched.h>
>>> #include <linux/kernfs.h>
>>> @@ -509,6 +510,26 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
>>> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
>>> +/*
>>> + * Get the cacheinfo structure of the cache associated with @cpu at level @level.
>>> + * cpuhp lock must be held.
>>> + */
>>> +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
>>> +{
>>> + struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
>>> + int i;
>>> +
>>> + for (i = 0; i < ci->num_leaves; i++) {
>>> + if (ci->info_list[i].level == level) {
>>> + if (ci->info_list[i].attributes & CACHE_ID)
>>> + return &ci->info_list[i];
>>> + break;
>>> + }
>>> + }
>>> +
>>> + return NULL;
>>> +}
>>> +
>>
>> This does not belong in resctrl. It really looks to partner well with existing
>> cache helpers in include/linux/cacheinfo.h that already contains get_cpu_cacheinfo_id().
>> Considering the existing naming get_cpu_cacheinfo() may be more appropriate.
>
> Reinette,
>
> The name get_cpu_cacheinfo() already exists and does something different
> (returns a "struct cpu_cacheinfo *" rather than a "struct cacheinfo *").
Indeed, it is even used by get_cpu_cacheinfo_id() as well as a few
other places in resctrl.
>
> How does this look for the change to <linux/cacheinfo.h> ... add a new
> function get_cpu_cacheinfo_level() and then use it as a helper for the
> existing get_cpu_cacheinfo_id()
>
> -Tony
>
> ---
>
> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
> index 2cb15fe4fe12..301b0b24f446 100644
> --- a/include/linux/cacheinfo.h
> +++ b/include/linux/cacheinfo.h
> @@ -113,10 +113,10 @@ int acpi_get_cache_info(unsigned int cpu,
> const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf);
>
> /*
> - * Get the id of the cache associated with @cpu at level @level.
> + * Get the cacheinfo structure for cache associated with @cpu at level @level.
> * cpuhp lock must be held.
> */
> -static inline int get_cpu_cacheinfo_id(int cpu, int level)
> +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
> {
> struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
> int i;
> @@ -124,12 +124,23 @@ static inline int get_cpu_cacheinfo_id(int cpu, int level)
> for (i = 0; i < ci->num_leaves; i++) {
> if (ci->info_list[i].level == level) {
> if (ci->info_list[i].attributes & CACHE_ID)
> - return ci->info_list[i].id;
> - return -1;
> + return &ci->info_list[i];
> + return NULL;
> }
> }
>
> - return -1;
> + return NULL;
> +}
> +
> +/*
> + * Get the id of the cache associated with @cpu at level @level.
> + * cpuhp lock must be held.
> + */
> +static inline int get_cpu_cacheinfo_id(int cpu, int level)
> +{
> + struct cacheinfo *ci = get_cpu_cacheinfo_level(cpu, level);
> +
> + return ci ? ci->id : -1;
> }
>
> #ifdef CONFIG_ARM64
This looks useful to me from resctrl perspective. Having it can already
benefit resctrl by replacing the open coded ones in pseudo_lock_region_init()
and rdtgroup_cbm_to_size().
We have previously [1] been able to include a cacheinfo change in a resctrl
series by clearly identifying it as such.
Reinette
[1] https://lore.kernel.org/all/[email protected]/
On Thu, May 30, 2024 at 01:20:39PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/28/24 3:19 PM, Tony Luck wrote:
> > Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> > and memory controllers on a socket into two or more groups. These are
> > presented to the operating system as NUMA nodes.
> >
> > This may enable some workloads to have slightly lower latency to memory
> > as the memory controller(s) in an SNC node are electrically closer to the
> > CPU cores on that SNC node. This cost may be offset by lower bandwidth
> > since the memory accesses for each core can only be interleaved between
> > the memory controllers on the same SNC node.
> >
> > Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> > to track L3 cache occupancy and memory bandwidth. There is an MSR that
> > controls how the RMIDs are shared between SNC nodes.
> >
> > The default mode divides them numerically. E.g. when there are two SNC
> > nodes on a socket the lower number half of the RMIDs are given to the
> > first node, the remainder to the second node. This would be difficult
> > to use with the Linux resctrl interface as specific RMID values assigned
> > to resctrl groups are not visible to users.
> >
> > The other mode divides the RMIDs and renumbers the ones on the second
> > SNC node to start from zero.
> >
> > Even with this renumbering SNC mode requires several changes in resctrl
> > behavior for correct operation.
> >
> > Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
> > how many SNC domains share an L3 cache instance. Initialize this to
> > "1". Runtime detection of SNC mode will adjust this value.
> >
> > Update all places to take appropriate action when SNC mode is enabled:
> > 1) The number of logical RMIDs per L3 cache available for use is the
> > number of physical RMIDs divided by the number of SNC nodes.
> > 2) Likewise the "mon_scale" value must be divided by the number of SNC
> > nodes.
> > 3) Add a function to convert from logical RMID values (assigned to
> > tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
> > to physical RMID values to load into IA32_QM_EVTSEL MSR when
> > reading counters on each SNC node.
> >
> > Signed-off-by: Tony Luck <[email protected]>
> > ---
> > arch/x86/kernel/cpu/resctrl/monitor.c | 37 ++++++++++++++++++++++++---
> > 1 file changed, 33 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index 89d7e6fcbaa1..b9b4d2b5ca82 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
> > #define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
> > +static int snc_nodes_per_l3_cache = 1;
> > +
> > /*
> > * The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
> > * If rmid > rmid threshold, MBM total and local values should be multiplied
> > @@ -185,10 +187,37 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
> > return entry;
> > }
> > -static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> > +/*
> > + * When Sub-NUMA Cluster (SNC) mode is not enabled, the physical RMID
> > + * is the same as the logical RMID.
> > + *
> > + * When SNC mode is enabled the physical RMIDs are distributed across
> > + * the SNC nodes. E.g. with two SNC nodes per L3 cache and 200 physical
> > + * RMIDs are divided with 0..99 on the first node and 100..199 on
> > + * the second node. Compute the value of the physical RMID to pass to
> > + * resctrl_arch_rmid_read().
>
> Please stop rushing version after version. I do not think you read the
> above after you wrote it. The sentences run into each other.
Re-written. Would you like to try reviewing these patches one at a time
as I fix them? That will:
a) Slow me down.
b) Avoid me building subsequent patches on earlier mistakes.
c) Give you bite-sized chunks to review in each sitting (I think
the overall direction of the series is well enough understood
at this point).
I've attached the updated version of patch 6 at the end of this e-mail.
> Could this be specific about what is meant by "physical" and "logical" RMID?
> To me "physical RMID" implies the RMID used by hardware and "logical RMID"
> is the RMID used by software ... but when it comes to SNC it is actually:
> "physical RMID" - RMID used by MSR_IA32_QM_EVTSEL
> "logical RMID" - RMID used by software and the MSR_IA32_PQR_ASSOC register
>
> > + *
> > + * Caller is responsible to make sure execution running on a CPU in
>
> "is responsible" and "make sure" means the same, no?
>
> "make sure execution running"?
Also re-written.
> (Looking ahead in this series and coming back to this, this looks like
> rushed work that you in turn expect folks spend quality time reviewing.)
>
> > + * the domain to be read.
> > + */
> > +static int logical_rmid_to_physical_rmid(int lrmid)
> > +{
> > + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> > + int cpu = smp_processor_id();
> > +
> > + if (snc_nodes_per_l3_cache == 1)
> > + return lrmid;
> > +
> > + return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> > +}
> > +
> > +static int __rmid_read(u32 lrmid,
> > + enum resctrl_event_id eventid, u64 *val)
>
> This line does not need to be split.
Joined now. I also pulled in your suggestion from a later patch to
rename this __rmid_read_phys() and do the logical to physical RMID
translation at the two callsites.
> > {
> > u64 msr_val;
> > + int prmid;
> > + prmid = logical_rmid_to_physical_rmid(lrmid);
> > /*
> > * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> > * with a valid event code for supported resource type and the bits
> > @@ -197,7 +226,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> > * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> > * are error bits.
> > */
> > - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> > + wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
> > rdmsrl(MSR_IA32_QM_CTR, msr_val);
> > if (msr_val & RMID_VAL_ERROR)
> > @@ -1022,8 +1051,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> > int ret;
> > resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> > - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> > - r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> > + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> > + r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
> > hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
> > if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
>
> Reinette
-Tony
Proposed v6 patch with fixes applied. I didn't include a URL
to the RDT architecture spec I reference in the comment for
logical_rmid_to_physical_rmid() because Intel URLs are notoriously
unstable. But I did check that a web search finds the document based on
the title. With Google it was second hit for me. Bing lists it as first
result.
From ab33bacb9bf4dcf7b04310c1296b9dacddc4cd80 Mon Sep 17 00:00:00 2001
From: Tony Luck <[email protected]>
Date: Thu, 30 May 2024 09:45:35 -0700
Subject: [PATCH] x86/resctrl: Introduce snc_nodes_per_l3_cache
Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.
This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.
Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.
The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.
RMID sahring mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.
Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.
Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
how many SNC domains share an L3 cache instance. Initialize this to
"1". Runtime detection of SNC mode will adjust this value.
Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
nodes.
3) Add a function to convert from logical RMID values (assigned to
tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
to physical RMID values to load into IA32_QM_EVTSEL MSR when
reading counters on each SNC node.
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 52 +++++++++++++++++++++++----
1 file changed, 46 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 89d7e6fcbaa1..0b05dfb5ab67 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
#define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
+static int snc_nodes_per_l3_cache = 1;
+
/*
* The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
* If rmid > rmid threshold, MBM total and local values should be multiplied
@@ -185,7 +187,39 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
return entry;
}
-static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
+/*
+ * When Sub-NUMA Cluster (SNC) mode is not enabled the RMID value
+ * loaded into IA32_PQR_ASSOC for the CPU to accumulate data is
+ * the same as the RMID value loaded into IA32_QM_EVTSEL to
+ * retrieve the current value of counters from IA32_QM_CTR.
+ *
+ * When SNC mode is enabled in RMID sharing mode there are fewer
+ * RMID values available to accumulate data (RMIDs are divided
+ * evenly between SNC nodes that share an L3 cache). Here we refer
+ * to the value loaded into IA32_PQR_ASSOC as the "logical RMID".
+ *
+ * Data is collected independently on each SNC node and can be retrieved
+ * using the "physical RMID" value computed by this function. The
+ * cpu argument can be any CPU in the SNC domain for the node.
+ *
+ * Note that the scope of the IA32_QM_EVTSEL and IA32_QM_CTR MSRs is
+ * still at the L3 cache scope. So a physical RMID may be read from any
+ * CPU that shares the L3 cache with the desired SNC node domain.
+ *
+ * For more details and examples see the "RMID Sharing Mode" section
+ * in the "Intel Resource Director Technology Architecture Specification".
+ */
+static int logical_rmid_to_physical_rmid(int cpu, int lrmid)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+
+ if (snc_nodes_per_l3_cache == 1)
+ return lrmid;
+
+ return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+}
+
+static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)
{
u64 msr_val;
@@ -197,7 +231,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
*/
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
rdmsrl(MSR_IA32_QM_CTR, msr_val);
if (msr_val & RMID_VAL_ERROR)
@@ -233,14 +267,17 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
enum resctrl_event_id eventid)
{
struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ int cpu = cpumask_any(&d->hdr.cpu_mask);
struct arch_mbm_state *am;
+ u32 prmid;
am = get_arch_mbm_state(hw_dom, rmid, eventid);
if (am) {
memset(am, 0, sizeof(*am));
+ prmid = logical_rmid_to_physical_rmid(cpu, rmid);
/* Record any initial, non-zero count value. */
- __rmid_read(rmid, eventid, &am->prev_msr);
+ __rmid_read_phys(prmid, eventid, &am->prev_msr);
}
}
@@ -275,8 +312,10 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
{
struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ int cpu = cpumask_any(&d->hdr.cpu_mask);
struct arch_mbm_state *am;
u64 msr_val, chunks;
+ u32 prmid;
int ret;
resctrl_arch_rmid_read_context_check();
@@ -284,7 +323,8 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
return -EINVAL;
- ret = __rmid_read(rmid, eventid, &msr_val);
+ prmid = logical_rmid_to_physical_rmid(cpu, rmid);
+ ret = __rmid_read_phys(prmid, eventid, &msr_val);
if (ret)
return ret;
@@ -1022,8 +1062,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
int ret;
resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
- hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
- r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+ r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
--
2.45.0
On Thu, May 30, 2024 at 01:24:57PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/28/24 3:20 PM, Tony Luck wrote:
> > Legacy resctrl monitor files must provide the sum of event values across
> > all Sub-NUMA Cluster (SNC) domains that share an L3 cache instance.
> >
> > Rename the existing resctrl_arch_rmid_read() function as
> > resctrl_arch_rmid_read_one() (with some small refactoring to drop
> > arguments that are not needed.
>
> Missing closing ")".
Will add.
> >
> > Create a new resctrl_arch_rmid_read() that iterates across
> > domains when necessary. Pass a CPU number from the right domain to
> > resctrl_arch_rmid_read_one().
>
> "when necessary"? Can you elaborate?
> "Pass a CPU ..." that just describes code that can be learned from patch.
> Please describe the changes not the code.
Will re-write.
> >
> > Signed-off-by: Tony Luck <[email protected]>
> > ---
> > arch/x86/kernel/cpu/resctrl/monitor.c | 41 ++++++++++++++++++++-------
> > 1 file changed, 31 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index 5f89ed4823ee..c9dd6ec68fcd 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -200,10 +200,9 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
> > * Caller is responsible to make sure execution running on a CPU in
> > * the domain to be read.
> > */
> > -static int logical_rmid_to_physical_rmid(int lrmid)
> > +static int logical_rmid_to_physical_rmid(int cpu, int lrmid)
> > {
> > struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> > - int cpu = smp_processor_id();
> > if (snc_nodes_per_l3_cache == 1)
> > return lrmid;
> > @@ -211,13 +210,13 @@ static int logical_rmid_to_physical_rmid(int lrmid)
> > return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> > }
> > -static int __rmid_read(u32 lrmid,
> > +static int __rmid_read(int cpu, u32 lrmid,
> > enum resctrl_event_id eventid, u64 *val)
> > {
> > u64 msr_val;
> > int prmid;
> > - prmid = logical_rmid_to_physical_rmid(lrmid);
> > + prmid = logical_rmid_to_physical_rmid(cpu, lrmid);
> > /*
>
> Passing CPU as parameter to __rmid_read(), which is run via IPI, really
> obfuscates the code. How about renaming it to "__rmid_read_phys()"
> and provide it the "physical RMID" as parameter to make it clear what it is
> doing?
> That would mean an extra call to determine "physical RMID" before calling
> __rmid_read_phys() but making it clear that it needs a physical RMID should
> help to explain what is going on.
>
>
> > * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> > * with a valid event code for supported resource type and the bits
> > @@ -269,7 +268,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> > memset(am, 0, sizeof(*am));
> > /* Record any initial, non-zero count value. */
> > - __rmid_read(rmid, eventid, &am->prev_msr);
> > + __rmid_read(smp_processor_id(), rmid, eventid, &am->prev_msr);
> > }
> > }
> > @@ -298,9 +297,8 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
> > return chunks >> shift;
> > }
> > -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> > - u32 unused, u32 rmid, enum resctrl_event_id eventid,
> > - u64 *val, bool sum, struct cacheinfo *ci, void *ignored)
> > +static int resctrl_arch_rmid_read_one(struct rdt_resource *r, struct rdt_mon_domain *d,
> > + int cpu, u32 rmid, enum resctrl_event_id eventid, u64 *val)
> > {
> > struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> > struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> > @@ -313,7 +311,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> > if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> > return -EINVAL;
> > - ret = __rmid_read(rmid, eventid, &msr_val);
> > + ret = __rmid_read(cpu, rmid, eventid, &msr_val);
> > if (ret)
> > return ret;
> > @@ -327,7 +325,30 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> > chunks = msr_val;
> > }
> > - *val = chunks * hw_res->mon_scale;
> > + *val += chunks * hw_res->mon_scale;
>
> The various new layers of indirection with SNC logic scattered between them
> makes this change hard to understand.
>
> > +
> > + return 0;
> > +}
> > +
> > +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> > + u32 unused, u32 rmid, enum resctrl_event_id eventid,
> > + u64 *val, bool sum, struct cacheinfo *ci, void *ignored)
>
> This is not architecture specific code.
Can you explain further? I've dropped the "sum" argument. As you pointed
out elsewhere this can be inferred from "d == NULL". But I do need the
cacheinfo information in resctrl_arch_rmid_read() to:
1) determine which domains to sum (those that match ci->id).
2) sanity check code is executing on a CPU in ci->shared_cpu_map.
> > +{
> > + int cpu = smp_processor_id();
> > + int ret;
> > +
> > + *val = 0;
> > + if (!sum)
> > + return resctrl_arch_rmid_read_one(r, d, cpu, rmid, eventid, val);
> > +
>
> /* SNC quirk that needs to be documented */
>
> > + list_for_each_entry(d, &r->mon_domains, hdr.list) {
> > + if (d->ci->id != ci->id)
> > + continue;
> > + cpu = cpumask_any(&d->hdr.cpu_mask);
> > + ret = resctrl_arch_rmid_read_one(r, d, cpu, rmid, eventid, val);
>
> The cpu parameter can be dropped, no? With the domain provided to resctrl_arch_rmid_read_one()
> it is not necessary to again split the SNC logic (in this case the "reading from any CPU
> in the cache domain is ok but still need accurate arch state") across multiple layers,
> just contain this in (documented portion of) resctrl_arch_rmid_read_one().
You are correct. I will drop the cpu argument.
> > + if (ret)
> > + return ret;
> > + }
> > return 0;
> > }
>
> Reinette
-Tony
Hi Tony,
On 6/3/24 4:15 PM, Tony Luck wrote:
> On Thu, May 30, 2024 at 01:24:57PM -0700, Reinette Chatre wrote:
>> On 5/28/24 3:20 PM, Tony Luck wrote:
...
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>>> + u32 unused, u32 rmid, enum resctrl_event_id eventid,
>>> + u64 *val, bool sum, struct cacheinfo *ci, void *ignored)
>>
>> This is not architecture specific code.
>
> Can you explain further? I've dropped the "sum" argument. As you pointed
> out elsewhere this can be inferred from "d == NULL". But I do need the
> cacheinfo information in resctrl_arch_rmid_read() to:
> 1) determine which domains to sum (those that match ci->id).
> 2) sanity check code is executing on a CPU in ci->shared_cpu_map.
>
"resctrl_arch_*" is the prefix of functions needed to be implemented
by every architecture. As I understand there is nothing architecture
specific about what this function does and every architecture's function
would thus end up looking identical. I expected the cacheinfo
information to be available from all architectures. If this is not
the case then it does not belong in struct rdt_mon_domain but should
instead be moved to struct rdt_hw_mon_domain ... but since cacheinfo
has already made its way into the filesystem code it is not clear
to me how you envision the arch/fs split.
Reinette
Hi Tony,
On 5/31/24 11:17 AM, Tony Luck wrote:
> On Thu, May 30, 2024 at 01:20:39PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 5/28/24 3:19 PM, Tony Luck wrote:
>>> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
>>> and memory controllers on a socket into two or more groups. These are
>>> presented to the operating system as NUMA nodes.
>>>
>>> This may enable some workloads to have slightly lower latency to memory
>>> as the memory controller(s) in an SNC node are electrically closer to the
>>> CPU cores on that SNC node. This cost may be offset by lower bandwidth
>>> since the memory accesses for each core can only be interleaved between
>>> the memory controllers on the same SNC node.
>>>
>>> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
>>> to track L3 cache occupancy and memory bandwidth. There is an MSR that
>>> controls how the RMIDs are shared between SNC nodes.
>>>
>>> The default mode divides them numerically. E.g. when there are two SNC
>>> nodes on a socket the lower number half of the RMIDs are given to the
>>> first node, the remainder to the second node. This would be difficult
>>> to use with the Linux resctrl interface as specific RMID values assigned
>>> to resctrl groups are not visible to users.
>>>
>>> The other mode divides the RMIDs and renumbers the ones on the second
>>> SNC node to start from zero.
>>>
>>> Even with this renumbering SNC mode requires several changes in resctrl
>>> behavior for correct operation.
>>>
>>> Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
>>> how many SNC domains share an L3 cache instance. Initialize this to
>>> "1". Runtime detection of SNC mode will adjust this value.
>>>
>>> Update all places to take appropriate action when SNC mode is enabled:
>>> 1) The number of logical RMIDs per L3 cache available for use is the
>>> number of physical RMIDs divided by the number of SNC nodes.
>>> 2) Likewise the "mon_scale" value must be divided by the number of SNC
>>> nodes.
>>> 3) Add a function to convert from logical RMID values (assigned to
>>> tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
>>> to physical RMID values to load into IA32_QM_EVTSEL MSR when
>>> reading counters on each SNC node.
>>>
>>> Signed-off-by: Tony Luck <[email protected]>
>>> ---
>>> arch/x86/kernel/cpu/resctrl/monitor.c | 37 ++++++++++++++++++++++++---
>>> 1 file changed, 33 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> index 89d7e6fcbaa1..b9b4d2b5ca82 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> @@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
>>> #define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
>>> +static int snc_nodes_per_l3_cache = 1;
>>> +
>>> /*
>>> * The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
>>> * If rmid > rmid threshold, MBM total and local values should be multiplied
>>> @@ -185,10 +187,37 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
>>> return entry;
>>> }
>>> -static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>>> +/*
>>> + * When Sub-NUMA Cluster (SNC) mode is not enabled, the physical RMID
>>> + * is the same as the logical RMID.
>>> + *
>>> + * When SNC mode is enabled the physical RMIDs are distributed across
>>> + * the SNC nodes. E.g. with two SNC nodes per L3 cache and 200 physical
>>> + * RMIDs are divided with 0..99 on the first node and 100..199 on
>>> + * the second node. Compute the value of the physical RMID to pass to
>>> + * resctrl_arch_rmid_read().
>>
>> Please stop rushing version after version. I do not think you read the
>> above after you wrote it. The sentences run into each other.
>
> Re-written. Would you like to try reviewing these patches one at a time
> as I fix them? That will:
>
> a) Slow me down.
> b) Avoid me building subsequent patches on earlier mistakes.
> c) Give you bite-sized chunks to review in each sitting (I think
> the overall direction of the series is well enough understood
> at this point).
I've been thinking about this a lot ... but I am not able to understand
how you could draw your conclusions based on what I wrote.
I do not believe we need a new process. I find the basic patch
review process sufficient and would appreciate if that can be followed.
What I mean by this is: after a patch series is submitted it is reviewed,
any disagreement or clarification of review feedback is discussed during
review of that series, next series is submitted with all review feedback
addressed. Exceptions may be when there are such big changes that
a new version may be needed just to create a new baseline, but that does
not apply here.
> I've attached the updated version of patch 6 at the end of this e-mail.
>
>> Could this be specific about what is meant by "physical" and "logical" RMID?
>> To me "physical RMID" implies the RMID used by hardware and "logical RMID"
>> is the RMID used by software ... but when it comes to SNC it is actually:
>> "physical RMID" - RMID used by MSR_IA32_QM_EVTSEL
>> "logical RMID" - RMID used by software and the MSR_IA32_PQR_ASSOC register
>>
>>> + *
>>> + * Caller is responsible to make sure execution running on a CPU in
>>
>> "is responsible" and "make sure" means the same, no?
>>
>> "make sure execution running"?
>
> Also re-written.
>
>> (Looking ahead in this series and coming back to this, this looks like
>> rushed work that you in turn expect folks spend quality time reviewing.)
>>
>>> + * the domain to be read.
>>> + */
>>> +static int logical_rmid_to_physical_rmid(int lrmid)
>>> +{
>>> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>>> + int cpu = smp_processor_id();
>>> +
>>> + if (snc_nodes_per_l3_cache == 1)
>>> + return lrmid;
>>> +
>>> + return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
>>> +}
>>> +
>>> +static int __rmid_read(u32 lrmid,
>>> + enum resctrl_event_id eventid, u64 *val)
>>
>> This line does not need to be split.
>
> Joined now. I also pulled in your suggestion from a later patch to
> rename this __rmid_read_phys() and do the logical to physical RMID
> translation at the two callsites.
>
>>> {
>>> u64 msr_val;
>>> + int prmid;
>>> + prmid = logical_rmid_to_physical_rmid(lrmid);
>>> /*
>>> * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
>>> * with a valid event code for supported resource type and the bits
>>> @@ -197,7 +226,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>>> * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>>> * are error bits.
>>> */
>>> - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
>>> + wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
>>> rdmsrl(MSR_IA32_QM_CTR, msr_val);
>>> if (msr_val & RMID_VAL_ERROR)
>>> @@ -1022,8 +1051,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>>> int ret;
>>> resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
>>> - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
>>> - r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
>>> + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
>>> + r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
>>> hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>>> if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
>>
>> Reinette
>
> -Tony
>
> Proposed v6 patch with fixes applied. I didn't include a URL
> to the RDT architecture spec I reference in the comment for
> logical_rmid_to_physical_rmid() because Intel URLs are notoriously
> unstable. But I did check that a web search finds the document based on
> the title. With Google it was second hit for me. Bing lists it as first
> result.
>
> From ab33bacb9bf4dcf7b04310c1296b9dacddc4cd80 Mon Sep 17 00:00:00 2001
> From: Tony Luck <[email protected]>
> Date: Thu, 30 May 2024 09:45:35 -0700
> Subject: [PATCH] x86/resctrl: Introduce snc_nodes_per_l3_cache
>
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
>
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
>
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
>
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
>
> RMID sahring mode divides the RMIDs and renumbers the ones on the second
sahring -> sharing
> SNC node to start from zero.
>
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
>
> Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
> how many SNC domains share an L3 cache instance. Initialize this to
> "1". Runtime detection of SNC mode will adjust this value.
>
> Update all places to take appropriate action when SNC mode is enabled:
> 1) The number of logical RMIDs per L3 cache available for use is the
> number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
> nodes.
> 3) Add a function to convert from logical RMID values (assigned to
> tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
> to physical RMID values to load into IA32_QM_EVTSEL MSR when
> reading counters on each SNC node.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/monitor.c | 52 +++++++++++++++++++++++----
> 1 file changed, 46 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 89d7e6fcbaa1..0b05dfb5ab67 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
>
> #define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
>
> +static int snc_nodes_per_l3_cache = 1;
> +
> /*
> * The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
> * If rmid > rmid threshold, MBM total and local values should be multiplied
> @@ -185,7 +187,39 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
> return entry;
> }
>
> -static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> +/*
> + * When Sub-NUMA Cluster (SNC) mode is not enabled the RMID value
> + * loaded into IA32_PQR_ASSOC for the CPU to accumulate data is
> + * the same as the RMID value loaded into IA32_QM_EVTSEL to
> + * retrieve the current value of counters from IA32_QM_CTR.
The function is logical_rmid_to_physical_rmid() and it is called
whether SNC is enabled or not. Above provides summary of expectations
when SNC is not enabled but there is no mention of "logical" or
"physical" RMID to understand what this means for non SNC.
> + *
> + * When SNC mode is enabled in RMID sharing mode there are fewer
What is difference between "SNC mode" and "RMID sharing mode"?
Please provide explanations about what terms mean and then be
consistent in their use.
This new version seems to be the first time "RMID sharing mode"
is used but there is no explanation about what that means.
Only previous mention was in subject of patch #18 but the comment
of arch_mon_domain_online() has no mention of the term
"RMID sharing mode" in the place where that mode is actually
enabled.
> + * RMID values available to accumulate data (RMIDs are divided
> + * evenly between SNC nodes that share an L3 cache). Here we refer
> + * to the value loaded into IA32_PQR_ASSOC as the "logical RMID".
Please do not use "we".
> + *
> + * Data is collected independently on each SNC node and can be retrieved
> + * using the "physical RMID" value computed by this function. The
> + * cpu argument can be any CPU in the SNC domain for the node.
"The cpu argument" -> "@cpu"
"SNC domain for the node" - what does "the node" refer to in this context?
Later it becomes "SNC node domain"? So far this comment has "SNC node",
"SNC domain for the node", "SNC node domain" that all seem to refer to
the same thing. Inconsistent terminology adds unnecessary complication.
> + *
> + * Note that the scope of the IA32_QM_EVTSEL and IA32_QM_CTR MSRs is
> + * still at the L3 cache scope. So a physical RMID may be read from any
"The scope of the IA32_QM_EVTSEL and IA32_QM_CTR MSRs is at the L3 cache."
Pick if you want "physical RMID" in quotes or not and then stick to it.
> + * CPU that shares the L3 cache with the desired SNC node domain.
> + *
> + * For more details and examples see the "RMID Sharing Mode" section
> + * in the "Intel Resource Director Technology Architecture Specification".
> + */
Reinette
> Hi Tony,
>
> On 6/3/24 4:15 PM, Tony Luck wrote:
> > On Thu, May 30, 2024 at 01:24:57PM -0700, Reinette Chatre wrote:
> >> On 5/28/24 3:20 PM, Tony Luck wrote:
> ...
>
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> >>> + u32 unused, u32 rmid, enum resctrl_event_id eventid,
> >>> + u64 *val, bool sum, struct cacheinfo *ci, void *ignored)
> >>
> >> This is not architecture specific code.
> >
> > Can you explain further? I've dropped the "sum" argument. As you pointed
> > out elsewhere this can be inferred from "d == NULL". But I do need the
> > cacheinfo information in resctrl_arch_rmid_read() to:
> > 1) determine which domains to sum (those that match ci->id).
> > 2) sanity check code is executing on a CPU in ci->shared_cpu_map.
> >
>
> "resctrl_arch_*" is the prefix of functions needed to be implemented
> by every architecture. As I understand there is nothing architecture
> specific about what this function does and every architecture's function
> would thus end up looking identical. I expected the cacheinfo
> information to be available from all architectures. If this is not
> the case then it does not belong in struct rdt_mon_domain but should
> instead be moved to struct rdt_hw_mon_domain ... but since cacheinfo
> has already made its way into the filesystem code it is not clear
> to me how you envision the arch/fs split.
Hi Reinette,
Files in resctrl that sum over resources are going to be a necessary feature
for backwards compatibility. I'm doing it for the first time here for SNC, but
I know of another platform topology change on the horizon that could also
benefit from this.
Looking at the end-point of the James Morse/Dave Martin patch series to
split out the arch independent layer to fs/resctrl/ I see that fs/monitor.c
makes calls to resctrl_arch_rmid_read(). The x86 version of this remains
in arch/x86/kernel/cpu/resctrl/monitor.c (I don't see an MPAM version).
James already added two arguments that MPAM needs and x86 doesn't
(hence "u32 unused" and "void *ignored" in the argument list). I confess
that my thought had been "If he can pad out the argument list for MPAM,
then I can do it too for x86". But that leads to madness, so time to reconsider.
I can see a couple of paths.
1) MPAM/others will also want to have files that sum things, so maybe they want
an extra argument that shows what to add up. Though even if they do, their
requirement may not be met by a "cacheinfo" pointer.
2) Only x86 (Intel) will use this. Maybe in this case the answer is to do some
renaming so the "void *unused" argument can be used to pass architecture
specific information.
Sketch for option #2. Currently the code does:
---------------------
That void *argument is currently supplied by a call. E.g.
void *arch_mon_ctx;
arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
QOS_L3_OCCUP_EVENT_ID, &val,
arch_mon_ctx);
resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
The x86 version of resctrl_arch_mon_ctx_alloc() just does "might_sleep(); return NULL;"
and resctrl_arch_mon_ctx_free() does nothing.
New version makes these changes:
---------------------
Add rdt_mon_domain * as a new argument to resctrl_arch_mon_ctx_alloc() (which MPAM can
ignore).
x86 alloc function becomes:
void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, struct rdt_mon_domain *d, int evtid)
{
might_sleep();
return d->ci;
}
resctrl_arch_mon_ctx_free() remains an empty stub.
Is this a reasonable way to split the independent fs layer code from the architecture specific?
-Tony
Hi Tony,
On 6/7/24 12:51 PM, Luck, Tony wrote:
>> Hi Tony,
>>
>> On 6/3/24 4:15 PM, Tony Luck wrote:
>>> On Thu, May 30, 2024 at 01:24:57PM -0700, Reinette Chatre wrote:
>>>> On 5/28/24 3:20 PM, Tony Luck wrote:
>> ...
>>
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>>>>> + u32 unused, u32 rmid, enum resctrl_event_id eventid,
>>>>> + u64 *val, bool sum, struct cacheinfo *ci, void *ignored)
>>>>
>>>> This is not architecture specific code.
>>>
>>> Can you explain further? I've dropped the "sum" argument. As you pointed
>>> out elsewhere this can be inferred from "d == NULL". But I do need the
>>> cacheinfo information in resctrl_arch_rmid_read() to:
>>> 1) determine which domains to sum (those that match ci->id).
>>> 2) sanity check code is executing on a CPU in ci->shared_cpu_map.
>>>
>>
>> "resctrl_arch_*" is the prefix of functions needed to be implemented
>> by every architecture. As I understand there is nothing architecture
>> specific about what this function does and every architecture's function
>> would thus end up looking identical. I expected the cacheinfo
>> information to be available from all architectures. If this is not
>> the case then it does not belong in struct rdt_mon_domain but should
>> instead be moved to struct rdt_hw_mon_domain ... but since cacheinfo
>> has already made its way into the filesystem code it is not clear
>> to me how you envision the arch/fs split.
>
> Hi Reinette,
>
> Files in resctrl that sum over resources are going to be a necessary feature
Sum over resources? That is something entirely different from SNC, no?
> for backwards compatibility. I'm doing it for the first time here for SNC, but
> I know of another platform topology change on the horizon that could also
> benefit from this.
>
> Looking at the end-point of the James Morse/Dave Martin patch series to
> split out the arch independent layer to fs/resctrl/ I see that fs/monitor.c
> makes calls to resctrl_arch_rmid_read(). The x86 version of this remains
> in arch/x86/kernel/cpu/resctrl/monitor.c (I don't see an MPAM version).
>
> James already added two arguments that MPAM needs and x86 doesn't
> (hence "u32 unused" and "void *ignored" in the argument list). I confess
> that my thought had been "If he can pad out the argument list for MPAM,
> then I can do it too for x86". But that leads to madness, so time to reconsider.
>
> I can see a couple of paths.
>
> 1) MPAM/others will also want to have files that sum things, so maybe they want
> an extra argument that shows what to add up. Though even if they do, their
> requirement may not be met by a "cacheinfo" pointer.
>
> 2) Only x86 (Intel) will use this. Maybe in this case the answer is to do some
> renaming so the "void *unused" argument can be used to pass architecture
> specific information.
By creating the new "sub" monitor files the sum of data has become a feature
of resctrl fs.
By including a pointer to struct cacheinfo in struct rdt_mon_domain as well as
struct rmid_read it surely is not just for Intel. If you want to make this just for
Intel then the whole solution needs to be changed.
>
> Sketch for option #2. Currently the code does:
> ---------------------
>
> That void *argument is currently supplied by a call. E.g.
>
> void *arch_mon_ctx;
>
> arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
>
>
> resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
> QOS_L3_OCCUP_EVENT_ID, &val,
> arch_mon_ctx);
>
> resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
>
> The x86 version of resctrl_arch_mon_ctx_alloc() just does "might_sleep(); return NULL;"
> and resctrl_arch_mon_ctx_free() does nothing.
>
> New version makes these changes:
> ---------------------
>
> Add rdt_mon_domain * as a new argument to resctrl_arch_mon_ctx_alloc() (which MPAM can
> ignore).
>
> x86 alloc function becomes:
>
> void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, struct rdt_mon_domain *d, int evtid)
> {
> might_sleep();
>
> return d->ci;
> }
>
> resctrl_arch_mon_ctx_free() remains an empty stub.
>
>
> Is this a reasonable way to split the independent fs layer code from the architecture specific?
Why is this necessary at all? The new variant of resctrl_arch_rmid_read() introduced in this
patch does not contain anything that is architecture specific and thus it is filesystem code.
It is this code that uses the information in struct cacheinfo to set the correct domain, if
needed. In this patch you rewrite resctrl_arch_rmid_read() as something architecture specific
but I cannot see what is architecture specific about it at all. Why not just call this new
function resctrl_rmid_read() that stays in filesystem code and then what you have in this
patch as resctrl_arch_rmid_read_one() should be what is already known as the architecture
specific resctrl_arch_rmid_read(). It is the architecture specific RMID read function that
does not need struct cacheinfo.
Reinette