2024-05-15 22:23:41

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 00/17] Add support for Sub-NUMA cluster (SNC) systems

This series based on top of Linus upstream commit 33e02dc69afb ("Merge
tag 'sound-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound")

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features. Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPUs support an MSR that can partition the RMID counters
in the same way. This allows monitoring features to be used. Legacy
monitoring files provide the sum of counters from each SNC node for
backwards compatibility. Additional files per SNC node provide details
per node.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <[email protected]>

---
Changes since v17: https://lore.kernel.org/all/[email protected]/

Reinette: This is still using the per-domain display_id field as
discussed. Would a better name make the intent clearer?

Patch 7 in previous version included virtually all of the new changes.
But that meant it was doing a lot of thinngs in a single patch
(including reverting a dozen lines from patch 6!)

So this series breaks patch 7 into nine pieces (0007..0015) for
better documentation in commit comments of the changes, and hopefully
easier review.

Patches 0001 ... 0005: Unchanged
Patch 0006: Dropped change that was reverted in v17.0007

Next nine are the split of the original patch v17.0007
Patch 0007: Added bigger commit comment describing where
this part of the series is heading and why.
Patch 0008: Added justification for new display_id field in struct rdt_mon_domain
Patch 0009: Split out a helper from mkdir_mondata_subdir()
so real changes in patch 0011 are easier to see.
Patch 0010: Comment stealing a bit from union mon_data_bits.evtid
Patch 0011: Save display_id instead of a random d->id in
meta data for monitor files that must sum SNC nodes
Don't call mon_event_read() to initialize "sum" files
Patch 0012: Set domid for "sum" files to the display id, not
to whatever SNC domain ID is in use here. Don't
call mon_event_read() for "sum" files.
Patch 0013: No change (apart from being split out from old patch 7)
Patch 0014: Because of change in patch 0011 to save the
display_id can no longer look up a domain using
rdt_find_domain(). Instead search r->mon_domains
for a match with d->display_id or d->hdr.id
Drop extra arg to ___mon_event_count() also
the "tmp" variable in __mon_event_count()
Patch 0015: Put #include <linux/cacheinfo.h> in alphabetical order
When SNC is disabled, keep the old check that
the current CPU is in the domain being read.
For the SNC case add comment about reading
monitor values from any CPU in the same L3 domain.

Patch 0016: Took alternate SNC detection algorithm from:
https://lore.kernel.org/all/[email protected]/
as it is simpler. But merged in the sanity
checks that make sense.
Converted the X86_MATCH*() usage to new model
that supports Intel families other than "6".
Patch 0017: No change


Tony Luck (17):
x86/resctrl: Prepare for new domain scope
x86/resctrl: Prepare to split rdt_domain structure
x86/resctrl: Prepare for different scope for control/monitor
operations
x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
x86/resctrl: Add node-scope to the options for feature scope
x86/resctrl: Introduce snc_nodes_per_l3_cache
x86/resctrl: Prepare for new Sub-NUMA (SNC) cluster monitor files
x86/resctrl: Add and initialize display_id field to struct
rdt_mon_domain
x86/resctrl: Add new fields to struct rmid_read for summation of
domains
x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function
x86/resctrl: Allocate a new bit in union mon_data_bits
x86/resctrl: Create Sub-NUMA (SNC) monitor files
x86/resctrl: Handle removing directories in Sub-NUMA (SNC) mode
x86/resctrl: Sum monitor data acrss Sub-NUMA (SNC) nodes when needed
x86/resctrl: Fix RMID reading sanity check for Sub-NUMA (SNC) mode
x86/resctrl: Sub NUMA Cluster detection and enable
x86/resctrl: Update documentation with Sub-NUMA cluster changes

Documentation/arch/x86/resctrl.rst | 17 +
include/linux/resctrl.h | 89 +++--
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 78 ++--
arch/x86/kernel/cpu/resctrl/core.c | 422 ++++++++++++++++++----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 70 ++--
arch/x86/kernel/cpu/resctrl/monitor.c | 106 ++++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 26 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 267 +++++++++-----
9 files changed, 779 insertions(+), 297 deletions(-)


base-commit: 33e02dc69afbd8f1b85a51d74d72f139ba4ca623
--
2.44.0



2024-05-15 22:23:50

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 01/17] x86/resctrl: Prepare for new domain scope

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Clean up the error handling when looking up domains. Report invalid id's
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the callsites.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 9 ++++-
arch/x86/kernel/cpu/resctrl/core.c | 46 ++++++++++++++++-------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 ++-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 ++-
5 files changed, 49 insertions(+), 19 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a365f67131ec..ed693bfe474d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -150,13 +150,18 @@ struct resctrl_membw {
struct rdt_parse_data;
struct resctrl_schema;

+enum resctrl_scope {
+ RESCTRL_L2_CACHE = 2,
+ RESCTRL_L3_CACHE = 3,
+};
+
/**
* struct rdt_resource - attributes of a resctrl resource
* @rid: The index of the resource
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @cache_level: Which cache level defines scope of this resource
+ * @scope: Scope of this resource
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
* @domains: RCU list of all domains for this resource
@@ -174,7 +179,7 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- int cache_level;
+ enum resctrl_scope scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
struct list_head domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index a113d9aba553..f85b2ff40eef 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -68,7 +68,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -82,7 +82,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .cache_level = 2,
+ .scope = RESCTRL_L2_CACHE,
.domains = domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -96,7 +96,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -108,7 +108,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -392,9 +392,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct rdt_domain *d;
struct list_head *l;

- if (id < 0)
- return ERR_PTR(-ENODEV);
-
list_for_each(l, &r->domains) {
d = list_entry(l, struct rdt_domain, list);
/* When id is found, return its domain. */
@@ -484,6 +481,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
return 0;
}

+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+ switch (scope) {
+ case RESCTRL_L2_CACHE:
+ case RESCTRL_L3_CACHE:
+ return get_cpu_cacheinfo_id(cpu, scope);
+ default:
+ break;
+ }
+
+ return -EINVAL;
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -499,7 +509,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
*/
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_domain_id_from_scope(cpu, r->scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
@@ -507,12 +517,14 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

lockdep_assert_held(&domain_list_lock);

- d = rdt_find_domain(r, id, &add_pos);
- if (IS_ERR(d)) {
- pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+ if (id < 0) {
+ pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->scope, r->name);
return;
}

+ d = rdt_find_domain(r, id, &add_pos);
+
if (d) {
cpumask_set_cpu(cpu, &d->cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
@@ -552,15 +564,21 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

static void domain_remove_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_domain_id_from_scope(cpu, r->scope);
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

lockdep_assert_held(&domain_list_lock);

+ if (id < 0) {
+ pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->scope, r->name);
+ return;
+ }
+
d = rdt_find_domain(r, id, NULL);
- if (IS_ERR_OR_NULL(d)) {
- pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+ if (!d) {
+ pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
return;
}
hw_dom = resctrl_to_arch_dom(d);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b7291f60399c..2bf021d42500 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -577,7 +577,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)

r = &rdt_resources_all[resid].r_resctrl;
d = rdt_find_domain(r, domid, NULL);
- if (IS_ERR_OR_NULL(d)) {
+ if (!d) {
ret = -ENOENT;
goto out;
}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index aacf236dfe3b..7c4bf0a006ce 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
*/
static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
{
+ enum resctrl_scope scope = plr->s->res->scope;
struct cpu_cacheinfo *ci;
int ret;
int i;

+ if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+ return -ENODEV;
+
/* Pick the first cpu we find that is associated with the cache. */
plr->cpu = cpumask_first(&plr->d->cpu_mask);

@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);

for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == plr->s->res->cache_level) {
+ if (ci->info_list[i].level == scope) {
plr->line_size = ci->info_list[i].coherency_line_size;
return 0;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 02f213f1c51c..b8588ce88eef 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1454,10 +1454,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
unsigned int size = 0;
int num_b, i;

+ if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+ return size;
+
num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->cache_level) {
+ if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
--
2.44.0


2024-05-15 22:24:12

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 05/17] x86/resctrl: Add node-scope to the options for feature scope

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index aa2c22a8e37b..5c7775343c3e 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -176,6 +176,7 @@ struct resctrl_schema;
enum resctrl_scope {
RESCTRL_L2_CACHE = 2,
RESCTRL_L3_CACHE = 3,
+ RESCTRL_NODE,
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b4f2be776408..395bac851f6e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -510,6 +510,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
case RESCTRL_L2_CACHE:
case RESCTRL_L3_CACHE:
return get_cpu_cacheinfo_id(cpu, scope);
+ case RESCTRL_NODE:
+ return cpu_to_node(cpu);
default:
break;
}
--
2.44.0


2024-05-15 22:24:13

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 02/17] x86/resctrl: Prepare to split rdt_domain structure

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 16 ++++--
arch/x86/kernel/cpu/resctrl/core.c | 26 +++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 24 ++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 14 +++---
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 60 +++++++++++------------
6 files changed, 81 insertions(+), 73 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index ed693bfe474d..f63fcf17a3bc 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -59,10 +59,20 @@ struct resctrl_staged_config {
};

/**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
* @list: all instances of this resource
* @id: unique id for this instance
* @cpu_mask: which CPUs share this resource
+ */
+struct rdt_domain_hdr {
+ struct list_head list;
+ int id;
+ struct cpumask cpu_mask;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr: common header for different domain types
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
* @mbm_local: saved state for MBM local bandwidth
@@ -77,9 +87,7 @@ struct resctrl_staged_config {
* by closid
*/
struct rdt_domain {
- struct list_head list;
- int id;
- struct cpumask cpu_mask;
+ struct rdt_domain_hdr hdr;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
struct mbm_state *mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f85b2ff40eef..96fff44f9d03 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -355,9 +355,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)

lockdep_assert_cpus_held();

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
/* Find the domain that contains this CPU */
- if (cpumask_test_cpu(cpu, &d->cpu_mask))
+ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
return d;
}

@@ -393,12 +393,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct list_head *l;

list_for_each(l, &r->domains) {
- d = list_entry(l, struct rdt_domain, list);
+ d = list_entry(l, struct rdt_domain, hdr.list);
/* When id is found, return its domain. */
- if (id == d->id)
+ if (id == d->hdr.id)
return d;
/* Stop searching when finding id's position in sorted list. */
- if (id < d->id)
+ if (id < d->hdr.id)
break;
}

@@ -526,7 +526,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
d = rdt_find_domain(r, id, &add_pos);

if (d) {
- cpumask_set_cpu(cpu, &d->cpu_mask);
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
return;
@@ -537,8 +537,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;

d = &hw_dom->d_resctrl;
- d->id = id;
- cpumask_set_cpu(cpu, &d->cpu_mask);
+ d->hdr.id = id;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

rdt_domain_reconfigure_cdp(r);

@@ -552,11 +552,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

- list_add_tail_rcu(&d->list, add_pos);
+ list_add_tail_rcu(&d->hdr.list, add_pos);

err = resctrl_online_domain(r, d);
if (err) {
- list_del_rcu(&d->list);
+ list_del_rcu(&d->hdr.list);
synchronize_rcu();
domain_free(hw_dom);
}
@@ -583,10 +583,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}
hw_dom = resctrl_to_arch_dom(d);

- cpumask_clear_cpu(cpu, &d->cpu_mask);
- if (cpumask_empty(&d->cpu_mask)) {
+ cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+ if (cpumask_empty(&d->hdr.cpu_mask)) {
resctrl_offline_domain(r, d);
- list_del_rcu(&d->list);
+ list_del_rcu(&d->hdr.list);
synchronize_rcu();

/*
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 2bf021d42500..6246f48b0449 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -69,7 +69,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,

cfg = &d->staged_config[s->conf_type];
if (cfg->have_new_ctrl) {
- rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+ rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
return -EINVAL;
}

@@ -148,7 +148,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,

cfg = &d->staged_config[s->conf_type];
if (cfg->have_new_ctrl) {
- rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+ rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
return -EINVAL;
}

@@ -231,8 +231,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
return -EINVAL;
}
dom = strim(dom);
- list_for_each_entry(d, &r->domains, list) {
- if (d->id == dom_id) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (d->hdr.id == dom_id) {
data.buf = dom;
data.rdtgrp = rdtgrp;
if (r->parse_ctrlval(&data, s, d))
@@ -280,7 +280,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
u32 idx = get_config_index(closid, t);
struct msr_param msr_param;

- if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+ if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
return -EINVAL;

hw_dom->ctrl_val[idx] = cfg_val;
@@ -306,7 +306,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
msr_param.res = NULL;
for (t = 0; t < CDP_NUM_TYPES; t++) {
@@ -330,7 +330,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
}
}
if (msr_param.res)
- smp_call_function_any(&d->cpu_mask, rdt_ctrl_update, &msr_param, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, rdt_ctrl_update, &msr_param, 1);
}

return 0;
@@ -450,7 +450,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
lockdep_assert_cpus_held();

seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -460,7 +460,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
ctrl_val = resctrl_arch_get_config(r, dom, closid,
schema->conf_type);

- seq_printf(s, r->format_str, dom->id, max_data_width,
+ seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
ctrl_val);
sep = true;
}
@@ -489,7 +489,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
} else {
seq_printf(s, "%s:%d=%x\n",
rdtgrp->plr->s->res->name,
- rdtgrp->plr->d->id,
+ rdtgrp->plr->d->hdr.id,
rdtgrp->plr->cbm);
}
} else {
@@ -537,7 +537,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
return;
}

- cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU);
+ cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, RESCTRL_PICK_ANY_CPU);

/*
* cpumask_any_housekeeping() prefers housekeeping CPUs, but
@@ -546,7 +546,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
* counters on some platforms if its called in IRQ context.
*/
if (tick_nohz_full_cpu(cpu))
- smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
else
smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 2345e6836593..ab8a198d88b3 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -281,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,

resctrl_arch_rmid_read_context_check();

- if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+ if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
return -EINVAL;

ret = __rmid_read(rmid, eventid, &msr_val);
@@ -364,7 +364,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
* CLOSID and RMID because there may be dependencies between them
* on some architectures.
*/
- trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->id, val);
+ trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->hdr.id, val);
}

if (force_free || !rmid_dirty) {
@@ -490,7 +490,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);

entry->busy = 0;
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
/*
* For the first limbo RMID in the domain,
* setup up the limbo worker.
@@ -801,7 +801,7 @@ void cqm_handle_limbo(struct work_struct *work)
__check_limbo(d, false);

if (has_busy_rmid(d)) {
- d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask,
+ d->cqm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
RESCTRL_PICK_ANY_CPU);
schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo,
delay);
@@ -825,7 +825,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;

- cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu);
+ cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu);
dom->cqm_work_cpu = cpu;

if (cpu < nr_cpu_ids)
@@ -868,7 +868,7 @@ void mbm_handle_overflow(struct work_struct *work)
* Re-check for housekeeping CPUs. This allows the overflow handler to
* move off a nohz_full CPU quickly.
*/
- d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask,
+ d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
RESCTRL_PICK_ANY_CPU);
schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);

@@ -897,7 +897,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
*/
if (!resctrl_mounted || !resctrl_arch_mon_capable())
return;
- cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu);
+ cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu);
dom->mbm_work_cpu = cpu;

if (cpu < nr_cpu_ids)
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 7c4bf0a006ce..36d943cb847a 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
int cpu;
int ret;

- for_each_cpu(cpu, &plr->d->cpu_mask) {
+ for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
if (!pm_req) {
rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
@@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
return -ENODEV;

/* Pick the first cpu we find that is associated with the cache. */
- plr->cpu = cpumask_first(&plr->d->cpu_mask);
+ plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);

if (!cpu_online(plr->cpu)) {
rdt_last_cmd_printf("CPU %u associated with cache not online\n",
@@ -859,10 +859,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* associated with them.
*/
for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(d_i, &r->domains, list) {
+ list_for_each_entry(d_i, &r->domains, hdr.list) {
if (d_i->plr)
cpumask_or(cpu_with_psl, cpu_with_psl,
- &d_i->cpu_mask);
+ &d_i->hdr.cpu_mask);
}
}

@@ -870,7 +870,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* Next test if new pseudo-locked region would intersect with
* existing region.
*/
- if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
+ if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
ret = true;

free_cpumask_var(cpu_with_psl);
@@ -1202,7 +1202,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
}

plr->thread_done = 0;
- cpu = cpumask_first(&plr->d->cpu_mask);
+ cpu = cpumask_first(&plr->d->hdr.cpu_mask);
if (!cpu_online(cpu)) {
ret = -ENODEV;
goto out;
@@ -1532,7 +1532,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
* may be scheduled elsewhere and invalidate entries in the
* pseudo-locked region.
*/
- if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
+ if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
mutex_unlock(&rdtgroup_mutex);
return -EINVAL;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b8588ce88eef..e6e2753738c9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -98,7 +98,7 @@ void rdt_staged_configs_clear(void)
lockdep_assert_held(&rdtgroup_mutex);

for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(dom, &r->domains, list)
+ list_for_each_entry(dom, &r->domains, hdr.list)
memset(dom->staged_config, 0, sizeof(dom->staged_config));
}
}
@@ -317,7 +317,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
rdt_last_cmd_puts("Cache domain offline\n");
ret = -ENODEV;
} else {
- mask = &rdtgrp->plr->d->cpu_mask;
+ mask = &rdtgrp->plr->d->hdr.cpu_mask;
seq_printf(s, is_cpu_list(of) ?
"%*pbl\n" : "%*pb\n",
cpumask_pr_args(mask));
@@ -1021,12 +1021,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
exclusive = 0;
- seq_printf(seq, "%d=", dom->id);
+ seq_printf(seq, "%d=", dom->hdr.id);
for (i = 0; i < closids_supported(); i++) {
if (!closid_allocated(i))
continue;
@@ -1343,7 +1343,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
continue;
has_cache = true;
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
ctrl = resctrl_arch_get_config(r, d, closid,
s->conf_type);
if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1458,7 +1458,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
return size;

num_b = bitmap_weight(&cbm, r->cache.cbm_len);
- ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
+ ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
@@ -1506,7 +1506,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
rdtgrp->plr->d,
rdtgrp->plr->cbm);
- seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+ seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
}
goto out;
}
@@ -1518,7 +1518,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
type = schema->conf_type;
sep = false;
seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
if (sep)
seq_putc(s, ';');
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1536,7 +1536,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
else
size = rdtgroup_cbm_to_size(r, d, ctrl);
}
- seq_printf(s, "%d=%u", d->id, size);
+ seq_printf(s, "%d=%u", d->hdr.id, size);
sep = true;
}
seq_putc(s, '\n');
@@ -1596,7 +1596,7 @@ static void mon_event_config_read(void *info)

static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
{
- smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
}

static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
@@ -1608,7 +1608,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -1616,7 +1616,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
mon_info.evtid = evtid;
mondata_config_read(dom, &mon_info);

- seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+ seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
sep = true;
}
seq_puts(s, "\n");
@@ -1682,7 +1682,7 @@ static void mbm_config_write_domain(struct rdt_resource *r,
* are scoped at the domain level. Writing any of these MSRs
* on one CPU is observed by all the CPUs in the domain.
*/
- smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
&mon_info, 1);

/*
@@ -1732,8 +1732,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}

- list_for_each_entry(d, &r->domains, list) {
- if (d->id == dom_id) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (d->hdr.id == dom_id) {
mbm_config_write_domain(r, d, evtid, val);
goto next;
}
@@ -2280,14 +2280,14 @@ static int set_cache_qos_cfg(int level, bool enable)
return -ENOMEM;

r_l = &rdt_resources_all[level].r_resctrl;
- list_for_each_entry(d, &r_l->domains, list) {
+ list_for_each_entry(d, &r_l->domains, hdr.list) {
if (r_l->cache.arch_has_per_cpu_cfg)
/* Pick all the CPUs in the domain instance */
- for_each_cpu(cpu, &d->cpu_mask)
+ for_each_cpu(cpu, &d->hdr.cpu_mask)
cpumask_set_cpu(cpu, cpu_mask);
else
/* Pick one CPU from each domain instance to update MSR */
- cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+ cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
}

/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2316,7 +2316,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
{
u32 num_closid = resctrl_arch_get_num_closid(r);
- int cpu = cpumask_any(&d->cpu_mask);
+ int cpu = cpumask_any(&d->hdr.cpu_mask);
int i;

d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
@@ -2365,7 +2365,7 @@ static int set_mba_sc(bool mba_sc)

r->membw.mba_sc = mba_sc;

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
for (i = 0; i < num_closid; i++)
d->mbps_val[i] = MBA_MAX_MBPS;
}
@@ -2704,7 +2704,7 @@ static int rdt_get_tree(struct fs_context *fc)

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- list_for_each_entry(dom, &r->domains, list)
+ list_for_each_entry(dom, &r->domains, hdr.list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL,
RESCTRL_PICK_ANY_CPU);
}
@@ -2831,13 +2831,13 @@ static int reset_all_ctrls(struct rdt_resource *r)
* CBMs in all domains to the maximum mask value. Pick one CPU
* from each domain to update the MSRs below.
*/
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);

for (i = 0; i < hw_res->num_closid; i++)
hw_dom->ctrl_val[i] = r->default_ctrl;
msr_param.dom = d;
- smp_call_function_any(&d->cpu_mask, rdt_ctrl_update, &msr_param, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, rdt_ctrl_update, &msr_param, 1);
}

return 0;
@@ -3035,7 +3035,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
char name[32];
int ret;

- sprintf(name, "mon_%s_%02d", r->name, d->id);
+ sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
/* create the directory */
kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
if (IS_ERR(kn))
@@ -3051,7 +3051,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
}

priv.u.rid = r->rid;
- priv.u.domid = d->id;
+ priv.u.domid = d->hdr.id;
list_for_each_entry(mevt, &r->evt_list, list) {
priv.u.evtid = mevt->evtid;
ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3102,7 +3102,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();

- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
return ret;
@@ -3261,7 +3261,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
*/
tmp_cbm = cfg->new_ctrl;
if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
- rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+ rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
return -ENOSPC;
}
cfg->have_new_ctrl = true;
@@ -3284,7 +3284,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
struct rdt_domain *d;
int ret;

- list_for_each_entry(d, &s->res->domains, list) {
+ list_for_each_entry(d, &s->res->domains, hdr.list) {
ret = __init_one_rdt_domain(d, s, closid);
if (ret < 0)
return ret;
@@ -3299,7 +3299,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
struct resctrl_staged_config *cfg;
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
if (is_mba_sc(r)) {
d->mbps_val[closid] = MBA_MAX_MBPS;
continue;
@@ -3945,7 +3945,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
* per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- rmdir_mondata_subdir_allrdtgrp(r, d->id);
+ rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);

if (is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
--
2.44.0


2024-05-15 22:24:12

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 03/17] x86/resctrl: Prepare for different scope for control/monitor operations

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 25 ++-
arch/x86/kernel/cpu/resctrl/internal.h | 7 +-
arch/x86/kernel/cpu/resctrl/core.c | 224 +++++++++++++++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 12 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 4 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 60 +++---
7 files changed, 240 insertions(+), 96 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index f63fcf17a3bc..96ddf9ff3183 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -58,15 +58,22 @@ struct resctrl_staged_config {
bool have_new_ctrl;
};

+enum resctrl_domain_type {
+ RESCTRL_CTRL_DOMAIN,
+ RESCTRL_MON_DOMAIN,
+};
+
/**
* struct rdt_domain_hdr - common header for different domain types
* @list: all instances of this resource
* @id: unique id for this instance
+ * @type: type of this instance
* @cpu_mask: which CPUs share this resource
*/
struct rdt_domain_hdr {
struct list_head list;
int id;
+ enum resctrl_domain_type type;
struct cpumask cpu_mask;
};

@@ -169,10 +176,12 @@ enum resctrl_scope {
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @scope: Scope of this resource
+ * @ctrl_scope: Scope of this resource for control functions
+ * @mon_scope: Scope of this resource for monitor functions
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
- * @domains: RCU list of all domains for this resource
+ * @ctrl_domains: RCU list of all control domains for this resource
+ * @mon_domains: RCU list of all monitor domains for this resource
* @name: Name to use in "schemata" file.
* @data_width: Character width of data when displaying
* @default_ctrl: Specifies default cache cbm or memory B/W percent.
@@ -187,10 +196,12 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- enum resctrl_scope scope;
+ enum resctrl_scope ctrl_scope;
+ enum resctrl_scope mon_scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
- struct list_head domains;
+ struct list_head ctrl_domains;
+ struct list_head mon_domains;
char *name;
int data_width;
u32 default_ctrl;
@@ -236,8 +247,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,

u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index f1d926832ec8..377679b79919 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -558,8 +558,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
- struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+ struct list_head **pos);
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -578,7 +578,8 @@ int rdt_pseudo_lock_init(void);
void rdt_pseudo_lock_release(void);
int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
void closid_free(int closid);
int alloc_rmid(u32 closid);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 96fff44f9d03..edd9b2bfb53d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -60,7 +60,8 @@ static void mba_wrmsr_intel(struct msr_param *m);
static void cat_wrmsr(struct msr_param *m);
static void mba_wrmsr_amd(struct msr_param *m);

-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)

struct rdt_hw_resource rdt_resources_all[] = {
[RDT_RESOURCE_L3] =
@@ -68,8 +69,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_L3),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .mon_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L3),
+ .mon_domains = mon_domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -82,8 +85,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .scope = RESCTRL_L2_CACHE,
- .domains = domain_init(RDT_RESOURCE_L2),
+ .ctrl_scope = RESCTRL_L2_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -96,8 +99,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_MBA),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -108,8 +111,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_SMBA),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -349,13 +352,28 @@ static void cat_wrmsr(struct msr_param *m)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
}

-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
{
struct rdt_domain *d;

lockdep_assert_cpus_held();

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
+ /* Find the domain that contains this CPU */
+ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
+ return d;
+ }
+
+ return NULL;
+}
+
+struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r)
+{
+ struct rdt_domain *d;
+
+ lockdep_assert_cpus_held();
+
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
/* Find the domain that contains this CPU */
if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
return d;
@@ -379,26 +397,26 @@ void rdt_ctrl_update(void *arg)
}

/*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Search for a domain id in a resource domain list.
*
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
- * The domain list is sorted by id in ascending order.
+ * Search the domain list to find the domain id. If the domain id is
+ * found, return the domain. NULL otherwise. If the domain id is not
+ * found (and NULL returned) then the first domain with id bigger than
+ * the input id can be returned to the caller via @pos.
*/
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
- struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+ struct list_head **pos)
{
- struct rdt_domain *d;
+ struct rdt_domain_hdr *d;
struct list_head *l;

- list_for_each(l, &r->domains) {
- d = list_entry(l, struct rdt_domain, hdr.list);
+ list_for_each(l, h) {
+ d = list_entry(l, struct rdt_domain_hdr, list);
/* When id is found, return its domain. */
- if (id == d->hdr.id)
+ if (id == d->id)
return d;
/* Stop searching when finding id's position in sorted list. */
- if (id < d->hdr.id)
+ if (id < d->id)
break;
}

@@ -494,38 +512,29 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
return -EINVAL;
}

-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
{
- int id = get_domain_id_from_scope(cpu, r->scope);
+ int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
struct rdt_domain *d;
int err;

lockdep_assert_held(&domain_list_lock);

if (id < 0) {
- pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->scope, r->name);
+ pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->ctrl_scope, r->name);
return;
}

- d = rdt_find_domain(r, id, &add_pos);
+ hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+ if (hdr) {
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ return;
+ d = container_of(hdr, struct rdt_domain, hdr);

- if (d) {
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
@@ -538,23 +547,70 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

d = &hw_dom->d_resctrl;
d->hdr.id = id;
+ d->hdr.type = RESCTRL_CTRL_DOMAIN;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

rdt_domain_reconfigure_cdp(r);

- if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+ if (domain_setup_ctrlval(r, d)) {
domain_free(hw_dom);
return;
}

- if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+ list_add_tail_rcu(&d->hdr.list, add_pos);
+
+ err = resctrl_online_ctrl_domain(r, d);
+ if (err) {
+ list_del_rcu(&d->hdr.list);
+ synchronize_rcu();
+ domain_free(hw_dom);
+ }
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct list_head *add_pos = NULL;
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
+ struct rdt_domain *d;
+ int err;
+
+ lockdep_assert_held(&domain_list_lock);
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+ if (hdr) {
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ return;
+ d = container_of(hdr, struct rdt_domain, hdr);
+
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ return;
+ }
+
+ hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+ if (!hw_dom)
+ return;
+
+ d = &hw_dom->d_resctrl;
+ d->hdr.id = id;
+ d->hdr.type = RESCTRL_MON_DOMAIN;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+
+ if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
domain_free(hw_dom);
return;
}

list_add_tail_rcu(&d->hdr.list, add_pos);

- err = resctrl_online_domain(r, d);
+ err = resctrl_online_mon_domain(r, d);
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
@@ -562,30 +618,45 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
}
}

-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
+{
+ if (r->alloc_capable)
+ domain_add_cpu_ctrl(cpu, r);
+ if (r->mon_capable)
+ domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
{
- int id = get_domain_id_from_scope(cpu, r->scope);
+ int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
struct rdt_domain *d;

lockdep_assert_held(&domain_list_lock);

if (id < 0) {
- pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->scope, r->name);
+ pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->ctrl_scope, r->name);
return;
}

- d = rdt_find_domain(r, id, NULL);
- if (!d) {
- pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
+ hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+ if (!hdr) {
+ pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n",
+ id, cpu, r->name);
return;
}
+
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
hw_dom = resctrl_to_arch_dom(d);

cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
- resctrl_offline_domain(r, d);
+ resctrl_offline_ctrl_domain(r, d);
list_del_rcu(&d->hdr.list);
synchronize_rcu();

@@ -601,6 +672,53 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}
}

+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
+ struct rdt_domain *d;
+
+ lockdep_assert_held(&domain_list_lock);
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+ if (!hdr) {
+ pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n",
+ id, cpu, r->name);
+ return;
+ }
+
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
+ hw_dom = resctrl_to_arch_dom(d);
+
+ cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+ if (cpumask_empty(&d->hdr.cpu_mask)) {
+ resctrl_offline_mon_domain(r, d);
+ list_del_rcu(&d->hdr.list);
+ synchronize_rcu();
+ domain_free(hw_dom);
+
+ return;
+ }
+}
+
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+ if (r->alloc_capable)
+ domain_remove_cpu_ctrl(cpu, r);
+ if (r->mon_capable)
+ domain_remove_cpu_mon(cpu, r);
+}
+
static void clear_closid_rmid(int cpu)
{
struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 6246f48b0449..8cc36723f077 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -231,7 +231,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
return -EINVAL;
}
dom = strim(dom);
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (d->hdr.id == dom_id) {
data.buf = dom;
data.rdtgrp = rdtgrp;
@@ -306,7 +306,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
msr_param.res = NULL;
for (t = 0; t < CDP_NUM_TYPES; t++) {
@@ -450,7 +450,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
lockdep_assert_cpus_held();

seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -556,6 +556,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
+ struct rdt_domain_hdr *hdr;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
@@ -576,11 +577,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
evtid = md.u.evtid;

r = &rdt_resources_all[resid].r_resctrl;
- d = rdt_find_domain(r, domid, NULL);
- if (!d) {
+ hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+ if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
ret = -ENOENT;
goto out;
}
+ d = container_of(hdr, struct rdt_domain, hdr);

mon_event_read(&rr, r, d, rdtgrp, evtid, false);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ab8a198d88b3..82a44de8136f 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -490,7 +490,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);

entry->busy = 0;
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
/*
* For the first limbo RMID in the domain,
* setup up the limbo worker.
@@ -687,7 +687,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
idx = resctrl_arch_rmid_idx_encode(closid, rmid);
pmbm_data = &dom_mbm->mbm_local[idx];

- dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+ dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
pr_warn_once("Failure to get domain for MBA update\n");
return;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 36d943cb847a..58985ffcf74e 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
*/
static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
{
- enum resctrl_scope scope = plr->s->res->scope;
+ enum resctrl_scope scope = plr->s->res->ctrl_scope;
struct cpu_cacheinfo *ci;
int ret;
int i;
@@ -859,7 +859,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* associated with them.
*/
for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(d_i, &r->domains, hdr.list) {
+ list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
if (d_i->plr)
cpumask_or(cpu_with_psl, cpu_with_psl,
&d_i->hdr.cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e6e2753738c9..7c1475f393ff 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -98,7 +98,7 @@ void rdt_staged_configs_clear(void)
lockdep_assert_held(&rdtgroup_mutex);

for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(dom, &r->domains, hdr.list)
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
memset(dom->staged_config, 0, sizeof(dom->staged_config));
}
}
@@ -1021,7 +1021,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
@@ -1343,7 +1343,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
continue;
has_cache = true;
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
ctrl = resctrl_arch_get_config(r, d, closid,
s->conf_type);
if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1454,13 +1454,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
unsigned int size = 0;
int num_b, i;

- if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+ if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
return size;

num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->scope) {
+ if (ci->info_list[i].level == r->ctrl_scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
@@ -1518,7 +1518,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
type = schema->conf_type;
sep = false;
seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (sep)
seq_putc(s, ';');
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1608,7 +1608,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->mon_domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -1732,7 +1732,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
if (d->hdr.id == dom_id) {
mbm_config_write_domain(r, d, evtid, val);
goto next;
@@ -2280,7 +2280,7 @@ static int set_cache_qos_cfg(int level, bool enable)
return -ENOMEM;

r_l = &rdt_resources_all[level].r_resctrl;
- list_for_each_entry(d, &r_l->domains, hdr.list) {
+ list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
if (r_l->cache.arch_has_per_cpu_cfg)
/* Pick all the CPUs in the domain instance */
for_each_cpu(cpu, &d->hdr.cpu_mask)
@@ -2365,7 +2365,7 @@ static int set_mba_sc(bool mba_sc)

r->membw.mba_sc = mba_sc;

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
for (i = 0; i < num_closid; i++)
d->mbps_val[i] = MBA_MAX_MBPS;
}
@@ -2704,7 +2704,7 @@ static int rdt_get_tree(struct fs_context *fc)

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- list_for_each_entry(dom, &r->domains, hdr.list)
+ list_for_each_entry(dom, &r->mon_domains, hdr.list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL,
RESCTRL_PICK_ANY_CPU);
}
@@ -2828,10 +2828,10 @@ static int reset_all_ctrls(struct rdt_resource *r)

/*
* Disable resource control for this resource by setting all
- * CBMs in all domains to the maximum mask value. Pick one CPU
+ * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
* from each domain to update the MSRs below.
*/
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);

for (i = 0; i < hw_res->num_closid; i++)
@@ -3102,7 +3102,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();

- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->mon_domains, hdr.list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
return ret;
@@ -3284,7 +3284,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
struct rdt_domain *d;
int ret;

- list_for_each_entry(d, &s->res->domains, hdr.list) {
+ list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
ret = __init_one_rdt_domain(d, s, closid);
if (ret < 0)
return ret;
@@ -3299,7 +3299,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
struct resctrl_staged_config *cfg;
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (is_mba_sc(r)) {
d->mbps_val[closid] = MBA_MAX_MBPS;
continue;
@@ -3930,15 +3930,19 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
kfree(d->mbm_local);
}

-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
{
mutex_lock(&rdtgroup_mutex);

if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
mba_sc_domain_destroy(r, d);

- if (!r->mon_capable)
- goto out_unlock;
+ mutex_unlock(&rdtgroup_mutex);
+}
+
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ mutex_lock(&rdtgroup_mutex);

/*
* If resctrl is mounted, remove all the
@@ -3964,7 +3968,6 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)

domain_destroy_mon_state(d);

-out_unlock:
mutex_unlock(&rdtgroup_mutex);
}

@@ -3999,7 +4002,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
{
int err = 0;

@@ -4008,11 +4011,18 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) {
/* RDT_RESOURCE_MBA is never mon_capable */
err = mba_sc_domain_allocate(r, d);
- goto out_unlock;
}

- if (!r->mon_capable)
- goto out_unlock;
+ mutex_unlock(&rdtgroup_mutex);
+
+ return err;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ int err;
+
+ mutex_lock(&rdtgroup_mutex);

err = domain_setup_mon_state(r, d);
if (err)
@@ -4077,7 +4087,7 @@ void resctrl_offline_cpu(unsigned int cpu)
if (!l3->mon_capable)
goto out_unlock;

- d = get_domain_from_cpu(cpu, l3);
+ d = get_mon_domain_from_cpu(cpu, l3);
if (d) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
cancel_delayed_work(&d->mbm_over);
--
2.44.0


2024-05-15 22:24:28

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 06/17] x86/resctrl: Introduce snc_nodes_per_l3_cache

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that shows how many
SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
SNC mode is either not implemented, or not enabled.

Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
nodes.
3) Disable the "-o mba_MBps" mount option in SNC mode
because the monitoring is being done per SNC node, while the
bandwidth allocation is still done at the L3 cache scope.
Trying to use this feedback loop might result in contradictory
changes to the throttling level coming from each of the SNC
node bandwidth measurements.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 3 ++-
4 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 135190e0711c..49440f194253 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -484,6 +484,8 @@ extern struct rdt_hw_resource rdt_resources_all[];
extern struct rdtgroup rdtgroup_default;
extern struct dentry *debugfs_resctrl;

+extern unsigned int snc_nodes_per_l3_cache;
+
enum resctrl_res_level {
RDT_RESOURCE_L3,
RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 395bac851f6e..bfa9d3a429fd 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -331,6 +331,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
return r->default_ctrl;
}

+/*
+ * Number of SNC nodes that share each L3 cache. Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
static void mba_wrmsr_intel(struct msr_param *m)
{
struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 89d7e6fcbaa1..0f66825a1ac9 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1022,8 +1022,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
int ret;

resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
- hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
- r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+ r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;

if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index cc31ede1a1e7..0923492a8bd0 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2346,7 +2346,8 @@ static bool supports_mba_mbps(void)
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

return (is_mbm_local_enabled() &&
- r->alloc_capable && is_mba_linear());
+ r->alloc_capable && is_mba_linear() &&
+ snc_nodes_per_l3_cache == 1);
}

/*
--
2.44.0


2024-05-15 22:24:35

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 07/17] x86/resctrl: Prepare for new Sub-NUMA (SNC) cluster monitor files

When SNC is enabled monitoring data is collected at the SNC node
granularity, but must be reported at L3-cache granularity for
backwards compatibility in addition to reporting at the node
level.

Add a mon_display_scope field to the rdt_resource structure to track
reporting scope. Default is for non-SNC systems where both scopes
are the same.

This is the first step to an eventual goal of monitor reporting files
like this (for a system with two SNC nodes per L3):

$ cd /sys/fs/resctrl/mon_data
$ tree mon_L3_00
mon_L3_00 <- 00 here is L3 cache id
├── llc_occupancy \ These files provide legacy support
├── mbm_local_bytes > for non-SNC aware monitor apps
├── mbm_total_bytes / that expect data at L3 cache level
├── mon_sub_L3_00 <- 00 here is SNC node id
│   ├── llc_occupancy \ These files are finer grained
│   ├── mbm_local_bytes > data from each SNC node
│   └── mbm_total_bytes /
└── mon_sub_L3_01
├── llc_occupancy \
├── mbm_local_bytes > As above, but for node 1.
└── mbm_total_bytes /

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 1 +
2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 5c7775343c3e..98c0ff8ba005 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -187,6 +187,7 @@ enum resctrl_scope {
* @num_rmid: Number of RMIDs available
* @ctrl_scope: Scope of this resource for control functions
* @mon_scope: Scope of this resource for monitor functions
+ * @mon_display_scope: Scope for user reporting monitor functions
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
* @ctrl_domains: RCU list of all control domains for this resource
@@ -207,6 +208,7 @@ struct rdt_resource {
int num_rmid;
enum resctrl_scope ctrl_scope;
enum resctrl_scope mon_scope;
+ enum resctrl_scope mon_display_scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
struct list_head ctrl_domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index bfa9d3a429fd..15856254fea7 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -71,6 +71,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.name = "L3",
.ctrl_scope = RESCTRL_L3_CACHE,
.mon_scope = RESCTRL_L3_CACHE,
+ .mon_display_scope = RESCTRL_L3_CACHE,
.ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L3),
.mon_domains = mon_domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
--
2.44.0


2024-05-15 22:24:36

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 08/17] x86/resctrl: Add and initialize display_id field to struct rdt_mon_domain

When Sub-NUMA (SNC) mode is enabled monitoring domains are created at
SNC node scope. Add a field that holds the identity of the L3 cache for
each domain to make it easy to find all domains that share the same
L3 cache instance. There are three places where this is needed. In
all cases code is operating on a domain where "d->id" refers to the
SNC node id.

1) When making monitor directories.
Need the L3 cache instance ID to make the mon_L3_XX directory
that will contain the legacy monitor reporting files and the
mon_sub_L3_YY directory for this domain.
2) When removing monitor directories.
Similar to making directories.
3) When reporting data from one of the L3-scoped legacy files.
This requires summing data from each SNC node that shares the
same L3 cache instance id.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 8 ++++++++
2 files changed, 10 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 98c0ff8ba005..2f8ac925bc18 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -96,6 +96,7 @@ struct rdt_ctrl_domain {
/**
* struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
* @hdr: common header for different domain types
+ * @display_id: shared id used to identify domains to be summed for display
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
* @mbm_local: saved state for MBM local bandwidth
@@ -106,6 +107,7 @@ struct rdt_ctrl_domain {
*/
struct rdt_mon_domain {
struct rdt_domain_hdr hdr;
+ int display_id;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
struct mbm_state *mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 15856254fea7..dd40c998df72 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -614,6 +614,14 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)

d = &hw_dom->d_resctrl;
d->hdr.id = id;
+ d->display_id = get_domain_id_from_scope(cpu, r->mon_display_scope);
+ if (d->display_id < 0) {
+ pr_warn_once("Can't find monitor domain display id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_display_scope, r->name);
+ mon_domain_free(hw_dom);
+ return;
+ }
+
d->hdr.type = RESCTRL_MON_DOMAIN;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

--
2.44.0


2024-05-15 22:24:47

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 04/17] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 48 ++++++++-------
arch/x86/kernel/cpu/resctrl/internal.h | 62 ++++++++++++--------
arch/x86/kernel/cpu/resctrl/core.c | 71 ++++++++++++-----------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 28 ++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 40 ++++++-------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 64 ++++++++++----------
7 files changed, 174 insertions(+), 145 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 96ddf9ff3183..aa2c22a8e37b 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -78,7 +78,23 @@ struct rdt_domain_hdr {
};

/**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr: common header for different domain types
+ * @plr: pseudo-locked region (if any) associated with domain
+ * @staged_config: parsed configuration to be applied
+ * @mbps_val: When mba_sc is enabled, this holds the array of user
+ * specified control values for mba_sc in MBps, indexed
+ * by closid
+ */
+struct rdt_ctrl_domain {
+ struct rdt_domain_hdr hdr;
+ struct pseudo_lock_region *plr;
+ struct resctrl_staged_config staged_config[CDP_NUM_TYPES];
+ u32 *mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
* @hdr: common header for different domain types
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
@@ -87,13 +103,8 @@ struct rdt_domain_hdr {
* @cqm_limbo: worker to periodically read CQM h/w counters
* @mbm_work_cpu: worker CPU for MBM h/w counters
* @cqm_work_cpu: worker CPU for CQM h/w counters
- * @plr: pseudo-locked region (if any) associated with domain
- * @staged_config: parsed configuration to be applied
- * @mbps_val: When mba_sc is enabled, this holds the array of user
- * specified control values for mba_sc in MBps, indexed
- * by closid
*/
-struct rdt_domain {
+struct rdt_mon_domain {
struct rdt_domain_hdr hdr;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
@@ -102,9 +113,6 @@ struct rdt_domain {
struct delayed_work cqm_limbo;
int mbm_work_cpu;
int cqm_work_cpu;
- struct pseudo_lock_region *plr;
- struct resctrl_staged_config staged_config[CDP_NUM_TYPES];
- u32 *mbps_val;
};

/**
@@ -208,7 +216,7 @@ struct rdt_resource {
const char *format_str;
int (*parse_ctrlval)(struct rdt_parse_data *data,
struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);
struct list_head evt_list;
unsigned long fflags;
bool cdp_capable;
@@ -242,15 +250,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
* Update the ctrl_val and apply this config right now.
* Must be called on one of the domain's CPUs.
*/
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type t, u32 cfg_val);

-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);

@@ -279,7 +287,7 @@ void resctrl_offline_cpu(unsigned int cpu);
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
*/
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *arch_mon_ctx);

@@ -312,7 +320,7 @@ static inline void resctrl_arch_rmid_read_context_check(void)
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 closid, u32 rmid,
enum resctrl_event_id eventid);

@@ -325,7 +333,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);

extern unsigned int resctrl_rmid_realloc_threshold;
extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 377679b79919..135190e0711c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -147,7 +147,7 @@ union mon_data_bits {
struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
enum resctrl_event_id evtid;
bool first;
int err;
@@ -232,7 +232,7 @@ struct mongroup {
*/
struct pseudo_lock_region {
struct resctrl_schema *s;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
u32 cbm;
wait_queue_head_t lock_thread_wq;
int thread_done;
@@ -355,25 +355,41 @@ struct arch_mbm_state {
};

/**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- * a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ * a resource for a control function
* @d_resctrl: Properties exposed to the resctrl file system
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+ struct rdt_ctrl_domain d_resctrl;
+ u32 *ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ * a resource for a monitor function
+ * @d_resctrl: Properties exposed to the resctrl file system
* @arch_mbm_total: arch private state for MBM total bandwidth
* @arch_mbm_local: arch private state for MBM local bandwidth
*
* Members of this structure are accessed via helpers that provide abstraction.
*/
-struct rdt_hw_domain {
- struct rdt_domain d_resctrl;
- u32 *ctrl_val;
+struct rdt_hw_mon_domain {
+ struct rdt_mon_domain d_resctrl;
struct arch_mbm_state *arch_mbm_total;
struct arch_mbm_state *arch_mbm_local;
};

-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+ return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
{
- return container_of(r, struct rdt_hw_domain, d_resctrl);
+ return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
}

/**
@@ -385,7 +401,7 @@ static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
*/
struct msr_param {
struct rdt_resource *res;
- struct rdt_domain *dom;
+ struct rdt_ctrl_domain *dom;
u32 low;
u32 high;
};
@@ -458,9 +474,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
}

int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);
int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);

extern struct mutex rdtgroup_mutex;

@@ -564,22 +580,22 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
unsigned long cbm);
enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
int rdtgroup_tasks_assigned(struct rdtgroup *r);
int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
int rdt_pseudo_lock_init(void);
void rdt_pseudo_lock_release(void);
int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
-struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
void closid_free(int closid);
int alloc_rmid(u32 closid);
@@ -590,19 +606,19 @@ bool __init rdt_cpu_has(int flag);
void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
unsigned long delay_ms,
int exclude_cpu);
void mbm_handle_overflow(struct work_struct *work);
void __init intel_rdt_mbm_apply_quirk(void);
bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu);
void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void __init thread_throttle_mode_init(void);
void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index edd9b2bfb53d..b4f2be776408 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -309,8 +309,8 @@ static void rdt_get_cdp_l2_config(void)

static void mba_wrmsr_amd(struct msr_param *m)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom);
unsigned int i;

for (i = m->low; i < m->high; i++)
@@ -333,8 +333,8 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)

static void mba_wrmsr_intel(struct msr_param *m)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom);
unsigned int i;

/* Write the delay values for mba. */
@@ -344,17 +344,17 @@ static void mba_wrmsr_intel(struct msr_param *m)

static void cat_wrmsr(struct msr_param *m)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom);
unsigned int i;

for (i = m->low; i < m->high; i++)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
}

-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
{
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

lockdep_assert_cpus_held();

@@ -367,9 +367,9 @@ struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
return NULL;
}

-struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r)
{
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;

lockdep_assert_cpus_held();

@@ -440,18 +440,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
*dc = r->default_ctrl;
}

-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+ kfree(hw_dom->ctrl_val);
+ kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
{
kfree(hw_dom->arch_mbm_total);
kfree(hw_dom->arch_mbm_local);
- kfree(hw_dom->ctrl_val);
kfree(hw_dom);
}

-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct msr_param m;
u32 *dc;

@@ -476,7 +481,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
*/
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
{
size_t tsize;

@@ -515,10 +520,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+ struct rdt_hw_ctrl_domain *hw_dom;
struct list_head *add_pos = NULL;
- struct rdt_hw_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int err;

lockdep_assert_held(&domain_list_lock);
@@ -533,7 +538,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
if (hdr) {
if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
return;
- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_ctrl_domain, hdr);

cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
@@ -553,7 +558,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
rdt_domain_reconfigure_cdp(r);

if (domain_setup_ctrlval(r, d)) {
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);
return;
}

@@ -563,7 +568,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);
}
}

@@ -571,9 +576,9 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
struct list_head *add_pos = NULL;
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_mon_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
int err;

lockdep_assert_held(&domain_list_lock);
@@ -588,7 +593,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
if (hdr) {
if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
return;
- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);

cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
return;
@@ -604,7 +609,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);
return;
}

@@ -614,7 +619,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);
}
}

@@ -629,9 +634,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

lockdep_assert_held(&domain_list_lock);

@@ -651,8 +656,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
- hw_dom = resctrl_to_arch_dom(d);
+ d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);

cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
@@ -661,12 +666,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
synchronize_rcu();

/*
- * rdt_domain "d" is going to be freed below, so clear
+ * rdt_ctrl_domain "d" is going to be freed below, so clear
* its pointer from pseudo_lock_region struct.
*/
if (d->plr)
d->plr->d = NULL;
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);

return;
}
@@ -675,9 +680,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_mon_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;

lockdep_assert_held(&domain_list_lock);

@@ -697,15 +702,15 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
- hw_dom = resctrl_to_arch_dom(d);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ hw_dom = resctrl_to_arch_mon_dom(d);

cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
resctrl_offline_mon_domain(r, d);
list_del_rcu(&d->hdr.list);
synchronize_rcu();
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);

return;
}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 8cc36723f077..3b9383612c35 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -60,7 +60,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
}

int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
struct resctrl_staged_config *cfg;
u32 closid = data->rdtgrp->closid;
@@ -139,7 +139,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
* resource type.
*/
int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
struct rdtgroup *rdtgrp = data->rdtgrp;
struct resctrl_staged_config *cfg;
@@ -208,8 +208,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
struct resctrl_staged_config *cfg;
struct rdt_resource *r = s->res;
struct rdt_parse_data data;
+ struct rdt_ctrl_domain *d;
char *dom = NULL, *id;
- struct rdt_domain *d;
unsigned long dom_id;

/* Walking r->domains, ensure it can't race with cpuhp */
@@ -272,11 +272,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
}
}

-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type t, u32 cfg_val)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
u32 idx = get_config_index(closid, t);
struct msr_param msr_param;

@@ -297,17 +297,17 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
{
struct resctrl_staged_config *cfg;
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct msr_param msr_param;
+ struct rdt_ctrl_domain *d;
enum resctrl_conf_type t;
- struct rdt_domain *d;
u32 idx;

/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();

list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
- hw_dom = resctrl_to_arch_dom(d);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);
msr_param.res = NULL;
for (t = 0; t < CDP_NUM_TYPES; t++) {
cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -430,10 +430,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
return ret ?: nbytes;
}

-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
u32 idx = get_config_index(closid, type);

return hw_dom->ctrl_val[idx];
@@ -442,7 +442,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
{
struct rdt_resource *r = schema->res;
- struct rdt_domain *dom;
+ struct rdt_ctrl_domain *dom;
bool sep = false;
u32 ctrl_val;

@@ -514,7 +514,7 @@ static int smp_mon_event_count(void *arg)
}

void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first)
{
int cpu;
@@ -557,11 +557,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
struct rdt_domain_hdr *hdr;
+ struct rdt_mon_domain *d;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
union mon_data_bits md;
- struct rdt_domain *d;
struct rmid_read rr;
int ret = 0;

@@ -582,7 +582,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
ret = -ENOENT;
goto out;
}
- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);

mon_event_read(&rr, r, d, rdtgrp, evtid, false);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 82a44de8136f..89d7e6fcbaa1 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -209,7 +209,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
return 0;
}

-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
u32 rmid,
enum resctrl_event_id eventid)
{
@@ -228,11 +228,11 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
return NULL;
}

-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 unused, u32 rmid,
enum resctrl_event_id eventid)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct arch_mbm_state *am;

am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -248,9 +248,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
* Assumes that hardware counters are also reset and thus that there is
* no need to record initial non-zero counts.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);

if (is_mbm_total_enabled())
memset(hw_dom->arch_mbm_total, 0,
@@ -269,12 +269,12 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
return chunks >> shift;
}

-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 unused, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *ignored)
{
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct arch_mbm_state *am;
u64 msr_val, chunks;
int ret;
@@ -320,7 +320,7 @@ static void limbo_release_entry(struct rmid_entry *entry)
* decrement the count. If the busy count gets to zero on an RMID, we
* free the RMID
*/
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -378,7 +378,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
}

-bool has_busy_rmid(struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();

@@ -479,7 +479,7 @@ int alloc_rmid(u32 closid)
static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
u32 idx;

lockdep_assert_held(&rdtgroup_mutex);
@@ -531,7 +531,7 @@ void free_rmid(u32 closid, u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}

-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
u32 rmid, enum resctrl_event_id evtid)
{
u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
@@ -667,12 +667,12 @@ void mon_event_count(void *info)
* throttle MSRs already have low percentage values. To avoid
* unnecessarily restricting such rdtgroups, we also increase the bandwidth.
*/
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
+ struct rdt_ctrl_domain *dom_mba;
struct rdt_resource *r_mba;
- struct rdt_domain *dom_mba;
u32 cur_bw, user_bw, idx;
struct list_head *head;
struct rdtgroup *entry;
@@ -733,7 +733,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
}

-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 closid, u32 rmid)
{
struct rmid_read rr;
@@ -791,12 +791,12 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
void cqm_handle_limbo(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;

cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

- d = container_of(work, struct rdt_domain, cqm_limbo.work);
+ d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);

__check_limbo(d, false);

@@ -819,7 +819,7 @@ void cqm_handle_limbo(struct work_struct *work)
* @exclude_cpu: Which CPU the handler should not run on,
* RESCTRL_PICK_ANY_CPU to pick any CPU.
*/
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
@@ -836,9 +836,9 @@ void mbm_handle_overflow(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
struct rdtgroup *prgrp, *crgrp;
+ struct rdt_mon_domain *d;
struct list_head *head;
struct rdt_resource *r;
- struct rdt_domain *d;

cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
@@ -851,7 +851,7 @@ void mbm_handle_overflow(struct work_struct *work)
goto out_unlock;

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- d = container_of(work, struct rdt_domain, mbm_over.work);
+ d = container_of(work, struct rdt_mon_domain, mbm_over.work);

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp->closid, prgrp->mon.rmid);
@@ -885,7 +885,7 @@ void mbm_handle_overflow(struct work_struct *work)
* @exclude_cpu: Which CPU the handler should not run on,
* RESCTRL_PICK_ANY_CPU to pick any CPU.
*/
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 58985ffcf74e..abec0d6d9476 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
* Return: true if @cbm overlaps with pseudo-locked region on @d, false
* otherwise.
*/
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
{
unsigned int cbm_len;
unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
* if it is not possible to test due to memory allocation issue,
* false otherwise.
*/
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
{
+ struct rdt_ctrl_domain *d_i;
cpumask_var_t cpu_with_psl;
struct rdt_resource *r;
- struct rdt_domain *d_i;
bool ret = false;

/* Walking r->domains, ensure it can't race with cpuhp */
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 7c1475f393ff..cc31ede1a1e7 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -92,8 +92,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)

void rdt_staged_configs_clear(void)
{
+ struct rdt_ctrl_domain *dom;
struct rdt_resource *r;
- struct rdt_domain *dom;

lockdep_assert_held(&rdtgroup_mutex);

@@ -1012,7 +1012,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
unsigned long sw_shareable = 0, hw_shareable = 0;
unsigned long exclusive = 0, pseudo_locked = 0;
struct rdt_resource *r = s->res;
- struct rdt_domain *dom;
+ struct rdt_ctrl_domain *dom;
int i, hwb, swb, excl, psl;
enum rdtgrp_mode mode;
bool sep = false;
@@ -1243,7 +1243,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
*
* Return: false if CBM does not overlap, true if it does.
*/
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid,
enum resctrl_conf_type type, bool exclusive)
{
@@ -1298,7 +1298,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
*
* Return: true if CBM overlap detected, false if there is no overlap
*/
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid, bool exclusive)
{
enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1329,10 +1329,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
{
int closid = rdtgrp->closid;
+ struct rdt_ctrl_domain *d;
struct resctrl_schema *s;
struct rdt_resource *r;
bool has_cache = false;
- struct rdt_domain *d;
u32 ctrl;

/* Walking r->domains, ensure it can't race with cpuhp */
@@ -1448,7 +1448,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
* bitmap functions work correctly.
*/
unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
- struct rdt_domain *d, unsigned long cbm)
+ struct rdt_ctrl_domain *d, unsigned long cbm)
{
struct cpu_cacheinfo *ci;
unsigned int size = 0;
@@ -1480,9 +1480,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
{
struct resctrl_schema *schema;
enum resctrl_conf_type type;
+ struct rdt_ctrl_domain *d;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
- struct rdt_domain *d;
unsigned int size;
int ret = 0;
u32 closid;
@@ -1594,7 +1594,7 @@ static void mon_event_config_read(void *info)
mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
}

-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
{
smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
}
@@ -1602,7 +1602,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
{
struct mon_config_info mon_info = {0};
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
bool sep = false;

cpus_read_lock();
@@ -1661,7 +1661,7 @@ static void mon_event_config_write(void *info)
}

static void mbm_config_write_domain(struct rdt_resource *r,
- struct rdt_domain *d, u32 evtid, u32 val)
+ struct rdt_mon_domain *d, u32 evtid, u32 val)
{
struct mon_config_info mon_info = {0};

@@ -1702,7 +1702,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
char *dom_str = NULL, *id_str;
unsigned long dom_id, val;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;

/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
@@ -2261,9 +2261,9 @@ static inline bool is_mba_linear(void)
static int set_cache_qos_cfg(int level, bool enable)
{
void (*update)(void *arg);
+ struct rdt_ctrl_domain *d;
struct rdt_resource *r_l;
cpumask_var_t cpu_mask;
- struct rdt_domain *d;
int cpu;

/* Walking r->domains, ensure it can't race with cpuhp */
@@ -2313,7 +2313,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
l3_qos_cfg_update(&hw_res->cdp_enabled);
}

-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
u32 num_closid = resctrl_arch_get_num_closid(r);
int cpu = cpumask_any(&d->hdr.cpu_mask);
@@ -2331,7 +2331,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
}

static void mba_sc_domain_destroy(struct rdt_resource *r,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
kfree(d->mbps_val);
d->mbps_val = NULL;
@@ -2357,7 +2357,7 @@ static int set_mba_sc(bool mba_sc)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
u32 num_closid = resctrl_arch_get_num_closid(r);
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int i;

if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2629,7 +2629,7 @@ static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
unsigned long flags = RFTYPE_CTRL_BASE;
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
struct rdt_resource *r;
int ret;

@@ -2814,9 +2814,9 @@ static int rdt_init_fs_context(struct fs_context *fc)
static int reset_all_ctrls(struct rdt_resource *r)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct msr_param msr_param;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int i;

/* Walking r->domains, ensure it can't race with cpuhp */
@@ -2832,7 +2832,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
* from each domain to update the MSRs below.
*/
list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
- hw_dom = resctrl_to_arch_dom(d);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);

for (i = 0; i < hw_res->num_closid; i++)
hw_dom->ctrl_val[i] = r->default_ctrl;
@@ -3025,7 +3025,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}

static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
- struct rdt_domain *d,
+ struct rdt_mon_domain *d,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
union mon_data_bits priv;
@@ -3074,7 +3074,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
* and "monitor" groups with given domain id.
*/
static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_domain *d)
+ struct rdt_mon_domain *d)
{
struct kernfs_node *parent_kn;
struct rdtgroup *prgrp, *crgrp;
@@ -3096,7 +3096,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_resource *r,
struct rdtgroup *prgrp)
{
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
int ret;

/* Walking r->domains, ensure it can't race with cpuhp */
@@ -3201,7 +3201,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
* Set the RDT domain up to start off with all usable allocations. That is,
* all shareable and unused bits. All-zero CBM is invalid.
*/
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
u32 closid)
{
enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3281,7 +3281,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
*/
static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
{
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int ret;

list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3297,7 +3297,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
{
struct resctrl_staged_config *cfg;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (is_mba_sc(r)) {
@@ -3923,14 +3923,14 @@ static void __init rdtgroup_setup_default(void)
mutex_unlock(&rdtgroup_mutex);
}

-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
{
bitmap_free(d->rmid_busy_llc);
kfree(d->mbm_total);
kfree(d->mbm_local);
}

-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
mutex_lock(&rdtgroup_mutex);

@@ -3940,7 +3940,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
mutex_unlock(&rdtgroup_mutex);
}

-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
{
mutex_lock(&rdtgroup_mutex);

@@ -3971,7 +3971,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
mutex_unlock(&rdtgroup_mutex);
}

-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize;
@@ -4002,7 +4002,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
int err = 0;

@@ -4018,7 +4018,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
return err;
}

-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
{
int err;

@@ -4073,8 +4073,8 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
void resctrl_offline_cpu(unsigned int cpu)
{
struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_mon_domain *d;
struct rdtgroup *rdtgrp;
- struct rdt_domain *d;

mutex_lock(&rdtgroup_mutex);
list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
--
2.44.0


2024-05-15 22:25:03

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 09/17] x86/resctrl: Add new fields to struct rmid_read for summation of domains

rdtgroup_mondata_show() calls mon_event_count() which packages up all
the required details into an rmid_read structure passed across the
smp_call*() infrastructure.

Legacy files reporting for a single domain pass that domain in the
rmid_read structure. Files that need to sum multiple domains have
meta data that provides the display_id for domains that must be
summed.

Add the sumdomains and display_id fields to the rmid_read structure.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 49440f194253..498c5d240c68 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -150,6 +150,8 @@ struct rmid_read {
struct rdt_mon_domain *d;
enum resctrl_event_id evtid;
bool first;
+ bool sumdomains;
+ int display_id;
int err;
u64 val;
void *arch_mon_ctx;
--
2.44.0


2024-05-15 22:25:06

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 11/17] x86/resctrl: Allocate a new bit in union mon_data_bits

When Sub-NUMA (SNC) mode is enabled the legacy monitor reporting files
must report the sum of the data from all of the SNC nodes that share the
L3 cache that is referenced by the monitor file.

Resctrl squeezes all the attributes of these files into 32-bits so they
can be stored in the "priv" field of struct kernfs_node.

Steal one bit from the "evtid" field (currently 8 bits, but only three
events supported by Intel) to create a new "sum" field that indicates
this file must sum across SNC nodes. This bit also indicates that the
domid field is the display_id to match to find which domains must be
summed.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 498c5d240c68..c54ad12ff2b8 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -132,14 +132,19 @@ struct mon_evt {
* as kernfs private data
* @rid: Resource id associated with the event file
* @evtid: Event id associated with the event file
- * @domid: The domain to which the event file belongs
+ * @sum: Set when event must be summed across multiple
+ * domains.
+ * @domid: When @sum is zero this is the domain to which
+ * the event file belongs. When sum is one this
+ * is the display_id of all domains to be summed
* @u: Name of the bit fields struct
*/
union mon_data_bits {
void *priv;
struct {
unsigned int rid : 10;
- enum resctrl_event_id evtid : 8;
+ enum resctrl_event_id evtid : 7;
+ unsigned int sum : 1;
unsigned int domid : 14;
} u;
};
--
2.44.0


2024-05-15 22:25:35

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 12/17] x86/resctrl: Create Sub-NUMA (SNC) monitor files

When SNC mode is enabled, create subdirectories and file to monitor
at the SNC node granularity. Monitor files at the L3 granularity are
tagged with a "sum" attribute to indicate that all SNC nodes sharing
an L3 cache should be read and summed to provide the result to the
user.

Note that the "domid" field for files that must sum across SNC domains
has the L3 cache instance id, while non-summing files use the domain id.

Also the "sum" files do not need to make a call to mon_event_read() to
initialize the MBM counters. This will be handled by initializing the
individual SNC nodes that share the L3.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 53 ++++++++++++++++++--------
1 file changed, 38 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 7a6c40aefdcc..f0f468babdea 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3026,7 +3026,8 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}

static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
- struct rdt_resource *r, struct rdtgroup *prgrp)
+ struct rdt_resource *r, struct rdtgroup *prgrp,
+ bool do_sum)
{
union mon_data_bits priv;
struct mon_evt *mevt;
@@ -3037,15 +3038,18 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
return -EPERM;

priv.u.rid = r->rid;
- priv.u.domid = d->hdr.id;
+ priv.u.domid = do_sum ? d->display_id : d->hdr.id;
+ priv.u.sum = do_sum;
list_for_each_entry(mevt, &r->evt_list, list) {
priv.u.evtid = mevt->evtid;
ret = mon_addfile(kn, mevt->name, priv.priv);
if (ret)
return ret;

- if (is_mbm_event(mevt->evtid))
+ if (!do_sum && is_mbm_event(mevt->evtid)) {
+ rr.sumdomains = 0;
mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
+ }
}

return 0;
@@ -3055,23 +3059,42 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
struct rdt_mon_domain *d,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
- struct kernfs_node *kn;
+ struct kernfs_node *kn, *ckn;
char name[32];
+ bool do_sum;
int ret;

- sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
- /* create the directory */
- kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
- if (IS_ERR(kn))
- return PTR_ERR(kn);
+ do_sum = r->mon_scope != r->mon_display_scope;
+ sprintf(name, "mon_%s_%02d", r->name, d->display_id);
+ kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
+ if (!kn) {
+ /* create the directory */
+ kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
+ if (IS_ERR(kn))
+ return PTR_ERR(kn);

- ret = rdtgroup_kn_set_ugid(kn);
- if (ret)
- goto out_destroy;
+ ret = rdtgroup_kn_set_ugid(kn);
+ if (ret)
+ goto out_destroy;
+ ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
+ if (ret)
+ goto out_destroy;
+ }

- ret = mon_add_all_files(kn, d, r, prgrp);
- if (ret)
- goto out_destroy;
+ if (do_sum) {
+ sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
+ if (IS_ERR(ckn))
+ goto out_destroy;
+
+ ret = rdtgroup_kn_set_ugid(ckn);
+ if (ret)
+ goto out_destroy;
+
+ ret = mon_add_all_files(ckn, d, r, prgrp, false);
+ if (ret)
+ goto out_destroy;
+ }

kernfs_activate(kn);
return 0;
--
2.44.0


2024-05-15 22:25:50

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 10/17] x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function

Move the creation of monitoring files into a helper function.

No functional change.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 45 ++++++++++++++++----------
1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 0923492a8bd0..7a6c40aefdcc 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3025,14 +3025,37 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}
}

+static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
+ struct rdt_resource *r, struct rdtgroup *prgrp)
+{
+ union mon_data_bits priv;
+ struct mon_evt *mevt;
+ struct rmid_read rr;
+ int ret;
+
+ if (WARN_ON(list_empty(&r->evt_list)))
+ return -EPERM;
+
+ priv.u.rid = r->rid;
+ priv.u.domid = d->hdr.id;
+ list_for_each_entry(mevt, &r->evt_list, list) {
+ priv.u.evtid = mevt->evtid;
+ ret = mon_addfile(kn, mevt->name, priv.priv);
+ if (ret)
+ return ret;
+
+ if (is_mbm_event(mevt->evtid))
+ mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
+ }
+
+ return 0;
+}
+
static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
struct rdt_mon_domain *d,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
- union mon_data_bits priv;
struct kernfs_node *kn;
- struct mon_evt *mevt;
- struct rmid_read rr;
char name[32];
int ret;

@@ -3046,22 +3069,10 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
if (ret)
goto out_destroy;

- if (WARN_ON(list_empty(&r->evt_list))) {
- ret = -EPERM;
+ ret = mon_add_all_files(kn, d, r, prgrp);
+ if (ret)
goto out_destroy;
- }

- priv.u.rid = r->rid;
- priv.u.domid = d->hdr.id;
- list_for_each_entry(mevt, &r->evt_list, list) {
- priv.u.evtid = mevt->evtid;
- ret = mon_addfile(kn, mevt->name, priv.priv);
- if (ret)
- goto out_destroy;
-
- if (is_mbm_event(mevt->evtid))
- mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
- }
kernfs_activate(kn);
return 0;

--
2.44.0


2024-05-15 22:26:00

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 14/17] x86/resctrl: Sum monitor data acrss Sub-NUMA (SNC) nodes when needed

When the sumdomains fields is set in the rmid_read structure, walk
the list of domains in this resource to find all that share an L3
cache id (rr->display_id).

Adjust the RMID value based on which SNC domain is being accessed.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 20 +++++++++++---
arch/x86/kernel/cpu/resctrl/monitor.c | 33 ++++++++++++++++++++++-
2 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 3b9383612c35..7ab788d47ad3 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -575,15 +575,27 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
resid = md.u.rid;
domid = md.u.domid;
evtid = md.u.evtid;
-
+ rr.sumdomains = md.u.sum;
r = &rdt_resources_all[resid].r_resctrl;
- hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
- if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+
+ if (rr.sumdomains) {
+ rr.display_id = domid;
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
+ if (d->display_id == domid)
+ goto got_domain;
+ }
ret = -ENOENT;
goto out;
+ } else {
+ hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+ if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+ ret = -ENOENT;
+ goto out;
+ }
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
}
- d = container_of(hdr, struct rdt_mon_domain, hdr);

+got_domain:
mon_event_read(&rr, r, d, rdtgrp, evtid, false);

if (rr.err == -EIO)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 0f66825a1ac9..668d2fdf58cd 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -546,7 +546,7 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
}
}

-static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
+static int ___mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
{
struct mbm_state *m;
u64 tval = 0;
@@ -569,6 +569,37 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
return 0;
}

+static u32 get_node_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, u32 rmid)
+{
+ int cpu = cpumask_any(&d->hdr.cpu_mask);
+
+ return rmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+}
+
+static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
+{
+ struct rdt_mon_domain *d;
+ u32 node_rmid;
+ int ret = 0;
+
+ if (!rr->sumdomains) {
+ node_rmid = get_node_rmid(rr->r, rr->d, rmid);
+ return ___mon_event_count(closid, node_rmid, rr);
+ }
+
+ list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
+ if (d->display_id != rr->display_id)
+ continue;
+ rr->d = d;
+ node_rmid = get_node_rmid(rr->r, d, rmid);
+ ret = ___mon_event_count(closid, node_rmid, rr);
+ if (ret)
+ break;
+ }
+
+ return ret;
+}
+
/*
* mbm_bw_count() - Update bw count from values previously read by
* __mon_event_count().
--
2.44.0


2024-05-15 22:26:05

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 15/17] x86/resctrl: Fix RMID reading sanity check for Sub-NUMA (SNC) mode

The sanity check that RMIDs are being read from a CPU listed in the
the cpu_mask for the domain is incorrect when summing across multiple
SNC domains. It is safe to read the RMID from any CPU that shares the
same L3 cache instance.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 668d2fdf58cd..e4b92c7af71d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -15,6 +15,7 @@
* Software Developer Manual June 2016, volume 3, section 17.17.
*/

+#include <linux/cacheinfo.h>
#include <linux/cpu.h>
#include <linux/module.h>
#include <linux/sizes.h>
@@ -281,8 +282,18 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,

resctrl_arch_rmid_read_context_check();

- if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
- return -EINVAL;
+ if (r->mon_scope == r->mon_display_scope) {
+ if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
+ return -EINVAL;
+ } else {
+ /*
+ * SNC: OK to read events on any CPU sharing same L3
+ * cache instance.
+ */
+ if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(),
+ r->mon_display_scope))
+ return -EINVAL;
+ }

ret = __rmid_read(rmid, eventid, &msr_val);
if (ret)
--
2.44.0


2024-05-15 22:26:17

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 13/17] x86/resctrl: Handle removing directories in Sub-NUMA (SNC) mode

In SNC mode there are multiple subdirectories in each L3 level monitor
directory (one for each SNC node). If all the CPUs in an SNC node are
taken offline, then just that SNC node directory must be removed. In
non-SNC mode, or when the last SNC node directory is removed, also
remove the L3 monitor directory.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 43 +++++++++++++++++++++-----
1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f0f468babdea..cac32ddd3afd 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3011,17 +3011,46 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
* and monitor groups with given domain id.
*/
static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- unsigned int dom_id)
+ struct rdt_mon_domain *d)
{
struct rdtgroup *prgrp, *crgrp;
+ struct rdt_mon_domain *dom;
+ bool remove_all = true;
+ struct kernfs_node *kn;
+ char subname[32];
char name[32];

+ sprintf(name, "mon_%s_%02d", r->name, d->display_id);
+ if (r->mon_scope != r->mon_display_scope) {
+ int count = 0;
+
+ list_for_each_entry(dom, &r->mon_domains, hdr.list)
+ if (d->display_id == dom->display_id)
+ count++;
+ if (count > 1) {
+ remove_all = false;
+ sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ }
+ }
+
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
- sprintf(name, "mon_%s_%02d", r->name, dom_id);
- kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
+ if (remove_all) {
+ kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
+ } else {
+ kn = kernfs_find_and_get_ns(prgrp->mon.mon_data_kn, name, NULL);
+ if (kn)
+ kernfs_remove_by_name(kn, subname);
+ }

- list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
- kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
+ list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
+ if (remove_all) {
+ kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
+ } else {
+ kn = kernfs_find_and_get_ns(prgrp->mon.mon_data_kn, name, NULL);
+ if (kn)
+ kernfs_remove_by_name(kn, subname);
+ }
+ }
}
}

@@ -3111,8 +3140,8 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdt_mon_domain *d)
{
- struct kernfs_node *parent_kn;
struct rdtgroup *prgrp, *crgrp;
+ struct kernfs_node *parent_kn;
struct list_head *head;

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
@@ -3984,7 +4013,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
* per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
+ rmdir_mondata_subdir_allrdtgrp(r, d);

if (is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
--
2.44.0


2024-05-15 22:26:41

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 16/17] x86/resctrl: Sub NUMA Cluster detection and enable

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
number CPUs sharing the L3 cache with CPU0 to the number of CPUs in
the same NUMA node as CPU0.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 104 +++++++++++++++++++++++++++++
2 files changed, 105 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e022e6eb766c..3cb8dd6311c3 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1164,6 +1164,7 @@
#define MSR_IA32_QM_CTR 0xc8e
#define MSR_IA32_PQR_ASSOC 0xc8f
#define MSR_IA32_L3_CBM_BASE 0xc90
+#define MSR_RMID_SNC_CONFIG 0xca0
#define MSR_IA32_L2_CBM_BASE 0xd10
#define MSR_IA32_MBA_THRTL_BASE 0xd50

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index dd40c998df72..195f9e29c553 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -21,6 +21,7 @@
#include <linux/err.h>
#include <linux/cacheinfo.h>
#include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>

#include <asm/cpu_device_id.h>
#include <asm/resctrl.h>
@@ -753,11 +754,42 @@ static void clear_closid_rmid(int cpu)
RESCTRL_RESERVED_CLOSID);
}

+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+ u64 val;
+
+ /* Only need to enable once per package. */
+ if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+ return;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, val);
+ val &= ~BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
static int resctrl_arch_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;

mutex_lock(&domain_list_lock);
+
+ if (snc_nodes_per_l3_cache > 1)
+ snc_remap_rmids(cpu);
+
for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
mutex_unlock(&domain_list_lock);
@@ -997,11 +1029,83 @@ static __init bool get_rdt_resources(void)
return (rdt_mon_capable || rdt_alloc_capable);
}

+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+ X86_MATCH_VFM(INTEL_ICELAKE_X, 0),
+ X86_MATCH_VFM(INTEL_SAPPHIRERAPIDS_X, 0),
+ X86_MATCH_VFM(INTEL_EMERALDRAPIDS_X, 0),
+ X86_MATCH_VFM(INTEL_GRANITERAPIDS_X, 0),
+ X86_MATCH_VFM(INTEL_ATOM_CRESTMONT_X, 0),
+ {}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * number CPUs sharing the L3 cache with CPU0 to the number of CPUs in
+ * the same NUMA node as CPU0.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+ struct cpu_cacheinfo *ci = get_cpu_cacheinfo(0);
+ const cpumask_t *node0_cpumask;
+ cpumask_t *l3_cpumask = NULL;
+ int i, ret;
+
+ if (!x86_match_cpu(snc_cpu_ids))
+ return 1;
+
+ cpus_read_lock();
+ if (num_online_cpus() != num_present_cpus())
+ pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+ cpus_read_unlock();
+
+ for (i = 0; i < ci->num_leaves; i++) {
+ if (ci->info_list[i].level == 3) {
+ if (ci->info_list[i].attributes & CACHE_ID) {
+ l3_cpumask = &ci->info_list[i].shared_cpu_map;
+ break;
+ }
+ }
+ }
+ if (!l3_cpumask) {
+ pr_info("can't get CPU0 L3 mask\n");
+ return 1;
+ }
+
+ node0_cpumask = cpumask_of_node(cpu_to_node(0));
+
+ ret = bitmap_weight(cpumask_bits(l3_cpumask), nr_cpu_ids) /
+ bitmap_weight(cpumask_bits(node0_cpumask), nr_cpu_ids);
+
+ /* sanity check: Only valid results are 1, 2, 3, 4 */
+ switch (ret) {
+ case 1:
+ break;
+ case 2 ... 4:
+ pr_info("Sub-NUMA cluster detected with %d nodes per L3 cache\n", ret);
+ rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+ break;
+ default:
+ pr_warn("Ignore improbable SNC node count %d\n", ret);
+ ret = 1;
+ break;
+ }
+
+ return ret;
+}
+
static __init void rdt_init_res_defs_intel(void)
{
struct rdt_hw_resource *hw_res;
struct rdt_resource *r;

+ snc_nodes_per_l3_cache = snc_get_config();
+
for_each_rdt_resource(r) {
hw_res = resctrl_to_arch_res(r);

--
2.44.0


2024-05-15 22:26:54

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v18 17/17] x86/resctrl: Update documentation with Sub-NUMA cluster changes

*** This patch needs updating for new files for monitoring ***

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Signed-off-by: Tony Luck <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 627e23869bca..401f6bfb4a3c 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -375,6 +375,10 @@ When monitoring is enabled all MON groups will also contain:
all tasks in the group. In CTRL_MON groups these files provide
the sum for all tasks in the CTRL_MON group and all tasks in
MON groups. Please see example section for more details on usage.
+ On systems with Sub-NUMA (SNC) cluster enabled there are extra
+ directories for each node (located within the "mon_L3_XX" directory
+ for the L3 cache they occupy). These are named "mon_sub_L3_YY"
+ where "YY" is the node number.

"mon_hw_id":
Available only with debug option. The identifier used by hardware
@@ -484,6 +488,19 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
each bit represents 5% of the capacity of the cache. You could partition
the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.

+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.
+The top-level monitoring files in each "mon_L3_XX" directory provide
+the sum of data across all SNC nodes sharing an L3 cache instance.
+Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
+the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
+"mon_sub_L3_YY" directories to get node local data.
+
Memory bandwidth Allocation and monitoring
==========================================

--
2.44.0


2024-05-16 08:58:29

by Maciej Wieczor-Retman

[permalink] [raw]
Subject: Re: [PATCH v18 17/17] x86/resctrl: Update documentation with Sub-NUMA cluster changes

On 2024-05-15 at 15:23:25 -0700, Tony Luck wrote:
>*** This patch needs updating for new files for monitoring ***

Is this note here by accident? New files seem to be mentioned in the patch.

>
>With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
>per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
>their name refer to Sub-NUMA nodes instead of L3 cache ids.
>
>Users should be aware that SNC mode also affects the amount of L3 cache
>available for allocation within each SNC node.
>
>Signed-off-by: Tony Luck <[email protected]>
>---
> Documentation/arch/x86/resctrl.rst | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
>diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
>index 627e23869bca..401f6bfb4a3c 100644
>--- a/Documentation/arch/x86/resctrl.rst
>+++ b/Documentation/arch/x86/resctrl.rst
>@@ -375,6 +375,10 @@ When monitoring is enabled all MON groups will also contain:
> all tasks in the group. In CTRL_MON groups these files provide
> the sum for all tasks in the CTRL_MON group and all tasks in
> MON groups. Please see example section for more details on usage.
>+ On systems with Sub-NUMA (SNC) cluster enabled there are extra
>+ directories for each node (located within the "mon_L3_XX" directory
>+ for the L3 cache they occupy). These are named "mon_sub_L3_YY"
>+ where "YY" is the node number.
>
> "mon_hw_id":
> Available only with debug option. The identifier used by hardware
>@@ -484,6 +488,19 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
> each bit represents 5% of the capacity of the cache. You could partition
> the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>
>+Notes on Sub-NUMA Cluster mode
>+==============================
>+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
>+nodes much more readily than between regular NUMA nodes since the CPUs
>+on Sub-NUMA nodes share the same L3 cache and the system may report
>+the NUMA distance between Sub-NUMA nodes with a lower value than used
>+for regular NUMA nodes.
>+The top-level monitoring files in each "mon_L3_XX" directory provide
>+the sum of data across all SNC nodes sharing an L3 cache instance.
>+Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
>+the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
>+"mon_sub_L3_YY" directories to get node local data.
>+
> Memory bandwidth Allocation and monitoring
> ==========================================
>
>--
>2.44.0
>

--
Kind regards
Maciej Wiecz?r-Retman

2024-05-16 08:59:59

by Maciej Wieczor-Retman

[permalink] [raw]
Subject: Re: [PATCH v18 15/17] x86/resctrl: Fix RMID reading sanity check for Sub-NUMA (SNC) mode

On 2024-05-15 at 15:23:23 -0700, Tony Luck wrote:
>The sanity check that RMIDs are being read from a CPU listed in the
>the cpu_mask for the domain is incorrect when summing across multiple

the the cpu_mask -> the cpu_mask?

>SNC domains. It is safe to read the RMID from any CPU that shares the
>same L3 cache instance.
>
>Signed-off-by: Tony Luck <[email protected]>
>---
> arch/x86/kernel/cpu/resctrl/monitor.c | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>index 668d2fdf58cd..e4b92c7af71d 100644
>--- a/arch/x86/kernel/cpu/resctrl/monitor.c
>+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>@@ -15,6 +15,7 @@
> * Software Developer Manual June 2016, volume 3, section 17.17.
> */
>
>+#include <linux/cacheinfo.h>
> #include <linux/cpu.h>
> #include <linux/module.h>
> #include <linux/sizes.h>
>@@ -281,8 +282,18 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>
> resctrl_arch_rmid_read_context_check();
>
>- if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
>- return -EINVAL;
>+ if (r->mon_scope == r->mon_display_scope) {
>+ if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
>+ return -EINVAL;
>+ } else {
>+ /*
>+ * SNC: OK to read events on any CPU sharing same L3
>+ * cache instance.
>+ */
>+ if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(),
>+ r->mon_display_scope))
>+ return -EINVAL;
>+ }
>
> ret = __rmid_read(rmid, eventid, &msr_val);
> if (ret)
>--
>2.44.0
>

--
Kind regards
Maciej Wiecz?r-Retman

2024-05-16 09:29:04

by Maciej Wieczor-Retman

[permalink] [raw]
Subject: Re: [PATCH v18 00/17] Add support for Sub-NUMA cluster (SNC) systems

On 2024-05-15 at 15:23:08 -0700, Tony Luck wrote:
>This series based on top of Linus upstream commit 33e02dc69afb ("Merge
>tag 'sound-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound")
>
>The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
>that share an L3 cache into two or more sets. This plays havoc with the
>Resource Director Technology (RDT) monitoring features. Prior to this
>patch Intel has advised that SNC and RDT are incompatible.
>
>Some of these CPUs support an MSR that can partition the RMID counters
>in the same way. This allows monitoring features to be used. Legacy
>monitoring files provide the sum of counters from each SNC node for
>backwards compatibility. Additional files per SNC node provide details
>per node.
>
>Cache and memory bandwidth allocation features continue to operate at
>the scope of the L3 cache.
>
>Signed-off-by: Tony Luck <[email protected]>
>
>---
>Changes since v17: https://lore.kernel.org/all/[email protected]/
>
>Reinette: This is still using the per-domain display_id field as
>discussed. Would a better name make the intent clearer?
>
>Patch 7 in previous version included virtually all of the new changes.
>But that meant it was doing a lot of thinngs in a single patch
>(including reverting a dozen lines from patch 6!)
>
>So this series breaks patch 7 into nine pieces (0007..0015) for
>better documentation in commit comments of the changes, and hopefully
>easier review.
>
>Patches 0001 ... 0005: Unchanged
>Patch 0006: Dropped change that was reverted in v17.0007
>
>Next nine are the split of the original patch v17.0007
>Patch 0007: Added bigger commit comment describing where
> this part of the series is heading and why.
>Patch 0008: Added justification for new display_id field in struct rdt_mon_domain
>Patch 0009: Split out a helper from mkdir_mondata_subdir()
> so real changes in patch 0011 are easier to see.
>Patch 0010: Comment stealing a bit from union mon_data_bits.evtid
>Patch 0011: Save display_id instead of a random d->id in
> meta data for monitor files that must sum SNC nodes
> Don't call mon_event_read() to initialize "sum" files
>Patch 0012: Set domid for "sum" files to the display id, not
> to whatever SNC domain ID is in use here. Don't
> call mon_event_read() for "sum" files.
>Patch 0013: No change (apart from being split out from old patch 7)
>Patch 0014: Because of change in patch 0011 to save the
> display_id can no longer look up a domain using
> rdt_find_domain(). Instead search r->mon_domains
> for a match with d->display_id or d->hdr.id
> Drop extra arg to ___mon_event_count() also
> the "tmp" variable in __mon_event_count()
>Patch 0015: Put #include <linux/cacheinfo.h> in alphabetical order
> When SNC is disabled, keep the old check that
> the current CPU is in the domain being read.
> For the SNC case add comment about reading
> monitor values from any CPU in the same L3 domain.
>
>Patch 0016: Took alternate SNC detection algorithm from:
> https://lore.kernel.org/all/[email protected]/
> as it is simpler. But merged in the sanity
> checks that make sense.
> Converted the X86_MATCH*() usage to new model
> that supports Intel families other than "6".
>Patch 0017: No change
>
>
>Tony Luck (17):
> x86/resctrl: Prepare for new domain scope
> x86/resctrl: Prepare to split rdt_domain structure
> x86/resctrl: Prepare for different scope for control/monitor
> operations
> x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
> x86/resctrl: Add node-scope to the options for feature scope
> x86/resctrl: Introduce snc_nodes_per_l3_cache
> x86/resctrl: Prepare for new Sub-NUMA (SNC) cluster monitor files
> x86/resctrl: Add and initialize display_id field to struct
> rdt_mon_domain
> x86/resctrl: Add new fields to struct rmid_read for summation of
> domains
> x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function
> x86/resctrl: Allocate a new bit in union mon_data_bits
> x86/resctrl: Create Sub-NUMA (SNC) monitor files
> x86/resctrl: Handle removing directories in Sub-NUMA (SNC) mode
> x86/resctrl: Sum monitor data acrss Sub-NUMA (SNC) nodes when needed
> x86/resctrl: Fix RMID reading sanity check for Sub-NUMA (SNC) mode
> x86/resctrl: Sub NUMA Cluster detection and enable
> x86/resctrl: Update documentation with Sub-NUMA cluster changes
>
> Documentation/arch/x86/resctrl.rst | 17 +
> include/linux/resctrl.h | 89 +++--
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kernel/cpu/resctrl/internal.h | 78 ++--
> arch/x86/kernel/cpu/resctrl/core.c | 422 ++++++++++++++++++----
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 70 ++--
> arch/x86/kernel/cpu/resctrl/monitor.c | 106 ++++--
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 26 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 267 +++++++++-----
> 9 files changed, 779 insertions(+), 297 deletions(-)
>
>
>base-commit: 33e02dc69afbd8f1b85a51d74d72f139ba4ca623
>--
>2.44.0
>

Tested-by: Maciej Wieczor-Retman <[email protected]>

--
Kind regards
Maciej Wiecz?r-Retman

2024-05-16 15:53:28

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v18 00/17] Add support for Sub-NUMA cluster (SNC) systems

> Tested-by: Maciej Wieczor-Retman <[email protected]>

Thanks for testing. Also, for the fixups to commit messages in patches 15 & 17.

-Tony

2024-05-22 21:08:18

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 06/17] x86/resctrl: Introduce snc_nodes_per_l3_cache

Hi Tony,

On 5/15/2024 3:23 PM, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
>
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
>
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
>
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
>
> The other mode divides the RMIDs and renumbers the ones on the second
> SNC node to start from zero.
>
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
>
> Add a global integer "snc_nodes_per_l3_cache" that shows how many
> SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
> SNC mode is either not implemented, or not enabled.
>
> Update all places to take appropriate action when SNC mode is enabled:
> 1) The number of logical RMIDs per L3 cache available for use is the
> number of physical RMIDs divided by the number of SNC nodes.

Should it then perhaps be "number of logical RMIDs per SNC node"?
The way this feature is introduced makes it hard to understand how
RMIDs are used when SNC is enabled since the implementation appears
to distinguish between (a) RMIDs assigned to monitor group and written
to PQR register and (b) RMIDs for which the event counters are read.
The former ((a)) is used directly after the adjustment described in (1) but
the latter needs an adjustment that I did not notice being mentioned.

I think it will help to move the get_node_rmid() helper that is
introduced later here and explain why it is needed.

> 2) Likewise the "mon_scale" value must be divided by the number of SNC
> nodes.
> 3) Disable the "-o mba_MBps" mount option in SNC mode
> because the monitoring is being done per SNC node, while the
> bandwidth allocation is still done at the L3 cache scope.
> Trying to use this feedback loop might result in contradictory
> changes to the throttling level coming from each of the SNC
> node bandwidth measurements.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
> arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++--
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 3 ++-
> 4 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 135190e0711c..49440f194253 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -484,6 +484,8 @@ extern struct rdt_hw_resource rdt_resources_all[];
> extern struct rdtgroup rdtgroup_default;
> extern struct dentry *debugfs_resctrl;
>
> +extern unsigned int snc_nodes_per_l3_cache;
> +
> enum resctrl_res_level {
> RDT_RESOURCE_L3,
> RDT_RESOURCE_L2,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 395bac851f6e..bfa9d3a429fd 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -331,6 +331,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
> return r->default_ctrl;
> }
>
> +/*
> + * Number of SNC nodes that share each L3 cache. Default is 1 for
> + * systems that do not support SNC, or have SNC disabled.
> + */
> +unsigned int snc_nodes_per_l3_cache = 1;
> +
> static void mba_wrmsr_intel(struct msr_param *m)
> {
> struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 89d7e6fcbaa1..0f66825a1ac9 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1022,8 +1022,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> int ret;
>
> resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> - r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> + r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
> hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>
> if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index cc31ede1a1e7..0923492a8bd0 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2346,7 +2346,8 @@ static bool supports_mba_mbps(void)
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>
> return (is_mbm_local_enabled() &&
> - r->alloc_capable && is_mba_linear());
> + r->alloc_capable && is_mba_linear() &&
> + snc_nodes_per_l3_cache == 1);
> }
>
> /*

Since the software controller is a filesystem feature the above
now requires that snc_nodes_per_l3_cache becomes part of the resctrl
filesystem code and every architecture will need to set snc_nodes_per_l3_cache.
Every architecture will thus need to interpret what "SNC" means for them
using the term introduced here. That may be ok ... but the term "SNC"
will then surely not identify an Intel feature and Intel needs to be ok
that any architecture calls their "similar to SNC but not quite identical"
"SNC".

I assume now that as part of the fs/arch split there needs to be
a new helper that allows different architectures to set this
filesystem variable?

Reinette

2024-05-22 21:15:07

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 09/17] x86/resctrl: Add new fields to struct rmid_read for summation of domains

Hi Tony,

On 5/15/2024 3:23 PM, Tony Luck wrote:
> rdtgroup_mondata_show() calls mon_event_count() which packages up all

mon_event_count() -> mon_event_read()?


> the required details into an rmid_read structure passed across the
> smp_call*() infrastructure.
>
> Legacy files reporting for a single domain pass that domain in the
> rmid_read structure. Files that need to sum multiple domains have
> meta data that provides the display_id for domains that must be
> summed.

( ... and also for convenience pass a domain in the rmid_read structure
... more later)

>
> Add the sumdomains and display_id fields to the rmid_read structure.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 49440f194253..498c5d240c68 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -150,6 +150,8 @@ struct rmid_read {
> struct rdt_mon_domain *d;
> enum resctrl_event_id evtid;
> bool first;
> + bool sumdomains;
> + int display_id;
> int err;
> u64 val;
> void *arch_mon_ctx;

These new members have enough obscurity to make this a good
point to document the members of struct rmid_read

Reinette

2024-05-22 21:15:50

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 10/17] x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function

Hi Tony,

On 5/15/2024 3:23 PM, Tony Luck wrote:
> Move the creation of monitoring files into a helper function.

Could you please elaborate why this is needed using a changelog
following x86 customs?

Reinette

2024-05-22 21:17:10

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 11/17] x86/resctrl: Allocate a new bit in union mon_data_bits

Hi Tony,

On 5/15/2024 3:23 PM, Tony Luck wrote:
> When Sub-NUMA (SNC) mode is enabled the legacy monitor reporting files

Sub-NUMA Cluster (SNC)? (I'll stop pointing these out.)

> must report the sum of the data from all of the SNC nodes that share the
> L3 cache that is referenced by the monitor file.
>
> Resctrl squeezes all the attributes of these files into 32-bits so they
> can be stored in the "priv" field of struct kernfs_node.
>
> Steal one bit from the "evtid" field (currently 8 bits, but only three
> events supported by Intel) to create a new "sum" field that indicates

This is filesystem code so should surely not be just about what us supported
by Intel.

> this file must sum across SNC nodes. This bit also indicates that the
> domid field is the display_id to match to find which domains must be
> summed.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 498c5d240c68..c54ad12ff2b8 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -132,14 +132,19 @@ struct mon_evt {
> * as kernfs private data
> * @rid: Resource id associated with the event file
> * @evtid: Event id associated with the event file
> - * @domid: The domain to which the event file belongs
> + * @sum: Set when event must be summed across multiple
> + * domains.
> + * @domid: When @sum is zero this is the domain to which
> + * the event file belongs. When sum is one this

sum -> @sum to be consistent with previous sentence?

> + * is the display_id of all domains to be summed

"is the monitoring display scope id shared with other monitoring
domains to be summed"?

> * @u: Name of the bit fields struct
> */
> union mon_data_bits {
> void *priv;
> struct {
> unsigned int rid : 10;
> - enum resctrl_event_id evtid : 8;
> + enum resctrl_event_id evtid : 7;
> + unsigned int sum : 1;
> unsigned int domid : 14;
> } u;
> };

Reinette

2024-05-22 21:20:03

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 12/17] x86/resctrl: Create Sub-NUMA (SNC) monitor files

Hi Tony,

On 5/15/2024 3:23 PM, Tony Luck wrote:
> When SNC mode is enabled, create subdirectories and file to monitor

"and file" -> "and files"?

> at the SNC node granularity. Monitor files at the L3 granularity are
> tagged with a "sum" attribute to indicate that all SNC nodes sharing
> an L3 cache should be read and summed to provide the result to the
> user.

Why go through effort to create a generic "monitor display scope" and
then just always refer to it as L3 cache scope? One consequence is that
the code and changelog seems to have a disconnect.

>
> Note that the "domid" field for files that must sum across SNC domains
> has the L3 cache instance id, while non-summing files use the domain id.
>
> Also the "sum" files do not need to make a call to mon_event_read() to
> initialize the MBM counters. This will be handled by initializing the
> individual SNC nodes that share the L3.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 53 ++++++++++++++++++--------
> 1 file changed, 38 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 7a6c40aefdcc..f0f468babdea 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3026,7 +3026,8 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> }
>
> static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> - struct rdt_resource *r, struct rdtgroup *prgrp)
> + struct rdt_resource *r, struct rdtgroup *prgrp,
> + bool do_sum)
> {
> union mon_data_bits priv;
> struct mon_evt *mevt;
> @@ -3037,15 +3038,18 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> return -EPERM;
>
> priv.u.rid = r->rid;
> - priv.u.domid = d->hdr.id;
> + priv.u.domid = do_sum ? d->display_id : d->hdr.id;
> + priv.u.sum = do_sum;
> list_for_each_entry(mevt, &r->evt_list, list) {
> priv.u.evtid = mevt->evtid;
> ret = mon_addfile(kn, mevt->name, priv.priv);
> if (ret)
> return ret;
>
> - if (is_mbm_event(mevt->evtid))
> + if (!do_sum && is_mbm_event(mevt->evtid)) {
> + rr.sumdomains = 0;
> mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
> + }
> }
>
> return 0;
> @@ -3055,23 +3059,42 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> struct rdt_mon_domain *d,
> struct rdt_resource *r, struct rdtgroup *prgrp)
> {
> - struct kernfs_node *kn;
> + struct kernfs_node *kn, *ckn;
> char name[32];
> + bool do_sum;
> int ret;
>
> - sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
> - /* create the directory */
> - kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> - if (IS_ERR(kn))
> - return PTR_ERR(kn);
> + do_sum = r->mon_scope != r->mon_display_scope;
> + sprintf(name, "mon_%s_%02d", r->name, d->display_id);

Why not just determine "display_id" dynamically here and pass it as parameter
to mon_add_all_files()? Previously you mentioned that error handling is a problem
but this flow can surely handle errors, no?

> + kn = kernfs_find_and_get_ns(parent_kn, name, NULL);
> + if (!kn) {
> + /* create the directory */
> + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> + if (IS_ERR(kn))
> + return PTR_ERR(kn);
>
> - ret = rdtgroup_kn_set_ugid(kn);
> - if (ret)
> - goto out_destroy;
> + ret = rdtgroup_kn_set_ugid(kn);
> + if (ret)
> + goto out_destroy;
> + ret = mon_add_all_files(kn, d, r, prgrp, do_sum);
> + if (ret)
> + goto out_destroy;
> + }
>
> - ret = mon_add_all_files(kn, d, r, prgrp);
> - if (ret)
> - goto out_destroy;
> + if (do_sum) {
> + sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> + ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
> + if (IS_ERR(ckn))
> + goto out_destroy;
> +
> + ret = rdtgroup_kn_set_ugid(ckn);
> + if (ret)
> + goto out_destroy;
> +
> + ret = mon_add_all_files(ckn, d, r, prgrp, false);
> + if (ret)
> + goto out_destroy;
> + }
>
> kernfs_activate(kn);
> return 0;

Reinette

2024-05-22 21:20:48

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 08/17] x86/resctrl: Add and initialize display_id field to struct rdt_mon_domain

Hi Tony,

On 5/15/2024 3:23 PM, Tony Luck wrote:
> When Sub-NUMA (SNC) mode is enabled monitoring domains are created at

Sub-NUMA Cluster (SNC) ?

> SNC node scope. Add a field that holds the identity of the L3 cache for

This is not necessarily the L3 cache, but instead intended to be the
monitoring display scope, no?

> each domain to make it easy to find all domains that share the same
> L3 cache instance. There are three places where this is needed. In
> all cases code is operating on a domain where "d->id" refers to the
> SNC node id.
>
> 1) When making monitor directories.
> Need the L3 cache instance ID to make the mon_L3_XX directory
> that will contain the legacy monitor reporting files and the
> mon_sub_L3_YY directory for this domain.
> 2) When removing monitor directories.
> Similar to making directories.
> 3) When reporting data from one of the L3-scoped legacy files.
> This requires summing data from each SNC node that shares the
> same L3 cache instance id.

<insert motivation about why this cannot be determined dynamically
at the places identified>

>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> include/linux/resctrl.h | 2 ++
> arch/x86/kernel/cpu/resctrl/core.c | 8 ++++++++
> 2 files changed, 10 insertions(+)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 98c0ff8ba005..2f8ac925bc18 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -96,6 +96,7 @@ struct rdt_ctrl_domain {
> /**
> * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
> * @hdr: common header for different domain types
> + * @display_id: shared id used to identify domains to be summed for display

This description seems to indicate this is a member only used when needing to
sum domains, thus only for SNC at this time. Looking ahead the description does not
seem to capture that this value has been integrated into non-SNC support and will
always be used when creating files for all domains, whether SNC is enabled or not.
This member thus seems to be used for more than it claims to.

> * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
> * @mbm_total: saved state for MBM total bandwidth
> * @mbm_local: saved state for MBM local bandwidth
> @@ -106,6 +107,7 @@ struct rdt_ctrl_domain {
> */
> struct rdt_mon_domain {
> struct rdt_domain_hdr hdr;
> + int display_id;
> unsigned long *rmid_busy_llc;
> struct mbm_state *mbm_total;
> struct mbm_state *mbm_local;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 15856254fea7..dd40c998df72 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -614,6 +614,14 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>
> d = &hw_dom->d_resctrl;
> d->hdr.id = id;
> + d->display_id = get_domain_id_from_scope(cpu, r->mon_display_scope);
> + if (d->display_id < 0) {
> + pr_warn_once("Can't find monitor domain display id for CPU:%d scope:%d for resource %s\n",
> + cpu, r->mon_display_scope, r->name);
> + mon_domain_free(hw_dom);
> + return;
> + }
> +
> d->hdr.type = RESCTRL_MON_DOMAIN;
> cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>

Reinette

2024-05-22 21:20:55

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 07/17] x86/resctrl: Prepare for new Sub-NUMA (SNC) cluster monitor files

Hi Tony,

Please note in the subject (and many places in this series) the
text "Sub-NUMA (SNC)" is used. Is that intentional? Other places
it reads "Sub-NUMA Cluster (SNC)".

On 5/15/2024 3:23 PM, Tony Luck wrote:
> When SNC is enabled monitoring data is collected at the SNC node
> granularity, but must be reported at L3-cache granularity for
> backwards compatibility in addition to reporting at the node
> level.
>
> Add a mon_display_scope field to the rdt_resource structure to track
> reporting scope. Default is for non-SNC systems where both scopes
> are the same.

Just to confirm ... the reporting scope needs to be superset of
the "collecting"(?) scope. Is this something that is implicitly enforced?

Reinette

2024-05-22 21:21:02

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 13/17] x86/resctrl: Handle removing directories in Sub-NUMA (SNC) mode

Hi Tony,

On 5/15/2024 3:23 PM, Tony Luck wrote:
> In SNC mode there are multiple subdirectories in each L3 level monitor
> directory (one for each SNC node). If all the CPUs in an SNC node are
> taken offline, then just that SNC node directory must be removed. In

imperative tone needed

> non-SNC mode, or when the last SNC node directory is removed, also
> remove the L3 monitor directory.

There is a disconnect between changelog and code. The code tries to be
generic while the changelog is as specific to SNC as possible. This makes
it hard to go from changelog (ignoring that changelog does not follow x86
customs to begin with) to patch.

>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 43 +++++++++++++++++++++-----
> 1 file changed, 36 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index f0f468babdea..cac32ddd3afd 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3011,17 +3011,46 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> * and monitor groups with given domain id.
> */
> static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> - unsigned int dom_id)
> + struct rdt_mon_domain *d)
> {
> struct rdtgroup *prgrp, *crgrp;
> + struct rdt_mon_domain *dom;
> + bool remove_all = true;
> + struct kernfs_node *kn;
> + char subname[32];
> char name[32];
>
> + sprintf(name, "mon_%s_%02d", r->name, d->display_id);
> + if (r->mon_scope != r->mon_display_scope) {
> + int count = 0;
> +
> + list_for_each_entry(dom, &r->mon_domains, hdr.list)
> + if (d->display_id == dom->display_id)
> + count++;
> + if (count > 1) {
> + remove_all = false;
> + sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
> + }
> + }

This continues to look suspect to me. When I took a closer look during previous
version I thought this information can only be accessed via inode. Seeing this
code again made me look more closely and it seems there is no problem to just
query how many subdirectories a directory has. See for example, kobject_has_children().
Doing something like that seems more intuitive than this quirky way to set and
use a flag.

> +
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> - sprintf(name, "mon_%s_%02d", r->name, dom_id);
> - kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
> + if (remove_all) {
> + kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
> + } else {
> + kn = kernfs_find_and_get_ns(prgrp->mon.mon_data_kn, name, NULL);

kernfs_find_and_get()?

> + if (kn)
> + kernfs_remove_by_name(kn, subname);
> + }
>
> - list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
> - kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
> + list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
> + if (remove_all) {
> + kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
> + } else {
> + kn = kernfs_find_and_get_ns(prgrp->mon.mon_data_kn, name, NULL);

copy&paste ?

> + if (kn)
> + kernfs_remove_by_name(kn, subname);
> + }
> + }
> }
> }
>
> @@ -3111,8 +3140,8 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> struct rdt_mon_domain *d)
> {
> - struct kernfs_node *parent_kn;
> struct rdtgroup *prgrp, *crgrp;
> + struct kernfs_node *parent_kn;
> struct list_head *head;
>

Stray snippet?

> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> @@ -3984,7 +4013,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> * per domain monitor data directories.
> */
> if (resctrl_mounted && resctrl_arch_mon_capable())
> - rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
> + rmdir_mondata_subdir_allrdtgrp(r, d);
>
> if (is_mbm_enabled())
> cancel_delayed_work(&d->mbm_over);

Reinette

2024-05-22 21:24:29

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 14/17] x86/resctrl: Sum monitor data acrss Sub-NUMA (SNC) nodes when needed

Hi Tony,

shortlog: acrss -> across?

On 5/15/2024 3:23 PM, Tony Luck wrote:
> When the sumdomains fields is set in the rmid_read structure, walk
> the list of domains in this resource to find all that share an L3
> cache id (rr->display_id).

"L3 cache id" vs "monitor display scope"?

>
> Adjust the RMID value based on which SNC domain is being accessed.

Thinking generously this changelog contains two brief descriptions of code
snippets. There is no context or explanation of what the goal of this change
is and why it chooses to implement things the way it does ... implementations
that definitely needs some explanation.

>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 20 +++++++++++---
> arch/x86/kernel/cpu/resctrl/monitor.c | 33 ++++++++++++++++++++++-
> 2 files changed, 48 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 3b9383612c35..7ab788d47ad3 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -575,15 +575,27 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> resid = md.u.rid;
> domid = md.u.domid;
> evtid = md.u.evtid;
> -
> + rr.sumdomains = md.u.sum;
> r = &rdt_resources_all[resid].r_resctrl;
> - hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
> - if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
> +
> + if (rr.sumdomains) {
> + rr.display_id = domid;
> + list_for_each_entry(d, &r->mon_domains, hdr.list) {
> + if (d->display_id == domid)
> + goto got_domain;
> + }
> ret = -ENOENT;
> goto out;
> + } else {
> + hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
> + if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
> + ret = -ENOENT;
> + goto out;
> + }
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> }
> - d = container_of(hdr, struct rdt_mon_domain, hdr);
>
> +got_domain:
> mon_event_read(&rr, r, d, rdtgrp, evtid, false);

This looks more like "wedging things until it works". In the "sumdomains" case
it just picks the first matching domain without any explanation why it would
be ok. The only reason why mon_event_read() needs the domain is so that
it can use it to determine which CPUs it should read the counters from and it looks
like this code is written to take advantage of just that. It is a hack based on
knowledge of internals of mon_event_read() to just get the reader to run on a
CPU that works for SNC where another quirk awaits to override the domain
when the counter is _actually_ read so that it can get the correct architectural
state.

I was expecting here to at least see some documentation/comments
to explain why the code behaves the way it is. Optimistically it can
be documented as an "optimization" for the sumdomains case that is only
used by SNC where it is ok to read counters from any domain in the
"monitor display scope".

Apart from the documentation I do not think this code should wedge itself
in like this.

To make it obvious what this code does I think the non-SNC case should
set rr.d and mon_event_read() should take a CPU mask as parameter
instead of a struct rdt_domain. This means that in SNC case rr.d will be
NULL and make the code flow more obvious.

>
> if (rr.err == -EIO)
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 0f66825a1ac9..668d2fdf58cd 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -546,7 +546,7 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
> }
> }
>
> -static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> +static int ___mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> {
> struct mbm_state *m;
> u64 tval = 0;
> @@ -569,6 +569,37 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> return 0;
> }
>
> +static u32 get_node_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, u32 rmid)
> +{
> + int cpu = cpumask_any(&d->hdr.cpu_mask);
> +
> + return rmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;

Is this missing a "if (snc_nodes_per_l3_cache > 1)" ?

I do not think this belongs in resctrl fs code though. Should this algorithm be forced
on all architectures? It seems more appropriate for the arch specific code.

> +}
> +
> +static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> +{
> + struct rdt_mon_domain *d;
> + u32 node_rmid;
> + int ret = 0;
> +
> + if (!rr->sumdomains) {
> + node_rmid = get_node_rmid(rr->r, rr->d, rmid);
> + return ___mon_event_count(closid, node_rmid, rr);
> + }
> +

/*
* rr->sumdomains is only set by SNC mode where the event
* counters of a monitoring domain can be read from a CPU belonging
* to a different monitoring domain that shares the same monitoring
* display domain. Optimize counter reads when needing to sum the
* values by reading the counter for several domains from
* the same CPU instead of sending IPI to one CPU of each monitoring
* domain.
*/

Please feel free to improve, please do help folks trying to understand
what this code does.

> + list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
> + if (d->display_id != rr->display_id)
> + continue;
> + rr->d = d;
> + node_rmid = get_node_rmid(rr->r, d, rmid);
> + ret = ___mon_event_count(closid, node_rmid, rr);
> + if (ret)
> + break;
> + }
> +
> + return ret;
> +}
> +
> /*
> * mbm_bw_count() - Update bw count from values previously read by
> * __mon_event_count().

Reinette

2024-05-22 21:25:37

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 15/17] x86/resctrl: Fix RMID reading sanity check for Sub-NUMA (SNC) mode

Hi Tony,

A "Fix" in a shortlog of a kernel commit has quite a specific meaning
that I do not believe applies here. It fixes something introduced by this
patch series so "Fix" is surely suspect.

On 5/15/2024 3:23 PM, Tony Luck wrote:
> The sanity check that RMIDs are being read from a CPU listed in the
> the cpu_mask for the domain is incorrect when summing across multiple
> SNC domains. It is safe to read the RMID from any CPU that shares the
> same L3 cache instance.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/monitor.c | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 668d2fdf58cd..e4b92c7af71d 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -15,6 +15,7 @@
> * Software Developer Manual June 2016, volume 3, section 17.17.
> */
>
> +#include <linux/cacheinfo.h>
> #include <linux/cpu.h>
> #include <linux/module.h>
> #include <linux/sizes.h>
> @@ -281,8 +282,18 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>
> resctrl_arch_rmid_read_context_check();
>
> - if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> - return -EINVAL;
> + if (r->mon_scope == r->mon_display_scope) {
> + if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> + return -EINVAL;
> + } else {
> + /*
> + * SNC: OK to read events on any CPU sharing same L3
> + * cache instance.
> + */
> + if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(),
> + r->mon_display_scope))

By hardcoding that mon_display_scope is a cache instead of using get_domain_id_from_scope()
it seems that all pretending about being generic has just been abandoned at this point.

> + return -EINVAL;
> + }
>
> ret = __rmid_read(rmid, eventid, &msr_val);
> if (ret)

Reinette

2024-05-22 21:27:03

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 16/17] x86/resctrl: Sub NUMA Cluster detection and enable

Hi Tony,

On 5/15/2024 3:23 PM, Tony Luck wrote:
> There isn't a simple hardware bit that indicates whether a CPU is
> running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing

Sometimes it is "Sub-NUMA Cluster" and sometimes it is "Sub NUMA Cluster".

> number CPUs sharing the L3 cache with CPU0 to the number of CPUs in
> the same NUMA node as CPU0.
>
> When SNC mode is detected, reconfigure the RMID counters by updating
> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
>
> Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
> on the second SNC node to start from zero.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kernel/cpu/resctrl/core.c | 104 +++++++++++++++++++++++++++++
> 2 files changed, 105 insertions(+)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index e022e6eb766c..3cb8dd6311c3 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1164,6 +1164,7 @@
> #define MSR_IA32_QM_CTR 0xc8e
> #define MSR_IA32_PQR_ASSOC 0xc8f
> #define MSR_IA32_L3_CBM_BASE 0xc90
> +#define MSR_RMID_SNC_CONFIG 0xca0
> #define MSR_IA32_L2_CBM_BASE 0xd10
> #define MSR_IA32_MBA_THRTL_BASE 0xd50
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index dd40c998df72..195f9e29c553 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -21,6 +21,7 @@
> #include <linux/err.h>
> #include <linux/cacheinfo.h>
> #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>
> #include <asm/cpu_device_id.h>
> #include <asm/resctrl.h>
> @@ -753,11 +754,42 @@ static void clear_closid_rmid(int cpu)
> RESCTRL_RESERVED_CLOSID);
> }
>
> +/*
> + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> + * which indicates that RMIDs are configured in legacy mode.
> + * This mode is incompatible with Linux resctrl semantics
> + * as RMIDs are partitioned between SNC nodes, which requires
> + * a user to know which RMID is allocated to a task.
> + * Clearing bit 0 reconfigures the RMID counters for use
> + * in Sub NUMA Cluster mode. This mode is better for Linux.
> + * The RMID space is divided between all SNC nodes with the
> + * RMIDs renumbered to start from zero in each node when
> + * couning operations from tasks. Code to read the counters

couning -> counting

> + * must adjust RMID counter numbers based on SNC node. See
> + * __rmid_read() for code that does this.
> + */
> +static void snc_remap_rmids(int cpu)
> +{
> + u64 val;
> +
> + /* Only need to enable once per package. */
> + if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> + return;
> +
> + rdmsrl(MSR_RMID_SNC_CONFIG, val);
> + val &= ~BIT_ULL(0);
> + wrmsrl(MSR_RMID_SNC_CONFIG, val);
> +}
> +
> static int resctrl_arch_online_cpu(unsigned int cpu)
> {
> struct rdt_resource *r;
>
> mutex_lock(&domain_list_lock);
> +
> + if (snc_nodes_per_l3_cache > 1)
> + snc_remap_rmids(cpu);
> +
> for_each_capable_rdt_resource(r)
> domain_add_cpu(cpu, r);
> mutex_unlock(&domain_list_lock);
> @@ -997,11 +1029,83 @@ static __init bool get_rdt_resources(void)
> return (rdt_mon_capable || rdt_alloc_capable);
> }
>
> +/* CPU models that support MSR_RMID_SNC_CONFIG */
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> + X86_MATCH_VFM(INTEL_ICELAKE_X, 0),
> + X86_MATCH_VFM(INTEL_SAPPHIRERAPIDS_X, 0),
> + X86_MATCH_VFM(INTEL_EMERALDRAPIDS_X, 0),
> + X86_MATCH_VFM(INTEL_GRANITERAPIDS_X, 0),
> + X86_MATCH_VFM(INTEL_ATOM_CRESTMONT_X, 0),
> + {}
> +};
> +
> +/*
> + * There isn't a simple hardware bit that indicates whether a CPU is running
> + * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
> + * number CPUs sharing the L3 cache with CPU0 to the number of CPUs in
> + * the same NUMA node as CPU0.
> + * It is not possible to accurately determine SNC state if the system is
> + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
> + * to L3 caches. It will be OK if system is booted with hyperthreading
> + * disabled (since this doesn't affect the ratio).
> + */
> +static __init int snc_get_config(void)
> +{
> + struct cpu_cacheinfo *ci = get_cpu_cacheinfo(0);
> + const cpumask_t *node0_cpumask;
> + cpumask_t *l3_cpumask = NULL;
> + int i, ret;
> +
> + if (!x86_match_cpu(snc_cpu_ids))
> + return 1;
> +
> + cpus_read_lock();
> + if (num_online_cpus() != num_present_cpus())
> + pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> + cpus_read_unlock();
> +
> + for (i = 0; i < ci->num_leaves; i++) {
> + if (ci->info_list[i].level == 3) {
> + if (ci->info_list[i].attributes & CACHE_ID) {
> + l3_cpumask = &ci->info_list[i].shared_cpu_map;
> + break;
> + }
> + }
> + }
> + if (!l3_cpumask) {
> + pr_info("can't get CPU0 L3 mask\n");

Sentence can start with upper case

> + return 1;
> + }
> +
> + node0_cpumask = cpumask_of_node(cpu_to_node(0));
> +
> + ret = bitmap_weight(cpumask_bits(l3_cpumask), nr_cpu_ids) /
> + bitmap_weight(cpumask_bits(node0_cpumask), nr_cpu_ids);
> +

Can cpumask_weight() be used?

> + /* sanity check: Only valid results are 1, 2, 3, 4 */
> + switch (ret) {
> + case 1:
> + break;
> + case 2 ... 4:
> + pr_info("Sub-NUMA cluster detected with %d nodes per L3 cache\n", ret);
> + rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> + break;
> + default:
> + pr_warn("Ignore improbable SNC node count %d\n", ret);
> + ret = 1;
> + break;
> + }
> +
> + return ret;
> +}
> +
> static __init void rdt_init_res_defs_intel(void)
> {
> struct rdt_hw_resource *hw_res;
> struct rdt_resource *r;
>
> + snc_nodes_per_l3_cache = snc_get_config();
> +
> for_each_rdt_resource(r) {
> hw_res = resctrl_to_arch_res(r);
>

Reinette

2024-05-22 21:28:00

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 17/17] x86/resctrl: Update documentation with Sub-NUMA cluster changes

Hi Tony,

On 5/15/2024 3:23 PM, Tony Luck wrote:
> *** This patch needs updating for new files for monitoring ***
>
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.

Out of date?

>
> Users should be aware that SNC mode also affects the amount of L3 cache
> available for allocation within each SNC node.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> Documentation/arch/x86/resctrl.rst | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 627e23869bca..401f6bfb4a3c 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -375,6 +375,10 @@ When monitoring is enabled all MON groups will also contain:
> all tasks in the group. In CTRL_MON groups these files provide
> the sum for all tasks in the CTRL_MON group and all tasks in
> MON groups. Please see example section for more details on usage.
> + On systems with Sub-NUMA (SNC) cluster enabled there are extra

"Sub-NUMA (SNC) cluster" -> "Sub-NUMA Cluster (SNC)"?

> + directories for each node (located within the "mon_L3_XX" directory
> + for the L3 cache they occupy). These are named "mon_sub_L3_YY"
> + where "YY" is the node number.
>
> "mon_hw_id":
> Available only with debug option. The identifier used by hardware
> @@ -484,6 +488,19 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
> each bit represents 5% of the capacity of the cache. You could partition
> the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>
> +Notes on Sub-NUMA Cluster mode
> +==============================
> +When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
> +nodes much more readily than between regular NUMA nodes since the CPUs
> +on Sub-NUMA nodes share the same L3 cache and the system may report
> +the NUMA distance between Sub-NUMA nodes with a lower value than used
> +for regular NUMA nodes.
> +The top-level monitoring files in each "mon_L3_XX" directory provide
> +the sum of data across all SNC nodes sharing an L3 cache instance.
> +Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
> +the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
> +"mon_sub_L3_YY" directories to get node local data.
> +
> Memory bandwidth Allocation and monitoring
> ==========================================
>

Reinette

2024-05-22 23:47:35

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v18 15/17] x86/resctrl: Fix RMID reading sanity check for Sub-NUMA (SNC) mode

On Wed, May 22, 2024 at 02:25:23PM -0700, Reinette Chatre wrote:
> > + /*
> > + * SNC: OK to read events on any CPU sharing same L3
> > + * cache instance.
> > + */
> > + if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(),
> > + r->mon_display_scope))
>
> By hardcoding that mon_display_scope is a cache instead of using get_domain_id_from_scope()
> it seems that all pretending about being generic has just been abandoned at this point.

Yes. It now seems like a futile quest to make this look
like something generic. All this code is operating on the
rdt_resources_all[RDT_RESOURCE_L3] resource (which by its very name is
"L3" scoped). In the SNC case the L3 has been divided (in some senses,
but not all) into nodes.

Given that pretending isn't working ... just be explicit?

Some "thinking aloud" follows ...

struct rdt_resource:
In order to track monitor events, resctrl must build a domain list based
on the smallest measurement scope. So with SNC enabled, that is the
node. With it disabled it is L3 cache scope (which on existing systems
is the same as node scope).

Maybe keep .mon_scope with the existing name, but define it to be the
minimum measurement scope and use it to build domains. So it
defaults to RESCTRL_L3_CACHE but SNC detection will rewrite it to
RESCTRL_L3_NODE.

Drop the .mon_display_scope field. By definition it must always have
the value RESCTRL_L3_CACHE. So replace checks that compare values
rdt_resources_all[RDT_RESOURCE_L3] of .mon_scope & .mon_display_scope
with:

if (r->mon_scope != RESCTRL_L3_CACHE)
// SNC stuff
else
// regular stuff

struct rdt_mon_domain:
In the rdt_mon_domain rename the display_id field with the more
honest name "l3_cache_id". In addition save a pointer to the
.shared_cpu_map of the L3 cache. When SNC is off, this will be the
same as the d->hdr.cpu_mask for the domain. For SNC on it will be
a superset (encompassing all the bits from cpu_masks in all domains
that share an L3 instance).

Where SNC specifc code is required, the check becomes:

if (d->hdr.id != d->l3_cache_id)
// SNC stuff
else
// regular stuff

The l3_cache_id can be used in mkdir code to make the mon_L3_XX
directories. The L3 .shared_cpu_map in picking a CPU to read
the counters for the "sum" files. l3_cache_id also indicates
which domains should be summed.


Does this look like a useful direction to pursue?

-Tony

2024-05-23 17:03:55

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 15/17] x86/resctrl: Fix RMID reading sanity check for Sub-NUMA (SNC) mode

Hi Tony,

On 5/22/24 4:47 PM, Tony Luck wrote:
> On Wed, May 22, 2024 at 02:25:23PM -0700, Reinette Chatre wrote:
>>> + /*
>>> + * SNC: OK to read events on any CPU sharing same L3
>>> + * cache instance.
>>> + */
>>> + if (d->display_id != get_cpu_cacheinfo_id(smp_processor_id(),
>>> + r->mon_display_scope))
>>
>> By hardcoding that mon_display_scope is a cache instead of using get_domain_id_from_scope()
>> it seems that all pretending about being generic has just been abandoned at this point.
>
> Yes. It now seems like a futile quest to make this look
> like something generic. All this code is operating on the

I did not see the generic solution as not being possible. The
implementation seemed generally ok to me when thinking of it as a
generic solution with implementation that has the optimization of not
sending IPIs unnecessarily since the only user is SNC.

> rdt_resources_all[RDT_RESOURCE_L3] resource (which by its very name is

Yes, good point.

> "L3" scoped). In the SNC case the L3 has been divided (in some senses,
> but not all) into nodes.
>
> Given that pretending isn't working ... just be explicit?
>
> Some "thinking aloud" follows ...

Sure, will consider with you ...

>
> struct rdt_resource:
> In order to track monitor events, resctrl must build a domain list based
> on the smallest measurement scope. So with SNC enabled, that is the
> node. With it disabled it is L3 cache scope (which on existing systems
> is the same as node scope).
>
> Maybe keep .mon_scope with the existing name, but define it to be the
> minimum measurement scope and use it to build domains. So it
> defaults to RESCTRL_L3_CACHE but SNC detection will rewrite it to
> RESCTRL_L3_NODE.

Above has been agreed on for a while now, no? The only change is that
the name of the new scope will change from RESCTRL_NODE to RESCTRL_L3_NODE?

>
> Drop the .mon_display_scope field. By definition it must always have
> the value RESCTRL_L3_CACHE. So replace checks that compare values
> rdt_resources_all[RDT_RESOURCE_L3] of .mon_scope & .mon_display_scope
> with:
>
> if (r->mon_scope != RESCTRL_L3_CACHE)
> // SNC stuff
> else
> // regular stuff

This seems reasonable considering what you reminded about earlier that
all things monitoring is hardcoded to RDT_RESOURCE_L3. Perhaps that test
can be a macro with an elaborate comment describing the SNC view of the
world? I also think that a specific test may be easier to understand
("if (r->mon_scope == RESCTRL_L3_NODE) /* SNC */") since that makes it
easier to follow code to figure out where RESCTRL_L3_NODE is assigned as
opposed to trying to find flows where mon_scope is _not_ RESCTRL_L3_CACHE.


> struct rdt_mon_domain:
> In the rdt_mon_domain rename the display_id field with the more
> honest name "l3_cache_id". In addition save a pointer to the
> .shared_cpu_map of the L3 cache. When SNC is off, this will be the

Sounds good. If already saving a pointer, could that be simplified,
while also making code easier to understand, with a pointer to the
cache's struct cacheinfo instead? That will give access to cache ID as
well as shared_cpu_map.

> same as the d->hdr.cpu_mask for the domain. For SNC on it will be
> a superset (encompassing all the bits from cpu_masks in all domains
> that share an L3 instance).

May need to take care when considering scenarios where CPUs can be
offlined. For example, when SNC is enabled and all CPUs associated with
all but one NUMA domain are disabled then the final remaining monitoring
domain may have the same CPU mask as the L3 cache even though SNC is
enabled?

> Where SNC specifc code is required, the check becomes:
>
> if (d->hdr.id != d->l3_cache_id)
> // SNC stuff
> else
> // regular stuff

I am not sure about these tests and will need more context on where they
will be used. For example, when SNC is enabled then NUMA node #0 belongs
to cache ID #0 then the test would not capture that SNC is enabled for
monitoring domain #0?

> The l3_cache_id can be used in mkdir code to make the mon_L3_XX
> directories. The L3 .shared_cpu_map in picking a CPU to read
> the counters for the "sum" files. l3_cache_id also indicates
> which domains should be summed.

Using the L3 .shared_cpu_map to pick CPU sounds good. It really makes
it obvious what is going on.

> Does this look like a useful direction to pursue?

As I understand it will make the code obviously specific to SNC but not
change the flow of implementation in this series. I do continue to
believe that many of the flows to support SNC are not intuitive (to me)
so I would like to keep my request that the SNC portions have clear
comments to explain why it does the things it does and not just leave
the reader with impression of "if (SNC specific check) /* quirks */ ".
This will help future changes to these areas.

Reinette

2024-05-23 19:05:19

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v18 06/17] x86/resctrl: Introduce snc_nodes_per_l3_cache

> > return (is_mbm_local_enabled() &&
> > - r->alloc_capable && is_mba_linear());
> > + r->alloc_capable && is_mba_linear() &&
> > + snc_nodes_per_l3_cache == 1);
> > }
> >
> > /*
>
> Since the software controller is a filesystem feature the above
> now requires that snc_nodes_per_l3_cache becomes part of the resctrl
> filesystem code and every architecture will need to set snc_nodes_per_l3_cache.
> Every architecture will thus need to interpret what "SNC" means for them
> using the term introduced here. That may be ok ... but the term "SNC"
> will then surely not identify an Intel feature and Intel needs to be ok
> that any architecture calls their "similar to SNC but not quite identical"
> "SNC".
>
> I assume now that as part of the fs/arch split there needs to be
> a new helper that allows different architectures to set this
> filesystem variable?

I can change this check to better reflect the underlying reason to
disable the software controller. Which is that the MBM monitor scope
does not match the MBA control scope. This seems like an architecture
neutral expression.

So code would look like this:

struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_rescrl;
struct rdt_resource *rmba = &rdt_resources_all[RDT_RESOURCE_MBA].r_rescrl;

...

return (is_mbm_local_enabled() &&
r->alloc_capable && is_mba_linear() &&
rmbm->mon_scope == rmba->ctrl_scope);

I'm also contemplating dropping snc_nodes_per_l3_cache from being a
global variable and making it a field in "struct rdt_resource" (only needed
for the RDT_RESOURCE_L3 resource). N.B. Babu had suggested it
shouldn't be global many patch versions ago.

Perhaps name it .domains_per_l3_cache or .subdomains_per_l3_cache?

Bad idea? Good idea (but you have a better name for the field)?

-Tony


2024-05-23 20:56:45

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 06/17] x86/resctrl: Introduce snc_nodes_per_l3_cache

Hi Tony,

On 5/23/24 12:04 PM, Luck, Tony wrote:
>>> return (is_mbm_local_enabled() &&
>>> - r->alloc_capable && is_mba_linear());
>>> + r->alloc_capable && is_mba_linear() &&
>>> + snc_nodes_per_l3_cache == 1);
>>> }
>>>
>>> /*
>>
>> Since the software controller is a filesystem feature the above
>> now requires that snc_nodes_per_l3_cache becomes part of the resctrl
>> filesystem code and every architecture will need to set snc_nodes_per_l3_cache.
>> Every architecture will thus need to interpret what "SNC" means for them
>> using the term introduced here. That may be ok ... but the term "SNC"
>> will then surely not identify an Intel feature and Intel needs to be ok
>> that any architecture calls their "similar to SNC but not quite identical"
>> "SNC".
>>
>> I assume now that as part of the fs/arch split there needs to be
>> a new helper that allows different architectures to set this
>> filesystem variable?
>
> I can change this check to better reflect the underlying reason to
> disable the software controller. Which is that the MBM monitor scope
> does not match the MBA control scope. This seems like an architecture
> neutral expression.
>
> So code would look like this:
>
> struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_rescrl;
> struct rdt_resource *rmba = &rdt_resources_all[RDT_RESOURCE_MBA].r_rescrl;
>
> ...
>
> return (is_mbm_local_enabled() &&
> r->alloc_capable && is_mba_linear() &&
> rmbm->mon_scope == rmba->ctrl_scope);
>
> I'm also contemplating dropping snc_nodes_per_l3_cache from being a
> global variable and making it a field in "struct rdt_resource" (only needed
> for the RDT_RESOURCE_L3 resource). N.B. Babu had suggested it
> shouldn't be global many patch versions ago.
>
> Perhaps name it .domains_per_l3_cache or .subdomains_per_l3_cache?
>
> Bad idea? Good idea (but you have a better name for the field)?

With the check in supports_mba_mbps() changed I do not see
snc_nodes_per_l3_cache needed by anything but arch specific code.
I thus do not see any reason for the resctrl filesystem (or, for
example, Arm) to know that this value even exists.

While struct rdt_hw_resource is a place to put architecture
specific information it does not seem appropriate to force every
resource to carry what is essentially a system wide (not specific to
resctrl) L3 specific property. To me this really seems like an
architecture specific global setting but I'd also like to hear the
motivations why it should not be.

Reinette




2024-05-23 21:27:40

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v18 06/17] x86/resctrl: Introduce snc_nodes_per_l3_cache

> > I'm also contemplating dropping snc_nodes_per_l3_cache from being a
> > global variable and making it a field in "struct rdt_resource" (only needed
> > for the RDT_RESOURCE_L3 resource). N.B. Babu had suggested it
> > shouldn't be global many patch versions ago.
> >
> > Perhaps name it .domains_per_l3_cache or .subdomains_per_l3_cache?
> >
> > Bad idea? Good idea (but you have a better name for the field)?
>
> With the check in supports_mba_mbps() changed I do not see
> snc_nodes_per_l3_cache needed by anything but arch specific code.
> I thus do not see any reason for the resctrl filesystem (or, for
> example, Arm) to know that this value even exists.
>
> While struct rdt_hw_resource is a place to put architecture
> specific information it does not seem appropriate to force every
> resource to carry what is essentially a system wide (not specific to
> resctrl) L3 specific property. To me this really seems like an
> architecture specific global setting but I'd also like to hear the
> motivations why it should not be.

So (in arch/x86/kernel/cpu/resctrl/monitor.c)

static int snc_nodes_per_l3_cache = 1;

Set and use only in this (arch specific) file.

-Tony

2024-05-23 22:32:06

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 06/17] x86/resctrl: Introduce snc_nodes_per_l3_cache

Hi Tony,

On 5/23/24 2:25 PM, Luck, Tony wrote:
>>> I'm also contemplating dropping snc_nodes_per_l3_cache from being a
>>> global variable and making it a field in "struct rdt_resource" (only needed
>>> for the RDT_RESOURCE_L3 resource). N.B. Babu had suggested it
>>> shouldn't be global many patch versions ago.
>>>
>>> Perhaps name it .domains_per_l3_cache or .subdomains_per_l3_cache?
>>>
>>> Bad idea? Good idea (but you have a better name for the field)?
>>
>> With the check in supports_mba_mbps() changed I do not see
>> snc_nodes_per_l3_cache needed by anything but arch specific code.
>> I thus do not see any reason for the resctrl filesystem (or, for
>> example, Arm) to know that this value even exists.
>>
>> While struct rdt_hw_resource is a place to put architecture
>> specific information it does not seem appropriate to force every
>> resource to carry what is essentially a system wide (not specific to
>> resctrl) L3 specific property. To me this really seems like an
>> architecture specific global setting but I'd also like to hear the
>> motivations why it should not be.
>
> So (in arch/x86/kernel/cpu/resctrl/monitor.c)
>
> static int snc_nodes_per_l3_cache = 1;
>
> Set and use only in this (arch specific) file.

Since this series initializes this value in
arch/x86/kernel/cpu/resctrl/core.c it is not clear to
me from just this one line how you envision the changes.

Just to be clear ... when I refer to arch specific and
filesystem code I am considering how this series integrates with [1]
since that is the direction resctrl is headed in.
Being "arch specific" thus does not require that it be moved into
monitor.c - it could be added to arch/x86/kernel/cpu/resctrl/internal.h
where it can remain after the fs/arch split.

It will be very helpful if you view your series while taking
[1] into account. For example, when looking at [1] you will see that
mon_event_count() and __mon_event_count() are resctrl filesystem
functions. When you consider that it should be clear that adding
an arch specific get_node_rmid() between these functions will make
the arch/fs split more difficult.

Reinette

[1] https://lore.kernel.org/lkml/[email protected]/


2024-05-23 23:19:05

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v18 06/17] x86/resctrl: Introduce snc_nodes_per_l3_cache

> > So (in arch/x86/kernel/cpu/resctrl/monitor.c)
> >
> > static int snc_nodes_per_l3_cache = 1;
> >
> > Set and use only in this (arch specific) file.
>
> Since this series initializes this value in
> arch/x86/kernel/cpu/resctrl/core.c it is not clear to
> me from just this one line how you envision the changes.

v18 did the initialization in core.c. But since SNC is all about monitor
features it looks more logical to do this here:

resctrl_late_init()
get_rdt_resources()
get_rdt_mon_resources()
rdt_get_mon_l3_config() <<<< Do SNC enumeration here


> Just to be clear ... when I refer to arch specific and
> filesystem code I am considering how this series integrates with [1]
> since that is the direction resctrl is headed in.
> Being "arch specific" thus does not require that it be moved into
> monitor.c - it could be added to arch/x86/kernel/cpu/resctrl/internal.h
> where it can remain after the fs/arch split.

The logical place to convert from logical RMID to physical RMID looks
to be in __rmid_read(). Just need to pass in the domain pointer (from
both resctrl_arch_reset_rmid() and resctrl_arch_rmid_read().
>
> It will be very helpful if you view your series while taking
> [1] into account. For example, when looking at [1] you will see that
> mon_event_count() and __mon_event_count() are resctrl filesystem
> functions. When you consider that it should be clear that adding
> an arch specific get_node_rmid() between these functions will make
> the arch/fs split more difficult.

I'll try to keep that in mind as I rework my series. In v18 my "sum" code
went into __mon_event_count(). I'll see if I can push that down another
level.

-Tony

2024-05-23 23:48:58

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v18 06/17] x86/resctrl: Introduce snc_nodes_per_l3_cache

Hi Tony,

On 5/23/24 4:18 PM, Luck, Tony wrote:
>>> So (in arch/x86/kernel/cpu/resctrl/monitor.c)
>>>
>>> static int snc_nodes_per_l3_cache = 1;
>>>
>>> Set and use only in this (arch specific) file.
>>
>> Since this series initializes this value in
>> arch/x86/kernel/cpu/resctrl/core.c it is not clear to
>> me from just this one line how you envision the changes.
>
> v18 did the initialization in core.c. But since SNC is all about monitor
> features it looks more logical to do this here:
>
> resctrl_late_init()
> get_rdt_resources()
> get_rdt_mon_resources()
> rdt_get_mon_l3_config() <<<< Do SNC enumeration here
>

ok.

>
>> Just to be clear ... when I refer to arch specific and
>> filesystem code I am considering how this series integrates with [1]
>> since that is the direction resctrl is headed in.
>> Being "arch specific" thus does not require that it be moved into
>> monitor.c - it could be added to arch/x86/kernel/cpu/resctrl/internal.h
>> where it can remain after the fs/arch split.
>
> The logical place to convert from logical RMID to physical RMID looks
> to be in __rmid_read(). Just need to pass in the domain pointer (from
> both resctrl_arch_reset_rmid() and resctrl_arch_rmid_read().

This is not obvious to me. Do you need domain pointer to figure out
which node the domain belongs to? It looks to me as though these
calls are already running on a CPU belonging to the domain so perhaps
smp_processor_id() is sufficient?

>> It will be very helpful if you view your series while taking
>> [1] into account. For example, when looking at [1] you will see that
>> mon_event_count() and __mon_event_count() are resctrl filesystem
>> functions. When you consider that it should be clear that adding
>> an arch specific get_node_rmid() between these functions will make
>> the arch/fs split more difficult.
>
> I'll try to keep that in mind as I rework my series. In v18 my "sum" code
> went into __mon_event_count(). I'll see if I can push that down another
> level.

Thank you

Reinette