2023-11-30 00:34:43

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v12 0/8] Add support for Sub-NUMA cluster (SNC) systems

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features. Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <[email protected]>

Changes since v11:

Global: (comment from Reinette)
Reorder tags with Signed-off-by: first, then Reviewed/Tested

Patch1: (comment from Reinette)
Add error message to domain_remove_cpu() [matching the one in
domain_add_cpu()] for the case where get_cpu_cacheinfo_id()
failed to find a cache ID for the current CPU.

Patch3: (comment from Reinette)
When splitting the domain_add_cpu() and domain_remove_cpu()
functions add "control" and "monitor" to the warning messages.
Fix the:
pr_warn("Couldn't find control scope id=%d for CPU %d\n", id, cpu);
message:
s/Couldn't/Can't/
s/control scope/control domain with/
Add resource name.
Ditto for similar monitor message.

Patch6: (comment from Reinette)
Used Reinette's rewrite into imperative mood for latter part
of commit message.

Patch8: (comment from Randy)
s/have/has/ s/cache. But/cache, but/

Added Reinette's "Reviewed-by:" to all patches except patch 3.

Added Shaopeng Tan's Reviewed and Tested to all patches.

Rebased to v6.7-rc3

Tony Luck (8):
x86/resctrl: Prepare for new domain scope
x86/resctrl: Prepare to split rdt_domain structure
x86/resctrl: Prepare for different scope for control/monitor
operations
x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
x86/resctrl: Add node-scope to the options for feature scope
x86/resctrl: Introduce snc_nodes_per_l3_cache
x86/resctrl: Sub NUMA Cluster detection and enable
x86/resctrl: Update documentation with Sub-NUMA cluster changes

Documentation/arch/x86/resctrl.rst | 25 +-
include/linux/resctrl.h | 85 +++--
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 66 ++--
arch/x86/kernel/cpu/resctrl/core.c | 411 +++++++++++++++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 58 +--
arch/x86/kernel/cpu/resctrl/monitor.c | 68 ++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 26 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 149 ++++----
9 files changed, 607 insertions(+), 282 deletions(-)


base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
--
2.41.0


2023-11-30 00:34:44

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v12 2/8] x86/resctrl: Prepare to split rdt_domain structure

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
---
include/linux/resctrl.h | 16 ++++--
arch/x86/kernel/cpu/resctrl/core.c | 26 +++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 10 ++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 60 +++++++++++------------
6 files changed, 78 insertions(+), 70 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 7d4eb7df611d..c4067150a6b7 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,10 +53,20 @@ struct resctrl_staged_config {
};

/**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
* @list: all instances of this resource
* @id: unique id for this instance
* @cpu_mask: which CPUs share this resource
+ */
+struct rdt_domain_hdr {
+ struct list_head list;
+ int id;
+ struct cpumask cpu_mask;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr: common header for different domain types
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
* @mbm_local: saved state for MBM local bandwidth
@@ -71,9 +81,7 @@ struct resctrl_staged_config {
* by closid
*/
struct rdt_domain {
- struct list_head list;
- int id;
- struct cpumask cpu_mask;
+ struct rdt_domain_hdr hdr;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
struct mbm_state *mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index fd113bc29d4e..62a989fd950d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -356,9 +356,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
{
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
/* Find the domain that contains this CPU */
- if (cpumask_test_cpu(cpu, &d->cpu_mask))
+ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
return d;
}

@@ -402,12 +402,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct list_head *l;

list_for_each(l, &r->domains) {
- d = list_entry(l, struct rdt_domain, list);
+ d = list_entry(l, struct rdt_domain, hdr.list);
/* When id is found, return its domain. */
- if (id == d->id)
+ if (id == d->hdr.id)
return d;
/* Stop searching when finding id's position in sorted list. */
- if (id < d->id)
+ if (id < d->hdr.id)
break;
}

@@ -530,7 +530,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

d = rdt_find_domain(r, id, &add_pos);
if (d) {
- cpumask_set_cpu(cpu, &d->cpu_mask);
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
return;
@@ -541,8 +541,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;

d = &hw_dom->d_resctrl;
- d->id = id;
- cpumask_set_cpu(cpu, &d->cpu_mask);
+ d->hdr.id = id;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

rdt_domain_reconfigure_cdp(r);

@@ -556,11 +556,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

- list_add_tail(&d->list, add_pos);
+ list_add_tail(&d->hdr.list, add_pos);

err = resctrl_online_domain(r, d);
if (err) {
- list_del(&d->list);
+ list_del(&d->hdr.list);
domain_free(hw_dom);
}
}
@@ -584,10 +584,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}
hw_dom = resctrl_to_arch_dom(d);

- cpumask_clear_cpu(cpu, &d->cpu_mask);
- if (cpumask_empty(&d->cpu_mask)) {
+ cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+ if (cpumask_empty(&d->hdr.cpu_mask)) {
resctrl_offline_domain(r, d);
- list_del(&d->list);
+ list_del(&d->hdr.list);

/*
* rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 3f8891d57fac..23f8258d36a8 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,

cfg = &d->staged_config[s->conf_type];
if (cfg->have_new_ctrl) {
- rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+ rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
return -EINVAL;
}

@@ -146,7 +146,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,

cfg = &d->staged_config[s->conf_type];
if (cfg->have_new_ctrl) {
- rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+ rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
return -EINVAL;
}

@@ -226,8 +226,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
return -EINVAL;
}
dom = strim(dom);
- list_for_each_entry(d, &r->domains, list) {
- if (d->id == dom_id) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (d->hdr.id == dom_id) {
data.buf = dom;
data.rdtgrp = rdtgrp;
if (r->parse_ctrlval(&data, s, d))
@@ -274,7 +274,7 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
struct rdt_domain *dom = &hw_dom->d_resctrl;

if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
- cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
+ cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
hw_dom->ctrl_val[idx] = cfg->new_ctrl;

return true;
@@ -291,7 +291,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
u32 idx = get_config_index(closid, t);
struct msr_param msr_param;

- if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+ if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
return -EINVAL;

hw_dom->ctrl_val[idx] = cfg_val;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
return -ENOMEM;

msr_param.res = NULL;
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
for (t = 0; t < CDP_NUM_TYPES; t++) {
cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
u32 ctrl_val;

seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -476,7 +476,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
ctrl_val = resctrl_arch_get_config(r, dom, closid,
schema->conf_type);

- seq_printf(s, r->format_str, dom->id, max_data_width,
+ seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
ctrl_val);
sep = true;
}
@@ -505,7 +505,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
} else {
seq_printf(s, "%s:%d=%x\n",
rdtgrp->plr->s->res->name,
- rdtgrp->plr->d->id,
+ rdtgrp->plr->d->hdr.id,
rdtgrp->plr->cbm);
}
} else {
@@ -536,7 +536,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
rr->val = 0;
rr->first = first;

- smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
}

int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..dd0ea1bc0092 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -238,7 +238,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
u64 msr_val, chunks;
int ret;

- if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+ if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
return -EINVAL;

ret = __rmid_read(rmid, eventid, &msr_val);
@@ -340,8 +340,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)

entry->busy = 0;
cpu = get_cpu();
- list_for_each_entry(d, &r->domains, list) {
- if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
err = resctrl_arch_rmid_read(r, d, entry->rmid,
QOS_L3_OCCUP_EVENT_ID,
&val);
@@ -661,7 +661,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;

- cpu = cpumask_any(&dom->cpu_mask);
+ cpu = cpumask_any(&dom->hdr.cpu_mask);
dom->cqm_work_cpu = cpu;

schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
@@ -708,7 +708,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)

if (!static_branch_likely(&rdt_mon_enable_key))
return;
- cpu = cpumask_any(&dom->cpu_mask);
+ cpu = cpumask_any(&dom->hdr.cpu_mask);
dom->mbm_work_cpu = cpu;
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 2a682da9f43a..fcbd99e2eb66 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
int cpu;
int ret;

- for_each_cpu(cpu, &plr->d->cpu_mask) {
+ for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
if (!pm_req) {
rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
@@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
return -ENODEV;

/* Pick the first cpu we find that is associated with the cache. */
- plr->cpu = cpumask_first(&plr->d->cpu_mask);
+ plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);

if (!cpu_online(plr->cpu)) {
rdt_last_cmd_printf("CPU %u associated with cache not online\n",
@@ -856,10 +856,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* associated with them.
*/
for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(d_i, &r->domains, list) {
+ list_for_each_entry(d_i, &r->domains, hdr.list) {
if (d_i->plr)
cpumask_or(cpu_with_psl, cpu_with_psl,
- &d_i->cpu_mask);
+ &d_i->hdr.cpu_mask);
}
}

@@ -867,7 +867,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* Next test if new pseudo-locked region would intersect with
* existing region.
*/
- if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
+ if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
ret = true;

free_cpumask_var(cpu_with_psl);
@@ -1199,7 +1199,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
}

plr->thread_done = 0;
- cpu = cpumask_first(&plr->d->cpu_mask);
+ cpu = cpumask_first(&plr->d->hdr.cpu_mask);
if (!cpu_online(cpu)) {
ret = -ENODEV;
goto out;
@@ -1529,7 +1529,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
* may be scheduled elsewhere and invalidate entries in the
* pseudo-locked region.
*/
- if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
+ if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
mutex_unlock(&rdtgroup_mutex);
return -EINVAL;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index c44be64d65ec..04d32602ac33 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
lockdep_assert_held(&rdtgroup_mutex);

for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(dom, &r->domains, list)
+ list_for_each_entry(dom, &r->domains, hdr.list)
memset(dom->staged_config, 0, sizeof(dom->staged_config));
}
}
@@ -295,7 +295,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
rdt_last_cmd_puts("Cache domain offline\n");
ret = -ENODEV;
} else {
- mask = &rdtgrp->plr->d->cpu_mask;
+ mask = &rdtgrp->plr->d->hdr.cpu_mask;
seq_printf(s, is_cpu_list(of) ?
"%*pbl\n" : "%*pb\n",
cpumask_pr_args(mask));
@@ -984,12 +984,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,

mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
exclusive = 0;
- seq_printf(seq, "%d=", dom->id);
+ seq_printf(seq, "%d=", dom->hdr.id);
for (i = 0; i < closids_supported(); i++) {
if (!closid_allocated(i))
continue;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
continue;
has_cache = true;
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
ctrl = resctrl_arch_get_config(r, d, closid,
s->conf_type);
if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1417,7 +1417,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
return size;

num_b = bitmap_weight(&cbm, r->cache.cbm_len);
- ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
+ ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
@@ -1465,7 +1465,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
rdtgrp->plr->d,
rdtgrp->plr->cbm);
- seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+ seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
}
goto out;
}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
type = schema->conf_type;
sep = false;
seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
if (sep)
seq_putc(s, ';');
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1495,7 +1495,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
else
size = rdtgroup_cbm_to_size(r, d, ctrl);
}
- seq_printf(s, "%d=%u", d->id, size);
+ seq_printf(s, "%d=%u", d->hdr.id, size);
sep = true;
}
seq_putc(s, '\n');
@@ -1555,7 +1555,7 @@ static void mon_event_config_read(void *info)

static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
{
- smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
}

static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid

mutex_lock(&rdtgroup_mutex);

- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -1574,7 +1574,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
mon_info.evtid = evtid;
mondata_config_read(dom, &mon_info);

- seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+ seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
sep = true;
}
seq_puts(s, "\n");
@@ -1646,7 +1646,7 @@ static int mbm_config_write_domain(struct rdt_resource *r,
* are scoped at the domain level. Writing any of these MSRs
* on one CPU is observed by all the CPUs in the domain.
*/
- smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
&mon_info, 1);

/*
@@ -1689,8 +1689,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}

- list_for_each_entry(d, &r->domains, list) {
- if (d->id == dom_id) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (d->hdr.id == dom_id) {
ret = mbm_config_write_domain(r, d, evtid, val);
if (ret)
return -EINVAL;
@@ -2232,14 +2232,14 @@ static int set_cache_qos_cfg(int level, bool enable)
return -ENOMEM;

r_l = &rdt_resources_all[level].r_resctrl;
- list_for_each_entry(d, &r_l->domains, list) {
+ list_for_each_entry(d, &r_l->domains, hdr.list) {
if (r_l->cache.arch_has_per_cpu_cfg)
/* Pick all the CPUs in the domain instance */
- for_each_cpu(cpu, &d->cpu_mask)
+ for_each_cpu(cpu, &d->hdr.cpu_mask)
cpumask_set_cpu(cpu, cpu_mask);
else
/* Pick one CPU from each domain instance to update MSR */
- cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+ cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
}

/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2268,7 +2268,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
{
u32 num_closid = resctrl_arch_get_num_closid(r);
- int cpu = cpumask_any(&d->cpu_mask);
+ int cpu = cpumask_any(&d->hdr.cpu_mask);
int i;

d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)

r->membw.mba_sc = mba_sc;

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
for (i = 0; i < num_closid; i++)
d->mbps_val[i] = MBA_MAX_MBPS;
}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- list_for_each_entry(dom, &r->domains, list)
+ list_for_each_entry(dom, &r->domains, hdr.list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}

@@ -2780,9 +2780,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
* CBMs in all domains to the maximum mask value. Pick one CPU
* from each domain to update the MSRs below.
*/
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
- cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+ cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);

for (i = 0; i < hw_res->num_closid; i++)
hw_dom->ctrl_val[i] = r->default_ctrl;
@@ -2986,7 +2986,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
char name[32];
int ret;

- sprintf(name, "mon_%s_%02d", r->name, d->id);
+ sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
/* create the directory */
kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
if (IS_ERR(kn))
@@ -3002,7 +3002,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
}

priv.u.rid = r->rid;
- priv.u.domid = d->id;
+ priv.u.domid = d->hdr.id;
list_for_each_entry(mevt, &r->evt_list, list) {
priv.u.evtid = mevt->evtid;
ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_domain *dom;
int ret;

- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
return ret;
@@ -3209,7 +3209,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
*/
tmp_cbm = cfg->new_ctrl;
if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
- rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+ rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
return -ENOSPC;
}
cfg->have_new_ctrl = true;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
struct rdt_domain *d;
int ret;

- list_for_each_entry(d, &s->res->domains, list) {
+ list_for_each_entry(d, &s->res->domains, hdr.list) {
ret = __init_one_rdt_domain(d, s, closid);
if (ret < 0)
return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
struct resctrl_staged_config *cfg;
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
if (is_mba_sc(r)) {
d->mbps_val[closid] = MBA_MAX_MBPS;
continue;
@@ -3864,7 +3864,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
* per domain monitor data directories.
*/
if (static_branch_unlikely(&rdt_mon_enable_key))
- rmdir_mondata_subdir_allrdtgrp(r, d->id);
+ rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);

if (is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
--
2.41.0

2023-11-30 00:34:50

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v12 5/8] x86/resctrl: Add node-scope to the options for feature scope

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
---
include/linux/resctrl.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 058a940c3239..b8a3a11b970d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -170,6 +170,7 @@ struct resctrl_schema;
enum resctrl_scope {
RESCTRL_L2_CACHE = 2,
RESCTRL_L3_CACHE = 3,
+ RESCTRL_NODE,
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 797cb3bf417a..c9315ce8f7bd 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -502,6 +502,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
case RESCTRL_L2_CACHE:
case RESCTRL_L3_CACHE:
return get_cpu_cacheinfo_id(cpu, scope);
+ case RESCTRL_NODE:
+ return cpu_to_node(cpu);
default:
break;
}
--
2.41.0

2023-11-30 00:34:54

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v12 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that shows how many
SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
SNC mode is either not implemented, or not enabled.

Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
nodes.
3) The RMID renumbering operates when using the value from the
IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
counter, adjust from the logical RMID to the physical
RMID value for the SNC node that it wishes to read and load the
adjusted value into the IA32_QM_EVTSEL MSR.
4) Divide the L3 cache between the SNC nodes. Divide the value
reported in the resctrl "size" file by the number of SNC
nodes because the effective amount of cache that can be allocated
is reduced by that factor.
5) Disable the "-o mba_MBps" mount option in SNC mode
because the monitoring is being done per SNC node, while the
bandwidth allocation is still done at the L3 cache scope.
Trying to use this feedback loop might result in contradictory
changes to the throttling level coming from each of the SNC
node bandwidth measurements.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 16 +++++++++++++---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++--
4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index ce3a70657842..e7a75a439c16 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);

extern struct dentry *debugfs_resctrl;

+extern unsigned int snc_nodes_per_l3_cache;
+
enum resctrl_res_level {
RDT_RESOURCE_L3,
RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c9315ce8f7bd..cf5aba8a74bf 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
*/
bool rdt_alloc_capable;

+/*
+ * Number of SNC nodes that share each L3 cache. Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
static void
mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 4e145f5620b0..30b7c3b9b517 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)

static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ int cpu = smp_processor_id();
+ int rmid_offset = 0;
u64 msr_val;

+ /*
+ * When SNC mode is on, need to compute the offset to read the
+ * physical RMID counter for the node to which this CPU belongs.
+ */
+ if (snc_nodes_per_l3_cache > 1)
+ rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
/*
* As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
* with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
*/
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
rdmsrl(MSR_IA32_QM_CTR, msr_val);

if (msr_val & RMID_VAL_ERROR)
@@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
int ret;

resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
- hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
- r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+ r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;

if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 21bbd832f3f2..79d57dade568 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
}
}

- return size;
+ return size / snc_nodes_per_l3_cache;
}

/*
@@ -2298,7 +2298,8 @@ static bool supports_mba_mbps(void)
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

return (is_mbm_local_enabled() &&
- r->alloc_capable && is_mba_linear());
+ r->alloc_capable && is_mba_linear() &&
+ snc_nodes_per_l3_cache == 1);
}

/*
--
2.41.0

2023-11-30 00:35:12

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v12 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..49ff789db1d8 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:

"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
- directories have one file per event (e.g. "llc_occupancy",
+ This contains a set of files organized by L3 domain or by NUMA
+ node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+ or enabled respectively) and by RDT event. Each of these
+ directories has one file per event (e.g. "llc_occupancy",
"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
files provide a read out of the current value of the event for
all tasks in the group. In CTRL_MON groups these files provide
@@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
each bit represents 5% of the capacity of the cache. You could partition
the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.

+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes. Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache, but each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
Memory bandwidth Allocation and monitoring
==========================================

--
2.41.0

2023-11-30 00:35:12

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v12 1/8] x86/resctrl: Prepare for new domain scope

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Clean up the error handling when looking up domains. Report invalid id's
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the callsites.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
---
include/linux/resctrl.h | 9 ++++-
arch/x86/kernel/cpu/resctrl/core.c | 45 ++++++++++++++++-------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 ++-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 ++-
5 files changed, 48 insertions(+), 19 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..7d4eb7df611d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
struct rdt_parse_data;
struct resctrl_schema;

+enum resctrl_scope {
+ RESCTRL_L2_CACHE = 2,
+ RESCTRL_L3_CACHE = 3,
+};
+
/**
* struct rdt_resource - attributes of a resctrl resource
* @rid: The index of the resource
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @cache_level: Which cache level defines scope of this resource
+ * @scope: Scope of this resource
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
* @domains: All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- int cache_level;
+ enum resctrl_scope scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
struct list_head domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 19e0681f0435..fd113bc29d4e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .cache_level = 2,
+ .scope = RESCTRL_L2_CACHE,
.domains = domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -401,9 +401,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct rdt_domain *d;
struct list_head *l;

- if (id < 0)
- return ERR_PTR(-ENODEV);
-
list_for_each(l, &r->domains) {
d = list_entry(l, struct rdt_domain, list);
/* When id is found, return its domain. */
@@ -491,6 +488,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
return 0;
}

+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+ switch (scope) {
+ case RESCTRL_L2_CACHE:
+ case RESCTRL_L3_CACHE:
+ return get_cpu_cacheinfo_id(cpu, scope);
+ default:
+ break;
+ }
+
+ return -EINVAL;
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -506,18 +516,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
*/
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_domain_id_from_scope(cpu, r->scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
int err;

- d = rdt_find_domain(r, id, &add_pos);
- if (IS_ERR(d)) {
- pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+ if (id < 0) {
+ pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->scope, r->name);
return;
}

+ d = rdt_find_domain(r, id, &add_pos);
if (d) {
cpumask_set_cpu(cpu, &d->cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
@@ -556,13 +567,19 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

static void domain_remove_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_domain_id_from_scope(cpu, r->scope);
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

+ if (id < 0) {
+ pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->scope, r->name);
+ return;
+ }
+
d = rdt_find_domain(r, id, NULL);
- if (IS_ERR_OR_NULL(d)) {
- pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+ if (!d) {
+ pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
return;
}
hw_dom = resctrl_to_arch_dom(d);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index beccb0e87ba7..3f8891d57fac 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -563,7 +563,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)

r = &rdt_resources_all[resid].r_resctrl;
d = rdt_find_domain(r, domid, NULL);
- if (IS_ERR_OR_NULL(d)) {
+ if (!d) {
ret = -ENOENT;
goto out;
}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..2a682da9f43a 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
*/
static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
{
+ enum resctrl_scope scope = plr->s->res->scope;
struct cpu_cacheinfo *ci;
int ret;
int i;

+ if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+ return -ENODEV;
+
/* Pick the first cpu we find that is associated with the cache. */
plr->cpu = cpumask_first(&plr->d->cpu_mask);

@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);

for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == plr->s->res->cache_level) {
+ if (ci->info_list[i].level == scope) {
plr->line_size = ci->info_list[i].coherency_line_size;
return 0;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..c44be64d65ec 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1413,10 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
unsigned int size = 0;
int num_b, i;

+ if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+ return size;
+
num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->cache_level) {
+ if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
--
2.41.0

2023-11-30 00:35:23

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 96 ++++++++++++++++++++++++++++++
2 files changed, 97 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d51e1850ed0..94d29d81e6db 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1111,6 +1111,7 @@
#define MSR_IA32_QM_CTR 0xc8e
#define MSR_IA32_PQR_ASSOC 0xc8f
#define MSR_IA32_L3_CBM_BASE 0xc90
+#define MSR_RMID_SNC_CONFIG 0xca0
#define MSR_IA32_L2_CBM_BASE 0xd10
#define MSR_IA32_MBA_THRTL_BASE 0xd50

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index cf5aba8a74bf..3293ab4c58b0 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@

#define pr_fmt(fmt) "resctrl: " fmt

+#include <linux/cpu.h>
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/cacheinfo.h>
#include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>

+#include <asm/cpu_device_id.h>
#include <asm/intel-family.h>
#include <asm/resctrl.h>
#include "internal.h"
@@ -740,11 +743,42 @@ static void clear_closid_rmid(int cpu)
wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
}

+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+ u64 val;
+
+ /* Only need to enable once per package. */
+ if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+ return;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, val);
+ val &= ~BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
static int resctrl_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
+
+ if (snc_nodes_per_l3_cache > 1)
+ snc_remap_rmids(cpu);
+
for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
/* The cpu is set in default rdtgroup after online. */
@@ -999,11 +1033,73 @@ static __init bool get_rdt_resources(void)
return (rdt_mon_capable || rdt_alloc_capable);
}

+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+ {}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+ unsigned long *node_caches;
+ int mem_only_nodes = 0;
+ int cpu, node, ret;
+ int num_l3_caches;
+
+ if (!x86_match_cpu(snc_cpu_ids))
+ return 1;
+
+ node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
+ if (!node_caches)
+ return 1;
+
+ cpus_read_lock();
+
+ if (num_online_cpus() != num_present_cpus())
+ pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+
+ for_each_node(node) {
+ cpu = cpumask_first(cpumask_of_node(node));
+ if (cpu < nr_cpu_ids)
+ set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
+ else
+ mem_only_nodes++;
+ }
+ cpus_read_unlock();
+
+ num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
+ kfree(node_caches);
+
+ if (!num_l3_caches)
+ return 1;
+
+ ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+ if (ret > 1)
+ rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+
+ return ret;
+}
+
static __init void rdt_init_res_defs_intel(void)
{
struct rdt_hw_resource *hw_res;
struct rdt_resource *r;

+ snc_nodes_per_l3_cache = snc_get_config();
+
for_each_rdt_resource(r) {
hw_res = resctrl_to_arch_res(r);

--
2.41.0

2023-11-30 00:35:24

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v12 3/8] x86/resctrl: Prepare for different scope for control/monitor operations

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
---
include/linux/resctrl.h | 25 ++-
arch/x86/kernel/cpu/resctrl/internal.h | 6 +-
arch/x86/kernel/cpu/resctrl/core.c | 211 ++++++++++++++++------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 12 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 4 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 55 +++---
7 files changed, 220 insertions(+), 97 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c4067150a6b7..35e700edc6e6 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -52,15 +52,22 @@ struct resctrl_staged_config {
bool have_new_ctrl;
};

+enum resctrl_domain_type {
+ RESCTRL_CTRL_DOMAIN,
+ RESCTRL_MON_DOMAIN,
+};
+
/**
* struct rdt_domain_hdr - common header for different domain types
* @list: all instances of this resource
* @id: unique id for this instance
+ * @type: type of this instance
* @cpu_mask: which CPUs share this resource
*/
struct rdt_domain_hdr {
struct list_head list;
int id;
+ enum resctrl_domain_type type;
struct cpumask cpu_mask;
};

@@ -163,10 +170,12 @@ enum resctrl_scope {
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @scope: Scope of this resource
+ * @ctrl_scope: Scope of this resource for control functions
+ * @mon_scope: Scope of this resource for monitor functions
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
- * @domains: All domains for this resource
+ * @ctrl_domains: Control domains for this resource
+ * @mon_domains: Monitor domains for this resource
* @name: Name to use in "schemata" file.
* @data_width: Character width of data when displaying
* @default_ctrl: Specifies default cache cbm or memory B/W percent.
@@ -181,10 +190,12 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- enum resctrl_scope scope;
+ enum resctrl_scope ctrl_scope;
+ enum resctrl_scope mon_scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
- struct list_head domains;
+ struct list_head ctrl_domains;
+ struct list_head mon_domains;
char *name;
int data_width;
u32 default_ctrl;
@@ -230,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,

u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..24bf9d7989a9 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -520,8 +520,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
- struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+ struct list_head **pos);
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -540,7 +540,7 @@ int rdt_pseudo_lock_init(void);
void rdt_pseudo_lock_release(void);
int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
void closid_free(int closid);
int alloc_rmid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 62a989fd950d..1fd85533b4ca 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,8 @@ static void
mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);

-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)

struct rdt_hw_resource rdt_resources_all[] = {
[RDT_RESOURCE_L3] =
@@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_L3),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .mon_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L3),
+ .mon_domains = mon_domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .scope = RESCTRL_L2_CACHE,
- .domains = domain_init(RDT_RESOURCE_L2),
+ .ctrl_scope = RESCTRL_L2_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_MBA),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_SMBA),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -352,11 +355,11 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
}

-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
{
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
/* Find the domain that contains this CPU */
if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
return d;
@@ -378,7 +381,7 @@ void rdt_ctrl_update(void *arg)
int cpu = smp_processor_id();
struct rdt_domain *d;

- d = get_domain_from_cpu(cpu, r);
+ d = get_ctrl_domain_from_cpu(cpu, r);
if (d) {
hw_res->msr_update(d, m, r);
return;
@@ -388,26 +391,26 @@ void rdt_ctrl_update(void *arg)
}

/*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Search for a domain id in a resource domain list.
*
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
- * The domain list is sorted by id in ascending order.
+ * Search the domain list to find the domain id. If the domain id is
+ * found, return the domain. NULL otherwise. If the domain id is not
+ * found (and NULL returned) then the first domain with id bigger than
+ * the input id can be returned to the caller via @pos.
*/
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
- struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+ struct list_head **pos)
{
- struct rdt_domain *d;
+ struct rdt_domain_hdr *d;
struct list_head *l;

- list_for_each(l, &r->domains) {
- d = list_entry(l, struct rdt_domain, hdr.list);
+ list_for_each(l, h) {
+ d = list_entry(l, struct rdt_domain_hdr, list);
/* When id is found, return its domain. */
- if (id == d->hdr.id)
+ if (id == d->id)
return d;
/* Stop searching when finding id's position in sorted list. */
- if (id < d->hdr.id)
+ if (id < d->id)
break;
}

@@ -501,35 +504,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
return -EINVAL;
}

-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
{
- int id = get_domain_id_from_scope(cpu, r->scope);
+ int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
struct rdt_domain *d;
int err;

if (id < 0) {
- pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->scope, r->name);
+ pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->ctrl_scope, r->name);
return;
}

- d = rdt_find_domain(r, id, &add_pos);
- if (d) {
+ hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+ if (hdr) {
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
+
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
@@ -542,51 +538,114 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

d = &hw_dom->d_resctrl;
d->hdr.id = id;
+ d->hdr.type = RESCTRL_CTRL_DOMAIN;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

rdt_domain_reconfigure_cdp(r);

- if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+ if (domain_setup_ctrlval(r, d)) {
domain_free(hw_dom);
return;
}

- if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+ list_add_tail(&d->hdr.list, add_pos);
+
+ err = resctrl_online_ctrl_domain(r, d);
+ if (err) {
+ list_del(&d->hdr.list);
+ domain_free(hw_dom);
+ }
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct list_head *add_pos = NULL;
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
+ struct rdt_domain *d;
+ int err;
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+ if (hdr) {
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
+
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ return;
+ }
+
+ hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+ if (!hw_dom)
+ return;
+
+ d = &hw_dom->d_resctrl;
+ d->hdr.id = id;
+ d->hdr.type = RESCTRL_MON_DOMAIN;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+
+ if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
domain_free(hw_dom);
return;
}

list_add_tail(&d->hdr.list, add_pos);

- err = resctrl_online_domain(r, d);
+ err = resctrl_online_mon_domain(r, d);
if (err) {
list_del(&d->hdr.list);
domain_free(hw_dom);
}
}

-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+/*
+ * domain_add_cpu - Add a CPU to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_domain_id_from_scope(cpu, r->scope);
+ if (r->alloc_capable)
+ domain_add_cpu_ctrl(cpu, r);
+ if (r->mon_capable)
+ domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
struct rdt_domain *d;

if (id < 0) {
- pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->scope, r->name);
+ pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->ctrl_scope, r->name);
return;
}

- d = rdt_find_domain(r, id, NULL);
- if (!d) {
- pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
+ hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+ if (!hdr) {
+ pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n",
+ id, cpu, r->name);
return;
}
+
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
hw_dom = resctrl_to_arch_dom(d);

cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
- resctrl_offline_domain(r, d);
+ resctrl_offline_ctrl_domain(r, d);
list_del(&d->hdr.list);

/*
@@ -599,6 +658,42 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)

return;
}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
+ struct rdt_domain *d;
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+ if (!hdr) {
+ pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n",
+ id, cpu, r->name);
+ return;
+ }
+
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
+ hw_dom = resctrl_to_arch_dom(d);
+
+ cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+ if (cpumask_empty(&d->hdr.cpu_mask)) {
+ resctrl_offline_mon_domain(r, d);
+ list_del(&d->hdr.list);
+ domain_free(hw_dom);
+
+ return;
+ }

if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -613,6 +708,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}
}

+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+ if (r->alloc_capable)
+ domain_remove_cpu_ctrl(cpu, r);
+ if (r->mon_capable)
+ domain_remove_cpu_mon(cpu, r);
+}
+
static void clear_closid_rmid(int cpu)
{
struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 23f8258d36a8..0b4136c42762 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -226,7 +226,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
return -EINVAL;
}
dom = strim(dom);
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (d->hdr.id == dom_id) {
data.buf = dom;
data.rdtgrp = rdtgrp;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
return -ENOMEM;

msr_param.res = NULL;
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
for (t = 0; t < CDP_NUM_TYPES; t++) {
cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
u32 ctrl_val;

seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -542,6 +542,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
+ struct rdt_domain_hdr *hdr;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
@@ -562,11 +563,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
evtid = md.u.evtid;

r = &rdt_resources_all[resid].r_resctrl;
- d = rdt_find_domain(r, domid, NULL);
- if (!d) {
+ hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+ if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
ret = -ENOENT;
goto out;
}
+ d = container_of(hdr, struct rdt_domain, hdr);

mon_event_read(&rr, r, d, rdtgrp, evtid, false);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index dd0ea1bc0092..ec5ad926c5dc 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)

entry->busy = 0;
cpu = get_cpu();
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
err = resctrl_arch_rmid_read(r, d, entry->rmid,
QOS_L3_OCCUP_EVENT_ID,
@@ -535,7 +535,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
rmid = rgrp->mon.rmid;
pmbm_data = &dom_mbm->mbm_local[rmid];

- dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+ dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
pr_warn_once("Failure to get domain for MBA update\n");
return;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index fcbd99e2eb66..ed6d59af1cef 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
*/
static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
{
- enum resctrl_scope scope = plr->s->res->scope;
+ enum resctrl_scope scope = plr->s->res->ctrl_scope;
struct cpu_cacheinfo *ci;
int ret;
int i;
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* associated with them.
*/
for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(d_i, &r->domains, hdr.list) {
+ list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
if (d_i->plr)
cpumask_or(cpu_with_psl, cpu_with_psl,
&d_i->hdr.cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 04d32602ac33..760013ed1bff 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
lockdep_assert_held(&rdtgroup_mutex);

for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(dom, &r->domains, hdr.list)
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
memset(dom->staged_config, 0, sizeof(dom->staged_config));
}
}
@@ -984,7 +984,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,

mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
continue;
has_cache = true;
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
ctrl = resctrl_arch_get_config(r, d, closid,
s->conf_type);
if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1413,13 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
unsigned int size = 0;
int num_b, i;

- if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+ if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
return size;

num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->scope) {
+ if (ci->info_list[i].level == r->ctrl_scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
type = schema->conf_type;
sep = false;
seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (sep)
seq_putc(s, ';');
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid

mutex_lock(&rdtgroup_mutex);

- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->mon_domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -1689,7 +1689,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
if (d->hdr.id == dom_id) {
ret = mbm_config_write_domain(r, d, evtid, val);
if (ret)
@@ -2232,7 +2232,7 @@ static int set_cache_qos_cfg(int level, bool enable)
return -ENOMEM;

r_l = &rdt_resources_all[level].r_resctrl;
- list_for_each_entry(d, &r_l->domains, hdr.list) {
+ list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
if (r_l->cache.arch_has_per_cpu_cfg)
/* Pick all the CPUs in the domain instance */
for_each_cpu(cpu, &d->hdr.cpu_mask)
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)

r->membw.mba_sc = mba_sc;

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
for (i = 0; i < num_closid; i++)
d->mbps_val[i] = MBA_MAX_MBPS;
}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- list_for_each_entry(dom, &r->domains, hdr.list)
+ list_for_each_entry(dom, &r->mon_domains, hdr.list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}

@@ -2777,10 +2777,10 @@ static int reset_all_ctrls(struct rdt_resource *r)

/*
* Disable resource control for this resource by setting all
- * CBMs in all domains to the maximum mask value. Pick one CPU
+ * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
* from each domain to update the MSRs below.
*/
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);

@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_domain *dom;
int ret;

- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->mon_domains, hdr.list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
return ret;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
struct rdt_domain *d;
int ret;

- list_for_each_entry(d, &s->res->domains, hdr.list) {
+ list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
ret = __init_one_rdt_domain(d, s, closid);
if (ret < 0)
return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
struct resctrl_staged_config *cfg;
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (is_mba_sc(r)) {
d->mbps_val[closid] = MBA_MAX_MBPS;
continue;
@@ -3849,15 +3849,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
kfree(d->mbm_local);
}

-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
mba_sc_domain_destroy(r, d);
+}

- if (!r->mon_capable)
- return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ lockdep_assert_held(&rdtgroup_mutex);

/*
* If resctrl is mounted, remove all the
@@ -3914,18 +3916,21 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
{
- int err;
-
lockdep_assert_held(&rdtgroup_mutex);

if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
- /* RDT_RESOURCE_MBA is never mon_capable */
return mba_sc_domain_allocate(r, d);

- if (!r->mon_capable)
- return 0;
+ return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ int err;
+
+ lockdep_assert_held(&rdtgroup_mutex);

err = domain_setup_mon_state(r, d);
if (err)
--
2.41.0

2023-11-30 00:36:56

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v12 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
---
include/linux/resctrl.h | 48 +++++++------
arch/x86/kernel/cpu/resctrl/internal.h | 60 ++++++++++------
arch/x86/kernel/cpu/resctrl/core.c | 87 ++++++++++++-----------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 40 +++++------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 62 ++++++++--------
7 files changed, 182 insertions(+), 153 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 35e700edc6e6..058a940c3239 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -72,7 +72,23 @@ struct rdt_domain_hdr {
};

/**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr: common header for different domain types
+ * @plr: pseudo-locked region (if any) associated with domain
+ * @staged_config: parsed configuration to be applied
+ * @mbps_val: When mba_sc is enabled, this holds the array of user
+ * specified control values for mba_sc in MBps, indexed
+ * by closid
+ */
+struct rdt_ctrl_domain {
+ struct rdt_domain_hdr hdr;
+ struct pseudo_lock_region *plr;
+ struct resctrl_staged_config staged_config[CDP_NUM_TYPES];
+ u32 *mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
* @hdr: common header for different domain types
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
@@ -81,13 +97,8 @@ struct rdt_domain_hdr {
* @cqm_limbo: worker to periodically read CQM h/w counters
* @mbm_work_cpu: worker CPU for MBM h/w counters
* @cqm_work_cpu: worker CPU for CQM h/w counters
- * @plr: pseudo-locked region (if any) associated with domain
- * @staged_config: parsed configuration to be applied
- * @mbps_val: When mba_sc is enabled, this holds the array of user
- * specified control values for mba_sc in MBps, indexed
- * by closid
*/
-struct rdt_domain {
+struct rdt_mon_domain {
struct rdt_domain_hdr hdr;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
@@ -96,9 +107,6 @@ struct rdt_domain {
struct delayed_work cqm_limbo;
int mbm_work_cpu;
int cqm_work_cpu;
- struct pseudo_lock_region *plr;
- struct resctrl_staged_config staged_config[CDP_NUM_TYPES];
- u32 *mbps_val;
};

/**
@@ -202,7 +210,7 @@ struct rdt_resource {
const char *format_str;
int (*parse_ctrlval)(struct rdt_parse_data *data,
struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);
struct list_head evt_list;
unsigned long fflags;
bool cdp_capable;
@@ -236,15 +244,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
* Update the ctrl_val and apply this config right now.
* Must be called on one of the domain's CPUs.
*/
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type t, u32 cfg_val);

-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -260,7 +268,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
*/
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid, u64 *val);

/**
@@ -273,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid);

/**
@@ -285,7 +293,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);

extern unsigned int resctrl_rmid_realloc_threshold;
extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 24bf9d7989a9..ce3a70657842 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -107,7 +107,7 @@ union mon_data_bits {
struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
enum resctrl_event_id evtid;
bool first;
int err;
@@ -192,7 +192,7 @@ struct mongroup {
*/
struct pseudo_lock_region {
struct resctrl_schema *s;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
u32 cbm;
wait_queue_head_t lock_thread_wq;
int thread_done;
@@ -319,25 +319,41 @@ struct arch_mbm_state {
};

/**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- * a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ * a resource for a control function
* @d_resctrl: Properties exposed to the resctrl file system
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+ struct rdt_ctrl_domain d_resctrl;
+ u32 *ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ * a resource for a monitor function
+ * @d_resctrl: Properties exposed to the resctrl file system
* @arch_mbm_total: arch private state for MBM total bandwidth
* @arch_mbm_local: arch private state for MBM local bandwidth
*
* Members of this structure are accessed via helpers that provide abstraction.
*/
-struct rdt_hw_domain {
- struct rdt_domain d_resctrl;
- u32 *ctrl_val;
+struct rdt_hw_mon_domain {
+ struct rdt_mon_domain d_resctrl;
struct arch_mbm_state *arch_mbm_total;
struct arch_mbm_state *arch_mbm_local;
};

-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+ return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
{
- return container_of(r, struct rdt_hw_domain, d_resctrl);
+ return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
}

/**
@@ -405,7 +421,7 @@ struct rdt_hw_resource {
struct rdt_resource r_resctrl;
u32 num_closid;
unsigned int msr_base;
- void (*msr_update) (struct rdt_domain *d, struct msr_param *m,
+ void (*msr_update) (struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r);
unsigned int mon_scale;
unsigned int mbm_width;
@@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
}

int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);
int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);

extern struct mutex rdtgroup_mutex;

@@ -526,21 +542,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
unsigned long cbm);
enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
int rdtgroup_tasks_assigned(struct rdtgroup *r);
int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
int rdt_pseudo_lock_init(void);
void rdt_pseudo_lock_release(void);
int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
void closid_free(int closid);
int alloc_rmid(void);
@@ -550,17 +566,17 @@ bool __init rdt_cpu_has(int flag);
void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
unsigned long delay_ms);
void mbm_handle_overflow(struct work_struct *work);
void __init intel_rdt_mbm_apply_quirk(void);
bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void __init thread_throttle_mode_init(void);
void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 1fd85533b4ca..797cb3bf417a 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
bool rdt_alloc_capable;

static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r);
static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r);

#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -307,11 +307,11 @@ static void rdt_get_cdp_l2_config(void)
}

static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
{
- unsigned int i;
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ unsigned int i;

for (i = m->low; i < m->high; i++)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -332,12 +332,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
}

static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r)
{
- unsigned int i;
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ unsigned int i;

/* Write the delay values for mba. */
for (i = m->low; i < m->high; i++)
@@ -345,19 +345,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
}

static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
{
- unsigned int i;
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ unsigned int i;

for (i = m->low; i < m->high; i++)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
}

-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
{
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
/* Find the domain that contains this CPU */
@@ -379,7 +379,7 @@ void rdt_ctrl_update(void *arg)
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
struct rdt_resource *r = m->res;
int cpu = smp_processor_id();
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

d = get_ctrl_domain_from_cpu(cpu, r);
if (d) {
@@ -434,18 +434,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
*dc = r->default_ctrl;
}

-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+ kfree(hw_dom->ctrl_val);
+ kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
{
kfree(hw_dom->arch_mbm_total);
kfree(hw_dom->arch_mbm_local);
- kfree(hw_dom->ctrl_val);
kfree(hw_dom);
}

-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct msr_param m;
u32 *dc;

@@ -468,7 +473,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
*/
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
{
size_t tsize;

@@ -507,10 +512,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+ struct rdt_hw_ctrl_domain *hw_dom;
struct list_head *add_pos = NULL;
- struct rdt_hw_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int err;

if (id < 0) {
@@ -524,7 +529,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_ctrl_domain, hdr);

cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
@@ -544,7 +549,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
rdt_domain_reconfigure_cdp(r);

if (domain_setup_ctrlval(r, d)) {
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);
return;
}

@@ -553,17 +558,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
err = resctrl_online_ctrl_domain(r, d);
if (err) {
list_del(&d->hdr.list);
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);
}
}

static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct rdt_hw_mon_domain *hw_dom;
struct list_head *add_pos = NULL;
- struct rdt_hw_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
int err;

if (id < 0) {
@@ -577,7 +582,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);

cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
return;
@@ -593,7 +598,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);
return;
}

@@ -602,7 +607,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
err = resctrl_online_mon_domain(r, d);
if (err) {
list_del(&d->hdr.list);
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);
}
}

@@ -620,9 +625,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

if (id < 0) {
pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
@@ -640,8 +645,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
- hw_dom = resctrl_to_arch_dom(d);
+ d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);

cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
@@ -649,12 +654,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
list_del(&d->hdr.list);

/*
- * rdt_domain "d" is going to be freed below, so clear
+ * rdt_ctrl_domain "d" is going to be freed below, so clear
* its pointer from pseudo_lock_region struct.
*/
if (d->plr)
d->plr->d = NULL;
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);

return;
}
@@ -663,9 +668,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_mon_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;

if (id < 0) {
pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
@@ -683,14 +688,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
- hw_dom = resctrl_to_arch_dom(d);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ hw_dom = resctrl_to_arch_mon_dom(d);

cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
resctrl_offline_mon_domain(r, d);
list_del(&d->hdr.list);
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);

return;
}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 0b4136c42762..08fc97ce4135 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
}

int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
struct resctrl_staged_config *cfg;
u32 closid = data->rdtgrp->closid;
@@ -137,7 +137,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
* resource type.
*/
int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
struct rdtgroup *rdtgrp = data->rdtgrp;
struct resctrl_staged_config *cfg;
@@ -206,8 +206,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
struct resctrl_staged_config *cfg;
struct rdt_resource *r = s->res;
struct rdt_parse_data data;
+ struct rdt_ctrl_domain *d;
char *dom = NULL, *id;
- struct rdt_domain *d;
unsigned long dom_id;

if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -267,11 +267,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
}
}

-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
struct resctrl_staged_config *cfg, u32 idx,
cpumask_var_t cpu_mask)
{
- struct rdt_domain *dom = &hw_dom->d_resctrl;
+ struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;

if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
@@ -283,11 +283,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
return false;
}

-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type t, u32 cfg_val)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
u32 idx = get_config_index(closid, t);
struct msr_param msr_param;

@@ -307,11 +307,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
{
struct resctrl_staged_config *cfg;
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct msr_param msr_param;
+ struct rdt_ctrl_domain *d;
enum resctrl_conf_type t;
cpumask_var_t cpu_mask;
- struct rdt_domain *d;
u32 idx;

if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -319,7 +319,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)

msr_param.res = NULL;
list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
- hw_dom = resctrl_to_arch_dom(d);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);
for (t = 0; t < CDP_NUM_TYPES; t++) {
cfg = &hw_dom->d_resctrl.staged_config[t];
if (!cfg->have_new_ctrl)
@@ -449,10 +449,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
return ret ?: nbytes;
}

-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
u32 idx = get_config_index(closid, type);

return hw_dom->ctrl_val[idx];
@@ -461,7 +461,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
{
struct rdt_resource *r = schema->res;
- struct rdt_domain *dom;
+ struct rdt_ctrl_domain *dom;
bool sep = false;
u32 ctrl_val;

@@ -523,7 +523,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
}

void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first)
{
/*
@@ -543,11 +543,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
struct rdt_domain_hdr *hdr;
+ struct rdt_mon_domain *d;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
union mon_data_bits md;
- struct rdt_domain *d;
struct rmid_read rr;
int ret = 0;

@@ -568,7 +568,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
ret = -ENOENT;
goto out;
}
- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);

mon_event_read(&rr, r, d, rdtgrp, evtid, false);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ec5ad926c5dc..4e145f5620b0 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
return 0;
}

-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
u32 rmid,
enum resctrl_event_id eventid)
{
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
return NULL;
}

-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct arch_mbm_state *am;

am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
* Assumes that hardware counters are also reset and thus that there is
* no need to record initial non-zero counts.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);

if (is_mbm_total_enabled())
memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
return chunks >> shift;
}

-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct arch_mbm_state *am;
u64 msr_val, chunks;
int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
* decrement the count. If the busy count gets to zero on an RMID, we
* free the RMID
*/
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
}
}

-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
{
return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
}
@@ -334,7 +334,7 @@ int alloc_rmid(void)
static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
int cpu, err;
u64 val = 0;

@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}

-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
enum resctrl_event_id evtid)
{
switch (evtid) {
@@ -516,13 +516,13 @@ void mon_event_count(void *info)
* throttle MSRs already have low percentage values. To avoid
* unnecessarily restricting such rdtgroups, we also increase the bandwidth.
*/
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
+ struct rdt_ctrl_domain *dom_mba;
u32 cur_bw, delta_bw, user_bw;
struct rdt_resource *r_mba;
- struct rdt_domain *dom_mba;
struct list_head *head;
struct rdtgroup *entry;

@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
}
}

-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
{
struct rmid_read rr;

@@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
int cpu = smp_processor_id();
+ struct rdt_mon_domain *d;
struct rdt_resource *r;
- struct rdt_domain *d;

mutex_lock(&rdtgroup_mutex);

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- d = container_of(work, struct rdt_domain, cqm_limbo.work);
+ d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);

__check_limbo(d, false);

@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
mutex_unlock(&rdtgroup_mutex);
}

-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;
@@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
struct rdtgroup *prgrp, *crgrp;
int cpu = smp_processor_id();
+ struct rdt_mon_domain *d;
struct list_head *head;
struct rdt_resource *r;
- struct rdt_domain *d;

mutex_lock(&rdtgroup_mutex);

@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
goto out_unlock;

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- d = container_of(work, struct rdt_domain, mbm_over.work);
+ d = container_of(work, struct rdt_mon_domain, mbm_over.work);

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
mutex_unlock(&rdtgroup_mutex);
}

-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index ed6d59af1cef..08d35f828bc3 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
* Return: true if @cbm overlaps with pseudo-locked region on @d, false
* otherwise.
*/
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
{
unsigned int cbm_len;
unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
* if it is not possible to test due to memory allocation issue,
* false otherwise.
*/
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
{
+ struct rdt_ctrl_domain *d_i;
cpumask_var_t cpu_with_psl;
struct rdt_resource *r;
- struct rdt_domain *d_i;
bool ret = false;

if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 760013ed1bff..21bbd832f3f2 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -85,8 +85,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)

void rdt_staged_configs_clear(void)
{
+ struct rdt_ctrl_domain *dom;
struct rdt_resource *r;
- struct rdt_domain *dom;

lockdep_assert_held(&rdtgroup_mutex);

@@ -976,7 +976,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
unsigned long sw_shareable = 0, hw_shareable = 0;
unsigned long exclusive = 0, pseudo_locked = 0;
struct rdt_resource *r = s->res;
- struct rdt_domain *dom;
+ struct rdt_ctrl_domain *dom;
int i, hwb, swb, excl, psl;
enum rdtgrp_mode mode;
bool sep = false;
@@ -1205,7 +1205,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
*
* Return: false if CBM does not overlap, true if it does.
*/
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid,
enum resctrl_conf_type type, bool exclusive)
{
@@ -1260,7 +1260,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
*
* Return: true if CBM overlap detected, false if there is no overlap
*/
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid, bool exclusive)
{
enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1291,10 +1291,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
{
int closid = rdtgrp->closid;
+ struct rdt_ctrl_domain *d;
struct resctrl_schema *s;
struct rdt_resource *r;
bool has_cache = false;
- struct rdt_domain *d;
u32 ctrl;

list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1407,7 +1407,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
* bitmap functions work correctly.
*/
unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
- struct rdt_domain *d, unsigned long cbm)
+ struct rdt_ctrl_domain *d, unsigned long cbm)
{
struct cpu_cacheinfo *ci;
unsigned int size = 0;
@@ -1439,9 +1439,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
{
struct resctrl_schema *schema;
enum resctrl_conf_type type;
+ struct rdt_ctrl_domain *d;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
- struct rdt_domain *d;
unsigned int size;
int ret = 0;
u32 closid;
@@ -1553,7 +1553,7 @@ static void mon_event_config_read(void *info)
mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
}

-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
{
smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
}
@@ -1561,7 +1561,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
{
struct mon_config_info mon_info = {0};
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
bool sep = false;

mutex_lock(&rdtgroup_mutex);
@@ -1618,7 +1618,7 @@ static void mon_event_config_write(void *info)
}

static int mbm_config_write_domain(struct rdt_resource *r,
- struct rdt_domain *d, u32 evtid, u32 val)
+ struct rdt_mon_domain *d, u32 evtid, u32 val)
{
struct mon_config_info mon_info = {0};
int ret = 0;
@@ -1668,7 +1668,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
{
char *dom_str = NULL, *id_str;
unsigned long dom_id, val;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
int ret = 0;

next:
@@ -2216,9 +2216,9 @@ static inline bool is_mba_linear(void)
static int set_cache_qos_cfg(int level, bool enable)
{
void (*update)(void *arg);
+ struct rdt_ctrl_domain *d;
struct rdt_resource *r_l;
cpumask_var_t cpu_mask;
- struct rdt_domain *d;
int cpu;

if (level == RDT_RESOURCE_L3)
@@ -2265,7 +2265,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
l3_qos_cfg_update(&hw_res->cdp_enabled);
}

-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
u32 num_closid = resctrl_arch_get_num_closid(r);
int cpu = cpumask_any(&d->hdr.cpu_mask);
@@ -2283,7 +2283,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
}

static void mba_sc_domain_destroy(struct rdt_resource *r,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
kfree(d->mbps_val);
d->mbps_val = NULL;
@@ -2309,7 +2309,7 @@ static int set_mba_sc(bool mba_sc)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
u32 num_closid = resctrl_arch_get_num_closid(r);
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int i;

if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2578,7 +2578,7 @@ static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
unsigned long flags = RFTYPE_CTRL_BASE;
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
struct rdt_resource *r;
int ret;

@@ -2762,10 +2762,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
static int reset_all_ctrls(struct rdt_resource *r)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct msr_param msr_param;
+ struct rdt_ctrl_domain *d;
cpumask_var_t cpu_mask;
- struct rdt_domain *d;
int i;

if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2781,7 +2781,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
* from each domain to update the MSRs below.
*/
list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
- hw_dom = resctrl_to_arch_dom(d);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);
cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);

for (i = 0; i < hw_res->num_closid; i++)
@@ -2976,7 +2976,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}

static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
- struct rdt_domain *d,
+ struct rdt_mon_domain *d,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
union mon_data_bits priv;
@@ -3025,7 +3025,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
* and "monitor" groups with given domain id.
*/
static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_domain *d)
+ struct rdt_mon_domain *d)
{
struct kernfs_node *parent_kn;
struct rdtgroup *prgrp, *crgrp;
@@ -3047,7 +3047,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_resource *r,
struct rdtgroup *prgrp)
{
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
int ret;

list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3149,7 +3149,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
* Set the RDT domain up to start off with all usable allocations. That is,
* all shareable and unused bits. All-zero CBM is invalid.
*/
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
u32 closid)
{
enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3229,7 +3229,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
*/
static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
{
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int ret;

list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3245,7 +3245,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
{
struct resctrl_staged_config *cfg;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (is_mba_sc(r)) {
@@ -3842,14 +3842,14 @@ static void __init rdtgroup_setup_default(void)
mutex_unlock(&rdtgroup_mutex);
}

-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
{
bitmap_free(d->rmid_busy_llc);
kfree(d->mbm_total);
kfree(d->mbm_local);
}

-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3857,7 +3857,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
mba_sc_domain_destroy(r, d);
}

-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3886,7 +3886,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
domain_destroy_mon_state(d);
}

-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
{
size_t tsize;

@@ -3916,7 +3916,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3926,7 +3926,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
{
int err;

--
2.41.0

2023-11-30 18:37:52

by Fam Zheng

[permalink] [raw]
Subject: Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable

Hi Tony,

On Wed, Nov 29, 2023 at 04:34:17PM -0800, Tony Luck wrote:
> There isn't a simple hardware bit that indicates whether a CPU is
> running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
> the ratio of NUMA nodes to L3 cache instances.
>
> When SNC mode is detected, reconfigure the RMID counters by updating
> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
>
> Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
> on the second SNC node to start from zero.
>
> Signed-off-by: Tony Luck <[email protected]>
> Reviewed-by: Peter Newman <[email protected]>
> Reviewed-by: Reinette Chatre <[email protected]>
> Reviewed-by: Shaopeng Tan <[email protected]>
> Tested-by: Shaopeng Tan <[email protected]>
> ---
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kernel/cpu/resctrl/core.c | 96 ++++++++++++++++++++++++++++++
> 2 files changed, 97 insertions(+)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 1d51e1850ed0..94d29d81e6db 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1111,6 +1111,7 @@
> #define MSR_IA32_QM_CTR 0xc8e
> #define MSR_IA32_PQR_ASSOC 0xc8f
> #define MSR_IA32_L3_CBM_BASE 0xc90
> +#define MSR_RMID_SNC_CONFIG 0xca0
> #define MSR_IA32_L2_CBM_BASE 0xd10
> #define MSR_IA32_MBA_THRTL_BASE 0xd50
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index cf5aba8a74bf..3293ab4c58b0 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>
> #define pr_fmt(fmt) "resctrl: " fmt
>
> +#include <linux/cpu.h>
> #include <linux/slab.h>
> #include <linux/err.h>
> #include <linux/cacheinfo.h>
> #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>
> +#include <asm/cpu_device_id.h>
> #include <asm/intel-family.h>
> #include <asm/resctrl.h>
> #include "internal.h"
> @@ -740,11 +743,42 @@ static void clear_closid_rmid(int cpu)
> wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
> }
>
> +/*
> + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> + * which indicates that RMIDs are configured in legacy mode.
> + * This mode is incompatible with Linux resctrl semantics
> + * as RMIDs are partitioned between SNC nodes, which requires
> + * a user to know which RMID is allocated to a task.
> + * Clearing bit 0 reconfigures the RMID counters for use
> + * in Sub NUMA Cluster mode. This mode is better for Linux.
> + * The RMID space is divided between all SNC nodes with the
> + * RMIDs renumbered to start from zero in each node when
> + * couning operations from tasks. Code to read the counters
> + * must adjust RMID counter numbers based on SNC node. See
> + * __rmid_read() for code that does this.
> + */
> +static void snc_remap_rmids(int cpu)
> +{
> + u64 val;
> +
> + /* Only need to enable once per package. */
> + if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> + return;
> +
> + rdmsrl(MSR_RMID_SNC_CONFIG, val);
> + val &= ~BIT_ULL(0);
> + wrmsrl(MSR_RMID_SNC_CONFIG, val);
> +}
> +
> static int resctrl_online_cpu(unsigned int cpu)
> {
> struct rdt_resource *r;
>
> mutex_lock(&rdtgroup_mutex);
> +
> + if (snc_nodes_per_l3_cache > 1)
> + snc_remap_rmids(cpu);
> +
> for_each_capable_rdt_resource(r)
> domain_add_cpu(cpu, r);
> /* The cpu is set in default rdtgroup after online. */
> @@ -999,11 +1033,73 @@ static __init bool get_rdt_resources(void)
> return (rdt_mon_capable || rdt_alloc_capable);
> }
>
> +/* CPU models that support MSR_RMID_SNC_CONFIG */
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> + X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
> + {}
> +};
> +
> +/*
> + * There isn't a simple hardware bit that indicates whether a CPU is running
> + * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
> + * ratio of NUMA nodes to L3 cache instances.
> + * It is not possible to accurately determine SNC state if the system is
> + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
> + * to L3 caches. It will be OK if system is booted with hyperthreading
> + * disabled (since this doesn't affect the ratio).
> + */
> +static __init int snc_get_config(void)
> +{
> + unsigned long *node_caches;
> + int mem_only_nodes = 0;
> + int cpu, node, ret;
> + int num_l3_caches;
> +
> + if (!x86_match_cpu(snc_cpu_ids))
> + return 1;
> +
> + node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
> + if (!node_caches)
> + return 1;
> +
> + cpus_read_lock();
> +
> + if (num_online_cpus() != num_present_cpus())
> + pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> +
> + for_each_node(node) {
> + cpu = cpumask_first(cpumask_of_node(node));
> + if (cpu < nr_cpu_ids)
> + set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);

Are we sure get_cpu_cacheinfo_id() is an valid index here? Looking at
the function it could be -1 or larger than nr_node_ids.

Fam

> + else
> + mem_only_nodes++;
> + }
> + cpus_read_unlock();
> +
> + num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
> + kfree(node_caches);
> +
> + if (!num_l3_caches)
> + return 1;
> +
> + ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
> +
> + if (ret > 1)
> + rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> +
> + return ret;
> +}
> +
> static __init void rdt_init_res_defs_intel(void)
> {
> struct rdt_hw_resource *hw_res;
> struct rdt_resource *r;
>
> + snc_nodes_per_l3_cache = snc_get_config();
> +
> for_each_rdt_resource(r) {
> hw_res = resctrl_to_arch_res(r);
>
> --
> 2.41.

2023-11-30 20:57:41

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable

On Thu, Nov 30, 2023 at 06:02:42PM +0000, Fam Zheng wrote:
> > +static __init int snc_get_config(void)
> > +{
> > + unsigned long *node_caches;
> > + int mem_only_nodes = 0;
> > + int cpu, node, ret;
> > + int num_l3_caches;
> > +
> > + if (!x86_match_cpu(snc_cpu_ids))
> > + return 1;
> > +
> > + node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
> > + if (!node_caches)
> > + return 1;
> > +
> > + cpus_read_lock();
> > +
> > + if (num_online_cpus() != num_present_cpus())
> > + pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> > +
> > + for_each_node(node) {
> > + cpu = cpumask_first(cpumask_of_node(node));
> > + if (cpu < nr_cpu_ids)
> > + set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
>
> Are we sure get_cpu_cacheinfo_id() is an valid index here? Looking at
> the function it could be -1 or larger than nr_node_ids.

Fam,

Return -1 is possible (in the case where first CPU on a node doesn't
have an L3 cache). Larger than nr_node_ids seems a bit more speculative.
It would mean a system with multiple L3 cache instances per node. I
suppose that's theoretically possible. In the limit case every CPU may
have its own personal L3 cache, but still have multiple CPUs grouped
together on a node.

Patch below (to be folded into part7 of next version). Increases the
size of the bitmap. Checks for get_cpu_cacheinfo_id() returning -1.
Patch just ignores the node in this case. I'm never quite sure how much
code to add for "Can't happen" scenarios.

-Tony


diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3293ab4c58b0..85f8a1b3feaf 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -1056,12 +1056,13 @@ static __init int snc_get_config(void)
unsigned long *node_caches;
int mem_only_nodes = 0;
int cpu, node, ret;
+ int cache_id;
int num_l3_caches;

if (!x86_match_cpu(snc_cpu_ids))
return 1;

- node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
+ node_caches = bitmap_zalloc(num_online_cpus(), GFP_KERNEL);
if (!node_caches)
return 1;

@@ -1072,10 +1073,13 @@ static __init int snc_get_config(void)

for_each_node(node) {
cpu = cpumask_first(cpumask_of_node(node));
- if (cpu < nr_cpu_ids)
- set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
- else
+ if (cpu < nr_cpu_ids) {
+ cache_id = get_cpu_cacheinfo_id(cpu, 3);
+ if (cache_id != -1)
+ set_bit(cache_id, node_caches);
+ } else {
mem_only_nodes++;
+ }
}
cpus_read_unlock();

--
2.41.0

2023-11-30 21:47:32

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable

Hi Tony,

On 11/30/2023 12:57 PM, Tony Luck wrote:
> On Thu, Nov 30, 2023 at 06:02:42PM +0000, Fam Zheng wrote:
>>> +static __init int snc_get_config(void)
>>> +{
>>> + unsigned long *node_caches;
>>> + int mem_only_nodes = 0;
>>> + int cpu, node, ret;
>>> + int num_l3_caches;
>>> +
>>> + if (!x86_match_cpu(snc_cpu_ids))
>>> + return 1;
>>> +
>>> + node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
>>> + if (!node_caches)
>>> + return 1;
>>> +
>>> + cpus_read_lock();
>>> +
>>> + if (num_online_cpus() != num_present_cpus())
>>> + pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
>>> +
>>> + for_each_node(node) {
>>> + cpu = cpumask_first(cpumask_of_node(node));
>>> + if (cpu < nr_cpu_ids)
>>> + set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
>>
>> Are we sure get_cpu_cacheinfo_id() is an valid index here? Looking at
>> the function it could be -1 or larger than nr_node_ids.
>
> Fam,
>
> Return -1 is possible (in the case where first CPU on a node doesn't
> have an L3 cache). Larger than nr_node_ids seems a bit more speculative.
> It would mean a system with multiple L3 cache instances per node. I
> suppose that's theoretically possible. In the limit case every CPU may
> have its own personal L3 cache, but still have multiple CPUs grouped
> together on a node.
>
> Patch below (to be folded into part7 of next version). Increases the
> size of the bitmap. Checks for get_cpu_cacheinfo_id() returning -1.
> Patch just ignores the node in this case. I'm never quite sure how much
> code to add for "Can't happen" scenarios.
>

Thank you.

> -Tony
>
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 3293ab4c58b0..85f8a1b3feaf 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -1056,12 +1056,13 @@ static __init int snc_get_config(void)
> unsigned long *node_caches;
> int mem_only_nodes = 0;
> int cpu, node, ret;
> + int cache_id;
> int num_l3_caches;

Please do maintain reverse fir order.

>
> if (!x86_match_cpu(snc_cpu_ids))
> return 1;

I understand and welcome this change as motivated by robustness. Apart
from that, with this being a model specific feature for this particular
group of systems, it it not clear to me in which scenarios this could
run on a system where a present CPU does not have access to L3 cache.

>
> - node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
> + node_caches = bitmap_zalloc(num_online_cpus(), GFP_KERNEL);

Please do take care to take new bitmap size into account in all
places. From what I can tell there is a later bitmap_weight() call that
still uses nr_node_ids as size.

> if (!node_caches)
> return 1;
>
> @@ -1072,10 +1073,13 @@ static __init int snc_get_config(void)
>
> for_each_node(node) {
> cpu = cpumask_first(cpumask_of_node(node));
> - if (cpu < nr_cpu_ids)
> - set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> - else
> + if (cpu < nr_cpu_ids) {
> + cache_id = get_cpu_cacheinfo_id(cpu, 3);
> + if (cache_id != -1)
> + set_bit(cache_id, node_caches);
> + } else {
> mem_only_nodes++;
> + }
> }
> cpus_read_unlock();
>

Could this code be made even more robust by checking the computed
snc_nodes_per_l3_cache against the limited actually possible values?
Forcing it to 1 if something went wrong?

Reinette

2023-11-30 21:52:30

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v12 3/8] x86/resctrl: Prepare for different scope for control/monitor operations

Hi Tony,

On 11/29/2023 4:34 PM, Tony Luck wrote:
> Resctrl assumes that control and monitor operations on a resource are
> performed at the same scope.
>
> Prepare for systems that use different scope (specifically Intel needs
> to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
> and NODE scope for cache occupancy and memory bandwidth monitoring).
>
> Create separate domain lists for control and monitor operations.
>
> Note that errors during initialization of either control or monitor
> functions on a domain would previously result in that domain being
> excluded from both control and monitor operations. Now the domains are
> allocated independently it is no longer required to disable both control
> and monitor operations if either fail.
>
> Signed-off-by: Tony Luck <[email protected]>
> Reviewed-by: Shaopeng Tan <[email protected]>
> Tested-by: Shaopeng Tan <[email protected]>
> ---

According to Documentation/process/maintainer-tip.rst the
"Tested-by" tag should go before the "Reviewed-by" tag. Please
check the whole series.

Reviewed-by: Reinette Chatre <[email protected]>

Reinette


2023-11-30 22:43:34

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable

On Thu, Nov 30, 2023 at 01:47:10PM -0800, Reinette Chatre wrote:
> Hi Tony,
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 3293ab4c58b0..85f8a1b3feaf 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -1056,12 +1056,13 @@ static __init int snc_get_config(void)
> > unsigned long *node_caches;
> > int mem_only_nodes = 0;
> > int cpu, node, ret;
> > + int cache_id;
> > int num_l3_caches;
>
> Please do maintain reverse fir order.

Fixed.

>
> >
> > if (!x86_match_cpu(snc_cpu_ids))
> > return 1;
>
> I understand and welcome this change as motivated by robustness. Apart
> from that, with this being a model specific feature for this particular
> group of systems, it it not clear to me in which scenarios this could
> run on a system where a present CPU does not have access to L3 cache.

Agreed that on these systems there should always be an L3 cache. Should
I drop the check for "-1"?

> >
> > - node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
> > + node_caches = bitmap_zalloc(num_online_cpus(), GFP_KERNEL);
>
> Please do take care to take new bitmap size into account in all
> places. From what I can tell there is a later bitmap_weight() call that
> still uses nr_node_ids as size.

Oops. I was also using num_online_cpus() before cpus_read_lock(), so
things could theoretically change before the bitmap_weight() call.
I switched to using num_present_cpus() in both places.

> > if (!node_caches)
> > return 1;
> >
> > @@ -1072,10 +1073,13 @@ static __init int snc_get_config(void)
> >
> > for_each_node(node) {
> > cpu = cpumask_first(cpumask_of_node(node));
> > - if (cpu < nr_cpu_ids)
> > - set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> > - else
> > + if (cpu < nr_cpu_ids) {
> > + cache_id = get_cpu_cacheinfo_id(cpu, 3);
> > + if (cache_id != -1)
> > + set_bit(cache_id, node_caches);
> > + } else {
> > mem_only_nodes++;
> > + }
> > }
> > cpus_read_unlock();
> >
>
> Could this code be made even more robust by checking the computed
> snc_nodes_per_l3_cache against the limited actually possible values?
> Forcing it to 1 if something went wrong?

Added a couple of extra sanity checks. See updated incremental patch
below.

-Tony


diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3293ab4c58b0..3684c6bf8224 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -1057,11 +1057,12 @@ static __init int snc_get_config(void)
int mem_only_nodes = 0;
int cpu, node, ret;
int num_l3_caches;
+ int cache_id;

if (!x86_match_cpu(snc_cpu_ids))
return 1;

- node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
+ node_caches = bitmap_zalloc(num_present_cpus(), GFP_KERNEL);
if (!node_caches)
return 1;

@@ -1072,23 +1073,39 @@ static __init int snc_get_config(void)

for_each_node(node) {
cpu = cpumask_first(cpumask_of_node(node));
- if (cpu < nr_cpu_ids)
- set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
- else
+ if (cpu < nr_cpu_ids) {
+ cache_id = get_cpu_cacheinfo_id(cpu, 3);
+ if (cache_id != -1)
+ set_bit(cache_id, node_caches);
+ } else {
mem_only_nodes++;
+ }
}
cpus_read_unlock();

- num_l3_caches = bitmap_weight(node_caches, nr_node_ids);
+ num_l3_caches = bitmap_weight(node_caches, num_present_cpus());
kfree(node_caches);

if (!num_l3_caches)
return 1;

+ /* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
+ if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
+ return 1;
+
ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;

- if (ret > 1)
+ /* sanity check #2: Only valid results are 1, 2, 4 */
+ switch (ret) {
+ case 1:
+ break;
+ case 2:
+ case 4:
rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+ break;
+ default:
+ return 1;
+ }

return ret;
}
--
2.41.0

2023-11-30 23:41:55

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable

Hi Tony,

On 11/30/2023 2:43 PM, Tony Luck wrote:
> On Thu, Nov 30, 2023 at 01:47:10PM -0800, Reinette Chatre wrote:

...

>>> if (!x86_match_cpu(snc_cpu_ids))
>>> return 1;
>>
>> I understand and welcome this change as motivated by robustness. Apart
>> from that, with this being a model specific feature for this particular
>> group of systems, it it not clear to me in which scenarios this could
>> run on a system where a present CPU does not have access to L3 cache.
>
> Agreed that on these systems there should always be an L3 cache. Should
> I drop the check for "-1"?

Please do keep it. I welcome the additional robustness. The static checker I
tried did not complain about this but I expect that it is something that
could trigger checks.

>
>>>
>>> - node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
>>> + node_caches = bitmap_zalloc(num_online_cpus(), GFP_KERNEL);
>>
>> Please do take care to take new bitmap size into account in all
>> places. From what I can tell there is a later bitmap_weight() call that
>> still uses nr_node_ids as size.
>
> Oops. I was also using num_online_cpus() before cpus_read_lock(), so
> things could theoretically change before the bitmap_weight() call.
> I switched to using num_present_cpus() in both places.

Thanks for catching this. I am not sure if num_present_cpus() is the right
choice. I found its comment to say "If HOTPLUG is enabled, then cpu_present_mask
varies dynamically ...". num_possible_cpus() seems more appropriate when looking
for something that does not change while not holding the hotplug lock. Reading its
description more closely also makes me wonder if the later
num_online_cpus() != num_present_cpus()
should also maybe be
num_online_cpus() != num_possible_cpus() ?
It seems to more closely match the intention.

>>> if (!node_caches)
>>> return 1;
>>>
>>> @@ -1072,10 +1073,13 @@ static __init int snc_get_config(void)
>>>
>>> for_each_node(node) {
>>> cpu = cpumask_first(cpumask_of_node(node));
>>> - if (cpu < nr_cpu_ids)
>>> - set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
>>> - else
>>> + if (cpu < nr_cpu_ids) {
>>> + cache_id = get_cpu_cacheinfo_id(cpu, 3);
>>> + if (cache_id != -1)
>>> + set_bit(cache_id, node_caches);
>>> + } else {
>>> mem_only_nodes++;
>>> + }
>>> }
>>> cpus_read_unlock();
>>>
>>
>> Could this code be made even more robust by checking the computed
>> snc_nodes_per_l3_cache against the limited actually possible values?
>> Forcing it to 1 if something went wrong?
>
> Added a couple of extra sanity checks. See updated incremental patch
> below.

Thank you very much. The additional checks look good to me.

Reinette

2023-12-01 00:37:32

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable

On Thu, Nov 30, 2023 at 03:40:52PM -0800, Reinette Chatre wrote:
> Hi Tony,
>
> On 11/30/2023 2:43 PM, Tony Luck wrote:
> > On Thu, Nov 30, 2023 at 01:47:10PM -0800, Reinette Chatre wrote:
>
> ...
>
> >>> if (!x86_match_cpu(snc_cpu_ids))
> >>> return 1;
> >>
> >> I understand and welcome this change as motivated by robustness. Apart
> >> from that, with this being a model specific feature for this particular
> >> group of systems, it it not clear to me in which scenarios this could
> >> run on a system where a present CPU does not have access to L3 cache.
> >
> > Agreed that on these systems there should always be an L3 cache. Should
> > I drop the check for "-1"?
>
> Please do keep it. I welcome the additional robustness. The static checker I
> tried did not complain about this but I expect that it is something that
> could trigger checks.
>
> >
> >>>
> >>> - node_caches = bitmap_zalloc(nr_node_ids, GFP_KERNEL);
> >>> + node_caches = bitmap_zalloc(num_online_cpus(), GFP_KERNEL);
> >>
> >> Please do take care to take new bitmap size into account in all
> >> places. From what I can tell there is a later bitmap_weight() call that
> >> still uses nr_node_ids as size.
> >
> > Oops. I was also using num_online_cpus() before cpus_read_lock(), so
> > things could theoretically change before the bitmap_weight() call.
> > I switched to using num_present_cpus() in both places.
>
> Thanks for catching this. I am not sure if num_present_cpus() is the right
> choice. I found its comment to say "If HOTPLUG is enabled, then cpu_present_mask
> varies dynamically ...". num_possible_cpus() seems more appropriate when looking

I can size the bitmask based on num_possible_cpus().

> for something that does not change while not holding the hotplug lock. Reading its
> description more closely also makes me wonder if the later
> num_online_cpus() != num_present_cpus()
> should also maybe be
> num_online_cpus() != num_possible_cpus() ?
> It seems to more closely match the intention.

This seems problematic. On a system that does support physical CPU
hotplug num_possible_cpus() may be some very large number. Reserving
space for CPUs that can be added later. None of those CPUs can be online
(obviously!). So this test would fail on such a system.

> >>> if (!node_caches)
> >>> return 1;
> >>>
> >>> @@ -1072,10 +1073,13 @@ static __init int snc_get_config(void)
> >>>
> >>> for_each_node(node) {
> >>> cpu = cpumask_first(cpumask_of_node(node));
> >>> - if (cpu < nr_cpu_ids)
> >>> - set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches);
> >>> - else
> >>> + if (cpu < nr_cpu_ids) {
> >>> + cache_id = get_cpu_cacheinfo_id(cpu, 3);
> >>> + if (cache_id != -1)
> >>> + set_bit(cache_id, node_caches);
> >>> + } else {
> >>> mem_only_nodes++;
> >>> + }
> >>> }
> >>> cpus_read_unlock();
> >>>
> >>
> >> Could this code be made even more robust by checking the computed
> >> snc_nodes_per_l3_cache against the limited actually possible values?
> >> Forcing it to 1 if something went wrong?
> >
> > Added a couple of extra sanity checks. See updated incremental patch
> > below.
>
> Thank you very much. The additional checks look good to me.
>
> Reinette

Thanks for looking at this. I'm applying changes to my local tree. I'll
give folks a little more time to find additonal issues in v12 and post
v13 next week.

-Tony

2023-12-01 02:09:02

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v12 7/8] x86/resctrl: Sub NUMA Cluster detection and enable

Hi Tony,

On 11/30/2023 4:37 PM, Tony Luck wrote:
> On Thu, Nov 30, 2023 at 03:40:52PM -0800, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 11/30/2023 2:43 PM, Tony Luck wrote:
>>> On Thu, Nov 30, 2023 at 01:47:10PM -0800, Reinette Chatre wrote:

>> for something that does not change while not holding the hotplug lock. Reading its
>> description more closely also makes me wonder if the later
>> num_online_cpus() != num_present_cpus()
>> should also maybe be
>> num_online_cpus() != num_possible_cpus() ?
>> It seems to more closely match the intention.
>
> This seems problematic. On a system that does support physical CPU
> hotplug num_possible_cpus() may be some very large number. Reserving
> space for CPUs that can be added later. None of those CPUs can be online
> (obviously!). So this test would fail on such a system.

Thank you for clarifying. It was not obvious to me how these bitmaps are
different with physical hotplug.

Reinette

2023-12-04 16:56:21

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v12 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes



On 11/29/23 18:34, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
>
> Users should be aware that SNC mode also affects the amount of L3 cache
> available for allocation within each SNC node.
>
> Signed-off-by: Tony Luck <[email protected]>
> Reviewed-by: Peter Newman <[email protected]>
> Reviewed-by: Reinette Chatre <[email protected]>
> Reviewed-by: Shaopeng Tan <[email protected]>
> Tested-by: Shaopeng Tan <[email protected]>
> ---
> Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index a6279df64a9d..49ff789db1d8 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
> When monitoring is enabled all MON groups will also contain:
>
> "mon_data":
> - This contains a set of files organized by L3 domain and by
> - RDT event. E.g. on a system with two L3 domains there will
> - be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
> - directories have one file per event (e.g. "llc_occupancy",
> + This contains a set of files organized by L3 domain or by NUMA
> + node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> + or enabled respectively) and by RDT event. Each of these
> + directories has one file per event (e.g. "llc_occupancy",
> "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
> files provide a read out of the current value of the event for
> all tasks in the group. In CTRL_MON groups these files provide
> @@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
> each bit represents 5% of the capacity of the cache. You could partition
> the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>
> +Notes on Sub-NUMA Cluster mode
> +==============================
> +When SNC mode is enabled Linux may load balance tasks between Sub-NUMA

enabled, Linux

Thanks
Babu

> +nodes much more readily than between regular NUMA nodes since the CPUs
> +on Sub-NUMA nodes share the same L3 cache and the system may report
> +the NUMA distance between Sub-NUMA nodes with a lower value than used
> +for regular NUMA nodes. Users who do not bind tasks to the CPUs of a
> +specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
> +and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
> +to get the full view of traffic for which the tasks were the source.
> +
> +The cache allocation feature still provides the same number of
> +bits in a mask to control allocation into the L3 cache, but each
> +of those ways has its capacity reduced because the cache is divided
> +between the SNC nodes. The values reported in the resctrl
> +"size" files are adjusted accordingly.
> +
> Memory bandwidth Allocation and monitoring
> ==========================================
>

--
Thanks
Babu Moger

2023-12-04 17:51:32

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v12 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes

>> +Notes on Sub-NUMA Cluster mode
>> +==============================
>> +When SNC mode is enabled Linux may load balance tasks between Sub-NUMA
>
> enabled, Linux

Babu,

Thanks. I've added that "," ready for next version.

-Tony

2023-12-04 18:54:19

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features. Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <[email protected]>

Changes since v12:

All:
Reinette - put commit tags in right order for TIP (Tested-by before
Reviewed-by)

Patch 7:
Fam Zheng - Check for -1 return from get_cpu_cacheinfo_id() and
increase size of bitmap tracking # of L3 instances.
Reinette - Add extra sanity checks. Note that this patch has
some additional tweaks beyond the e-mail discussion.
1) "3" is a valid return in addition to 1, 2, 4
2) Added a warning if the sanity checks fail that
prints number of CPU nodes and number of L3 cache
instances that were found.

Patch 8:
Babu - Fix grammar with an additional comma.


Tony Luck (8):
x86/resctrl: Prepare for new domain scope
x86/resctrl: Prepare to split rdt_domain structure
x86/resctrl: Prepare for different scope for control/monitor
operations
x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
x86/resctrl: Add node-scope to the options for feature scope
x86/resctrl: Introduce snc_nodes_per_l3_cache
x86/resctrl: Sub NUMA Cluster detection and enable
x86/resctrl: Update documentation with Sub-NUMA cluster changes

Documentation/arch/x86/resctrl.rst | 25 +-
include/linux/resctrl.h | 85 +++--
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 66 ++--
arch/x86/kernel/cpu/resctrl/core.c | 433 +++++++++++++++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 58 +--
arch/x86/kernel/cpu/resctrl/monitor.c | 68 ++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 26 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 149 ++++----
9 files changed, 629 insertions(+), 282 deletions(-)


base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
--
2.41.0

2023-12-04 18:54:24

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v13 5/8] x86/resctrl: Add node-scope to the options for feature scope

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Signed-off-by: Tony Luck <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
---
include/linux/resctrl.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 058a940c3239..b8a3a11b970d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -170,6 +170,7 @@ struct resctrl_schema;
enum resctrl_scope {
RESCTRL_L2_CACHE = 2,
RESCTRL_L3_CACHE = 3,
+ RESCTRL_NODE,
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 797cb3bf417a..c9315ce8f7bd 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -502,6 +502,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
case RESCTRL_L2_CACHE:
case RESCTRL_L3_CACHE:
return get_cpu_cacheinfo_id(cpu, scope);
+ case RESCTRL_NODE:
+ return cpu_to_node(cpu);
default:
break;
}
--
2.41.0

2023-12-04 18:54:31

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v13 2/8] x86/resctrl: Prepare to split rdt_domain structure

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.

Signed-off-by: Tony Luck <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
---
include/linux/resctrl.h | 16 ++++--
arch/x86/kernel/cpu/resctrl/core.c | 26 +++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 10 ++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 60 +++++++++++------------
6 files changed, 78 insertions(+), 70 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 7d4eb7df611d..c4067150a6b7 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,10 +53,20 @@ struct resctrl_staged_config {
};

/**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
* @list: all instances of this resource
* @id: unique id for this instance
* @cpu_mask: which CPUs share this resource
+ */
+struct rdt_domain_hdr {
+ struct list_head list;
+ int id;
+ struct cpumask cpu_mask;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr: common header for different domain types
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
* @mbm_local: saved state for MBM local bandwidth
@@ -71,9 +81,7 @@ struct resctrl_staged_config {
* by closid
*/
struct rdt_domain {
- struct list_head list;
- int id;
- struct cpumask cpu_mask;
+ struct rdt_domain_hdr hdr;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
struct mbm_state *mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index fd113bc29d4e..62a989fd950d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -356,9 +356,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
{
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
/* Find the domain that contains this CPU */
- if (cpumask_test_cpu(cpu, &d->cpu_mask))
+ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
return d;
}

@@ -402,12 +402,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct list_head *l;

list_for_each(l, &r->domains) {
- d = list_entry(l, struct rdt_domain, list);
+ d = list_entry(l, struct rdt_domain, hdr.list);
/* When id is found, return its domain. */
- if (id == d->id)
+ if (id == d->hdr.id)
return d;
/* Stop searching when finding id's position in sorted list. */
- if (id < d->id)
+ if (id < d->hdr.id)
break;
}

@@ -530,7 +530,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

d = rdt_find_domain(r, id, &add_pos);
if (d) {
- cpumask_set_cpu(cpu, &d->cpu_mask);
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
return;
@@ -541,8 +541,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;

d = &hw_dom->d_resctrl;
- d->id = id;
- cpumask_set_cpu(cpu, &d->cpu_mask);
+ d->hdr.id = id;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

rdt_domain_reconfigure_cdp(r);

@@ -556,11 +556,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

- list_add_tail(&d->list, add_pos);
+ list_add_tail(&d->hdr.list, add_pos);

err = resctrl_online_domain(r, d);
if (err) {
- list_del(&d->list);
+ list_del(&d->hdr.list);
domain_free(hw_dom);
}
}
@@ -584,10 +584,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}
hw_dom = resctrl_to_arch_dom(d);

- cpumask_clear_cpu(cpu, &d->cpu_mask);
- if (cpumask_empty(&d->cpu_mask)) {
+ cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+ if (cpumask_empty(&d->hdr.cpu_mask)) {
resctrl_offline_domain(r, d);
- list_del(&d->list);
+ list_del(&d->hdr.list);

/*
* rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 3f8891d57fac..23f8258d36a8 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,

cfg = &d->staged_config[s->conf_type];
if (cfg->have_new_ctrl) {
- rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+ rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
return -EINVAL;
}

@@ -146,7 +146,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,

cfg = &d->staged_config[s->conf_type];
if (cfg->have_new_ctrl) {
- rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+ rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
return -EINVAL;
}

@@ -226,8 +226,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
return -EINVAL;
}
dom = strim(dom);
- list_for_each_entry(d, &r->domains, list) {
- if (d->id == dom_id) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (d->hdr.id == dom_id) {
data.buf = dom;
data.rdtgrp = rdtgrp;
if (r->parse_ctrlval(&data, s, d))
@@ -274,7 +274,7 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
struct rdt_domain *dom = &hw_dom->d_resctrl;

if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
- cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
+ cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
hw_dom->ctrl_val[idx] = cfg->new_ctrl;

return true;
@@ -291,7 +291,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
u32 idx = get_config_index(closid, t);
struct msr_param msr_param;

- if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+ if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
return -EINVAL;

hw_dom->ctrl_val[idx] = cfg_val;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
return -ENOMEM;

msr_param.res = NULL;
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
for (t = 0; t < CDP_NUM_TYPES; t++) {
cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
u32 ctrl_val;

seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -476,7 +476,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
ctrl_val = resctrl_arch_get_config(r, dom, closid,
schema->conf_type);

- seq_printf(s, r->format_str, dom->id, max_data_width,
+ seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
ctrl_val);
sep = true;
}
@@ -505,7 +505,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
} else {
seq_printf(s, "%s:%d=%x\n",
rdtgrp->plr->s->res->name,
- rdtgrp->plr->d->id,
+ rdtgrp->plr->d->hdr.id,
rdtgrp->plr->cbm);
}
} else {
@@ -536,7 +536,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
rr->val = 0;
rr->first = first;

- smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
}

int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..dd0ea1bc0092 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -238,7 +238,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
u64 msr_val, chunks;
int ret;

- if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+ if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
return -EINVAL;

ret = __rmid_read(rmid, eventid, &msr_val);
@@ -340,8 +340,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)

entry->busy = 0;
cpu = get_cpu();
- list_for_each_entry(d, &r->domains, list) {
- if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
err = resctrl_arch_rmid_read(r, d, entry->rmid,
QOS_L3_OCCUP_EVENT_ID,
&val);
@@ -661,7 +661,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;

- cpu = cpumask_any(&dom->cpu_mask);
+ cpu = cpumask_any(&dom->hdr.cpu_mask);
dom->cqm_work_cpu = cpu;

schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
@@ -708,7 +708,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)

if (!static_branch_likely(&rdt_mon_enable_key))
return;
- cpu = cpumask_any(&dom->cpu_mask);
+ cpu = cpumask_any(&dom->hdr.cpu_mask);
dom->mbm_work_cpu = cpu;
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 2a682da9f43a..fcbd99e2eb66 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
int cpu;
int ret;

- for_each_cpu(cpu, &plr->d->cpu_mask) {
+ for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
if (!pm_req) {
rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
@@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
return -ENODEV;

/* Pick the first cpu we find that is associated with the cache. */
- plr->cpu = cpumask_first(&plr->d->cpu_mask);
+ plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);

if (!cpu_online(plr->cpu)) {
rdt_last_cmd_printf("CPU %u associated with cache not online\n",
@@ -856,10 +856,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* associated with them.
*/
for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(d_i, &r->domains, list) {
+ list_for_each_entry(d_i, &r->domains, hdr.list) {
if (d_i->plr)
cpumask_or(cpu_with_psl, cpu_with_psl,
- &d_i->cpu_mask);
+ &d_i->hdr.cpu_mask);
}
}

@@ -867,7 +867,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* Next test if new pseudo-locked region would intersect with
* existing region.
*/
- if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
+ if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
ret = true;

free_cpumask_var(cpu_with_psl);
@@ -1199,7 +1199,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
}

plr->thread_done = 0;
- cpu = cpumask_first(&plr->d->cpu_mask);
+ cpu = cpumask_first(&plr->d->hdr.cpu_mask);
if (!cpu_online(cpu)) {
ret = -ENODEV;
goto out;
@@ -1529,7 +1529,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
* may be scheduled elsewhere and invalidate entries in the
* pseudo-locked region.
*/
- if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
+ if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
mutex_unlock(&rdtgroup_mutex);
return -EINVAL;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index c44be64d65ec..04d32602ac33 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
lockdep_assert_held(&rdtgroup_mutex);

for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(dom, &r->domains, list)
+ list_for_each_entry(dom, &r->domains, hdr.list)
memset(dom->staged_config, 0, sizeof(dom->staged_config));
}
}
@@ -295,7 +295,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
rdt_last_cmd_puts("Cache domain offline\n");
ret = -ENODEV;
} else {
- mask = &rdtgrp->plr->d->cpu_mask;
+ mask = &rdtgrp->plr->d->hdr.cpu_mask;
seq_printf(s, is_cpu_list(of) ?
"%*pbl\n" : "%*pb\n",
cpumask_pr_args(mask));
@@ -984,12 +984,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,

mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
exclusive = 0;
- seq_printf(seq, "%d=", dom->id);
+ seq_printf(seq, "%d=", dom->hdr.id);
for (i = 0; i < closids_supported(); i++) {
if (!closid_allocated(i))
continue;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
continue;
has_cache = true;
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
ctrl = resctrl_arch_get_config(r, d, closid,
s->conf_type);
if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1417,7 +1417,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
return size;

num_b = bitmap_weight(&cbm, r->cache.cbm_len);
- ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
+ ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
@@ -1465,7 +1465,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
rdtgrp->plr->d,
rdtgrp->plr->cbm);
- seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+ seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
}
goto out;
}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
type = schema->conf_type;
sep = false;
seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
if (sep)
seq_putc(s, ';');
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1495,7 +1495,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
else
size = rdtgroup_cbm_to_size(r, d, ctrl);
}
- seq_printf(s, "%d=%u", d->id, size);
+ seq_printf(s, "%d=%u", d->hdr.id, size);
sep = true;
}
seq_putc(s, '\n');
@@ -1555,7 +1555,7 @@ static void mon_event_config_read(void *info)

static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
{
- smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
}

static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid

mutex_lock(&rdtgroup_mutex);

- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -1574,7 +1574,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
mon_info.evtid = evtid;
mondata_config_read(dom, &mon_info);

- seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+ seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
sep = true;
}
seq_puts(s, "\n");
@@ -1646,7 +1646,7 @@ static int mbm_config_write_domain(struct rdt_resource *r,
* are scoped at the domain level. Writing any of these MSRs
* on one CPU is observed by all the CPUs in the domain.
*/
- smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
&mon_info, 1);

/*
@@ -1689,8 +1689,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}

- list_for_each_entry(d, &r->domains, list) {
- if (d->id == dom_id) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (d->hdr.id == dom_id) {
ret = mbm_config_write_domain(r, d, evtid, val);
if (ret)
return -EINVAL;
@@ -2232,14 +2232,14 @@ static int set_cache_qos_cfg(int level, bool enable)
return -ENOMEM;

r_l = &rdt_resources_all[level].r_resctrl;
- list_for_each_entry(d, &r_l->domains, list) {
+ list_for_each_entry(d, &r_l->domains, hdr.list) {
if (r_l->cache.arch_has_per_cpu_cfg)
/* Pick all the CPUs in the domain instance */
- for_each_cpu(cpu, &d->cpu_mask)
+ for_each_cpu(cpu, &d->hdr.cpu_mask)
cpumask_set_cpu(cpu, cpu_mask);
else
/* Pick one CPU from each domain instance to update MSR */
- cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+ cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
}

/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2268,7 +2268,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
{
u32 num_closid = resctrl_arch_get_num_closid(r);
- int cpu = cpumask_any(&d->cpu_mask);
+ int cpu = cpumask_any(&d->hdr.cpu_mask);
int i;

d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)

r->membw.mba_sc = mba_sc;

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
for (i = 0; i < num_closid; i++)
d->mbps_val[i] = MBA_MAX_MBPS;
}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- list_for_each_entry(dom, &r->domains, list)
+ list_for_each_entry(dom, &r->domains, hdr.list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}

@@ -2780,9 +2780,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
* CBMs in all domains to the maximum mask value. Pick one CPU
* from each domain to update the MSRs below.
*/
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
- cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+ cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);

for (i = 0; i < hw_res->num_closid; i++)
hw_dom->ctrl_val[i] = r->default_ctrl;
@@ -2986,7 +2986,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
char name[32];
int ret;

- sprintf(name, "mon_%s_%02d", r->name, d->id);
+ sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
/* create the directory */
kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
if (IS_ERR(kn))
@@ -3002,7 +3002,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
}

priv.u.rid = r->rid;
- priv.u.domid = d->id;
+ priv.u.domid = d->hdr.id;
list_for_each_entry(mevt, &r->evt_list, list) {
priv.u.evtid = mevt->evtid;
ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_domain *dom;
int ret;

- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
return ret;
@@ -3209,7 +3209,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
*/
tmp_cbm = cfg->new_ctrl;
if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
- rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+ rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
return -ENOSPC;
}
cfg->have_new_ctrl = true;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
struct rdt_domain *d;
int ret;

- list_for_each_entry(d, &s->res->domains, list) {
+ list_for_each_entry(d, &s->res->domains, hdr.list) {
ret = __init_one_rdt_domain(d, s, closid);
if (ret < 0)
return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
struct resctrl_staged_config *cfg;
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
if (is_mba_sc(r)) {
d->mbps_val[closid] = MBA_MAX_MBPS;
continue;
@@ -3864,7 +3864,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
* per domain monitor data directories.
*/
if (static_branch_unlikely(&rdt_mon_enable_key))
- rmdir_mondata_subdir_allrdtgrp(r, d->id);
+ rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);

if (is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
--
2.41.0

2023-12-04 18:54:43

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v13 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that shows how many
SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
SNC mode is either not implemented, or not enabled.

Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
nodes.
3) The RMID renumbering operates when using the value from the
IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
counter, adjust from the logical RMID to the physical
RMID value for the SNC node that it wishes to read and load the
adjusted value into the IA32_QM_EVTSEL MSR.
4) Divide the L3 cache between the SNC nodes. Divide the value
reported in the resctrl "size" file by the number of SNC
nodes because the effective amount of cache that can be allocated
is reduced by that factor.
5) Disable the "-o mba_MBps" mount option in SNC mode
because the monitoring is being done per SNC node, while the
bandwidth allocation is still done at the L3 cache scope.
Trying to use this feedback loop might result in contradictory
changes to the throttling level coming from each of the SNC
node bandwidth measurements.

Signed-off-by: Tony Luck <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 16 +++++++++++++---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++--
4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index ce3a70657842..e7a75a439c16 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);

extern struct dentry *debugfs_resctrl;

+extern unsigned int snc_nodes_per_l3_cache;
+
enum resctrl_res_level {
RDT_RESOURCE_L3,
RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c9315ce8f7bd..cf5aba8a74bf 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
*/
bool rdt_alloc_capable;

+/*
+ * Number of SNC nodes that share each L3 cache. Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
static void
mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 4e145f5620b0..30b7c3b9b517 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)

static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ int cpu = smp_processor_id();
+ int rmid_offset = 0;
u64 msr_val;

+ /*
+ * When SNC mode is on, need to compute the offset to read the
+ * physical RMID counter for the node to which this CPU belongs.
+ */
+ if (snc_nodes_per_l3_cache > 1)
+ rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
/*
* As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
* with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
*/
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
rdmsrl(MSR_IA32_QM_CTR, msr_val);

if (msr_val & RMID_VAL_ERROR)
@@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
int ret;

resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
- hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
- r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+ r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;

if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 21bbd832f3f2..79d57dade568 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
}
}

- return size;
+ return size / snc_nodes_per_l3_cache;
}

/*
@@ -2298,7 +2298,8 @@ static bool supports_mba_mbps(void)
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

return (is_mbm_local_enabled() &&
- r->alloc_capable && is_mba_linear());
+ r->alloc_capable && is_mba_linear() &&
+ snc_nodes_per_l3_cache == 1);
}

/*
--
2.41.0

2023-12-04 18:54:46

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v13 3/8] x86/resctrl: Prepare for different scope for control/monitor operations

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
---
include/linux/resctrl.h | 25 ++-
arch/x86/kernel/cpu/resctrl/internal.h | 6 +-
arch/x86/kernel/cpu/resctrl/core.c | 211 ++++++++++++++++------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 12 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 4 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 55 +++---
7 files changed, 220 insertions(+), 97 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c4067150a6b7..35e700edc6e6 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -52,15 +52,22 @@ struct resctrl_staged_config {
bool have_new_ctrl;
};

+enum resctrl_domain_type {
+ RESCTRL_CTRL_DOMAIN,
+ RESCTRL_MON_DOMAIN,
+};
+
/**
* struct rdt_domain_hdr - common header for different domain types
* @list: all instances of this resource
* @id: unique id for this instance
+ * @type: type of this instance
* @cpu_mask: which CPUs share this resource
*/
struct rdt_domain_hdr {
struct list_head list;
int id;
+ enum resctrl_domain_type type;
struct cpumask cpu_mask;
};

@@ -163,10 +170,12 @@ enum resctrl_scope {
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @scope: Scope of this resource
+ * @ctrl_scope: Scope of this resource for control functions
+ * @mon_scope: Scope of this resource for monitor functions
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
- * @domains: All domains for this resource
+ * @ctrl_domains: Control domains for this resource
+ * @mon_domains: Monitor domains for this resource
* @name: Name to use in "schemata" file.
* @data_width: Character width of data when displaying
* @default_ctrl: Specifies default cache cbm or memory B/W percent.
@@ -181,10 +190,12 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- enum resctrl_scope scope;
+ enum resctrl_scope ctrl_scope;
+ enum resctrl_scope mon_scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
- struct list_head domains;
+ struct list_head ctrl_domains;
+ struct list_head mon_domains;
char *name;
int data_width;
u32 default_ctrl;
@@ -230,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,

u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..24bf9d7989a9 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -520,8 +520,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
- struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+ struct list_head **pos);
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -540,7 +540,7 @@ int rdt_pseudo_lock_init(void);
void rdt_pseudo_lock_release(void);
int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
void closid_free(int closid);
int alloc_rmid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 62a989fd950d..1fd85533b4ca 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,8 @@ static void
mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);

-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)

struct rdt_hw_resource rdt_resources_all[] = {
[RDT_RESOURCE_L3] =
@@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_L3),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .mon_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L3),
+ .mon_domains = mon_domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .scope = RESCTRL_L2_CACHE,
- .domains = domain_init(RDT_RESOURCE_L2),
+ .ctrl_scope = RESCTRL_L2_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_MBA),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_SMBA),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -352,11 +355,11 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
}

-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
{
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
/* Find the domain that contains this CPU */
if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
return d;
@@ -378,7 +381,7 @@ void rdt_ctrl_update(void *arg)
int cpu = smp_processor_id();
struct rdt_domain *d;

- d = get_domain_from_cpu(cpu, r);
+ d = get_ctrl_domain_from_cpu(cpu, r);
if (d) {
hw_res->msr_update(d, m, r);
return;
@@ -388,26 +391,26 @@ void rdt_ctrl_update(void *arg)
}

/*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Search for a domain id in a resource domain list.
*
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
- * The domain list is sorted by id in ascending order.
+ * Search the domain list to find the domain id. If the domain id is
+ * found, return the domain. NULL otherwise. If the domain id is not
+ * found (and NULL returned) then the first domain with id bigger than
+ * the input id can be returned to the caller via @pos.
*/
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
- struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+ struct list_head **pos)
{
- struct rdt_domain *d;
+ struct rdt_domain_hdr *d;
struct list_head *l;

- list_for_each(l, &r->domains) {
- d = list_entry(l, struct rdt_domain, hdr.list);
+ list_for_each(l, h) {
+ d = list_entry(l, struct rdt_domain_hdr, list);
/* When id is found, return its domain. */
- if (id == d->hdr.id)
+ if (id == d->id)
return d;
/* Stop searching when finding id's position in sorted list. */
- if (id < d->hdr.id)
+ if (id < d->id)
break;
}

@@ -501,35 +504,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
return -EINVAL;
}

-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
{
- int id = get_domain_id_from_scope(cpu, r->scope);
+ int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
struct rdt_domain *d;
int err;

if (id < 0) {
- pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->scope, r->name);
+ pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->ctrl_scope, r->name);
return;
}

- d = rdt_find_domain(r, id, &add_pos);
- if (d) {
+ hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+ if (hdr) {
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
+
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
@@ -542,51 +538,114 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

d = &hw_dom->d_resctrl;
d->hdr.id = id;
+ d->hdr.type = RESCTRL_CTRL_DOMAIN;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

rdt_domain_reconfigure_cdp(r);

- if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+ if (domain_setup_ctrlval(r, d)) {
domain_free(hw_dom);
return;
}

- if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+ list_add_tail(&d->hdr.list, add_pos);
+
+ err = resctrl_online_ctrl_domain(r, d);
+ if (err) {
+ list_del(&d->hdr.list);
+ domain_free(hw_dom);
+ }
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct list_head *add_pos = NULL;
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
+ struct rdt_domain *d;
+ int err;
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+ if (hdr) {
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
+
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ return;
+ }
+
+ hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+ if (!hw_dom)
+ return;
+
+ d = &hw_dom->d_resctrl;
+ d->hdr.id = id;
+ d->hdr.type = RESCTRL_MON_DOMAIN;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+
+ if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
domain_free(hw_dom);
return;
}

list_add_tail(&d->hdr.list, add_pos);

- err = resctrl_online_domain(r, d);
+ err = resctrl_online_mon_domain(r, d);
if (err) {
list_del(&d->hdr.list);
domain_free(hw_dom);
}
}

-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+/*
+ * domain_add_cpu - Add a CPU to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_domain_id_from_scope(cpu, r->scope);
+ if (r->alloc_capable)
+ domain_add_cpu_ctrl(cpu, r);
+ if (r->mon_capable)
+ domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
struct rdt_domain *d;

if (id < 0) {
- pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->scope, r->name);
+ pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->ctrl_scope, r->name);
return;
}

- d = rdt_find_domain(r, id, NULL);
- if (!d) {
- pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
+ hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+ if (!hdr) {
+ pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n",
+ id, cpu, r->name);
return;
}
+
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
hw_dom = resctrl_to_arch_dom(d);

cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
- resctrl_offline_domain(r, d);
+ resctrl_offline_ctrl_domain(r, d);
list_del(&d->hdr.list);

/*
@@ -599,6 +658,42 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)

return;
}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
+ struct rdt_domain *d;
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+ if (!hdr) {
+ pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n",
+ id, cpu, r->name);
+ return;
+ }
+
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
+ hw_dom = resctrl_to_arch_dom(d);
+
+ cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+ if (cpumask_empty(&d->hdr.cpu_mask)) {
+ resctrl_offline_mon_domain(r, d);
+ list_del(&d->hdr.list);
+ domain_free(hw_dom);
+
+ return;
+ }

if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -613,6 +708,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}
}

+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+ if (r->alloc_capable)
+ domain_remove_cpu_ctrl(cpu, r);
+ if (r->mon_capable)
+ domain_remove_cpu_mon(cpu, r);
+}
+
static void clear_closid_rmid(int cpu)
{
struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 23f8258d36a8..0b4136c42762 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -226,7 +226,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
return -EINVAL;
}
dom = strim(dom);
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (d->hdr.id == dom_id) {
data.buf = dom;
data.rdtgrp = rdtgrp;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
return -ENOMEM;

msr_param.res = NULL;
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
for (t = 0; t < CDP_NUM_TYPES; t++) {
cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
u32 ctrl_val;

seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -542,6 +542,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
+ struct rdt_domain_hdr *hdr;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
@@ -562,11 +563,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
evtid = md.u.evtid;

r = &rdt_resources_all[resid].r_resctrl;
- d = rdt_find_domain(r, domid, NULL);
- if (!d) {
+ hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+ if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
ret = -ENOENT;
goto out;
}
+ d = container_of(hdr, struct rdt_domain, hdr);

mon_event_read(&rr, r, d, rdtgrp, evtid, false);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index dd0ea1bc0092..ec5ad926c5dc 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)

entry->busy = 0;
cpu = get_cpu();
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
err = resctrl_arch_rmid_read(r, d, entry->rmid,
QOS_L3_OCCUP_EVENT_ID,
@@ -535,7 +535,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
rmid = rgrp->mon.rmid;
pmbm_data = &dom_mbm->mbm_local[rmid];

- dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+ dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
pr_warn_once("Failure to get domain for MBA update\n");
return;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index fcbd99e2eb66..ed6d59af1cef 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
*/
static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
{
- enum resctrl_scope scope = plr->s->res->scope;
+ enum resctrl_scope scope = plr->s->res->ctrl_scope;
struct cpu_cacheinfo *ci;
int ret;
int i;
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* associated with them.
*/
for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(d_i, &r->domains, hdr.list) {
+ list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
if (d_i->plr)
cpumask_or(cpu_with_psl, cpu_with_psl,
&d_i->hdr.cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 04d32602ac33..760013ed1bff 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
lockdep_assert_held(&rdtgroup_mutex);

for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(dom, &r->domains, hdr.list)
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
memset(dom->staged_config, 0, sizeof(dom->staged_config));
}
}
@@ -984,7 +984,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,

mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
continue;
has_cache = true;
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
ctrl = resctrl_arch_get_config(r, d, closid,
s->conf_type);
if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1413,13 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
unsigned int size = 0;
int num_b, i;

- if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+ if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
return size;

num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->scope) {
+ if (ci->info_list[i].level == r->ctrl_scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
type = schema->conf_type;
sep = false;
seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (sep)
seq_putc(s, ';');
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid

mutex_lock(&rdtgroup_mutex);

- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->mon_domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -1689,7 +1689,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
if (d->hdr.id == dom_id) {
ret = mbm_config_write_domain(r, d, evtid, val);
if (ret)
@@ -2232,7 +2232,7 @@ static int set_cache_qos_cfg(int level, bool enable)
return -ENOMEM;

r_l = &rdt_resources_all[level].r_resctrl;
- list_for_each_entry(d, &r_l->domains, hdr.list) {
+ list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
if (r_l->cache.arch_has_per_cpu_cfg)
/* Pick all the CPUs in the domain instance */
for_each_cpu(cpu, &d->hdr.cpu_mask)
@@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)

r->membw.mba_sc = mba_sc;

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
for (i = 0; i < num_closid; i++)
d->mbps_val[i] = MBA_MAX_MBPS;
}
@@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- list_for_each_entry(dom, &r->domains, hdr.list)
+ list_for_each_entry(dom, &r->mon_domains, hdr.list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}

@@ -2777,10 +2777,10 @@ static int reset_all_ctrls(struct rdt_resource *r)

/*
* Disable resource control for this resource by setting all
- * CBMs in all domains to the maximum mask value. Pick one CPU
+ * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
* from each domain to update the MSRs below.
*/
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);

@@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_domain *dom;
int ret;

- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->mon_domains, hdr.list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
return ret;
@@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
struct rdt_domain *d;
int ret;

- list_for_each_entry(d, &s->res->domains, hdr.list) {
+ list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
ret = __init_one_rdt_domain(d, s, closid);
if (ret < 0)
return ret;
@@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
struct resctrl_staged_config *cfg;
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (is_mba_sc(r)) {
d->mbps_val[closid] = MBA_MAX_MBPS;
continue;
@@ -3849,15 +3849,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
kfree(d->mbm_local);
}

-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
mba_sc_domain_destroy(r, d);
+}

- if (!r->mon_capable)
- return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ lockdep_assert_held(&rdtgroup_mutex);

/*
* If resctrl is mounted, remove all the
@@ -3914,18 +3916,21 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
{
- int err;
-
lockdep_assert_held(&rdtgroup_mutex);

if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
- /* RDT_RESOURCE_MBA is never mon_capable */
return mba_sc_domain_allocate(r, d);

- if (!r->mon_capable)
- return 0;
+ return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ int err;
+
+ lockdep_assert_held(&rdtgroup_mutex);

err = domain_setup_mon_state(r, d);
if (err)
--
2.41.0

2023-12-04 18:54:48

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v13 7/8] x86/resctrl: Sub NUMA Cluster detection and enable

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.

Signed-off-by: Tony Luck <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 118 +++++++++++++++++++++++++++++
2 files changed, 119 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d51e1850ed0..94d29d81e6db 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1111,6 +1111,7 @@
#define MSR_IA32_QM_CTR 0xc8e
#define MSR_IA32_PQR_ASSOC 0xc8f
#define MSR_IA32_L3_CBM_BASE 0xc90
+#define MSR_RMID_SNC_CONFIG 0xca0
#define MSR_IA32_L2_CBM_BASE 0xd10
#define MSR_IA32_MBA_THRTL_BASE 0xd50

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index cf5aba8a74bf..1de1c4499b7d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@

#define pr_fmt(fmt) "resctrl: " fmt

+#include <linux/cpu.h>
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/cacheinfo.h>
#include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>

+#include <asm/cpu_device_id.h>
#include <asm/intel-family.h>
#include <asm/resctrl.h>
#include "internal.h"
@@ -740,11 +743,42 @@ static void clear_closid_rmid(int cpu)
wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
}

+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+ u64 val;
+
+ /* Only need to enable once per package. */
+ if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+ return;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, val);
+ val &= ~BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
static int resctrl_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
+
+ if (snc_nodes_per_l3_cache > 1)
+ snc_remap_rmids(cpu);
+
for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
/* The cpu is set in default rdtgroup after online. */
@@ -999,11 +1033,95 @@ static __init bool get_rdt_resources(void)
return (rdt_mon_capable || rdt_alloc_capable);
}

+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+ {}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+ unsigned long *node_caches;
+ int mem_only_nodes = 0;
+ int cpu, node, ret;
+ int num_l3_caches;
+ int cache_id;
+
+ if (!x86_match_cpu(snc_cpu_ids))
+ return 1;
+
+ node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
+ if (!node_caches)
+ return 1;
+
+ cpus_read_lock();
+
+ if (num_online_cpus() != num_present_cpus())
+ pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+
+ for_each_node(node) {
+ cpu = cpumask_first(cpumask_of_node(node));
+ if (cpu < nr_cpu_ids) {
+ cache_id = get_cpu_cacheinfo_id(cpu, 3);
+ if (cache_id != -1)
+ set_bit(cache_id, node_caches);
+ } else {
+ mem_only_nodes++;
+ }
+ }
+ cpus_read_unlock();
+
+ num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
+ kfree(node_caches);
+
+ if (!num_l3_caches)
+ goto insane;
+
+ /* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
+ if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
+ goto insane;
+
+ ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+ /* sanity check #2: Only valid results are 1, 2, 3, 4 */
+ switch (ret) {
+ case 1:
+ break;
+ case 2:
+ case 3:
+ case 4:
+ rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+ break;
+ default:
+ goto insane;
+ }
+
+ return ret;
+insane:
+ pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
+ (nr_node_ids - mem_only_nodes), num_l3_caches);
+ return 1;
+}
+
static __init void rdt_init_res_defs_intel(void)
{
struct rdt_hw_resource *hw_res;
struct rdt_resource *r;

+ snc_nodes_per_l3_cache = snc_get_config();
+
for_each_rdt_resource(r) {
hw_res = resctrl_to_arch_res(r);

--
2.41.0

2023-12-04 18:54:53

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v13 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Signed-off-by: Tony Luck <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..15f1cff6ee76 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:

"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
- directories have one file per event (e.g. "llc_occupancy",
+ This contains a set of files organized by L3 domain or by NUMA
+ node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+ or enabled respectively) and by RDT event. Each of these
+ directories has one file per event (e.g. "llc_occupancy",
"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
files provide a read out of the current value of the event for
all tasks in the group. In CTRL_MON groups these files provide
@@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
each bit represents 5% of the capacity of the cache. You could partition
the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.

+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes. Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache, but each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
Memory bandwidth Allocation and monitoring
==========================================

--
2.41.0

2023-12-04 18:55:16

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v13 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Signed-off-by: Tony Luck <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
---
include/linux/resctrl.h | 48 +++++++------
arch/x86/kernel/cpu/resctrl/internal.h | 60 ++++++++++------
arch/x86/kernel/cpu/resctrl/core.c | 87 ++++++++++++-----------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 40 +++++------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 62 ++++++++--------
7 files changed, 182 insertions(+), 153 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 35e700edc6e6..058a940c3239 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -72,7 +72,23 @@ struct rdt_domain_hdr {
};

/**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr: common header for different domain types
+ * @plr: pseudo-locked region (if any) associated with domain
+ * @staged_config: parsed configuration to be applied
+ * @mbps_val: When mba_sc is enabled, this holds the array of user
+ * specified control values for mba_sc in MBps, indexed
+ * by closid
+ */
+struct rdt_ctrl_domain {
+ struct rdt_domain_hdr hdr;
+ struct pseudo_lock_region *plr;
+ struct resctrl_staged_config staged_config[CDP_NUM_TYPES];
+ u32 *mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
* @hdr: common header for different domain types
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
@@ -81,13 +97,8 @@ struct rdt_domain_hdr {
* @cqm_limbo: worker to periodically read CQM h/w counters
* @mbm_work_cpu: worker CPU for MBM h/w counters
* @cqm_work_cpu: worker CPU for CQM h/w counters
- * @plr: pseudo-locked region (if any) associated with domain
- * @staged_config: parsed configuration to be applied
- * @mbps_val: When mba_sc is enabled, this holds the array of user
- * specified control values for mba_sc in MBps, indexed
- * by closid
*/
-struct rdt_domain {
+struct rdt_mon_domain {
struct rdt_domain_hdr hdr;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
@@ -96,9 +107,6 @@ struct rdt_domain {
struct delayed_work cqm_limbo;
int mbm_work_cpu;
int cqm_work_cpu;
- struct pseudo_lock_region *plr;
- struct resctrl_staged_config staged_config[CDP_NUM_TYPES];
- u32 *mbps_val;
};

/**
@@ -202,7 +210,7 @@ struct rdt_resource {
const char *format_str;
int (*parse_ctrlval)(struct rdt_parse_data *data,
struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);
struct list_head evt_list;
unsigned long fflags;
bool cdp_capable;
@@ -236,15 +244,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
* Update the ctrl_val and apply this config right now.
* Must be called on one of the domain's CPUs.
*/
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type t, u32 cfg_val);

-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -260,7 +268,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
*/
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid, u64 *val);

/**
@@ -273,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid);

/**
@@ -285,7 +293,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);

extern unsigned int resctrl_rmid_realloc_threshold;
extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 24bf9d7989a9..ce3a70657842 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -107,7 +107,7 @@ union mon_data_bits {
struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
enum resctrl_event_id evtid;
bool first;
int err;
@@ -192,7 +192,7 @@ struct mongroup {
*/
struct pseudo_lock_region {
struct resctrl_schema *s;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
u32 cbm;
wait_queue_head_t lock_thread_wq;
int thread_done;
@@ -319,25 +319,41 @@ struct arch_mbm_state {
};

/**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- * a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ * a resource for a control function
* @d_resctrl: Properties exposed to the resctrl file system
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+ struct rdt_ctrl_domain d_resctrl;
+ u32 *ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ * a resource for a monitor function
+ * @d_resctrl: Properties exposed to the resctrl file system
* @arch_mbm_total: arch private state for MBM total bandwidth
* @arch_mbm_local: arch private state for MBM local bandwidth
*
* Members of this structure are accessed via helpers that provide abstraction.
*/
-struct rdt_hw_domain {
- struct rdt_domain d_resctrl;
- u32 *ctrl_val;
+struct rdt_hw_mon_domain {
+ struct rdt_mon_domain d_resctrl;
struct arch_mbm_state *arch_mbm_total;
struct arch_mbm_state *arch_mbm_local;
};

-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+ return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
{
- return container_of(r, struct rdt_hw_domain, d_resctrl);
+ return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
}

/**
@@ -405,7 +421,7 @@ struct rdt_hw_resource {
struct rdt_resource r_resctrl;
u32 num_closid;
unsigned int msr_base;
- void (*msr_update) (struct rdt_domain *d, struct msr_param *m,
+ void (*msr_update) (struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r);
unsigned int mon_scale;
unsigned int mbm_width;
@@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
}

int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);
int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);

extern struct mutex rdtgroup_mutex;

@@ -526,21 +542,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
unsigned long cbm);
enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
int rdtgroup_tasks_assigned(struct rdtgroup *r);
int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
int rdt_pseudo_lock_init(void);
void rdt_pseudo_lock_release(void);
int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
void closid_free(int closid);
int alloc_rmid(void);
@@ -550,17 +566,17 @@ bool __init rdt_cpu_has(int flag);
void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
unsigned long delay_ms);
void mbm_handle_overflow(struct work_struct *work);
void __init intel_rdt_mbm_apply_quirk(void);
bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void __init thread_throttle_mode_init(void);
void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 1fd85533b4ca..797cb3bf417a 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
bool rdt_alloc_capable;

static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r);
static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r);

#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -307,11 +307,11 @@ static void rdt_get_cdp_l2_config(void)
}

static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
{
- unsigned int i;
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ unsigned int i;

for (i = m->low; i < m->high; i++)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -332,12 +332,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
}

static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r)
{
- unsigned int i;
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ unsigned int i;

/* Write the delay values for mba. */
for (i = m->low; i < m->high; i++)
@@ -345,19 +345,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
}

static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
{
- unsigned int i;
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ unsigned int i;

for (i = m->low; i < m->high; i++)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
}

-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
{
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
/* Find the domain that contains this CPU */
@@ -379,7 +379,7 @@ void rdt_ctrl_update(void *arg)
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
struct rdt_resource *r = m->res;
int cpu = smp_processor_id();
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

d = get_ctrl_domain_from_cpu(cpu, r);
if (d) {
@@ -434,18 +434,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
*dc = r->default_ctrl;
}

-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+ kfree(hw_dom->ctrl_val);
+ kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
{
kfree(hw_dom->arch_mbm_total);
kfree(hw_dom->arch_mbm_local);
- kfree(hw_dom->ctrl_val);
kfree(hw_dom);
}

-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct msr_param m;
u32 *dc;

@@ -468,7 +473,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
*/
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
{
size_t tsize;

@@ -507,10 +512,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+ struct rdt_hw_ctrl_domain *hw_dom;
struct list_head *add_pos = NULL;
- struct rdt_hw_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int err;

if (id < 0) {
@@ -524,7 +529,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_ctrl_domain, hdr);

cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
@@ -544,7 +549,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
rdt_domain_reconfigure_cdp(r);

if (domain_setup_ctrlval(r, d)) {
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);
return;
}

@@ -553,17 +558,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
err = resctrl_online_ctrl_domain(r, d);
if (err) {
list_del(&d->hdr.list);
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);
}
}

static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct rdt_hw_mon_domain *hw_dom;
struct list_head *add_pos = NULL;
- struct rdt_hw_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
int err;

if (id < 0) {
@@ -577,7 +582,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);

cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
return;
@@ -593,7 +598,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);
return;
}

@@ -602,7 +607,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
err = resctrl_online_mon_domain(r, d);
if (err) {
list_del(&d->hdr.list);
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);
}
}

@@ -620,9 +625,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

if (id < 0) {
pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
@@ -640,8 +645,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
- hw_dom = resctrl_to_arch_dom(d);
+ d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);

cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
@@ -649,12 +654,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
list_del(&d->hdr.list);

/*
- * rdt_domain "d" is going to be freed below, so clear
+ * rdt_ctrl_domain "d" is going to be freed below, so clear
* its pointer from pseudo_lock_region struct.
*/
if (d->plr)
d->plr->d = NULL;
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);

return;
}
@@ -663,9 +668,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_mon_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;

if (id < 0) {
pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
@@ -683,14 +688,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
- hw_dom = resctrl_to_arch_dom(d);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ hw_dom = resctrl_to_arch_mon_dom(d);

cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
resctrl_offline_mon_domain(r, d);
list_del(&d->hdr.list);
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);

return;
}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 0b4136c42762..08fc97ce4135 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
}

int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
struct resctrl_staged_config *cfg;
u32 closid = data->rdtgrp->closid;
@@ -137,7 +137,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
* resource type.
*/
int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
struct rdtgroup *rdtgrp = data->rdtgrp;
struct resctrl_staged_config *cfg;
@@ -206,8 +206,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
struct resctrl_staged_config *cfg;
struct rdt_resource *r = s->res;
struct rdt_parse_data data;
+ struct rdt_ctrl_domain *d;
char *dom = NULL, *id;
- struct rdt_domain *d;
unsigned long dom_id;

if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -267,11 +267,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
}
}

-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
struct resctrl_staged_config *cfg, u32 idx,
cpumask_var_t cpu_mask)
{
- struct rdt_domain *dom = &hw_dom->d_resctrl;
+ struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;

if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
@@ -283,11 +283,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
return false;
}

-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type t, u32 cfg_val)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
u32 idx = get_config_index(closid, t);
struct msr_param msr_param;

@@ -307,11 +307,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
{
struct resctrl_staged_config *cfg;
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct msr_param msr_param;
+ struct rdt_ctrl_domain *d;
enum resctrl_conf_type t;
cpumask_var_t cpu_mask;
- struct rdt_domain *d;
u32 idx;

if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -319,7 +319,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)

msr_param.res = NULL;
list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
- hw_dom = resctrl_to_arch_dom(d);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);
for (t = 0; t < CDP_NUM_TYPES; t++) {
cfg = &hw_dom->d_resctrl.staged_config[t];
if (!cfg->have_new_ctrl)
@@ -449,10 +449,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
return ret ?: nbytes;
}

-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
u32 idx = get_config_index(closid, type);

return hw_dom->ctrl_val[idx];
@@ -461,7 +461,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
{
struct rdt_resource *r = schema->res;
- struct rdt_domain *dom;
+ struct rdt_ctrl_domain *dom;
bool sep = false;
u32 ctrl_val;

@@ -523,7 +523,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
}

void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first)
{
/*
@@ -543,11 +543,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
struct rdt_domain_hdr *hdr;
+ struct rdt_mon_domain *d;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
union mon_data_bits md;
- struct rdt_domain *d;
struct rmid_read rr;
int ret = 0;

@@ -568,7 +568,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
ret = -ENOENT;
goto out;
}
- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);

mon_event_read(&rr, r, d, rdtgrp, evtid, false);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ec5ad926c5dc..4e145f5620b0 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
return 0;
}

-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
u32 rmid,
enum resctrl_event_id eventid)
{
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
return NULL;
}

-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct arch_mbm_state *am;

am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
* Assumes that hardware counters are also reset and thus that there is
* no need to record initial non-zero counts.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);

if (is_mbm_total_enabled())
memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
return chunks >> shift;
}

-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct arch_mbm_state *am;
u64 msr_val, chunks;
int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
* decrement the count. If the busy count gets to zero on an RMID, we
* free the RMID
*/
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
}
}

-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
{
return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
}
@@ -334,7 +334,7 @@ int alloc_rmid(void)
static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
int cpu, err;
u64 val = 0;

@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}

-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
enum resctrl_event_id evtid)
{
switch (evtid) {
@@ -516,13 +516,13 @@ void mon_event_count(void *info)
* throttle MSRs already have low percentage values. To avoid
* unnecessarily restricting such rdtgroups, we also increase the bandwidth.
*/
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
+ struct rdt_ctrl_domain *dom_mba;
u32 cur_bw, delta_bw, user_bw;
struct rdt_resource *r_mba;
- struct rdt_domain *dom_mba;
struct list_head *head;
struct rdtgroup *entry;

@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
}
}

-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
{
struct rmid_read rr;

@@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
int cpu = smp_processor_id();
+ struct rdt_mon_domain *d;
struct rdt_resource *r;
- struct rdt_domain *d;

mutex_lock(&rdtgroup_mutex);

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- d = container_of(work, struct rdt_domain, cqm_limbo.work);
+ d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);

__check_limbo(d, false);

@@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
mutex_unlock(&rdtgroup_mutex);
}

-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;
@@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
struct rdtgroup *prgrp, *crgrp;
int cpu = smp_processor_id();
+ struct rdt_mon_domain *d;
struct list_head *head;
struct rdt_resource *r;
- struct rdt_domain *d;

mutex_lock(&rdtgroup_mutex);

@@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
goto out_unlock;

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- d = container_of(work, struct rdt_domain, mbm_over.work);
+ d = container_of(work, struct rdt_mon_domain, mbm_over.work);

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp->mon.rmid);
@@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
mutex_unlock(&rdtgroup_mutex);
}

-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index ed6d59af1cef..08d35f828bc3 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
* Return: true if @cbm overlaps with pseudo-locked region on @d, false
* otherwise.
*/
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
{
unsigned int cbm_len;
unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
* if it is not possible to test due to memory allocation issue,
* false otherwise.
*/
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
{
+ struct rdt_ctrl_domain *d_i;
cpumask_var_t cpu_with_psl;
struct rdt_resource *r;
- struct rdt_domain *d_i;
bool ret = false;

if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 760013ed1bff..21bbd832f3f2 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -85,8 +85,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)

void rdt_staged_configs_clear(void)
{
+ struct rdt_ctrl_domain *dom;
struct rdt_resource *r;
- struct rdt_domain *dom;

lockdep_assert_held(&rdtgroup_mutex);

@@ -976,7 +976,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
unsigned long sw_shareable = 0, hw_shareable = 0;
unsigned long exclusive = 0, pseudo_locked = 0;
struct rdt_resource *r = s->res;
- struct rdt_domain *dom;
+ struct rdt_ctrl_domain *dom;
int i, hwb, swb, excl, psl;
enum rdtgrp_mode mode;
bool sep = false;
@@ -1205,7 +1205,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
*
* Return: false if CBM does not overlap, true if it does.
*/
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid,
enum resctrl_conf_type type, bool exclusive)
{
@@ -1260,7 +1260,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
*
* Return: true if CBM overlap detected, false if there is no overlap
*/
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid, bool exclusive)
{
enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1291,10 +1291,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
{
int closid = rdtgrp->closid;
+ struct rdt_ctrl_domain *d;
struct resctrl_schema *s;
struct rdt_resource *r;
bool has_cache = false;
- struct rdt_domain *d;
u32 ctrl;

list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1407,7 +1407,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
* bitmap functions work correctly.
*/
unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
- struct rdt_domain *d, unsigned long cbm)
+ struct rdt_ctrl_domain *d, unsigned long cbm)
{
struct cpu_cacheinfo *ci;
unsigned int size = 0;
@@ -1439,9 +1439,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
{
struct resctrl_schema *schema;
enum resctrl_conf_type type;
+ struct rdt_ctrl_domain *d;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
- struct rdt_domain *d;
unsigned int size;
int ret = 0;
u32 closid;
@@ -1553,7 +1553,7 @@ static void mon_event_config_read(void *info)
mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
}

-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
{
smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
}
@@ -1561,7 +1561,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
{
struct mon_config_info mon_info = {0};
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
bool sep = false;

mutex_lock(&rdtgroup_mutex);
@@ -1618,7 +1618,7 @@ static void mon_event_config_write(void *info)
}

static int mbm_config_write_domain(struct rdt_resource *r,
- struct rdt_domain *d, u32 evtid, u32 val)
+ struct rdt_mon_domain *d, u32 evtid, u32 val)
{
struct mon_config_info mon_info = {0};
int ret = 0;
@@ -1668,7 +1668,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
{
char *dom_str = NULL, *id_str;
unsigned long dom_id, val;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
int ret = 0;

next:
@@ -2216,9 +2216,9 @@ static inline bool is_mba_linear(void)
static int set_cache_qos_cfg(int level, bool enable)
{
void (*update)(void *arg);
+ struct rdt_ctrl_domain *d;
struct rdt_resource *r_l;
cpumask_var_t cpu_mask;
- struct rdt_domain *d;
int cpu;

if (level == RDT_RESOURCE_L3)
@@ -2265,7 +2265,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
l3_qos_cfg_update(&hw_res->cdp_enabled);
}

-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
u32 num_closid = resctrl_arch_get_num_closid(r);
int cpu = cpumask_any(&d->hdr.cpu_mask);
@@ -2283,7 +2283,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
}

static void mba_sc_domain_destroy(struct rdt_resource *r,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
kfree(d->mbps_val);
d->mbps_val = NULL;
@@ -2309,7 +2309,7 @@ static int set_mba_sc(bool mba_sc)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
u32 num_closid = resctrl_arch_get_num_closid(r);
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int i;

if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2578,7 +2578,7 @@ static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
unsigned long flags = RFTYPE_CTRL_BASE;
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
struct rdt_resource *r;
int ret;

@@ -2762,10 +2762,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
static int reset_all_ctrls(struct rdt_resource *r)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct msr_param msr_param;
+ struct rdt_ctrl_domain *d;
cpumask_var_t cpu_mask;
- struct rdt_domain *d;
int i;

if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2781,7 +2781,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
* from each domain to update the MSRs below.
*/
list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
- hw_dom = resctrl_to_arch_dom(d);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);
cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);

for (i = 0; i < hw_res->num_closid; i++)
@@ -2976,7 +2976,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}

static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
- struct rdt_domain *d,
+ struct rdt_mon_domain *d,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
union mon_data_bits priv;
@@ -3025,7 +3025,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
* and "monitor" groups with given domain id.
*/
static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_domain *d)
+ struct rdt_mon_domain *d)
{
struct kernfs_node *parent_kn;
struct rdtgroup *prgrp, *crgrp;
@@ -3047,7 +3047,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_resource *r,
struct rdtgroup *prgrp)
{
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
int ret;

list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3149,7 +3149,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
* Set the RDT domain up to start off with all usable allocations. That is,
* all shareable and unused bits. All-zero CBM is invalid.
*/
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
u32 closid)
{
enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3229,7 +3229,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
*/
static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
{
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int ret;

list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3245,7 +3245,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
{
struct resctrl_staged_config *cfg;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (is_mba_sc(r)) {
@@ -3842,14 +3842,14 @@ static void __init rdtgroup_setup_default(void)
mutex_unlock(&rdtgroup_mutex);
}

-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
{
bitmap_free(d->rmid_busy_llc);
kfree(d->mbm_total);
kfree(d->mbm_local);
}

-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3857,7 +3857,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
mba_sc_domain_destroy(r, d);
}

-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3886,7 +3886,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
domain_destroy_mon_state(d);
}

-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
{
size_t tsize;

@@ -3916,7 +3916,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3926,7 +3926,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
{
int err;

--
2.41.0

2023-12-04 21:21:07

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems

Hi Tony,

Tested the series on AMD system. Just ran few basic tests. Everything
looking good.

Thanks

Babu

On 12/4/2023 12:53 PM, Tony Luck wrote:
> The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
> that share an L3 cache into two or more sets. This plays havoc with the
> Resource Director Technology (RDT) monitoring features. Prior to this
> patch Intel has advised that SNC and RDT are incompatible.
>
> Some of these CPU support an MSR that can partition the RMID counters in
> the same way. This allows monitoring features to be used. With the caveat
> that users must be aware that Linux may migrate tasks more frequently
> between SNC nodes than between "regular" NUMA nodes, so reading counters
> from all SNC nodes may be needed to get a complete picture of activity
> for tasks.
>
> Cache and memory bandwidth allocation features continue to operate at
> the scope of the L3 cache.
>
> Signed-off-by: Tony Luck <[email protected]>
>
> Changes since v12:
>
> All:
> Reinette - put commit tags in right order for TIP (Tested-by before
> Reviewed-by)
>
> Patch 7:
> Fam Zheng - Check for -1 return from get_cpu_cacheinfo_id() and
> increase size of bitmap tracking # of L3 instances.
> Reinette - Add extra sanity checks. Note that this patch has
> some additional tweaks beyond the e-mail discussion.
> 1) "3" is a valid return in addition to 1, 2, 4
> 2) Added a warning if the sanity checks fail that
> prints number of CPU nodes and number of L3 cache
> instances that were found.
>
> Patch 8:
> Babu - Fix grammar with an additional comma.
>
>
> Tony Luck (8):
> x86/resctrl: Prepare for new domain scope
> x86/resctrl: Prepare to split rdt_domain structure
> x86/resctrl: Prepare for different scope for control/monitor
> operations
> x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
> x86/resctrl: Add node-scope to the options for feature scope
> x86/resctrl: Introduce snc_nodes_per_l3_cache
> x86/resctrl: Sub NUMA Cluster detection and enable
> x86/resctrl: Update documentation with Sub-NUMA cluster changes
>
> Documentation/arch/x86/resctrl.rst | 25 +-
> include/linux/resctrl.h | 85 +++--
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kernel/cpu/resctrl/internal.h | 66 ++--
> arch/x86/kernel/cpu/resctrl/core.c | 433 +++++++++++++++++-----
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 58 +--
> arch/x86/kernel/cpu/resctrl/monitor.c | 68 ++--
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 26 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 149 ++++----
> 9 files changed, 629 insertions(+), 282 deletions(-)
>
>
> base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab

2023-12-04 21:34:50

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems

> Tested the series on AMD system. Just ran few basic tests. Everything
> looking good.

Babu,

Thanks for testing. I'll add your Tested-by tag if[1] I make a v14.

-Tony

[1] realistically not if, but when :-(

2023-12-04 23:04:27

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v13 2/8] x86/resctrl: Prepare to split rdt_domain structure


On 12/4/2023 12:53 PM, Tony Luck wrote:
> The rdt_domain structure is used for both control and monitor features.
> It is about to be split into separate structures for these two usages
> because the scope for control and monitoring features for a resource
> will be different for future resources.
>
> To allow for common code that scans a list of domains looking for a
> specific domain id, move all the common fields ("list", "id", "cpu_mask")
> into their own structure within the rdt_domain structure.
>
> Signed-off-by: Tony Luck<[email protected]>
> Tested-by: Shaopeng Tan<[email protected]>
> Reviewed-by: Reinette Chatre<[email protected]>
> Reviewed-by: Shaopeng Tan<[email protected]>
Reviewed-by: Babu Moger <[email protected]>
> ---
> include/linux/resctrl.h | 16 ++++--
> arch/x86/kernel/cpu/resctrl/core.c | 26 +++++-----
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
> arch/x86/kernel/cpu/resctrl/monitor.c | 10 ++--
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 60 +++++++++++------------
> 6 files changed, 78 insertions(+), 70 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 7d4eb7df611d..c4067150a6b7 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -53,10 +53,20 @@ struct resctrl_staged_config {
> };
>
> /**
> - * struct rdt_domain - group of CPUs sharing a resctrl resource
> + * struct rdt_domain_hdr - common header for different domain types
> * @list: all instances of this resource
> * @id: unique id for this instance
> * @cpu_mask: which CPUs share this resource
> + */
> +struct rdt_domain_hdr {
> + struct list_head list;
> + int id;
> + struct cpumask cpu_mask;
> +};
> +
> +/**
> + * struct rdt_domain - group of CPUs sharing a resctrl resource
> + * @hdr: common header for different domain types
> * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
> * @mbm_total: saved state for MBM total bandwidth
> * @mbm_local: saved state for MBM local bandwidth
> @@ -71,9 +81,7 @@ struct resctrl_staged_config {
> * by closid
> */
> struct rdt_domain {
> - struct list_head list;
> - int id;
> - struct cpumask cpu_mask;
> + struct rdt_domain_hdr hdr;
> unsigned long *rmid_busy_llc;
> struct mbm_state *mbm_total;
> struct mbm_state *mbm_local;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index fd113bc29d4e..62a989fd950d 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -356,9 +356,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
> {
> struct rdt_domain *d;
>
> - list_for_each_entry(d, &r->domains, list) {
> + list_for_each_entry(d, &r->domains, hdr.list) {
> /* Find the domain that contains this CPU */
> - if (cpumask_test_cpu(cpu, &d->cpu_mask))
> + if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
> return d;
> }
>
> @@ -402,12 +402,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> struct list_head *l;
>
> list_for_each(l, &r->domains) {
> - d = list_entry(l, struct rdt_domain, list);
> + d = list_entry(l, struct rdt_domain, hdr.list);
> /* When id is found, return its domain. */
> - if (id == d->id)
> + if (id == d->hdr.id)
> return d;
> /* Stop searching when finding id's position in sorted list. */
> - if (id < d->id)
> + if (id < d->hdr.id)
> break;
> }
>
> @@ -530,7 +530,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>
> d = rdt_find_domain(r, id, &add_pos);
> if (d) {
> - cpumask_set_cpu(cpu, &d->cpu_mask);
> + cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> if (r->cache.arch_has_per_cpu_cfg)
> rdt_domain_reconfigure_cdp(r);
> return;
> @@ -541,8 +541,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> return;
>
> d = &hw_dom->d_resctrl;
> - d->id = id;
> - cpumask_set_cpu(cpu, &d->cpu_mask);
> + d->hdr.id = id;
> + cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>
> rdt_domain_reconfigure_cdp(r);
>
> @@ -556,11 +556,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> return;
> }
>
> - list_add_tail(&d->list, add_pos);
> + list_add_tail(&d->hdr.list, add_pos);
>
> err = resctrl_online_domain(r, d);
> if (err) {
> - list_del(&d->list);
> + list_del(&d->hdr.list);
> domain_free(hw_dom);
> }
> }
> @@ -584,10 +584,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> }
> hw_dom = resctrl_to_arch_dom(d);
>
> - cpumask_clear_cpu(cpu, &d->cpu_mask);
> - if (cpumask_empty(&d->cpu_mask)) {
> + cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> + if (cpumask_empty(&d->hdr.cpu_mask)) {
> resctrl_offline_domain(r, d);
> - list_del(&d->list);
> + list_del(&d->hdr.list);
>
> /*
> * rdt_domain "d" is going to be freed below, so clear
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 3f8891d57fac..23f8258d36a8 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
>
> cfg = &d->staged_config[s->conf_type];
> if (cfg->have_new_ctrl) {
> - rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
> + rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
> return -EINVAL;
> }
>
> @@ -146,7 +146,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
>
> cfg = &d->staged_config[s->conf_type];
> if (cfg->have_new_ctrl) {
> - rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
> + rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
> return -EINVAL;
> }
>
> @@ -226,8 +226,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
> return -EINVAL;
> }
> dom = strim(dom);
> - list_for_each_entry(d, &r->domains, list) {
> - if (d->id == dom_id) {
> + list_for_each_entry(d, &r->domains, hdr.list) {
> + if (d->hdr.id == dom_id) {
> data.buf = dom;
> data.rdtgrp = rdtgrp;
> if (r->parse_ctrlval(&data, s, d))
> @@ -274,7 +274,7 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
> struct rdt_domain *dom = &hw_dom->d_resctrl;
>
> if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
> - cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
> + cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
> hw_dom->ctrl_val[idx] = cfg->new_ctrl;
>
> return true;
> @@ -291,7 +291,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
> u32 idx = get_config_index(closid, t);
> struct msr_param msr_param;
>
> - if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
> + if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> return -EINVAL;
>
> hw_dom->ctrl_val[idx] = cfg_val;
> @@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
> return -ENOMEM;
>
> msr_param.res = NULL;
> - list_for_each_entry(d, &r->domains, list) {
> + list_for_each_entry(d, &r->domains, hdr.list) {
> hw_dom = resctrl_to_arch_dom(d);
> for (t = 0; t < CDP_NUM_TYPES; t++) {
> cfg = &hw_dom->d_resctrl.staged_config[t];
> @@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
> u32 ctrl_val;
>
> seq_printf(s, "%*s:", max_name_width, schema->name);
> - list_for_each_entry(dom, &r->domains, list) {
> + list_for_each_entry(dom, &r->domains, hdr.list) {
> if (sep)
> seq_puts(s, ";");
>
> @@ -476,7 +476,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
> ctrl_val = resctrl_arch_get_config(r, dom, closid,
> schema->conf_type);
>
> - seq_printf(s, r->format_str, dom->id, max_data_width,
> + seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
> ctrl_val);
> sep = true;
> }
> @@ -505,7 +505,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
> } else {
> seq_printf(s, "%s:%d=%x\n",
> rdtgrp->plr->s->res->name,
> - rdtgrp->plr->d->id,
> + rdtgrp->plr->d->hdr.id,
> rdtgrp->plr->cbm);
> }
> } else {
> @@ -536,7 +536,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> rr->val = 0;
> rr->first = first;
>
> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> + smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
> }
>
> int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..dd0ea1bc0092 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -238,7 +238,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> u64 msr_val, chunks;
> int ret;
>
> - if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
> + if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> return -EINVAL;
>
> ret = __rmid_read(rmid, eventid, &msr_val);
> @@ -340,8 +340,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>
> entry->busy = 0;
> cpu = get_cpu();
> - list_for_each_entry(d, &r->domains, list) {
> - if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
> + list_for_each_entry(d, &r->domains, hdr.list) {
> + if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
> err = resctrl_arch_rmid_read(r, d, entry->rmid,
> QOS_L3_OCCUP_EVENT_ID,
> &val);
> @@ -661,7 +661,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
> unsigned long delay = msecs_to_jiffies(delay_ms);
> int cpu;
>
> - cpu = cpumask_any(&dom->cpu_mask);
> + cpu = cpumask_any(&dom->hdr.cpu_mask);
> dom->cqm_work_cpu = cpu;
>
> schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
> @@ -708,7 +708,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
>
> if (!static_branch_likely(&rdt_mon_enable_key))
> return;
> - cpu = cpumask_any(&dom->cpu_mask);
> + cpu = cpumask_any(&dom->hdr.cpu_mask);
> dom->mbm_work_cpu = cpu;
> schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 2a682da9f43a..fcbd99e2eb66 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
> int cpu;
> int ret;
>
> - for_each_cpu(cpu, &plr->d->cpu_mask) {
> + for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
> pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
> if (!pm_req) {
> rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
> @@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
> return -ENODEV;
>
> /* Pick the first cpu we find that is associated with the cache. */
> - plr->cpu = cpumask_first(&plr->d->cpu_mask);
> + plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);
>
> if (!cpu_online(plr->cpu)) {
> rdt_last_cmd_printf("CPU %u associated with cache not online\n",
> @@ -856,10 +856,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
> * associated with them.
> */
> for_each_alloc_capable_rdt_resource(r) {
> - list_for_each_entry(d_i, &r->domains, list) {
> + list_for_each_entry(d_i, &r->domains, hdr.list) {
> if (d_i->plr)
> cpumask_or(cpu_with_psl, cpu_with_psl,
> - &d_i->cpu_mask);
> + &d_i->hdr.cpu_mask);
> }
> }
>
> @@ -867,7 +867,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
> * Next test if new pseudo-locked region would intersect with
> * existing region.
> */
> - if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
> + if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
> ret = true;
>
> free_cpumask_var(cpu_with_psl);
> @@ -1199,7 +1199,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
> }
>
> plr->thread_done = 0;
> - cpu = cpumask_first(&plr->d->cpu_mask);
> + cpu = cpumask_first(&plr->d->hdr.cpu_mask);
> if (!cpu_online(cpu)) {
> ret = -ENODEV;
> goto out;
> @@ -1529,7 +1529,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
> * may be scheduled elsewhere and invalidate entries in the
> * pseudo-locked region.
> */
> - if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
> + if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
> mutex_unlock(&rdtgroup_mutex);
> return -EINVAL;
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index c44be64d65ec..04d32602ac33 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
> lockdep_assert_held(&rdtgroup_mutex);
>
> for_each_alloc_capable_rdt_resource(r) {
> - list_for_each_entry(dom, &r->domains, list)
> + list_for_each_entry(dom, &r->domains, hdr.list)
> memset(dom->staged_config, 0, sizeof(dom->staged_config));
> }
> }
> @@ -295,7 +295,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
> rdt_last_cmd_puts("Cache domain offline\n");
> ret = -ENODEV;
> } else {
> - mask = &rdtgrp->plr->d->cpu_mask;
> + mask = &rdtgrp->plr->d->hdr.cpu_mask;
> seq_printf(s, is_cpu_list(of) ?
> "%*pbl\n" : "%*pb\n",
> cpumask_pr_args(mask));
> @@ -984,12 +984,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
>
> mutex_lock(&rdtgroup_mutex);
> hw_shareable = r->cache.shareable_bits;
> - list_for_each_entry(dom, &r->domains, list) {
> + list_for_each_entry(dom, &r->domains, hdr.list) {
> if (sep)
> seq_putc(seq, ';');
> sw_shareable = 0;
> exclusive = 0;
> - seq_printf(seq, "%d=", dom->id);
> + seq_printf(seq, "%d=", dom->hdr.id);
> for (i = 0; i < closids_supported(); i++) {
> if (!closid_allocated(i))
> continue;
> @@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
> if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
> continue;
> has_cache = true;
> - list_for_each_entry(d, &r->domains, list) {
> + list_for_each_entry(d, &r->domains, hdr.list) {
> ctrl = resctrl_arch_get_config(r, d, closid,
> s->conf_type);
> if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
> @@ -1417,7 +1417,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> return size;
>
> num_b = bitmap_weight(&cbm, r->cache.cbm_len);
> - ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
> + ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
> for (i = 0; i < ci->num_leaves; i++) {
> if (ci->info_list[i].level == r->scope) {
> size = ci->info_list[i].size / r->cache.cbm_len * num_b;
> @@ -1465,7 +1465,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
> size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
> rdtgrp->plr->d,
> rdtgrp->plr->cbm);
> - seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
> + seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
> }
> goto out;
> }
> @@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
> type = schema->conf_type;
> sep = false;
> seq_printf(s, "%*s:", max_name_width, schema->name);
> - list_for_each_entry(d, &r->domains, list) {
> + list_for_each_entry(d, &r->domains, hdr.list) {
> if (sep)
> seq_putc(s, ';');
> if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
> @@ -1495,7 +1495,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
> else
> size = rdtgroup_cbm_to_size(r, d, ctrl);
> }
> - seq_printf(s, "%d=%u", d->id, size);
> + seq_printf(s, "%d=%u", d->hdr.id, size);
> sep = true;
> }
> seq_putc(s, '\n');
> @@ -1555,7 +1555,7 @@ static void mon_event_config_read(void *info)
>
> static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
> {
> - smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
> + smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
> }
>
> static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
> @@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
>
> mutex_lock(&rdtgroup_mutex);
>
> - list_for_each_entry(dom, &r->domains, list) {
> + list_for_each_entry(dom, &r->domains, hdr.list) {
> if (sep)
> seq_puts(s, ";");
>
> @@ -1574,7 +1574,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
> mon_info.evtid = evtid;
> mondata_config_read(dom, &mon_info);
>
> - seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
> + seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
> sep = true;
> }
> seq_puts(s, "\n");
> @@ -1646,7 +1646,7 @@ static int mbm_config_write_domain(struct rdt_resource *r,
> * are scoped at the domain level. Writing any of these MSRs
> * on one CPU is observed by all the CPUs in the domain.
> */
> - smp_call_function_any(&d->cpu_mask, mon_event_config_write,
> + smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
> &mon_info, 1);
>
> /*
> @@ -1689,8 +1689,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
> return -EINVAL;
> }
>
> - list_for_each_entry(d, &r->domains, list) {
> - if (d->id == dom_id) {
> + list_for_each_entry(d, &r->domains, hdr.list) {
> + if (d->hdr.id == dom_id) {
> ret = mbm_config_write_domain(r, d, evtid, val);
> if (ret)
> return -EINVAL;
> @@ -2232,14 +2232,14 @@ static int set_cache_qos_cfg(int level, bool enable)
> return -ENOMEM;
>
> r_l = &rdt_resources_all[level].r_resctrl;
> - list_for_each_entry(d, &r_l->domains, list) {
> + list_for_each_entry(d, &r_l->domains, hdr.list) {
> if (r_l->cache.arch_has_per_cpu_cfg)
> /* Pick all the CPUs in the domain instance */
> - for_each_cpu(cpu, &d->cpu_mask)
> + for_each_cpu(cpu, &d->hdr.cpu_mask)
> cpumask_set_cpu(cpu, cpu_mask);
> else
> /* Pick one CPU from each domain instance to update MSR */
> - cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
> + cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
> }
>
> /* Update QOS_CFG MSR on all the CPUs in cpu_mask */
> @@ -2268,7 +2268,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
> static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
> {
> u32 num_closid = resctrl_arch_get_num_closid(r);
> - int cpu = cpumask_any(&d->cpu_mask);
> + int cpu = cpumask_any(&d->hdr.cpu_mask);
> int i;
>
> d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
> @@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
>
> r->membw.mba_sc = mba_sc;
>
> - list_for_each_entry(d, &r->domains, list) {
> + list_for_each_entry(d, &r->domains, hdr.list) {
> for (i = 0; i < num_closid; i++)
> d->mbps_val[i] = MBA_MAX_MBPS;
> }
> @@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
>
> if (is_mbm_enabled()) {
> r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> - list_for_each_entry(dom, &r->domains, list)
> + list_for_each_entry(dom, &r->domains, hdr.list)
> mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
> }
>
> @@ -2780,9 +2780,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
> * CBMs in all domains to the maximum mask value. Pick one CPU
> * from each domain to update the MSRs below.
> */
> - list_for_each_entry(d, &r->domains, list) {
> + list_for_each_entry(d, &r->domains, hdr.list) {
> hw_dom = resctrl_to_arch_dom(d);
> - cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
> + cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
>
> for (i = 0; i < hw_res->num_closid; i++)
> hw_dom->ctrl_val[i] = r->default_ctrl;
> @@ -2986,7 +2986,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> char name[32];
> int ret;
>
> - sprintf(name, "mon_%s_%02d", r->name, d->id);
> + sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
> /* create the directory */
> kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> if (IS_ERR(kn))
> @@ -3002,7 +3002,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> }
>
> priv.u.rid = r->rid;
> - priv.u.domid = d->id;
> + priv.u.domid = d->hdr.id;
> list_for_each_entry(mevt, &r->evt_list, list) {
> priv.u.evtid = mevt->evtid;
> ret = mon_addfile(kn, mevt->name, priv.priv);
> @@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
> struct rdt_domain *dom;
> int ret;
>
> - list_for_each_entry(dom, &r->domains, list) {
> + list_for_each_entry(dom, &r->domains, hdr.list) {
> ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
> if (ret)
> return ret;
> @@ -3209,7 +3209,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
> */
> tmp_cbm = cfg->new_ctrl;
> if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
> - rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
> + rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
> return -ENOSPC;
> }
> cfg->have_new_ctrl = true;
> @@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
> struct rdt_domain *d;
> int ret;
>
> - list_for_each_entry(d, &s->res->domains, list) {
> + list_for_each_entry(d, &s->res->domains, hdr.list) {
> ret = __init_one_rdt_domain(d, s, closid);
> if (ret < 0)
> return ret;
> @@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
> struct resctrl_staged_config *cfg;
> struct rdt_domain *d;
>
> - list_for_each_entry(d, &r->domains, list) {
> + list_for_each_entry(d, &r->domains, hdr.list) {
> if (is_mba_sc(r)) {
> d->mbps_val[closid] = MBA_MAX_MBPS;
> continue;
> @@ -3864,7 +3864,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
> * per domain monitor data directories.
> */
> if (static_branch_unlikely(&rdt_mon_enable_key))
> - rmdir_mondata_subdir_allrdtgrp(r, d->id);
> + rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
>
> if (is_mbm_enabled())
> cancel_delayed_work(&d->mbm_over);

2023-12-04 23:05:21

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v13 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures


On 12/4/2023 12:53 PM, Tony Luck wrote:
> The same rdt_domain structure is used for both control and monitor
> functions. But this results in wasted memory as some of the fields are
> only used by control functions, while most are only used for monitor
> functions.
>
> Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
> just the fields required for control and monitoring respectively.
>
> Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
> and rdt_hw_mon_domain.
>
> Signed-off-by: Tony Luck<[email protected]>
> Tested-by: Shaopeng Tan<[email protected]>
> Reviewed-by: Peter Newman<[email protected]>
> Reviewed-by: Reinette Chatre<[email protected]>
> Reviewed-by: Shaopeng Tan<[email protected]>
Reviewed-by: Babu Moger <[email protected]>
> ---
> include/linux/resctrl.h | 48 +++++++------
> arch/x86/kernel/cpu/resctrl/internal.h | 60 ++++++++++------
> arch/x86/kernel/cpu/resctrl/core.c | 87 ++++++++++++-----------
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
> arch/x86/kernel/cpu/resctrl/monitor.c | 40 +++++------
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 62 ++++++++--------
> 7 files changed, 182 insertions(+), 153 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 35e700edc6e6..058a940c3239 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -72,7 +72,23 @@ struct rdt_domain_hdr {
> };
>
> /**
> - * struct rdt_domain - group of CPUs sharing a resctrl resource
> + * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
> + * @hdr: common header for different domain types
> + * @plr: pseudo-locked region (if any) associated with domain
> + * @staged_config: parsed configuration to be applied
> + * @mbps_val: When mba_sc is enabled, this holds the array of user
> + * specified control values for mba_sc in MBps, indexed
> + * by closid
> + */
> +struct rdt_ctrl_domain {
> + struct rdt_domain_hdr hdr;
> + struct pseudo_lock_region *plr;
> + struct resctrl_staged_config staged_config[CDP_NUM_TYPES];
> + u32 *mbps_val;
> +};
> +
> +/**
> + * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
> * @hdr: common header for different domain types
> * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
> * @mbm_total: saved state for MBM total bandwidth
> @@ -81,13 +97,8 @@ struct rdt_domain_hdr {
> * @cqm_limbo: worker to periodically read CQM h/w counters
> * @mbm_work_cpu: worker CPU for MBM h/w counters
> * @cqm_work_cpu: worker CPU for CQM h/w counters
> - * @plr: pseudo-locked region (if any) associated with domain
> - * @staged_config: parsed configuration to be applied
> - * @mbps_val: When mba_sc is enabled, this holds the array of user
> - * specified control values for mba_sc in MBps, indexed
> - * by closid
> */
> -struct rdt_domain {
> +struct rdt_mon_domain {
> struct rdt_domain_hdr hdr;
> unsigned long *rmid_busy_llc;
> struct mbm_state *mbm_total;
> @@ -96,9 +107,6 @@ struct rdt_domain {
> struct delayed_work cqm_limbo;
> int mbm_work_cpu;
> int cqm_work_cpu;
> - struct pseudo_lock_region *plr;
> - struct resctrl_staged_config staged_config[CDP_NUM_TYPES];
> - u32 *mbps_val;
> };
>
> /**
> @@ -202,7 +210,7 @@ struct rdt_resource {
> const char *format_str;
> int (*parse_ctrlval)(struct rdt_parse_data *data,
> struct resctrl_schema *s,
> - struct rdt_domain *d);
> + struct rdt_ctrl_domain *d);
> struct list_head evt_list;
> unsigned long fflags;
> bool cdp_capable;
> @@ -236,15 +244,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
> * Update the ctrl_val and apply this config right now.
> * Must be called on one of the domain's CPUs.
> */
> -int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
> +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 closid, enum resctrl_conf_type t, u32 cfg_val);
>
> -u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
> +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 closid, enum resctrl_conf_type type);
> -int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
> -void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
>
> /**
> * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
> @@ -260,7 +268,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
> * Return:
> * 0 on success, or -EIO, -EINVAL etc on error.
> */
> -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> u32 rmid, enum resctrl_event_id eventid, u64 *val);
>
> /**
> @@ -273,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> *
> * This can be called from any CPU.
> */
> -void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
> +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> u32 rmid, enum resctrl_event_id eventid);
>
> /**
> @@ -285,7 +293,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
> *
> * This can be called from any CPU.
> */
> -void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
> +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
>
> extern unsigned int resctrl_rmid_realloc_threshold;
> extern unsigned int resctrl_rmid_realloc_limit;
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 24bf9d7989a9..ce3a70657842 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -107,7 +107,7 @@ union mon_data_bits {
> struct rmid_read {
> struct rdtgroup *rgrp;
> struct rdt_resource *r;
> - struct rdt_domain *d;
> + struct rdt_mon_domain *d;
> enum resctrl_event_id evtid;
> bool first;
> int err;
> @@ -192,7 +192,7 @@ struct mongroup {
> */
> struct pseudo_lock_region {
> struct resctrl_schema *s;
> - struct rdt_domain *d;
> + struct rdt_ctrl_domain *d;
> u32 cbm;
> wait_queue_head_t lock_thread_wq;
> int thread_done;
> @@ -319,25 +319,41 @@ struct arch_mbm_state {
> };
>
> /**
> - * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
> - * a resource
> + * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
> + * a resource for a control function
> * @d_resctrl: Properties exposed to the resctrl file system
> * @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
> + *
> + * Members of this structure are accessed via helpers that provide abstraction.
> + */
> +struct rdt_hw_ctrl_domain {
> + struct rdt_ctrl_domain d_resctrl;
> + u32 *ctrl_val;
> +};
> +
> +/**
> + * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
> + * a resource for a monitor function
> + * @d_resctrl: Properties exposed to the resctrl file system
> * @arch_mbm_total: arch private state for MBM total bandwidth
> * @arch_mbm_local: arch private state for MBM local bandwidth
> *
> * Members of this structure are accessed via helpers that provide abstraction.
> */
> -struct rdt_hw_domain {
> - struct rdt_domain d_resctrl;
> - u32 *ctrl_val;
> +struct rdt_hw_mon_domain {
> + struct rdt_mon_domain d_resctrl;
> struct arch_mbm_state *arch_mbm_total;
> struct arch_mbm_state *arch_mbm_local;
> };
>
> -static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
> +static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
> +{
> + return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
> +}
> +
> +static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
> {
> - return container_of(r, struct rdt_hw_domain, d_resctrl);
> + return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
> }
>
> /**
> @@ -405,7 +421,7 @@ struct rdt_hw_resource {
> struct rdt_resource r_resctrl;
> u32 num_closid;
> unsigned int msr_base;
> - void (*msr_update) (struct rdt_domain *d, struct msr_param *m,
> + void (*msr_update) (struct rdt_ctrl_domain *d, struct msr_param *m,
> struct rdt_resource *r);
> unsigned int mon_scale;
> unsigned int mbm_width;
> @@ -418,9 +434,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
> }
>
> int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
> - struct rdt_domain *d);
> + struct rdt_ctrl_domain *d);
> int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
> - struct rdt_domain *d);
> + struct rdt_ctrl_domain *d);
>
> extern struct mutex rdtgroup_mutex;
>
> @@ -526,21 +542,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
> char *buf, size_t nbytes, loff_t off);
> int rdtgroup_schemata_show(struct kernfs_open_file *of,
> struct seq_file *s, void *v);
> -bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
> +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
> unsigned long cbm, int closid, bool exclusive);
> -unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
> +unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> unsigned long cbm);
> enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
> int rdtgroup_tasks_assigned(struct rdtgroup *r);
> int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
> int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
> -bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
> -bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
> +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
> +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
> int rdt_pseudo_lock_init(void);
> void rdt_pseudo_lock_release(void);
> int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
> void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
> -struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
> +struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
> int closids_supported(void);
> void closid_free(int closid);
> int alloc_rmid(void);
> @@ -550,17 +566,17 @@ bool __init rdt_cpu_has(int flag);
> void mon_event_count(void *info);
> int rdtgroup_mondata_show(struct seq_file *m, void *arg);
> void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> - struct rdt_domain *d, struct rdtgroup *rdtgrp,
> + struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
> int evtid, int first);
> -void mbm_setup_overflow_handler(struct rdt_domain *dom,
> +void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
> unsigned long delay_ms);
> void mbm_handle_overflow(struct work_struct *work);
> void __init intel_rdt_mbm_apply_quirk(void);
> bool is_mba_sc(struct rdt_resource *r);
> -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
> +void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
> void cqm_handle_limbo(struct work_struct *work);
> -bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
> -void __check_limbo(struct rdt_domain *d, bool force_free);
> +bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
> +void __check_limbo(struct rdt_mon_domain *d, bool force_free);
> void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
> void __init thread_throttle_mode_init(void);
> void __init mbm_config_rftype_init(const char *config);
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 1fd85533b4ca..797cb3bf417a 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -49,12 +49,12 @@ int max_name_width, max_data_width;
> bool rdt_alloc_capable;
>
> static void
> -mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> +mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
> struct rdt_resource *r);
> static void
> -cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
> +cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
> static void
> -mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
> +mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
> struct rdt_resource *r);
>
> #define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
> @@ -307,11 +307,11 @@ static void rdt_get_cdp_l2_config(void)
> }
>
> static void
> -mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
> +mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
> {
> - unsigned int i;
> - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> + struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> + unsigned int i;
>
> for (i = m->low; i < m->high; i++)
> wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
> @@ -332,12 +332,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
> }
>
> static void
> -mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> +mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
> struct rdt_resource *r)
> {
> - unsigned int i;
> - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> + struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> + unsigned int i;
>
> /* Write the delay values for mba. */
> for (i = m->low; i < m->high; i++)
> @@ -345,19 +345,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> }
>
> static void
> -cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
> +cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
> {
> - unsigned int i;
> - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> + struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> + unsigned int i;
>
> for (i = m->low; i < m->high; i++)
> wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
> }
>
> -struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
> +struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
> {
> - struct rdt_domain *d;
> + struct rdt_ctrl_domain *d;
>
> list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> /* Find the domain that contains this CPU */
> @@ -379,7 +379,7 @@ void rdt_ctrl_update(void *arg)
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
> struct rdt_resource *r = m->res;
> int cpu = smp_processor_id();
> - struct rdt_domain *d;
> + struct rdt_ctrl_domain *d;
>
> d = get_ctrl_domain_from_cpu(cpu, r);
> if (d) {
> @@ -434,18 +434,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
> *dc = r->default_ctrl;
> }
>
> -static void domain_free(struct rdt_hw_domain *hw_dom)
> +static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
> +{
> + kfree(hw_dom->ctrl_val);
> + kfree(hw_dom);
> +}
> +
> +static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
> {
> kfree(hw_dom->arch_mbm_total);
> kfree(hw_dom->arch_mbm_local);
> - kfree(hw_dom->ctrl_val);
> kfree(hw_dom);
> }
>
> -static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
> +static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
> {
> + struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> struct msr_param m;
> u32 *dc;
>
> @@ -468,7 +473,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
> * @num_rmid: The size of the MBM counter array
> * @hw_dom: The domain that owns the allocated arrays
> */
> -static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> +static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
> {
> size_t tsize;
>
> @@ -507,10 +512,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> {
> int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
> + struct rdt_hw_ctrl_domain *hw_dom;
> struct list_head *add_pos = NULL;
> - struct rdt_hw_domain *hw_dom;
> struct rdt_domain_hdr *hdr;
> - struct rdt_domain *d;
> + struct rdt_ctrl_domain *d;
> int err;
>
> if (id < 0) {
> @@ -524,7 +529,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
> return;
>
> - d = container_of(hdr, struct rdt_domain, hdr);
> + d = container_of(hdr, struct rdt_ctrl_domain, hdr);
>
> cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> if (r->cache.arch_has_per_cpu_cfg)
> @@ -544,7 +549,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> rdt_domain_reconfigure_cdp(r);
>
> if (domain_setup_ctrlval(r, d)) {
> - domain_free(hw_dom);
> + ctrl_domain_free(hw_dom);
> return;
> }
>
> @@ -553,17 +558,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> err = resctrl_online_ctrl_domain(r, d);
> if (err) {
> list_del(&d->hdr.list);
> - domain_free(hw_dom);
> + ctrl_domain_free(hw_dom);
> }
> }
>
> static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> {
> int id = get_domain_id_from_scope(cpu, r->mon_scope);
> + struct rdt_hw_mon_domain *hw_dom;
> struct list_head *add_pos = NULL;
> - struct rdt_hw_domain *hw_dom;
> struct rdt_domain_hdr *hdr;
> - struct rdt_domain *d;
> + struct rdt_mon_domain *d;
> int err;
>
> if (id < 0) {
> @@ -577,7 +582,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
> return;
>
> - d = container_of(hdr, struct rdt_domain, hdr);
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
>
> cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> return;
> @@ -593,7 +598,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>
> if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> - domain_free(hw_dom);
> + mon_domain_free(hw_dom);
> return;
> }
>
> @@ -602,7 +607,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> err = resctrl_online_mon_domain(r, d);
> if (err) {
> list_del(&d->hdr.list);
> - domain_free(hw_dom);
> + mon_domain_free(hw_dom);
> }
> }
>
> @@ -620,9 +625,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> {
> int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
> - struct rdt_hw_domain *hw_dom;
> + struct rdt_hw_ctrl_domain *hw_dom;
> struct rdt_domain_hdr *hdr;
> - struct rdt_domain *d;
> + struct rdt_ctrl_domain *d;
>
> if (id < 0) {
> pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
> @@ -640,8 +645,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
> return;
>
> - d = container_of(hdr, struct rdt_domain, hdr);
> - hw_dom = resctrl_to_arch_dom(d);
> + d = container_of(hdr, struct rdt_ctrl_domain, hdr);
> + hw_dom = resctrl_to_arch_ctrl_dom(d);
>
> cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> if (cpumask_empty(&d->hdr.cpu_mask)) {
> @@ -649,12 +654,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> list_del(&d->hdr.list);
>
> /*
> - * rdt_domain "d" is going to be freed below, so clear
> + * rdt_ctrl_domain "d" is going to be freed below, so clear
> * its pointer from pseudo_lock_region struct.
> */
> if (d->plr)
> d->plr->d = NULL;
> - domain_free(hw_dom);
> + ctrl_domain_free(hw_dom);
>
> return;
> }
> @@ -663,9 +668,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> {
> int id = get_domain_id_from_scope(cpu, r->mon_scope);
> - struct rdt_hw_domain *hw_dom;
> + struct rdt_hw_mon_domain *hw_dom;
> struct rdt_domain_hdr *hdr;
> - struct rdt_domain *d;
> + struct rdt_mon_domain *d;
>
> if (id < 0) {
> pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
> @@ -683,14 +688,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
> return;
>
> - d = container_of(hdr, struct rdt_domain, hdr);
> - hw_dom = resctrl_to_arch_dom(d);
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> + hw_dom = resctrl_to_arch_mon_dom(d);
>
> cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> if (cpumask_empty(&d->hdr.cpu_mask)) {
> resctrl_offline_mon_domain(r, d);
> list_del(&d->hdr.list);
> - domain_free(hw_dom);
> + mon_domain_free(hw_dom);
>
> return;
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 0b4136c42762..08fc97ce4135 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
> }
>
> int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
> - struct rdt_domain *d)
> + struct rdt_ctrl_domain *d)
> {
> struct resctrl_staged_config *cfg;
> u32 closid = data->rdtgrp->closid;
> @@ -137,7 +137,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
> * resource type.
> */
> int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
> - struct rdt_domain *d)
> + struct rdt_ctrl_domain *d)
> {
> struct rdtgroup *rdtgrp = data->rdtgrp;
> struct resctrl_staged_config *cfg;
> @@ -206,8 +206,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
> struct resctrl_staged_config *cfg;
> struct rdt_resource *r = s->res;
> struct rdt_parse_data data;
> + struct rdt_ctrl_domain *d;
> char *dom = NULL, *id;
> - struct rdt_domain *d;
> unsigned long dom_id;
>
> if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
> @@ -267,11 +267,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
> }
> }
>
> -static bool apply_config(struct rdt_hw_domain *hw_dom,
> +static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
> struct resctrl_staged_config *cfg, u32 idx,
> cpumask_var_t cpu_mask)
> {
> - struct rdt_domain *dom = &hw_dom->d_resctrl;
> + struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;
>
> if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
> cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
> @@ -283,11 +283,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
> return false;
> }
>
> -int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
> +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 closid, enum resctrl_conf_type t, u32 cfg_val)
> {
> + struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> u32 idx = get_config_index(closid, t);
> struct msr_param msr_param;
>
> @@ -307,11 +307,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
> int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
> {
> struct resctrl_staged_config *cfg;
> - struct rdt_hw_domain *hw_dom;
> + struct rdt_hw_ctrl_domain *hw_dom;
> struct msr_param msr_param;
> + struct rdt_ctrl_domain *d;
> enum resctrl_conf_type t;
> cpumask_var_t cpu_mask;
> - struct rdt_domain *d;
> u32 idx;
>
> if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
> @@ -319,7 +319,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
>
> msr_param.res = NULL;
> list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> - hw_dom = resctrl_to_arch_dom(d);
> + hw_dom = resctrl_to_arch_ctrl_dom(d);
> for (t = 0; t < CDP_NUM_TYPES; t++) {
> cfg = &hw_dom->d_resctrl.staged_config[t];
> if (!cfg->have_new_ctrl)
> @@ -449,10 +449,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
> return ret ?: nbytes;
> }
>
> -u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
> +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 closid, enum resctrl_conf_type type)
> {
> - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> + struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
> u32 idx = get_config_index(closid, type);
>
> return hw_dom->ctrl_val[idx];
> @@ -461,7 +461,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
> static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
> {
> struct rdt_resource *r = schema->res;
> - struct rdt_domain *dom;
> + struct rdt_ctrl_domain *dom;
> bool sep = false;
> u32 ctrl_val;
>
> @@ -523,7 +523,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
> }
>
> void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> - struct rdt_domain *d, struct rdtgroup *rdtgrp,
> + struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
> int evtid, int first)
> {
> /*
> @@ -543,11 +543,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> {
> struct kernfs_open_file *of = m->private;
> struct rdt_domain_hdr *hdr;
> + struct rdt_mon_domain *d;
> u32 resid, evtid, domid;
> struct rdtgroup *rdtgrp;
> struct rdt_resource *r;
> union mon_data_bits md;
> - struct rdt_domain *d;
> struct rmid_read rr;
> int ret = 0;
>
> @@ -568,7 +568,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> ret = -ENOENT;
> goto out;
> }
> - d = container_of(hdr, struct rdt_domain, hdr);
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
>
> mon_event_read(&rr, r, d, rdtgrp, evtid, false);
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index ec5ad926c5dc..4e145f5620b0 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> return 0;
> }
>
> -static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
> +static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
> u32 rmid,
> enum resctrl_event_id eventid)
> {
> @@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
> return NULL;
> }
>
> -void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
> +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> u32 rmid, enum resctrl_event_id eventid)
> {
> - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> + struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> struct arch_mbm_state *am;
>
> am = get_arch_mbm_state(hw_dom, rmid, eventid);
> @@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
> * Assumes that hardware counters are also reset and thus that there is
> * no need to record initial non-zero counts.
> */
> -void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
> +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> + struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>
> if (is_mbm_total_enabled())
> memset(hw_dom->arch_mbm_total, 0,
> @@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
> return chunks >> shift;
> }
>
> -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> u32 rmid, enum resctrl_event_id eventid, u64 *val)
> {
> + struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> struct arch_mbm_state *am;
> u64 msr_val, chunks;
> int ret;
> @@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> * decrement the count. If the busy count gets to zero on an RMID, we
> * free the RMID
> */
> -void __check_limbo(struct rdt_domain *d, bool force_free)
> +void __check_limbo(struct rdt_mon_domain *d, bool force_free)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> struct rmid_entry *entry;
> @@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
> }
> }
>
> -bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
> +bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
> }
> @@ -334,7 +334,7 @@ int alloc_rmid(void)
> static void add_rmid_to_limbo(struct rmid_entry *entry)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> - struct rdt_domain *d;
> + struct rdt_mon_domain *d;
> int cpu, err;
> u64 val = 0;
>
> @@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
> list_add_tail(&entry->list, &rmid_free_lru);
> }
>
> -static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
> +static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
> enum resctrl_event_id evtid)
> {
> switch (evtid) {
> @@ -516,13 +516,13 @@ void mon_event_count(void *info)
> * throttle MSRs already have low percentage values. To avoid
> * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
> */
> -static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> +static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
> {
> u32 closid, rmid, cur_msr_val, new_msr_val;
> struct mbm_state *pmbm_data, *cmbm_data;
> + struct rdt_ctrl_domain *dom_mba;
> u32 cur_bw, delta_bw, user_bw;
> struct rdt_resource *r_mba;
> - struct rdt_domain *dom_mba;
> struct list_head *head;
> struct rdtgroup *entry;
>
> @@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> }
> }
>
> -static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
> +static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
> {
> struct rmid_read rr;
>
> @@ -640,13 +640,13 @@ void cqm_handle_limbo(struct work_struct *work)
> {
> unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
> int cpu = smp_processor_id();
> + struct rdt_mon_domain *d;
> struct rdt_resource *r;
> - struct rdt_domain *d;
>
> mutex_lock(&rdtgroup_mutex);
>
> r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> - d = container_of(work, struct rdt_domain, cqm_limbo.work);
> + d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
>
> __check_limbo(d, false);
>
> @@ -656,7 +656,7 @@ void cqm_handle_limbo(struct work_struct *work)
> mutex_unlock(&rdtgroup_mutex);
> }
>
> -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
> +void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
> {
> unsigned long delay = msecs_to_jiffies(delay_ms);
> int cpu;
> @@ -672,9 +672,9 @@ void mbm_handle_overflow(struct work_struct *work)
> unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
> struct rdtgroup *prgrp, *crgrp;
> int cpu = smp_processor_id();
> + struct rdt_mon_domain *d;
> struct list_head *head;
> struct rdt_resource *r;
> - struct rdt_domain *d;
>
> mutex_lock(&rdtgroup_mutex);
>
> @@ -682,7 +682,7 @@ void mbm_handle_overflow(struct work_struct *work)
> goto out_unlock;
>
> r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> - d = container_of(work, struct rdt_domain, mbm_over.work);
> + d = container_of(work, struct rdt_mon_domain, mbm_over.work);
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> mbm_update(r, d, prgrp->mon.rmid);
> @@ -701,7 +701,7 @@ void mbm_handle_overflow(struct work_struct *work)
> mutex_unlock(&rdtgroup_mutex);
> }
>
> -void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
> +void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
> {
> unsigned long delay = msecs_to_jiffies(delay_ms);
> int cpu;
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index ed6d59af1cef..08d35f828bc3 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
> * Return: true if @cbm overlaps with pseudo-locked region on @d, false
> * otherwise.
> */
> -bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
> +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
> {
> unsigned int cbm_len;
> unsigned long cbm_b;
> @@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
> * if it is not possible to test due to memory allocation issue,
> * false otherwise.
> */
> -bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
> +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
> {
> + struct rdt_ctrl_domain *d_i;
> cpumask_var_t cpu_with_psl;
> struct rdt_resource *r;
> - struct rdt_domain *d_i;
> bool ret = false;
>
> if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 760013ed1bff..21bbd832f3f2 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -85,8 +85,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
>
> void rdt_staged_configs_clear(void)
> {
> + struct rdt_ctrl_domain *dom;
> struct rdt_resource *r;
> - struct rdt_domain *dom;
>
> lockdep_assert_held(&rdtgroup_mutex);
>
> @@ -976,7 +976,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
> unsigned long sw_shareable = 0, hw_shareable = 0;
> unsigned long exclusive = 0, pseudo_locked = 0;
> struct rdt_resource *r = s->res;
> - struct rdt_domain *dom;
> + struct rdt_ctrl_domain *dom;
> int i, hwb, swb, excl, psl;
> enum rdtgrp_mode mode;
> bool sep = false;
> @@ -1205,7 +1205,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
> *
> * Return: false if CBM does not overlap, true if it does.
> */
> -static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
> +static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> unsigned long cbm, int closid,
> enum resctrl_conf_type type, bool exclusive)
> {
> @@ -1260,7 +1260,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
> *
> * Return: true if CBM overlap detected, false if there is no overlap
> */
> -bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
> +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
> unsigned long cbm, int closid, bool exclusive)
> {
> enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
> @@ -1291,10 +1291,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
> static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
> {
> int closid = rdtgrp->closid;
> + struct rdt_ctrl_domain *d;
> struct resctrl_schema *s;
> struct rdt_resource *r;
> bool has_cache = false;
> - struct rdt_domain *d;
> u32 ctrl;
>
> list_for_each_entry(s, &resctrl_schema_all, list) {
> @@ -1407,7 +1407,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
> * bitmap functions work correctly.
> */
> unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> - struct rdt_domain *d, unsigned long cbm)
> + struct rdt_ctrl_domain *d, unsigned long cbm)
> {
> struct cpu_cacheinfo *ci;
> unsigned int size = 0;
> @@ -1439,9 +1439,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
> {
> struct resctrl_schema *schema;
> enum resctrl_conf_type type;
> + struct rdt_ctrl_domain *d;
> struct rdtgroup *rdtgrp;
> struct rdt_resource *r;
> - struct rdt_domain *d;
> unsigned int size;
> int ret = 0;
> u32 closid;
> @@ -1553,7 +1553,7 @@ static void mon_event_config_read(void *info)
> mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
> }
>
> -static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
> +static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
> {
> smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
> }
> @@ -1561,7 +1561,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
> static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
> {
> struct mon_config_info mon_info = {0};
> - struct rdt_domain *dom;
> + struct rdt_mon_domain *dom;
> bool sep = false;
>
> mutex_lock(&rdtgroup_mutex);
> @@ -1618,7 +1618,7 @@ static void mon_event_config_write(void *info)
> }
>
> static int mbm_config_write_domain(struct rdt_resource *r,
> - struct rdt_domain *d, u32 evtid, u32 val)
> + struct rdt_mon_domain *d, u32 evtid, u32 val)
> {
> struct mon_config_info mon_info = {0};
> int ret = 0;
> @@ -1668,7 +1668,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
> {
> char *dom_str = NULL, *id_str;
> unsigned long dom_id, val;
> - struct rdt_domain *d;
> + struct rdt_mon_domain *d;
> int ret = 0;
>
> next:
> @@ -2216,9 +2216,9 @@ static inline bool is_mba_linear(void)
> static int set_cache_qos_cfg(int level, bool enable)
> {
> void (*update)(void *arg);
> + struct rdt_ctrl_domain *d;
> struct rdt_resource *r_l;
> cpumask_var_t cpu_mask;
> - struct rdt_domain *d;
> int cpu;
>
> if (level == RDT_RESOURCE_L3)
> @@ -2265,7 +2265,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
> l3_qos_cfg_update(&hw_res->cdp_enabled);
> }
>
> -static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
> +static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
> {
> u32 num_closid = resctrl_arch_get_num_closid(r);
> int cpu = cpumask_any(&d->hdr.cpu_mask);
> @@ -2283,7 +2283,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
> }
>
> static void mba_sc_domain_destroy(struct rdt_resource *r,
> - struct rdt_domain *d)
> + struct rdt_ctrl_domain *d)
> {
> kfree(d->mbps_val);
> d->mbps_val = NULL;
> @@ -2309,7 +2309,7 @@ static int set_mba_sc(bool mba_sc)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> u32 num_closid = resctrl_arch_get_num_closid(r);
> - struct rdt_domain *d;
> + struct rdt_ctrl_domain *d;
> int i;
>
> if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
> @@ -2578,7 +2578,7 @@ static int rdt_get_tree(struct fs_context *fc)
> {
> struct rdt_fs_context *ctx = rdt_fc2context(fc);
> unsigned long flags = RFTYPE_CTRL_BASE;
> - struct rdt_domain *dom;
> + struct rdt_mon_domain *dom;
> struct rdt_resource *r;
> int ret;
>
> @@ -2762,10 +2762,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
> static int reset_all_ctrls(struct rdt_resource *r)
> {
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> - struct rdt_hw_domain *hw_dom;
> + struct rdt_hw_ctrl_domain *hw_dom;
> struct msr_param msr_param;
> + struct rdt_ctrl_domain *d;
> cpumask_var_t cpu_mask;
> - struct rdt_domain *d;
> int i;
>
> if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
> @@ -2781,7 +2781,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
> * from each domain to update the MSRs below.
> */
> list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> - hw_dom = resctrl_to_arch_dom(d);
> + hw_dom = resctrl_to_arch_ctrl_dom(d);
> cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
>
> for (i = 0; i < hw_res->num_closid; i++)
> @@ -2976,7 +2976,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> }
>
> static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> - struct rdt_domain *d,
> + struct rdt_mon_domain *d,
> struct rdt_resource *r, struct rdtgroup *prgrp)
> {
> union mon_data_bits priv;
> @@ -3025,7 +3025,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> * and "monitor" groups with given domain id.
> */
> static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> - struct rdt_domain *d)
> + struct rdt_mon_domain *d)
> {
> struct kernfs_node *parent_kn;
> struct rdtgroup *prgrp, *crgrp;
> @@ -3047,7 +3047,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
> struct rdt_resource *r,
> struct rdtgroup *prgrp)
> {
> - struct rdt_domain *dom;
> + struct rdt_mon_domain *dom;
> int ret;
>
> list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> @@ -3149,7 +3149,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
> * Set the RDT domain up to start off with all usable allocations. That is,
> * all shareable and unused bits. All-zero CBM is invalid.
> */
> -static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
> +static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
> u32 closid)
> {
> enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
> @@ -3229,7 +3229,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
> */
> static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
> {
> - struct rdt_domain *d;
> + struct rdt_ctrl_domain *d;
> int ret;
>
> list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
> @@ -3245,7 +3245,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
> static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
> {
> struct resctrl_staged_config *cfg;
> - struct rdt_domain *d;
> + struct rdt_ctrl_domain *d;
>
> list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> if (is_mba_sc(r)) {
> @@ -3842,14 +3842,14 @@ static void __init rdtgroup_setup_default(void)
> mutex_unlock(&rdtgroup_mutex);
> }
>
> -static void domain_destroy_mon_state(struct rdt_domain *d)
> +static void domain_destroy_mon_state(struct rdt_mon_domain *d)
> {
> bitmap_free(d->rmid_busy_llc);
> kfree(d->mbm_total);
> kfree(d->mbm_local);
> }
>
> -void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
> {
> lockdep_assert_held(&rdtgroup_mutex);
>
> @@ -3857,7 +3857,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
> mba_sc_domain_destroy(r, d);
> }
>
> -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> lockdep_assert_held(&rdtgroup_mutex);
>
> @@ -3886,7 +3886,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> domain_destroy_mon_state(d);
> }
>
> -static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
> +static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> size_t tsize;
>
> @@ -3916,7 +3916,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
> return 0;
> }
>
> -int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
> {
> lockdep_assert_held(&rdtgroup_mutex);
>
> @@ -3926,7 +3926,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
> return 0;
> }
>
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> int err;
>

2023-12-04 23:05:58

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v13 3/8] x86/resctrl: Prepare for different scope for control/monitor operations


On 12/4/2023 12:53 PM, Tony Luck wrote:
> Resctrl assumes that control and monitor operations on a resource are
> performed at the same scope.
>
> Prepare for systems that use different scope (specifically Intel needs
> to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
> and NODE scope for cache occupancy and memory bandwidth monitoring).
>
> Create separate domain lists for control and monitor operations.
>
> Note that errors during initialization of either control or monitor
> functions on a domain would previously result in that domain being
> excluded from both control and monitor operations. Now the domains are
> allocated independently it is no longer required to disable both control
> and monitor operations if either fail.
>
> Signed-off-by: Tony Luck<[email protected]>
> Tested-by: Shaopeng Tan<[email protected]>
> Reviewed-by: Shaopeng Tan<[email protected]>
> Reviewed-by: Reinette Chatre<[email protected]>
Reviewed-by: Babu Moger <[email protected]>
> ---
> include/linux/resctrl.h | 25 ++-
> arch/x86/kernel/cpu/resctrl/internal.h | 6 +-
> arch/x86/kernel/cpu/resctrl/core.c | 211 ++++++++++++++++------
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 12 +-
> arch/x86/kernel/cpu/resctrl/monitor.c | 4 +-
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 55 +++---
> 7 files changed, 220 insertions(+), 97 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index c4067150a6b7..35e700edc6e6 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -52,15 +52,22 @@ struct resctrl_staged_config {
> bool have_new_ctrl;
> };
>
> +enum resctrl_domain_type {
> + RESCTRL_CTRL_DOMAIN,
> + RESCTRL_MON_DOMAIN,
> +};
> +
> /**
> * struct rdt_domain_hdr - common header for different domain types
> * @list: all instances of this resource
> * @id: unique id for this instance
> + * @type: type of this instance
> * @cpu_mask: which CPUs share this resource
> */
> struct rdt_domain_hdr {
> struct list_head list;
> int id;
> + enum resctrl_domain_type type;
> struct cpumask cpu_mask;
> };
>
> @@ -163,10 +170,12 @@ enum resctrl_scope {
> * @alloc_capable: Is allocation available on this machine
> * @mon_capable: Is monitor feature available on this machine
> * @num_rmid: Number of RMIDs available
> - * @scope: Scope of this resource
> + * @ctrl_scope: Scope of this resource for control functions
> + * @mon_scope: Scope of this resource for monitor functions
> * @cache: Cache allocation related data
> * @membw: If the component has bandwidth controls, their properties.
> - * @domains: All domains for this resource
> + * @ctrl_domains: Control domains for this resource
> + * @mon_domains: Monitor domains for this resource
> * @name: Name to use in "schemata" file.
> * @data_width: Character width of data when displaying
> * @default_ctrl: Specifies default cache cbm or memory B/W percent.
> @@ -181,10 +190,12 @@ struct rdt_resource {
> bool alloc_capable;
> bool mon_capable;
> int num_rmid;
> - enum resctrl_scope scope;
> + enum resctrl_scope ctrl_scope;
> + enum resctrl_scope mon_scope;
> struct resctrl_cache cache;
> struct resctrl_membw membw;
> - struct list_head domains;
> + struct list_head ctrl_domains;
> + struct list_head mon_domains;
> char *name;
> int data_width;
> u32 default_ctrl;
> @@ -230,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
>
> u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
> u32 closid, enum resctrl_conf_type type);
> -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
> -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
>
> /**
> * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a4f1aa15f0a2..24bf9d7989a9 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -520,8 +520,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
> int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
> int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
> umode_t mask);
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> - struct list_head **pos);
> +struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
> + struct list_head **pos);
> ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
> char *buf, size_t nbytes, loff_t off);
> int rdtgroup_schemata_show(struct kernfs_open_file *of,
> @@ -540,7 +540,7 @@ int rdt_pseudo_lock_init(void);
> void rdt_pseudo_lock_release(void);
> int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
> void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
> -struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
> +struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
> int closids_supported(void);
> void closid_free(int closid);
> int alloc_rmid(void);
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 62a989fd950d..1fd85533b4ca 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -57,7 +57,8 @@ static void
> mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
> struct rdt_resource *r);
>
> -#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
> +#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
> +#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
>
> struct rdt_hw_resource rdt_resources_all[] = {
> [RDT_RESOURCE_L3] =
> @@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_L3,
> .name = "L3",
> - .scope = RESCTRL_L3_CACHE,
> - .domains = domain_init(RDT_RESOURCE_L3),
> + .ctrl_scope = RESCTRL_L3_CACHE,
> + .mon_scope = RESCTRL_L3_CACHE,
> + .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L3),
> + .mon_domains = mon_domain_init(RDT_RESOURCE_L3),
> .parse_ctrlval = parse_cbm,
> .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, @@ -79,8 +82,8 @@ struct
> rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid =
> RDT_RESOURCE_L2, .name = "L2",
> - .scope = RESCTRL_L2_CACHE,
> - .domains = domain_init(RDT_RESOURCE_L2),
> + .ctrl_scope = RESCTRL_L2_CACHE,
> + .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L2),
> .parse_ctrlval = parse_cbm,
> .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, @@ -93,8 +96,8 @@ struct
> rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid =
> RDT_RESOURCE_MBA, .name = "MB",
> - .scope = RESCTRL_L3_CACHE,
> - .domains = domain_init(RDT_RESOURCE_MBA),
> + .ctrl_scope = RESCTRL_L3_CACHE,
> + .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_MBA),
> .parse_ctrlval = parse_bw,
> .format_str = "%d=%*u",
> .fflags = RFTYPE_RES_MB,
> @@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_SMBA,
> .name = "SMBA",
> - .scope = RESCTRL_L3_CACHE,
> - .domains = domain_init(RDT_RESOURCE_SMBA),
> + .ctrl_scope = RESCTRL_L3_CACHE,
> + .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_SMBA),
> .parse_ctrlval = parse_bw,
> .format_str = "%d=%*u",
> .fflags = RFTYPE_RES_MB,
> @@ -352,11 +355,11 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
> wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
> }
>
> -struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
> +struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
> {
> struct rdt_domain *d;
>
> - list_for_each_entry(d, &r->domains, hdr.list) {
> + list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> /* Find the domain that contains this CPU */
> if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
> return d;
> @@ -378,7 +381,7 @@ void rdt_ctrl_update(void *arg)
> int cpu = smp_processor_id();
> struct rdt_domain *d;
>
> - d = get_domain_from_cpu(cpu, r);
> + d = get_ctrl_domain_from_cpu(cpu, r);
> if (d) {
> hw_res->msr_update(d, m, r);
> return;
> @@ -388,26 +391,26 @@ void rdt_ctrl_update(void *arg)
> }
>
> /*
> - * rdt_find_domain - Find a domain in a resource that matches input resource id
> + * rdt_find_domain - Search for a domain id in a resource domain list.
> *
> - * Search resource r's domain list to find the resource id. If the resource
> - * id is found in a domain, return the domain. Otherwise, if requested by
> - * caller, return the first domain whose id is bigger than the input id.
> - * The domain list is sorted by id in ascending order.
> + * Search the domain list to find the domain id. If the domain id is
> + * found, return the domain. NULL otherwise. If the domain id is not
> + * found (and NULL returned) then the first domain with id bigger than
> + * the input id can be returned to the caller via @pos.
> */
> -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
> - struct list_head **pos)
> +struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
> + struct list_head **pos)
> {
> - struct rdt_domain *d;
> + struct rdt_domain_hdr *d;
> struct list_head *l;
>
> - list_for_each(l, &r->domains) {
> - d = list_entry(l, struct rdt_domain, hdr.list);
> + list_for_each(l, h) {
> + d = list_entry(l, struct rdt_domain_hdr, list);
> /* When id is found, return its domain. */
> - if (id == d->hdr.id)
> + if (id == d->id)
> return d;
> /* Stop searching when finding id's position in sorted list. */
> - if (id < d->hdr.id)
> + if (id < d->id)
> break;
> }
>
> @@ -501,35 +504,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> return -EINVAL;
> }
>
> -/*
> - * domain_add_cpu - Add a cpu to a resource's domain list.
> - *
> - * If an existing domain in the resource r's domain list matches the cpu's
> - * resource id, add the cpu in the domain.
> - *
> - * Otherwise, a new domain is allocated and inserted into the right position
> - * in the domain list sorted by id in ascending order.
> - *
> - * The order in the domain list is visible to users when we print entries
> - * in the schemata file and schemata input is validated to have the same order
> - * as this list.
> - */
> -static void domain_add_cpu(int cpu, struct rdt_resource *r)
> +static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> {
> - int id = get_domain_id_from_scope(cpu, r->scope);
> + int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
> struct list_head *add_pos = NULL;
> struct rdt_hw_domain *hw_dom;
> + struct rdt_domain_hdr *hdr;
> struct rdt_domain *d;
> int err;
>
> if (id < 0) {
> - pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> - cpu, r->scope, r->name);
> + pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
> + cpu, r->ctrl_scope, r->name);
> return;
> }
>
> - d = rdt_find_domain(r, id, &add_pos);
> - if (d) {
> + hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
> + if (hdr) {
> + if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
> + return;
> +
> + d = container_of(hdr, struct rdt_domain, hdr);
> +
> cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> if (r->cache.arch_has_per_cpu_cfg)
> rdt_domain_reconfigure_cdp(r);
> @@ -542,51 +538,114 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>
> d = &hw_dom->d_resctrl;
> d->hdr.id = id;
> + d->hdr.type = RESCTRL_CTRL_DOMAIN;
> cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>
> rdt_domain_reconfigure_cdp(r);
>
> - if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
> + if (domain_setup_ctrlval(r, d)) {
> domain_free(hw_dom);
> return;
> }
>
> - if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> + list_add_tail(&d->hdr.list, add_pos);
> +
> + err = resctrl_online_ctrl_domain(r, d);
> + if (err) {
> + list_del(&d->hdr.list);
> + domain_free(hw_dom);
> + }
> +}
> +
> +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> + int id = get_domain_id_from_scope(cpu, r->mon_scope);
> + struct list_head *add_pos = NULL;
> + struct rdt_hw_domain *hw_dom;
> + struct rdt_domain_hdr *hdr;
> + struct rdt_domain *d;
> + int err;
> +
> + if (id < 0) {
> + pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
> + cpu, r->mon_scope, r->name);
> + return;
> + }
> +
> + hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
> + if (hdr) {
> + if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
> + return;
> +
> + d = container_of(hdr, struct rdt_domain, hdr);
> +
> + cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> + return;
> + }
> +
> + hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> + if (!hw_dom)
> + return;
> +
> + d = &hw_dom->d_resctrl;
> + d->hdr.id = id;
> + d->hdr.type = RESCTRL_MON_DOMAIN;
> + cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> +
> + if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
> domain_free(hw_dom);
> return;
> }
>
> list_add_tail(&d->hdr.list, add_pos);
>
> - err = resctrl_online_domain(r, d);
> + err = resctrl_online_mon_domain(r, d);
> if (err) {
> list_del(&d->hdr.list);
> domain_free(hw_dom);
> }
> }
>
> -static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> +/*
> + * domain_add_cpu - Add a CPU to either/both resource's domain lists.
> + */
> +static void domain_add_cpu(int cpu, struct rdt_resource *r)
> {
> - int id = get_domain_id_from_scope(cpu, r->scope);
> + if (r->alloc_capable)
> + domain_add_cpu_ctrl(cpu, r);
> + if (r->mon_capable)
> + domain_add_cpu_mon(cpu, r);
> +}
> +
> +static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> +{
> + int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
> struct rdt_hw_domain *hw_dom;
> + struct rdt_domain_hdr *hdr;
> struct rdt_domain *d;
>
> if (id < 0) {
> - pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> - cpu, r->scope, r->name);
> + pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
> + cpu, r->ctrl_scope, r->name);
> return;
> }
>
> - d = rdt_find_domain(r, id, NULL);
> - if (!d) {
> - pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
> + hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
> + if (!hdr) {
> + pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n",
> + id, cpu, r->name);
> return;
> }
> +
> + if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
> + return;
> +
> + d = container_of(hdr, struct rdt_domain, hdr);
> hw_dom = resctrl_to_arch_dom(d);
>
> cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> if (cpumask_empty(&d->hdr.cpu_mask)) {
> - resctrl_offline_domain(r, d);
> + resctrl_offline_ctrl_domain(r, d);
> list_del(&d->hdr.list);
>
> /*
> @@ -599,6 +658,42 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>
> return;
> }
> +}
> +
> +static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> +{
> + int id = get_domain_id_from_scope(cpu, r->mon_scope);
> + struct rdt_hw_domain *hw_dom;
> + struct rdt_domain_hdr *hdr;
> + struct rdt_domain *d;
> +
> + if (id < 0) {
> + pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
> + cpu, r->mon_scope, r->name);
> + return;
> + }
> +
> + hdr = rdt_find_domain(&r->mon_domains, id, NULL);
> + if (!hdr) {
> + pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n",
> + id, cpu, r->name);
> + return;
> + }
> +
> + if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
> + return;
> +
> + d = container_of(hdr, struct rdt_domain, hdr);
> + hw_dom = resctrl_to_arch_dom(d);
> +
> + cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> + if (cpumask_empty(&d->hdr.cpu_mask)) {
> + resctrl_offline_mon_domain(r, d);
> + list_del(&d->hdr.list);
> + domain_free(hw_dom);
> +
> + return;
> + }
>
> if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> @@ -613,6 +708,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> }
> }
>
> +static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> +{
> + if (r->alloc_capable)
> + domain_remove_cpu_ctrl(cpu, r);
> + if (r->mon_capable)
> + domain_remove_cpu_mon(cpu, r);
> +}
> +
> static void clear_closid_rmid(int cpu)
> {
> struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 23f8258d36a8..0b4136c42762 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -226,7 +226,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
> return -EINVAL;
> }
> dom = strim(dom);
> - list_for_each_entry(d, &r->domains, hdr.list) {
> + list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> if (d->hdr.id == dom_id) {
> data.buf = dom;
> data.rdtgrp = rdtgrp;
> @@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
> return -ENOMEM;
>
> msr_param.res = NULL;
> - list_for_each_entry(d, &r->domains, hdr.list) {
> + list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> hw_dom = resctrl_to_arch_dom(d);
> for (t = 0; t < CDP_NUM_TYPES; t++) {
> cfg = &hw_dom->d_resctrl.staged_config[t];
> @@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
> u32 ctrl_val;
>
> seq_printf(s, "%*s:", max_name_width, schema->name);
> - list_for_each_entry(dom, &r->domains, hdr.list) {
> + list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
> if (sep)
> seq_puts(s, ";");
>
> @@ -542,6 +542,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> {
> struct kernfs_open_file *of = m->private;
> + struct rdt_domain_hdr *hdr;
> u32 resid, evtid, domid;
> struct rdtgroup *rdtgrp;
> struct rdt_resource *r;
> @@ -562,11 +563,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> evtid = md.u.evtid;
>
> r = &rdt_resources_all[resid].r_resctrl;
> - d = rdt_find_domain(r, domid, NULL);
> - if (!d) {
> + hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
> + if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
> ret = -ENOENT;
> goto out;
> }
> + d = container_of(hdr, struct rdt_domain, hdr);
>
> mon_event_read(&rr, r, d, rdtgrp, evtid, false);
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index dd0ea1bc0092..ec5ad926c5dc 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>
> entry->busy = 0;
> cpu = get_cpu();
> - list_for_each_entry(d, &r->domains, hdr.list) {
> + list_for_each_entry(d, &r->mon_domains, hdr.list) {
> if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
> err = resctrl_arch_rmid_read(r, d, entry->rmid,
> QOS_L3_OCCUP_EVENT_ID,
> @@ -535,7 +535,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> rmid = rgrp->mon.rmid;
> pmbm_data = &dom_mbm->mbm_local[rmid];
>
> - dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
> + dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
> if (!dom_mba) {
> pr_warn_once("Failure to get domain for MBA update\n");
> return;
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index fcbd99e2eb66..ed6d59af1cef 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
> */
> static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
> {
> - enum resctrl_scope scope = plr->s->res->scope;
> + enum resctrl_scope scope = plr->s->res->ctrl_scope;
> struct cpu_cacheinfo *ci;
> int ret;
> int i;
> @@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
> * associated with them.
> */
> for_each_alloc_capable_rdt_resource(r) {
> - list_for_each_entry(d_i, &r->domains, hdr.list) {
> + list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
> if (d_i->plr)
> cpumask_or(cpu_with_psl, cpu_with_psl,
> &d_i->hdr.cpu_mask);
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 04d32602ac33..760013ed1bff 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
> lockdep_assert_held(&rdtgroup_mutex);
>
> for_each_alloc_capable_rdt_resource(r) {
> - list_for_each_entry(dom, &r->domains, hdr.list)
> + list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
> memset(dom->staged_config, 0, sizeof(dom->staged_config));
> }
> }
> @@ -984,7 +984,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
>
> mutex_lock(&rdtgroup_mutex);
> hw_shareable = r->cache.shareable_bits;
> - list_for_each_entry(dom, &r->domains, hdr.list) {
> + list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
> if (sep)
> seq_putc(seq, ';');
> sw_shareable = 0;
> @@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
> if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
> continue;
> has_cache = true;
> - list_for_each_entry(d, &r->domains, hdr.list) {
> + list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> ctrl = resctrl_arch_get_config(r, d, closid,
> s->conf_type);
> if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
> @@ -1413,13 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> unsigned int size = 0;
> int num_b, i;
>
> - if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
> + if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
> return size;
>
> num_b = bitmap_weight(&cbm, r->cache.cbm_len);
> ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
> for (i = 0; i < ci->num_leaves; i++) {
> - if (ci->info_list[i].level == r->scope) {
> + if (ci->info_list[i].level == r->ctrl_scope) {
> size = ci->info_list[i].size / r->cache.cbm_len * num_b;
> break;
> }
> @@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
> type = schema->conf_type;
> sep = false;
> seq_printf(s, "%*s:", max_name_width, schema->name);
> - list_for_each_entry(d, &r->domains, hdr.list) {
> + list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> if (sep)
> seq_putc(s, ';');
> if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
> @@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
>
> mutex_lock(&rdtgroup_mutex);
>
> - list_for_each_entry(dom, &r->domains, hdr.list) {
> + list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> if (sep)
> seq_puts(s, ";");
>
> @@ -1689,7 +1689,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
> return -EINVAL;
> }
>
> - list_for_each_entry(d, &r->domains, hdr.list) {
> + list_for_each_entry(d, &r->mon_domains, hdr.list) {
> if (d->hdr.id == dom_id) {
> ret = mbm_config_write_domain(r, d, evtid, val);
> if (ret)
> @@ -2232,7 +2232,7 @@ static int set_cache_qos_cfg(int level, bool enable)
> return -ENOMEM;
>
> r_l = &rdt_resources_all[level].r_resctrl;
> - list_for_each_entry(d, &r_l->domains, hdr.list) {
> + list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
> if (r_l->cache.arch_has_per_cpu_cfg)
> /* Pick all the CPUs in the domain instance */
> for_each_cpu(cpu, &d->hdr.cpu_mask)
> @@ -2317,7 +2317,7 @@ static int set_mba_sc(bool mba_sc)
>
> r->membw.mba_sc = mba_sc;
>
> - list_for_each_entry(d, &r->domains, hdr.list) {
> + list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> for (i = 0; i < num_closid; i++)
> d->mbps_val[i] = MBA_MAX_MBPS;
> }
> @@ -2653,7 +2653,7 @@ static int rdt_get_tree(struct fs_context *fc)
>
> if (is_mbm_enabled()) {
> r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> - list_for_each_entry(dom, &r->domains, hdr.list)
> + list_for_each_entry(dom, &r->mon_domains, hdr.list)
> mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
> }
>
> @@ -2777,10 +2777,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
>
> /*
> * Disable resource control for this resource by setting all
> - * CBMs in all domains to the maximum mask value. Pick one CPU
> + * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
> * from each domain to update the MSRs below.
> */
> - list_for_each_entry(d, &r->domains, hdr.list) {
> + list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> hw_dom = resctrl_to_arch_dom(d);
> cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
>
> @@ -3050,7 +3050,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
> struct rdt_domain *dom;
> int ret;
>
> - list_for_each_entry(dom, &r->domains, hdr.list) {
> + list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
> if (ret)
> return ret;
> @@ -3232,7 +3232,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
> struct rdt_domain *d;
> int ret;
>
> - list_for_each_entry(d, &s->res->domains, hdr.list) {
> + list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
> ret = __init_one_rdt_domain(d, s, closid);
> if (ret < 0)
> return ret;
> @@ -3247,7 +3247,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
> struct resctrl_staged_config *cfg;
> struct rdt_domain *d;
>
> - list_for_each_entry(d, &r->domains, hdr.list) {
> + list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
> if (is_mba_sc(r)) {
> d->mbps_val[closid] = MBA_MAX_MBPS;
> continue;
> @@ -3849,15 +3849,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
> kfree(d->mbm_local);
> }
>
> -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
> +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
> {
> lockdep_assert_held(&rdtgroup_mutex);
>
> if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
> mba_sc_domain_destroy(r, d);
> +}
>
> - if (!r->mon_capable)
> - return;
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> +{
> + lockdep_assert_held(&rdtgroup_mutex);
>
> /*
> * If resctrl is mounted, remove all the
> @@ -3914,18 +3916,21 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
> return 0;
> }
>
> -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
> +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
> {
> - int err;
> -
> lockdep_assert_held(&rdtgroup_mutex);
>
> if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
> - /* RDT_RESOURCE_MBA is never mon_capable */
> return mba_sc_domain_allocate(r, d);
>
> - if (!r->mon_capable)
> - return 0;
> + return 0;
> +}
> +
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
> +{
> + int err;
> +
> + lockdep_assert_held(&rdtgroup_mutex);
>
> err = domain_setup_mon_state(r, d);
> if (err)

2023-12-04 23:06:08

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v13 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache


On 12/4/2023 12:53 PM, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
>
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
>
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
>
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
>
> The other mode divides the RMIDs and renumbers the ones on the second
> SNC node to start from zero.
>
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
>
> Add a global integer "snc_nodes_per_l3_cache" that shows how many
> SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
> SNC mode is either not implemented, or not enabled.
>
> Update all places to take appropriate action when SNC mode is enabled:
> 1) The number of logical RMIDs per L3 cache available for use is the
> number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
> nodes.
> 3) The RMID renumbering operates when using the value from the
> IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
> counter, adjust from the logical RMID to the physical
> RMID value for the SNC node that it wishes to read and load the
> adjusted value into the IA32_QM_EVTSEL MSR.
> 4) Divide the L3 cache between the SNC nodes. Divide the value
> reported in the resctrl "size" file by the number of SNC
> nodes because the effective amount of cache that can be allocated
> is reduced by that factor.
> 5) Disable the "-o mba_MBps" mount option in SNC mode
> because the monitoring is being done per SNC node, while the
> bandwidth allocation is still done at the L3 cache scope.
> Trying to use this feedback loop might result in contradictory
> changes to the throttling level coming from each of the SNC
> node bandwidth measurements.
>
> Signed-off-by: Tony Luck<[email protected]>
> Tested-by: Shaopeng Tan<[email protected]>
> Reviewed-by: Peter Newman<[email protected]>
> Reviewed-by: Reinette Chatre<[email protected]>
> Reviewed-by: Shaopeng Tan<[email protected]>
Reviewed-by: Babu Moger <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
> arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 16 +++++++++++++---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++--
> 4 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index ce3a70657842..e7a75a439c16 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>
> extern struct dentry *debugfs_resctrl;
>
> +extern unsigned int snc_nodes_per_l3_cache;
> +
> enum resctrl_res_level {
> RDT_RESOURCE_L3,
> RDT_RESOURCE_L2,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index c9315ce8f7bd..cf5aba8a74bf 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -48,6 +48,12 @@ int max_name_width, max_data_width;
> */
> bool rdt_alloc_capable;
>
> +/*
> + * Number of SNC nodes that share each L3 cache. Default is 1 for
> + * systems that do not support SNC, or have SNC disabled.
> + */
> +unsigned int snc_nodes_per_l3_cache = 1;
> +
> static void
> mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
> struct rdt_resource *r);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 4e145f5620b0..30b7c3b9b517 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
>
> static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> {
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + int cpu = smp_processor_id();
> + int rmid_offset = 0;
> u64 msr_val;
>
> + /*
> + * When SNC mode is on, need to compute the offset to read the
> + * physical RMID counter for the node to which this CPU belongs.
> + */
> + if (snc_nodes_per_l3_cache > 1)
> + rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> +
> /*
> * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> * with a valid event code for supported resource type and the bits
> @@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> * are error bits.
> */
> - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
> rdmsrl(MSR_IA32_QM_CTR, msr_val);
>
> if (msr_val & RMID_VAL_ERROR)
> @@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> int ret;
>
> resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> - r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> + r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
> hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>
> if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 21bbd832f3f2..79d57dade568 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> }
> }
>
> - return size;
> + return size / snc_nodes_per_l3_cache;
> }
>
> /*
> @@ -2298,7 +2298,8 @@ static bool supports_mba_mbps(void)
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>
> return (is_mbm_local_enabled() &&
> - r->alloc_capable && is_mba_linear());
> + r->alloc_capable && is_mba_linear() &&
> + snc_nodes_per_l3_cache == 1);
> }
>
> /*

2023-12-04 23:06:10

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v13 7/8] x86/resctrl: Sub NUMA Cluster detection and enable


On 12/4/2023 12:53 PM, Tony Luck wrote:
> There isn't a simple hardware bit that indicates whether a CPU is
> running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
> the ratio of NUMA nodes to L3 cache instances.
>
> When SNC mode is detected, reconfigure the RMID counters by updating
> the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.
>
> Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
> on the second SNC node to start from zero.
>
> Signed-off-by: Tony Luck<[email protected]>
> Tested-by: Shaopeng Tan<[email protected]>
> Reviewed-by: Peter Newman<[email protected]>
> Reviewed-by: Reinette Chatre<[email protected]>
> Reviewed-by: Shaopeng Tan<[email protected]>

Reviewed-by: Babu Moger <[email protected]>


> ---
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kernel/cpu/resctrl/core.c | 118 +++++++++++++++++++++++++++++
> 2 files changed, 119 insertions(+)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 1d51e1850ed0..94d29d81e6db 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1111,6 +1111,7 @@
> #define MSR_IA32_QM_CTR 0xc8e
> #define MSR_IA32_PQR_ASSOC 0xc8f
> #define MSR_IA32_L3_CBM_BASE 0xc90
> +#define MSR_RMID_SNC_CONFIG 0xca0
> #define MSR_IA32_L2_CBM_BASE 0xd10
> #define MSR_IA32_MBA_THRTL_BASE 0xd50
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index cf5aba8a74bf..1de1c4499b7d 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -16,11 +16,14 @@
>
> #define pr_fmt(fmt) "resctrl: " fmt
>
> +#include <linux/cpu.h>
> #include <linux/slab.h>
> #include <linux/err.h>
> #include <linux/cacheinfo.h>
> #include <linux/cpuhotplug.h>
> +#include <linux/mod_devicetable.h>
>
> +#include <asm/cpu_device_id.h>
> #include <asm/intel-family.h>
> #include <asm/resctrl.h>
> #include "internal.h"
> @@ -740,11 +743,42 @@ static void clear_closid_rmid(int cpu)
> wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
> }
>
> +/*
> + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> + * which indicates that RMIDs are configured in legacy mode.
> + * This mode is incompatible with Linux resctrl semantics
> + * as RMIDs are partitioned between SNC nodes, which requires
> + * a user to know which RMID is allocated to a task.
> + * Clearing bit 0 reconfigures the RMID counters for use
> + * in Sub NUMA Cluster mode. This mode is better for Linux.
> + * The RMID space is divided between all SNC nodes with the
> + * RMIDs renumbered to start from zero in each node when
> + * couning operations from tasks. Code to read the counters
> + * must adjust RMID counter numbers based on SNC node. See
> + * __rmid_read() for code that does this.
> + */
> +static void snc_remap_rmids(int cpu)
> +{
> + u64 val;
> +
> + /* Only need to enable once per package. */
> + if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
> + return;
> +
> + rdmsrl(MSR_RMID_SNC_CONFIG, val);
> + val &= ~BIT_ULL(0);
> + wrmsrl(MSR_RMID_SNC_CONFIG, val);
> +}
> +
> static int resctrl_online_cpu(unsigned int cpu)
> {
> struct rdt_resource *r;
>
> mutex_lock(&rdtgroup_mutex);
> +
> + if (snc_nodes_per_l3_cache > 1)
> + snc_remap_rmids(cpu);
> +
> for_each_capable_rdt_resource(r)
> domain_add_cpu(cpu, r);
> /* The cpu is set in default rdtgroup after online. */
> @@ -999,11 +1033,95 @@ static __init bool get_rdt_resources(void)
> return (rdt_mon_capable || rdt_alloc_capable);
> }
>
> +/* CPU models that support MSR_RMID_SNC_CONFIG */
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
> + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
> + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
> + X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
> + {}
> +};
> +
> +/*
> + * There isn't a simple hardware bit that indicates whether a CPU is running
> + * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
> + * ratio of NUMA nodes to L3 cache instances.
> + * It is not possible to accurately determine SNC state if the system is
> + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
> + * to L3 caches. It will be OK if system is booted with hyperthreading
> + * disabled (since this doesn't affect the ratio).
> + */
> +static __init int snc_get_config(void)
> +{
> + unsigned long *node_caches;
> + int mem_only_nodes = 0;
> + int cpu, node, ret;
> + int num_l3_caches;
> + int cache_id;
> +
> + if (!x86_match_cpu(snc_cpu_ids))
> + return 1;
> +
> + node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
> + if (!node_caches)
> + return 1;
> +
> + cpus_read_lock();
> +
> + if (num_online_cpus() != num_present_cpus())
> + pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> +
> + for_each_node(node) {
> + cpu = cpumask_first(cpumask_of_node(node));
> + if (cpu < nr_cpu_ids) {
> + cache_id = get_cpu_cacheinfo_id(cpu, 3);
> + if (cache_id != -1)
> + set_bit(cache_id, node_caches);
> + } else {
> + mem_only_nodes++;
> + }
> + }
> + cpus_read_unlock();
> +
> + num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
> + kfree(node_caches);
> +
> + if (!num_l3_caches)
> + goto insane;
> +
> + /* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
> + if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
> + goto insane;
> +
> + ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
> +
> + /* sanity check #2: Only valid results are 1, 2, 3, 4 */
> + switch (ret) {
> + case 1:
> + break;
> + case 2:
> + case 3:
> + case 4:
> + rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
> + break;
> + default:
> + goto insane;
> + }
> +
> + return ret;
> +insane:
> + pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
> + (nr_node_ids - mem_only_nodes), num_l3_caches);
> + return 1;
> +}
> +
> static __init void rdt_init_res_defs_intel(void)
> {
> struct rdt_hw_resource *hw_res;
> struct rdt_resource *r;
>
> + snc_nodes_per_l3_cache = snc_get_config();
> +
> for_each_rdt_resource(r) {
> hw_res = resctrl_to_arch_res(r);
>

2023-12-04 23:06:27

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v13 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes


On 12/4/2023 12:53 PM, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
>
> Users should be aware that SNC mode also affects the amount of L3 cache
> available for allocation within each SNC node.
>
> Signed-off-by: Tony Luck<[email protected]>
> Tested-by: Shaopeng Tan<[email protected]>
> Reviewed-by: Peter Newman<[email protected]>
> Reviewed-by: Reinette Chatre<[email protected]>
> Reviewed-by: Shaopeng Tan<[email protected]>

Reviewed-by: Babu Moger <[email protected]>


> ---
> Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index a6279df64a9d..15f1cff6ee76 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
> When monitoring is enabled all MON groups will also contain:
>
> "mon_data":
> - This contains a set of files organized by L3 domain and by
> - RDT event. E.g. on a system with two L3 domains there will
> - be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
> - directories have one file per event (e.g. "llc_occupancy",
> + This contains a set of files organized by L3 domain or by NUMA
> + node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> + or enabled respectively) and by RDT event. Each of these
> + directories has one file per event (e.g. "llc_occupancy",
> "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
> files provide a read out of the current value of the event for
> all tasks in the group. In CTRL_MON groups these files provide
> @@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
> each bit represents 5% of the capacity of the cache. You could partition
> the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>
> +Notes on Sub-NUMA Cluster mode
> +==============================
> +When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
> +nodes much more readily than between regular NUMA nodes since the CPUs
> +on Sub-NUMA nodes share the same L3 cache and the system may report
> +the NUMA distance between Sub-NUMA nodes with a lower value than used
> +for regular NUMA nodes. Users who do not bind tasks to the CPUs of a
> +specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
> +and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
> +to get the full view of traffic for which the tasks were the source.
> +
> +The cache allocation feature still provides the same number of
> +bits in a mask to control allocation into the L3 cache, but each
> +of those ways has its capacity reduced because the cache is divided
> +between the SNC nodes. The values reported in the resctrl
> +"size" files are adjusted accordingly.
> +
> Memory bandwidth Allocation and monitoring
> ==========================================
>

2023-12-04 23:06:48

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v13 5/8] x86/resctrl: Add node-scope to the options for feature scope


On 12/4/2023 12:53 PM, Tony Luck wrote:
> Currently supported resctrl features are all domain scoped the same as the
> scope of the L2 or L3 caches.
>
> Add RESCTRL_NODE as a new option for features that are scoped at the
> same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
> Cluster (SNC) feature where monitoring features are node scoped.
>
> Signed-off-by: Tony Luck<[email protected]>
> Tested-by: Shaopeng Tan<[email protected]>
> Reviewed-by: Peter Newman<[email protected]>
> Reviewed-by: Reinette Chatre<[email protected]>
> Reviewed-by: Shaopeng Tan<[email protected]>
Reviewed-by: Babu Moger <[email protected]>
> ---
> include/linux/resctrl.h | 1 +
> arch/x86/kernel/cpu/resctrl/core.c | 2 ++
> 2 files changed, 3 insertions(+)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 058a940c3239..b8a3a11b970d 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -170,6 +170,7 @@ struct resctrl_schema;
> enum resctrl_scope {
> RESCTRL_L2_CACHE = 2,
> RESCTRL_L3_CACHE = 3,
> + RESCTRL_NODE,
> };
>
> /**
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 797cb3bf417a..c9315ce8f7bd 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -502,6 +502,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> case RESCTRL_L2_CACHE:
> case RESCTRL_L3_CACHE:
> return get_cpu_cacheinfo_id(cpu, scope);
> + case RESCTRL_NODE:
> + return cpu_to_node(cpu);
> default:
> break;
> }

2023-12-04 23:26:08

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v13 0/8] Add support for Sub-NUMA cluster (SNC) systems

On Mon, Dec 04, 2023 at 10:53:49AM -0800, Tony Luck wrote:

Boris: I've collected "Reviewed-by:" from Reinette for all patches. Babu
sent a Tested-by for the series, and Reviewed-by for each patch just
now.

So it's ready to got into your to-be-reviewed queue.

Thanks

-Tony

> The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
> that share an L3 cache into two or more sets. This plays havoc with the
> Resource Director Technology (RDT) monitoring features. Prior to this
> patch Intel has advised that SNC and RDT are incompatible.
>
> Some of these CPU support an MSR that can partition the RMID counters in
> the same way. This allows monitoring features to be used. With the caveat
> that users must be aware that Linux may migrate tasks more frequently
> between SNC nodes than between "regular" NUMA nodes, so reading counters
> from all SNC nodes may be needed to get a complete picture of activity
> for tasks.
>
> Cache and memory bandwidth allocation features continue to operate at
> the scope of the L3 cache.
>
> Signed-off-by: Tony Luck <[email protected]>
>
> Changes since v12:
>
> All:
> Reinette - put commit tags in right order for TIP (Tested-by before
> Reviewed-by)
>
> Patch 7:
> Fam Zheng - Check for -1 return from get_cpu_cacheinfo_id() and
> increase size of bitmap tracking # of L3 instances.
> Reinette - Add extra sanity checks. Note that this patch has
> some additional tweaks beyond the e-mail discussion.
> 1) "3" is a valid return in addition to 1, 2, 4
> 2) Added a warning if the sanity checks fail that
> prints number of CPU nodes and number of L3 cache
> instances that were found.
>
> Patch 8:
> Babu - Fix grammar with an additional comma.
>
>
> Tony Luck (8):
> x86/resctrl: Prepare for new domain scope
> x86/resctrl: Prepare to split rdt_domain structure
> x86/resctrl: Prepare for different scope for control/monitor
> operations
> x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
> x86/resctrl: Add node-scope to the options for feature scope
> x86/resctrl: Introduce snc_nodes_per_l3_cache
> x86/resctrl: Sub NUMA Cluster detection and enable
> x86/resctrl: Update documentation with Sub-NUMA cluster changes
>
> Documentation/arch/x86/resctrl.rst | 25 +-
> include/linux/resctrl.h | 85 +++--
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kernel/cpu/resctrl/internal.h | 66 ++--
> arch/x86/kernel/cpu/resctrl/core.c | 433 +++++++++++++++++-----
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 58 +--
> arch/x86/kernel/cpu/resctrl/monitor.c | 68 ++--
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 26 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 149 ++++----
> 9 files changed, 629 insertions(+), 282 deletions(-)
>
>
> base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
> --
> 2.41.0
>

2024-01-26 22:39:21

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v14 0/8] Add support for Sub-NUMA cluster (SNC) systems

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features. Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <[email protected]>

---
Changes since v13:

Rebase to tip x86/cache
fc747eebef73 ("x86/resctrl: Remove redundant variable in mbm_config_write_domain()")

Applied all Reviewed and Tested tags

Tony Luck (8):
x86/resctrl: Prepare for new domain scope
x86/resctrl: Prepare to split rdt_domain structure
x86/resctrl: Prepare for different scope for control/monitor
operations
x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
x86/resctrl: Add node-scope to the options for feature scope
x86/resctrl: Introduce snc_nodes_per_l3_cache
x86/resctrl: Sub NUMA Cluster detection and enable
x86/resctrl: Update documentation with Sub-NUMA cluster changes

Documentation/arch/x86/resctrl.rst | 25 +-
include/linux/resctrl.h | 85 +++--
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 66 ++--
arch/x86/kernel/cpu/resctrl/core.c | 433 +++++++++++++++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 58 +--
arch/x86/kernel/cpu/resctrl/monitor.c | 68 ++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 26 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 149 ++++----
9 files changed, 629 insertions(+), 282 deletions(-)


base-commit: fc747eebef734563cf68a512f57937c8f231834a
--
2.43.0


2024-01-26 22:39:28

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v14 1/8] x86/resctrl: Prepare for new domain scope

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Clean up the error handling when looking up domains. Report invalid id's
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the callsites.

Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Babu Moger <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 9 ++++-
arch/x86/kernel/cpu/resctrl/core.c | 45 ++++++++++++++++-------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 ++-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 ++-
5 files changed, 48 insertions(+), 19 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..7d4eb7df611d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
struct rdt_parse_data;
struct resctrl_schema;

+enum resctrl_scope {
+ RESCTRL_L2_CACHE = 2,
+ RESCTRL_L3_CACHE = 3,
+};
+
/**
* struct rdt_resource - attributes of a resctrl resource
* @rid: The index of the resource
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @cache_level: Which cache level defines scope of this resource
+ * @scope: Scope of this resource
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
* @domains: All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- int cache_level;
+ enum resctrl_scope scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
struct list_head domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index aa9810a64258..268965c7c7ef 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -79,7 +79,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .cache_level = 2,
+ .scope = RESCTRL_L2_CACHE,
.domains = domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -93,7 +93,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -105,7 +105,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -399,9 +399,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct rdt_domain *d;
struct list_head *l;

- if (id < 0)
- return ERR_PTR(-ENODEV);
-
list_for_each(l, &r->domains) {
d = list_entry(l, struct rdt_domain, list);
/* When id is found, return its domain. */
@@ -489,6 +486,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
return 0;
}

+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+ switch (scope) {
+ case RESCTRL_L2_CACHE:
+ case RESCTRL_L3_CACHE:
+ return get_cpu_cacheinfo_id(cpu, scope);
+ default:
+ break;
+ }
+
+ return -EINVAL;
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -504,18 +514,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
*/
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_domain_id_from_scope(cpu, r->scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
int err;

- d = rdt_find_domain(r, id, &add_pos);
- if (IS_ERR(d)) {
- pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+ if (id < 0) {
+ pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->scope, r->name);
return;
}

+ d = rdt_find_domain(r, id, &add_pos);
if (d) {
cpumask_set_cpu(cpu, &d->cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
@@ -554,13 +565,19 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

static void domain_remove_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_domain_id_from_scope(cpu, r->scope);
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

+ if (id < 0) {
+ pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->scope, r->name);
+ return;
+ }
+
d = rdt_find_domain(r, id, NULL);
- if (IS_ERR_OR_NULL(d)) {
- pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+ if (!d) {
+ pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
return;
}
hw_dom = resctrl_to_arch_dom(d);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index beccb0e87ba7..3f8891d57fac 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -563,7 +563,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)

r = &rdt_resources_all[resid].r_resctrl;
d = rdt_find_domain(r, domid, NULL);
- if (IS_ERR_OR_NULL(d)) {
+ if (!d) {
ret = -ENOENT;
goto out;
}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..2a682da9f43a 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
*/
static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
{
+ enum resctrl_scope scope = plr->s->res->scope;
struct cpu_cacheinfo *ci;
int ret;
int i;

+ if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+ return -ENODEV;
+
/* Pick the first cpu we find that is associated with the cache. */
plr->cpu = cpumask_first(&plr->d->cpu_mask);

@@ -311,7 +315,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);

for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == plr->s->res->cache_level) {
+ if (ci->info_list[i].level == scope) {
plr->line_size = ci->info_list[i].coherency_line_size;
return 0;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index aa24343f1d23..1a9b2cd99b5f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1413,10 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
unsigned int size = 0;
int num_b, i;

+ if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+ return size;
+
num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->cache_level) {
+ if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
--
2.43.0


2024-01-26 22:39:51

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v14 2/8] x86/resctrl: Prepare to split rdt_domain structure

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.

Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Babu Moger <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 16 ++++--
arch/x86/kernel/cpu/resctrl/core.c | 26 +++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 22 ++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 10 ++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 60 +++++++++++------------
6 files changed, 78 insertions(+), 70 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 7d4eb7df611d..c4067150a6b7 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,10 +53,20 @@ struct resctrl_staged_config {
};

/**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
* @list: all instances of this resource
* @id: unique id for this instance
* @cpu_mask: which CPUs share this resource
+ */
+struct rdt_domain_hdr {
+ struct list_head list;
+ int id;
+ struct cpumask cpu_mask;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr: common header for different domain types
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
* @mbm_local: saved state for MBM local bandwidth
@@ -71,9 +81,7 @@ struct resctrl_staged_config {
* by closid
*/
struct rdt_domain {
- struct list_head list;
- int id;
- struct cpumask cpu_mask;
+ struct rdt_domain_hdr hdr;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
struct mbm_state *mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 268965c7c7ef..625f75899d7b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -354,9 +354,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
{
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
/* Find the domain that contains this CPU */
- if (cpumask_test_cpu(cpu, &d->cpu_mask))
+ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
return d;
}

@@ -400,12 +400,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct list_head *l;

list_for_each(l, &r->domains) {
- d = list_entry(l, struct rdt_domain, list);
+ d = list_entry(l, struct rdt_domain, hdr.list);
/* When id is found, return its domain. */
- if (id == d->id)
+ if (id == d->hdr.id)
return d;
/* Stop searching when finding id's position in sorted list. */
- if (id < d->id)
+ if (id < d->hdr.id)
break;
}

@@ -528,7 +528,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

d = rdt_find_domain(r, id, &add_pos);
if (d) {
- cpumask_set_cpu(cpu, &d->cpu_mask);
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
return;
@@ -539,8 +539,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;

d = &hw_dom->d_resctrl;
- d->id = id;
- cpumask_set_cpu(cpu, &d->cpu_mask);
+ d->hdr.id = id;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

rdt_domain_reconfigure_cdp(r);

@@ -554,11 +554,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

- list_add_tail(&d->list, add_pos);
+ list_add_tail(&d->hdr.list, add_pos);

err = resctrl_online_domain(r, d);
if (err) {
- list_del(&d->list);
+ list_del(&d->hdr.list);
domain_free(hw_dom);
}
}
@@ -582,10 +582,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}
hw_dom = resctrl_to_arch_dom(d);

- cpumask_clear_cpu(cpu, &d->cpu_mask);
- if (cpumask_empty(&d->cpu_mask)) {
+ cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+ if (cpumask_empty(&d->hdr.cpu_mask)) {
resctrl_offline_domain(r, d);
- list_del(&d->list);
+ list_del(&d->hdr.list);

/*
* rdt_domain "d" is going to be freed below, so clear
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 3f8891d57fac..23f8258d36a8 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -67,7 +67,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,

cfg = &d->staged_config[s->conf_type];
if (cfg->have_new_ctrl) {
- rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+ rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
return -EINVAL;
}

@@ -146,7 +146,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,

cfg = &d->staged_config[s->conf_type];
if (cfg->have_new_ctrl) {
- rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+ rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
return -EINVAL;
}

@@ -226,8 +226,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
return -EINVAL;
}
dom = strim(dom);
- list_for_each_entry(d, &r->domains, list) {
- if (d->id == dom_id) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (d->hdr.id == dom_id) {
data.buf = dom;
data.rdtgrp = rdtgrp;
if (r->parse_ctrlval(&data, s, d))
@@ -274,7 +274,7 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
struct rdt_domain *dom = &hw_dom->d_resctrl;

if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
- cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask);
+ cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
hw_dom->ctrl_val[idx] = cfg->new_ctrl;

return true;
@@ -291,7 +291,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
u32 idx = get_config_index(closid, t);
struct msr_param msr_param;

- if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+ if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
return -EINVAL;

hw_dom->ctrl_val[idx] = cfg_val;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
return -ENOMEM;

msr_param.res = NULL;
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
for (t = 0; t < CDP_NUM_TYPES; t++) {
cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
u32 ctrl_val;

seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -476,7 +476,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
ctrl_val = resctrl_arch_get_config(r, dom, closid,
schema->conf_type);

- seq_printf(s, r->format_str, dom->id, max_data_width,
+ seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
ctrl_val);
sep = true;
}
@@ -505,7 +505,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
} else {
seq_printf(s, "%s:%d=%x\n",
rdtgrp->plr->s->res->name,
- rdtgrp->plr->d->id,
+ rdtgrp->plr->d->hdr.id,
rdtgrp->plr->cbm);
}
} else {
@@ -536,7 +536,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
rr->val = 0;
rr->first = first;

- smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
}

int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3a6c069614eb..310ab05f0cb7 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -238,7 +238,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
u64 msr_val, chunks;
int ret;

- if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+ if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
return -EINVAL;

ret = __rmid_read(rmid, eventid, &msr_val);
@@ -340,8 +340,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)

entry->busy = 0;
cpu = get_cpu();
- list_for_each_entry(d, &r->domains, list) {
- if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
err = resctrl_arch_rmid_read(r, d, entry->rmid,
QOS_L3_OCCUP_EVENT_ID,
&val);
@@ -639,7 +639,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;

- cpu = cpumask_any(&dom->cpu_mask);
+ cpu = cpumask_any(&dom->hdr.cpu_mask);
dom->cqm_work_cpu = cpu;

schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
@@ -686,7 +686,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)

if (!static_branch_likely(&rdt_mon_enable_key))
return;
- cpu = cpumask_any(&dom->cpu_mask);
+ cpu = cpumask_any(&dom->hdr.cpu_mask);
dom->mbm_work_cpu = cpu;
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 2a682da9f43a..fcbd99e2eb66 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
int cpu;
int ret;

- for_each_cpu(cpu, &plr->d->cpu_mask) {
+ for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
if (!pm_req) {
rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
@@ -301,7 +301,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
return -ENODEV;

/* Pick the first cpu we find that is associated with the cache. */
- plr->cpu = cpumask_first(&plr->d->cpu_mask);
+ plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);

if (!cpu_online(plr->cpu)) {
rdt_last_cmd_printf("CPU %u associated with cache not online\n",
@@ -856,10 +856,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* associated with them.
*/
for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(d_i, &r->domains, list) {
+ list_for_each_entry(d_i, &r->domains, hdr.list) {
if (d_i->plr)
cpumask_or(cpu_with_psl, cpu_with_psl,
- &d_i->cpu_mask);
+ &d_i->hdr.cpu_mask);
}
}

@@ -867,7 +867,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* Next test if new pseudo-locked region would intersect with
* existing region.
*/
- if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
+ if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
ret = true;

free_cpumask_var(cpu_with_psl);
@@ -1199,7 +1199,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
}

plr->thread_done = 0;
- cpu = cpumask_first(&plr->d->cpu_mask);
+ cpu = cpumask_first(&plr->d->hdr.cpu_mask);
if (!cpu_online(cpu)) {
ret = -ENODEV;
goto out;
@@ -1529,7 +1529,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
* may be scheduled elsewhere and invalidate entries in the
* pseudo-locked region.
*/
- if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
+ if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
mutex_unlock(&rdtgroup_mutex);
return -EINVAL;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 1a9b2cd99b5f..a459c16735f7 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
lockdep_assert_held(&rdtgroup_mutex);

for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(dom, &r->domains, list)
+ list_for_each_entry(dom, &r->domains, hdr.list)
memset(dom->staged_config, 0, sizeof(dom->staged_config));
}
}
@@ -295,7 +295,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
rdt_last_cmd_puts("Cache domain offline\n");
ret = -ENODEV;
} else {
- mask = &rdtgrp->plr->d->cpu_mask;
+ mask = &rdtgrp->plr->d->hdr.cpu_mask;
seq_printf(s, is_cpu_list(of) ?
"%*pbl\n" : "%*pb\n",
cpumask_pr_args(mask));
@@ -984,12 +984,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,

mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
exclusive = 0;
- seq_printf(seq, "%d=", dom->id);
+ seq_printf(seq, "%d=", dom->hdr.id);
for (i = 0; i < closids_supported(); i++) {
if (!closid_allocated(i))
continue;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
continue;
has_cache = true;
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
ctrl = resctrl_arch_get_config(r, d, closid,
s->conf_type);
if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1417,7 +1417,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
return size;

num_b = bitmap_weight(&cbm, r->cache.cbm_len);
- ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
+ ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
@@ -1465,7 +1465,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
rdtgrp->plr->d,
rdtgrp->plr->cbm);
- seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+ seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
}
goto out;
}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
type = schema->conf_type;
sep = false;
seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
if (sep)
seq_putc(s, ';');
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1495,7 +1495,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
else
size = rdtgroup_cbm_to_size(r, d, ctrl);
}
- seq_printf(s, "%d=%u", d->id, size);
+ seq_printf(s, "%d=%u", d->hdr.id, size);
sep = true;
}
seq_putc(s, '\n');
@@ -1555,7 +1555,7 @@ static void mon_event_config_read(void *info)

static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
{
- smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
}

static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid

mutex_lock(&rdtgroup_mutex);

- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -1574,7 +1574,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
mon_info.evtid = evtid;
mondata_config_read(dom, &mon_info);

- seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+ seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
sep = true;
}
seq_puts(s, "\n");
@@ -1639,7 +1639,7 @@ static void mbm_config_write_domain(struct rdt_resource *r,
* are scoped at the domain level. Writing any of these MSRs
* on one CPU is observed by all the CPUs in the domain.
*/
- smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+ smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
&mon_info, 1);

/*
@@ -1686,8 +1686,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}

- list_for_each_entry(d, &r->domains, list) {
- if (d->id == dom_id) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
+ if (d->hdr.id == dom_id) {
mbm_config_write_domain(r, d, evtid, val);
goto next;
}
@@ -2227,14 +2227,14 @@ static int set_cache_qos_cfg(int level, bool enable)
return -ENOMEM;

r_l = &rdt_resources_all[level].r_resctrl;
- list_for_each_entry(d, &r_l->domains, list) {
+ list_for_each_entry(d, &r_l->domains, hdr.list) {
if (r_l->cache.arch_has_per_cpu_cfg)
/* Pick all the CPUs in the domain instance */
- for_each_cpu(cpu, &d->cpu_mask)
+ for_each_cpu(cpu, &d->hdr.cpu_mask)
cpumask_set_cpu(cpu, cpu_mask);
else
/* Pick one CPU from each domain instance to update MSR */
- cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+ cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
}

/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2263,7 +2263,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
{
u32 num_closid = resctrl_arch_get_num_closid(r);
- int cpu = cpumask_any(&d->cpu_mask);
+ int cpu = cpumask_any(&d->hdr.cpu_mask);
int i;

d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
@@ -2312,7 +2312,7 @@ static int set_mba_sc(bool mba_sc)

r->membw.mba_sc = mba_sc;

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
for (i = 0; i < num_closid; i++)
d->mbps_val[i] = MBA_MAX_MBPS;
}
@@ -2648,7 +2648,7 @@ static int rdt_get_tree(struct fs_context *fc)

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- list_for_each_entry(dom, &r->domains, list)
+ list_for_each_entry(dom, &r->domains, hdr.list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}

@@ -2775,9 +2775,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
* CBMs in all domains to the maximum mask value. Pick one CPU
* from each domain to update the MSRs below.
*/
- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
- cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+ cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);

for (i = 0; i < hw_res->num_closid; i++)
hw_dom->ctrl_val[i] = r->default_ctrl;
@@ -2981,7 +2981,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
char name[32];
int ret;

- sprintf(name, "mon_%s_%02d", r->name, d->id);
+ sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
/* create the directory */
kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
if (IS_ERR(kn))
@@ -2997,7 +2997,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
}

priv.u.rid = r->rid;
- priv.u.domid = d->id;
+ priv.u.domid = d->hdr.id;
list_for_each_entry(mevt, &r->evt_list, list) {
priv.u.evtid = mevt->evtid;
ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3045,7 +3045,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_domain *dom;
int ret;

- list_for_each_entry(dom, &r->domains, list) {
+ list_for_each_entry(dom, &r->domains, hdr.list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
return ret;
@@ -3204,7 +3204,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
*/
tmp_cbm = cfg->new_ctrl;
if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
- rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+ rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
return -ENOSPC;
}
cfg->have_new_ctrl = true;
@@ -3227,7 +3227,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
struct rdt_domain *d;
int ret;

- list_for_each_entry(d, &s->res->domains, list) {
+ list_for_each_entry(d, &s->res->domains, hdr.list) {
ret = __init_one_rdt_domain(d, s, closid);
if (ret < 0)
return ret;
@@ -3242,7 +3242,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
struct resctrl_staged_config *cfg;
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, list) {
+ list_for_each_entry(d, &r->domains, hdr.list) {
if (is_mba_sc(r)) {
d->mbps_val[closid] = MBA_MAX_MBPS;
continue;
@@ -3859,7 +3859,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
* per domain monitor data directories.
*/
if (static_branch_unlikely(&rdt_mon_enable_key))
- rmdir_mondata_subdir_allrdtgrp(r, d->id);
+ rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);

if (is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
--
2.43.0


2024-01-26 22:40:01

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v14 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that shows how many
SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
SNC mode is either not implemented, or not enabled.

Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
nodes.
3) The RMID renumbering operates when using the value from the
IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
counter, adjust from the logical RMID to the physical
RMID value for the SNC node that it wishes to read and load the
adjusted value into the IA32_QM_EVTSEL MSR.
4) Divide the L3 cache between the SNC nodes. Divide the value
reported in the resctrl "size" file by the number of SNC
nodes because the effective amount of cache that can be allocated
is reduced by that factor.
5) Disable the "-o mba_MBps" mount option in SNC mode
because the monitoring is being done per SNC node, while the
bandwidth allocation is still done at the L3 cache scope.
Trying to use this feedback loop might result in contradictory
changes to the throttling level coming from each of the SNC
node bandwidth measurements.

Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Babu Moger <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 16 +++++++++++++---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++--
4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 3bfd1cf25b49..a1fb73e0681d 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -444,6 +444,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);

extern struct dentry *debugfs_resctrl;

+extern unsigned int snc_nodes_per_l3_cache;
+
enum resctrl_res_level {
RDT_RESOURCE_L3,
RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 27523841f9fd..fcfc0b117ff7 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
*/
bool rdt_alloc_capable;

+/*
+ * Number of SNC nodes that share each L3 cache. Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
static void
mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index e51bafc1945b..d93f50f1e1b4 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)

static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ int cpu = smp_processor_id();
+ int rmid_offset = 0;
u64 msr_val;

+ /*
+ * When SNC mode is on, need to compute the offset to read the
+ * physical RMID counter for the node to which this CPU belongs.
+ */
+ if (snc_nodes_per_l3_cache > 1)
+ rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
/*
* As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
* with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
*/
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
rdmsrl(MSR_IA32_QM_CTR, msr_val);

if (msr_val & RMID_VAL_ERROR)
@@ -761,8 +771,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
int ret;

resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
- hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
- r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+ r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;

if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 441d2f744ccc..52f8e0971ff1 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
}
}

- return size;
+ return size / snc_nodes_per_l3_cache;
}

/*
@@ -2293,7 +2293,8 @@ static bool supports_mba_mbps(void)
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

return (is_mbm_local_enabled() &&
- r->alloc_capable && is_mba_linear());
+ r->alloc_capable && is_mba_linear() &&
+ snc_nodes_per_l3_cache == 1);
}

/*
--
2.43.0


2024-01-26 22:40:02

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v14 5/8] x86/resctrl: Add node-scope to the options for feature scope

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are node scoped.

Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Babu Moger <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 058a940c3239..b8a3a11b970d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -170,6 +170,7 @@ struct resctrl_schema;
enum resctrl_scope {
RESCTRL_L2_CACHE = 2,
RESCTRL_L3_CACHE = 3,
+ RESCTRL_NODE,
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d7c0651d337d..27523841f9fd 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -500,6 +500,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
case RESCTRL_L2_CACHE:
case RESCTRL_L3_CACHE:
return get_cpu_cacheinfo_id(cpu, scope);
+ case RESCTRL_NODE:
+ return cpu_to_node(cpu);
default:
break;
}
--
2.43.0


2024-01-26 22:40:02

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v14 3/8] x86/resctrl: Prepare for different scope for control/monitor operations

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Babu Moger <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 25 ++-
arch/x86/kernel/cpu/resctrl/internal.h | 6 +-
arch/x86/kernel/cpu/resctrl/core.c | 211 ++++++++++++++++------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 12 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 4 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 55 +++---
7 files changed, 220 insertions(+), 97 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c4067150a6b7..35e700edc6e6 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -52,15 +52,22 @@ struct resctrl_staged_config {
bool have_new_ctrl;
};

+enum resctrl_domain_type {
+ RESCTRL_CTRL_DOMAIN,
+ RESCTRL_MON_DOMAIN,
+};
+
/**
* struct rdt_domain_hdr - common header for different domain types
* @list: all instances of this resource
* @id: unique id for this instance
+ * @type: type of this instance
* @cpu_mask: which CPUs share this resource
*/
struct rdt_domain_hdr {
struct list_head list;
int id;
+ enum resctrl_domain_type type;
struct cpumask cpu_mask;
};

@@ -163,10 +170,12 @@ enum resctrl_scope {
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @scope: Scope of this resource
+ * @ctrl_scope: Scope of this resource for control functions
+ * @mon_scope: Scope of this resource for monitor functions
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
- * @domains: All domains for this resource
+ * @ctrl_domains: Control domains for this resource
+ * @mon_domains: Monitor domains for this resource
* @name: Name to use in "schemata" file.
* @data_width: Character width of data when displaying
* @default_ctrl: Specifies default cache cbm or memory B/W percent.
@@ -181,10 +190,12 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- enum resctrl_scope scope;
+ enum resctrl_scope ctrl_scope;
+ enum resctrl_scope mon_scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
- struct list_head domains;
+ struct list_head ctrl_domains;
+ struct list_head mon_domains;
char *name;
int data_width;
u32 default_ctrl;
@@ -230,8 +241,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,

u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 52e7e7deee10..c2b0f563ddf9 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -518,8 +518,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
- struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+ struct list_head **pos);
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -538,7 +538,7 @@ int rdt_pseudo_lock_init(void);
void rdt_pseudo_lock_release(void);
int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
void closid_free(int closid);
int alloc_rmid(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 625f75899d7b..5783ca195a99 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,8 @@ static void
mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);

-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)

struct rdt_hw_resource rdt_resources_all[] = {
[RDT_RESOURCE_L3] =
@@ -65,8 +66,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_L3),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .mon_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L3),
+ .mon_domains = mon_domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -79,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .scope = RESCTRL_L2_CACHE,
- .domains = domain_init(RDT_RESOURCE_L2),
+ .ctrl_scope = RESCTRL_L2_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
@@ -93,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_MBA),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -105,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .scope = RESCTRL_L3_CACHE,
- .domains = domain_init(RDT_RESOURCE_SMBA),
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
@@ -350,11 +353,11 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
}

-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
{
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
/* Find the domain that contains this CPU */
if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
return d;
@@ -376,7 +379,7 @@ void rdt_ctrl_update(void *arg)
int cpu = smp_processor_id();
struct rdt_domain *d;

- d = get_domain_from_cpu(cpu, r);
+ d = get_ctrl_domain_from_cpu(cpu, r);
if (d) {
hw_res->msr_update(d, m, r);
return;
@@ -386,26 +389,26 @@ void rdt_ctrl_update(void *arg)
}

/*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Search for a domain id in a resource domain list.
*
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
- * The domain list is sorted by id in ascending order.
+ * Search the domain list to find the domain id. If the domain id is
+ * found, return the domain. NULL otherwise. If the domain id is not
+ * found (and NULL returned) then the first domain with id bigger than
+ * the input id can be returned to the caller via @pos.
*/
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
- struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+ struct list_head **pos)
{
- struct rdt_domain *d;
+ struct rdt_domain_hdr *d;
struct list_head *l;

- list_for_each(l, &r->domains) {
- d = list_entry(l, struct rdt_domain, hdr.list);
+ list_for_each(l, h) {
+ d = list_entry(l, struct rdt_domain_hdr, list);
/* When id is found, return its domain. */
- if (id == d->hdr.id)
+ if (id == d->id)
return d;
/* Stop searching when finding id's position in sorted list. */
- if (id < d->hdr.id)
+ if (id < d->id)
break;
}

@@ -499,35 +502,28 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
return -EINVAL;
}

-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
{
- int id = get_domain_id_from_scope(cpu, r->scope);
+ int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
struct rdt_domain *d;
int err;

if (id < 0) {
- pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->scope, r->name);
+ pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->ctrl_scope, r->name);
return;
}

- d = rdt_find_domain(r, id, &add_pos);
- if (d) {
+ hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+ if (hdr) {
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
+
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
@@ -540,51 +536,114 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

d = &hw_dom->d_resctrl;
d->hdr.id = id;
+ d->hdr.type = RESCTRL_CTRL_DOMAIN;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

rdt_domain_reconfigure_cdp(r);

- if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+ if (domain_setup_ctrlval(r, d)) {
domain_free(hw_dom);
return;
}

- if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+ list_add_tail(&d->hdr.list, add_pos);
+
+ err = resctrl_online_ctrl_domain(r, d);
+ if (err) {
+ list_del(&d->hdr.list);
+ domain_free(hw_dom);
+ }
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct list_head *add_pos = NULL;
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
+ struct rdt_domain *d;
+ int err;
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+ if (hdr) {
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
+
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ return;
+ }
+
+ hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+ if (!hw_dom)
+ return;
+
+ d = &hw_dom->d_resctrl;
+ d->hdr.id = id;
+ d->hdr.type = RESCTRL_MON_DOMAIN;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+
+ if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
domain_free(hw_dom);
return;
}

list_add_tail(&d->hdr.list, add_pos);

- err = resctrl_online_domain(r, d);
+ err = resctrl_online_mon_domain(r, d);
if (err) {
list_del(&d->hdr.list);
domain_free(hw_dom);
}
}

-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+/*
+ * domain_add_cpu - Add a CPU to either/both resource's domain lists.
+ */
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_domain_id_from_scope(cpu, r->scope);
+ if (r->alloc_capable)
+ domain_add_cpu_ctrl(cpu, r);
+ if (r->mon_capable)
+ domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
struct rdt_domain *d;

if (id < 0) {
- pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->scope, r->name);
+ pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->ctrl_scope, r->name);
return;
}

- d = rdt_find_domain(r, id, NULL);
- if (!d) {
- pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
+ hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+ if (!hdr) {
+ pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n",
+ id, cpu, r->name);
return;
}
+
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
hw_dom = resctrl_to_arch_dom(d);

cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
- resctrl_offline_domain(r, d);
+ resctrl_offline_ctrl_domain(r, d);
list_del(&d->hdr.list);

/*
@@ -597,6 +656,42 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)

return;
}
+}
+
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct rdt_hw_domain *hw_dom;
+ struct rdt_domain_hdr *hdr;
+ struct rdt_domain *d;
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+ if (!hdr) {
+ pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n",
+ id, cpu, r->name);
+ return;
+ }
+
+ if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ return;
+
+ d = container_of(hdr, struct rdt_domain, hdr);
+ hw_dom = resctrl_to_arch_dom(d);
+
+ cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+ if (cpumask_empty(&d->hdr.cpu_mask)) {
+ resctrl_offline_mon_domain(r, d);
+ list_del(&d->hdr.list);
+ domain_free(hw_dom);
+
+ return;
+ }

if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
@@ -611,6 +706,14 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}
}

+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+ if (r->alloc_capable)
+ domain_remove_cpu_ctrl(cpu, r);
+ if (r->mon_capable)
+ domain_remove_cpu_mon(cpu, r);
+}
+
static void clear_closid_rmid(int cpu)
{
struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 23f8258d36a8..0b4136c42762 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -226,7 +226,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
return -EINVAL;
}
dom = strim(dom);
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (d->hdr.id == dom_id) {
data.buf = dom;
data.rdtgrp = rdtgrp;
@@ -318,7 +318,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
return -ENOMEM;

msr_param.res = NULL;
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
for (t = 0; t < CDP_NUM_TYPES; t++) {
cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -466,7 +466,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
u32 ctrl_val;

seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -542,6 +542,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
+ struct rdt_domain_hdr *hdr;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
@@ -562,11 +563,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
evtid = md.u.evtid;

r = &rdt_resources_all[resid].r_resctrl;
- d = rdt_find_domain(r, domid, NULL);
- if (!d) {
+ hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+ if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
ret = -ENOENT;
goto out;
}
+ d = container_of(hdr, struct rdt_domain, hdr);

mon_event_read(&rr, r, d, rdtgrp, evtid, false);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 310ab05f0cb7..749f6662e7ca 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -340,7 +340,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)

entry->busy = 0;
cpu = get_cpu();
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) {
err = resctrl_arch_rmid_read(r, d, entry->rmid,
QOS_L3_OCCUP_EVENT_ID,
@@ -532,7 +532,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
rmid = rgrp->mon.rmid;
pmbm_data = &dom_mbm->mbm_local[rmid];

- dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+ dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
pr_warn_once("Failure to get domain for MBA update\n");
return;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index fcbd99e2eb66..ed6d59af1cef 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
*/
static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
{
- enum resctrl_scope scope = plr->s->res->scope;
+ enum resctrl_scope scope = plr->s->res->ctrl_scope;
struct cpu_cacheinfo *ci;
int ret;
int i;
@@ -856,7 +856,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* associated with them.
*/
for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(d_i, &r->domains, hdr.list) {
+ list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
if (d_i->plr)
cpumask_or(cpu_with_psl, cpu_with_psl,
&d_i->hdr.cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a459c16735f7..5cc6e9b11a16 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -91,7 +91,7 @@ void rdt_staged_configs_clear(void)
lockdep_assert_held(&rdtgroup_mutex);

for_each_alloc_capable_rdt_resource(r) {
- list_for_each_entry(dom, &r->domains, hdr.list)
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
memset(dom->staged_config, 0, sizeof(dom->staged_config));
}
}
@@ -984,7 +984,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,

mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
@@ -1302,7 +1302,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
continue;
has_cache = true;
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
ctrl = resctrl_arch_get_config(r, d, closid,
s->conf_type);
if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1413,13 +1413,13 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
unsigned int size = 0;
int num_b, i;

- if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+ if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
return size;

num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->hdr.cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->scope) {
+ if (ci->info_list[i].level == r->ctrl_scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
@@ -1477,7 +1477,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
type = schema->conf_type;
sep = false;
seq_printf(s, "%*s:", max_name_width, schema->name);
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (sep)
seq_putc(s, ';');
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1566,7 +1566,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid

mutex_lock(&rdtgroup_mutex);

- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->mon_domains, hdr.list) {
if (sep)
seq_puts(s, ";");

@@ -1686,7 +1686,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
return -EINVAL;
}

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->mon_domains, hdr.list) {
if (d->hdr.id == dom_id) {
mbm_config_write_domain(r, d, evtid, val);
goto next;
@@ -2227,7 +2227,7 @@ static int set_cache_qos_cfg(int level, bool enable)
return -ENOMEM;

r_l = &rdt_resources_all[level].r_resctrl;
- list_for_each_entry(d, &r_l->domains, hdr.list) {
+ list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
if (r_l->cache.arch_has_per_cpu_cfg)
/* Pick all the CPUs in the domain instance */
for_each_cpu(cpu, &d->hdr.cpu_mask)
@@ -2312,7 +2312,7 @@ static int set_mba_sc(bool mba_sc)

r->membw.mba_sc = mba_sc;

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
for (i = 0; i < num_closid; i++)
d->mbps_val[i] = MBA_MAX_MBPS;
}
@@ -2648,7 +2648,7 @@ static int rdt_get_tree(struct fs_context *fc)

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- list_for_each_entry(dom, &r->domains, hdr.list)
+ list_for_each_entry(dom, &r->mon_domains, hdr.list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}

@@ -2772,10 +2772,10 @@ static int reset_all_ctrls(struct rdt_resource *r)

/*
* Disable resource control for this resource by setting all
- * CBMs in all domains to the maximum mask value. Pick one CPU
+ * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
* from each domain to update the MSRs below.
*/
- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
hw_dom = resctrl_to_arch_dom(d);
cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);

@@ -3045,7 +3045,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_domain *dom;
int ret;

- list_for_each_entry(dom, &r->domains, hdr.list) {
+ list_for_each_entry(dom, &r->mon_domains, hdr.list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
return ret;
@@ -3227,7 +3227,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
struct rdt_domain *d;
int ret;

- list_for_each_entry(d, &s->res->domains, hdr.list) {
+ list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
ret = __init_one_rdt_domain(d, s, closid);
if (ret < 0)
return ret;
@@ -3242,7 +3242,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
struct resctrl_staged_config *cfg;
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, hdr.list) {
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (is_mba_sc(r)) {
d->mbps_val[closid] = MBA_MAX_MBPS;
continue;
@@ -3844,15 +3844,17 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
kfree(d->mbm_local);
}

-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
mba_sc_domain_destroy(r, d);
+}

- if (!r->mon_capable)
- return;
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ lockdep_assert_held(&rdtgroup_mutex);

/*
* If resctrl is mounted, remove all the
@@ -3909,18 +3911,21 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
{
- int err;
-
lockdep_assert_held(&rdtgroup_mutex);

if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
- /* RDT_RESOURCE_MBA is never mon_capable */
return mba_sc_domain_allocate(r, d);

- if (!r->mon_capable)
- return 0;
+ return 0;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ int err;
+
+ lockdep_assert_held(&rdtgroup_mutex);

err = domain_setup_mon_state(r, d);
if (err)
--
2.43.0


2024-01-26 22:40:29

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v14 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Babu Moger <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..15f1cff6ee76 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:

"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
- directories have one file per event (e.g. "llc_occupancy",
+ This contains a set of files organized by L3 domain or by NUMA
+ node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+ or enabled respectively) and by RDT event. Each of these
+ directories has one file per event (e.g. "llc_occupancy",
"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
files provide a read out of the current value of the event for
all tasks in the group. In CTRL_MON groups these files provide
@@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
each bit represents 5% of the capacity of the cache. You could partition
the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.

+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes. Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache, but each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
Memory bandwidth Allocation and monitoring
==========================================

--
2.43.0


2024-01-26 22:41:05

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v14 4/8] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Babu Moger <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 48 +++++++------
arch/x86/kernel/cpu/resctrl/internal.h | 60 ++++++++++------
arch/x86/kernel/cpu/resctrl/core.c | 87 ++++++++++++-----------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 32 ++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 40 +++++------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 62 ++++++++--------
7 files changed, 182 insertions(+), 153 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 35e700edc6e6..058a940c3239 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -72,7 +72,23 @@ struct rdt_domain_hdr {
};

/**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr: common header for different domain types
+ * @plr: pseudo-locked region (if any) associated with domain
+ * @staged_config: parsed configuration to be applied
+ * @mbps_val: When mba_sc is enabled, this holds the array of user
+ * specified control values for mba_sc in MBps, indexed
+ * by closid
+ */
+struct rdt_ctrl_domain {
+ struct rdt_domain_hdr hdr;
+ struct pseudo_lock_region *plr;
+ struct resctrl_staged_config staged_config[CDP_NUM_TYPES];
+ u32 *mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
* @hdr: common header for different domain types
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
@@ -81,13 +97,8 @@ struct rdt_domain_hdr {
* @cqm_limbo: worker to periodically read CQM h/w counters
* @mbm_work_cpu: worker CPU for MBM h/w counters
* @cqm_work_cpu: worker CPU for CQM h/w counters
- * @plr: pseudo-locked region (if any) associated with domain
- * @staged_config: parsed configuration to be applied
- * @mbps_val: When mba_sc is enabled, this holds the array of user
- * specified control values for mba_sc in MBps, indexed
- * by closid
*/
-struct rdt_domain {
+struct rdt_mon_domain {
struct rdt_domain_hdr hdr;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
@@ -96,9 +107,6 @@ struct rdt_domain {
struct delayed_work cqm_limbo;
int mbm_work_cpu;
int cqm_work_cpu;
- struct pseudo_lock_region *plr;
- struct resctrl_staged_config staged_config[CDP_NUM_TYPES];
- u32 *mbps_val;
};

/**
@@ -202,7 +210,7 @@ struct rdt_resource {
const char *format_str;
int (*parse_ctrlval)(struct rdt_parse_data *data,
struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);
struct list_head evt_list;
unsigned long fflags;
bool cdp_capable;
@@ -236,15 +244,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
* Update the ctrl_val and apply this config right now.
* Must be called on one of the domain's CPUs.
*/
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type t, u32 cfg_val);

-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
@@ -260,7 +268,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
*/
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid, u64 *val);

/**
@@ -273,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid);

/**
@@ -285,7 +293,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);

extern unsigned int resctrl_rmid_realloc_threshold;
extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c2b0f563ddf9..3bfd1cf25b49 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -106,7 +106,7 @@ union mon_data_bits {
struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
enum resctrl_event_id evtid;
bool first;
int err;
@@ -191,7 +191,7 @@ struct mongroup {
*/
struct pseudo_lock_region {
struct resctrl_schema *s;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
u32 cbm;
wait_queue_head_t lock_thread_wq;
int thread_done;
@@ -314,25 +314,41 @@ struct arch_mbm_state {
};

/**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- * a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ * a resource for a control function
* @d_resctrl: Properties exposed to the resctrl file system
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+ struct rdt_ctrl_domain d_resctrl;
+ u32 *ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ * a resource for a monitor function
+ * @d_resctrl: Properties exposed to the resctrl file system
* @arch_mbm_total: arch private state for MBM total bandwidth
* @arch_mbm_local: arch private state for MBM local bandwidth
*
* Members of this structure are accessed via helpers that provide abstraction.
*/
-struct rdt_hw_domain {
- struct rdt_domain d_resctrl;
- u32 *ctrl_val;
+struct rdt_hw_mon_domain {
+ struct rdt_mon_domain d_resctrl;
struct arch_mbm_state *arch_mbm_total;
struct arch_mbm_state *arch_mbm_local;
};

-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+ return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
{
- return container_of(r, struct rdt_hw_domain, d_resctrl);
+ return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
}

/**
@@ -402,7 +418,7 @@ struct rdt_hw_resource {
struct rdt_resource r_resctrl;
u32 num_closid;
unsigned int msr_base;
- void (*msr_update) (struct rdt_domain *d, struct msr_param *m,
+ void (*msr_update) (struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r);
unsigned int mon_scale;
unsigned int mbm_width;
@@ -416,9 +432,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
}

int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);
int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d);
+ struct rdt_ctrl_domain *d);

extern struct mutex rdtgroup_mutex;

@@ -524,21 +540,21 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
unsigned long cbm);
enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
int rdtgroup_tasks_assigned(struct rdtgroup *r);
int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
int rdt_pseudo_lock_init(void);
void rdt_pseudo_lock_release(void);
int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
void closid_free(int closid);
int alloc_rmid(void);
@@ -548,17 +564,17 @@ bool __init rdt_cpu_has(int flag);
void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
unsigned long delay_ms);
void mbm_handle_overflow(struct work_struct *work);
void __init intel_rdt_mbm_apply_quirk(void);
bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms);
void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void __init thread_throttle_mode_init(void);
void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 5783ca195a99..d7c0651d337d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -49,12 +49,12 @@ int max_name_width, max_data_width;
bool rdt_alloc_capable;

static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r);
static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r);
static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r);

#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
@@ -305,11 +305,11 @@ static void rdt_get_cdp_l2_config(void)
}

static void
-mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+mba_wrmsr_amd(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
{
- unsigned int i;
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ unsigned int i;

for (i = m->low; i < m->high; i++)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
@@ -330,12 +330,12 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
}

static void
-mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
+mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
struct rdt_resource *r)
{
- unsigned int i;
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ unsigned int i;

/* Write the delay values for mba. */
for (i = m->low; i < m->high; i++)
@@ -343,19 +343,19 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
}

static void
-cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
+cat_wrmsr(struct rdt_ctrl_domain *d, struct msr_param *m, struct rdt_resource *r)
{
- unsigned int i;
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ unsigned int i;

for (i = m->low; i < m->high; i++)
wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
}

-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
{
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
/* Find the domain that contains this CPU */
@@ -377,7 +377,7 @@ void rdt_ctrl_update(void *arg)
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
struct rdt_resource *r = m->res;
int cpu = smp_processor_id();
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

d = get_ctrl_domain_from_cpu(cpu, r);
if (d) {
@@ -432,18 +432,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
*dc = r->default_ctrl;
}

-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+ kfree(hw_dom->ctrl_val);
+ kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
{
kfree(hw_dom->arch_mbm_total);
kfree(hw_dom->arch_mbm_local);
- kfree(hw_dom->ctrl_val);
kfree(hw_dom);
}

-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct msr_param m;
u32 *dc;

@@ -466,7 +471,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
*/
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
{
size_t tsize;

@@ -505,10 +510,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+ struct rdt_hw_ctrl_domain *hw_dom;
struct list_head *add_pos = NULL;
- struct rdt_hw_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int err;

if (id < 0) {
@@ -522,7 +527,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_ctrl_domain, hdr);

cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
@@ -542,7 +547,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
rdt_domain_reconfigure_cdp(r);

if (domain_setup_ctrlval(r, d)) {
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);
return;
}

@@ -551,17 +556,17 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
err = resctrl_online_ctrl_domain(r, d);
if (err) {
list_del(&d->hdr.list);
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);
}
}

static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct rdt_hw_mon_domain *hw_dom;
struct list_head *add_pos = NULL;
- struct rdt_hw_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
int err;

if (id < 0) {
@@ -575,7 +580,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);

cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
return;
@@ -591,7 +596,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);

if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);
return;
}

@@ -600,7 +605,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
err = resctrl_online_mon_domain(r, d);
if (err) {
list_del(&d->hdr.list);
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);
}
}

@@ -618,9 +623,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

if (id < 0) {
pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
@@ -638,8 +643,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
- hw_dom = resctrl_to_arch_dom(d);
+ d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);

cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
@@ -647,12 +652,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
list_del(&d->hdr.list);

/*
- * rdt_domain "d" is going to be freed below, so clear
+ * rdt_ctrl_domain "d" is going to be freed below, so clear
* its pointer from pseudo_lock_region struct.
*/
if (d->plr)
d->plr->d = NULL;
- domain_free(hw_dom);
+ ctrl_domain_free(hw_dom);

return;
}
@@ -661,9 +666,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_mon_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;

if (id < 0) {
pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
@@ -681,14 +686,14 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
return;

- d = container_of(hdr, struct rdt_domain, hdr);
- hw_dom = resctrl_to_arch_dom(d);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ hw_dom = resctrl_to_arch_mon_dom(d);

cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
if (cpumask_empty(&d->hdr.cpu_mask)) {
resctrl_offline_mon_domain(r, d);
list_del(&d->hdr.list);
- domain_free(hw_dom);
+ mon_domain_free(hw_dom);

return;
}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 0b4136c42762..08fc97ce4135 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -58,7 +58,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
}

int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
struct resctrl_staged_config *cfg;
u32 closid = data->rdtgrp->closid;
@@ -137,7 +137,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
* resource type.
*/
int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
struct rdtgroup *rdtgrp = data->rdtgrp;
struct resctrl_staged_config *cfg;
@@ -206,8 +206,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
struct resctrl_staged_config *cfg;
struct rdt_resource *r = s->res;
struct rdt_parse_data data;
+ struct rdt_ctrl_domain *d;
char *dom = NULL, *id;
- struct rdt_domain *d;
unsigned long dom_id;

if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
@@ -267,11 +267,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
}
}

-static bool apply_config(struct rdt_hw_domain *hw_dom,
+static bool apply_config(struct rdt_hw_ctrl_domain *hw_dom,
struct resctrl_staged_config *cfg, u32 idx,
cpumask_var_t cpu_mask)
{
- struct rdt_domain *dom = &hw_dom->d_resctrl;
+ struct rdt_ctrl_domain *dom = &hw_dom->d_resctrl;

if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) {
cpumask_set_cpu(cpumask_any(&dom->hdr.cpu_mask), cpu_mask);
@@ -283,11 +283,11 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
return false;
}

-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type t, u32 cfg_val)
{
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
u32 idx = get_config_index(closid, t);
struct msr_param msr_param;

@@ -307,11 +307,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
{
struct resctrl_staged_config *cfg;
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct msr_param msr_param;
+ struct rdt_ctrl_domain *d;
enum resctrl_conf_type t;
cpumask_var_t cpu_mask;
- struct rdt_domain *d;
u32 idx;

if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -319,7 +319,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)

msr_param.res = NULL;
list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
- hw_dom = resctrl_to_arch_dom(d);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);
for (t = 0; t < CDP_NUM_TYPES; t++) {
cfg = &hw_dom->d_resctrl.staged_config[t];
if (!cfg->have_new_ctrl)
@@ -449,10 +449,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
return ret ?: nbytes;
}

-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
u32 idx = get_config_index(closid, type);

return hw_dom->ctrl_val[idx];
@@ -461,7 +461,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
{
struct rdt_resource *r = schema->res;
- struct rdt_domain *dom;
+ struct rdt_ctrl_domain *dom;
bool sep = false;
u32 ctrl_val;

@@ -523,7 +523,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
}

void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first)
{
/*
@@ -543,11 +543,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
struct rdt_domain_hdr *hdr;
+ struct rdt_mon_domain *d;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
union mon_data_bits md;
- struct rdt_domain *d;
struct rmid_read rr;
int ret = 0;

@@ -568,7 +568,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
ret = -ENOENT;
goto out;
}
- d = container_of(hdr, struct rdt_domain, hdr);
+ d = container_of(hdr, struct rdt_mon_domain, hdr);

mon_event_read(&rr, r, d, rdtgrp, evtid, false);

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 749f6662e7ca..e51bafc1945b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -170,7 +170,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
return 0;
}

-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
u32 rmid,
enum resctrl_event_id eventid)
{
@@ -189,10 +189,10 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
return NULL;
}

-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct arch_mbm_state *am;

am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -208,9 +208,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
* Assumes that hardware counters are also reset and thus that there is
* no need to record initial non-zero counts.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
{
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);

if (is_mbm_total_enabled())
memset(hw_dom->arch_mbm_total, 0,
@@ -229,11 +229,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
return chunks >> shift;
}

-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
+ struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct arch_mbm_state *am;
u64 msr_val, chunks;
int ret;
@@ -266,7 +266,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
* decrement the count. If the busy count gets to zero on an RMID, we
* free the RMID
*/
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
struct rmid_entry *entry;
@@ -305,7 +305,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
}
}

-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_resource *r, struct rdt_mon_domain *d)
{
return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
}
@@ -334,7 +334,7 @@ int alloc_rmid(void)
static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;
int cpu, err;
u64 val = 0;

@@ -383,7 +383,7 @@ void free_rmid(u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}

-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 rmid,
enum resctrl_event_id evtid)
{
switch (evtid) {
@@ -513,12 +513,12 @@ void mon_event_count(void *info)
* throttle MSRs already have low percentage values. To avoid
* unnecessarily restricting such rdtgroups, we also increase the bandwidth.
*/
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
+ struct rdt_ctrl_domain *dom_mba;
struct rdt_resource *r_mba;
- struct rdt_domain *dom_mba;
struct list_head *head;
struct rdtgroup *entry;
u32 cur_bw, user_bw;
@@ -578,7 +578,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
}

-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, int rmid)
{
struct rmid_read rr;

@@ -618,13 +618,13 @@ void cqm_handle_limbo(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
int cpu = smp_processor_id();
+ struct rdt_mon_domain *d;
struct rdt_resource *r;
- struct rdt_domain *d;

mutex_lock(&rdtgroup_mutex);

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- d = container_of(work, struct rdt_domain, cqm_limbo.work);
+ d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);

__check_limbo(d, false);

@@ -634,7 +634,7 @@ void cqm_handle_limbo(struct work_struct *work)
mutex_unlock(&rdtgroup_mutex);
}

-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;
@@ -650,9 +650,9 @@ void mbm_handle_overflow(struct work_struct *work)
unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
struct rdtgroup *prgrp, *crgrp;
int cpu = smp_processor_id();
+ struct rdt_mon_domain *d;
struct list_head *head;
struct rdt_resource *r;
- struct rdt_domain *d;

mutex_lock(&rdtgroup_mutex);

@@ -660,7 +660,7 @@ void mbm_handle_overflow(struct work_struct *work)
goto out_unlock;

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- d = container_of(work, struct rdt_domain, mbm_over.work);
+ d = container_of(work, struct rdt_mon_domain, mbm_over.work);

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp->mon.rmid);
@@ -679,7 +679,7 @@ void mbm_handle_overflow(struct work_struct *work)
mutex_unlock(&rdtgroup_mutex);
}

-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index ed6d59af1cef..08d35f828bc3 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -814,7 +814,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
* Return: true if @cbm overlaps with pseudo-locked region on @d, false
* otherwise.
*/
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
{
unsigned int cbm_len;
unsigned long cbm_b;
@@ -841,11 +841,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
* if it is not possible to test due to memory allocation issue,
* false otherwise.
*/
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
{
+ struct rdt_ctrl_domain *d_i;
cpumask_var_t cpu_with_psl;
struct rdt_resource *r;
- struct rdt_domain *d_i;
bool ret = false;

if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5cc6e9b11a16..441d2f744ccc 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -85,8 +85,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)

void rdt_staged_configs_clear(void)
{
+ struct rdt_ctrl_domain *dom;
struct rdt_resource *r;
- struct rdt_domain *dom;

lockdep_assert_held(&rdtgroup_mutex);

@@ -976,7 +976,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
unsigned long sw_shareable = 0, hw_shareable = 0;
unsigned long exclusive = 0, pseudo_locked = 0;
struct rdt_resource *r = s->res;
- struct rdt_domain *dom;
+ struct rdt_ctrl_domain *dom;
int i, hwb, swb, excl, psl;
enum rdtgrp_mode mode;
bool sep = false;
@@ -1205,7 +1205,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
*
* Return: false if CBM does not overlap, true if it does.
*/
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid,
enum resctrl_conf_type type, bool exclusive)
{
@@ -1260,7 +1260,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
*
* Return: true if CBM overlap detected, false if there is no overlap
*/
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
unsigned long cbm, int closid, bool exclusive)
{
enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1291,10 +1291,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
{
int closid = rdtgrp->closid;
+ struct rdt_ctrl_domain *d;
struct resctrl_schema *s;
struct rdt_resource *r;
bool has_cache = false;
- struct rdt_domain *d;
u32 ctrl;

list_for_each_entry(s, &resctrl_schema_all, list) {
@@ -1407,7 +1407,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
* bitmap functions work correctly.
*/
unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
- struct rdt_domain *d, unsigned long cbm)
+ struct rdt_ctrl_domain *d, unsigned long cbm)
{
struct cpu_cacheinfo *ci;
unsigned int size = 0;
@@ -1439,9 +1439,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
{
struct resctrl_schema *schema;
enum resctrl_conf_type type;
+ struct rdt_ctrl_domain *d;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
- struct rdt_domain *d;
unsigned int size;
int ret = 0;
u32 closid;
@@ -1553,7 +1553,7 @@ static void mon_event_config_read(void *info)
mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
}

-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
{
smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
}
@@ -1561,7 +1561,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
{
struct mon_config_info mon_info = {0};
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
bool sep = false;

mutex_lock(&rdtgroup_mutex);
@@ -1618,7 +1618,7 @@ static void mon_event_config_write(void *info)
}

static void mbm_config_write_domain(struct rdt_resource *r,
- struct rdt_domain *d, u32 evtid, u32 val)
+ struct rdt_mon_domain *d, u32 evtid, u32 val)
{
struct mon_config_info mon_info = {0};

@@ -1659,7 +1659,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
char *dom_str = NULL, *id_str;
unsigned long dom_id, val;
- struct rdt_domain *d;
+ struct rdt_mon_domain *d;

next:
if (!tok || tok[0] == '\0')
@@ -2211,9 +2211,9 @@ static inline bool is_mba_linear(void)
static int set_cache_qos_cfg(int level, bool enable)
{
void (*update)(void *arg);
+ struct rdt_ctrl_domain *d;
struct rdt_resource *r_l;
cpumask_var_t cpu_mask;
- struct rdt_domain *d;
int cpu;

if (level == RDT_RESOURCE_L3)
@@ -2260,7 +2260,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
l3_qos_cfg_update(&hw_res->cdp_enabled);
}

-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
u32 num_closid = resctrl_arch_get_num_closid(r);
int cpu = cpumask_any(&d->hdr.cpu_mask);
@@ -2278,7 +2278,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
}

static void mba_sc_domain_destroy(struct rdt_resource *r,
- struct rdt_domain *d)
+ struct rdt_ctrl_domain *d)
{
kfree(d->mbps_val);
d->mbps_val = NULL;
@@ -2304,7 +2304,7 @@ static int set_mba_sc(bool mba_sc)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
u32 num_closid = resctrl_arch_get_num_closid(r);
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int i;

if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2573,7 +2573,7 @@ static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
unsigned long flags = RFTYPE_CTRL_BASE;
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
struct rdt_resource *r;
int ret;

@@ -2757,10 +2757,10 @@ static int rdt_init_fs_context(struct fs_context *fc)
static int reset_all_ctrls(struct rdt_resource *r)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
- struct rdt_hw_domain *hw_dom;
+ struct rdt_hw_ctrl_domain *hw_dom;
struct msr_param msr_param;
+ struct rdt_ctrl_domain *d;
cpumask_var_t cpu_mask;
- struct rdt_domain *d;
int i;

if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -2776,7 +2776,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
* from each domain to update the MSRs below.
*/
list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
- hw_dom = resctrl_to_arch_dom(d);
+ hw_dom = resctrl_to_arch_ctrl_dom(d);
cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);

for (i = 0; i < hw_res->num_closid; i++)
@@ -2971,7 +2971,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}

static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
- struct rdt_domain *d,
+ struct rdt_mon_domain *d,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
union mon_data_bits priv;
@@ -3020,7 +3020,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
* and "monitor" groups with given domain id.
*/
static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_domain *d)
+ struct rdt_mon_domain *d)
{
struct kernfs_node *parent_kn;
struct rdtgroup *prgrp, *crgrp;
@@ -3042,7 +3042,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_resource *r,
struct rdtgroup *prgrp)
{
- struct rdt_domain *dom;
+ struct rdt_mon_domain *dom;
int ret;

list_for_each_entry(dom, &r->mon_domains, hdr.list) {
@@ -3144,7 +3144,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
* Set the RDT domain up to start off with all usable allocations. That is,
* all shareable and unused bits. All-zero CBM is invalid.
*/
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
u32 closid)
{
enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3224,7 +3224,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
*/
static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
{
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;
int ret;

list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3240,7 +3240,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
{
struct resctrl_staged_config *cfg;
- struct rdt_domain *d;
+ struct rdt_ctrl_domain *d;

list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
if (is_mba_sc(r)) {
@@ -3837,14 +3837,14 @@ static void __init rdtgroup_setup_default(void)
mutex_unlock(&rdtgroup_mutex);
}

-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
{
bitmap_free(d->rmid_busy_llc);
kfree(d->mbm_total);
kfree(d->mbm_local);
}

-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3852,7 +3852,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
mba_sc_domain_destroy(r, d);
}

-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3881,7 +3881,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
domain_destroy_mon_state(d);
}

-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
{
size_t tsize;

@@ -3911,7 +3911,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3921,7 +3921,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
{
int err;

--
2.43.0


2024-01-26 22:46:29

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v14 7/8] x86/resctrl: Sub NUMA Cluster detection and enable

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.

Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Babu Moger <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 118 +++++++++++++++++++++++++++++
2 files changed, 119 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index f1bd7b91b3c6..f6ba7d0397b8 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1119,6 +1119,7 @@
#define MSR_IA32_QM_CTR 0xc8e
#define MSR_IA32_PQR_ASSOC 0xc8f
#define MSR_IA32_L3_CBM_BASE 0xc90
+#define MSR_RMID_SNC_CONFIG 0xca0
#define MSR_IA32_L2_CBM_BASE 0xd10
#define MSR_IA32_MBA_THRTL_BASE 0xd50

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index fcfc0b117ff7..ad8e68ed5ec9 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@

#define pr_fmt(fmt) "resctrl: " fmt

+#include <linux/cpu.h>
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/cacheinfo.h>
#include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>

+#include <asm/cpu_device_id.h>
#include <asm/intel-family.h>
#include <asm/resctrl.h>
#include "internal.h"
@@ -738,11 +741,42 @@ static void clear_closid_rmid(int cpu)
wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
}

+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+ u64 val;
+
+ /* Only need to enable once per package. */
+ if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+ return;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, val);
+ val &= ~BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
static int resctrl_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
+
+ if (snc_nodes_per_l3_cache > 1)
+ snc_remap_rmids(cpu);
+
for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
/* The cpu is set in default rdtgroup after online. */
@@ -997,11 +1031,95 @@ static __init bool get_rdt_resources(void)
return (rdt_mon_capable || rdt_alloc_capable);
}

+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+ {}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+ unsigned long *node_caches;
+ int mem_only_nodes = 0;
+ int cpu, node, ret;
+ int num_l3_caches;
+ int cache_id;
+
+ if (!x86_match_cpu(snc_cpu_ids))
+ return 1;
+
+ node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
+ if (!node_caches)
+ return 1;
+
+ cpus_read_lock();
+
+ if (num_online_cpus() != num_present_cpus())
+ pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+
+ for_each_node(node) {
+ cpu = cpumask_first(cpumask_of_node(node));
+ if (cpu < nr_cpu_ids) {
+ cache_id = get_cpu_cacheinfo_id(cpu, 3);
+ if (cache_id != -1)
+ set_bit(cache_id, node_caches);
+ } else {
+ mem_only_nodes++;
+ }
+ }
+ cpus_read_unlock();
+
+ num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
+ kfree(node_caches);
+
+ if (!num_l3_caches)
+ goto insane;
+
+ /* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
+ if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
+ goto insane;
+
+ ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+ /* sanity check #2: Only valid results are 1, 2, 3, 4 */
+ switch (ret) {
+ case 1:
+ break;
+ case 2:
+ case 3:
+ case 4:
+ rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_NODE;
+ break;
+ default:
+ goto insane;
+ }
+
+ return ret;
+insane:
+ pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
+ (nr_node_ids - mem_only_nodes), num_l3_caches);
+ return 1;
+}
+
static __init void rdt_init_res_defs_intel(void)
{
struct rdt_hw_resource *hw_res;
struct rdt_resource *r;

+ snc_nodes_per_l3_cache = snc_get_config();
+
for_each_rdt_resource(r) {
hw_res = resctrl_to_arch_res(r);

--
2.43.0


2024-01-30 01:05:19

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v14 0/8] Add support for Sub-NUMA cluster (SNC) systems

I've been wondering whether the SNC patches were creating way too much
infrastructure that isn't generally useful. Specifically the capability
for a reactrl resource to have different scope for monitoring and
control functions.

That seems like a very strange thing.

History here is that Intel CMT, MBM and L3 CAT features arrived in
quick succession, closely followed by MBA and then L2 CAT. At the time
it made sense for the "L3" resource to cover all of CMT, MBM and L3 CAT.

That became slightly odd when MBA (another L3 scoped resource) was
added. It was given its own "struct rdt_resource". AMD later added
SMBA as yet another L3-scoped resource with its own "struct resource".

I wondered how the SNC series would look if I went back to an approach
from one of the earlier versions. In that one I created a new resource
for SNC monitoring that was NODE scoped. And then made all the CMT and
MBM code switch over to using that one when SNC was enabled. That was
a bit hacky, and I moved from there to the dual domain lists per
resource.

I just tried out an approach the split the RDT_RESOURCE_L3 resource
into separate resources, one for control, one for monitoring.

It makes the code much simpler:

8 files changed, 235 insertions(+), 30 deletions(-) [1]

vs.

9 files changed, 629 insertions(+), 282 deletions(-)


Tradeoffs:

The complex series (posted as v14) cleanly split the "rdt_domain"
structure into two. So it no longer carried all the fields needed
for both control and monitor, even though only one set was ever
used. But the cost was a lot of code churn that may never be useful
for anything other than SNC.

On non-SNC systems the new series produces separate linked lists
of domains for L3 control & monitor, even though the lists are the
same, and the domain structures still carry all fields for both
control and monitor functions.

Question: Does anyone think that single domain with different scope for
control and monitor is inherently more useful than two separate domains?


While I get this sorted out, maybe Boris should take James next set
of MPAM patches as they are, instead of on top of the complex SNC
series.

-Tony

[1] I'll post this set just as soon as I can get time on and SNC machine
to make sure they actually work.

2024-01-30 22:20:53

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

This is the re-worked version of this series that I promised to post
yesterday. Check that e-mail for the arguments for this alternate
approach.

https://lore.kernel.org/all/ZbhLRDvZrxBZDv2j@agluck-desk3/

Apologies to Drew Fustini who I'd somehow dropped from later versions
of this series. Drew: you had made a comment at one point that having
different scopes within a single resource may be useful on RISC-V.
Version 14 included that, but it's gone here. Maybe multiple resctrl
"struct resource" for a single h/w entity like L3 as I'm doing in this
version could work for you too?

Patches 1-5 are almost completely rewritten based around the new
idea to give CMT and MBM their own "resource" instead of sharing
one with L3 CAT. This removes the need for separate domain lists,
and thus most of the churn of the previous version of this series.

Patches 6-8 are largely unchanged. But I removed all the Reviewed
and Tested tags since they are now built on a completely different
base.

Patches are against tip x86/cache:

fc747eebef73 ("x86/resctrl: Remove redundant variable in mbm_config_write_domain()")

Re-work doesn't affect the v14 cover letter, so pasting it here:

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features. Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <[email protected]>

Tony Luck (8):
x86/resctrl: Split the RDT_RESOURCE_L3 resource
x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON
x86/resctrl: Prepare for non-cache-scoped resources
x86/resctrl: Add helper function to look up domain_id from scope
x86/resctrl: Add "NODE" as an option for resource scope
x86/resctrl: Introduce snc_nodes_per_l3_cache
x86/resctrl: Sub NUMA Cluster detection and enable
x86/resctrl: Update documentation with Sub-NUMA cluster changes

Documentation/arch/x86/resctrl.rst | 25 ++-
include/linux/resctrl.h | 10 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 3 +
arch/x86/kernel/cpu/resctrl/core.c | 181 +++++++++++++++++++++-
arch/x86/kernel/cpu/resctrl/monitor.c | 28 ++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +-
8 files changed, 236 insertions(+), 30 deletions(-)


base-commit: fc747eebef734563cf68a512f57937c8f231834a
--
2.43.0


2024-01-30 22:20:59

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource

The RDT_RESOURCE_L3 is unique in that it is used for both monitoring
an control functions. This made sense while both uses had the same
scope. But systems with Sub-NUMA clustering enabled do not follow this
pattern.

Create a new resource: RDT_RESOURCE_L3_MON ready to take over the
monitoring functions.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++++++
2 files changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 52e7e7deee10..c6051bc70e96 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -429,6 +429,7 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
extern struct dentry *debugfs_resctrl;

enum resctrl_res_level {
+ RDT_RESOURCE_L3_MON,
RDT_RESOURCE_L3,
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index aa9810a64258..c50f55d7790e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -60,6 +60,16 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)

struct rdt_hw_resource rdt_resources_all[] = {
+ [RDT_RESOURCE_L3_MON] =
+ {
+ .r_resctrl = {
+ .rid = RDT_RESOURCE_L3_MON,
+ .name = "L3",
+ .cache_level = 3,
+ .domains = domain_init(RDT_RESOURCE_L3_MON),
+ .fflags = RFTYPE_RES_CACHE,
+ },
+ },
[RDT_RESOURCE_L3] =
{
.r_resctrl = {
--
2.43.0


2024-01-30 22:21:11

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v15-RFC 2/8] x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON

Switch over all places that setup and use monitoring funtions to
use the new resource structure.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 6 ++++--
arch/x86/kernel/cpu/resctrl/monitor.c | 12 ++++--------
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c50f55d7790e..0828575c3e13 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -591,11 +591,13 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
return;
}

- if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
+ if (r == &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
cancel_delayed_work(&d->mbm_over);
mbm_setup_overflow_handler(d, 0);
}
+ }
+ if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
has_busy_rmid(r, d)) {
cancel_delayed_work(&d->cqm_limbo);
@@ -826,7 +828,7 @@ static __init bool get_rdt_alloc_resources(void)

static __init bool get_rdt_mon_resources(void)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;

if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3a6c069614eb..080cad0d7288 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -268,7 +268,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
*/
void __check_limbo(struct rdt_domain *d, bool force_free)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
struct rmid_entry *entry;
u32 crmid = 1, nrmid;
bool rmid_dirty;
@@ -333,7 +333,7 @@ int alloc_rmid(void)

static void add_rmid_to_limbo(struct rmid_entry *entry)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
struct rdt_domain *d;
int cpu, err;
u64 val = 0;
@@ -623,7 +623,7 @@ void cqm_handle_limbo(struct work_struct *work)

mutex_lock(&rdtgroup_mutex);

- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
d = container_of(work, struct rdt_domain, cqm_limbo.work);

__check_limbo(d, false);
@@ -659,7 +659,7 @@ void mbm_handle_overflow(struct work_struct *work)
if (!static_branch_likely(&rdt_mon_enable_key))
goto out_unlock;

- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
d = container_of(work, struct rdt_domain, mbm_over.work);

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
@@ -736,10 +736,6 @@ static struct mon_evt mbm_local_event = {

/*
* Initialize the event list for the resource.
- *
- * Note that MBM events are also part of RDT_RESOURCE_L3 resource
- * because as per the SDM the total and local memory bandwidth
- * are enumerated as part of L3 monitoring.
*/
static void l3_mon_evt_init(struct rdt_resource *r)
{
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index aa24343f1d23..9ee3a9906781 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2644,7 +2644,7 @@ static int rdt_get_tree(struct fs_context *fc)
static_branch_enable_cpuslocked(&rdt_enable_key);

if (is_mbm_enabled()) {
- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
list_for_each_entry(dom, &r->domains, list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}
--
2.43.0


2024-01-30 22:21:25

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v15-RFC 5/8] x86/resctrl: Add "NODE" as an option for resource scope

Add RESCTRL_NODE to the enum, and to the helper function that looks
up a domain id from a scope.

There are a couple of places where the scope must be a cache scope.
Add some defensive WARN_ON checks to those.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 3 +++
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 ++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 3 +++
4 files changed, 11 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 2155dc15e636..e3cddf3f07f8 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -147,6 +147,7 @@ struct resctrl_schema;
enum resctrl_scope {
RESCTRL_L2_CACHE = 2,
RESCTRL_L3_CACHE = 3,
+ RESCTRL_NODE,
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 59e6aa7abef5..b741cbf61843 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -505,6 +505,9 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
case RESCTRL_L2_CACHE:
case RESCTRL_L3_CACHE:
return get_cpu_cacheinfo_id(cpu, scope);
+ case RESCTRL_NODE:
+ return cpu_to_node(cpu);
+
default:
break;
}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 6a72fb627aa5..2bafc73b51e2 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
*/
static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
{
+ enum resctrl_scope scope = plr->s->res->scope;
struct cpu_cacheinfo *ci;
int ret;
int i;

+ if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+ return -ENODEV;
+
/* Pick the first cpu we find that is associated with the cache. */
plr->cpu = cpumask_first(&plr->d->cpu_mask);

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index eff9d87547c9..770f2bf98462 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1413,6 +1413,9 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
unsigned int size = 0;
int num_b, i;

+ if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+ return size;
+
num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
--
2.43.0


2024-01-30 22:21:31

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v15-RFC 4/8] x86/resctrl: Add helper function to look up domain_id from scope

Prepare for more options for scope of resources. Add some diagnostic
messages if lookup fails.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 29 +++++++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d89dce63397b..59e6aa7abef5 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -499,6 +499,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
return 0;
}

+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+ switch (scope) {
+ case RESCTRL_L2_CACHE:
+ case RESCTRL_L3_CACHE:
+ return get_cpu_cacheinfo_id(cpu, scope);
+ default:
+ break;
+ }
+
+ return -EINVAL;
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -514,12 +527,18 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
*/
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->scope);
+ int id = get_domain_id_from_scope(cpu, r->scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
int err;

+ if (id < 0) {
+ pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->scope, r->name);
+ return;
+ }
+
d = rdt_find_domain(r, id, &add_pos);
if (IS_ERR(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -564,10 +583,16 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

static void domain_remove_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->scope);
+ int id = get_domain_id_from_scope(cpu, r->scope);
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

+ if (id < 0) {
+ pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->scope, r->name);
+ return;
+ }
+
d = rdt_find_domain(r, id, NULL);
if (IS_ERR_OR_NULL(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
--
2.43.0


2024-01-30 22:21:49

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v15-RFC 3/8] x86/resctrl: Prepare for non-cache-scoped resources

Not all resources are scoped in line with some level of hardware cache.

Prepare by renaming the "cache_level" field to "scope" and change
the type to an enum to ease adding new scopes.

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 9 +++++++--
arch/x86/kernel/cpu/resctrl/core.c | 14 +++++++-------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
4 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..2155dc15e636 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
struct rdt_parse_data;
struct resctrl_schema;

+enum resctrl_scope {
+ RESCTRL_L2_CACHE = 2,
+ RESCTRL_L3_CACHE = 3,
+};
+
/**
* struct rdt_resource - attributes of a resctrl resource
* @rid: The index of the resource
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @cache_level: Which cache level defines scope of this resource
+ * @scope: Hardware scope for this resource
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
* @domains: All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- int cache_level;
+ enum resctrl_scope scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
struct list_head domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 0828575c3e13..d89dce63397b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3_MON,
.name = "L3",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_L3_MON),
.fflags = RFTYPE_RES_CACHE,
},
@@ -75,7 +75,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -89,7 +89,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .cache_level = 2,
+ .scope = RESCTRL_L2_CACHE,
.domains = domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -103,7 +103,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -115,7 +115,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -514,7 +514,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
*/
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_cpu_cacheinfo_id(cpu, r->scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
@@ -564,7 +564,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

static void domain_remove_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_cpu_cacheinfo_id(cpu, r->scope);
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..6a72fb627aa5 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -311,7 +311,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);

for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == plr->s->res->cache_level) {
+ if (ci->info_list[i].level == plr->s->res->scope) {
plr->line_size = ci->info_list[i].coherency_line_size;
return 0;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9ee3a9906781..eff9d87547c9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1416,7 +1416,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->cache_level) {
+ if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
--
2.43.0


2024-01-30 22:21:59

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v15-RFC 7/8] x86/resctrl: Sub NUMA Cluster detection and enable

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.

When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen. Update
the scope of the RDT_RESOURCE_L3_MON resource to NODE.

Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 119 +++++++++++++++++++++++++++++
2 files changed, 120 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index f1bd7b91b3c6..f6ba7d0397b8 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1119,6 +1119,7 @@
#define MSR_IA32_QM_CTR 0xc8e
#define MSR_IA32_PQR_ASSOC 0xc8f
#define MSR_IA32_L3_CBM_BASE 0xc90
+#define MSR_RMID_SNC_CONFIG 0xca0
#define MSR_IA32_L2_CBM_BASE 0xd10
#define MSR_IA32_MBA_THRTL_BASE 0xd50

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index dc886d2c9a33..84c36e10241f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@

#define pr_fmt(fmt) "resctrl: " fmt

+#include <linux/cpu.h>
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/cacheinfo.h>
#include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>

+#include <asm/cpu_device_id.h>
#include <asm/intel-family.h>
#include <asm/resctrl.h>
#include "internal.h"
@@ -651,11 +654,42 @@ static void clear_closid_rmid(int cpu)
wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
}

+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+ u64 val;
+
+ /* Only need to enable once per package. */
+ if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+ return;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, val);
+ val &= ~BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
static int resctrl_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
+
+ if (snc_nodes_per_l3_cache > 1)
+ snc_remap_rmids(cpu);
+
for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
/* The cpu is set in default rdtgroup after online. */
@@ -910,11 +944,96 @@ static __init bool get_rdt_resources(void)
return (rdt_mon_capable || rdt_alloc_capable);
}

+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+ {}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+ unsigned long *node_caches;
+ int mem_only_nodes = 0;
+ int cpu, node, ret;
+ int num_l3_caches;
+ int cache_id;
+
+ if (!x86_match_cpu(snc_cpu_ids))
+ return 1;
+
+ node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
+ if (!node_caches)
+ return 1;
+
+ cpus_read_lock();
+
+ if (num_online_cpus() != num_present_cpus())
+ pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+
+ for_each_node(node) {
+ cpu = cpumask_first(cpumask_of_node(node));
+ if (cpu < nr_cpu_ids) {
+ cache_id = get_cpu_cacheinfo_id(cpu, 3);
+ if (cache_id != -1)
+ set_bit(cache_id, node_caches);
+ } else {
+ mem_only_nodes++;
+ }
+ }
+ cpus_read_unlock();
+
+ num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
+ kfree(node_caches);
+
+ if (!num_l3_caches)
+ goto insane;
+
+ /* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
+ if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
+ goto insane;
+
+ ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+ /* sanity check #2: Only valid results are 1, 2, 3, 4 */
+ switch (ret) {
+ case 1:
+ break;
+ case 2:
+ case 3:
+ case 4:
+ rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl.scope = RESCTRL_NODE;
+ pr_info("Sub-NUMA Cluster: %d nodes per L3 cache\n", ret);
+ break;
+ default:
+ goto insane;
+ }
+
+ return ret;
+insane:
+ pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
+ (nr_node_ids - mem_only_nodes), num_l3_caches);
+ return 1;
+}
+
static __init void rdt_init_res_defs_intel(void)
{
struct rdt_hw_resource *hw_res;
struct rdt_resource *r;

+ snc_nodes_per_l3_cache = snc_get_config();
+
for_each_rdt_resource(r) {
hw_res = resctrl_to_arch_res(r);

--
2.43.0


2024-01-30 22:22:03

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v15-RFC 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that shows how many
SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
SNC mode is either not implemented, or not enabled.

Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
nodes.
3) The RMID renumbering operates when using the value from the
IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
counter, adjust from the logical RMID to the physical
RMID value for the SNC node that it wishes to read and load the
adjusted value into the IA32_QM_EVTSEL MSR.
4) Divide the L3 cache between the SNC nodes. Divide the value
reported in the resctrl "size" file by the number of SNC
nodes because the effective amount of cache that can be allocated
is reduced by that factor.
5) Disable the "-o mba_MBps" mount option in SNC mode
because the monitoring is being done per SNC node, while the
bandwidth allocation is still done at the L3 cache scope.
Trying to use this feedback loop might result in contradictory
changes to the throttling level coming from each of the SNC
node bandwidth measurements.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 16 +++++++++++++---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++--
4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c6051bc70e96..d9c6dcf30922 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -428,6 +428,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);

extern struct dentry *debugfs_resctrl;

+extern unsigned int snc_nodes_per_l3_cache;
+
enum resctrl_res_level {
RDT_RESOURCE_L3_MON,
RDT_RESOURCE_L3,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b741cbf61843..dc886d2c9a33 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
*/
bool rdt_alloc_capable;

+/*
+ * Number of SNC nodes that share each L3 cache. Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
static void
mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 080cad0d7288..357919bbadbe 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)

static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ int cpu = smp_processor_id();
+ int rmid_offset = 0;
u64 msr_val;

+ /*
+ * When SNC mode is on, need to compute the offset to read the
+ * physical RMID counter for the node to which this CPU belongs.
+ */
+ if (snc_nodes_per_l3_cache > 1)
+ rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
/*
* As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
* with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
*/
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
rdmsrl(MSR_IA32_QM_CTR, msr_val);

if (msr_val & RMID_VAL_ERROR)
@@ -757,8 +767,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
int ret;

resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
- hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
- r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+ r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;

if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 770f2bf98462..e639069f871a 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
}
}

- return size;
+ return size / snc_nodes_per_l3_cache;
}

/*
@@ -2293,7 +2293,8 @@ static bool supports_mba_mbps(void)
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

return (is_mbm_local_enabled() &&
- r->alloc_capable && is_mba_linear());
+ r->alloc_capable && is_mba_linear() &&
+ snc_nodes_per_l3_cache == 1);
}

/*
--
2.43.0


2024-01-30 22:22:16

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v15-RFC 8/8] x86/resctrl: Update documentation with Sub-NUMA cluster changes

With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Peter Newman <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Babu Moger <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..15f1cff6ee76 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:

"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
- directories have one file per event (e.g. "llc_occupancy",
+ This contains a set of files organized by L3 domain or by NUMA
+ node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+ or enabled respectively) and by RDT event. Each of these
+ directories has one file per event (e.g. "llc_occupancy",
"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
files provide a read out of the current value of the event for
all tasks in the group. In CTRL_MON groups these files provide
@@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
each bit represents 5% of the capacity of the cache. You could partition
the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.

+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes. Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache, but each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
Memory bandwidth Allocation and monitoring
==========================================

--
2.43.0


2024-02-01 19:19:24

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

Hi Tony,

On 1/30/2024 2:20 PM, Tony Luck wrote:
> This is the re-worked version of this series that I promised to post
> yesterday. Check that e-mail for the arguments for this alternate
> approach.
>
> https://lore.kernel.org/all/ZbhLRDvZrxBZDv2j@agluck-desk3/
>
> Apologies to Drew Fustini who I'd somehow dropped from later versions
> of this series. Drew: you had made a comment at one point that having
> different scopes within a single resource may be useful on RISC-V.
> Version 14 included that, but it's gone here. Maybe multiple resctrl
> "struct resource" for a single h/w entity like L3 as I'm doing in this
> version could work for you too?
>
> Patches 1-5 are almost completely rewritten based around the new
> idea to give CMT and MBM their own "resource" instead of sharing
> one with L3 CAT. This removes the need for separate domain lists,
> and thus most of the churn of the previous version of this series.

I do not see it as removing the need for separate domain lists but
instead keeping the idea of separate domain lists but in this case
duplicating the resource in order to host the second domain
list. This solution also keeps the same structures for control and
monitoring that previous version cleaned up [1].
To me this thus seems like a similar solution as v14 but with
additional duplication due to an extra resource and without the cleanup.

Reinette

[1] https://lore.kernel.org/lkml/[email protected]/


2024-02-08 04:17:39

by Drew Fustini

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

On Tue, Jan 30, 2024 at 02:20:26PM -0800, Tony Luck wrote:
> This is the re-worked version of this series that I promised to post
> yesterday. Check that e-mail for the arguments for this alternate
> approach.
>
> https://lore.kernel.org/all/ZbhLRDvZrxBZDv2j@agluck-desk3/
>
> Apologies to Drew Fustini who I'd somehow dropped from later versions
> of this series. Drew: you had made a comment at one point that having
> different scopes within a single resource may be useful on RISC-V.
> Version 14 included that, but it's gone here. Maybe multiple resctrl
> "struct resource" for a single h/w entity like L3 as I'm doing in this
> version could work for you too?

Sorry for the latency.

The RISC-V CBQRI specification [1] describes a bandwidth controller
register interface [2]. It allows a controller to implement both
bandwidth allocation and bandwidth usage monitoring.

The proof-of-concept resctrl implementation [3] that I worked on created
two domains for each memory controller in the example SoC. One domain
would contain the MBA resource and the other would contain the L3
resource to represent MBM files like local_bytes:

# cat /sys/fs/resctrl/schemata
MB:4= 80;6= 80;8= 80
L2:0=0fff;1=0fff
L3:2=ffff;3=0000;5=0000;7=0000

Where:

Domain 0 is L2 cache controller 0 capacity allocation
Domain 1 is L2 cache controller 1 capacity allocation
Domain 2 is L3 cache controller capacity allocation

Domain 4 is Memory controller 0 bandwidth allocation
Domain 6 is Memory controller 1 bandwidth allocation
Domain 8 is Memory controller 2 bandwidth allocation

Domain 3 is Memory controller 0 bandwidth monitoring
Domain 5 is Memory controller 1 bandwidth monitoring
Domain 7 is Memory controller 2 bandwidth monitoring

I think this scheme is confusing but I wasn't able to find a better
way to do it at the time.

> Patches 1-5 are almost completely rewritten based around the new
> idea to give CMT and MBM their own "resource" instead of sharing
> one with L3 CAT. This removes the need for separate domain lists,
> and thus most of the churn of the previous version of this series.

Very interesting. Do you think I would be able to create MBM files for
each memory controller without creating pointless L3 domains that show
up in schemata?

Thanks,
Drew

[1] https://github.com/riscv-non-isa/riscv-cbqri/releases/tag/v1.0-rc1
[2] https://github.com/riscv-non-isa/riscv-cbqri/blob/main/qos_bandwidth.adoc
[3] https://lore.kernel.org/linux-riscv/[email protected]/

2024-02-08 19:29:24

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

> > Patches 1-5 are almost completely rewritten based around the new
> > idea to give CMT and MBM their own "resource" instead of sharing
> > one with L3 CAT. This removes the need for separate domain lists,
> > and thus most of the churn of the previous version of this series.
>
> Very interesting. Do you think I would be able to create MBM files for
> each memory controller without creating pointless L3 domains that show
> up in schemata?

Entries only show up in the schemata file for resources that are "alloc_capable".
So you should be able to make as many rdt_hw_resource structures as you
need that are "mon_capable", but not "alloc_capable" ... though making more
than one such resource may explore untested areas of the code since there
has historically only been one mon_capable resource. It looks like the resource
id from the "rid" field is passed through to the "show" functions for MBM and CQM.

This patch series splits the one resource that is marked as both mon_capable
and alloc_capable into two. Maybe that's a useful cleanup, but maybe not a
requirement for what you need.

-Tony

2024-02-09 15:28:31

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource

Hi Tony,

On 1/30/24 16:20, Tony Luck wrote:
> The RDT_RESOURCE_L3 is unique in that it is used for both monitoring
> an control functions. This made sense while both uses had the same

an/and

> scope. But systems with Sub-NUMA clustering enabled do not follow this
> pattern.
>
> Create a new resource: RDT_RESOURCE_L3_MON ready to take over the
> monitoring functions.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 1 +
> arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++++++
> 2 files changed, 11 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 52e7e7deee10..c6051bc70e96 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -429,6 +429,7 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
> extern struct dentry *debugfs_resctrl;
>
> enum resctrl_res_level {
> + RDT_RESOURCE_L3_MON,
> RDT_RESOURCE_L3,

How about?
RDT_RESOURCE_L3,
RDT_RESOURCE_L3_MON,

> RDT_RESOURCE_L2,
> RDT_RESOURCE_MBA,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index aa9810a64258..c50f55d7790e 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -60,6 +60,16 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
> #define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
>
> struct rdt_hw_resource rdt_resources_all[] = {
> + [RDT_RESOURCE_L3_MON] =
> + {
> + .r_resctrl = {
> + .rid = RDT_RESOURCE_L3_MON,
> + .name = "L3",

L3_MON ?

> + .cache_level = 3,
> + .domains = domain_init(RDT_RESOURCE_L3_MON),
> + .fflags = RFTYPE_RES_CACHE,
> + },
> + },
> [RDT_RESOURCE_L3] =
> {
> .r_resctrl = {

--
Thanks
Babu Moger

2024-02-09 15:28:48

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 2/8] x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON

Tony,

On 1/30/24 16:20, Tony Luck wrote:
> Switch over all places that setup and use monitoring funtions to

functions?

> use the new resource structure.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 6 ++++--
> arch/x86/kernel/cpu/resctrl/monitor.c | 12 ++++--------
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
> 3 files changed, 9 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index c50f55d7790e..0828575c3e13 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -591,11 +591,13 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> return;
> }
>
> - if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> + if (r == &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl) {
> if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> cancel_delayed_work(&d->mbm_over);
> mbm_setup_overflow_handler(d, 0);
> }
> + }
> + if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {

RDT_RESOURCE_L3_MON?


> if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
> has_busy_rmid(r, d)) {
> cancel_delayed_work(&d->cqm_limbo);
> @@ -826,7 +828,7 @@ static __init bool get_rdt_alloc_resources(void)
>
> static __init bool get_rdt_mon_resources(void)
> {
> - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
>
> if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
> rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 3a6c069614eb..080cad0d7288 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -268,7 +268,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> */
> void __check_limbo(struct rdt_domain *d, bool force_free)
> {
> - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
> struct rmid_entry *entry;
> u32 crmid = 1, nrmid;
> bool rmid_dirty;
> @@ -333,7 +333,7 @@ int alloc_rmid(void)
>
> static void add_rmid_to_limbo(struct rmid_entry *entry)
> {
> - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
> struct rdt_domain *d;
> int cpu, err;
> u64 val = 0;
> @@ -623,7 +623,7 @@ void cqm_handle_limbo(struct work_struct *work)
>
> mutex_lock(&rdtgroup_mutex);
>
> - r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
> d = container_of(work, struct rdt_domain, cqm_limbo.work);
>
> __check_limbo(d, false);
> @@ -659,7 +659,7 @@ void mbm_handle_overflow(struct work_struct *work)
> if (!static_branch_likely(&rdt_mon_enable_key))
> goto out_unlock;
>
> - r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
> d = container_of(work, struct rdt_domain, mbm_over.work);
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> @@ -736,10 +736,6 @@ static struct mon_evt mbm_local_event = {
>
> /*
> * Initialize the event list for the resource.
> - *
> - * Note that MBM events are also part of RDT_RESOURCE_L3 resource
> - * because as per the SDM the total and local memory bandwidth
> - * are enumerated as part of L3 monitoring.
> */
> static void l3_mon_evt_init(struct rdt_resource *r)
> {
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index aa24343f1d23..9ee3a9906781 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2644,7 +2644,7 @@ static int rdt_get_tree(struct fs_context *fc)
> static_branch_enable_cpuslocked(&rdt_enable_key);
>
> if (is_mbm_enabled()) {
> - r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
> list_for_each_entry(dom, &r->domains, list)
> mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
> }

--
Thanks
Babu Moger

2024-02-09 15:29:32

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 4/8] x86/resctrl: Add helper function to look up domain_id from scope

Hi Tony,

On 1/30/24 16:20, Tony Luck wrote:
> Prepare for more options for scope of resources. Add some diagnostic
> messages if lookup fails.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 29 +++++++++++++++++++++++++++--
> 1 file changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index d89dce63397b..59e6aa7abef5 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -499,6 +499,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> return 0;
> }
>
> +static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> +{
> + switch (scope) {
> + case RESCTRL_L2_CACHE:
> + case RESCTRL_L3_CACHE:
> + return get_cpu_cacheinfo_id(cpu, scope);
> + default:
> + break;
> + }
> +
> + return -EINVAL;
> +}
> +
> /*
> * domain_add_cpu - Add a cpu to a resource's domain list.
> *
> @@ -514,12 +527,18 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> */
> static void domain_add_cpu(int cpu, struct rdt_resource *r)
> {
> - int id = get_cpu_cacheinfo_id(cpu, r->scope);
> + int id = get_domain_id_from_scope(cpu, r->scope);
> struct list_head *add_pos = NULL;
> struct rdt_hw_domain *hw_dom;
> struct rdt_domain *d;
> int err;
>
> + if (id < 0) {
> + pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> + cpu, r->scope, r->name);

Will it be good to move pr_warn_once inside get_domain_id_from_scope
instead of repeating during every call?

> + return;
> + }
> +
> d = rdt_find_domain(r, id, &add_pos);
> if (IS_ERR(d)) {
> pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> @@ -564,10 +583,16 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>
> static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> {
> - int id = get_cpu_cacheinfo_id(cpu, r->scope);
> + int id = get_domain_id_from_scope(cpu, r->scope);
> struct rdt_hw_domain *hw_dom;
> struct rdt_domain *d;
>
> + if (id < 0) {
> + pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> + cpu, r->scope, r->name);

Same comment as above. Will it be good to move pr_warn_once inside
get_domain_id_from_scope ?

> + return;
> + }
> +
> d = rdt_find_domain(r, id, NULL);
> if (IS_ERR_OR_NULL(d)) {
> pr_warn("Couldn't find cache id for CPU %d\n", cpu);

--
Thanks
Babu Moger

2024-02-09 15:30:07

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache

Hi Tony,

On 1/30/24 16:20, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
>
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
>
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
>
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
>
> The other mode divides the RMIDs and renumbers the ones on the second
> SNC node to start from zero.
>
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
>
> Add a global integer "snc_nodes_per_l3_cache" that shows how many
> SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
> SNC mode is either not implemented, or not enabled.
>
> Update all places to take appropriate action when SNC mode is enabled:
> 1) The number of logical RMIDs per L3 cache available for use is the
> number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
> nodes.
> 3) The RMID renumbering operates when using the value from the
> IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
> counter, adjust from the logical RMID to the physical
> RMID value for the SNC node that it wishes to read and load the
> adjusted value into the IA32_QM_EVTSEL MSR.
> 4) Divide the L3 cache between the SNC nodes. Divide the value
> reported in the resctrl "size" file by the number of SNC
> nodes because the effective amount of cache that can be allocated
> is reduced by that factor.
> 5) Disable the "-o mba_MBps" mount option in SNC mode
> because the monitoring is being done per SNC node, while the
> bandwidth allocation is still done at the L3 cache scope.
> Trying to use this feedback loop might result in contradictory
> changes to the throttling level coming from each of the SNC
> node bandwidth measurements.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
> arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 16 +++++++++++++---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++--
> 4 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index c6051bc70e96..d9c6dcf30922 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -428,6 +428,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>
> extern struct dentry *debugfs_resctrl;
>
> +extern unsigned int snc_nodes_per_l3_cache;

I feel this can be part of rdt_resource instead of global.


> +
> enum resctrl_res_level {
> RDT_RESOURCE_L3_MON,
> RDT_RESOURCE_L3,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index b741cbf61843..dc886d2c9a33 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -48,6 +48,12 @@ int max_name_width, max_data_width;
> */
> bool rdt_alloc_capable;
>
> +/*
> + * Number of SNC nodes that share each L3 cache. Default is 1 for
> + * systems that do not support SNC, or have SNC disabled.
> + */
> +unsigned int snc_nodes_per_l3_cache = 1;
> +
> static void
> mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> struct rdt_resource *r);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 080cad0d7288..357919bbadbe 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
>
> static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> {
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;

RDT_RESOURCE_L3_MON?


> + int cpu = smp_processor_id();
> + int rmid_offset = 0;
> u64 msr_val;
>
> + /*
> + * When SNC mode is on, need to compute the offset to read the
> + * physical RMID counter for the node to which this CPU belongs.
> + */
> + if (snc_nodes_per_l3_cache > 1)
> + rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;

Not sure if you have tested or not. r->num_rmid is initialized for the
resource RDT_RESOURCE_L3_MON. For other resource it is always 0.


> +
> /*
> * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> * with a valid event code for supported resource type and the bits
> @@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> * are error bits.
> */
> - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
> rdmsrl(MSR_IA32_QM_CTR, msr_val);
>
> if (msr_val & RMID_VAL_ERROR)
> @@ -757,8 +767,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> int ret;
>
> resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> - r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> + r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
> hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>
> if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 770f2bf98462..e639069f871a 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> }
> }
>
> - return size;
> + return size / snc_nodes_per_l3_cache;
> }
>
> /*
> @@ -2293,7 +2293,8 @@ static bool supports_mba_mbps(void)
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>
> return (is_mbm_local_enabled() &&
> - r->alloc_capable && is_mba_linear());
> + r->alloc_capable && is_mba_linear() &&
> + snc_nodes_per_l3_cache == 1);
> }
>
> /*

--
Thanks
Babu Moger

2024-02-09 15:30:21

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 5/8] x86/resctrl: Add "NODE" as an option for resource scope

Hi Tony,

This patch probably needs to be merged with with patch 7.

Thanks

On 1/30/24 16:20, Tony Luck wrote:
> Add RESCTRL_NODE to the enum, and to the helper function that looks
> up a domain id from a scope.
>
> There are a couple of places where the scope must be a cache scope.
> Add some defensive WARN_ON checks to those.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> include/linux/resctrl.h | 1 +
> arch/x86/kernel/cpu/resctrl/core.c | 3 +++
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 ++++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 3 +++
> 4 files changed, 11 insertions(+)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 2155dc15e636..e3cddf3f07f8 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -147,6 +147,7 @@ struct resctrl_schema;
> enum resctrl_scope {
> RESCTRL_L2_CACHE = 2,
> RESCTRL_L3_CACHE = 3,
> + RESCTRL_NODE,
> };
>
> /**
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 59e6aa7abef5..b741cbf61843 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -505,6 +505,9 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> case RESCTRL_L2_CACHE:
> case RESCTRL_L3_CACHE:
> return get_cpu_cacheinfo_id(cpu, scope);
> + case RESCTRL_NODE:
> + return cpu_to_node(cpu);
> +
> default:
> break;
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 6a72fb627aa5..2bafc73b51e2 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
> */
> static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
> {
> + enum resctrl_scope scope = plr->s->res->scope;
> struct cpu_cacheinfo *ci;
> int ret;
> int i;
>
> + if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
> + return -ENODEV;
> +
> /* Pick the first cpu we find that is associated with the cache. */
> plr->cpu = cpumask_first(&plr->d->cpu_mask);
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index eff9d87547c9..770f2bf98462 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1413,6 +1413,9 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> unsigned int size = 0;
> int num_b, i;
>
> + if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
> + return size;
> +
> num_b = bitmap_weight(&cbm, r->cache.cbm_len);
> ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
> for (i = 0; i < ci->num_leaves; i++) {

--
Thanks
Babu Moger

2024-02-09 15:36:04

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

Hi Tony,

On 1/30/24 16:20, Tony Luck wrote:
> This is the re-worked version of this series that I promised to post
> yesterday. Check that e-mail for the arguments for this alternate
> approach.

To be honest, I like this series more than the previous series. I always
thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.

You need to separate the domain lists for RDT_RESOURCE_L3 and
RDT_RESOURCE_L3_MON if you are going this route. I didn't see that in this
series. Also I have few other comments as well.

Thanks
Babu


2024-02-09 15:36:49

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 3/8] x86/resctrl: Prepare for non-cache-scoped resources

Hi Tony,

On 1/30/24 16:20, Tony Luck wrote:
> Not all resources are scoped in line with some level of hardware cache.

same level?

Thanks
Babu
>
> Prepare by renaming the "cache_level" field to "scope" and change
> the type to an enum to ease adding new scopes.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> include/linux/resctrl.h | 9 +++++++--
> arch/x86/kernel/cpu/resctrl/core.c | 14 +++++++-------
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
> 4 files changed, 16 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 66942d7fba7f..2155dc15e636 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -144,13 +144,18 @@ struct resctrl_membw {
> struct rdt_parse_data;
> struct resctrl_schema;
>
> +enum resctrl_scope {
> + RESCTRL_L2_CACHE = 2,
> + RESCTRL_L3_CACHE = 3,
> +};
> +
> /**
> * struct rdt_resource - attributes of a resctrl resource
> * @rid: The index of the resource
> * @alloc_capable: Is allocation available on this machine
> * @mon_capable: Is monitor feature available on this machine
> * @num_rmid: Number of RMIDs available
> - * @cache_level: Which cache level defines scope of this resource
> + * @scope: Hardware scope for this resource
> * @cache: Cache allocation related data
> * @membw: If the component has bandwidth controls, their properties.
> * @domains: All domains for this resource
> @@ -168,7 +173,7 @@ struct rdt_resource {
> bool alloc_capable;
> bool mon_capable;
> int num_rmid;
> - int cache_level;
> + enum resctrl_scope scope;
> struct resctrl_cache cache;
> struct resctrl_membw membw;
> struct list_head domains;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 0828575c3e13..d89dce63397b 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_L3_MON,
> .name = "L3",
> - .cache_level = 3,
> + .scope = RESCTRL_L3_CACHE,
> .domains = domain_init(RDT_RESOURCE_L3_MON),
> .fflags = RFTYPE_RES_CACHE,
> },
> @@ -75,7 +75,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_L3,
> .name = "L3",
> - .cache_level = 3,
> + .scope = RESCTRL_L3_CACHE,
> .domains = domain_init(RDT_RESOURCE_L3),
> .parse_ctrlval = parse_cbm,
> .format_str = "%d=%0*x",
> @@ -89,7 +89,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_L2,
> .name = "L2",
> - .cache_level = 2,
> + .scope = RESCTRL_L2_CACHE,
> .domains = domain_init(RDT_RESOURCE_L2),
> .parse_ctrlval = parse_cbm,
> .format_str = "%d=%0*x",
> @@ -103,7 +103,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_MBA,
> .name = "MB",
> - .cache_level = 3,
> + .scope = RESCTRL_L3_CACHE,
> .domains = domain_init(RDT_RESOURCE_MBA),
> .parse_ctrlval = parse_bw,
> .format_str = "%d=%*u",
> @@ -115,7 +115,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_SMBA,
> .name = "SMBA",
> - .cache_level = 3,
> + .scope = RESCTRL_L3_CACHE,
> .domains = domain_init(RDT_RESOURCE_SMBA),
> .parse_ctrlval = parse_bw,
> .format_str = "%d=%*u",
> @@ -514,7 +514,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> */
> static void domain_add_cpu(int cpu, struct rdt_resource *r)
> {
> - int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> + int id = get_cpu_cacheinfo_id(cpu, r->scope);
> struct list_head *add_pos = NULL;
> struct rdt_hw_domain *hw_dom;
> struct rdt_domain *d;
> @@ -564,7 +564,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>
> static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> {
> - int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> + int id = get_cpu_cacheinfo_id(cpu, r->scope);
> struct rdt_hw_domain *hw_dom;
> struct rdt_domain *d;
>
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 8f559eeae08e..6a72fb627aa5 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -311,7 +311,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
> plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
>
> for (i = 0; i < ci->num_leaves; i++) {
> - if (ci->info_list[i].level == plr->s->res->cache_level) {
> + if (ci->info_list[i].level == plr->s->res->scope) {
> plr->line_size = ci->info_list[i].coherency_line_size;
> return 0;
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 9ee3a9906781..eff9d87547c9 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1416,7 +1416,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> num_b = bitmap_weight(&cbm, r->cache.cbm_len);
> ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
> for (i = 0; i < ci->num_leaves; i++) {
> - if (ci->info_list[i].level == r->cache_level) {
> + if (ci->info_list[i].level == r->scope) {
> size = ci->info_list[i].size / r->cache.cbm_len * num_b;
> break;
> }

--
Thanks
Babu Moger

2024-02-09 18:32:37

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

On Fri, Feb 09, 2024 at 09:27:56AM -0600, Moger, Babu wrote:
> Hi Tony,
>
> On 1/30/24 16:20, Tony Luck wrote:
> > This is the re-worked version of this series that I promised to post
> > yesterday. Check that e-mail for the arguments for this alternate
> > approach.
>
> To be honest, I like this series more than the previous series. I always
> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>
> You need to separate the domain lists for RDT_RESOURCE_L3 and
> RDT_RESOURCE_L3_MON if you are going this route. I didn't see that in this
> series. Also I have few other comments as well.

They are separated. Each "struct rdt_resource" has its own domain list.

Or do you mean break up the struct rdt_domain into the control and
monitor versions as was done in the previous series?
>
> Thanks
> Babu
>

2024-02-09 18:46:58

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource

On Fri, Feb 09, 2024 at 09:28:16AM -0600, Moger, Babu wrote:
> > enum resctrl_res_level {
> > + RDT_RESOURCE_L3_MON,
> > RDT_RESOURCE_L3,
>
> How about?
> RDT_RESOURCE_L3,
> RDT_RESOURCE_L3_MON,

Does the order matter? I put the L3_MON one first because historically
cache occupancy was the first resctrl tool. But if you have a better
argument for the order, then I can change it.

> > RDT_RESOURCE_L2,
> > RDT_RESOURCE_MBA,
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index aa9810a64258..c50f55d7790e 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -60,6 +60,16 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
> > #define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
> >
> > struct rdt_hw_resource rdt_resources_all[] = {
> > + [RDT_RESOURCE_L3_MON] =
> > + {
> > + .r_resctrl = {
> > + .rid = RDT_RESOURCE_L3_MON,
> > + .name = "L3",
>
> L3_MON ?

That was my first choice too. But I found:

$ ls /sys/fs/resctrl/info
L3 L3_MON_MON last_cmd_status MB

This would be easy to fix ... just change this code to not append
an extra "_MON" to the directory name:

for_each_mon_capable_rdt_resource(r) {
fflags = r->fflags | RFTYPE_MON_INFO;
sprintf(name, "%s_MON", r->name);
ret = rdtgroup_mkdir_info_resdir(r, name, fflags);
if (ret)
goto out_destroy;
}

But I also saw this:
$ ls /sys/fs/resctrl/mon_data/
mon_L3_MON_00 mon_L3_MON_01

which didn't seem to have an easy fix. So I took the easy route and left
the ".name" field as "L3_MON".

-Tony

2024-02-09 18:51:36

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 2/8] x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON

On Fri, Feb 09, 2024 at 09:28:25AM -0600, Moger, Babu wrote:
> Tony,
>
> On 1/30/24 16:20, Tony Luck wrote:
> > Switch over all places that setup and use monitoring funtions to
>
> functions?

Yes. Will fix.

> > use the new resource structure.
> >
> > Signed-off-by: Tony Luck <[email protected]>
> > ---
> > arch/x86/kernel/cpu/resctrl/core.c | 6 ++++--
> > arch/x86/kernel/cpu/resctrl/monitor.c | 12 ++++--------
> > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
> > 3 files changed, 9 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index c50f55d7790e..0828575c3e13 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -591,11 +591,13 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> > return;
> > }
> >
> > - if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> > + if (r == &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl) {
> > if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> > cancel_delayed_work(&d->mbm_over);
> > mbm_setup_overflow_handler(d, 0);
> > }
> > + }
> > + if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
>
> RDT_RESOURCE_L3_MON?

Good catch.

-Tony

2024-02-09 18:58:00

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 3/8] x86/resctrl: Prepare for non-cache-scoped resources

On Fri, Feb 09, 2024 at 09:28:35AM -0600, Moger, Babu wrote:
> Hi Tony,
>
> On 1/30/24 16:20, Tony Luck wrote:
> > Not all resources are scoped in line with some level of hardware cache.
>
> same level?

No. "same" isn't what I meant here. If I shuffle this around:

Not all resources are scoped to match the scope of a hardware
cache level.

Is that more clear?


-Tony

2024-02-09 19:16:23

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 4/8] x86/resctrl: Add helper function to look up domain_id from scope

On Fri, Feb 09, 2024 at 09:28:52AM -0600, Moger, Babu wrote:
> > + if (id < 0) {
> > + pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> > + cpu, r->scope, r->name);
>
> Will it be good to move pr_warn_once inside get_domain_id_from_scope
> instead of repeating during every call?

Yes. Will move from here to get_domain_id_from_scope().

> > + if (id < 0) {
> > + pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> > + cpu, r->scope, r->name);
>
> Same comment as above. Will it be good to move pr_warn_once inside
> get_domain_id_from_scope ?

Moved this one too.

Thanks

-Tony

2024-02-09 19:36:09

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache

On Fri, Feb 09, 2024 at 09:29:16AM -0600, Moger, Babu wrote:
> >
> > +extern unsigned int snc_nodes_per_l3_cache;
>
> I feel this can be part of rdt_resource instead of global.

Mixed emotions about that. It would be another field that appears
in every instance of rdt_resource, but only used by the RDT_RESOURCE_L3_MON
copy.

>
> > +
> > enum resctrl_res_level {
> > RDT_RESOURCE_L3_MON,
> > RDT_RESOURCE_L3,
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index b741cbf61843..dc886d2c9a33 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -48,6 +48,12 @@ int max_name_width, max_data_width;
> > */
> > bool rdt_alloc_capable;
> >
> > +/*
> > + * Number of SNC nodes that share each L3 cache. Default is 1 for
> > + * systems that do not support SNC, or have SNC disabled.
> > + */
> > +unsigned int snc_nodes_per_l3_cache = 1;
> > +
> > static void
> > mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> > struct rdt_resource *r);
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index 080cad0d7288..357919bbadbe 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
> >
> > static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> > {
> > + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>
> RDT_RESOURCE_L3_MON?

Second good catch.

>
> > + int cpu = smp_processor_id();
> > + int rmid_offset = 0;
> > u64 msr_val;
> >
> > + /*
> > + * When SNC mode is on, need to compute the offset to read the
> > + * physical RMID counter for the node to which this CPU belongs.
> > + */
> > + if (snc_nodes_per_l3_cache > 1)
> > + rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
>
> Not sure if you have tested or not. r->num_rmid is initialized for the
> resource RDT_RESOURCE_L3_MON. For other resource it is always 0.

I hadn't got time on the SNC machine to try this out. Thanks
for catching this one, I'd have been scratching my head for a
while to track the symptoms of this problem back to this mistake.

Thanks

-Tony

2024-02-09 19:38:52

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems


On 2/9/24 12:31, Tony Luck wrote:
> On Fri, Feb 09, 2024 at 09:27:56AM -0600, Moger, Babu wrote:
>> Hi Tony,
>>
>> On 1/30/24 16:20, Tony Luck wrote:
>>> This is the re-worked version of this series that I promised to post
>>> yesterday. Check that e-mail for the arguments for this alternate
>>> approach.
>>
>> To be honest, I like this series more than the previous series. I always
>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>>
>> You need to separate the domain lists for RDT_RESOURCE_L3 and
>> RDT_RESOURCE_L3_MON if you are going this route. I didn't see that in this
>> series. Also I have few other comments as well.
>
> They are separated. Each "struct rdt_resource" has its own domain list.

Yea. You are right.
>
> Or do you mean break up the struct rdt_domain into the control and
> monitor versions as was done in the previous series?

No. Not required. Each resource has its own domain list. So, it is
separated already as far as I can see.

Reinette seem to have some concerns about this series. But, I am fine with
both these approaches. I feel this is more clean approach.
--
Thanks
Babu Moger

2024-02-09 19:42:29

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 1/8] x86/resctrl: Split the RDT_RESOURCE_L3 resource



On 2/9/24 12:44, Tony Luck wrote:
> On Fri, Feb 09, 2024 at 09:28:16AM -0600, Moger, Babu wrote:
>>> enum resctrl_res_level {
>>> + RDT_RESOURCE_L3_MON,
>>> RDT_RESOURCE_L3,
>>
>> How about?
>> RDT_RESOURCE_L3,
>> RDT_RESOURCE_L3_MON,
>
> Does the order matter? I put the L3_MON one first because historically
> cache occupancy was the first resctrl tool. But if you have a better
> argument for the order, then I can change it.

That is fine. No need to change.

>
>>> RDT_RESOURCE_L2,
>>> RDT_RESOURCE_MBA,
>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>> index aa9810a64258..c50f55d7790e 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>> @@ -60,6 +60,16 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
>>> #define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
>>>
>>> struct rdt_hw_resource rdt_resources_all[] = {
>>> + [RDT_RESOURCE_L3_MON] =
>>> + {
>>> + .r_resctrl = {
>>> + .rid = RDT_RESOURCE_L3_MON,
>>> + .name = "L3",
>>
>> L3_MON ?
>
> That was my first choice too. But I found:
>
> $ ls /sys/fs/resctrl/info
> L3 L3_MON_MON last_cmd_status MB
>
> This would be easy to fix ... just change this code to not append
> an extra "_MON" to the directory name:
>
> for_each_mon_capable_rdt_resource(r) {
> fflags = r->fflags | RFTYPE_MON_INFO;
> sprintf(name, "%s_MON", r->name);
> ret = rdtgroup_mkdir_info_resdir(r, name, fflags);
> if (ret)
> goto out_destroy;
> }
>
> But I also saw this:
> $ ls /sys/fs/resctrl/mon_data/
> mon_L3_MON_00 mon_L3_MON_01
>
> which didn't seem to have an easy fix. So I took the easy route and left
> the ".name" field as "L3_MON".
>

Ok. Sounds good.

--
Thanks
Babu Moger

2024-02-09 19:44:26

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 3/8] x86/resctrl: Prepare for non-cache-scoped resources



On 2/9/24 12:57, Tony Luck wrote:
> On Fri, Feb 09, 2024 at 09:28:35AM -0600, Moger, Babu wrote:
>> Hi Tony,
>>
>> On 1/30/24 16:20, Tony Luck wrote:
>>> Not all resources are scoped in line with some level of hardware cache.
>>
>> same level?
>
> No. "same" isn't what I meant here. If I shuffle this around:
>
> Not all resources are scoped to match the scope of a hardware
> cache level.
>
> Is that more clear?

Sure. Looks good.

--
Thanks
Babu Moger

2024-02-09 19:51:20

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 5/8] x86/resctrl: Add "NODE" as an option for resource scope

On Fri, Feb 09, 2024 at 09:29:03AM -0600, Moger, Babu wrote:
> Hi Tony,
>
> This patch probably needs to be merged with with patch 7.

If it just added RESCTRL_NODE to the enum and the switch() I'd
agree (as patch 7 is where RESCTRL_NODE first used). But this
patch also adds the sanity checks on scope where it has to be
a cache level.

Patch 7 is already on the big side (119 lines added to core.c).

If you really think this series needs to cut back the
number of patches, I could move the sanity check pieces
from here to patch 3 (where the enum is introduced) and
just the RESCTRL_NODE bits to patch 7.

-Tony

2024-02-09 21:27:21

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems



On 2/9/2024 11:38 AM, Moger, Babu wrote:
> On 2/9/24 12:31, Tony Luck wrote:
>> On Fri, Feb 09, 2024 at 09:27:56AM -0600, Moger, Babu wrote:
>>> On 1/30/24 16:20, Tony Luck wrote:
>
> Reinette seem to have some concerns about this series. But, I am fine with
> both these approaches. I feel this is more clean approach.

I questioned the motivation but never received a response.

Reinette

2024-02-09 21:37:02

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

>> Reinette seem to have some concerns about this series. But, I am fine with
>> both these approaches. I feel this is more clean approach.
>
> I questioned the motivation but never received a response.

Reinette,

Sorry. My motivation was to reduce the amount of code churn that
was done in the by the previous incarnation.

9 files changed, 629 insertions(+), 282 deletions(-)

Vast amounts of that just added "_mon" or "_ctrl" to structure
or variable names.

-Tony

2024-02-09 22:19:00

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

Hi Tony,

On 2/9/2024 1:36 PM, Luck, Tony wrote:
>>> Reinette seem to have some concerns about this series. But, I am fine with
>>> both these approaches. I feel this is more clean approach.
>>
>> I questioned the motivation but never received a response.
>
> Reinette,
>
> Sorry. My motivation was to reduce the amount of code churn that
> was done in the by the previous incarnation.
>
> 9 files changed, 629 insertions(+), 282 deletions(-)
>
> Vast amounts of that just added "_mon" or "_ctrl" to structure
> or variable names.

I actually had specific points that this response also ignores.
Let me repeat and highlight the same points:

1) You claim that this series "removes the need for separate domain
lists" ... but then this series does just that (create a separate
domain list), but in an obfuscated way (duplicate the resource to
have the monitoring domain list in there).
2) You claim this series "reduces amount of code churn", but this is
because this series keeps using the same original data structures
for separate monitoring and control usages. The previous series made
an effort to separate the structures for the different usages
but this series does not. What makes it ok in this series to
use the same data structures for different usages?

Additionally:

Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure
or variable names." ... that is because the structures are actually split,
no? It is not just renaming for unnecessary churn.

What is the benefit of keeping the data structures to be shared
between monitor and control usages?
If there is a benefit to keeping these data structures, why not just
address this aspect in previous solution?

Reinette




2024-02-09 23:54:39

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

> I actually had specific points that this response also ignores.
> Let me repeat and highlight the same points:
>
> 1) You claim that this series "removes the need for separate domain
> lists" ... but then this series does just that (create a separate
> domain list), but in an obfuscated way (duplicate the resource to
> have the monitoring domain list in there).

That was poorly worded on my part. I should have said "removes the
need for separate domain lists within a single rdt_resource".

Adding an extra domain list to a resource may be the start of a slippery
slope. What if there is some additional "L3"-like resctrl operation that
acts at the socket level (Intel has made products with multiple L3
instances per socket before). Would you be OK add a third domain
list to every struct rdt_resource to handle this? Or would it be simpler
to just add a new rdt_resource structure with socket scoped domains?

> 2) You claim this series "reduces amount of code churn", but this is
> because this series keeps using the same original data structures
> for separate monitoring and control usages. The previous series made
> an effort to separate the structures for the different usages
> but this series does not. What makes it ok in this series to
> use the same data structures for different usages?

Legacy resctrl has been using the same rdt_domain structure for both
usages since the dawn of time. So it has been OK up until now.

> Additionally:
>
> Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure
> or variable names." ... that is because the structures are actually split,
> no? It is not just renaming for unnecessary churn.

Perhaps not "unnecessary" churn. But certainly a lot of code change for
what I perceive as very little real gain.

> What is the benefit of keeping the data structures to be shared
> between monitor and control usages?

Benefit is no code changes. Cost is continuing to waste memory with
structures that are slightly bigger than they need to be.

> If there is a benefit to keeping these data structures, why not just
> address this aspect in previous solution?

The previous solution evolved to splitting these structures. But this
happened incrementally (remember that at an early stage the monitor
structures all got the "_mon" addition to their names, but the control
structures kept the original names). Only when I got to the end of this
process did I look at the magnitude of the change.

-Tony

2024-02-10 00:29:39

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

Hi Tony,

On 2/9/2024 3:44 PM, Luck, Tony wrote:
>> I actually had specific points that this response also ignores.
>> Let me repeat and highlight the same points:
>>
>> 1) You claim that this series "removes the need for separate domain
>> lists" ... but then this series does just that (create a separate
>> domain list), but in an obfuscated way (duplicate the resource to
>> have the monitoring domain list in there).
>
> That was poorly worded on my part. I should have said "removes the
> need for separate domain lists within a single rdt_resource".
>
> Adding an extra domain list to a resource may be the start of a slippery
> slope. What if there is some additional "L3"-like resctrl operation that
> acts at the socket level (Intel has made products with multiple L3
> instances per socket before). Would you be OK add a third domain
> list to every struct rdt_resource to handle this? Or would it be simpler
> to just add a new rdt_resource structure with socket scoped domains?

This should not be about what is simplest to patch into current resctrl.

There is no need to support a new domain list for a new scope. The domain
lists support the functionality: control or monitoring. If control has
socket scope the existing implementation supports that.
If there is another operation supported by a resource apart from
control or monitoring then we can consider how to support it when
we know what it is. That would also be a great point to decide if
the same data structure should just grow to support an operation that
not all resources may support. That may depend on the amount of data
needed to support this hypothetical operation.

>
>> 2) You claim this series "reduces amount of code churn", but this is
>> because this series keeps using the same original data structures
>> for separate monitoring and control usages. The previous series made
>> an effort to separate the structures for the different usages
>> but this series does not. What makes it ok in this series to
>> use the same data structures for different usages?
>
> Legacy resctrl has been using the same rdt_domain structure for both
> usages since the dawn of time. So it has been OK up until now.

This is not the same.

Legacy resctrl uses the same data structure in the same list for both control
and monitoring usages so it is fine to have both monitoring and control data
in the data structure.

What you are doing in both solutions is to place the same data structure
in separate lists for control and monitoring usages. In the one list only the
control data is used, on the other only the monitoring data is used.

>> Additionally:
>>
>> Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure
>> or variable names." ... that is because the structures are actually split,
>> no? It is not just renaming for unnecessary churn.
>
> Perhaps not "unnecessary" churn. But certainly a lot of code change for
> what I perceive as very little real gain.

ok. There may be little gain wrt saving space. One complication with
this single data structure is that its content may only be decided based
on which list it is part of. It should be obvious to developers when
which members are valid. Perhaps this can be addressed with clear
documentation of the data structures.

>
>> What is the benefit of keeping the data structures to be shared
>> between monitor and control usages?
>
> Benefit is no code changes. Cost is continuing to waste memory with
> structures that are slightly bigger than they need to be.
>
>> If there is a benefit to keeping these data structures, why not just
>> address this aspect in previous solution?
>
> The previous solution evolved to splitting these structures. But this
> happened incrementally (remember that at an early stage the monitor
> structures all got the "_mon" addition to their names, but the control
> structures kept the original names). Only when I got to the end of this
> process did I look at the magnitude of the change.

Not answering my question.

Reinette

2024-02-12 16:53:24

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

> >> I actually had specific points that this response also ignores.
> >> Let me repeat and highlight the same points:
> >>
> >> 1) You claim that this series "removes the need for separate domain
> >> lists" ... but then this series does just that (create a separate
> >> domain list), but in an obfuscated way (duplicate the resource to
> >> have the monitoring domain list in there).
> >
> > That was poorly worded on my part. I should have said "removes the
> > need for separate domain lists within a single rdt_resource".
> >
> > Adding an extra domain list to a resource may be the start of a slippery
> > slope. What if there is some additional "L3"-like resctrl operation that
> > acts at the socket level (Intel has made products with multiple L3
> > instances per socket before). Would you be OK add a third domain
> > list to every struct rdt_resource to handle this? Or would it be simpler
> > to just add a new rdt_resource structure with socket scoped domains?
>
> This should not be about what is simplest to patch into current resctrl.

I wanted to offer this in case Boris also thought that the previous version
was too much churn to support an obscure Intel-only (so far) feature.

But if you are going to Nack this new version on the grounds that it muddies
the water about usage of the rdt_domain structure, then I will abandon it.

> There is no need to support a new domain list for a new scope. The domain
> lists support the functionality: control or monitoring. If control has
> socket scope the existing implementation supports that.
> If there is another operation supported by a resource apart from
> control or monitoring then we can consider how to support it when
> we know what it is. That would also be a great point to decide if
> the same data structure should just grow to support an operation that
> not all resources may support. That may depend on the amount of data
> needed to support this hypothetical operation.
>
> >
> >> 2) You claim this series "reduces amount of code churn", but this is
> >> because this series keeps using the same original data structures
> >> for separate monitoring and control usages. The previous series made
> >> an effort to separate the structures for the different usages
> >> but this series does not. What makes it ok in this series to
> >> use the same data structures for different usages?
> >
> > Legacy resctrl has been using the same rdt_domain structure for both
> > usages since the dawn of time. So it has been OK up until now.
>
> This is not the same.
>
> Legacy resctrl uses the same data structure in the same list for both control
> and monitoring usages so it is fine to have both monitoring and control data
> in the data structure.
>
> What you are doing in both solutions is to place the same data structure
> in separate lists for control and monitoring usages. In the one list only the
> control data is used, on the other only the monitoring data is used.
>
> >> Additionally:
> >>
> >> Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure
> >> or variable names." ... that is because the structures are actually split,
> >> no? It is not just renaming for unnecessary churn.
> >
> > Perhaps not "unnecessary" churn. But certainly a lot of code change for
> > what I perceive as very little real gain.
>
> ok. There may be little gain wrt saving space. One complication with
> this single data structure is that its content may only be decided based
> on which list it is part of. It should be obvious to developers when
> which members are valid. Perhaps this can be addressed with clear
> documentation of the data structures.
>
> >
> >> What is the benefit of keeping the data structures to be shared
> >> between monitor and control usages?
> >
> > Benefit is no code changes. Cost is continuing to waste memory with
> > structures that are slightly bigger than they need to be.
> >
> >> If there is a benefit to keeping these data structures, why not just
> >> address this aspect in previous solution?
> >
> > The previous solution evolved to splitting these structures. But this
> > happened incrementally (remember that at an early stage the monitor
> > structures all got the "_mon" addition to their names, but the control
> > structures kept the original names). Only when I got to the end of this
> > process did I look at the magnitude of the change.
>
> Not answering my question.

I'm not exactly sure what "aspect" you thought could be addressed in the
previous series. But the point is moot now. This diversion from the
series has come to a dead end, and I hope that Boris will look at v14
(either before the next group of ARM patches, or after).

-Tony

2024-02-12 17:23:17

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache

Hi Tony,

On 2/9/24 13:35, Tony Luck wrote:
> On Fri, Feb 09, 2024 at 09:29:16AM -0600, Moger, Babu wrote:
>>>
>>> +extern unsigned int snc_nodes_per_l3_cache;
>>
>> I feel this can be part of rdt_resource instead of global.
>
> Mixed emotions about that. It would be another field that appears
> in every instance of rdt_resource, but only used by the RDT_RESOURCE_L3_MON
> copy.
>
That was the comment I got earlier for my patches(search for global).

https://lore.kernel.org/lkml/[email protected]/
--
Thanks
Babu Moger

2024-02-12 17:39:37

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 5/8] x86/resctrl: Add "NODE" as an option for resource scope

Hi Tony,

On 2/9/24 13:23, Tony Luck wrote:
> On Fri, Feb 09, 2024 at 09:29:03AM -0600, Moger, Babu wrote:
>> Hi Tony,
>>
>> This patch probably needs to be merged with with patch 7.
>
> If it just added RESCTRL_NODE to the enum and the switch() I'd
> agree (as patch 7 is where RESCTRL_NODE first used). But this
> patch also adds the sanity checks on scope where it has to be
> a cache level.
>
> Patch 7 is already on the big side (119 lines added to core.c).
>
> If you really think this series needs to cut back the
> number of patches, I could move the sanity check pieces
> from here to patch 3 (where the enum is introduced) and
> just the RESCTRL_NODE bits to patch 7.

No. You dont have to cut back on number of patches. I think it is easy to
review if these changes are merged with patch 7.


--
Thanks
Babu Moger

2024-02-12 19:45:14

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

Hi Babu,

On 2/9/2024 7:27 AM, Moger, Babu wrote:

> To be honest, I like this series more than the previous series. I always
> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.

Would you prefer that your "Reviewed-by" tag be removed from the
previous series?

Reinette

2024-02-12 19:58:02

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

>> To be honest, I like this series more than the previous series. I always
>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>
> Would you prefer that your "Reviewed-by" tag be removed from the
> previous series?

I'm thinking that I could continue splitting things and break "struct rdt_resource" into
separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom.

Doing that would get rid of the rdt_resources_all[] array. Replacing with individual
rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature.

Features found on a system would be added to a list of ctrl or list of mon resources.

-Tony

2024-02-12 20:55:24

by Babu Moger

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems



On 2/12/24 13:44, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/9/2024 7:27 AM, Moger, Babu wrote:
>
>> To be honest, I like this series more than the previous series. I always
>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>
> Would you prefer that your "Reviewed-by" tag be removed from the
> previous series?
>
Sure. I will plan to review again the new series when Tony submits v16.
--
Thanks
Babu Moger

2024-02-12 21:54:05

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

Hi Tony,

On 2/12/2024 11:57 AM, Luck, Tony wrote:
>>> To be honest, I like this series more than the previous series. I always
>>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>>
>> Would you prefer that your "Reviewed-by" tag be removed from the
>> previous series?
>
> I'm thinking that I could continue splitting things and break "struct rdt_resource" into
> separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom.

It is not obvious what you mean with "continue splitting things". Are you
speaking about "continue splitting from v14" or "continue splitting from v15-RFC"?

I think that any solution needs to consider what makes sense for resctrl
as a whole instead of how to support SNC with smallest patch possible.

There should not be any changes that makes resctrl harder to understand
and maintain, as exemplified by confusion introduced by a simple thing as
resource name choice [1].

>
> Doing that would get rid of the rdt_resources_all[] array. Replacing with individual
> rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature.
>
> Features found on a system would be added to a list of ctrl or list of mon resources.

Could you please elaborate what is architecturally wrong with v14 and how this
new proposal addresses that?

Reinette

[1] https://lore.kernel.org/lkml/ZcZyqs5hnQqZ5ZV0@agluck-desk3/

2024-02-12 22:05:25

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

On Mon, Feb 12, 2024 at 01:43:56PM -0800, Reinette Chatre wrote:
> Hi Tony,
>
> On 2/12/2024 11:57 AM, Luck, Tony wrote:
> >>> To be honest, I like this series more than the previous series. I always
> >>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
> >>
> >> Would you prefer that your "Reviewed-by" tag be removed from the
> >> previous series?
> >
> > I'm thinking that I could continue splitting things and break "struct rdt_resource" into
> > separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom.
>
> It is not obvious what you mean with "continue splitting things". Are you
> speaking about "continue splitting from v14" or "continue splitting from v15-RFC"?

I'm speaking of some future potential changes. Not proposing to
do this now.

> I think that any solution needs to consider what makes sense for resctrl
> as a whole instead of how to support SNC with smallest patch possible.

I am officially abandoning my v15-RFC patches. I wasn't clear enough in
my e-mail earlier today.

https://lore.kernel.org/all/SJ1PR11MB608378D1304224D9E8A9016FFC482@SJ1PR11MB6083.namprd11.prod.outlook.com/
>
> There should not be any changes that makes resctrl harder to understand
> and maintain, as exemplified by confusion introduced by a simple thing as
> resource name choice [1].
>
> >
> > Doing that would get rid of the rdt_resources_all[] array. Replacing with individual
> > rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature.
> >
> > Features found on a system would be added to a list of ctrl or list of mon resources.
>
> Could you please elaborate what is architecturally wrong with v14 and how this
> new proposal addresses that?

There is nothing architecturally wrong with v14. I thought it was more
complex than it needed to be. You have convinced me that my v15-RFC
series, while simpler, is not a reasonable path for long-term resctrl
maintainability.
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/ZcZyqs5hnQqZ5ZV0@agluck-desk3/

-Tony

2024-02-13 18:11:50

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

Hello,

On 12/02/2024 22:05, Tony Luck wrote:
> On Mon, Feb 12, 2024 at 01:43:56PM -0800, Reinette Chatre wrote:
>> On 2/12/2024 11:57 AM, Luck, Tony wrote:
>>>>> To be honest, I like this series more than the previous series. I always
>>>>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>>>>
>>>> Would you prefer that your "Reviewed-by" tag be removed from the
>>>> previous series?
>>>
>>> I'm thinking that I could continue splitting things and break "struct rdt_resource" into
>>> separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom.
>>
>> It is not obvious what you mean with "continue splitting things". Are you
>> speaking about "continue splitting from v14" or "continue splitting from v15-RFC"?
>
> I'm speaking of some future potential changes. Not proposing to
> do this now.
>
>> I think that any solution needs to consider what makes sense for resctrl
>> as a whole instead of how to support SNC with smallest patch possible.

>> There should not be any changes that makes resctrl harder to understand
>> and maintain, as exemplified by confusion introduced by a simple thing as
>> resource name choice [1].
>>
>>>
>>> Doing that would get rid of the rdt_resources_all[] array. Replacing with individual
>>> rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature.
>>>
>>> Features found on a system would be added to a list of ctrl or list of mon resources.
>>
>> Could you please elaborate what is architecturally wrong with v14 and how this
>> new proposal addresses that?
>
> There is nothing architecturally wrong with v14. I thought it was more
> complex than it needed to be. You have convinced me that my v15-RFC
> series, while simpler, is not a reasonable path for long-term resctrl
> maintainability.

I'm not sure if its helpful to describe a third approach at this point - but on the off
chance its useful:
With SNC enable, the L3 monitors are unaffected, but the controls behave as if they were
part of some other component in the system..

ACPI describes something called "memory side caches" [0] in the HMAT table, which are
outside the CPU cache hierarchy, and are associated with a Proximity-Domain. I've heard
that one of Arm's partners has built a system with MPAM controls on something like this.
How would we support this - and would this be a better fit for the way SNC behaves?

I think this would be a new resource and schema, 'MSC'(?) with domain-ids using the NUMA
nid. As these aren't CPU caches, they wouldn't appear in the same part of the sysfs
hierarchy, and wouldn't necessarily have a cache-id.

For SNC systems, I think this would look like CMT on the L3, and CAT on the 'MSC'.
Existing software wouldn't know to use the new schema, but equally wouldn't be surprised
by the domain-ids being something other than the cache-id, and the controls and monitors
not lining up.
Where its not quite right for SNC is sysfs may not describe a memory side cache, but one
would be present in resctrl. I don't think that's a problem - unless these systems do also
have a memory-side-cache that behaves differently. (where is the controls being applied at
the 'near' side of the link - I don't think the difference matters)


I'm a little nervous that the SNC support looks strange if we ever add support for
something like the above. Given its described in ACPI, I assume there are plenty of
machines out there that look like this.

(Why aren't memory-side-caches a CPU cache? They live near the memory controller and cache
based on the PA, not the CPU that issued the transaction)


Thanks,

James

[0]
https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#memory-side-cache-overview

2024-02-13 19:02:39

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v15-RFC 0/8] Add support for Sub-NUMA cluster (SNC) systems

> With SNC enable, the L3 monitors are unaffected, but the controls behave as if they were
> part of some other component in the system.

I don't think of it like that. See attached picture of a single socket divided in two by SNC.
[If the attachment is stripped off for those reading this via mailing lists, if you want the
picture, just send me an e-mail.]

Everything in blue is node 0. Yellow for node 1.

The rectangles in the middle represent the L3 cache (12-way associative). When cores
in node 0 access memory in node 0, it will be cached using the "top" half of the cache
indices. Similarly for node 1 using the "bottom" half.

Here’s how each of the Intel L3 resctrl functions operate with SNC enabled:

CQM: Reports how much of your half of the L3 cache is occupied

MBM: Reports on memory traffic from your half of the cache to your memory controllers.

CAT: Still controls which ways of the cache are available for allocation (but each way
has half the capacity.)

MBA: The same throttling levels applied to "blue" and "yellow" traffic (because there
are only socket level controls).

> I'm a little nervous that the SNC support looks strange if we ever add support for
> something like the above. Given its described in ACPI, I assume there are plenty of
> machines out there that look like this.

I'm also nervous as h/w designers find various ways to diverge from the old paradigm of

socket scope == L3 cache scope == NUMA node scope

-Tony


Attachments:
SNC topology.png (17.88 kB)
SNC topology.png