LinuxLists.cc - [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch

2020-02-14 18:30:19

Subject: [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch

Hi folks,

These two patches are the tip of the MPAM iceberg.

Arm have some CPU support for dividing caches into portions, and
applying bandwidth limits at various points in the SoC. The collective term
for these features is MPAM: Memory Partitioning and Monitoring.

MPAM is similar enough to Intel RDT, that it should use the defacto linux
interface: resctrl. This filesystem currently lives under arch/x86, and is
tightly coupled to the architecture.
Ultimately, my plan is to split the existing resctrl code up to have an
arch<->fs abstraction, then move all the bits out to fs/resctrl. From there
MPAM can be wired up.

These two patches are step one: Split the two structs that resctrl uses
to have an arch<->fs split. These sit on top of the cleanup posted here:
lore.kernel.org/r/[email protected]

I'm after strong opinions like "No! struct mbm_state is obviously arch
specific.". Making the hardware configuration belong to the arch code
instead of resctrl is so that it can be scaled on arm64, where MPAM
allows terrifyingly large portion bitmaps for the caches.

Last time these were posted, the request was for the spec, and to see
the whole fully assembled iceberg.

The spec is here:
https://static.docs.arm.com/ddi0598/ab/DDI0598A_b_MPAM_supp_armv8a.pdf

For a slightly dated view of the whole tree:
1. Don peril sensitive sunglasses
2. https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/feb

The tree is generally RFC-quality. It gets more ragged once you get out of
the x86 code. I anticipate all the arm64 code being rewritten before its
considered for merging.

(I haven't reposted the CDP origami as before, as I think that series
will be clearer if I re-order the patches ... it may even be shorter)

Does it all work? Almost. Monitor groups are proving to be a problem, I
can't see a way of getting these working without a user-visible change of
behaviour.
MPAMs counters aren't 1:1 with RMIDs, so supporting MBM_* on
any likely hardware will have to be via something other than resctrl.

Thanks,

James Morse (2):
x86/resctrl: Split struct rdt_resource
x86/resctrl: Split struct rdt_domain

arch/x86/kernel/cpu/resctrl/core.c | 257 +++++++++++++---------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 16 +-
arch/x86/kernel/cpu/resctrl/internal.h | 157 +++----------
arch/x86/kernel/cpu/resctrl/monitor.c | 29 ++-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 77 ++++---
include/linux/resctrl.h | 133 +++++++++++
7 files changed, 389 insertions(+), 284 deletions(-)

--
2.24.1

2020-02-14 18:31:57

by James Morse

[permalink] [raw]

Subject: [RFC PATCH v2 1/2] x86/resctrl: Split struct rdt_resource

resctrl is the defacto Linux ABI for SoC resource partitioning features.
To support it on another architecture, we need to abstract it from
Intel RDT, and move it to /fs/.

Lets start by splitting struct rdt_resource, (the name is kept for now
to keep the noise down), and add some type-trickery to keep the foreach
helpers working.

Move everything that that is particular to resctrl into a new header
file, keeping the x86 msr specific stuff where it is. resctrl code
paths touching a 'hw' struct indicates where an abstraction is needed.

We split rdt_domain up in a similar way in the next patch.
No change in behaviour, this patch just moves types around.

Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 231 ++++++++++++----------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 6 +-
arch/x86/kernel/cpu/resctrl/internal.h | 117 +++--------
arch/x86/kernel/cpu/resctrl/monitor.c | 21 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 48 ++---
include/linux/resctrl.h | 100 ++++++++++
7 files changed, 299 insertions(+), 228 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f2968fb6fb9a..ce02f3f35b44 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,119 +57,133 @@ static void
mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);

-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].domains)
+#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].resctrl.domains)

-struct rdt_resource rdt_resources_all[] = {
+struct rdt_hw_resource rdt_resources_all[] = {
[RDT_RESOURCE_L3] =
{
- .rid = RDT_RESOURCE_L3,
- .name = "L3",
- .domains = domain_init(RDT_RESOURCE_L3),
+ .resctrl = {
+ .rid = RDT_RESOURCE_L3,
+ .name = "L3",
+ .cache_level = 3,
+ .cache = {
+ .min_cbm_bits = 1,
+ .cbm_idx_mult = 1,
+ .cbm_idx_offset = 0,
+ },
+ .domains = domain_init(RDT_RESOURCE_L3),
+ .parse_ctrlval = parse_cbm,
+ .format_str = "%d=%0*x",
+ .fflags = RFTYPE_RES_CACHE,
+ },
.msr_base = MSR_IA32_L3_CBM_BASE,
.msr_update = cat_wrmsr,
- .cache_level = 3,
- .cache = {
- .min_cbm_bits = 1,
- .cbm_idx_mult = 1,
- .cbm_idx_offset = 0,
- },
- .parse_ctrlval = parse_cbm,
- .format_str = "%d=%0*x",
- .fflags = RFTYPE_RES_CACHE,
},
[RDT_RESOURCE_L3DATA] =
{
- .rid = RDT_RESOURCE_L3DATA,
- .name = "L3DATA",
- .domains = domain_init(RDT_RESOURCE_L3DATA),
+ .resctrl = {
+ .rid = RDT_RESOURCE_L3DATA,
+ .name = "L3DATA",
+ .cache_level = 3,
+ .cache = {
+ .min_cbm_bits = 1,
+ .cbm_idx_mult = 2,
+ .cbm_idx_offset = 0,
+ },
+ .domains = domain_init(RDT_RESOURCE_L3DATA),
+ .parse_ctrlval = parse_cbm,
+ .format_str = "%d=%0*x",
+ .fflags = RFTYPE_RES_CACHE,
+ },
.msr_base = MSR_IA32_L3_CBM_BASE,
.msr_update = cat_wrmsr,
- .cache_level = 3,
- .cache = {
- .min_cbm_bits = 1,
- .cbm_idx_mult = 2,
- .cbm_idx_offset = 0,
- },
- .parse_ctrlval = parse_cbm,
- .format_str = "%d=%0*x",
- .fflags = RFTYPE_RES_CACHE,
},
[RDT_RESOURCE_L3CODE] =
{
- .rid = RDT_RESOURCE_L3CODE,
- .name = "L3CODE",
- .domains = domain_init(RDT_RESOURCE_L3CODE),
+ .resctrl = {
+ .rid = RDT_RESOURCE_L3CODE,
+ .name = "L3CODE",
+ .cache_level = 3,
+ .cache = {
+ .min_cbm_bits = 1,
+ .cbm_idx_mult = 2,
+ .cbm_idx_offset = 1,
+ },
+ .domains = domain_init(RDT_RESOURCE_L3CODE),
+ .parse_ctrlval = parse_cbm,
+ .format_str = "%d=%0*x",
+ .fflags = RFTYPE_RES_CACHE,
+ },
.msr_base = MSR_IA32_L3_CBM_BASE,
.msr_update = cat_wrmsr,
- .cache_level = 3,
- .cache = {
- .min_cbm_bits = 1,
- .cbm_idx_mult = 2,
- .cbm_idx_offset = 1,
- },
- .parse_ctrlval = parse_cbm,
- .format_str = "%d=%0*x",
- .fflags = RFTYPE_RES_CACHE,
},
[RDT_RESOURCE_L2] =
{
- .rid = RDT_RESOURCE_L2,
- .name = "L2",
- .domains = domain_init(RDT_RESOURCE_L2),
+ .resctrl = {
+ .rid = RDT_RESOURCE_L2,
+ .name = "L2",
+ .cache_level = 2,
+ .cache = {
+ .min_cbm_bits = 1,
+ .cbm_idx_mult = 1,
+ .cbm_idx_offset = 0,
+ },
+ .domains = domain_init(RDT_RESOURCE_L2),
+ .parse_ctrlval = parse_cbm,
+ .format_str = "%d=%0*x",
+ .fflags = RFTYPE_RES_CACHE,
+ },
.msr_base = MSR_IA32_L2_CBM_BASE,
.msr_update = cat_wrmsr,
- .cache_level = 2,
- .cache = {
- .min_cbm_bits = 1,
- .cbm_idx_mult = 1,
- .cbm_idx_offset = 0,
- },
- .parse_ctrlval = parse_cbm,
- .format_str = "%d=%0*x",
- .fflags = RFTYPE_RES_CACHE,
},
[RDT_RESOURCE_L2DATA] =
{
- .rid = RDT_RESOURCE_L2DATA,
- .name = "L2DATA",
- .domains = domain_init(RDT_RESOURCE_L2DATA),
+ .resctrl = {
+ .rid = RDT_RESOURCE_L2DATA,
+ .name = "L2DATA",
+ .cache_level = 2,
+ .cache = {
+ .min_cbm_bits = 1,
+ .cbm_idx_mult = 2,
+ .cbm_idx_offset = 0,
+ },
+ .domains = domain_init(RDT_RESOURCE_L2DATA),
+ .parse_ctrlval = parse_cbm,
+ .format_str = "%d=%0*x",
+ .fflags = RFTYPE_RES_CACHE,
+ },
.msr_base = MSR_IA32_L2_CBM_BASE,
.msr_update = cat_wrmsr,
- .cache_level = 2,
- .cache = {
- .min_cbm_bits = 1,
- .cbm_idx_mult = 2,
- .cbm_idx_offset = 0,
- },
- .parse_ctrlval = parse_cbm,
- .format_str = "%d=%0*x",
- .fflags = RFTYPE_RES_CACHE,
},
[RDT_RESOURCE_L2CODE] =
{
- .rid = RDT_RESOURCE_L2CODE,
- .name = "L2CODE",
- .domains = domain_init(RDT_RESOURCE_L2CODE),
+ .resctrl = {
+ .rid = RDT_RESOURCE_L2CODE,
+ .name = "L2CODE",
+ .cache_level = 2,
+ .cache = {
+ .min_cbm_bits = 1,
+ .cbm_idx_mult = 2,
+ .cbm_idx_offset = 1,
+ },
+ .domains = domain_init(RDT_RESOURCE_L2CODE),
+ .parse_ctrlval = parse_cbm,
+ .format_str = "%d=%0*x",
+ .fflags = RFTYPE_RES_CACHE,
+ },
.msr_base = MSR_IA32_L2_CBM_BASE,
.msr_update = cat_wrmsr,
- .cache_level = 2,
- .cache = {
- .min_cbm_bits = 1,
- .cbm_idx_mult = 2,
- .cbm_idx_offset = 1,
- },
- .parse_ctrlval = parse_cbm,
- .format_str = "%d=%0*x",
- .fflags = RFTYPE_RES_CACHE,
},
[RDT_RESOURCE_MBA] =
{
- .rid = RDT_RESOURCE_MBA,
- .name = "MB",
- .domains = domain_init(RDT_RESOURCE_MBA),
- .cache_level = 3,
- .format_str = "%d=%*u",
- .fflags = RFTYPE_RES_MB,
+ .resctrl = {
+ .rid = RDT_RESOURCE_MBA,
+ .name = "MB",
+ .cache_level = 3,
+ .domains = domain_init(RDT_RESOURCE_MBA),
+ .format_str = "%d=%*u",
+ .fflags = RFTYPE_RES_MB,
+ },
},
};

@@ -198,7 +212,7 @@ static unsigned int cbm_idx(struct rdt_resource *r, unsigned int closid)
*/
static inline void cache_alloc_hsw_probe(void)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].resctrl;
u32 l, h, max_cbm = BIT_MASK(20) - 1;

if (wrmsr_safe(MSR_IA32_L3_CBM_BASE, max_cbm, 0))
@@ -224,7 +238,7 @@ static inline void cache_alloc_hsw_probe(void)
bool is_mba_sc(struct rdt_resource *r)
{
if (!r)
- return rdt_resources_all[RDT_RESOURCE_MBA].membw.mba_sc;
+ return rdt_resources_all[RDT_RESOURCE_MBA].resctrl.membw.mba_sc;

return r->membw.mba_sc;
}
@@ -321,8 +335,8 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)

static void rdt_get_cdp_config(int level, int type)
{
- struct rdt_resource *r_l = &rdt_resources_all[level];
- struct rdt_resource *r = &rdt_resources_all[type];
+ struct rdt_resource *r_l = &rdt_resources_all[level].resctrl;
+ struct rdt_resource *r = &rdt_resources_all[type].resctrl;

r->num_closid = r_l->num_closid / 2;
r->cache.cbm_len = r_l->cache.cbm_len;
@@ -353,9 +367,10 @@ static void
mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
{
unsigned int i;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);

for (i = m->low; i < m->high; i++)
- wrmsrl(r->msr_base + i, d->ctrl_val[i]);
+ wrmsrl(hw_res->msr_base + i, d->ctrl_val[i]);
}

/*
@@ -377,19 +392,21 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r)
{
unsigned int i;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);

/* Write the delay values for mba. */
for (i = m->low; i < m->high; i++)
- wrmsrl(r->msr_base + i, delay_bw_map(d->ctrl_val[i], r));
+ wrmsrl(hw_res->msr_base + i, delay_bw_map(d->ctrl_val[i], r));
}

static void
cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
{
unsigned int i;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);

for (i = m->low; i < m->high; i++)
- wrmsrl(r->msr_base + cbm_idx(r, i), d->ctrl_val[i]);
+ wrmsrl(hw_res->msr_base + cbm_idx(r, i), d->ctrl_val[i]);
}

struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
@@ -408,13 +425,14 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
void rdt_ctrl_update(void *arg)
{
struct msr_param *m = arg;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
struct rdt_resource *r = m->res;
int cpu = smp_processor_id();
struct rdt_domain *d;

d = get_domain_from_cpu(cpu, r);
if (d) {
- r->msr_update(d, m, r);
+ hw_res->msr_update(d, m, r);
return;
}
pr_warn_once("cpu %d not found in any domain for resource %s\n",
@@ -472,6 +490,7 @@ void setup_default_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm)

static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct msr_param m;
u32 *dc, *dm;

@@ -491,7 +510,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)

m.low = 0;
m.high = r->num_closid;
- r->msr_update(d, &m, r);
+ hw_res->msr_update(d, &m, r);
return 0;
}

@@ -639,7 +658,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
return;
}

- if (r == &rdt_resources_all[RDT_RESOURCE_L3]) {
+ if (r == &rdt_resources_all[RDT_RESOURCE_L3].resctrl) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
cancel_delayed_work(&d->mbm_over);
mbm_setup_overflow_handler(d, 0);
@@ -815,9 +834,9 @@ static __init bool get_mem_config(void)
return false;

if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
- return __get_mem_config_intel(&rdt_resources_all[RDT_RESOURCE_MBA]);
+ return __get_mem_config_intel(&rdt_resources_all[RDT_RESOURCE_MBA].resctrl);
else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
- return __rdt_get_mem_config_amd(&rdt_resources_all[RDT_RESOURCE_MBA]);
+ return __rdt_get_mem_config_amd(&rdt_resources_all[RDT_RESOURCE_MBA].resctrl);

return false;
}
@@ -833,14 +852,14 @@ static __init bool get_rdt_alloc_resources(void)
return false;

if (rdt_cpu_has(X86_FEATURE_CAT_L3)) {
- rdt_get_cache_alloc_cfg(1, &rdt_resources_all[RDT_RESOURCE_L3]);
+ rdt_get_cache_alloc_cfg(1, &rdt_resources_all[RDT_RESOURCE_L3].resctrl);
if (rdt_cpu_has(X86_FEATURE_CDP_L3))
rdt_get_cdp_l3_config();
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CAT_L2)) {
/* CPUID 0x10.2 fields are same format at 0x10.1 */
- rdt_get_cache_alloc_cfg(2, &rdt_resources_all[RDT_RESOURCE_L2]);
+ rdt_get_cache_alloc_cfg(2, &rdt_resources_all[RDT_RESOURCE_L2].resctrl);
if (rdt_cpu_has(X86_FEATURE_CDP_L2))
rdt_get_cdp_l2_config();
ret = true;
@@ -864,7 +883,7 @@ static __init bool get_rdt_mon_resources(void)
if (!rdt_mon_features)
return false;

- return !rdt_get_mon_l3_config(&rdt_resources_all[RDT_RESOURCE_L3]);
+ return !rdt_get_mon_l3_config(&rdt_resources_all[RDT_RESOURCE_L3].resctrl);
}

static __init void __check_quirks_intel(void)
@@ -898,9 +917,14 @@ static __init bool get_rdt_resources(void)

static __init void rdt_init_res_defs_intel(void)
{
+ struct rdt_hw_resource *hw_res;
struct rdt_resource *r;
+ int i;
+
+ for (i = 0; i < RDT_NUM_RESOURCES; i++) {
+ hw_res = &rdt_resources_all[i];
+ r = &rdt_resources_all[i].resctrl;

- for_each_rdt_resource(r) {
if (r->rid == RDT_RESOURCE_L3 ||
r->rid == RDT_RESOURCE_L3DATA ||
r->rid == RDT_RESOURCE_L3CODE ||
@@ -909,8 +933,8 @@ static __init void rdt_init_res_defs_intel(void)
r->rid == RDT_RESOURCE_L2CODE)
r->cache.arch_has_sparse_bitmaps = false;
else if (r->rid == RDT_RESOURCE_MBA) {
- r->msr_base = MSR_IA32_MBA_THRTL_BASE;
- r->msr_update = mba_wrmsr_intel;
+ hw_res->msr_base = MSR_IA32_MBA_THRTL_BASE;
+ hw_res->msr_update = mba_wrmsr_intel;
r->parse_ctrlval = parse_bw;
}
}
@@ -918,9 +942,14 @@ static __init void rdt_init_res_defs_intel(void)

static __init void rdt_init_res_defs_amd(void)
{
+ struct rdt_hw_resource *hw_res;
struct rdt_resource *r;
+ int i;
+
+ for (i = 0; i < RDT_NUM_RESOURCES; i++) {
+ hw_res = &rdt_resources_all[i];
+ r = &rdt_resources_all[i].resctrl;

- for_each_rdt_resource(r) {
if (r->rid == RDT_RESOURCE_L3 ||
r->rid == RDT_RESOURCE_L3DATA ||
r->rid == RDT_RESOURCE_L3CODE ||
@@ -929,8 +958,8 @@ static __init void rdt_init_res_defs_amd(void)
r->rid == RDT_RESOURCE_L2CODE)
r->cache.arch_has_sparse_bitmaps = true;
else if (r->rid == RDT_RESOURCE_MBA) {
- r->msr_base = MSR_IA32_MBA_BW_BASE;
- r->msr_update = mba_wrmsr_amd;
+ hw_res->msr_base = MSR_IA32_MBA_BW_BASE;
+ hw_res->msr_update = mba_wrmsr_amd;
r->parse_ctrlval = parse_bw;
}
}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 38df876feb54..c90aa79d90b9 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -446,6 +446,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_domain *d,
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
+ struct rdt_hw_resource *hw_res;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
@@ -465,7 +466,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
domid = md.u.domid;
evtid = md.u.evtid;

- r = &rdt_resources_all[resid];
+ hw_res = &rdt_resources_all[resid];
+ r = &hw_res->resctrl;
d = rdt_find_domain(r, domid, NULL);
if (IS_ERR_OR_NULL(d)) {
ret = -ENOENT;
@@ -479,7 +481,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
else if (rr.val & RMID_VAL_UNAVAIL)
seq_puts(m, "Unavailable\n");
else
- seq_printf(m, "%llu\n", rr.val * r->mon_scale);
+ seq_printf(m, "%llu\n", rr.val * hw_res->mon_scale);

out:
rdtgroup_kn_unlock(of->kn);
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 0172a87de814..5e69f709b729 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -2,6 +2,7 @@
#ifndef _ASM_X86_RESCTRL_INTERNAL_H
#define _ASM_X86_RESCTRL_INTERNAL_H

+#include <linux/resctrl.h>
#include <linux/sched.h>
#include <linux/kernfs.h>
#include <linux/fs_context.h>
@@ -340,45 +341,6 @@ struct msr_param {
int high;
};

-/**
- * struct rdt_cache - Cache allocation related data
- * @cbm_len: Length of the cache bit mask
- * @min_cbm_bits: Minimum number of consecutive bits to be set
- * @cbm_idx_mult: Multiplier of CBM index
- * @cbm_idx_offset: Offset of CBM index. CBM index is computed by:
- * closid * cbm_idx_multi + cbm_idx_offset
- * in a cache bit mask
- * @shareable_bits: Bitmask of shareable resource with other
- * executing entities
- * @arch_has_sparse_bitmaps: True if a bitmap like f00f is valid.
- */
-struct rdt_cache {
- unsigned int cbm_len;
- unsigned int min_cbm_bits;
- unsigned int cbm_idx_mult;
- unsigned int cbm_idx_offset;
- unsigned int shareable_bits;
- bool arch_has_sparse_bitmaps;
-};
-
-/**
- * struct rdt_membw - Memory bandwidth allocation related data
- * @min_bw: Minimum memory bandwidth percentage user can request
- * @bw_gran: Granularity at which the memory bandwidth is allocated
- * @arch_needs_linear: True if we can't configure non-linear resources
- * @delay_linear: True if memory B/W delay is in linear scale
- * @mba_sc: True if MBA software controller(mba_sc) is enabled
- * @mb_map: Mapping of memory B/W percentage to memory B/W delay
- */
-struct rdt_membw {
- u32 min_bw;
- u32 bw_gran;
- u32 delay_linear;
- bool arch_needs_linear;
- bool mba_sc;
- u32 *mb_map;
-};
-
static inline bool is_llc_occupancy_enabled(void)
{
return (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID));
@@ -411,55 +373,24 @@ struct rdt_parse_data {
};

/**
- * struct rdt_resource - attributes of an RDT resource
- * @rid: The index of the resource
- * @alloc_enabled: Is allocation enabled on this machine
- * @mon_enabled: Is monitoring enabled for this feature
- * @alloc_capable: Is allocation available on this machine
- * @mon_capable: Is monitor feature available on this machine
- * @name: Name to use in "schemata" file
- * @num_closid: Number of CLOSIDs available
- * @cache_level: Which cache level defines scope of this resource
- * @default_ctrl: Specifies default cache cbm or memory B/W percent.
+ * struct rdt_hw_resource - hw attributes of an RDT resource
* @msr_base: Base MSR address for CBMs
* @msr_update: Function pointer to update QOS MSRs
- * @data_width: Character width of data when displaying
- * @domains: All domains for this resource
- * @cache: Cache allocation related data
- * @format_str: Per resource format string to show domain value
- * @parse_ctrlval: Per resource function pointer to parse control values
- * @evt_list: List of monitoring events
- * @num_rmid: Number of RMIDs available
* @mon_scale: cqm counter * mon_scale = occupancy in bytes
- * @fflags: flags to choose base and info files
*/
-struct rdt_resource {
- int rid;
- bool alloc_enabled;
- bool mon_enabled;
- bool alloc_capable;
- bool mon_capable;
- char *name;
- int num_closid;
- int cache_level;
- u32 default_ctrl;
+struct rdt_hw_resource {
+ struct rdt_resource resctrl;
unsigned int msr_base;
void (*msr_update) (struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);
- int data_width;
- struct list_head domains;
- struct rdt_cache cache;
- struct rdt_membw membw;
- const char *format_str;
- int (*parse_ctrlval)(struct rdt_parse_data *data,
- struct rdt_resource *r,
- struct rdt_domain *d);
- struct list_head evt_list;
- int num_rmid;
unsigned int mon_scale;
- unsigned long fflags;
};

+static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r)
+{
+ return container_of(r, struct rdt_hw_resource, resctrl);
+}
+
int parse_cbm(struct rdt_parse_data *data, struct rdt_resource *r,
struct rdt_domain *d);
int parse_bw(struct rdt_parse_data *data, struct rdt_resource *r,
@@ -467,7 +398,7 @@ int parse_bw(struct rdt_parse_data *data, struct rdt_resource *r,

extern struct mutex rdtgroup_mutex;

-extern struct rdt_resource rdt_resources_all[];
+extern struct rdt_hw_resource rdt_resources_all[];
extern struct rdtgroup rdtgroup_default;
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);

@@ -486,33 +417,37 @@ enum {
RDT_NUM_RESOURCES,
};

+static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
+{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(res);
+
+ hw_res++;
+ return &hw_res->resctrl;
+}
+
#define for_each_rdt_resource(r) \
- for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
- r++)
+ for (r = &rdt_resources_all[0].resctrl; \
+ r < &rdt_resources_all[RDT_NUM_RESOURCES].resctrl; \
+ r = resctrl_inc(r))

#define for_each_capable_rdt_resource(r) \
- for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
- r++) \
+ for_each_rdt_resource(r) \
if (r->alloc_capable || r->mon_capable)

#define for_each_alloc_capable_rdt_resource(r) \
- for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
- r++) \
+ for_each_rdt_resource(r) \
if (r->alloc_capable)

#define for_each_mon_capable_rdt_resource(r) \
- for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
- r++) \
+ for_each_rdt_resource(r) \
if (r->mon_capable)

#define for_each_alloc_enabled_rdt_resource(r) \
- for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
- r++) \
+ for_each_rdt_resource(r) \
if (r->alloc_enabled)

#define for_each_mon_enabled_rdt_resource(r) \
- for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
- r++) \
+ for_each_rdt_resource(r) \
if (r->mon_enabled)

/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index a02a7f886a0a..cd34a06cec68 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -111,7 +111,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
struct rdt_resource *r;
u32 crmid = 1, nrmid;

- r = &rdt_resources_all[RDT_RESOURCE_L3];
+ r = &rdt_resources_all[RDT_RESOURCE_L3].resctrl;

/*
* Skip RMID 0 and start from RMID 1 and check all the RMIDs that
@@ -169,7 +169,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
int cpu;
u64 val;

- r = &rdt_resources_all[RDT_RESOURCE_L3];
+ r = &rdt_resources_all[RDT_RESOURCE_L3].resctrl;

entry->busy = 0;
cpu = get_cpu();
@@ -270,7 +270,7 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
*/
static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_L3];
struct mbm_state *m = &rr->d->mbm_local[rmid];
u64 tval, cur_bw, chunks;

@@ -280,7 +280,7 @@ static void mbm_bw_count(u32 rmid, struct rmid_read *rr)

chunks = mbm_overflow_count(m->prev_bw_msr, tval);
m->chunks += chunks;
- cur_bw = (chunks * r->mon_scale) >> 20;
+ cur_bw = (chunks * hw_res->mon_scale) >> 20;

if (m->delta_comp)
m->delta_bw = abs(cur_bw - m->prev_bw);
@@ -353,6 +353,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
{
u32 closid, rmid, cur_msr, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
+ struct rdt_hw_resource *hw_r_mba;
u32 cur_bw, delta_bw, user_bw;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
@@ -362,7 +363,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
if (!is_mbm_local_enabled())
return;

- r_mba = &rdt_resources_all[RDT_RESOURCE_MBA];
+ hw_r_mba = &rdt_resources_all[RDT_RESOURCE_MBA];
+ r_mba = &hw_r_mba->resctrl;
closid = rgrp->closid;
rmid = rgrp->mon.rmid;
pmbm_data = &dom_mbm->mbm_local[rmid];
@@ -411,7 +413,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
return;
}

- cur_msr = r_mba->msr_base + closid;
+ cur_msr = hw_r_mba->msr_base + closid;
wrmsrl(cur_msr, delay_bw_map(new_msr_val, r_mba));
dom_mba->ctrl_val[closid] = new_msr_val;

@@ -475,7 +477,7 @@ void cqm_handle_limbo(struct work_struct *work)

mutex_lock(&rdtgroup_mutex);

- r = &rdt_resources_all[RDT_RESOURCE_L3];
+ r = &rdt_resources_all[RDT_RESOURCE_L3].resctrl;
d = container_of(work, struct rdt_domain, cqm_limbo.work);

__check_limbo(d, false);
@@ -605,10 +607,11 @@ static void l3_mon_evt_init(struct rdt_resource *r)

int rdt_get_mon_l3_config(struct rdt_resource *r)
{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
unsigned int cl_size = boot_cpu_data.x86_cache_size;
int ret;

- r->mon_scale = boot_cpu_data.x86_cache_occ_scale;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;

/*
@@ -621,7 +624,7 @@ int rdt_get_mon_l3_config(struct rdt_resource *r)
resctrl_cqm_threshold = cl_size * 1024 / r->num_rmid;

/* h/w works in units of "boot_cpu_data.x86_cache_occ_scale" */
- resctrl_cqm_threshold /= r->mon_scale;
+ resctrl_cqm_threshold /= hw_res->mon_scale;

ret = dom_data_init(r);
if (ret)
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index d7623e1b927d..29ace6b60cda 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -684,8 +684,8 @@ int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp)
* resource, the portion of cache used by it should be made
* unavailable to all future allocations from both resources.
*/
- if (rdt_resources_all[RDT_RESOURCE_L3DATA].alloc_enabled ||
- rdt_resources_all[RDT_RESOURCE_L2DATA].alloc_enabled) {
+ if (rdt_resources_all[RDT_RESOURCE_L3DATA].resctrl.alloc_enabled ||
+ rdt_resources_all[RDT_RESOURCE_L2DATA].resctrl.alloc_enabled) {
rdt_last_cmd_puts("CDP enabled\n");
return -EINVAL;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index c84b1f355a9a..f3106dfc4da6 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1021,8 +1021,9 @@ static int max_threshold_occ_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
struct rdt_resource *r = of->kn->parent->priv;
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);

- seq_printf(seq, "%u\n", resctrl_cqm_threshold * r->mon_scale);
+ seq_printf(seq, "%u\n", resctrl_cqm_threshold * hw_res->mon_scale);

return 0;
}
@@ -1030,7 +1031,7 @@ static int max_threshold_occ_show(struct kernfs_open_file *of,
static ssize_t max_threshold_occ_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
- struct rdt_resource *r = of->kn->parent->priv;
+ struct rdt_hw_resource *hw_res;
unsigned int bytes;
int ret;

@@ -1041,7 +1042,8 @@ static ssize_t max_threshold_occ_write(struct kernfs_open_file *of,
if (bytes > (boot_cpu_data.x86_cache_size * 1024))
return -EINVAL;

- resctrl_cqm_threshold = bytes / r->mon_scale;
+ hw_res = resctrl_to_arch_res(of->kn->parent->priv);
+ resctrl_cqm_threshold = bytes / hw_res->mon_scale;

return nbytes;
}
@@ -1099,16 +1101,16 @@ static int rdt_cdp_peer_get(struct rdt_resource *r, struct rdt_domain *d,

switch (r->rid) {
case RDT_RESOURCE_L3DATA:
- _r_cdp = &rdt_resources_all[RDT_RESOURCE_L3CODE];
+ _r_cdp = &rdt_resources_all[RDT_RESOURCE_L3CODE].resctrl;
break;
case RDT_RESOURCE_L3CODE:
- _r_cdp = &rdt_resources_all[RDT_RESOURCE_L3DATA];
+ _r_cdp = &rdt_resources_all[RDT_RESOURCE_L3DATA].resctrl;
break;
case RDT_RESOURCE_L2DATA:
- _r_cdp = &rdt_resources_all[RDT_RESOURCE_L2CODE];
+ _r_cdp = &rdt_resources_all[RDT_RESOURCE_L2CODE].resctrl;
break;
case RDT_RESOURCE_L2CODE:
- _r_cdp = &rdt_resources_all[RDT_RESOURCE_L2DATA];
+ _r_cdp = &rdt_resources_all[RDT_RESOURCE_L2DATA].resctrl;
break;
default:
ret = -ENOENT;
@@ -1830,7 +1832,7 @@ static void l2_qos_cfg_update(void *arg)

static inline bool is_mba_linear(void)
{
- return rdt_resources_all[RDT_RESOURCE_MBA].membw.delay_linear;
+ return rdt_resources_all[RDT_RESOURCE_MBA].resctrl.membw.delay_linear;
}

static int set_cache_qos_cfg(int level, bool enable)
@@ -1851,7 +1853,7 @@ static int set_cache_qos_cfg(int level, bool enable)
if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
return -ENOMEM;

- r_l = &rdt_resources_all[level];
+ r_l = &rdt_resources_all[level].resctrl;
list_for_each_entry(d, &r_l->domains, list) {
/* Pick one CPU from each domain instance to update MSR */
cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
@@ -1877,7 +1879,7 @@ static int set_cache_qos_cfg(int level, bool enable)
*/
static int set_mba_sc(bool mba_sc)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA];
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].resctrl;
struct rdt_domain *d;

if (!is_mbm_enabled() || !is_mba_linear() ||
@@ -1893,9 +1895,9 @@ static int set_mba_sc(bool mba_sc)

static int cdp_enable(int level, int data_type, int code_type)
{
- struct rdt_resource *r_ldata = &rdt_resources_all[data_type];
- struct rdt_resource *r_lcode = &rdt_resources_all[code_type];
- struct rdt_resource *r_l = &rdt_resources_all[level];
+ struct rdt_resource *r_ldata = &rdt_resources_all[data_type].resctrl;
+ struct rdt_resource *r_lcode = &rdt_resources_all[code_type].resctrl;
+ struct rdt_resource *r_l = &rdt_resources_all[level].resctrl;
int ret;

if (!r_l->alloc_capable || !r_ldata->alloc_capable ||
@@ -1925,13 +1927,13 @@ static int cdpl2_enable(void)

static void cdp_disable(int level, int data_type, int code_type)
{
- struct rdt_resource *r = &rdt_resources_all[level];
+ struct rdt_resource *r = &rdt_resources_all[level].resctrl;

r->alloc_enabled = r->alloc_capable;

- if (rdt_resources_all[data_type].alloc_enabled) {
- rdt_resources_all[data_type].alloc_enabled = false;
- rdt_resources_all[code_type].alloc_enabled = false;
+ if (rdt_resources_all[data_type].resctrl.alloc_enabled) {
+ rdt_resources_all[data_type].resctrl.alloc_enabled = false;
+ rdt_resources_all[code_type].resctrl.alloc_enabled = false;
set_cache_qos_cfg(level, false);
}
}
@@ -1948,9 +1950,9 @@ static void cdpl2_disable(void)

static void cdp_disable_all(void)
{
- if (rdt_resources_all[RDT_RESOURCE_L3DATA].alloc_enabled)
+ if (rdt_resources_all[RDT_RESOURCE_L3DATA].resctrl.alloc_enabled)
cdpl3_disable();
- if (rdt_resources_all[RDT_RESOURCE_L2DATA].alloc_enabled)
+ if (rdt_resources_all[RDT_RESOURCE_L2DATA].resctrl.alloc_enabled)
cdpl2_disable();
}

@@ -2101,7 +2103,7 @@ static int rdt_get_tree(struct fs_context *fc)
static_branch_enable_cpuslocked(&rdt_enable_key);

if (is_mbm_enabled()) {
- r = &rdt_resources_all[RDT_RESOURCE_L3];
+ r = &rdt_resources_all[RDT_RESOURCE_L3].resctrl;
list_for_each_entry(dom, &r->domains, list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}
@@ -3092,13 +3094,13 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)

static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf)
{
- if (rdt_resources_all[RDT_RESOURCE_L3DATA].alloc_enabled)
+ if (rdt_resources_all[RDT_RESOURCE_L3DATA].resctrl.alloc_enabled)
seq_puts(seq, ",cdp");

- if (rdt_resources_all[RDT_RESOURCE_L2DATA].alloc_enabled)
+ if (rdt_resources_all[RDT_RESOURCE_L2DATA].resctrl.alloc_enabled)
seq_puts(seq, ",cdpl2");

- if (is_mba_sc(&rdt_resources_all[RDT_RESOURCE_MBA]))
+ if (is_mba_sc(&rdt_resources_all[RDT_RESOURCE_MBA].resctrl))
seq_puts(seq, ",mba_MBps");

return 0;
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 9b05af9b3e28..a8a499c6644b 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -2,6 +2,8 @@
#ifndef _RESCTRL_H
#define _RESCTRL_H

+#include <linux/kernel.h>
+#include <linux/list.h>
#include <linux/pid.h>

#ifdef CONFIG_PROC_CPU_RESCTRL
@@ -13,4 +15,102 @@ int proc_resctrl_show(struct seq_file *m,

#endif

+struct rdt_domain;
+
+/**
+ * struct resctrl_cache - Cache allocation related data
+ * @cbm_len: Length of the cache bit mask
+ * @min_cbm_bits: Minimum number of consecutive bits to be set
+ * @cbm_idx_mult: Multiplier of CBM index
+ * @cbm_idx_offset: Offset of CBM index. CBM index is computed by:
+ * closid * cbm_idx_multi + cbm_idx_offset
+ * in a cache bit mask
+ * @shareable_bits: Bitmask of shareable resource with other
+ * executing entities
+ * @arch_has_sparse_bitmaps: True if a bitmap like f00f is valid.
+ */
+struct resctrl_cache {
+ u32 cbm_len;
+ u32 min_cbm_bits;
+ unsigned int cbm_idx_mult;
+ unsigned int cbm_idx_offset;
+ u32 shareable_bits;
+ bool arch_has_sparse_bitmaps;
+};
+
+/**
+ * struct resctrl_membw - Memory bandwidth allocation related data
+ * @min_bw: Minimum memory bandwidth percentage user can request
+ * @bw_gran: Granularity at which the memory bandwidth is allocated
+ * @delay_linear: True if memory B/W delay is in linear scale
+ * @arch_needs_linear: True if we can't configure non-linear resources
+ * @mba_sc: True if MBA software controller(mba_sc) is enabled
+ * @mb_map: Mapping of memory B/W percentage to memory B/W delay
+ */
+struct resctrl_membw {
+ u32 min_bw;
+ u32 bw_gran;
+ u32 delay_linear;
+ bool arch_needs_linear;
+ bool mba_sc;
+ u32 *mb_map;
+};
+
+struct rdt_parse_data;
+
+/**
+ * @rid: The index of the resource
+ * @alloc_enabled: Is allocation enabled on this machine
+ * @mon_enabled: Is monitoring enabled for this feature
+ * @alloc_capable: Is allocation available on this machine
+ * @mon_capable: Is monitor feature available on this machine
+ *
+ * @cache_level: Which cache level defines scope of this resource
+ *
+ * @cache: If the component has cache controls, their properties.
+ * @membw: If the component has bandwidth controls, their properties.
+ *
+ * @num_closid: Number of CLOSIDs available.
+ * @num_rmid: Number of RMIDs available.
+ *
+ * @domains: All domains for this resource
+ *
+ * @name: Name to use in "schemata" file.
+ * @data_width: Character width of data when displaying.
+ * @default_ctrl: Specifies default cache cbm or memory B/W percent.
+ * @format_str: Per resource format string to show domain value
+ * @parse_ctrlval: Per resource function pointer to parse control values
+ *
+ * @evt_list: List of monitoring events
+ * @fflags: flags to choose base and info files
+ */
+struct rdt_resource {
+ int rid;
+ bool alloc_enabled;
+ bool mon_enabled;
+ bool alloc_capable;
+ bool mon_capable;
+
+ int cache_level;
+
+ struct resctrl_cache cache;
+ struct resctrl_membw membw;
+
+ int num_closid;
+ int num_rmid;
+
+ struct list_head domains;
+
+ char *name;
+ int data_width;
+ u32 default_ctrl;
+ const char *format_str;
+ int (*parse_ctrlval)(struct rdt_parse_data *data,
+ struct rdt_resource *r,
+ struct rdt_domain *d);
+ struct list_head evt_list;
+ unsigned long fflags;
+
+};
+
#endif /* _RESCTRL_H */
--
2.24.1

2020-02-14 18:32:03

by James Morse

[permalink] [raw]

Subject: [RFC PATCH v2 2/2] x86/resctrl: Split struct rdt_domain

resctrl is the defacto Linux ABI for SoC resource partitioning features.
To support it on another architecture, we need to abstract it from
Intel RDT, and move it to /fs/.

Split struct rdt_domain up too. Move everything that that is particular
to resctrl into a new header file. resctrl code paths touching a 'hw'
struct indicates where an abstraction is needed.

No change in behaviour, this patch just moves types around.

Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 32 +++++++++++-------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 10 ++++--
arch/x86/kernel/cpu/resctrl/internal.h | 40 +++++------------------
arch/x86/kernel/cpu/resctrl/monitor.c | 8 +++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 29 ++++++++++------
include/linux/resctrl.h | 35 +++++++++++++++++++-
6 files changed, 94 insertions(+), 60 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index ce02f3f35b44..e48d54dfd657 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -367,10 +367,11 @@ static void
mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
{
unsigned int i;
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);

for (i = m->low; i < m->high; i++)
- wrmsrl(hw_res->msr_base + i, d->ctrl_val[i]);
+ wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
}

/*
@@ -392,21 +393,23 @@ mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r)
{
unsigned int i;
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);

/* Write the delay values for mba. */
for (i = m->low; i < m->high; i++)
- wrmsrl(hw_res->msr_base + i, delay_bw_map(d->ctrl_val[i], r));
+ wrmsrl(hw_res->msr_base + i, delay_bw_map(hw_dom->ctrl_val[i], r));
}

static void
cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
{
unsigned int i;
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);

for (i = m->low; i < m->high; i++)
- wrmsrl(hw_res->msr_base + cbm_idx(r, i), d->ctrl_val[i]);
+ wrmsrl(hw_res->msr_base + cbm_idx(r, i), hw_dom->ctrl_val[i]);
}

struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
@@ -491,21 +494,22 @@ void setup_default_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm)
static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct msr_param m;
u32 *dc, *dm;

- dc = kmalloc_array(r->num_closid, sizeof(*d->ctrl_val), GFP_KERNEL);
+ dc = kmalloc_array(r->num_closid, sizeof(*hw_dom->ctrl_val), GFP_KERNEL);
if (!dc)
return -ENOMEM;

- dm = kmalloc_array(r->num_closid, sizeof(*d->mbps_val), GFP_KERNEL);
+ dm = kmalloc_array(r->num_closid, sizeof(*hw_dom->mbps_val), GFP_KERNEL);
if (!dm) {
kfree(dc);
return -ENOMEM;
}

- d->ctrl_val = dc;
- d->mbps_val = dm;
+ hw_dom->ctrl_val = dc;
+ hw_dom->mbps_val = dm;
setup_default_ctrlval(r, dc, dm);

m.low = 0;
@@ -567,6 +571,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
struct list_head *add_pos = NULL;
+ struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

d = rdt_find_domain(r, id, &add_pos);
@@ -580,10 +585,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

- d = kzalloc_node(sizeof(*d), GFP_KERNEL, cpu_to_node(cpu));
- if (!d)
+ hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+ if (!hw_dom)
return;

+ d = &hw_dom->resctrl;
d->id = id;
cpumask_set_cpu(cpu, &d->cpu_mask);

@@ -610,6 +616,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
static void domain_remove_cpu(int cpu, struct rdt_resource *r)
{
int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

d = rdt_find_domain(r, id, NULL);
@@ -617,6 +624,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
pr_warn("Could't find cache id for cpu %d\n", cpu);
return;
}
+ hw_dom = resctrl_to_arch_dom(d);

cpumask_clear_cpu(cpu, &d->cpu_mask);
if (cpumask_empty(&d->cpu_mask)) {
@@ -649,12 +657,12 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
if (d->plr)
d->plr->d = NULL;

- kfree(d->ctrl_val);
- kfree(d->mbps_val);
+ kfree(hw_dom->ctrl_val);
+ kfree(hw_dom->mbps_val);
bitmap_free(d->rmid_busy_llc);
kfree(d->mbm_total);
kfree(d->mbm_local);
- kfree(d);
+ kfree(hw_dom);
return;
}

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index c90aa79d90b9..cd79fc3715d3 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -237,6 +237,7 @@ static int parse_line(char *line, struct rdt_resource *r,

int update_domains(struct rdt_resource *r, int closid)
{
+ struct rdt_hw_domain *hw_dom;
struct msr_param msr_param;
cpumask_var_t cpu_mask;
struct rdt_domain *d;
@@ -253,7 +254,8 @@ int update_domains(struct rdt_resource *r, int closid)

mba_sc = is_mba_sc(r);
list_for_each_entry(d, &r->domains, list) {
- dc = !mba_sc ? d->ctrl_val : d->mbps_val;
+ hw_dom = resctrl_to_arch_dom(d);
+ dc = !mba_sc ? hw_dom->ctrl_val : hw_dom->mbps_val;
if (d->have_new_ctrl && d->new_ctrl != dc[closid]) {
cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
dc[closid] = d->new_ctrl;
@@ -372,17 +374,19 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,

static void show_doms(struct seq_file *s, struct rdt_resource *r, int closid)
{
+ struct rdt_hw_domain *hw_dom;
struct rdt_domain *dom;
bool sep = false;
u32 ctrl_val;

seq_printf(s, "%*s:", max_name_width, r->name);
list_for_each_entry(dom, &r->domains, list) {
+ hw_dom = resctrl_to_arch_dom(dom);
if (sep)
seq_puts(s, ";");

- ctrl_val = (!is_mba_sc(r) ? dom->ctrl_val[closid] :
- dom->mbps_val[closid]);
+ ctrl_val = (!is_mba_sc(r) ? hw_dom->ctrl_val[closid] :
+ hw_dom->mbps_val[closid]);
seq_printf(s, r->format_str, dom->id, max_data_width,
ctrl_val);
sep = true;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 5e69f709b729..bc4089a1e775 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -291,44 +291,22 @@ struct mbm_state {
};

/**
- * struct rdt_domain - group of cpus sharing an RDT resource
- * @list: all instances of this resource
- * @id: unique id for this instance
- * @cpu_mask: which cpus share this resource
- * @rmid_busy_llc:
- * bitmap of which limbo RMIDs are above threshold
- * @mbm_total: saved state for MBM total bandwidth
- * @mbm_local: saved state for MBM local bandwidth
- * @mbm_over: worker to periodically read MBM h/w counters
- * @cqm_limbo: worker to periodically read CQM h/w counters
- * @mbm_work_cpu:
- * worker cpu for MBM h/w counters
- * @cqm_work_cpu:
- * worker cpu for CQM h/w counters
+ * struct rdt_hw_domain - group of cpus sharing an RDT resource
+ * @resctrl: Properties exposed to the resctrl file system
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
* @mbps_val: When mba_sc is enabled, this holds the bandwidth in MBps
- * @new_ctrl: new ctrl value to be loaded
- * @have_new_ctrl: did user provide new_ctrl for this domain
- * @plr: pseudo-locked region (if any) associated with domain
*/
-struct rdt_domain {
- struct list_head list;
- int id;
- struct cpumask cpu_mask;
- unsigned long *rmid_busy_llc;
- struct mbm_state *mbm_total;
- struct mbm_state *mbm_local;
- struct delayed_work mbm_over;
- struct delayed_work cqm_limbo;
- int mbm_work_cpu;
- int cqm_work_cpu;
+struct rdt_hw_domain {
+ struct rdt_domain resctrl;
u32 *ctrl_val;
u32 *mbps_val;
- u32 new_ctrl;
- bool have_new_ctrl;
- struct pseudo_lock_region *plr;
};

+static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+{
+ return container_of(r, struct rdt_hw_domain, resctrl);
+}
+
/**
* struct msr_param - set a range of MSRs from a domain
* @res: The resource to use
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index cd34a06cec68..7b3b78c560d8 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -354,6 +354,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
u32 closid, rmid, cur_msr, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
struct rdt_hw_resource *hw_r_mba;
+ struct rdt_hw_domain *hw_dom_mba;
u32 cur_bw, delta_bw, user_bw;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
@@ -374,11 +375,12 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
pr_warn_once("Failure to get domain for MBA update\n");
return;
}
+ hw_dom_mba = resctrl_to_arch_dom(dom_mba);

cur_bw = pmbm_data->prev_bw;
- user_bw = dom_mba->mbps_val[closid];
+ user_bw = hw_dom_mba->mbps_val[closid];
delta_bw = pmbm_data->delta_bw;
- cur_msr_val = dom_mba->ctrl_val[closid];
+ cur_msr_val = hw_dom_mba->ctrl_val[closid];

/*
* For Ctrl groups read data from child monitor groups.
@@ -415,7 +417,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)

cur_msr = hw_r_mba->msr_base + closid;
wrmsrl(cur_msr, delay_bw_map(new_msr_val, r_mba));
- dom_mba->ctrl_val[closid] = new_msr_val;
+ hw_dom_mba->ctrl_val[closid] = new_msr_val;

/*
* Delta values are updated dynamically package wise for each
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f3106dfc4da6..50e1fd13f04a 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -911,7 +911,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
list_for_each_entry(dom, &r->domains, list) {
if (sep)
seq_putc(seq, ';');
- ctrl = dom->ctrl_val;
+ ctrl = resctrl_to_arch_dom(dom)->ctrl_val;
sw_shareable = 0;
exclusive = 0;
seq_printf(seq, "%d=", dom->id);
@@ -1175,7 +1175,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
}

/* Check for overlap with other resource groups */
- ctrl = d->ctrl_val;
+ ctrl = resctrl_to_arch_dom(d)->ctrl_val;
for (i = 0; i < closids_supported(); i++, ctrl++) {
ctrl_b = *ctrl;
mode = rdtgroup_mode_by_closid(i);
@@ -1244,6 +1244,7 @@ bool rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
*/
static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
{
+ struct rdt_hw_domain *hw_dom;
int closid = rdtgrp->closid;
struct rdt_resource *r;
bool has_cache = false;
@@ -1254,7 +1255,8 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
continue;
has_cache = true;
list_for_each_entry(d, &r->domains, list) {
- if (rdtgroup_cbm_overlaps(r, d, d->ctrl_val[closid],
+ hw_dom = resctrl_to_arch_dom(d);
+ if (rdtgroup_cbm_overlaps(r, d, hw_dom->ctrl_val[closid],
rdtgrp->closid, false)) {
rdt_last_cmd_puts("Schemata overlaps\n");
return false;
@@ -1386,6 +1388,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
static int rdtgroup_size_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
+ struct rdt_hw_domain *hw_dom;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
struct rdt_domain *d;
@@ -1420,14 +1423,15 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
sep = false;
seq_printf(s, "%*s:", max_name_width, r->name);
list_for_each_entry(d, &r->domains, list) {
+ hw_dom = resctrl_to_arch_dom(d);
if (sep)
seq_putc(s, ';');
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
size = 0;
} else {
ctrl = (!is_mba_sc(r) ?
- d->ctrl_val[rdtgrp->closid] :
- d->mbps_val[rdtgrp->closid]);
+ hw_dom->ctrl_val[rdtgrp->closid] :
+ hw_dom->mbps_val[rdtgrp->closid]);
if (r->rid == RDT_RESOURCE_MBA)
size = ctrl;
else
@@ -1880,6 +1884,7 @@ static int set_cache_qos_cfg(int level, bool enable)
static int set_mba_sc(bool mba_sc)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].resctrl;
+ struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

if (!is_mbm_enabled() || !is_mba_linear() ||
@@ -1887,8 +1892,10 @@ static int set_mba_sc(bool mba_sc)
return -EINVAL;

r->membw.mba_sc = mba_sc;
- list_for_each_entry(d, &r->domains, list)
- setup_default_ctrlval(r, d->ctrl_val, d->mbps_val);
+ list_for_each_entry(d, &r->domains, list) {
+ hw_dom = resctrl_to_arch_dom(d);
+ setup_default_ctrlval(r, hw_dom->ctrl_val, hw_dom->mbps_val);
+ }

return 0;
}
@@ -2207,6 +2214,7 @@ static int rdt_init_fs_context(struct fs_context *fc)

static int reset_all_ctrls(struct rdt_resource *r)
{
+ struct rdt_hw_domain *hw_dom;
struct msr_param msr_param;
cpumask_var_t cpu_mask;
struct rdt_domain *d;
@@ -2225,10 +2233,11 @@ static int reset_all_ctrls(struct rdt_resource *r)
* from each domain to update the MSRs below.
*/
list_for_each_entry(d, &r->domains, list) {
+ hw_dom = resctrl_to_arch_dom(d);
cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);

for (i = 0; i < r->num_closid; i++)
- d->ctrl_val[i] = r->default_ctrl;
+ hw_dom->ctrl_val[i] = r->default_ctrl;
}
cpu = get_cpu();
/* Update CBM on this cpu if it's in cpu_mask. */
@@ -2616,7 +2625,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct rdt_resource *r,
d->have_new_ctrl = false;
d->new_ctrl = r->cache.shareable_bits;
used_b = r->cache.shareable_bits;
- ctrl = d->ctrl_val;
+ ctrl = resctrl_to_arch_dom(d)->ctrl_val;
for (i = 0; i < closids_supported(); i++, ctrl++) {
if (closid_allocated(i) && i != closid) {
mode = rdtgroup_mode_by_closid(i);
@@ -2633,7 +2642,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct rdt_resource *r,
* with an exclusive group.
*/
if (d_cdp)
- peer_ctl = d_cdp->ctrl_val[i];
+ peer_ctl = resctrl_to_arch_dom(d_cdp)->ctrl_val[i];
else
peer_ctl = 0;
used_b |= *ctrl | peer_ctl;
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a8a499c6644b..dbb1a31814a8 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -15,7 +15,40 @@ int proc_resctrl_show(struct seq_file *m,

#endif

-struct rdt_domain;
+/**
+ * struct rdt_domain - group of cpus sharing an RDT resource
+ * @list: all instances of this resource
+ * @id: unique id for this instance
+ * @cpu_mask: which cpus share this resource
+ * @new_ctrl: new ctrl value to be loaded
+ * @have_new_ctrl: did user provide new_ctrl for this domain
+ * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
+ * @mbm_total: saved state for MBM total bandwidth
+ * @mbm_local: saved state for MBM local bandwidth
+ * @mbm_over: worker to periodically read MBM h/w counters
+ * @cqm_limbo: worker to periodically read CQM h/w counters
+ * @mbm_work_cpu: worker cpu for MBM h/w counters
+ * @cqm_work_cpu: worker cpu for CQM h/w counters
+ * @plr: pseudo-locked region (if any) associated with domain
+ */
+struct rdt_domain {
+ struct list_head list;
+ int id;
+ struct cpumask cpu_mask;
+
+ u32 new_ctrl;
+ bool have_new_ctrl;
+
+ unsigned long *rmid_busy_llc;
+ struct mbm_state *mbm_total;
+ struct mbm_state *mbm_local;
+ struct delayed_work mbm_over;
+ struct delayed_work cqm_limbo;
+ int mbm_work_cpu;
+ int cqm_work_cpu;
+
+ struct pseudo_lock_region *plr;
+};

/**
* struct resctrl_cache - Cache allocation related data
--
2.24.1

2020-04-15 21:50:28

by Reinette Chatre

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch

Hi James,

On 12/31/1969 4:00 PM, James Morse wrote:
> Hi folks,
>
> These two patches are the tip of the MPAM iceberg.
>
> Arm have some CPU support for dividing caches into portions, and
> applying bandwidth limits at various points in the SoC. The collective term
> for these features is MPAM: Memory Partitioning and Monitoring.
>
> MPAM is similar enough to Intel RDT, that it should use the defacto linux
> interface: resctrl. This filesystem currently lives under arch/x86, and is
> tightly coupled to the architecture.
> Ultimately, my plan is to split the existing resctrl code up to have an
> arch<->fs abstraction, then move all the bits out to fs/resctrl. From there
> MPAM can be wired up.
>
> These two patches are step one: Split the two structs that resctrl uses
> to have an arch<->fs split. These sit on top of the cleanup posted here:
> lore.kernel.org/r/[email protected]
>
> I'm after strong opinions like "No! struct mbm_state is obviously arch
> specific.". Making the hardware configuration belong to the arch code
> instead of resctrl is so that it can be scaled on arm64, where MPAM
> allows terrifyingly large portion bitmaps for the caches.
>
>
>
> Last time these were posted, the request was for the spec, and to see
> the whole fully assembled iceberg.
>
> The spec is here:
> https://static.docs.arm.com/ddi0598/ab/DDI0598A_b_MPAM_supp_armv8a.pdf
>
> For a slightly dated view of the whole tree:
> 1. Don peril sensitive sunglasses
> 2. https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/feb
>
> The tree is generally RFC-quality. It gets more ragged once you get out of
> the x86 code. I anticipate all the arm64 code being rewritten before its
> considered for merging.
>
> (I haven't reposted the CDP origami as before, as I think that series
> will be clearer if I re-order the patches ... it may even be shorter)
>
>
> Does it all work? Almost. Monitor groups are proving to be a problem, I
> can't see a way of getting these working without a user-visible change of
> behaviour.
> MPAMs counters aren't 1:1 with RMIDs, so supporting MBM_* on
> any likely hardware will have to be via something other than resctrl.
>

Before jumping to the details within the patches of this work ...

Could you please summarize the salient points from the spec, pointing
readers to the spec for more information? A start would be how resctrl
is expected to support MPAM. Could you highlight where does resctrl
interface or assumptions currently fall short? How is the planned MPAM
integration addressing these shortcomings?

Some higher level questions I have after scanning the spec and patches are:

* The spec contains many details of how MPAM supports virtualization. Is
it expected that all of this would be supported with resctrl? For
example, while some registers may be abstracted it seems some interface
may be needed to configure the virt to phy PARTID mappings. Information
about how resctrl is envisioned to support MPAM's virtualization would
help a lot.

* Looking at the commits (1385052cce87a8aed5dc0e96967cedd9e74a17e0 -
"x86/resctrl: Group staged configuration into a separate struct") I
found mention of a change in the schemata. Highlighting any planned
resctrl interface changes would be very helpful.

* Apart from actual interface changes, highlighting planned behavior
changes and motivation for them would also be helpful … for example
force enabling of CDP on all cache levels is a red flag to me.

* I am curious about how the configurability of MPAM will be handled. It
seems as though MPAM is highly configurable, how is this expected to be
handled in resctrl? For example, this message and KNOWN_ISSUES among the
patches mentions an ABI issue that RMID is independent from CLOSID in
RDT but PMG (like RMID) is dependent on PARTID (like CLOSID) in MPAM.
There is a MATCH_PARTID configuration option in MPAM that makes PMG not
depend on PARTID and thus seem to bring closer to RDT. I am surely not
indicating that MPAM should be made to behave like RDT but it does seem
that MPAM is very configurable. Is it the intention to support all ways
in which MPAM can be used and if so is the plan for resctrl so support
making these configuration changes and then support them? For example,
would you want resctrl to support all variations where MATCH_PARTID ==
[0|1] and MATCH_PMG == [0|1]? My intention here is not to delve into
these details in particular, instead I hope to use it as an example of
what I mean when curious about how (if at all) resctrl is envisioned to
support the configurability of MPAM.

It seems to me that MPAM may need more than what is currently available
from resctrl but it is hard for me to digest a 276 page spec and 150
patch series to fully understand what needs to be support and how to do
so. I look forward to learning more about the goals of what needs to be
supported and your vision for resctrl to do so.

Thank you

Reinette

2020-04-15 23:58:59

by James Morse

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch

Hi Reinette,

Thanks for taking a look, all great questions!

On 14/04/2020 19:56, Reinette Chatre wrote:
> On 12/31/1969 4:00 PM, James Morse wrote:
>> These two patches are the tip of the MPAM iceberg.
>>
>> Arm have some CPU support for dividing caches into portions, and
>> applying bandwidth limits at various points in the SoC. The collective term
>> for these features is MPAM: Memory Partitioning and Monitoring.
>>
>> MPAM is similar enough to Intel RDT, that it should use the defacto linux
>> interface: resctrl. This filesystem currently lives under arch/x86, and is
>> tightly coupled to the architecture.
>> Ultimately, my plan is to split the existing resctrl code up to have an
>> arch<->fs abstraction, then move all the bits out to fs/resctrl. From there
>> MPAM can be wired up.
>>
>> These two patches are step one: Split the two structs that resctrl uses
>> to have an arch<->fs split. These sit on top of the cleanup posted here:
>> lore.kernel.org/r/[email protected]
>>
>> I'm after strong opinions like "No! struct mbm_state is obviously arch
>> specific.". Making the hardware configuration belong to the arch code
>> instead of resctrl is so that it can be scaled on arm64, where MPAM
>> allows terrifyingly large portion bitmaps for the caches.
>>
>>
>>
>> Last time these were posted, the request was for the spec, and to see
>> the whole fully assembled iceberg.
>>
>> The spec is here:
>> https://static.docs.arm.com/ddi0598/ab/DDI0598A_b_MPAM_supp_armv8a.pdf
>>
>> For a slightly dated view of the whole tree:
>> 1. Don peril sensitive sunglasses
>> 2. https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/feb
>>
>> The tree is generally RFC-quality. It gets more ragged once you get out of
>> the x86 code. I anticipate all the arm64 code being rewritten before its
>> considered for merging.
>>
>> (I haven't reposted the CDP origami as before, as I think that series
>> will be clearer if I re-order the patches ... it may even be shorter)
>>
>>
>> Does it all work? Almost. Monitor groups are proving to be a problem, I
>> can't see a way of getting these working without a user-visible change of
>> behaviour.
>> MPAMs counters aren't 1:1 with RMIDs, so supporting MBM_* on
>> any likely hardware will have to be via something other than resctrl.

> Before jumping to the details within the patches of this work ...
>
> Could you please summarize the salient points from the spec, pointing
> readers to the spec for more information?

Its probably the differences with RDT that are important, so I'll focus on those:

MPAM has has a CPU interface that controls how the CPU tags traffic, and controls that are
scattered through the system, and are accessible via MMIO.

Arm doesn't specify a cache topology, (hence we have funny terms like PoU), so likewise
the controls, and where they are in the address space, are totally up to the implementer.
Linux would have to discover all this from an ACPI-table/DT-node.

MPAM has ~six different types of control, all of which are optional. Cache Portion bitmaps
correspond closely to CAT's behaviour. MBM's behaviour can be approximated with either the
bandwidth bitmap or the bandwidth max/min controls.

MPAM has partid's, which correspond perfectly with closid. It has performance monitors,
but these behave much more like a PMU than RDT's monitors. This is where it gets messy:

RDT has RMIDs as an independent value used for monitoring. There is one monitor per RMID.
MPAM has 'PMG', that subdivide the partid space. There are a number of PMU-like monitors
that can be configured to count for a partid, or a partid-and-pmg. It is very likely that
the number of these counters is much smaller than the number of partid, or partid+pmg.

PMG does not correspond to RMID, even though they look compatible from a distance.
(I can't see a way of fixing this is a compatible way. Supporting it via a different
mechanism is my best bet).

MPAM has something like the CPU interface for tagging traffic in Arm's irqchip and IOMMU.
MPAM has virtualisation support in the CPU, for mapping virtual partids to physical partids.

> A start would be how resctrl is expected to support MPAM.

Exactly as it works on an Intel Xeon today!
User-space should not be able to tell the difference.

I intend to get as much of MPAM going with this constraint as possible. We can then
discuss what needs changing/extending to allow other features to be used. (and what those
features are good for).

Practically this means that systems with MPAM can only use resctrl if they look a bit like
a Xeon. (i.e. bitmap controls on L2 or L3, bandwidth controls on-or-behind L3)

> Could you highlight where does resctrl interface or assumptions currently fall short?
> How is the planned MPAM integration addressing these shortcomings?

(shortcomings -> Design decisions that make sense for resctrl to support RDT.)

The resctrl monitor code implicitly assumes 'this' CPU all over the place. For MPAM any
'this' CPU may be a set of CPUs, which make this code tricky to work with.

tangent: (what! why? ... Reported by more than one of Arm's partners, is that part of
their system is made up of slices, which act as one component when integrated together. I
assume this helps scale the design, or improve the yield. There is one set of MPAM
controls per slice, which means each of the controls needs to be configured the same to
give one behaviour for the component. The MPAM controls may only be accessible from the
local CPUs if this slice thing contains CPUs... Arm's software model does exactly this
with its L3).

resctrl doesn't have the concept of having to allocate a counter for an RMID, because for
RDT these things are 1:1. Because the bandwidth counters are left running, and accessible
via the filesystem, MPAM cannot let resctrl use the bandwidth counters unless there are as
many PMU-like monitors as there are partid+pmg, which isn't likely.

My plan here is to add a 'resctrl_pmu' to perf. (I know there is some history with perf
here). This would allow perf to read the values that are already exposed via resctrl.
For MPAM this should give us the schedule-in/out hooks we need to allocate the PMU-like
counter when its actually being used.

I'd expect this to be done in the core /fs/ code, so that it uses a common interface to
resctrl, and works in exactly the same way with RDT. (no additional arch code means it
works on the next architecture to support resctrl too)

The rdt_resources_all array, and these alloc_enabled, alloc_capable flags are a really
neat trick to implement CDP on RDT. But this looks very much like an array of every SoC
that has been built, which is tricky if the topology is discovered at boot. It also leaves
properties of the user visible schemata file in the hands of the arch code.
I want it to be, very difficult, for Arm to invent new schemata without discussing whether
the interface is abstract enough to support on other architectures.
A good chunk of the MPAM tree is moving all the user-visible CDP behaviour into the code
that moves to /fs. This lets the arch code only deal with L2 or L3, and CDP becomes a
property of the configuration.

(this is probably one of the stranger looking changes. Secondary motivations are to avoid
emulating CDP under arch/arm64, and avoiding the nightmare SoC topology where there are
more MPAM controls to be configured, because of this slicing, in L2 than there are in
L2CODE....)

> Some higher level questions I have after scanning the spec and patches are:
>
> * The spec contains many details of how MPAM supports virtualization. Is
> it expected that all of this would be supported with resctrl?

For now, that is totally out of scope.

> For
> example, while some registers may be abstracted it seems some interface
> may be needed to configure the virt to phy PARTID mappings. Information
> about how resctrl is envisioned to support MPAM's virtualization would
> help a lot.

As you asked!
The principle here is not to change the user-visible bits of resctrl at all.

We only need to support KVM, and the definition of KVM's virtual machine (memory layout,
number of CPUs etc) comes from its user-space virtual-machine-manager, which is typically
Qemu.
KVM only needs to support the CPU-interface parts of MPAM. The controls, configured via
MMIO could be emulated by the VMM. (this lets it give an accurate model of the machine it
is emulating)
The VMM would need to allocate a number of control groups via resctrl, then pass their
names (or preferably closid) to KVM to set up the mapping. The MPAM hardware performs the
mapping when the guest uses a partid.
When the guest tries to configure the controls, this would trap back to the VMM, as it
does for any emulated device. The VMM can then make the corresponding change in resctrl.

This lets the VMM emulate MPAM controls that the host doesn't have, or control MPAM for
the guest using a different user-space interface.

This has only been discussed vaguely with the Qemu folk, its all subject to change.

I would like to be able to allocate closid in the kernel for KVM guests, as this is a step
towards supporting other in-kernel users. (in particular, the IOMMU).

> * Looking at the commits (1385052cce87a8aed5dc0e96967cedd9e74a17e0 -
> "x86/resctrl: Group staged configuration into a separate struct") I
> found mention of a change in the schemata. Highlighting any planned
> resctrl interface changes would be very helpful.

No changes to the schemata! That would cause variation between RDT and MPAM, and fragment
user-space software.

The commit message is certainly terrible. That patch is part of the chunk that folds the
L3CODE, L3DATA and L3 resources together so that the arch code is only dealing with L3,
(or L2 or MBM). This is so that the CODE/DATA schema behaviour lives in the core /fs/
code, so that its much harder for it to behave differently on systems with MPAM.

> * Apart from actual interface changes, highlighting planned behavior
> changes and motivation for them would also be helpful … for example
> force enabling of CDP on all cache levels is a red flag to me.

Interesting. This is the change that makes the CDP on/off global, instead of per cache.
Its still controlled by user-space. (so nothing is forced).
Do you have systems that support CAT at L3 and L2, but only CDP at L3, not L2?
(I was under the impression the L2 stuff was all Atom, and the L3+MBM was all Xeon).

MPAM's equivalent to CDP is just part of the CPU interface. Its always on.
To support 'CDP on L2 but not L3', (neither of which exist), we'd need to have extra code:
"was I asked to pretend CDP is enabled on this cache".

As CDP affects the way you allocate closid, (that odd/even thing), which is global, it
makes sense that this is either on or off. (doing this let me support CDP without the arch
code doing anything special!)

Existence of hardware that does this would obviously change this.

> * I am curious about how the configurability of MPAM will be handled. It
> seems as though MPAM is highly configurable, how is this expected to be
> handled in resctrl?

ACPI-tables/DT to describe the topology, and then code to 'pick' which MPAM features map
best onto resctrl.
Its likely that there will be systems that have MPAM, but can't use resctrl without
user-visible changes. Discussing user-visible changes can happen at that point.

> For example, this message and KNOWN_ISSUES among the
> patches mentions an ABI issue that RMID is independent from CLOSID in
> RDT but PMG (like RMID) is dependent on PARTID (like CLOSID) in MPAM.

Indeed. Because there is nothing quite like RMID, there is nothing I can expose as
num_rmid. Whatever value I expose there will cause user-space to detect some breakage.

> There is a MATCH_PARTID configuration option in MPAM that makes PMG not
> depend on PARTID and thus seem to bring closer to RDT.

(crumbs, you've dug into this in some detail!)

... Welcome to the wonderful world of Arm specifications:

On page 234 of [0], it describes the CSU controls:
| If MATCH_PMG == 1 and MATCH_PARTID == 0, it is CONSTRAINED-UNPREDICTABLE whether the
| monitor instance:
| * Measures the storage used with matching PMG and with any PARTID.
| * Measures no storage usage, that is, MSMON_CSU.VALUE is zero.
| * Measures the storage used with matching PMG and PARTID, that is, treats
| MATCH_PARTID as == 1.

Whenever you see something all-caps (CONSTRAINED-UNPREDICTABLE in this case), it means
this is pretty useless to general purpose software. We would need per-platform quirks for
which of these behaviours the implementer built.
"Measures the storage used with matching PMG and with any PARTID." is the one we wanted,
but the spec people made it optional, and its not discoverable.

Curiously, the text describing the bandwidth controls doesn't describe this as
unpredictable... I'll check that isn't an oversight. Being able to do this with the
bandwidth controls doesn't help, as we can't support those unless we have enough monitors
for the free-running files in resctrl.

Even if we could rely on this, we'd still have the problem that MPAM expects PMG to extend
the partid space. There may be very few PMG because they are expected to be used
per-partid. If there are fewer PMG than partid, we are straight back to square-one: 'no
monitor support'.

> I am surely not
> indicating that MPAM should be made to behave like RDT but it does seem
> that MPAM is very configurable.

> Is it the intention to support all ways in which MPAM can be used and
Today, no. Only those that map in a usable way to what resctrl already exposes to user-space.

> if so is the plan for resctrl so support
> making these configuration changes and then support them?

If someone comes up with a use-case that benefits from one of the the extra controls, we
can discuss how that could be abstracted to work on multiple architectures.

> For example,
> would you want resctrl to support all variations where MATCH_PARTID ==
> [0|1] and MATCH_PMG == [0|1]? My intention here is not to delve into
> these details in particular, instead I hope to use it as an example of
> what I mean when curious about how (if at all) resctrl is envisioned to
> support the configurability of MPAM.

I think not-at-all is the answer here. Resctrl's ABI is a ship that has sailed. Where at
all possible I intend to map what MPAM has to what resctrl exposes.

The num_rmid issue has me painted into a corner. No value there reflects the behaviour.
The only option may be not to expose any of the counters via the resctrl filesystem,
instead making them accessible via perf.
I think that would only fly if I can make it work on x86 too.

> It seems to me that MPAM may need more than what is currently available
> from resctrl

Ultimately yes, but the aim here isn't to support all of MPAM.
Its just to support what maps nicely. We can then discuss what to do next.

> but it is hard for me to digest a 276 page spec and 150
> patch series to fully understand what needs to be support and how to do
> so.

Yeah! Me to. But this is what Fenghua asked to see:
https://lore.kernel.org/lkml/[email protected]/

You'll be glad to know I have no intention of posting all that in one go (obvious, but
worth saying).

> I look forward to learning more about the goals of what needs to be
> supported and your vision for resctrl to do so.

Thanks for going through this with such a level of detail. I'm sure we agree on the 'no
user-visible changes' aspect, so the next pieces is where the split between core code and
arch code should be.

Thanks,

James

[0] https://static.docs.arm.com/ddi0598/ba/DDI0598B_a_MPAM_supp_armv8ba.pdf

2020-04-16 01:00:03

by Reinette Chatre

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch

Hi James,

Thank you very much for your thorough response. I do have a lot to
digest from it but would like to at least respond promptly to a question
you included ...

On 4/15/2020 5:59 AM, James Morse wrote:
> On 14/04/2020 19:56, Reinette Chatre wrote:
>> On 12/31/1969 4:00 PM, James Morse wrote:

...

>> * Apart from actual interface changes, highlighting planned behavior
>> changes and motivation for them would also be helpful … for example
>> force enabling of CDP on all cache levels is a red flag to me.
>
> Interesting. This is the change that makes the CDP on/off global, instead of per cache.

This is the one I referred to and a significant change.

> Its still controlled by user-space. (so nothing is forced).

Right, controlled with the mount option but the behavior is being
changed to apply to both L2 and L3, even if user requests just one of
the two.

Please note that in the documentation it is currently explicitly stated
that: "L2 and L3 CDP are controlled separately"

> Do you have systems that support CAT at L3 and L2, but only CDP at L3, not L2?
> (I was under the impression the L2 stuff was all Atom, and the L3+MBM was all Xeon).

Things are not as clear cut unfortunately. There is a new Atom system
that has a server uncore, thus inheriting some RDT features that have
previously only been seen on servers. L2 CAT/CDP is also moving to
servers in future server products.

You can find more details about RDT features in upcoming systems in
Chapter 9 of
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf

>
> MPAM's equivalent to CDP is just part of the CPU interface. Its always on.
> To support 'CDP on L2 but not L3', (neither of which exist), we'd need to have extra code:
> "was I asked to pretend CDP is enabled on this cache".
>
> As CDP affects the way you allocate closid, (that odd/even thing), which is global, it

The odd/even is just for the CDP enabled resource, not global. It is
thus possible for, for example, the L3, L2CODE, and L2DATA resources to
be enabled. The odd/even is configured by the multiplier cbm_idx_mult
set in the resource configuration and used in cbm_idx(). Perhaps you
mean the CLOSID is global? By enabling these together it would reduce
the number of CLOSIDs that could be used by L3 in this example.

> makes sense that this is either on or off. (doing this let me support CDP without the arch
> code doing anything special!)
>
> Existence of hardware that does this would obviously change this.
>

Yes, there are systems that support L2 CAT/CDP and L3 CAT/CDP. CDP is
controlled separately on the different cache levels.

>> It seems to me that MPAM may need more than what is currently available
>> from resctrl
>
> Ultimately yes, but the aim here isn't to support all of MPAM.
> Its just to support what maps nicely. We can then discuss what to do next.

Thank you for stating this. This is significant and was not clear to me
initially.

Reinette

2020-04-17 23:11:53

by Reinette Chatre

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch

Hi James,

On 4/15/2020 5:59 AM, James Morse wrote:
> On 14/04/2020 19:56, Reinette Chatre wrote:
>> On 12/31/1969 4:00 PM, James Morse wrote:
>>> These two patches are the tip of the MPAM iceberg.
>>>
>>> Arm have some CPU support for dividing caches into portions, and
>>> applying bandwidth limits at various points in the SoC. The collective term
>>> for these features is MPAM: Memory Partitioning and Monitoring.
>>>
>>> MPAM is similar enough to Intel RDT, that it should use the defacto linux
>>> interface: resctrl. This filesystem currently lives under arch/x86, and is
>>> tightly coupled to the architecture.
>>> Ultimately, my plan is to split the existing resctrl code up to have an
>>> arch<->fs abstraction, then move all the bits out to fs/resctrl. From there
>>> MPAM can be wired up.
>>>
>>> These two patches are step one: Split the two structs that resctrl uses
>>> to have an arch<->fs split. These sit on top of the cleanup posted here:
>>> lore.kernel.org/r/[email protected]
>>>
>>> I'm after strong opinions like "No! struct mbm_state is obviously arch
>>> specific.". Making the hardware configuration belong to the arch code
>>> instead of resctrl is so that it can be scaled on arm64, where MPAM
>>> allows terrifyingly large portion bitmaps for the caches.
>>>
>>>
>>>
>>> Last time these were posted, the request was for the spec, and to see
>>> the whole fully assembled iceberg.
>>>
>>> The spec is here:
>>> https://static.docs.arm.com/ddi0598/ab/DDI0598A_b_MPAM_supp_armv8a.pdf
>>>
>>> For a slightly dated view of the whole tree:
>>> 1. Don peril sensitive sunglasses
>>> 2. https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/feb
>>>
>>> The tree is generally RFC-quality. It gets more ragged once you get out of
>>> the x86 code. I anticipate all the arm64 code being rewritten before its
>>> considered for merging.
>>>
>>> (I haven't reposted the CDP origami as before, as I think that series
>>> will be clearer if I re-order the patches ... it may even be shorter)
>>>
>>>
>>> Does it all work? Almost. Monitor groups are proving to be a problem, I
>>> can't see a way of getting these working without a user-visible change of
>>> behaviour.
>>> MPAMs counters aren't 1:1 with RMIDs, so supporting MBM_* on
>>> any likely hardware will have to be via something other than resctrl.
>
>
>> Before jumping to the details within the patches of this work ...
>>
>> Could you please summarize the salient points from the spec, pointing
>> readers to the spec for more information?
>
> Its probably the differences with RDT that are important, so I'll focus on those:
>
> MPAM has has a CPU interface that controls how the CPU tags traffic, and controls that are
> scattered through the system, and are accessible via MMIO.

(... so many more controls than in RDT or PQoS.)

>
> Arm doesn't specify a cache topology, (hence we have funny terms like PoU), so likewise
> the controls, and where they are in the address space, are totally up to the implementer.
> Linux would have to discover all this from an ACPI-table/DT-node.

ok, I see. I am just starting now to familiarize myself how you were
able to align all these concepts between MPAM and RDT using the
mpam_device and mpam_component.

>
> MPAM has ~six different types of control, all of which are optional. Cache Portion bitmaps
> correspond closely to CAT's behaviour. MBM's behaviour can be approximated with either the
> bandwidth bitmap or the bandwidth max/min controls.

ok. Not all the partitioning types are quite clear to me
(proportional-stride partitioning specifically), but I can see that
there are similarities. (I am assuming you meant MBA)

>
> MPAM has partid's, which correspond perfectly with closid.

I am not sure about "perfectly".

If I understand correctly at least there is the caveat that MPAM is like
RDT with CDP always enabled without reducing number of CLOSIDs?

But ...

The assumption of resctrl fs is that a task would belong to a single
resource group, which in turn represents a single CLOSID/PARTID(?). On
the other hand MPAM allows a task to have different PARTIDs for code and
data. It seems that perhaps MPAM will be supported in Linux by having
its PARTIDs behave as RDT with CDP enabled (the odd/even pairing) and
thus perhaps not as flexible as the architecture supports/intends?
(Although I see you later mention that the goal is _not_ for MPAM to
emulate CDP so it is not clear to me how this would be supported).

Would user looking at schemata on resctrl on Arm thus always see the
CODE/DATA prefix on the cache resources? I am trying to create a high
level idea of what a user's interaction with resctrl fs would look like
on an Arm system and how that would be translated on a high level to
what it means for the hardware (allocating new PARTIDs, setting of
PARTIDs, etc.). (small steps)

> It has performance monitors,
> but these behave much more like a PMU than RDT's monitors. This is where it gets messy:
>
> RDT has RMIDs as an independent value used for monitoring. There is one monitor per RMID.

Currently up to three monitors per RMID if considering a monitor to be
the same as a counter?

> MPAM has 'PMG', that subdivide the partid space. There are a number of PMU-like monitors
> that can be configured to count for a partid, or a partid-and-pmg. It is very likely that
> the number of these counters is much smaller than the number of partid, or partid+pmg.

If I understand correctly the PARTID and PMG are global because that is
the MPAM information that accompanies all memory requests as it
traverses through the system.

In comparison the resource monitors seem to be entirely contained in the
MSC to which they belong ... and an MSC is allowed to have up to 2^16 of
each type.

It seems that it may at least theoretically be possible to have many
more counters than PARTID, or PARTID+PMG.

I see your point about them being more similar to PMU.

>
> PMG does not correspond to RMID, even though they look compatible from a distance.
> (I can't see a way of fixing this is a compatible way. Supporting it via a different
> mechanism is my best bet).

I've been trying to different ideas also but cannot find a good fit in
resctrl either.

(I seem to have trouble mapping both the partioning and monitoring MPAM
features to resctrl ... apologies as it is taking me some time to catch
up with where you are at this time)

>
> MPAM has something like the CPU interface for tagging traffic in Arm's irqchip and IOMMU.

I cannot find mention of this specifically in the new spec. Do you
perhaps have other documentation about this?

> MPAM has virtualisation support in the CPU, for mapping virtual partids to physical partids.
>
>
>> A start would be how resctrl is expected to support MPAM.
>
> Exactly as it works on an Intel Xeon today!
> User-space should not be able to tell the difference.
>
> I intend to get as much of MPAM going with this constraint as possible. We can then
> discuss what needs changing/extending to allow other features to be used. (and what those
> features are good for).
>
> Practically this means that systems with MPAM can only use resctrl if they look a bit like
> a Xeon. (i.e. bitmap controls on L2 or L3, bandwidth controls on-or-behind L3)
>

This sounds manageable.

>
>> Could you highlight where does resctrl interface or assumptions currently fall short?
>> How is the planned MPAM integration addressing these shortcomings?
>
> (shortcomings -> Design decisions that make sense for resctrl to support RDT.)
>
> The resctrl monitor code implicitly assumes 'this' CPU all over the place. For MPAM any
> 'this' CPU may be a set of CPUs, which make this code tricky to work with.
>
> tangent: (what! why? ... Reported by more than one of Arm's partners, is that part of
> their system is made up of slices, which act as one component when integrated together. I
> assume this helps scale the design, or improve the yield. There is one set of MPAM
> controls per slice, which means each of the controls needs to be configured the same to
> give one behaviour for the component. The MPAM controls may only be accessible from the
> local CPUs if this slice thing contains CPUs... Arm's software model does exactly this
> with its L3).

Apologies, the issue is not clear to me. I am familiar with the slicing
of L3, Intel systems do that also. Your concern that preceded this was
that the monitor code implicitly assumes "this" CPU. Could you please
elaborate more on this? The RMIDs have global scope within the package
(L3 cache domain instance) and keeps track of all CPUs associated with
it (rdt_domain->cpu_mask). The user queries each domain separately and
the counter value is read from any CPU associated with the particular
domain. How I see it the monitor code thus tracks which CPUs are
associated with a particular cache instance and would interact with an
appropriate CPU depending on which data/counter the user is requesting.

> resctrl doesn't have the concept of having to allocate a counter for an RMID, because for
> RDT these things are 1:1. Because the bandwidth counters are left running, and accessible
> via the filesystem, MPAM cannot let resctrl use the bandwidth counters unless there are as
> many PMU-like monitors as there are partid+pmg, which isn't likely.
>
> My plan here is to add a 'resctrl_pmu' to perf. (I know there is some history with perf
> here). This would allow perf to read the values that are already exposed via resctrl.
> For MPAM this should give us the schedule-in/out hooks we need to allocate the PMU-like
> counter when its actually being used.
>
> I'd expect this to be done in the core /fs/ code, so that it uses a common interface to
> resctrl, and works in exactly the same way with RDT. (no additional arch code means it
> works on the next architecture to support resctrl too)

Are you saying that resctrl fs would provide an interface to the
"resctrl_pmu" addition to perf?

> The rdt_resources_all array, and these alloc_enabled, alloc_capable flags are a really
> neat trick to implement CDP on RDT. But this looks very much like an array of every SoC
> that has been built, which is tricky if the topology is discovered at boot. It also leaves
> properties of the user visible schemata file in the hands of the arch code.
> I want it to be, very difficult, for Arm to invent new schemata without discussing whether
> the interface is abstract enough to support on other architectures.
> A good chunk of the MPAM tree is moving all the user-visible CDP behaviour into the code
> that moves to /fs. This lets the arch code only deal with L2 or L3, and CDP becomes a
> property of the configuration.

I did see the introduction of the new schema list that points to the
resources that appears to separate the user interface and the
architecture differences very well.

>
> (this is probably one of the stranger looking changes. Secondary motivations are to avoid
> emulating CDP under arch/arm64, and avoiding the nightmare SoC topology where there are
> more MPAM controls to be configured, because of this slicing, in L2 than there are in
> L2CODE....)

... oh ... no to emulate CDP. (I need to understand how the two PARTIDs
of a task are configured ... I will keep digging through the spec and
the patches).

>
>
>> Some higher level questions I have after scanning the spec and patches are:
>>
>> * The spec contains many details of how MPAM supports virtualization. Is
>> it expected that all of this would be supported with resctrl?
>
> For now, that is totally out of scope.
>
>
>> For
>> example, while some registers may be abstracted it seems some interface
>> may be needed to configure the virt to phy PARTID mappings. Information
>> about how resctrl is envisioned to support MPAM's virtualization would
>> help a lot.
>
> As you asked!
> The principle here is not to change the user-visible bits of resctrl at all.
>
> We only need to support KVM, and the definition of KVM's virtual machine (memory layout,
> number of CPUs etc) comes from its user-space virtual-machine-manager, which is typically
> Qemu.
> KVM only needs to support the CPU-interface parts of MPAM. The controls, configured via
> MMIO could be emulated by the VMM. (this lets it give an accurate model of the machine it
> is emulating)
> The VMM would need to allocate a number of control groups via resctrl, then pass their
> names (or preferably closid) to KVM to set up the mapping. The MPAM hardware performs the
> mapping when the guest uses a partid.
> When the guest tries to configure the controls, this would trap back to the VMM, as it
> does for any emulated device. The VMM can then make the corresponding change in resctrl.
>
> This lets the VMM emulate MPAM controls that the host doesn't have, or control MPAM for
> the guest using a different user-space interface.
>
> This has only been discussed vaguely with the Qemu folk, its all subject to change.
>
> I would like to be able to allocate closid in the kernel for KVM guests, as this is a step
> towards supporting other in-kernel users. (in particular, the IOMMU).

Thank you very much for this insight into what is being considered.

>> I look forward to learning more about the goals of what needs to be
>> supported and your vision for resctrl to do so.
>
> Thanks for going through this with such a level of detail. I'm sure we agree on the 'no
> user-visible changes' aspect, so the next pieces is where the split between core code and
> arch code should be.

Agreeing on "no user-visible changes" really helps to guide this initial
work. The split between core and arch may be more flexible and may be
done on an as-needed basis?

Reinette