2023-03-20 17:32:23

by James Morse

[permalink] [raw]
Subject: [PATCH v3 00/19] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

Hello!

The largest change since v2 is to avoid running work on nohz_full CPUs.
Otherwise changes since v2 are noted in each patch.
~
This series does two things, it changes resctrl to call resctrl_arch_rmid_read()
in a way that works for MPAM, and it separates the locking so that the arch code
and filesystem code don't have to share a mutex. I tried to split this as two
series, but these touch similar call sites, so it would create more work.

(What's MPAM? See the cover letter of the first series. [1])

On x86 the RMID is an independent number. MPAMs equivalent is PMG, but this
isn't an independent number - it extends the PARTID (same as CLOSID) space
with bits that aren't used to select the configuration. The monitors can
then be told to match specific PMG values, allowing monitor-groups to be
created.

But, MPAM expects the monitors to always monitor by PARTID. The
Cache-storage-utilisation counters can only work this way.
(In the MPAM spec not setting the MATCH_PARTID bit is made CONSTRAINED
UNPREDICTABLE - which is Arm's term to mean portable software can't rely on
this)

It gets worse, as some SoCs may have very few PMG bits. I've seen the
datasheet for one that has a single bit of PMG space.

To be usable, MPAM's counters always need the PARTID and the PMG.
For resctrl, this means always making the CLOSID available when the RMID
is used.

To ensure RMID are always unique, this series combines the CLOSID and RMID
into an index, and manages RMID based on that. For x86, the index and RMID
would always be the same.


Currently the architecture specific code in the cpuhp callbacks takes the
rdtgroup_mutex. This means the filesystem code would have to export this
lock, resulting in an ill-defined interface between the two, and the possibility
of cross-architecture lock-ordering head aches.

The second part of this series adds a domain_list_lock to protect writes to the
domain list, and protects the domain list with RCU - or read_cpus_lock().

Use of RCU is to allow lockless readers of the domain list, today resctrl only has
one, rdt_bit_usage_show(). But to get MPAMs monitors working, its very likely
they'll need to be plumbed up to perf. The uncore PMU driver would be a second
lockless reader of the domain list.

This series is based on v6.3-rc1, and can be retrieved from:
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/monitors_and_locking/v3

Bugs welcome,


Thanks,

James


[1] https://lore.kernel.org/lkml/[email protected]/
[v1] https://lore.kernel.org/all/[email protected]/
[v2] https://lore.kernel.org/lkml/[email protected]/

James Morse (19):
x86/resctrl: Track the closid with the rmid
x86/resctrl: Access per-rmid structures by index
x86/resctrl: Create helper for RMID allocation and mondata dir
creation
x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()
x86/resctrl: Allow RMID allocation to be scoped by CLOSID
x86/resctrl: Allow the allocator to check if a CLOSID can allocate
clean RMID
x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow
x86/resctrl: Queue mon_event_read() instead of sending an IPI
x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
x86/resctrl: Allow arch to allocate memory needed in
resctrl_arch_rmid_read()
x86/resctrl: Make resctrl_mounted checks explicit
x86/resctrl: Move alloc/mon static keys into helpers
x86/resctrl: Make rdt_enable_key the arch's decision to switch
x86/resctrl: Add helpers for system wide mon/alloc capable
x86/resctrl: Add cpu online callback for resctrl work
x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but
cpu
x86/resctrl: Add cpu offline callback for resctrl work
x86/resctrl: Separate arch and fs resctrl locks

arch/x86/include/asm/resctrl.h | 90 ++++++
arch/x86/kernel/cpu/resctrl/core.c | 78 ++---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 23 +-
arch/x86/kernel/cpu/resctrl/internal.h | 77 ++++-
arch/x86/kernel/cpu/resctrl/monitor.c | 372 ++++++++++++++++------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 15 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 319 ++++++++++++++-----
include/linux/resctrl.h | 21 +-
include/linux/tick.h | 3 +-
9 files changed, 750 insertions(+), 248 deletions(-)

--
2.39.2



2023-03-20 17:33:01

by James Morse

[permalink] [raw]
Subject: [PATCH v3 01/19] x86/resctrl: Track the closid with the rmid

x86's RMID are independent of the CLOSID. An RMID can be allocated,
used and freed without considering the CLOSID.

MPAM's equivalent feature is PMG, which is not an independent number,
it extends the CLOSID/PARTID space. For MPAM, only PMG-bits worth of
'RMID' can be allocated for a single CLOSID.
i.e. if there is 1 bit of PMG space, then each CLOSID can have two
monitor groups.

To allow resctrl to disambiguate RMID values for different CLOSID,
everything in resctrl that keeps an RMID value needs to know the CLOSID
too. This will always be ignored on x86.

Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Xin Hao <[email protected]>
Signed-off-by: James Morse <[email protected]>

---
Is there a better term for 'the unique identifier for a monitor group'.
Using RMID for that here may be confusing...

Changes since v1:
* Added comment in struct rmid_entry

Changes since v2:
* Moved X86_RESCTRL_BAD_CLOSID from a subsequent patch
---
arch/x86/include/asm/resctrl.h | 7 +++
arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 59 ++++++++++++++---------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 ++---
include/linux/resctrl.h | 11 ++++-
6 files changed, 61 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 52788f79786f..cbe986d23df6 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -7,6 +7,13 @@
#include <linux/sched.h>
#include <linux/jump_label.h>

+/*
+ * This value can never be a valid CLOSID, and is used when mapping a
+ * (closid, rmid) pair to an index and back. On x86 only the RMID is
+ * needed.
+ */
+#define X86_RESCTRL_BAD_CLOSID ((u32)~0)
+
/**
* struct resctrl_pqr_state - State cache for the PQR MSR
* @cur_rmid: The cached Resource Monitoring ID
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 8edecc5763d8..c64097947994 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -535,7 +535,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
void closid_free(int closid);
int alloc_rmid(void);
-void free_rmid(u32 rmid);
+void free_rmid(u32 closid, u32 rmid);
int rdt_get_mon_l3_config(struct rdt_resource *r);
bool __init rdt_cpu_has(int flag);
void mon_event_count(void *info);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 7fe51488e136..18c37d364030 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -25,6 +25,12 @@
#include "internal.h"

struct rmid_entry {
+ /*
+ * Some architectures's resctrl_arch_rmid_read() needs the CLOSID value
+ * in order to access the correct monitor. This field provides the
+ * value to list walkers like __check_limbo(). On x86 this is ignored.
+ */
+ u32 closid;
u32 rmid;
int busy;
struct list_head list;
@@ -136,7 +142,7 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val)
return val;
}

-static inline struct rmid_entry *__rmid_entry(u32 rmid)
+static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid)
{
struct rmid_entry *entry;

@@ -190,7 +196,8 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
}

void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
- u32 rmid, enum resctrl_event_id eventid)
+ u32 closid, u32 rmid,
+ enum resctrl_event_id eventid)
{
struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct arch_mbm_state *am;
@@ -230,7 +237,8 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
}

int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
- u32 rmid, enum resctrl_event_id eventid, u64 *val)
+ u32 closid, u32 rmid, enum resctrl_event_id eventid,
+ u64 *val)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
@@ -285,9 +293,9 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
if (nrmid >= r->num_rmid)
break;

- entry = __rmid_entry(nrmid);
+ entry = __rmid_entry(X86_RESCTRL_BAD_CLOSID, nrmid);// temporary

- if (resctrl_arch_rmid_read(r, d, entry->rmid,
+ if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
QOS_L3_OCCUP_EVENT_ID, &val)) {
rmid_dirty = true;
} else {
@@ -342,7 +350,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
cpu = get_cpu();
list_for_each_entry(d, &r->domains, list) {
if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
- err = resctrl_arch_rmid_read(r, d, entry->rmid,
+ err = resctrl_arch_rmid_read(r, d, entry->closid,
+ entry->rmid,
QOS_L3_OCCUP_EVENT_ID,
&val);
if (err || val <= resctrl_rmid_realloc_threshold)
@@ -366,7 +375,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
list_add_tail(&entry->list, &rmid_free_lru);
}

-void free_rmid(u32 rmid)
+void free_rmid(u32 closid, u32 rmid)
{
struct rmid_entry *entry;

@@ -375,7 +384,7 @@ void free_rmid(u32 rmid)

lockdep_assert_held(&rdtgroup_mutex);

- entry = __rmid_entry(rmid);
+ entry = __rmid_entry(closid, rmid);

if (is_llc_occupancy_enabled())
add_rmid_to_limbo(entry);
@@ -383,15 +392,16 @@ void free_rmid(u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}

-static int __mon_event_count(u32 rmid, struct rmid_read *rr)
+static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
{
struct mbm_state *m;
u64 tval = 0;

if (rr->first)
- resctrl_arch_reset_rmid(rr->r, rr->d, rmid, rr->evtid);
+ resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid);

- rr->err = resctrl_arch_rmid_read(rr->r, rr->d, rmid, rr->evtid, &tval);
+ rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid,
+ &tval);
if (rr->err)
return rr->err;

@@ -434,7 +444,7 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
* __mon_event_count() is compared with the chunks value from the previous
* invocation. This must be called once per second to maintain values in MBps.
*/
-static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
+static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
{
struct mbm_state *m = &rr->d->mbm_local[rmid];
u64 cur_bw, bytes, cur_bytes;
@@ -464,7 +474,7 @@ void mon_event_count(void *info)

rdtgrp = rr->rgrp;

- ret = __mon_event_count(rdtgrp->mon.rmid, rr);
+ ret = __mon_event_count(rdtgrp->closid, rdtgrp->mon.rmid, rr);

/*
* For Ctrl groups read data from child monitor groups and
@@ -475,7 +485,8 @@ void mon_event_count(void *info)

if (rdtgrp->type == RDTCTRL_GROUP) {
list_for_each_entry(entry, head, mon.crdtgrp_list) {
- if (__mon_event_count(entry->mon.rmid, rr) == 0)
+ if (__mon_event_count(rdtgrp->closid, entry->mon.rmid,
+ rr) == 0)
ret = 0;
}
}
@@ -605,7 +616,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
}
}

-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
+static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
+ u32 closid, u32 rmid)
{
struct rmid_read rr;

@@ -620,12 +632,12 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
if (is_mbm_total_enabled()) {
rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
rr.val = 0;
- __mon_event_count(rmid, &rr);
+ __mon_event_count(closid, rmid, &rr);
}
if (is_mbm_local_enabled()) {
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
rr.val = 0;
- __mon_event_count(rmid, &rr);
+ __mon_event_count(closid, rmid, &rr);

/*
* Call the MBA software controller only for the
@@ -633,7 +645,7 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
* the software controller explicitly.
*/
if (is_mba_sc(NULL))
- mbm_bw_count(rmid, &rr);
+ mbm_bw_count(closid, rmid, &rr);
}
}

@@ -690,11 +702,11 @@ void mbm_handle_overflow(struct work_struct *work)
d = container_of(work, struct rdt_domain, mbm_over.work);

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
- mbm_update(r, d, prgrp->mon.rmid);
+ mbm_update(r, d, prgrp->closid, prgrp->mon.rmid);

head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list)
- mbm_update(r, d, crgrp->mon.rmid);
+ mbm_update(r, d, crgrp->closid, crgrp->mon.rmid);

if (is_mba_sc(NULL))
update_mba_bw(prgrp, d);
@@ -737,10 +749,11 @@ static int dom_data_init(struct rdt_resource *r)
}

/*
- * RMID 0 is special and is always allocated. It's used for all
- * tasks that are not monitored.
+ * RMID 0 is special and is always allocated. It's used for the
+ * default_rdtgroup control group, which will be setup later. See
+ * rdtgroup_setup_root().
*/
- entry = __rmid_entry(0);
+ entry = __rmid_entry(0, 0);
list_del(&entry->list);

return 0;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 524f8ff3e69c..c51932516965 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -738,7 +738,7 @@ int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp)
* anymore when this group would be used for pseudo-locking. This
* is safe to call on platforms not capable of monitoring.
*/
- free_rmid(rdtgrp->mon.rmid);
+ free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);

ret = 0;
goto out;
@@ -773,7 +773,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)

ret = rdtgroup_locksetup_user_restore(rdtgrp);
if (ret) {
- free_rmid(rdtgrp->mon.rmid);
+ free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
return ret;
}

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e2c1599d1b37..23e6b3a373b0 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2688,7 +2688,7 @@ static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)

head = &rdtgrp->mon.crdtgrp_list;
list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
- free_rmid(sentry->mon.rmid);
+ free_rmid(sentry->closid, sentry->mon.rmid);
list_del(&sentry->mon.crdtgrp_list);

if (atomic_read(&sentry->waitcount) != 0)
@@ -2728,7 +2728,7 @@ static void rmdir_all_sub(void)
cpumask_or(&rdtgroup_default.cpu_mask,
&rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);

- free_rmid(rdtgrp->mon.rmid);
+ free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);

kernfs_remove(rdtgrp->kn);
list_del(&rdtgrp->rdtgroup_list);
@@ -3222,7 +3222,7 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
return 0;

out_idfree:
- free_rmid(rdtgrp->mon.rmid);
+ free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
out_destroy:
kernfs_put(rdtgrp->kn);
kernfs_remove(rdtgrp->kn);
@@ -3236,7 +3236,7 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp)
{
kernfs_remove(rgrp->kn);
- free_rmid(rgrp->mon.rmid);
+ free_rmid(rgrp->closid, rgrp->mon.rmid);
rdtgroup_remove(rgrp);
}

@@ -3385,7 +3385,7 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
update_closid_rmid(tmpmask, NULL);

rdtgrp->flags = RDT_DELETED;
- free_rmid(rdtgrp->mon.rmid);
+ free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);

/*
* Remove the rdtgrp from the parent ctrl_mon group's list
@@ -3431,8 +3431,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
update_closid_rmid(tmpmask, NULL);

+ free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
closid_free(rdtgrp->closid);
- free_rmid(rdtgrp->mon.rmid);

rdtgroup_ctrl_remove(rdtgrp);

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..7d80bae05f59 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -225,6 +225,8 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
* for this resource and domain.
* @r: resource that the counter should be read from.
* @d: domain that the counter should be read from.
+ * @closid: closid that matches the rmid. The counter may
+ * match traffic of both closid and rmid, or rmid only.
* @rmid: rmid of the counter to read.
* @eventid: eventid to read, e.g. L3 occupancy.
* @val: result of the counter read in bytes.
@@ -235,20 +237,25 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
* 0 on success, or -EIO, -EINVAL etc on error.
*/
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
- u32 rmid, enum resctrl_event_id eventid, u64 *val);
+ u32 closid, u32 rmid, enum resctrl_event_id eventid,
+ u64 *val);
+

/**
* resctrl_arch_reset_rmid() - Reset any private state associated with rmid
* and eventid.
* @r: The domain's resource.
* @d: The rmid's domain.
+ * @closid: The closid that matches the rmid. Counters may match both
+ * closid and rmid, or rmid only.
* @rmid: The rmid whose counter values should be reset.
* @eventid: The eventid whose counter values should be reset.
*
* This can be called from any CPU.
*/
void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
- u32 rmid, enum resctrl_event_id eventid);
+ u32 closid, u32 rmid,
+ enum resctrl_event_id eventid);

/**
* resctrl_arch_reset_rmid_all() - Reset all private state associated with
--
2.39.2


2023-03-20 17:46:33

by James Morse

[permalink] [raw]
Subject: [PATCH v3 16/19] x86/resctrl: Add cpu online callback for resctrl work

The resctrl architecture specific code may need to create a domain when
a CPU comes online, it also needs to reset the CPUs PQR_ASSOC register.
The resctrl filesystem code needs to update the rdtgroup_default cpu
mask when cpus are brought online.

Currently this is all done in one function, resctrl_online_cpu().
This will need to be split into architecture and filesystem parts
before resctrl can be moved to /fs/.

Pull the rdtgroup_default update work out as a filesystem specific
cpu_online helper. resctrl_online_cpu() is the obvious name for this,
which means the version in core.c needs renaming.

resctrl_online_cpu() is called by the arch code once it has done the
work to add the new cpu to any domains.

In future patches, resctrl_online_cpu() will take the rdtgroup_mutex
itself.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 11 ++++++-----
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 10 ++++++++++
include/linux/resctrl.h | 1 +
3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 351319403f84..8e25ea49372e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -603,19 +603,20 @@ static void clear_closid_rmid(int cpu)
wrmsr(MSR_IA32_PQR_ASSOC, RESCTRL_RESERVED_CLOSID, 0);
}

-static int resctrl_online_cpu(unsigned int cpu)
+static int resctrl_arch_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;
+ int err;

mutex_lock(&rdtgroup_mutex);
for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
- /* The cpu is set in default rdtgroup after online. */
- cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
clear_closid_rmid(cpu);
+
+ err = resctrl_online_cpu(cpu);
mutex_unlock(&rdtgroup_mutex);

- return 0;
+ return err;
}

static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
@@ -965,7 +966,7 @@ static int __init resctrl_late_init(void)

state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
"x86/resctrl/cat:online:",
- resctrl_online_cpu, resctrl_offline_cpu);
+ resctrl_arch_online_cpu, resctrl_offline_cpu);
if (state < 0)
return state;

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 8f319e03b449..410b2b451c30 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3698,6 +3698,16 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

+int resctrl_online_cpu(unsigned int cpu)
+{
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ /* The cpu is set in default rdtgroup after online. */
+ cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
+
+ return 0;
+}
+
/*
* rdtgroup_init - rdtgroup initialization
*
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 03e4f41cd336..5a66d034aa61 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -222,6 +222,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_cpu(unsigned int cpu);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
--
2.39.2


2023-03-20 17:46:37

by James Morse

[permalink] [raw]
Subject: [PATCH v3 07/19] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers

When switching tasks, the CLOSID and RMID that the new task should
use are stored in struct task_struct. For x86 the CLOSID known by resctrl,
the value in task_struct, and the value written to the CPU register are
all the same thing.

MPAM's CPU interface has two different PARTID's one for data accesses
the other for instruction fetch. Storing resctrl's CLOSID value in
struct task_struct implies the arch code knows whether resctrl is using
CDP.

Move the matching and setting of the struct task_struct properties
to use helpers. This allows arm64 to store the hardware format of
the register, instead of having to convert it each time.

__rdtgroup_move_task()s use of READ_ONCE()/WRITE_ONCE() ensures torn
values aren't seen as another CPU may schedule the task being moved
while the value is being changed. MPAM has an additional corner-case
here as the PMG bits extend the PARTID space. If the scheduler sees a
new-CLOSID but old-RMID, the task will dirty an RMID that the limbo code
is not watching causing an inaccurate count. x86's RMID are independent
values, so the limbo code will still be watching the old-RMID in this
circumstance.
To avoid this, arm64 needs both the CLOSID/RMID WRITE_ONCE()d together.
Both values must be provided together.

Because MPAM's RMID values are not unique, the CLOSID must be provided
when matching the RMID.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v2:
* __rdtgroup_move_task() changed to set CLOSID from different CLOSID place
depending on group type
---
arch/x86/include/asm/resctrl.h | 18 ++++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 62 ++++++++++++++++----------
2 files changed, 56 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 3ca40be41a0a..752123b0ce40 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -95,6 +95,24 @@ static inline unsigned int resctrl_arch_round_mon_val(unsigned int val)
return val * scale;
}

+static inline void resctrl_arch_set_closid_rmid(struct task_struct *tsk,
+ u32 closid, u32 rmid)
+{
+ WRITE_ONCE(tsk->closid, closid);
+ WRITE_ONCE(tsk->rmid, rmid);
+}
+
+static inline bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid)
+{
+ return READ_ONCE(tsk->closid) == closid;
+}
+
+static inline bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 ignored,
+ u32 rmid)
+{
+ return READ_ONCE(tsk->rmid) == rmid;
+}
+
static inline void resctrl_sched_in(void)
{
if (static_branch_likely(&rdt_enable_key))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e741bc47bae9..2306fbc9a9bb 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -84,7 +84,7 @@ void rdt_last_cmd_printf(const char *fmt, ...)
*
* Using a global CLOSID across all resources has some advantages and
* some drawbacks:
- * + We can simply set "current->closid" to assign a task to a resource
+ * + We can simply set current's closid to assign a task to a resource
* group.
* + Context switch code can avoid extra memory references deciding which
* CLOSID to load into the PQR_ASSOC MSR
@@ -544,14 +544,26 @@ static void update_task_closid_rmid(struct task_struct *t)
_update_task_closid_rmid(t);
}

+static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup *rdtgrp)
+{
+ u32 closid, rmid = rdtgrp->mon.rmid;
+
+ if (rdtgrp->type == RDTCTRL_GROUP)
+ closid = rdtgrp->closid;
+ else if (rdtgrp->type == RDTMON_GROUP)
+ closid = rdtgrp->mon.parent->closid;
+ else
+ return false;
+
+ return resctrl_arch_match_closid(tsk, closid) &&
+ resctrl_arch_match_rmid(tsk, closid, rmid);
+}
+
static int __rdtgroup_move_task(struct task_struct *tsk,
struct rdtgroup *rdtgrp)
{
/* If the task is already in rdtgrp, no need to move the task. */
- if ((rdtgrp->type == RDTCTRL_GROUP && tsk->closid == rdtgrp->closid &&
- tsk->rmid == rdtgrp->mon.rmid) ||
- (rdtgrp->type == RDTMON_GROUP && tsk->rmid == rdtgrp->mon.rmid &&
- tsk->closid == rdtgrp->mon.parent->closid))
+ if (task_in_rdtgroup(tsk, rdtgrp))
return 0;

/*
@@ -562,19 +574,19 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
* For monitor groups, can move the tasks only from
* their parent CTRL group.
*/
-
- if (rdtgrp->type == RDTCTRL_GROUP) {
- WRITE_ONCE(tsk->closid, rdtgrp->closid);
- WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
- } else if (rdtgrp->type == RDTMON_GROUP) {
- if (rdtgrp->mon.parent->closid == tsk->closid) {
- WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
- } else {
- rdt_last_cmd_puts("Can't move task to different control group\n");
- return -EINVAL;
- }
+ if (rdtgrp->type == RDTMON_GROUP &&
+ !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) {
+ rdt_last_cmd_puts("Can't move task to different control group\n");
+ return -EINVAL;
}

+ if (rdtgrp->type == RDTMON_GROUP)
+ resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid,
+ rdtgrp->mon.rmid);
+ else
+ resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid,
+ rdtgrp->mon.rmid);
+
/*
* Ensure the task's closid and rmid are written before determining if
* the task is current that will decide if it will be interrupted.
@@ -596,14 +608,15 @@ static int __rdtgroup_move_task(struct task_struct *tsk,

static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
{
- return (rdt_alloc_capable &&
- (r->type == RDTCTRL_GROUP) && (t->closid == r->closid));
+ return (rdt_alloc_capable && (r->type == RDTCTRL_GROUP) &&
+ resctrl_arch_match_closid(t, r->closid));
}

static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r)
{
- return (rdt_mon_capable &&
- (r->type == RDTMON_GROUP) && (t->rmid == r->mon.rmid));
+ return (rdt_mon_capable && (r->type == RDTMON_GROUP) &&
+ resctrl_arch_match_rmid(t, r->mon.parent->closid,
+ r->mon.rmid));
}

/**
@@ -799,7 +812,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
rdtg->mode != RDT_MODE_EXCLUSIVE)
continue;

- if (rdtg->closid != tsk->closid)
+ if (!resctrl_arch_match_closid(tsk, rdtg->closid))
continue;

seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "",
@@ -807,7 +820,8 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
seq_puts(s, "mon:");
list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
mon.crdtgrp_list) {
- if (tsk->rmid != crg->mon.rmid)
+ if (!resctrl_arch_match_rmid(tsk, crg->mon.parent->closid,
+ crg->mon.rmid))
continue;
seq_printf(s, "%s", crg->kn->name);
break;
@@ -2659,8 +2673,8 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
for_each_process_thread(p, t) {
if (!from || is_closid_match(t, from) ||
is_rmid_match(t, from)) {
- WRITE_ONCE(t->closid, to->closid);
- WRITE_ONCE(t->rmid, to->mon.rmid);
+ resctrl_arch_set_closid_rmid(t, to->closid,
+ to->mon.rmid);

/*
* Order the closid/rmid stores above before the loads
--
2.39.2


2023-03-20 17:46:39

by James Morse

[permalink] [raw]
Subject: [PATCH v3 19/19] x86/resctrl: Separate arch and fs resctrl locks

resctrl has one mutex that is taken by the architecture specific code,
and the filesystem parts. The two interact via cpuhp, where the
architecture code updates the domain list. Filesystem handlers that
walk the domains list should not run concurrently with the cpuhp
callback modifying the list.

Exposing a lock from the filesystem code means the interface is not
cleanly defined, and creates the possibility of cross-architecture
lock ordering headaches. The interaction only exists so that certain
filesystem paths are serialised against cpu hotplug. The cpu hotplug
code already has a mechanism to do this using cpus_read_lock().

MPAM's monitors have an overflow interrupt, so it needs to be possible
to walk the domains list in irq context. RCU is ideal for this,
but some paths need to be able to sleep to allocate memory.

Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
of a cpuhp callback, cpus_read_lock() must always be taken first.
rdtgroup_schemata_write() already does this.

Most of the filesystem code's domain list walkers are currently
protected by the rdtgroup_mutex taken in rdtgroup_kn_lock_live().
The exceptions are rdt_bit_usage_show() and the mon_config helpers
which take the lock directly.

Make the domain list protected by RCU. An architecture-specific
lock prevents concurrent writers. rdt_bit_usage_show() can
walk the domain list under rcu_read_lock(). The mon_config helpers
send multiple IPIs, take the cpus_read_lock() in these cases.

The other filesystem list walkers need to be able to sleep.
Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
cpuhp callbacks can't be invoked when file system operations are
occurring.

Add lockdep_assert_cpus_held() in the cases where the
rdtgroup_kn_lock_live() call isn't obvious.

Resctrl's domain online/offline calls now need to take the
rdtgroup_mutex themselves.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v2:
* Reworded a comment,
* Added a lockdep assertion
* Moved clear_closid_rmid() outside the locked region of cpu
online/offline
---
arch/x86/kernel/cpu/resctrl/core.c | 38 +++++++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 16 ++++--
arch/x86/kernel/cpu/resctrl/monitor.c | 3 ++
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 3 ++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 63 ++++++++++++++++++++---
include/linux/resctrl.h | 2 +-
6 files changed, 99 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 4e5fc89dab6d..85216091228a 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -25,8 +25,15 @@
#include <asm/resctrl.h>
#include "internal.h"

-/* Mutex to protect rdtgroup access. */
-DEFINE_MUTEX(rdtgroup_mutex);
+/*
+ * rdt_domain structures are kfree()d when their last CPU goes offline,
+ * and allocated when the first CPU in a new domain comes online.
+ * The rdt_resource's domain list is updated when this happens. Readers of
+ * the domain list must either take cpus_read_lock(), or rely on an RCU
+ * read-side critical section, to avoid observing concurrent modification.
+ * All writers take this mutex:
+ */
+static DEFINE_MUTEX(domain_list_lock);

/*
* The cached resctrl_pqr_state is strictly per CPU and can never be
@@ -508,6 +515,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
struct rdt_domain *d;
int err;

+ lockdep_assert_held(&domain_list_lock);
+
d = rdt_find_domain(r, id, &add_pos);
if (IS_ERR(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -541,11 +550,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

- list_add_tail(&d->list, add_pos);
+ list_add_tail_rcu(&d->list, add_pos);

err = resctrl_online_domain(r, d);
if (err) {
- list_del(&d->list);
+ list_del_rcu(&d->list);
+ synchronize_rcu();
domain_free(hw_dom);
}
}
@@ -556,6 +566,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

+ lockdep_assert_held(&domain_list_lock);
+
d = rdt_find_domain(r, id, NULL);
if (IS_ERR_OR_NULL(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -566,7 +578,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
cpumask_clear_cpu(cpu, &d->cpu_mask);
if (cpumask_empty(&d->cpu_mask)) {
resctrl_offline_domain(r, d);
- list_del(&d->list);
+ list_del_rcu(&d->list);
+ synchronize_rcu();

/*
* rdt_domain "d" is going to be freed below, so clear
@@ -594,30 +607,29 @@ static void clear_closid_rmid(int cpu)
static int resctrl_arch_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;
- int err;

- mutex_lock(&rdtgroup_mutex);
+ mutex_lock(&domain_list_lock);
for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
+ mutex_unlock(&domain_list_lock);
+
clear_closid_rmid(cpu);

- err = resctrl_online_cpu(cpu);
- mutex_unlock(&rdtgroup_mutex);
-
- return err;
+ return resctrl_online_cpu(cpu);
}

static int resctrl_arch_offline_cpu(unsigned int cpu)
{
struct rdt_resource *r;

- mutex_lock(&rdtgroup_mutex);
resctrl_offline_cpu(cpu);

+ mutex_lock(&domain_list_lock);
for_each_capable_rdt_resource(r)
domain_remove_cpu(cpu, r);
+ mutex_unlock(&domain_list_lock);
+
clear_closid_rmid(cpu);
- mutex_unlock(&rdtgroup_mutex);

return 0;
}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 9161bc95eea7..7c582fafa526 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -209,6 +209,9 @@ static int parse_line(char *line, struct resctrl_schema *s,
struct rdt_domain *d;
unsigned long dom_id;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
(r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) {
rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n");
@@ -313,6 +316,9 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
struct rdt_domain *d;
u32 idx;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
return -ENOMEM;

@@ -379,11 +385,9 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
return -EINVAL;
buf[nbytes - 1] = '\0';

- cpus_read_lock();
rdtgrp = rdtgroup_kn_lock_live(of->kn);
if (!rdtgrp) {
rdtgroup_kn_unlock(of->kn);
- cpus_read_unlock();
return -ENOENT;
}
rdt_last_cmd_clear();
@@ -447,7 +451,6 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,

out:
rdtgroup_kn_unlock(of->kn);
- cpus_read_unlock();
return ret ?: nbytes;
}

@@ -467,6 +470,9 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
bool sep = false;
u32 ctrl_val;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
seq_printf(s, "%*s:", max_name_width, schema->name);
list_for_each_entry(dom, &r->domains, list) {
if (sep)
@@ -530,8 +536,8 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
{
int cpu;

- /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
- lockdep_assert_held(&rdtgroup_mutex);
+ /* When picking a cpu from cpu_mask, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();

/*
* setup the parameters to pass to mon_event_count() to read the data.
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 11fa5d79c81d..58f665ce7a0a 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -458,6 +458,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
u32 idx;
int err;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);

arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 0b4fdb118643..f8864626d593 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -830,6 +830,9 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
struct rdt_domain *d_i;
bool ret = false;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
return true;

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index c27ec56c6c60..52c610426181 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -35,6 +35,10 @@
DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
+
+/* Mutex to protect rdtgroup access. */
+DEFINE_MUTEX(rdtgroup_mutex);
+
static struct kernfs_root *rdt_root;
struct rdtgroup rdtgroup_default;
LIST_HEAD(rdt_all_groups);
@@ -931,7 +935,8 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,

mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(dom, &r->domains, list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
@@ -987,8 +992,10 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
}
sep = true;
}
+ rcu_read_unlock();
seq_putc(seq, '\n');
mutex_unlock(&rdtgroup_mutex);
+
return 0;
}

@@ -1231,6 +1238,9 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
struct rdt_domain *d;
u32 ctrl;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
list_for_each_entry(s, &resctrl_schema_all, list) {
r = s->res;
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
@@ -1497,6 +1507,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
struct rdt_domain *dom;
bool sep = false;

+ cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

list_for_each_entry(dom, &r->domains, list) {
@@ -1513,6 +1524,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
seq_puts(s, "\n");

mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();

return 0;
}
@@ -1604,6 +1616,9 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
struct rdt_domain *d;
int ret = 0;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
next:
if (!tok || tok[0] == '\0')
return 0;
@@ -1645,6 +1660,7 @@ static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
if (nbytes == 0 || buf[nbytes - 1] != '\n')
return -EINVAL;

+ cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

rdt_last_cmd_clear();
@@ -1654,6 +1670,7 @@ static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID);

mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();

return ret ?: nbytes;
}
@@ -1669,6 +1686,7 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of,
if (nbytes == 0 || buf[nbytes - 1] != '\n')
return -EINVAL;

+ cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

rdt_last_cmd_clear();
@@ -1678,6 +1696,7 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of,
ret = mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID);

mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();

return ret ?: nbytes;
}
@@ -2130,6 +2149,9 @@ static int set_cache_qos_cfg(int level, bool enable)
struct rdt_domain *d;
int cpu;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (level == RDT_RESOURCE_L3)
update = l3_qos_cfg_update;
else if (level == RDT_RESOURCE_L2)
@@ -2318,6 +2340,7 @@ struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn)
atomic_inc(&rdtgrp->waitcount);
kernfs_break_active_protection(kn);

+ cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

/* Was this group deleted while we waited? */
@@ -2335,6 +2358,7 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn)
return;

mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();

if (atomic_dec_and_test(&rdtgrp->waitcount) &&
(rdtgrp->flags & RDT_DELETED)) {
@@ -2632,6 +2656,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
struct rdt_domain *d;
int i;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
return -ENOMEM;

@@ -2916,6 +2943,9 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_domain *dom;
int ret;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
list_for_each_entry(dom, &r->domains, list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
@@ -3602,7 +3632,8 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
kfree(d->mbm_local);
}

-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+static void _resctrl_offline_domain(struct rdt_resource *r,
+ struct rdt_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3637,6 +3668,13 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
domain_destroy_mon_state(d);
}

+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ mutex_lock(&rdtgroup_mutex);
+ _resctrl_offline_domain(r, d);
+ mutex_unlock(&rdtgroup_mutex);
+}
+
static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -3668,7 +3706,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+static int _resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
{
int err;

@@ -3700,12 +3738,23 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

+int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ int err;
+
+ mutex_lock(&rdtgroup_mutex);
+ err = _resctrl_online_domain(r, d);
+ mutex_unlock(&rdtgroup_mutex);
+
+ return err;
+}
+
int resctrl_online_cpu(unsigned int cpu)
{
- lockdep_assert_held(&rdtgroup_mutex);
-
+ mutex_lock(&rdtgroup_mutex);
/* The cpu is set in default rdtgroup after online. */
cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
+ mutex_unlock(&rdtgroup_mutex);

return 0;
}
@@ -3726,8 +3775,7 @@ void resctrl_offline_cpu(unsigned int cpu)
struct rdtgroup *rdtgrp;
struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;

- lockdep_assert_held(&rdtgroup_mutex);
-
+ mutex_lock(&rdtgroup_mutex);
list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
clear_childcpus(rdtgrp, cpu);
@@ -3747,6 +3795,7 @@ void resctrl_offline_cpu(unsigned int cpu)
cqm_setup_limbo_handler(d, 0, cpu);
}
}
+ mutex_unlock(&rdtgroup_mutex);
}

/*
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index f053527aaa5b..4d35798effef 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -159,7 +159,7 @@ struct resctrl_schema;
* @cache_level: Which cache level defines scope of this resource
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
- * @domains: All domains for this resource
+ * @domains: RCU list of all domains for this resource
* @name: Name to use in "schemata" file.
* @data_width: Character width of data when displaying
* @default_ctrl: Specifies default cache cbm or memory B/W percent.
--
2.39.2


2023-03-20 17:46:44

by James Morse

[permalink] [raw]
Subject: [PATCH v3 13/19] x86/resctrl: Move alloc/mon static keys into helpers

resctrl enables three static keys depending on the features it has enabled.
Another architecture's context switch code may look different, any
static keys that control it should be buried behind helpers.

Move the alloc/mon logic into arch-specific helpers as a preparatory step
for making the rdt_enable_key's status something the arch code decides.

This means other architectures don't have to mirror the static keys.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/include/asm/resctrl.h | 20 ++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/internal.h | 5 -----
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++----
3 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 1c87f1626456..5fdfcd5f943e 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -42,6 +42,26 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);

+static inline void resctrl_arch_enable_alloc(void)
+{
+ static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
+}
+
+static inline void resctrl_arch_disable_alloc(void)
+{
+ static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
+}
+
+static inline void resctrl_arch_enable_mon(void)
+{
+ static_branch_enable_cpuslocked(&rdt_mon_enable_key);
+}
+
+static inline void resctrl_arch_disable_mon(void)
+{
+ static_branch_disable_cpuslocked(&rdt_mon_enable_key);
+}
+
/*
* __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
*
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 7d5188e8bec3..c83bd581c1d5 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -92,9 +92,6 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
return container_of(kfc, struct rdt_fs_context, kfc);
}

-DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
-DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
-
/**
* struct mon_evt - Entry in the event list of a resource
* @evtid: event id
@@ -452,8 +449,6 @@ extern struct mutex rdtgroup_mutex;

extern struct rdt_hw_resource rdt_resources_all[];
extern struct rdtgroup rdtgroup_default;
-DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
-
extern struct dentry *debugfs_resctrl;

enum resctrl_res_level {
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5176a85f281c..c6c31efb85ac 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2510,9 +2510,9 @@ static int rdt_get_tree(struct fs_context *fc)
goto out_psl;

if (rdt_alloc_capable)
- static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
+ resctrl_arch_enable_alloc();
if (rdt_mon_capable)
- static_branch_enable_cpuslocked(&rdt_mon_enable_key);
+ resctrl_arch_enable_mon();

if (rdt_alloc_capable || rdt_mon_capable) {
static_branch_enable_cpuslocked(&rdt_enable_key);
@@ -2785,8 +2785,8 @@ static void rdt_kill_sb(struct super_block *sb)
rdt_pseudo_lock_release();
rdtgroup_default.mode = RDT_MODE_SHAREABLE;
schemata_list_destroy();
- static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
- static_branch_disable_cpuslocked(&rdt_mon_enable_key);
+ resctrl_arch_disable_alloc();
+ resctrl_arch_disable_mon();
static_branch_disable_cpuslocked(&rdt_enable_key);
resctrl_mounted = false;
kernfs_kill_sb(sb);
--
2.39.2


2023-03-20 17:47:19

by James Morse

[permalink] [raw]
Subject: [PATCH v3 10/19] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

MPAM's cache occupancy counters can take a little while to settle once
the monitor has been configured. The maximum settling time is described
to the driver via a firmware table. The value could be large enough
that it makes sense to sleep.

To avoid exposing this to resctrl, it should be hidden behind MPAM's
resctrl_arch_rmid_read(). But add_rmid_to_limbo() calls
resctrl_arch_rmid_read() from a non-preemptible context.

add_rmid_to_limbo() is opportunistically reading the L3 occupancy counter
on this domain to avoid adding the RMID to limbo if this domain's value
has drifted below resctrl_rmid_realloc_threshold since the limbo handler
last ran. Determining 'this domain' involves disabling preeption to
prevent the thread being migrated to CPUs in a different domain between
the check and resctrl_arch_rmid_read() call. The check is skipped
for all remote domains.

Instead, call resctrl_arch_rmid_read() for each domain, and get it to
read the arch specific counter via IPI if its called on a CPU outside
the target domain. By covering remote domains, this change stops the
limbo handler from being started unnecessarily if a remote domain is
below the threshold.

This also allows resctrl_arch_rmid_read() to sleep.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
The alternative is to remove the counter read from this path altogether,
and assume user-space would never try to re-allocate the last RMID before
the limbo handler runs next.
---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 12 +-----
arch/x86/kernel/cpu/resctrl/monitor.c | 48 +++++++++++++++--------
2 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b06e86839d00..9161bc95eea7 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -543,16 +543,8 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
rr->val = 0;
rr->first = first;

- cpu = get_cpu();
- if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
- mon_event_count(rr);
- put_cpu();
- } else {
- put_cpu();
-
- cpu = cpumask_any_housekeeping(&d->cpu_mask);
- smp_call_on_cpu(cpu, mon_event_count, rr, false);
- }
+ cpu = cpumask_any_housekeeping(&d->cpu_mask);
+ smp_call_on_cpu(cpu, mon_event_count, rr, false);
}

int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 5e9e876c3409..de72df06b37b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -253,22 +253,42 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
return chunks >> shift;
}

+struct __rmid_read_arg
+{
+ u32 rmid;
+ enum resctrl_event_id eventid;
+
+ u64 msr_val;
+ int err;
+};
+
+static void smp_call_rmid_read(void *_arg)
+{
+ struct __rmid_read_arg *arg = _arg;
+
+ arg->err = __rmid_read(arg->rmid, arg->eventid, &arg->msr_val);
+}
+
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
u64 *val)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct __rmid_read_arg arg;
struct arch_mbm_state *am;
u64 msr_val, chunks;
- int ret;
+ int err;

- if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
- return -EINVAL;
+ arg.rmid = rmid;
+ arg.eventid = eventid;

- ret = __rmid_read(rmid, eventid, &msr_val);
- if (ret)
- return ret;
+ err = smp_call_function_any(&d->cpu_mask, smp_call_rmid_read, &arg, true);
+ if (err)
+ return err;
+ if (arg.err)
+ return arg.err;
+ msr_val = arg.msr_val;

am = get_arch_mbm_state(hw_dom, rmid, eventid);
if (am) {
@@ -424,23 +444,18 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
struct rdt_domain *d;
- int cpu, err;
u64 val = 0;
u32 idx;
+ int err;

idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);

entry->busy = 0;
- cpu = get_cpu();
list_for_each_entry(d, &r->domains, list) {
- if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
- err = resctrl_arch_rmid_read(r, d, entry->closid,
- entry->rmid,
- QOS_L3_OCCUP_EVENT_ID,
- &val);
- if (err || val <= resctrl_rmid_realloc_threshold)
- continue;
- }
+ err = resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
+ QOS_L3_OCCUP_EVENT_ID, &val);
+ if (err || val <= resctrl_rmid_realloc_threshold)
+ continue;

/*
* For the first limbo RMID in the domain,
@@ -451,7 +466,6 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
set_bit(idx, d->rmid_busy_llc);
entry->busy++;
}
- put_cpu();

if (entry->busy)
rmid_limbo_count++;
--
2.39.2


2023-03-20 17:47:22

by James Morse

[permalink] [raw]
Subject: [PATCH v3 17/19] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu

When a CPU is taken offline resctrl may need to move the overflow or
limbo handlers to run on a different CPU.

Once the offline callbacks have been split, cqm_setup_limbo_handler()
will be called while the CPU that is going offline is still present
in the cpu_mask.

Pass the CPU to exclude to cqm_setup_limbo_handler() and
mbm_setup_overflow_handler(). These functions can use a variant of
cpumask_any_but() when selecting the CPU. -1 is used to indicate no CPUs
need excluding.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v2:
* Rephrased a comment to avoid a two letter bad-word. (we)
* Avoid assigning mbm_work_cpu if the domain is going to be free()d
* Added cpumask_any_housekeeping_but(), I dislike the name
---
arch/x86/kernel/cpu/resctrl/core.c | 8 +++--
arch/x86/kernel/cpu/resctrl/internal.h | 37 ++++++++++++++++++++--
arch/x86/kernel/cpu/resctrl/monitor.c | 43 +++++++++++++++++++++-----
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 ++--
include/linux/resctrl.h | 3 ++
5 files changed, 83 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 8e25ea49372e..aafe4b74587c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -582,12 +582,16 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
cancel_delayed_work(&d->mbm_over);
- mbm_setup_overflow_handler(d, 0);
+ /*
+ * exclude_cpu=-1 as this CPU has already been removed
+ * by cpumask_clear_cpu()d
+ */
+ mbm_setup_overflow_handler(d, 0, RESCTRL_PICK_ANY_CPU);
}
if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
has_busy_rmid(r, d)) {
cancel_delayed_work(&d->cqm_limbo);
- cqm_setup_limbo_handler(d, 0);
+ cqm_setup_limbo_handler(d, 0, RESCTRL_PICK_ANY_CPU);
}
}
}
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 3eb5b307b809..47838ba6876e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -78,6 +78,37 @@ static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
return cpu;
}

+/**
+ * cpumask_any_housekeeping_but() - Chose any cpu in @mask, preferring those
+ * that aren't marked nohz_full, excluding
+ * the provided CPU
+ * @mask: The mask to pick a CPU from.
+ * @exclude_cpu:The CPU to avoid picking.
+ *
+ * Returns a CPU from @mask, but not @but. If there are houskeeping CPUs that
+ * don't use nohz_full, these are preferred.
+ * Returns >= nr_cpu_ids if no CPUs are available.
+ */
+static inline unsigned int
+cpumask_any_housekeeping_but(const struct cpumask *mask, int exclude_cpu)
+{
+ int cpu, hk_cpu;
+
+ cpu = cpumask_any_but(mask, exclude_cpu);
+ if (tick_nohz_full_cpu(cpu)) {
+ hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
+ if (hk_cpu == exclude_cpu) {
+ hk_cpu = cpumask_nth_andnot(1, mask,
+ tick_nohz_full_mask);
+ }
+
+ if (hk_cpu < nr_cpu_ids)
+ cpu = hk_cpu;
+ }
+
+ return cpu;
+}
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
@@ -564,11 +595,13 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first);
void mbm_setup_overflow_handler(struct rdt_domain *dom,
- unsigned long delay_ms);
+ unsigned long delay_ms,
+ int exclude_cpu);
void mbm_handle_overflow(struct work_struct *work);
void __init intel_rdt_mbm_apply_quirk(void);
bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+ int exclude_cpu);
void cqm_handle_limbo(struct work_struct *work);
bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
void __check_limbo(struct rdt_domain *d, bool force_free);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f0f2e61b15d5..11fa5d79c81d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -477,7 +477,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
* setup up the limbo worker.
*/
if (!has_busy_rmid(r, d))
- cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL);
+ cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, -1);
set_bit(idx, d->rmid_busy_llc);
entry->busy++;
}
@@ -812,15 +812,28 @@ void cqm_handle_limbo(struct work_struct *work)
mutex_unlock(&rdtgroup_mutex);
}

-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+/**
+ * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this
+ * domain.
+ * @delay_ms: How far in the future the handler should run.
+ * @exclude_cpu: Which CPU the handler should not run on, -1 to pick any CPU.
+ */
+void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+ int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;

- cpu = cpumask_any_housekeeping(&dom->cpu_mask);
- dom->cqm_work_cpu = cpu;
+ if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
+ cpu = cpumask_any_housekeeping(&dom->cpu_mask);
+ else
+ cpu = cpumask_any_housekeeping_but(&dom->cpu_mask,
+ exclude_cpu);

- schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
+ if (cpu < nr_cpu_ids) {
+ dom->cqm_work_cpu = cpu;
+ schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
+ }
}

void mbm_handle_overflow(struct work_struct *work)
@@ -862,7 +875,14 @@ void mbm_handle_overflow(struct work_struct *work)
mutex_unlock(&rdtgroup_mutex);
}

-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+/**
+ * mbm_setup_overflow_handler() - Schedule the overflow handler to run for this
+ * domain.
+ * @delay_ms: How far in the future the handler should run.
+ * @exclude_cpu: Which CPU the handler should not run on, -1 to pick any CPU.
+ */
+void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
+ int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;
@@ -870,9 +890,16 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
if (!resctrl_mounted || !resctrl_arch_mon_capable())
return;

- cpu = cpumask_any_housekeeping(&dom->cpu_mask);
+ if (exclude_cpu == -1)
+ cpu = cpumask_any_housekeeping(&dom->cpu_mask);
+ else
+ cpu = cpumask_any_housekeeping_but(&dom->cpu_mask,
+ exclude_cpu);
+
dom->mbm_work_cpu = cpu;
- schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
+
+ if (cpu < nr_cpu_ids)
+ schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
}

static int dom_data_init(struct rdt_resource *r)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 410b2b451c30..bf206bdb21ee 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2520,7 +2520,8 @@ static int rdt_get_tree(struct fs_context *fc)
if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
list_for_each_entry(dom, &r->domains, list)
- mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
+ mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL,
+ RESCTRL_PICK_ANY_CPU);
}

goto out;
@@ -3686,7 +3687,8 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)

if (is_mbm_enabled()) {
INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow);
- mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL);
+ mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL,
+ RESCTRL_PICK_ANY_CPU);
}

if (is_llc_occupancy_enabled())
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 5a66d034aa61..3ea7d618f33f 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -9,6 +9,9 @@
/* CLOSID value used by the default control group */
#define RESCTRL_RESERVED_CLOSID 0

+/* Indicates no CPU needs to be excluded */
+#define RESCTRL_PICK_ANY_CPU -1
+
#ifdef CONFIG_PROC_CPU_RESCTRL

int proc_resctrl_show(struct seq_file *m,
--
2.39.2


2023-03-20 17:47:36

by James Morse

[permalink] [raw]
Subject: [PATCH v3 04/19] x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()

RMID are allocated for each monitor or control group directory, because
each of these needs its own RMID. For control groups,
rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID.

MPAM's equivalent of RMID is not an independent number, so can't be
allocated until the CLOSID is known. An RMID allocation for one CLOSID
may fail, whereas another may succeed depending on how many monitor
groups a control group has.

The RMID allocation needs to move to be after the CLOSID has been
allocated.

Move the RMID allocation out of mkdir_rdt_prepare() to occur in its caller,
after the mkdir_rdt_prepare() call. This allows the RMID allocator to
know the CLOSID.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v2:
* Moved kernfs_activate() later to preserve atomicity of files being visible
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 35 +++++++++++++++++++-------
1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b785beb0db26..16c8ca135b37 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3159,6 +3159,12 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
return 0;
}

+static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp)
+{
+ if (rdt_mon_capable)
+ free_rmid(rgrp->closid, rgrp->mon.rmid);
+}
+
static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
const char *name, umode_t mode,
enum rdt_group_type rtype, struct rdtgroup **r)
@@ -3224,12 +3230,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
goto out_destroy;
}

- ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
- if (ret)
- goto out_destroy;
-
- kernfs_activate(kn);
-
/*
* The caller unlocks the parent_kn upon success.
*/
@@ -3248,7 +3248,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp)
{
kernfs_remove(rgrp->kn);
- free_rmid(rgrp->closid, rgrp->mon.rmid);
rdtgroup_remove(rgrp);
}

@@ -3270,12 +3269,21 @@ static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn,
prgrp = rdtgrp->mon.parent;
rdtgrp->closid = prgrp->closid;

+ ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
+ if (ret) {
+ mkdir_rdt_prepare_clean(rdtgrp);
+ goto out_unlock;
+ }
+
+ kernfs_activate(rdtgrp->kn);
+
/*
* Add the rdtgrp to the list of rdtgrps the parent
* ctrl_mon group has to track.
*/
list_add_tail(&rdtgrp->mon.crdtgrp_list, &prgrp->mon.crdtgrp_list);

+out_unlock:
rdtgroup_kn_unlock(parent_kn);
return ret;
}
@@ -3306,10 +3314,17 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
ret = 0;

rdtgrp->closid = closid;
- ret = rdtgroup_init_alloc(rdtgrp);
- if (ret < 0)
+
+ ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
+ if (ret)
goto out_id_free;

+ kernfs_activate(rdtgrp->kn);
+
+ ret = rdtgroup_init_alloc(rdtgrp);
+ if (ret < 0)
+ goto out_rmid_free;
+
list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);

if (rdt_mon_capable) {
@@ -3328,6 +3343,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,

out_del_list:
list_del(&rdtgrp->rdtgroup_list);
+out_rmid_free:
+ mkdir_rdt_prepare_rmid_free(rdtgrp);
out_id_free:
closid_free(closid);
out_common_fail:
--
2.39.2


2023-03-20 17:47:49

by James Morse

[permalink] [raw]
Subject: [PATCH v3 06/19] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID

MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
used for different control groups.

This means once a CLOSID is allocated, all its monitoring ids may still be
dirty, and held in limbo.

Add a helper to allow the CLOSID allocator to check if a CLOSID has dirty
RMID values. This behaviour is enabled by a kconfig option selected by
the architecture, which avoids a pointless search for x86.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>

---
Changes since v1:
* Removed superflous IS_ENABLED().

Changes since v2:
* Reworded comment over resctrl_closid_is_dirty() to reflect this is all RMID.
---
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/monitor.c | 36 ++++++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 +++++++-----
3 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e11d9ce943d3..87545e4beb70 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -534,6 +534,7 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
+bool resctrl_closid_is_dirty(u32 closid);
void closid_free(int closid);
int alloc_rmid(u32 closid);
void free_rmid(u32 closid, u32 rmid);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ca58a433c668..a2ae4be4b2ba 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -363,6 +363,42 @@ static struct rmid_entry *resctrl_find_free_rmid(u32 closid)
return ERR_PTR(-ENOSPC);
}

+/**
+ * resctrl_closid_is_dirty - Determine if all RMID associated with this CLOSID
+ * are available.
+ * @closid: The CLOSID that is being queried.
+ *
+ * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocated CLOSID
+ * may not be able to allocate clean RMID. To avoid this the allocator will
+ * only return clean CLOSID. This is enough for now as it allows MPAM systems
+ * to use resctrl. This suffers from the problem that there may be no CLOSID
+ * where all the RMID are clean, causing the CLOSID allocation to fail.
+ * This can be improved (once MPAM support is upstream) to return the cleanest
+ * CLOSID where PMG=0 is clean. This would allow the CLOSID allocation to
+ * succeed, but subsequent monitor-group allocations may fail.
+ */
+bool resctrl_closid_is_dirty(u32 closid)
+{
+ struct rmid_entry *entry;
+ int i;
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
+ return false;
+
+ for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
+ entry = &rmid_ptrs[i];
+ if (entry->closid != closid)
+ continue;
+
+ if (entry->busy)
+ return true;
+ }
+
+ return false;
+}
+
/*
* For MPAM the RMID value is not unique, and has to be considered with
* the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index bcd27610bb77..e741bc47bae9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -93,7 +93,7 @@ void rdt_last_cmd_printf(const char *fmt, ...)
* - Our choices on how to configure each resource become progressively more
* limited as the number of resources grows.
*/
-static int closid_free_map;
+static unsigned long closid_free_map;
static int closid_free_map_len;

int closids_supported(void)
@@ -119,14 +119,17 @@ static void closid_init(void)

static int closid_alloc(void)
{
- u32 closid = ffs(closid_free_map);
+ u32 closid;

- if (closid == 0)
- return -ENOSPC;
- closid--;
- closid_free_map &= ~(1 << closid);
+ for_each_set_bit(closid, &closid_free_map, closid_free_map_len) {
+ if (resctrl_closid_is_dirty(closid))
+ continue;

- return closid;
+ clear_bit(closid, &closid_free_map);
+ return closid;
+ }
+
+ return -ENOSPC;
}

void closid_free(int closid)
--
2.39.2


2023-03-20 17:47:53

by James Morse

[permalink] [raw]
Subject: [PATCH v3 18/19] x86/resctrl: Add cpu offline callback for resctrl work

The resctrl architecture specific code may need to free a domain when
a CPU goes offline, it also needs to reset the CPUs PQR_ASSOC register.
The resctrl filesystem code needs to move the overflow and limbo work
to run on a different CPU, and clear this CPU from the cpu_mask of
control and monitor groups.

Currently this is all done in core.c and called from
resctrl_offline_cpu(), making the split between architecture and
filesystem code unclear.

Move the filesystem work into a filesystem helper called
resctrl_offline_cpu(), and rename the one in core.c
resctrl_arch_offline_cpu().

The rdtgroup_mutex is unlocked and locked again in the call in
preparation for changing the locking rules for the architecture
code.

resctrl_offline_cpu() is called before any of the resource/domains
are updated, and makes use of the exclude_cpu feature that was
previously added.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 41 ++++----------------------
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 39 ++++++++++++++++++++++++
include/linux/resctrl.h | 1 +
3 files changed, 45 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index aafe4b74587c..4e5fc89dab6d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -578,22 +578,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)

return;
}
-
- if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
- if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
- cancel_delayed_work(&d->mbm_over);
- /*
- * exclude_cpu=-1 as this CPU has already been removed
- * by cpumask_clear_cpu()d
- */
- mbm_setup_overflow_handler(d, 0, RESCTRL_PICK_ANY_CPU);
- }
- if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
- has_busy_rmid(r, d)) {
- cancel_delayed_work(&d->cqm_limbo);
- cqm_setup_limbo_handler(d, 0, RESCTRL_PICK_ANY_CPU);
- }
- }
}

static void clear_closid_rmid(int cpu)
@@ -623,31 +607,15 @@ static int resctrl_arch_online_cpu(unsigned int cpu)
return err;
}

-static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
+static int resctrl_arch_offline_cpu(unsigned int cpu)
{
- struct rdtgroup *cr;
-
- list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) {
- if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) {
- break;
- }
- }
-}
-
-static int resctrl_offline_cpu(unsigned int cpu)
-{
- struct rdtgroup *rdtgrp;
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
+ resctrl_offline_cpu(cpu);
+
for_each_capable_rdt_resource(r)
domain_remove_cpu(cpu, r);
- list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
- if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
- clear_childcpus(rdtgrp, cpu);
- break;
- }
- }
clear_closid_rmid(cpu);
mutex_unlock(&rdtgroup_mutex);

@@ -970,7 +938,8 @@ static int __init resctrl_late_init(void)

state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
"x86/resctrl/cat:online:",
- resctrl_arch_online_cpu, resctrl_offline_cpu);
+ resctrl_arch_online_cpu,
+ resctrl_arch_offline_cpu);
if (state < 0)
return state;

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index bf206bdb21ee..c27ec56c6c60 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3710,6 +3710,45 @@ int resctrl_online_cpu(unsigned int cpu)
return 0;
}

+static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
+{
+ struct rdtgroup *cr;
+
+ list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) {
+ if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask))
+ break;
+ }
+}
+
+void resctrl_offline_cpu(unsigned int cpu)
+{
+ struct rdt_domain *d;
+ struct rdtgroup *rdtgrp;
+ struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
+ if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
+ clear_childcpus(rdtgrp, cpu);
+ break;
+ }
+ }
+
+ d = get_domain_from_cpu(cpu, l3);
+ if (d) {
+ if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
+ cancel_delayed_work(&d->mbm_over);
+ mbm_setup_overflow_handler(d, 0, cpu);
+ }
+ if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
+ has_busy_rmid(l3, d)) {
+ cancel_delayed_work(&d->cqm_limbo);
+ cqm_setup_limbo_handler(d, 0, cpu);
+ }
+ }
+}
+
/*
* rdtgroup_init - rdtgroup initialization
*
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 3ea7d618f33f..f053527aaa5b 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -226,6 +226,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
int resctrl_online_cpu(unsigned int cpu);
+void resctrl_offline_cpu(unsigned int cpu);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
--
2.39.2


2023-03-20 17:47:57

by James Morse

[permalink] [raw]
Subject: [PATCH v3 12/19] x86/resctrl: Make resctrl_mounted checks explicit

The rdt_enable_key is switched when resctrl is mounted, and used to
prevent a second mount of the filesystem. It also enables the
architecture's context switch code.

This requires another architecture to have the same set of static-keys,
as resctrl depends on them too.

Make the resctrl_mounted checks explicit: resctrl can keep track of
whether it has been mounted once. This doesn't need to be combined with
whether the arch code is context switching the CLOSID.
Tests against the rdt_mon_enable_key become a test that resctrl is
mounted and that monitoring is enabled.

This will allow the static-key changing to be moved behind resctrl_arch_
calls.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/monitor.c | 5 +++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 +++++++++++------
3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 7262b355e128..7d5188e8bec3 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -142,6 +142,7 @@ extern bool rdt_alloc_capable;
extern bool rdt_mon_capable;
extern unsigned int rdt_mon_features;
extern struct list_head resctrl_schema_all;
+extern bool resctrl_mounted;

enum rdt_group_type {
RDTCTRL_GROUP = 0,
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f38cd2f12285..6279f5c98b39 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -834,7 +834,7 @@ void mbm_handle_overflow(struct work_struct *work)

mutex_lock(&rdtgroup_mutex);

- if (!static_branch_likely(&rdt_mon_enable_key))
+ if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
goto out_unlock;

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
@@ -867,8 +867,9 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;

- if (!static_branch_likely(&rdt_mon_enable_key))
+ if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
return;
+
cpu = cpumask_any_housekeeping(&dom->cpu_mask);
dom->mbm_work_cpu = cpu;
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 2306fbc9a9bb..5176a85f281c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -42,6 +42,9 @@ LIST_HEAD(rdt_all_groups);
/* list of entries for the schemata file */
LIST_HEAD(resctrl_schema_all);

+/* the filesystem can only be mounted once */
+bool resctrl_mounted;
+
/* Kernel fs node for "info" directory under root */
static struct kernfs_node *kn_info;

@@ -796,7 +799,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
mutex_lock(&rdtgroup_mutex);

/* Return empty if resctrl has not been mounted. */
- if (!static_branch_unlikely(&rdt_enable_key)) {
+ if (!resctrl_mounted) {
seq_puts(s, "res:\nmon:\n");
goto unlock;
}
@@ -2463,7 +2466,7 @@ static int rdt_get_tree(struct fs_context *fc)
/*
* resctrl file system can only be mounted once.
*/
- if (static_branch_unlikely(&rdt_enable_key)) {
+ if (resctrl_mounted) {
ret = -EBUSY;
goto out;
}
@@ -2511,8 +2514,10 @@ static int rdt_get_tree(struct fs_context *fc)
if (rdt_mon_capable)
static_branch_enable_cpuslocked(&rdt_mon_enable_key);

- if (rdt_alloc_capable || rdt_mon_capable)
+ if (rdt_alloc_capable || rdt_mon_capable) {
static_branch_enable_cpuslocked(&rdt_enable_key);
+ resctrl_mounted = true;
+ }

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
@@ -2783,6 +2788,7 @@ static void rdt_kill_sb(struct super_block *sb)
static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
static_branch_disable_cpuslocked(&rdt_mon_enable_key);
static_branch_disable_cpuslocked(&rdt_enable_key);
+ resctrl_mounted = false;
kernfs_kill_sb(sb);
mutex_unlock(&rdtgroup_mutex);
cpus_read_unlock();
@@ -3610,7 +3616,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
* If resctrl is mounted, remove all the
* per domain monitor data directories.
*/
- if (static_branch_unlikely(&rdt_mon_enable_key))
+ if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
rmdir_mondata_subdir_allrdtgrp(r, d->id);

if (is_mbm_enabled())
@@ -3687,8 +3693,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
if (is_llc_occupancy_enabled())
INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);

- /* If resctrl is mounted, add per domain monitor data directories. */
- if (static_branch_unlikely(&rdt_mon_enable_key))
+ if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
mkdir_mondata_subdir_allrdtgrp(r, d);

return 0;
--
2.39.2


2023-03-20 17:48:06

by James Morse

[permalink] [raw]
Subject: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI

x86 is blessed with an abundance of monitors, one per RMID, that can be
read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
the number implemented is up to the manufacturer. This means when there are
fewer monitors than needed, they need to be allocated and freed.

Worse, the domain may be broken up into slices, and the MMIO accesses
for each slice may need performing from different CPUs.

These two details mean MPAMs monitor code needs to be able to sleep, and
IPI another CPU in the domain to read from a resource that has been sliced.

mon_event_read() already invokes mon_event_count() via IPI, which means
this isn't possible. On systems using nohz-full, some CPUs need to be
interrupted to run kernel work as they otherwise stay in user-space
running realtime workloads. Interrupting these CPUs should be avoided,
and scheduling work on them may never complete.

Change mon_event_read() to pick a housekeeping CPU, (one that is not using
nohz_full) and schedule mon_event_count() and wait. If all the CPUs
in a domain are using nohz-full, then an IPI is used as the fallback.

This function is only used in response to a user-space filesystem request
(not the timing sensitive overflow code).

This allows MPAM to hide the slice behaviour from resctrl, and to keep
the monitor-allocation in monitor.c. When the IPI fallback is used on
machines where MPAM needs to make an access on multiple CPUs, the counter
read will always fail.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v2:
* Use cpumask_any_housekeeping() and fallback to an IPI if needed
---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 19 +++++++++++++++++--
arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++--
3 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index eb07d4435391..b06e86839d00 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -19,6 +19,7 @@
#include <linux/kernfs.h>
#include <linux/seq_file.h>
#include <linux/slab.h>
+#include <linux/tick.h>
#include "internal.h"

/*
@@ -527,8 +528,13 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first)
{
+ int cpu;
+
+ /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
+ lockdep_assert_held(&rdtgroup_mutex);
+
/*
- * setup the parameters to send to the IPI to read the data.
+ * setup the parameters to pass to mon_event_count() to read the data.
*/
rr->rgrp = rdtgrp;
rr->evtid = evtid;
@@ -537,7 +543,16 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
rr->val = 0;
rr->first = first;

- smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+ cpu = get_cpu();
+ if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
+ mon_event_count(rr);
+ put_cpu();
+ } else {
+ put_cpu();
+
+ cpu = cpumask_any_housekeeping(&d->cpu_mask);
+ smp_call_on_cpu(cpu, mon_event_count, rr, false);
+ }
}

int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 0b5fd5a0cda2..a07557390895 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -563,7 +563,7 @@ int alloc_rmid(u32 closid);
void free_rmid(u32 closid, u32 rmid);
int rdt_get_mon_l3_config(struct rdt_resource *r);
bool __init rdt_cpu_has(int flag);
-void mon_event_count(void *info);
+int mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain *d, struct rdtgroup *rdtgrp,
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3bec5c59ca0e..5e9e876c3409 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -550,10 +550,10 @@ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
}

/*
- * This is called via IPI to read the CQM/MBM counters
+ * This is scheduled by mon_event_read() to read the CQM/MBM counters
* on a domain.
*/
-void mon_event_count(void *info)
+int mon_event_count(void *info)
{
struct rdtgroup *rdtgrp, *entry;
struct rmid_read *rr = info;
@@ -586,6 +586,8 @@ void mon_event_count(void *info)
*/
if (ret == 0)
rr->err = 0;
+
+ return 0;
}

/*
--
2.39.2


2023-03-20 17:48:10

by James Morse

[permalink] [raw]
Subject: [PATCH v3 15/19] x86/resctrl: Add helpers for system wide mon/alloc capable

resctrl reads rdt_alloc_capable or rdt_mon_capable to determine
whether any of the resources support the corresponding features.
resctrl also uses the static-keys that affect the architecture's
context-switch code to determine the same thing.

This forces another architecture to have the same static-keys.

As the static-key is enabled based on the capable flag, and none of
the filesystem uses of these are in the scheduler path, move the
capable flags behind helpers, and use these in the filesystem
code instead of the static-key.

After this change, only the architecture code manages and uses
the static-keys to ensure __resctrl_sched_in() does not need
runtime checks.

This avoids multiple architectures having to define the same
static-keys.

Tested-by: Shaopeng Tan <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>

---
Changes since v1:
* Added missing conversion in mkdir_rdt_prepare_rmid_free()
---
arch/x86/include/asm/resctrl.h | 13 +++++++++
arch/x86/kernel/cpu/resctrl/internal.h | 2 --
arch/x86/kernel/cpu/resctrl/monitor.c | 4 +--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 ++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 34 +++++++++++------------
5 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 147af2b43385..4355245652c9 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -38,10 +38,18 @@ struct resctrl_pqr_state {

DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);

+extern bool rdt_alloc_capable;
+extern bool rdt_mon_capable;
+
DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);

+static inline bool resctrl_arch_alloc_capable(void)
+{
+ return rdt_alloc_capable;
+}
+
static inline void resctrl_arch_enable_alloc(void)
{
static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
@@ -54,6 +62,11 @@ static inline void resctrl_arch_disable_alloc(void)
static_branch_dec_cpuslocked(&rdt_enable_key);
}

+static inline bool resctrl_arch_mon_capable(void)
+{
+ return rdt_mon_capable;
+}
+
static inline void resctrl_arch_enable_mon(void)
{
static_branch_enable_cpuslocked(&rdt_mon_enable_key);
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c83bd581c1d5..3eb5b307b809 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -135,8 +135,6 @@ struct rmid_read {
int arch_mon_ctx;
};

-extern bool rdt_alloc_capable;
-extern bool rdt_mon_capable;
extern unsigned int rdt_mon_features;
extern struct list_head resctrl_schema_all;
extern bool resctrl_mounted;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 6279f5c98b39..f0f2e61b15d5 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -834,7 +834,7 @@ void mbm_handle_overflow(struct work_struct *work)

mutex_lock(&rdtgroup_mutex);

- if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
+ if (!resctrl_mounted || !resctrl_arch_mon_capable())
goto out_unlock;

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
@@ -867,7 +867,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;

- if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
+ if (!resctrl_mounted || !resctrl_arch_mon_capable())
return;

cpu = cpumask_any_housekeeping(&dom->cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 3b724a40d3a2..0b4fdb118643 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -567,7 +567,7 @@ static int rdtgroup_locksetup_user_restrict(struct rdtgroup *rdtgrp)
if (ret)
goto err_cpus;

- if (rdt_mon_capable) {
+ if (resctrl_arch_mon_capable()) {
ret = rdtgroup_kn_mode_restrict(rdtgrp, "mon_groups");
if (ret)
goto err_cpus_list;
@@ -614,7 +614,7 @@ static int rdtgroup_locksetup_user_restore(struct rdtgroup *rdtgrp)
if (ret)
goto err_cpus;

- if (rdt_mon_capable) {
+ if (resctrl_arch_mon_capable()) {
ret = rdtgroup_kn_mode_restore(rdtgrp, "mon_groups", 0777);
if (ret)
goto err_cpus_list;
@@ -762,7 +762,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
{
int ret;

- if (rdt_mon_capable) {
+ if (resctrl_arch_mon_capable()) {
ret = alloc_rmid(rdtgrp->closid);
if (ret < 0) {
rdt_last_cmd_puts("Out of RMIDs\n");
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 2ca8981c7d0d..8f319e03b449 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -611,13 +611,13 @@ static int __rdtgroup_move_task(struct task_struct *tsk,

static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
{
- return (rdt_alloc_capable && (r->type == RDTCTRL_GROUP) &&
+ return (resctrl_arch_alloc_capable() && (r->type == RDTCTRL_GROUP) &&
resctrl_arch_match_closid(t, r->closid));
}

static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r)
{
- return (rdt_mon_capable && (r->type == RDTMON_GROUP) &&
+ return (resctrl_arch_mon_capable() && (r->type == RDTMON_GROUP) &&
resctrl_arch_match_rmid(t, r->mon.parent->closid,
r->mon.rmid));
}
@@ -2487,7 +2487,7 @@ static int rdt_get_tree(struct fs_context *fc)
if (ret < 0)
goto out_schemata_free;

- if (rdt_mon_capable) {
+ if (resctrl_arch_mon_capable()) {
ret = mongroup_create_dir(rdtgroup_default.kn,
&rdtgroup_default, "mon_groups",
&kn_mongrp);
@@ -2509,12 +2509,12 @@ static int rdt_get_tree(struct fs_context *fc)
if (ret < 0)
goto out_psl;

- if (rdt_alloc_capable)
+ if (resctrl_arch_alloc_capable())
resctrl_arch_enable_alloc();
- if (rdt_mon_capable)
+ if (resctrl_arch_mon_capable())
resctrl_arch_enable_mon();

- if (rdt_alloc_capable || rdt_mon_capable)
+ if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable())
resctrl_mounted = true;

if (is_mbm_enabled()) {
@@ -2528,10 +2528,10 @@ static int rdt_get_tree(struct fs_context *fc)
out_psl:
rdt_pseudo_lock_release();
out_mondata:
- if (rdt_mon_capable)
+ if (resctrl_arch_mon_capable())
kernfs_remove(kn_mondata);
out_mongrp:
- if (rdt_mon_capable)
+ if (resctrl_arch_mon_capable())
kernfs_remove(kn_mongrp);
out_info:
kernfs_remove(kn_info);
@@ -2783,9 +2783,9 @@ static void rdt_kill_sb(struct super_block *sb)
rdt_pseudo_lock_release();
rdtgroup_default.mode = RDT_MODE_SHAREABLE;
schemata_list_destroy();
- if (rdt_alloc_capable)
+ if (resctrl_arch_alloc_capable())
resctrl_arch_disable_alloc();
- if (rdt_mon_capable)
+ if (resctrl_arch_mon_capable())
resctrl_arch_disable_mon();
resctrl_mounted = false;
kernfs_kill_sb(sb);
@@ -3161,7 +3161,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
{
int ret;

- if (!rdt_mon_capable)
+ if (!resctrl_arch_mon_capable())
return 0;

ret = alloc_rmid(rdtgrp->closid);
@@ -3183,7 +3183,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)

static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp)
{
- if (rdt_mon_capable)
+ if (resctrl_arch_mon_capable())
free_rmid(rgrp->closid, rgrp->mon.rmid);
}

@@ -3349,7 +3349,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,

list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);

- if (rdt_mon_capable) {
+ if (resctrl_arch_mon_capable()) {
/*
* Create an empty mon_groups directory to hold the subset
* of tasks and cpus to monitor.
@@ -3404,14 +3404,14 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
* allocation is supported, add a control and monitoring
* subdirectory
*/
- if (rdt_alloc_capable && parent_kn == rdtgroup_default.kn)
+ if (resctrl_arch_alloc_capable() && parent_kn == rdtgroup_default.kn)
return rdtgroup_mkdir_ctrl_mon(parent_kn, name, mode);

/*
* If RDT monitoring is supported and the parent directory is a valid
* "mon_groups" directory, add a monitoring subdirectory.
*/
- if (rdt_mon_capable && is_mon_groups(parent_kn, name))
+ if (resctrl_arch_mon_capable() && is_mon_groups(parent_kn, name))
return rdtgroup_mkdir_mon(parent_kn, name, mode);

return -EPERM;
@@ -3615,7 +3615,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
* If resctrl is mounted, remove all the
* per domain monitor data directories.
*/
- if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
+ if (resctrl_mounted && resctrl_arch_mon_capable())
rmdir_mondata_subdir_allrdtgrp(r, d->id);

if (is_mbm_enabled())
@@ -3692,7 +3692,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
if (is_llc_occupancy_enabled())
INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);

- if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
+ if (resctrl_mounted && resctrl_arch_mon_capable())
mkdir_mondata_subdir_allrdtgrp(r, d);

return 0;
--
2.39.2


2023-03-20 17:48:15

by James Morse

[permalink] [raw]
Subject: [PATCH v3 14/19] x86/resctrl: Make rdt_enable_key the arch's decision to switch

rdt_enable_key is switched when resctrl is mounted. It was also previously
used to prevent a second mount of the filesystem.

Any other architecture that wants to support resctrl has to provide
identical static keys.

Now that there are helpers for enabling and disabling the alloc/mon keys,
resctrl doesn't need to switch this extra key, it can be done by the arch
code. Use the static-key increment and decrement helpers, and change
resctrl to ensure the calls are balanced.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/include/asm/resctrl.h | 4 ++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++------
2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 5fdfcd5f943e..147af2b43385 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -45,21 +45,25 @@ DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
static inline void resctrl_arch_enable_alloc(void)
{
static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
+ static_branch_inc_cpuslocked(&rdt_enable_key);
}

static inline void resctrl_arch_disable_alloc(void)
{
static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
+ static_branch_dec_cpuslocked(&rdt_enable_key);
}

static inline void resctrl_arch_enable_mon(void)
{
static_branch_enable_cpuslocked(&rdt_mon_enable_key);
+ static_branch_inc_cpuslocked(&rdt_enable_key);
}

static inline void resctrl_arch_disable_mon(void)
{
static_branch_disable_cpuslocked(&rdt_mon_enable_key);
+ static_branch_dec_cpuslocked(&rdt_enable_key);
}

/*
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index c6c31efb85ac..2ca8981c7d0d 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2514,10 +2514,8 @@ static int rdt_get_tree(struct fs_context *fc)
if (rdt_mon_capable)
resctrl_arch_enable_mon();

- if (rdt_alloc_capable || rdt_mon_capable) {
- static_branch_enable_cpuslocked(&rdt_enable_key);
+ if (rdt_alloc_capable || rdt_mon_capable)
resctrl_mounted = true;
- }

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
@@ -2785,9 +2783,10 @@ static void rdt_kill_sb(struct super_block *sb)
rdt_pseudo_lock_release();
rdtgroup_default.mode = RDT_MODE_SHAREABLE;
schemata_list_destroy();
- resctrl_arch_disable_alloc();
- resctrl_arch_disable_mon();
- static_branch_disable_cpuslocked(&rdt_enable_key);
+ if (rdt_alloc_capable)
+ resctrl_arch_disable_alloc();
+ if (rdt_mon_capable)
+ resctrl_arch_disable_mon();
resctrl_mounted = false;
kernfs_kill_sb(sb);
mutex_unlock(&rdtgroup_mutex);
--
2.39.2


2023-03-20 17:48:20

by James Morse

[permalink] [raw]
Subject: [PATCH v3 05/19] x86/resctrl: Allow RMID allocation to be scoped by CLOSID

MPAMs RMID values are not unique unless the CLOSID is considered as well.

alloc_rmid() expects the RMID to be an independent number.

Pass the CLOSID in to alloc_rmid(). Use this to compare indexes when
allocating. If the CLOSID is not relevant to the index, this ends up
comparing the free RMID with itself, and the first free entry will be
used. With MPAM the CLOSID is included in the index, so this becomes a
walk of the free RMID entries, until one that matches the supplied
CLOSID is found.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v2;
* Rephrased comment in resctrl_find_free_rmid() to describe this in terms of
list_entry_first()
* Rephrased comment above alloc_rmid()
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 54 +++++++++++++++++------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
4 files changed, 43 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 47506e2afd59..e11d9ce943d3 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -535,7 +535,7 @@ void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
void closid_free(int closid);
-int alloc_rmid(void);
+int alloc_rmid(u32 closid);
void free_rmid(u32 closid, u32 rmid);
int rdt_get_mon_l3_config(struct rdt_resource *r);
bool __init rdt_cpu_has(int flag);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 03a7d13dd653..ca58a433c668 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -337,25 +337,51 @@ bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit;
}

-/*
- * As of now the RMIDs allocation is global.
- * However we keep track of which packages the RMIDs
- * are used to optimize the limbo list management.
- */
-int alloc_rmid(void)
+static struct rmid_entry *resctrl_find_free_rmid(u32 closid)
{
- struct rmid_entry *entry;
-
- lockdep_assert_held(&rdtgroup_mutex);
+ struct rmid_entry *itr;
+ u32 itr_idx, cmp_idx;

if (list_empty(&rmid_free_lru))
- return rmid_limbo_count ? -EBUSY : -ENOSPC;
+ return rmid_limbo_count ? ERR_PTR(-EBUSY) : ERR_PTR(-ENOSPC);

- entry = list_first_entry(&rmid_free_lru,
- struct rmid_entry, list);
- list_del(&entry->list);
+ list_for_each_entry(itr, &rmid_free_lru, list) {
+ /*
+ * get the index of this free RMID, and the index it would need
+ * to be if it were used with this CLOSID.
+ * If the CLOSID is irrelevant on this architecture, these will
+ * always be the same meaning the compiler can reduce this loop
+ * to a single list_entry_first() call.
+ */
+ itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid);
+ cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid);

- return entry->rmid;
+ if (itr_idx == cmp_idx)
+ return itr;
+ }
+
+ return ERR_PTR(-ENOSPC);
+}
+
+/*
+ * For MPAM the RMID value is not unique, and has to be considered with
+ * the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which
+ * allows all domains to be managed by a single limbo list.
+ * Each domain also has a rmid_busy_llc to reduce the work of the limbo handler.
+ */
+int alloc_rmid(u32 closid)
+{
+ struct rmid_entry *entry;
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ entry = resctrl_find_free_rmid(closid);
+ if (!IS_ERR(entry)) {
+ list_del(&entry->list);
+ return entry->rmid;
+ }
+
+ return PTR_ERR(entry);
}

static void add_rmid_to_limbo(struct rmid_entry *entry)
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index c51932516965..3b724a40d3a2 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -763,7 +763,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
int ret;

if (rdt_mon_capable) {
- ret = alloc_rmid();
+ ret = alloc_rmid(rdtgrp->closid);
if (ret < 0) {
rdt_last_cmd_puts("Out of RMIDs\n");
return ret;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 16c8ca135b37..bcd27610bb77 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3142,7 +3142,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
if (!rdt_mon_capable)
return 0;

- ret = alloc_rmid();
+ ret = alloc_rmid(rdtgrp->closid);
if (ret < 0) {
rdt_last_cmd_puts("Out of RMIDs\n");
return ret;
--
2.39.2


2023-03-20 17:48:25

by James Morse

[permalink] [raw]
Subject: [PATCH v3 11/19] x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read()

Depending on the number of monitors available, Arm's MPAM may need to
allocate a monitor prior to reading the counter value. Allocating a
contended resource may involve sleeping.

All callers of resctrl_arch_rmid_read() read the counter on more than
one domain. If the monitor is allocated globally, there is no need to
allocate and free it for each call to resctrl_arch_rmid_read().

Add arch hooks for this allocation, which need calling before
resctrl_arch_rmid_read(). The allocated monitor is passed to
resctrl_arch_rmid_read(), then freed again afterwards. The helper
can be called on any CPU, and can sleep.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/include/asm/resctrl.h | 11 +++++++
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/monitor.c | 40 +++++++++++++++++++++++---
include/linux/resctrl.h | 4 +--
4 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 752123b0ce40..1c87f1626456 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -136,6 +136,17 @@ static inline u32 resctrl_arch_rmid_idx_encode(u32 ignored, u32 rmid)
return rmid;
}

+/* x86 can always read an rmid, nothing needs allocating */
+struct rdt_resource;
+static inline int resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid)
+{
+ might_sleep();
+ return 0;
+};
+
+static inline void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid,
+ int ctx) { };
+
void resctrl_cpu_detect(struct cpuinfo_x86 *c);

#else
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a07557390895..7262b355e128 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -135,6 +135,7 @@ struct rmid_read {
bool first;
int err;
u64 val;
+ int arch_mon_ctx;
};

extern bool rdt_alloc_capable;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index de72df06b37b..f38cd2f12285 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -15,6 +15,7 @@
* Software Developer Manual June 2016, volume 3, section 17.17.
*/

+#include <linux/cpu.h>
#include <linux/module.h>
#include <linux/sizes.h>
#include <linux/slab.h>
@@ -271,7 +272,7 @@ static void smp_call_rmid_read(void *_arg)

int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
- u64 *val)
+ u64 *val, int ignored)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
@@ -317,9 +318,14 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
struct rmid_entry *entry;
u32 idx, cur_idx = 1;
+ int arch_mon_ctx;
bool rmid_dirty;
u64 val = 0;

+ arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
+ if (arch_mon_ctx < 0)
+ return;
+
/*
* Skip RMID 0 and start from RMID 1 and check all the RMIDs that
* are marked as busy for occupancy < threshold. If the occupancy
@@ -333,7 +339,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free)

entry = __rmid_entry(idx);
if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
- QOS_L3_OCCUP_EVENT_ID, &val)) {
+ QOS_L3_OCCUP_EVENT_ID, &val,
+ arch_mon_ctx)) {
rmid_dirty = true;
} else {
rmid_dirty = (val >= resctrl_rmid_realloc_threshold);
@@ -348,6 +355,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
}
cur_idx = idx + 1;
}
+
+ resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
}

bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
@@ -444,16 +453,22 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
struct rdt_domain *d;
+ int arch_mon_ctx;
u64 val = 0;
u32 idx;
int err;

idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);

+ arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
+ if (arch_mon_ctx < 0)
+ return;
+
entry->busy = 0;
list_for_each_entry(d, &r->domains, list) {
err = resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
- QOS_L3_OCCUP_EVENT_ID, &val);
+ QOS_L3_OCCUP_EVENT_ID, &val,
+ arch_mon_ctx);
if (err || val <= resctrl_rmid_realloc_threshold)
continue;

@@ -466,6 +481,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
set_bit(idx, d->rmid_busy_llc);
entry->busy++;
}
+ resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);

if (entry->busy)
rmid_limbo_count++;
@@ -502,7 +518,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid);

rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid,
- &tval);
+ &tval, rr->arch_mon_ctx);
if (rr->err)
return rr->err;

@@ -575,6 +591,9 @@ int mon_event_count(void *info)
int ret;

rdtgrp = rr->rgrp;
+ rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr->r, rr->evtid);
+ if (rr->arch_mon_ctx < 0)
+ return rr->arch_mon_ctx;

ret = __mon_event_count(rdtgrp->closid, rdtgrp->mon.rmid, rr);

@@ -601,6 +620,8 @@ int mon_event_count(void *info)
if (ret == 0)
rr->err = 0;

+ resctrl_arch_mon_ctx_free(rr->r, rr->evtid, rr->arch_mon_ctx);
+
return 0;
}

@@ -737,11 +758,21 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
if (is_mbm_total_enabled()) {
rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
rr.val = 0;
+ rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
+ if (rr.arch_mon_ctx < 0)
+ return;
+
__mon_event_count(closid, rmid, &rr);
+
+ resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
}
if (is_mbm_local_enabled()) {
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
rr.val = 0;
+ rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
+ if (rr.arch_mon_ctx < 0)
+ return;
+
__mon_event_count(closid, rmid, &rr);

/*
@@ -751,6 +782,7 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
*/
if (is_mba_sc(NULL))
mbm_bw_count(closid, rmid, &rr);
+ resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
}
}

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index ff7452f644e4..03e4f41cd336 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -233,6 +233,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
* @rmid: rmid of the counter to read.
* @eventid: eventid to read, e.g. L3 occupancy.
* @val: result of the counter read in bytes.
+ * @arch_mon_ctx: An allocated context from resctrl_arch_mon_ctx_alloc().
*
* Call from process context on a CPU that belongs to domain @d.
*
@@ -241,8 +242,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
*/
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
- u64 *val);
-
+ u64 *val, int arch_mon_ctx);

/**
* resctrl_arch_reset_rmid() - Reset any private state associated with rmid
--
2.39.2


2023-03-20 17:50:30

by James Morse

[permalink] [raw]
Subject: [PATCH v3 08/19] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

The limbo and overflow code picks a CPU to use from the domain's list
of online CPUs. Work is then scheduled on these CPUs to maintain
the limbo list and any counters that may overflow.

cpumask_any() may pick a CPU that is marked nohz_full, which will
either penalise the work that CPU was dedicated to, or delay the
processing of limbo list or counters that may overflow. Perhaps
indefinitely. Delaying the overflow handling will skew the bandwidth
values calculated by mba_sc, which expects to be called once a second.

Add cpumask_any_housekeeping() as a replacement for cpumask_any()
that prefers housekeeping CPUs. This helper will still return
a nohz_full CPU if that is the only option. The CPU to use is
re-evaluated each time the limbo/overflow work runs. This ensures
the work will move off a nohz_full CPU once a houskeeping CPU is
available.

Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 23 +++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 17 ++++++++++++-----
include/linux/tick.h | 3 ++-
3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 87545e4beb70..0b5fd5a0cda2 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -7,6 +7,7 @@
#include <linux/kernfs.h>
#include <linux/fs_context.h>
#include <linux/jump_label.h>
+#include <linux/tick.h>
#include <asm/resctrl.h>

#define L3_QOS_CDP_ENABLE 0x01ULL
@@ -55,6 +56,28 @@
/* Max event bits supported */
#define MAX_EVT_CONFIG_BITS GENMASK(6, 0)

+/**
+ * cpumask_any_housekeeping() - Chose any cpu in @mask, preferring those that
+ * aren't marked nohz_full
+ * @mask: The mask to pick a CPU from.
+ *
+ * Returns a CPU in @mask. If there are houskeeping CPUs that don't use
+ * nohz_full, these are preferred.
+ */
+static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
+{
+ int cpu, hk_cpu;
+
+ cpu = cpumask_any(mask);
+ if (tick_nohz_full_cpu(cpu)) {
+ hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
+ if (hk_cpu < nr_cpu_ids)
+ cpu = hk_cpu;
+ }
+
+ return cpu;
+}
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index a2ae4be4b2ba..3bec5c59ca0e 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -745,9 +745,9 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
void cqm_handle_limbo(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
- int cpu = smp_processor_id();
struct rdt_resource *r;
struct rdt_domain *d;
+ int cpu;

mutex_lock(&rdtgroup_mutex);

@@ -756,8 +756,10 @@ void cqm_handle_limbo(struct work_struct *work)

__check_limbo(d, false);

- if (has_busy_rmid(r, d))
+ if (has_busy_rmid(r, d)) {
+ cpu = cpumask_any_housekeeping(&d->cpu_mask);
schedule_delayed_work_on(cpu, &d->cqm_limbo, delay);
+ }

mutex_unlock(&rdtgroup_mutex);
}
@@ -767,7 +769,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;

- cpu = cpumask_any(&dom->cpu_mask);
+ cpu = cpumask_any_housekeeping(&dom->cpu_mask);
dom->cqm_work_cpu = cpu;

schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
@@ -777,10 +779,10 @@ void mbm_handle_overflow(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
struct rdtgroup *prgrp, *crgrp;
- int cpu = smp_processor_id();
struct list_head *head;
struct rdt_resource *r;
struct rdt_domain *d;
+ int cpu;

mutex_lock(&rdtgroup_mutex);

@@ -801,6 +803,11 @@ void mbm_handle_overflow(struct work_struct *work)
update_mba_bw(prgrp, d);
}

+ /*
+ * Re-check for housekeeping CPUs. This allows the overflow handler to
+ * move off a nohz_full CPU quickly.
+ */
+ cpu = cpumask_any_housekeeping(&d->cpu_mask);
schedule_delayed_work_on(cpu, &d->mbm_over, delay);

out_unlock:
@@ -814,7 +821,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)

if (!static_branch_likely(&rdt_mon_enable_key))
return;
- cpu = cpumask_any(&dom->cpu_mask);
+ cpu = cpumask_any_housekeeping(&dom->cpu_mask);
dom->mbm_work_cpu = cpu;
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
}
diff --git a/include/linux/tick.h b/include/linux/tick.h
index bfd571f18cfd..ae2e9019fc18 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -174,9 +174,10 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
static inline void tick_nohz_idle_stop_tick_protected(void) { }
#endif /* !CONFIG_NO_HZ_COMMON */

+extern cpumask_var_t tick_nohz_full_mask;
+
#ifdef CONFIG_NO_HZ_FULL
extern bool tick_nohz_full_running;
-extern cpumask_var_t tick_nohz_full_mask;

static inline bool tick_nohz_full_enabled(void)
{
--
2.39.2


2023-03-20 17:50:35

by James Morse

[permalink] [raw]
Subject: [PATCH v3 03/19] x86/resctrl: Create helper for RMID allocation and mondata dir creation

RMID are allocated for each monitor or control group directory, because
each of these needs its own RMID. For control groups,
rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID.

MPAM's equivalent of RMID are not an independent number, so can't be
allocated until the CLOSID is known. An RMID allocation for one CLOSID
may fail, whereas another may succeed depending on how many monitor
groups a control group has.

The RMID allocation needs to move to be after the CLOSID has been
allocated.

To make a subsequent change that does this easier to read, move the RMID
allocation and mondata dir creation to a helper.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 42 +++++++++++++++++---------
1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 6ecaf34a4e32..b785beb0db26 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3135,6 +3135,30 @@ static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp)
return 0;
}

+static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
+{
+ int ret;
+
+ if (!rdt_mon_capable)
+ return 0;
+
+ ret = alloc_rmid();
+ if (ret < 0) {
+ rdt_last_cmd_puts("Out of RMIDs\n");
+ return ret;
+ }
+ rdtgrp->mon.rmid = ret;
+
+ ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
+ if (ret) {
+ rdt_last_cmd_puts("kernfs subdir error\n");
+ free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
+ return ret;
+ }
+
+ return 0;
+}
+
static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
const char *name, umode_t mode,
enum rdt_group_type rtype, struct rdtgroup **r)
@@ -3200,20 +3224,10 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
goto out_destroy;
}

- if (rdt_mon_capable) {
- ret = alloc_rmid();
- if (ret < 0) {
- rdt_last_cmd_puts("Out of RMIDs\n");
- goto out_destroy;
- }
- rdtgrp->mon.rmid = ret;
+ ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
+ if (ret)
+ goto out_destroy;

- ret = mkdir_mondata_all(kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
- if (ret) {
- rdt_last_cmd_puts("kernfs subdir error\n");
- goto out_idfree;
- }
- }
kernfs_activate(kn);

/*
@@ -3221,8 +3235,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
*/
return 0;

-out_idfree:
- free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
out_destroy:
kernfs_put(rdtgrp->kn);
kernfs_remove(rdtgrp->kn);
--
2.39.2


2023-03-20 17:54:29

by James Morse

[permalink] [raw]
Subject: [PATCH v3 02/19] x86/resctrl: Access per-rmid structures by index

Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
monitors, RMID values on arm64 are not unique unless the CLOSID is
also included. Bitmaps like rmid_busy_llc need to be sized by the
number of unique entries for this resource.

Add helpers to encode/decode the CLOSID and RMID to an index. The
domain's rmid_busy__llc and the rmid_ptrs[] array are then sized by
index, as are the domain mbm_local and mbm_total arrays.
On x86, the index is always just the RMID, so all these structures
remain the same size.

The index gives resctrl a unique value it can use to store monitor
values, and allows MPAM to decode the closid when reading the hardware
counters.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v1:
* Added X86_BAD_CLOSID macro to make it clear what this value means
* Added second WARN_ON() for closid checking, and made both _ONCE()

Changes since v2:
* Added RESCTRL_RESERVED_CLOSID
* Removed a newline
* Repharsed some comments
* Renamed a variable 'ignore'd
* Moved X86_RESCTRL_BAD_CLOSID to a previous patch
---
arch/x86/include/asm/resctrl.h | 17 ++++++
arch/x86/kernel/cpu/resctrl/core.c | 2 +-
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/monitor.c | 83 +++++++++++++++++---------
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 7 ++-
include/linux/resctrl.h | 3 +
6 files changed, 82 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index cbe986d23df6..3ca40be41a0a 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -101,6 +101,23 @@ static inline void resctrl_sched_in(void)
__resctrl_sched_in();
}

+static inline u32 resctrl_arch_system_num_rmid_idx(void)
+{
+ /* RMID are independent numbers for x86. num_rmid_idx==num_rmid */
+ return boot_cpu_data.x86_cache_max_rmid + 1;
+}
+
+static inline void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid)
+{
+ *rmid = idx;
+ *closid = X86_RESCTRL_BAD_CLOSID;
+}
+
+static inline u32 resctrl_arch_rmid_idx_encode(u32 ignored, u32 rmid)
+{
+ return rmid;
+}
+
void resctrl_cpu_detect(struct cpuinfo_x86 *c);

#else
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..351319403f84 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -600,7 +600,7 @@ static void clear_closid_rmid(int cpu)
state->default_rmid = 0;
state->cur_closid = 0;
state->cur_rmid = 0;
- wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
+ wrmsr(MSR_IA32_PQR_ASSOC, RESCTRL_RESERVED_CLOSID, 0);
}

static int resctrl_online_cpu(unsigned int cpu)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c64097947994..47506e2afd59 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -7,6 +7,7 @@
#include <linux/kernfs.h>
#include <linux/fs_context.h>
#include <linux/jump_label.h>
+#include <asm/resctrl.h>

#define L3_QOS_CDP_ENABLE 0x01ULL

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 18c37d364030..03a7d13dd653 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -142,12 +142,29 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val)
return val;
}

-static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid)
+/*
+ * x86 and arm64 differ in their handling of monitoring.
+ * x86's RMID are an independent number, there is only one source of traffic
+ * an RMID value of '1'.
+ * arm64's PMG extend the PARTID/CLOSID space, there are multiple sources of
+ * traffic with a PMG value of '1', one for each CLOSID, meaining the RMID
+ * value is no longer unique.
+ * To account for this, resctrl uses an index. On x86 this is just the RMID,
+ * on arm64 it encodes the CLOSID and RMID. This gives a unique number.
+ *
+ * The domain's rmid_busy_llc and rmid_ptrs are sized by index. The arch code
+ * must accept an attempt to read every index.
+ */
+static inline struct rmid_entry *__rmid_entry(u32 idx)
{
struct rmid_entry *entry;
+ u32 closid, rmid;

- entry = &rmid_ptrs[rmid];
- WARN_ON(entry->rmid != rmid);
+ entry = &rmid_ptrs[idx];
+ resctrl_arch_rmid_idx_decode(idx, &closid, &rmid);
+
+ WARN_ON_ONCE(entry->closid != closid);
+ WARN_ON_ONCE(entry->rmid != rmid);

return entry;
}
@@ -277,8 +294,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
void __check_limbo(struct rdt_domain *d, bool force_free)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ u32 idx_limit = resctrl_arch_system_num_rmid_idx();
struct rmid_entry *entry;
- u32 crmid = 1, nrmid;
+ u32 idx, cur_idx = 1;
bool rmid_dirty;
u64 val = 0;

@@ -289,12 +307,11 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
* RMID and move it to the free list when the counter reaches 0.
*/
for (;;) {
- nrmid = find_next_bit(d->rmid_busy_llc, r->num_rmid, crmid);
- if (nrmid >= r->num_rmid)
+ idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx);
+ if (idx >= idx_limit)
break;

- entry = __rmid_entry(X86_RESCTRL_BAD_CLOSID, nrmid);// temporary
-
+ entry = __rmid_entry(idx);
if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
QOS_L3_OCCUP_EVENT_ID, &val)) {
rmid_dirty = true;
@@ -303,19 +320,21 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
}

if (force_free || !rmid_dirty) {
- clear_bit(entry->rmid, d->rmid_busy_llc);
+ clear_bit(idx, d->rmid_busy_llc);
if (!--entry->busy) {
rmid_limbo_count--;
list_add_tail(&entry->list, &rmid_free_lru);
}
}
- crmid = nrmid + 1;
+ cur_idx = idx + 1;
}
}

bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
{
- return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
+ u32 idx_limit = resctrl_arch_system_num_rmid_idx();
+
+ return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit;
}

/*
@@ -345,6 +364,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
struct rdt_domain *d;
int cpu, err;
u64 val = 0;
+ u32 idx;
+
+ idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);

entry->busy = 0;
cpu = get_cpu();
@@ -364,7 +386,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
*/
if (!has_busy_rmid(r, d))
cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL);
- set_bit(entry->rmid, d->rmid_busy_llc);
+ set_bit(idx, d->rmid_busy_llc);
entry->busy++;
}
put_cpu();
@@ -377,14 +399,16 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)

void free_rmid(u32 closid, u32 rmid)
{
+ u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
struct rmid_entry *entry;

- if (!rmid)
- return;
-
lockdep_assert_held(&rdtgroup_mutex);

- entry = __rmid_entry(closid, rmid);
+ /* do not allow the default rmid to be free'd */
+ if (!idx)
+ return;
+
+ entry = __rmid_entry(idx);

if (is_llc_occupancy_enabled())
add_rmid_to_limbo(entry);
@@ -394,6 +418,7 @@ void free_rmid(u32 closid, u32 rmid)

static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
{
+ u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
struct mbm_state *m;
u64 tval = 0;

@@ -410,10 +435,10 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
rr->val += tval;
return 0;
case QOS_L3_MBM_TOTAL_EVENT_ID:
- m = &rr->d->mbm_total[rmid];
+ m = &rr->d->mbm_total[idx];
break;
case QOS_L3_MBM_LOCAL_EVENT_ID:
- m = &rr->d->mbm_local[rmid];
+ m = &rr->d->mbm_local[idx];
break;
default:
/*
@@ -446,7 +471,8 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
*/
static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
{
- struct mbm_state *m = &rr->d->mbm_local[rmid];
+ u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+ struct mbm_state *m = &rr->d->mbm_local[idx];
u64 cur_bw, bytes, cur_bytes;

cur_bytes = rr->val;
@@ -536,7 +562,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
- u32 cur_bw, delta_bw, user_bw;
+ u32 cur_bw, delta_bw, user_bw, idx;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
struct list_head *head;
@@ -549,7 +575,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)

closid = rgrp->closid;
rmid = rgrp->mon.rmid;
- pmbm_data = &dom_mbm->mbm_local[rmid];
+ idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+ pmbm_data = &dom_mbm->mbm_local[idx];

dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
@@ -732,19 +759,20 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)

static int dom_data_init(struct rdt_resource *r)
{
+ u32 nr_idx = resctrl_arch_system_num_rmid_idx();
struct rmid_entry *entry = NULL;
- int i, nr_rmids;
+ u32 idx;
+ int i;

- nr_rmids = r->num_rmid;
- rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL);
+ rmid_ptrs = kcalloc(nr_idx, sizeof(struct rmid_entry), GFP_KERNEL);
if (!rmid_ptrs)
return -ENOMEM;

- for (i = 0; i < nr_rmids; i++) {
+ for (i = 0; i < nr_idx; i++) {
entry = &rmid_ptrs[i];
INIT_LIST_HEAD(&entry->list);

- entry->rmid = i;
+ resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid);
list_add_tail(&entry->list, &rmid_free_lru);
}

@@ -753,7 +781,8 @@ static int dom_data_init(struct rdt_resource *r)
* default_rdtgroup control group, which will be setup later. See
* rdtgroup_setup_root().
*/
- entry = __rmid_entry(0, 0);
+ idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, 0);
+ entry = __rmid_entry(idx);
list_del(&entry->list);

return 0;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 23e6b3a373b0..6ecaf34a4e32 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3587,16 +3587,17 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)

static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
{
+ u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize;

if (is_llc_occupancy_enabled()) {
- d->rmid_busy_llc = bitmap_zalloc(r->num_rmid, GFP_KERNEL);
+ d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
}
if (is_mbm_total_enabled()) {
tsize = sizeof(*d->mbm_total);
- d->mbm_total = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
+ d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL);
if (!d->mbm_total) {
bitmap_free(d->rmid_busy_llc);
return -ENOMEM;
@@ -3604,7 +3605,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
}
if (is_mbm_local_enabled()) {
tsize = sizeof(*d->mbm_local);
- d->mbm_local = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
+ d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL);
if (!d->mbm_local) {
bitmap_free(d->rmid_busy_llc);
kfree(d->mbm_total);
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 7d80bae05f59..ff7452f644e4 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -6,6 +6,9 @@
#include <linux/list.h>
#include <linux/pid.h>

+/* CLOSID value used by the default control group */
+#define RESCTRL_RESERVED_CLOSID 0
+
#ifdef CONFIG_PROC_CPU_RESCTRL

int proc_resctrl_show(struct seq_file *m,
--
2.39.2


2023-03-21 10:57:31

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH v3 02/19] x86/resctrl: Access per-rmid structures by index

On Mon, 20 Mar 2023, James Morse wrote:

> Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
> monitors, RMID values on arm64 are not unique unless the CLOSID is
> also included. Bitmaps like rmid_busy_llc need to be sized by the
> number of unique entries for this resource.
>
> Add helpers to encode/decode the CLOSID and RMID to an index. The
> domain's rmid_busy__llc and the rmid_ptrs[] array are then sized by
> index, as are the domain mbm_local and mbm_total arrays.
> On x86, the index is always just the RMID, so all these structures
> remain the same size.
>
> The index gives resctrl a unique value it can use to store monitor
> values, and allows MPAM to decode the closid when reading the hardware
> counters.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> Changes since v1:
> * Added X86_BAD_CLOSID macro to make it clear what this value means
> * Added second WARN_ON() for closid checking, and made both _ONCE()
>
> Changes since v2:
> * Added RESCTRL_RESERVED_CLOSID
> * Removed a newline
> * Repharsed some comments
> * Renamed a variable 'ignore'd
> * Moved X86_RESCTRL_BAD_CLOSID to a previous patch
> ---

> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index cbe986d23df6..3ca40be41a0a 100644

> @@ -732,19 +759,20 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
>
> static int dom_data_init(struct rdt_resource *r)
> {
> + u32 nr_idx = resctrl_arch_system_num_rmid_idx();

You've used idx_limit elsewhere so this name should be consistent with the
others.

--
i.

> struct rmid_entry *entry = NULL;
> - int i, nr_rmids;
> + u32 idx;
> + int i;
>
> - nr_rmids = r->num_rmid;
> - rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL);
> + rmid_ptrs = kcalloc(nr_idx, sizeof(struct rmid_entry), GFP_KERNEL);
> if (!rmid_ptrs)
> return -ENOMEM;
>
> - for (i = 0; i < nr_rmids; i++) {
> + for (i = 0; i < nr_idx; i++) {
> entry = &rmid_ptrs[i];
> INIT_LIST_HEAD(&entry->list);
>
> - entry->rmid = i;
> + resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid);
> list_add_tail(&entry->list, &rmid_free_lru);
> }
>
> @@ -753,7 +781,8 @@ static int dom_data_init(struct rdt_resource *r)
> * default_rdtgroup control group, which will be setup later. See
> * rdtgroup_setup_root().
> */
> - entry = __rmid_entry(0, 0);
> + idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, 0);
> + entry = __rmid_entry(idx);
> list_del(&entry->list);
>
> return 0;


2023-03-21 11:05:43

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH v3 03/19] x86/resctrl: Create helper for RMID allocation and mondata dir creation

On Mon, 20 Mar 2023, James Morse wrote:

> RMID are allocated for each monitor or control group directory, because
> each of these needs its own RMID. For control groups,
> rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID.
>
> MPAM's equivalent of RMID are not an independent number, so can't be
> allocated until the CLOSID is known. An RMID allocation for one CLOSID
> may fail, whereas another may succeed depending on how many monitor
> groups a control group has.
>
> The RMID allocation needs to move to be after the CLOSID has been
> allocated.
>
> To make a subsequent change that does this easier to read, move the RMID
> allocation and mondata dir creation to a helper.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 42 +++++++++++++++++---------
> 1 file changed, 27 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 6ecaf34a4e32..b785beb0db26 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3135,6 +3135,30 @@ static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp)
> return 0;
> }
>
> +static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
> +{
> + int ret;
> +
> + if (!rdt_mon_capable)
> + return 0;
> +
> + ret = alloc_rmid();
> + if (ret < 0) {
> + rdt_last_cmd_puts("Out of RMIDs\n");
> + return ret;
> + }
> + rdtgrp->mon.rmid = ret;
> +
> + ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
> + if (ret) {
> + rdt_last_cmd_puts("kernfs subdir error\n");
> + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
> const char *name, umode_t mode,
> enum rdt_group_type rtype, struct rdtgroup **r)
> @@ -3200,20 +3224,10 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
> goto out_destroy;
> }
>
> - if (rdt_mon_capable) {
> - ret = alloc_rmid();
> - if (ret < 0) {
> - rdt_last_cmd_puts("Out of RMIDs\n");
> - goto out_destroy;
> - }
> - rdtgrp->mon.rmid = ret;
> + ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
> + if (ret)
> + goto out_destroy;
>
> - ret = mkdir_mondata_all(kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
> - if (ret) {
> - rdt_last_cmd_puts("kernfs subdir error\n");
> - goto out_idfree;
> - }
> - }
> kernfs_activate(kn);
>
> /*
> @@ -3221,8 +3235,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
> */
> return 0;
>
> -out_idfree:
> - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
> out_destroy:
> kernfs_put(rdtgrp->kn);
> kernfs_remove(rdtgrp->kn);
>

Reviewed-by: Ilpo J?rvinen <[email protected]>


--
i.

2023-03-21 11:30:05

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH v3 05/19] x86/resctrl: Allow RMID allocation to be scoped by CLOSID

On Mon, 20 Mar 2023, James Morse wrote:

> MPAMs RMID values are not unique unless the CLOSID is considered as well.
>
> alloc_rmid() expects the RMID to be an independent number.
>
> Pass the CLOSID in to alloc_rmid(). Use this to compare indexes when
> allocating. If the CLOSID is not relevant to the index, this ends up
> comparing the free RMID with itself, and the first free entry will be
> used. With MPAM the CLOSID is included in the index, so this becomes a
> walk of the free RMID entries, until one that matches the supplied
> CLOSID is found.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> Changes since v2;
> * Rephrased comment in resctrl_find_free_rmid() to describe this in terms of
> list_entry_first()
> * Rephrased comment above alloc_rmid()
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
> arch/x86/kernel/cpu/resctrl/monitor.c | 54 +++++++++++++++++------
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
> 4 files changed, 43 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 47506e2afd59..e11d9ce943d3 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -535,7 +535,7 @@ void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
> struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
> int closids_supported(void);
> void closid_free(int closid);
> -int alloc_rmid(void);
> +int alloc_rmid(u32 closid);
> void free_rmid(u32 closid, u32 rmid);
> int rdt_get_mon_l3_config(struct rdt_resource *r);
> bool __init rdt_cpu_has(int flag);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 03a7d13dd653..ca58a433c668 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -337,25 +337,51 @@ bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
> return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit;
> }
>
> -/*
> - * As of now the RMIDs allocation is global.
> - * However we keep track of which packages the RMIDs
> - * are used to optimize the limbo list management.
> - */
> -int alloc_rmid(void)
> +static struct rmid_entry *resctrl_find_free_rmid(u32 closid)
> {
> - struct rmid_entry *entry;
> -
> - lockdep_assert_held(&rdtgroup_mutex);
> + struct rmid_entry *itr;
> + u32 itr_idx, cmp_idx;
>
> if (list_empty(&rmid_free_lru))
> - return rmid_limbo_count ? -EBUSY : -ENOSPC;
> + return rmid_limbo_count ? ERR_PTR(-EBUSY) : ERR_PTR(-ENOSPC);
>
> - entry = list_first_entry(&rmid_free_lru,
> - struct rmid_entry, list);
> - list_del(&entry->list);
> + list_for_each_entry(itr, &rmid_free_lru, list) {
> + /*
> + * get the index of this free RMID, and the index it would need
> + * to be if it were used with this CLOSID.
> + * If the CLOSID is irrelevant on this architecture, these will
> + * always be the same meaning the compiler can reduce this loop
> + * to a single list_entry_first() call.
> + */
> + itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid);
> + cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid);
>
> - return entry->rmid;
> + if (itr_idx == cmp_idx)
> + return itr;
> + }
> +
> + return ERR_PTR(-ENOSPC);
> +}
> +
> +/*
> + * For MPAM the RMID value is not unique, and has to be considered with
> + * the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which
> + * allows all domains to be managed by a single limbo list.
> + * Each domain also has a rmid_busy_llc to reduce the work of the limbo handler.
> + */
> +int alloc_rmid(u32 closid)
> +{
> + struct rmid_entry *entry;
> +
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + entry = resctrl_find_free_rmid(closid);
> + if (!IS_ERR(entry)) {
> + list_del(&entry->list);
> + return entry->rmid;
> + }
> +
> + return PTR_ERR(entry);

Reverse the if condition to make this follow the normal error handling
pattern.


--
i.


2023-03-21 13:21:28

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH v3 08/19] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

On Mon, 20 Mar 2023, James Morse wrote:

> The limbo and overflow code picks a CPU to use from the domain's list
> of online CPUs. Work is then scheduled on these CPUs to maintain
> the limbo list and any counters that may overflow.
>
> cpumask_any() may pick a CPU that is marked nohz_full, which will
> either penalise the work that CPU was dedicated to, or delay the
> processing of limbo list or counters that may overflow. Perhaps
> indefinitely. Delaying the overflow handling will skew the bandwidth
> values calculated by mba_sc, which expects to be called once a second.
>
> Add cpumask_any_housekeeping() as a replacement for cpumask_any()
> that prefers housekeeping CPUs. This helper will still return
> a nohz_full CPU if that is the only option. The CPU to use is
> re-evaluated each time the limbo/overflow work runs. This ensures
> the work will move off a nohz_full CPU once a houskeeping CPU is
> available.
>
> Signed-off-by: James Morse <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 23 +++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 17 ++++++++++++-----
> include/linux/tick.h | 3 ++-
> 3 files changed, 37 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 87545e4beb70..0b5fd5a0cda2 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -7,6 +7,7 @@
> #include <linux/kernfs.h>
> #include <linux/fs_context.h>
> #include <linux/jump_label.h>
> +#include <linux/tick.h>
> #include <asm/resctrl.h>
>
> #define L3_QOS_CDP_ENABLE 0x01ULL
> @@ -55,6 +56,28 @@
> /* Max event bits supported */
> #define MAX_EVT_CONFIG_BITS GENMASK(6, 0)
>
> +/**
> + * cpumask_any_housekeeping() - Chose any cpu in @mask, preferring those that

Choose

> + * aren't marked nohz_full
> + * @mask: The mask to pick a CPU from.
> + *
> + * Returns a CPU in @mask. If there are houskeeping CPUs that don't use

housekeeping

> + * nohz_full, these are preferred.
> + */
> +static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
> +{
> + int cpu, hk_cpu;
> +
> + cpu = cpumask_any(mask);
> + if (tick_nohz_full_cpu(cpu)) {
> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
> + if (hk_cpu < nr_cpu_ids)
> + cpu = hk_cpu;
> + }
> +
> + return cpu;
> +}
> +
> struct rdt_fs_context {
> struct kernfs_fs_context kfc;
> bool enable_cdpl2;
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index a2ae4be4b2ba..3bec5c59ca0e 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -745,9 +745,9 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
> void cqm_handle_limbo(struct work_struct *work)
> {
> unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
> - int cpu = smp_processor_id();
> struct rdt_resource *r;
> struct rdt_domain *d;
> + int cpu;
>
> mutex_lock(&rdtgroup_mutex);
>
> @@ -756,8 +756,10 @@ void cqm_handle_limbo(struct work_struct *work)
>
> __check_limbo(d, false);
>
> - if (has_busy_rmid(r, d))
> + if (has_busy_rmid(r, d)) {
> + cpu = cpumask_any_housekeeping(&d->cpu_mask);
> schedule_delayed_work_on(cpu, &d->cqm_limbo, delay);
> + }
>
> mutex_unlock(&rdtgroup_mutex);
> }
> @@ -767,7 +769,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
> unsigned long delay = msecs_to_jiffies(delay_ms);
> int cpu;
>
> - cpu = cpumask_any(&dom->cpu_mask);
> + cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> dom->cqm_work_cpu = cpu;
>
> schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
> @@ -777,10 +779,10 @@ void mbm_handle_overflow(struct work_struct *work)
> {
> unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
> struct rdtgroup *prgrp, *crgrp;
> - int cpu = smp_processor_id();
> struct list_head *head;
> struct rdt_resource *r;
> struct rdt_domain *d;
> + int cpu;
>
> mutex_lock(&rdtgroup_mutex);
>
> @@ -801,6 +803,11 @@ void mbm_handle_overflow(struct work_struct *work)
> update_mba_bw(prgrp, d);
> }
>
> + /*
> + * Re-check for housekeeping CPUs. This allows the overflow handler to
> + * move off a nohz_full CPU quickly.
> + */
> + cpu = cpumask_any_housekeeping(&d->cpu_mask);
> schedule_delayed_work_on(cpu, &d->mbm_over, delay);
>
> out_unlock:
> @@ -814,7 +821,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
>
> if (!static_branch_likely(&rdt_mon_enable_key))
> return;
> - cpu = cpumask_any(&dom->cpu_mask);
> + cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> dom->mbm_work_cpu = cpu;
> schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
> }
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index bfd571f18cfd..ae2e9019fc18 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -174,9 +174,10 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> static inline void tick_nohz_idle_stop_tick_protected(void) { }
> #endif /* !CONFIG_NO_HZ_COMMON */
>
> +extern cpumask_var_t tick_nohz_full_mask;
> +
> #ifdef CONFIG_NO_HZ_FULL
> extern bool tick_nohz_full_running;
> -extern cpumask_var_t tick_nohz_full_mask;

Its definition seems to also be inside #ifdef:

kernel/time/tick-sched.c-#ifdef CONFIG_NO_HZ_FULL
kernel/time/tick-sched.c:cpumask_var_t tick_nohz_full_mask;
kernel/time/tick-sched.c:EXPORT_SYMBOL_GPL(tick_nohz_full_mask);


--
i.


2023-03-21 15:12:52

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH v3 17/19] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu

On Mon, 20 Mar 2023, James Morse wrote:

> When a CPU is taken offline resctrl may need to move the overflow or
> limbo handlers to run on a different CPU.
>
> Once the offline callbacks have been split, cqm_setup_limbo_handler()
> will be called while the CPU that is going offline is still present
> in the cpu_mask.
>
> Pass the CPU to exclude to cqm_setup_limbo_handler() and
> mbm_setup_overflow_handler(). These functions can use a variant of
> cpumask_any_but() when selecting the CPU. -1 is used to indicate no CPUs
> need excluding.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> Changes since v2:
> * Rephrased a comment to avoid a two letter bad-word. (we)
> * Avoid assigning mbm_work_cpu if the domain is going to be free()d
> * Added cpumask_any_housekeeping_but(), I dislike the name
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 8 +++--
> arch/x86/kernel/cpu/resctrl/internal.h | 37 ++++++++++++++++++++--
> arch/x86/kernel/cpu/resctrl/monitor.c | 43 +++++++++++++++++++++-----
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 ++--
> include/linux/resctrl.h | 3 ++
> 5 files changed, 83 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 8e25ea49372e..aafe4b74587c 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -582,12 +582,16 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> cancel_delayed_work(&d->mbm_over);
> - mbm_setup_overflow_handler(d, 0);
> + /*
> + * exclude_cpu=-1 as this CPU has already been removed
> + * by cpumask_clear_cpu()d
> + */
> + mbm_setup_overflow_handler(d, 0, RESCTRL_PICK_ANY_CPU);
> }
> if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
> has_busy_rmid(r, d)) {
> cancel_delayed_work(&d->cqm_limbo);
> - cqm_setup_limbo_handler(d, 0);
> + cqm_setup_limbo_handler(d, 0, RESCTRL_PICK_ANY_CPU);
> }
> }
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 3eb5b307b809..47838ba6876e 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -78,6 +78,37 @@ static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
> return cpu;
> }
>
> +/**
> + * cpumask_any_housekeeping_but() - Chose any cpu in @mask, preferring those
> + * that aren't marked nohz_full, excluding
> + * the provided CPU
> + * @mask: The mask to pick a CPU from.
> + * @exclude_cpu:The CPU to avoid picking.
> + *
> + * Returns a CPU from @mask, but not @but. If there are houskeeping CPUs that
> + * don't use nohz_full, these are preferred.
> + * Returns >= nr_cpu_ids if no CPUs are available.
> + */
> +static inline unsigned int
> +cpumask_any_housekeeping_but(const struct cpumask *mask, int exclude_cpu)
> +{
> + int cpu, hk_cpu;
> +
> + cpu = cpumask_any_but(mask, exclude_cpu);
> + if (tick_nohz_full_cpu(cpu)) {
> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
> + if (hk_cpu == exclude_cpu) {
> + hk_cpu = cpumask_nth_andnot(1, mask,
> + tick_nohz_full_mask);

I'm left to wonder if it's okay to alter tick_nohz_full_mask in resctrl
code??


--
i.


2023-03-21 15:15:08

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH v3 08/19] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

On Mon, 20 Mar 2023, James Morse wrote:

> The limbo and overflow code picks a CPU to use from the domain's list
> of online CPUs. Work is then scheduled on these CPUs to maintain
> the limbo list and any counters that may overflow.
>
> cpumask_any() may pick a CPU that is marked nohz_full, which will
> either penalise the work that CPU was dedicated to, or delay the
> processing of limbo list or counters that may overflow. Perhaps
> indefinitely. Delaying the overflow handling will skew the bandwidth
> values calculated by mba_sc, which expects to be called once a second.
>
> Add cpumask_any_housekeeping() as a replacement for cpumask_any()
> that prefers housekeeping CPUs. This helper will still return
> a nohz_full CPU if that is the only option. The CPU to use is
> re-evaluated each time the limbo/overflow work runs. This ensures
> the work will move off a nohz_full CPU once a houskeeping CPU is

housekeeping

> available.
>
> Signed-off-by: James Morse <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 23 +++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 17 ++++++++++++-----
> include/linux/tick.h | 3 ++-
> 3 files changed, 37 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 87545e4beb70..0b5fd5a0cda2 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -7,6 +7,7 @@
> #include <linux/kernfs.h>
> #include <linux/fs_context.h>
> #include <linux/jump_label.h>
> +#include <linux/tick.h>
> #include <asm/resctrl.h>
>
> #define L3_QOS_CDP_ENABLE 0x01ULL
> @@ -55,6 +56,28 @@
> /* Max event bits supported */
> #define MAX_EVT_CONFIG_BITS GENMASK(6, 0)
>
> +/**
> + * cpumask_any_housekeeping() - Chose any cpu in @mask, preferring those that
> + * aren't marked nohz_full
> + * @mask: The mask to pick a CPU from.
> + *
> + * Returns a CPU in @mask. If there are houskeeping CPUs that don't use
> + * nohz_full, these are preferred.
> + */
> +static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
> +{
> + int cpu, hk_cpu;
> +
> + cpu = cpumask_any(mask);
> + if (tick_nohz_full_cpu(cpu)) {
> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);

Why cpumask_nth_and() is not enough here? ..._andnot() seems to alter
tick_nohz_full_mask which doesn't seem desirable?


--
i.


2023-03-21 15:25:51

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH v3 17/19] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu

On Tue, 21 Mar 2023, Ilpo J?rvinen wrote:

> On Mon, 20 Mar 2023, James Morse wrote:
>
> > When a CPU is taken offline resctrl may need to move the overflow or
> > limbo handlers to run on a different CPU.
> >
> > Once the offline callbacks have been split, cqm_setup_limbo_handler()
> > will be called while the CPU that is going offline is still present
> > in the cpu_mask.
> >
> > Pass the CPU to exclude to cqm_setup_limbo_handler() and
> > mbm_setup_overflow_handler(). These functions can use a variant of
> > cpumask_any_but() when selecting the CPU. -1 is used to indicate no CPUs
> > need excluding.
> >
> > Tested-by: Shaopeng Tan <[email protected]>
> > Signed-off-by: James Morse <[email protected]>
> > ---
> > Changes since v2:
> > * Rephrased a comment to avoid a two letter bad-word. (we)
> > * Avoid assigning mbm_work_cpu if the domain is going to be free()d
> > * Added cpumask_any_housekeeping_but(), I dislike the name
> > ---
> > arch/x86/kernel/cpu/resctrl/core.c | 8 +++--
> > arch/x86/kernel/cpu/resctrl/internal.h | 37 ++++++++++++++++++++--
> > arch/x86/kernel/cpu/resctrl/monitor.c | 43 +++++++++++++++++++++-----
> > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 ++--
> > include/linux/resctrl.h | 3 ++
> > 5 files changed, 83 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 8e25ea49372e..aafe4b74587c 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -582,12 +582,16 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> > if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> > if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> > cancel_delayed_work(&d->mbm_over);
> > - mbm_setup_overflow_handler(d, 0);
> > + /*
> > + * exclude_cpu=-1 as this CPU has already been removed
> > + * by cpumask_clear_cpu()d
> > + */
> > + mbm_setup_overflow_handler(d, 0, RESCTRL_PICK_ANY_CPU);
> > }
> > if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
> > has_busy_rmid(r, d)) {
> > cancel_delayed_work(&d->cqm_limbo);
> > - cqm_setup_limbo_handler(d, 0);
> > + cqm_setup_limbo_handler(d, 0, RESCTRL_PICK_ANY_CPU);
> > }
> > }
> > }
> > diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> > index 3eb5b307b809..47838ba6876e 100644
> > --- a/arch/x86/kernel/cpu/resctrl/internal.h
> > +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> > @@ -78,6 +78,37 @@ static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
> > return cpu;
> > }
> >
> > +/**
> > + * cpumask_any_housekeeping_but() - Chose any cpu in @mask, preferring those
> > + * that aren't marked nohz_full, excluding
> > + * the provided CPU
> > + * @mask: The mask to pick a CPU from.
> > + * @exclude_cpu:The CPU to avoid picking.
> > + *
> > + * Returns a CPU from @mask, but not @but. If there are houskeeping CPUs that
> > + * don't use nohz_full, these are preferred.
> > + * Returns >= nr_cpu_ids if no CPUs are available.
> > + */
> > +static inline unsigned int
> > +cpumask_any_housekeeping_but(const struct cpumask *mask, int exclude_cpu)
> > +{
> > + int cpu, hk_cpu;
> > +
> > + cpu = cpumask_any_but(mask, exclude_cpu);
> > + if (tick_nohz_full_cpu(cpu)) {
> > + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
> > + if (hk_cpu == exclude_cpu) {
> > + hk_cpu = cpumask_nth_andnot(1, mask,
> > + tick_nohz_full_mask);
>
> I'm left to wonder if it's okay to alter tick_nohz_full_mask in resctrl
> code??

I suppose it should do instead:
hk_cpu = cpumask_nth_and(0, mask, tick_nohz_full_mask);
if (hk_cpu == exclude_cpu)
hk_cpu = cpumask_next_and(hk_cpu, mask, tick_nohz_full_mask);

--
i.

2023-03-21 15:33:00

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH v3 18/19] x86/resctrl: Add cpu offline callback for resctrl work

On Mon, 20 Mar 2023, James Morse wrote:

> The resctrl architecture specific code may need to free a domain when
> a CPU goes offline, it also needs to reset the CPUs PQR_ASSOC register.
> The resctrl filesystem code needs to move the overflow and limbo work
> to run on a different CPU, and clear this CPU from the cpu_mask of
> control and monitor groups.
>
> Currently this is all done in core.c and called from
> resctrl_offline_cpu(), making the split between architecture and
> filesystem code unclear.
>
> Move the filesystem work into a filesystem helper called
> resctrl_offline_cpu(), and rename the one in core.c
> resctrl_arch_offline_cpu().
>
> The rdtgroup_mutex is unlocked and locked again in the call in
> preparation for changing the locking rules for the architecture
> code.
>
> resctrl_offline_cpu() is called before any of the resource/domains
> are updated, and makes use of the exclude_cpu feature that was
> previously added.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 41 ++++----------------------
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 39 ++++++++++++++++++++++++
> include/linux/resctrl.h | 1 +
> 3 files changed, 45 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index aafe4b74587c..4e5fc89dab6d 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -578,22 +578,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>
> return;
> }
> -
> - if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> - if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> - cancel_delayed_work(&d->mbm_over);
> - /*
> - * exclude_cpu=-1 as this CPU has already been removed
> - * by cpumask_clear_cpu()d
> - */

This was added in 17/19 and now removed (not moved) in 18/19. Please avoid
such back-and-forth churn.

--
i.


> - mbm_setup_overflow_handler(d, 0, RESCTRL_PICK_ANY_CPU);
> - }
> - if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
> - has_busy_rmid(r, d)) {
> - cancel_delayed_work(&d->cqm_limbo);
> - cqm_setup_limbo_handler(d, 0, RESCTRL_PICK_ANY_CPU);
> - }
> - }
> }
>
> static void clear_closid_rmid(int cpu)
> @@ -623,31 +607,15 @@ static int resctrl_arch_online_cpu(unsigned int cpu)
> return err;
> }
>
> -static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
> +static int resctrl_arch_offline_cpu(unsigned int cpu)
> {
> - struct rdtgroup *cr;
> -
> - list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) {
> - if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) {
> - break;
> - }
> - }
> -}
> -
> -static int resctrl_offline_cpu(unsigned int cpu)
> -{
> - struct rdtgroup *rdtgrp;
> struct rdt_resource *r;
>
> mutex_lock(&rdtgroup_mutex);
> + resctrl_offline_cpu(cpu);
> +
> for_each_capable_rdt_resource(r)
> domain_remove_cpu(cpu, r);
> - list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
> - if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
> - clear_childcpus(rdtgrp, cpu);
> - break;
> - }
> - }
> clear_closid_rmid(cpu);
> mutex_unlock(&rdtgroup_mutex);
>
> @@ -970,7 +938,8 @@ static int __init resctrl_late_init(void)
>
> state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> "x86/resctrl/cat:online:",
> - resctrl_arch_online_cpu, resctrl_offline_cpu);
> + resctrl_arch_online_cpu,
> + resctrl_arch_offline_cpu);
> if (state < 0)
> return state;
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index bf206bdb21ee..c27ec56c6c60 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3710,6 +3710,45 @@ int resctrl_online_cpu(unsigned int cpu)
> return 0;
> }
>
> +static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
> +{
> + struct rdtgroup *cr;
> +
> + list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) {
> + if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask))
> + break;
> + }
> +}
> +
> +void resctrl_offline_cpu(unsigned int cpu)
> +{
> + struct rdt_domain *d;
> + struct rdtgroup *rdtgrp;
> + struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
> + if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
> + clear_childcpus(rdtgrp, cpu);
> + break;
> + }
> + }
> +
> + d = get_domain_from_cpu(cpu, l3);
> + if (d) {
> + if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> + cancel_delayed_work(&d->mbm_over);
> + mbm_setup_overflow_handler(d, 0, cpu);
> + }
> + if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
> + has_busy_rmid(l3, d)) {
> + cancel_delayed_work(&d->cqm_limbo);
> + cqm_setup_limbo_handler(d, 0, cpu);
> + }
> + }
> +}
> +
> /*
> * rdtgroup_init - rdtgroup initialization
> *
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 3ea7d618f33f..f053527aaa5b 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -226,6 +226,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
> int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
> void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
> int resctrl_online_cpu(unsigned int cpu);
> +void resctrl_offline_cpu(unsigned int cpu);
>
> /**
> * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
>


2023-03-22 14:16:31

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi James,

On Mon, Mar 20, 2023 at 6:27 PM James Morse <[email protected]> wrote:
>
> x86 is blessed with an abundance of monitors, one per RMID, that can be

As I explained earlier, this is not the case on AMD.

> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
> the number implemented is up to the manufacturer. This means when there are
> fewer monitors than needed, they need to be allocated and freed.
>
> Worse, the domain may be broken up into slices, and the MMIO accesses
> for each slice may need performing from different CPUs.
>
> These two details mean MPAMs monitor code needs to be able to sleep, and
> IPI another CPU in the domain to read from a resource that has been sliced.

This doesn't sound very convincing. Could mon_event_read() IPI all the
CPUs in the domain? (after waiting to allocate and install monitors
when necessary?)


>
> mon_event_read() already invokes mon_event_count() via IPI, which means
> this isn't possible. On systems using nohz-full, some CPUs need to be
> interrupted to run kernel work as they otherwise stay in user-space
> running realtime workloads. Interrupting these CPUs should be avoided,
> and scheduling work on them may never complete.
>
> Change mon_event_read() to pick a housekeeping CPU, (one that is not using
> nohz_full) and schedule mon_event_count() and wait. If all the CPUs
> in a domain are using nohz-full, then an IPI is used as the fallback.
>
> This function is only used in response to a user-space filesystem request
> (not the timing sensitive overflow code).
>
> This allows MPAM to hide the slice behaviour from resctrl, and to keep
> the monitor-allocation in monitor.c.

This goal sounds more likely.

If it makes the initial enablement smoother, then I'm all for it.

Reviewed-By: Peter Newman <[email protected]>

These changes worked fine for me on tip/master, though there were merge
conflicts to resolve.

Tested-By: Peter Newman <[email protected]>

Thanks!

-Peter

2023-03-23 09:15:57

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI

On Wed, Mar 22, 2023 at 3:07 PM Peter Newman <[email protected]> wrote:
> On Mon, Mar 20, 2023 at 6:27 PM James Morse <[email protected]> wrote:
> >
> > x86 is blessed with an abundance of monitors, one per RMID, that can be
>
> As I explained earlier, this is not the case on AMD.
>
> > read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
> > the number implemented is up to the manufacturer. This means when there are
> > fewer monitors than needed, they need to be allocated and freed.
> >
> > Worse, the domain may be broken up into slices, and the MMIO accesses
> > for each slice may need performing from different CPUs.
> >
> > These two details mean MPAMs monitor code needs to be able to sleep, and
> > IPI another CPU in the domain to read from a resource that has been sliced.
>
> This doesn't sound very convincing. Could mon_event_read() IPI all the
> CPUs in the domain? (after waiting to allocate and install monitors
> when necessary?)

No wait, I know that isn't correct.

As you explained it, the remote CPU needs to sleep because it may need
to atomically acquire, install, and read a CSU monitor.

It still seems possible for the mon_event_read() thread to do all the
waiting (tell remote CPU to program CSU monitor, wait, tell same remote
CPU to read monitor), but that sounds like more work that I don't see a
lot of benefit to doing today.

Can you update the changelog to just say the remote CPU needs to block
when installing a CSU monitor?

Thanks!
-Peter

2023-03-31 23:22:47

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 02/19] x86/resctrl: Access per-rmid structures by index

Hi James,

On 3/20/2023 10:26 AM, James Morse wrote:
> Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
> monitors, RMID values on arm64 are not unique unless the CLOSID is
> also included. Bitmaps like rmid_busy_llc need to be sized by the
> number of unique entries for this resource.
>
> Add helpers to encode/decode the CLOSID and RMID to an index. The
> domain's rmid_busy__llc and the rmid_ptrs[] array are then sized by

rmid_busy__llc -> rmid_busy_llc

Not a big deal but since you are using [] for rmid_ptrs[] it should
not be necessary to say that it is an array.

> index, as are the domain mbm_local and mbm_total arrays.
> On x86, the index is always just the RMID, so all these structures
> remain the same size.
>
> The index gives resctrl a unique value it can use to store monitor
> values, and allows MPAM to decode the closid when reading the hardware
> counters.

When you are switching between CLOSID and closid in the same context it
is less obvious that it means the same thing.

>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> Changes since v1:
> * Added X86_BAD_CLOSID macro to make it clear what this value means
> * Added second WARN_ON() for closid checking, and made both _ONCE()
>
> Changes since v2:
> * Added RESCTRL_RESERVED_CLOSID
> * Removed a newline
> * Repharsed some comments
> * Renamed a variable 'ignore'd
> * Moved X86_RESCTRL_BAD_CLOSID to a previous patch
> ---
> arch/x86/include/asm/resctrl.h | 17 ++++++
> arch/x86/kernel/cpu/resctrl/core.c | 2 +-
> arch/x86/kernel/cpu/resctrl/internal.h | 1 +
> arch/x86/kernel/cpu/resctrl/monitor.c | 83 +++++++++++++++++---------
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 7 ++-
> include/linux/resctrl.h | 3 +
> 6 files changed, 82 insertions(+), 31 deletions(-)
>
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index cbe986d23df6..3ca40be41a0a 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -101,6 +101,23 @@ static inline void resctrl_sched_in(void)
> __resctrl_sched_in();
> }
>
> +static inline u32 resctrl_arch_system_num_rmid_idx(void)
> +{
> + /* RMID are independent numbers for x86. num_rmid_idx==num_rmid */

Could you add spaces around the "=="?

> + return boot_cpu_data.x86_cache_max_rmid + 1;
> +}
> +
> +static inline void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid)
> +{
> + *rmid = idx;
> + *closid = X86_RESCTRL_BAD_CLOSID;
> +}
> +
> +static inline u32 resctrl_arch_rmid_idx_encode(u32 ignored, u32 rmid)
> +{
> + return rmid;
> +}
> +
> void resctrl_cpu_detect(struct cpuinfo_x86 *c);
>
> #else
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 030d3b409768..351319403f84 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -600,7 +600,7 @@ static void clear_closid_rmid(int cpu)
> state->default_rmid = 0;
> state->cur_closid = 0;
> state->cur_rmid = 0;
> - wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
> + wrmsr(MSR_IA32_PQR_ASSOC, RESCTRL_RESERVED_CLOSID, 0);
> }
>
> static int resctrl_online_cpu(unsigned int cpu)
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index c64097947994..47506e2afd59 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -7,6 +7,7 @@
> #include <linux/kernfs.h>
> #include <linux/fs_context.h>
> #include <linux/jump_label.h>
> +#include <asm/resctrl.h>
>
> #define L3_QOS_CDP_ENABLE 0x01ULL
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 18c37d364030..03a7d13dd653 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -142,12 +142,29 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val)
> return val;
> }
>
> -static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid)
> +/*
> + * x86 and arm64 differ in their handling of monitoring.
> + * x86's RMID are an independent number, there is only one source of traffic
> + * an RMID value of '1'.

"source of traffic an RMID" -> "source of traffic with an RMID" ?

> + * arm64's PMG extend the PARTID/CLOSID space, there are multiple sources of
> + * traffic with a PMG value of '1', one for each CLOSID, meaining the RMID

meaining -> meaning

> + * value is no longer unique.

Reinette

2023-03-31 23:23:53

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 03/19] x86/resctrl: Create helper for RMID allocation and mondata dir creation

Hi James,

On 3/20/2023 10:26 AM, James Morse wrote:
> RMID are allocated for each monitor or control group directory, because

control groups do not always get an RMID, only if they are capable of
monitoring. How about "RMID are allocated for each monitor group and
control group that is also capable of monitoring". Please feel free to
improve.

Reinette

2023-03-31 23:25:24

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 06/19] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID

Hi James,

On 3/20/2023 10:26 AM, James Morse wrote:

...

> +/**
> + * resctrl_closid_is_dirty - Determine if all RMID associated with this CLOSID
> + * are available.
> + * @closid: The CLOSID that is being queried.
> + *
> + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocated CLOSID
> + * may not be able to allocate clean RMID. To avoid this the allocator will
> + * only return clean CLOSID. This is enough for now as it allows MPAM systems
> + * to use resctrl. This suffers from the problem that there may be no CLOSID
> + * where all the RMID are clean, causing the CLOSID allocation to fail.
> + * This can be improved (once MPAM support is upstream) to return the cleanest
> + * CLOSID where PMG=0 is clean. This would allow the CLOSID allocation to

Why does PMG=0 have to be the clean ID?

I am wondering about the use cases here. When a new CLOSID needs to be allocated,
would it not be useful to instead have a utility that returns the "cleanest" CLOSID?
Instead of picking an available CLOSID and then always have to check if it is
"dirty or not", why not have a utility that picks the CLOSID with the most
available PMGs?

Reinette

2023-03-31 23:26:07

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 08/19] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

Hi James,

On 3/20/2023 10:26 AM, James Morse wrote:
> The limbo and overflow code picks a CPU to use from the domain's list
> of online CPUs. Work is then scheduled on these CPUs to maintain
> the limbo list and any counters that may overflow.
>
> cpumask_any() may pick a CPU that is marked nohz_full, which will
> either penalise the work that CPU was dedicated to, or delay the

penalise -> penalize

> processing of limbo list or counters that may overflow. Perhaps
> indefinitely. Delaying the overflow handling will skew the bandwidth
> values calculated by mba_sc, which expects to be called once a second.
>
> Add cpumask_any_housekeeping() as a replacement for cpumask_any()
> that prefers housekeeping CPUs. This helper will still return
> a nohz_full CPU if that is the only option. The CPU to use is
> re-evaluated each time the limbo/overflow work runs. This ensures
> the work will move off a nohz_full CPU once a houskeeping CPU is
> available.
>
> Signed-off-by: James Morse <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 23 +++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 17 ++++++++++++-----
> include/linux/tick.h | 3 ++-
> 3 files changed, 37 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 87545e4beb70..0b5fd5a0cda2 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -7,6 +7,7 @@
> #include <linux/kernfs.h>
> #include <linux/fs_context.h>
> #include <linux/jump_label.h>
> +#include <linux/tick.h>
> #include <asm/resctrl.h>
>
> #define L3_QOS_CDP_ENABLE 0x01ULL
> @@ -55,6 +56,28 @@
> /* Max event bits supported */
> #define MAX_EVT_CONFIG_BITS GENMASK(6, 0)
>
> +/**
> + * cpumask_any_housekeeping() - Chose any cpu in @mask, preferring those that
> + * aren't marked nohz_full

"Chose any cpu" -> "Choose any CPU"

> + * @mask: The mask to pick a CPU from.
> + *
> + * Returns a CPU in @mask. If there are houskeeping CPUs that don't use
> + * nohz_full, these are preferred.
> + */
> +static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
> +{
> + int cpu, hk_cpu;
> +
> + cpu = cpumask_any(mask);
> + if (tick_nohz_full_cpu(cpu)) {
> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
> + if (hk_cpu < nr_cpu_ids)
> + cpu = hk_cpu;
> + }
> +

I think as a start this could perhaps be a #if defined(CONFIG_NO_HZ_FULL). There
appears to be a precedent for this in kernel/rcu/tree_nocb.h.

Apart from the issue that Ilpo pointed out I would prefer that any changes outside
resctrl are submitted separately to that subsystem.

...

> @@ -801,6 +803,11 @@ void mbm_handle_overflow(struct work_struct *work)
> update_mba_bw(prgrp, d);
> }
>
> + /*
> + * Re-check for housekeeping CPUs. This allows the overflow handler to
> + * move off a nohz_full CPU quickly.
> + */
> + cpu = cpumask_any_housekeeping(&d->cpu_mask);
> schedule_delayed_work_on(cpu, &d->mbm_over, delay);
>
> out_unlock:

From what I can tell the nohz_full CPUs are set during boot and do not change.


> @@ -814,7 +821,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
>
> if (!static_branch_likely(&rdt_mon_enable_key))
> return;
> - cpu = cpumask_any(&dom->cpu_mask);
> + cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> dom->mbm_work_cpu = cpu;
> schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
> }
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index bfd571f18cfd..ae2e9019fc18 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -174,9 +174,10 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> static inline void tick_nohz_idle_stop_tick_protected(void) { }
> #endif /* !CONFIG_NO_HZ_COMMON */
>
> +extern cpumask_var_t tick_nohz_full_mask;
> +
> #ifdef CONFIG_NO_HZ_FULL
> extern bool tick_nohz_full_running;
> -extern cpumask_var_t tick_nohz_full_mask;
>
> static inline bool tick_nohz_full_enabled(void)
> {

In addition to what Ilpo pointed out, be careful here.
cpumask_var_t is a pointer (or array) and needs to be
allocated before use. Moving its declaration but not the
allocation code seems risky.

Reinette

2023-03-31 23:26:35

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi James,

On 3/20/2023 10:26 AM, James Morse wrote:
> x86 is blessed with an abundance of monitors, one per RMID, that can be
> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
> the number implemented is up to the manufacturer. This means when there are
> fewer monitors than needed, they need to be allocated and freed.
>
> Worse, the domain may be broken up into slices, and the MMIO accesses
> for each slice may need performing from different CPUs.
>
> These two details mean MPAMs monitor code needs to be able to sleep, and
> IPI another CPU in the domain to read from a resource that has been sliced.
>
> mon_event_read() already invokes mon_event_count() via IPI, which means
> this isn't possible. On systems using nohz-full, some CPUs need to be
> interrupted to run kernel work as they otherwise stay in user-space
> running realtime workloads. Interrupting these CPUs should be avoided,
> and scheduling work on them may never complete.
>
> Change mon_event_read() to pick a housekeeping CPU, (one that is not using
> nohz_full) and schedule mon_event_count() and wait. If all the CPUs
> in a domain are using nohz-full, then an IPI is used as the fallback.

It is not clear to me where in this solution an IPI is used as fallback ...
(see below)

> + int cpu;
> +
> + /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> /*
> - * setup the parameters to send to the IPI to read the data.
> + * setup the parameters to pass to mon_event_count() to read the data.
> */
> rr->rgrp = rdtgrp;
> rr->evtid = evtid;
> @@ -537,7 +543,16 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> rr->val = 0;
> rr->first = first;
>
> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> + cpu = get_cpu();
> + if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
> + mon_event_count(rr);
> + put_cpu();
> + } else {
> + put_cpu();
> +
> + cpu = cpumask_any_housekeeping(&d->cpu_mask);
> + smp_call_on_cpu(cpu, mon_event_count, rr, false);
> + }
> }
>

... from what I can tell there is no IPI fallback here. As per previous
patch I understand cpumask_any_housekeeping() could still return
a nohz_full CPU and calling smp_call_on_cpu() on it would not send
an IPI but instead queue the work to it. What did I miss?

Reinette

2023-03-31 23:28:13

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 10/19] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

Hi James,

On 3/20/2023 10:26 AM, James Morse wrote:

...

> int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> u32 closid, u32 rmid, enum resctrl_event_id eventid,
> u64 *val)
> {
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> + struct __rmid_read_arg arg;
> struct arch_mbm_state *am;
> u64 msr_val, chunks;
> - int ret;
> + int err;
>
> - if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
> - return -EINVAL;
> + arg.rmid = rmid;
> + arg.eventid = eventid;
>
> - ret = __rmid_read(rmid, eventid, &msr_val);
> - if (ret)
> - return ret;
> + err = smp_call_function_any(&d->cpu_mask, smp_call_rmid_read, &arg, true);
> + if (err)
> + return err;

This seems to break the assumption of expected return values. __mon_event_count()
does:
rr->err = resctrl_arch_rmid_read()

and later rdtgroup_mondata_show() only expects -EIO or -EINVAL as errors, with
default of success.


> + if (arg.err)
> + return arg.err;
> + msr_val = arg.msr_val;
>


Reinette

2023-03-31 23:30:15

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 12/19] x86/resctrl: Make resctrl_mounted checks explicit

Hi James,

On 3/20/2023 10:26 AM, James Morse wrote:
> The rdt_enable_key is switched when resctrl is mounted, and used to
> prevent a second mount of the filesystem. It also enables the
> architecture's context switch code.
>
> This requires another architecture to have the same set of static-keys,
> as resctrl depends on them too.
>
> Make the resctrl_mounted checks explicit: resctrl can keep track of
> whether it has been mounted once. This doesn't need to be combined with
> whether the arch code is context switching the CLOSID.
> Tests against the rdt_mon_enable_key become a test that resctrl is
> mounted and that monitoring is enabled.

The last sentence above makes the code change hard to follow ...
(see below)

>
> This will allow the static-key changing to be moved behind resctrl_arch_
> calls.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 1 +
> arch/x86/kernel/cpu/resctrl/monitor.c | 5 +++--
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 +++++++++++------
> 3 files changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 7262b355e128..7d5188e8bec3 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -142,6 +142,7 @@ extern bool rdt_alloc_capable;
> extern bool rdt_mon_capable;
> extern unsigned int rdt_mon_features;
> extern struct list_head resctrl_schema_all;
> +extern bool resctrl_mounted;
>
> enum rdt_group_type {
> RDTCTRL_GROUP = 0,
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f38cd2f12285..6279f5c98b39 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -834,7 +834,7 @@ void mbm_handle_overflow(struct work_struct *work)
>
> mutex_lock(&rdtgroup_mutex);
>
> - if (!static_branch_likely(&rdt_mon_enable_key))
> + if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))

... considering the text in the changelog the "resctrl_mounted" check seems
unnecessary. Looking ahead I wonder if this check would not be more
appropriate in patch 15?

> goto out_unlock;
>
> r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> @@ -867,8 +867,9 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
> unsigned long delay = msecs_to_jiffies(delay_ms);
> int cpu;
>
> - if (!static_branch_likely(&rdt_mon_enable_key))
> + if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
> return;

same here

> +

This seems unnecessary.

> cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> dom->mbm_work_cpu = cpu;
> schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 2306fbc9a9bb..5176a85f281c 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -42,6 +42,9 @@ LIST_HEAD(rdt_all_groups);
> /* list of entries for the schemata file */
> LIST_HEAD(resctrl_schema_all);
>
> +/* the filesystem can only be mounted once */

Please start sentences with capital letters and end with period.

> +bool resctrl_mounted;
> +
> /* Kernel fs node for "info" directory under root */
> static struct kernfs_node *kn_info;
>
> @@ -796,7 +799,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
> mutex_lock(&rdtgroup_mutex);
>
> /* Return empty if resctrl has not been mounted. */
> - if (!static_branch_unlikely(&rdt_enable_key)) {
> + if (!resctrl_mounted) {
> seq_puts(s, "res:\nmon:\n");
> goto unlock;
> }
> @@ -2463,7 +2466,7 @@ static int rdt_get_tree(struct fs_context *fc)
> /*
> * resctrl file system can only be mounted once.
> */
> - if (static_branch_unlikely(&rdt_enable_key)) {
> + if (resctrl_mounted) {
> ret = -EBUSY;
> goto out;
> }
> @@ -2511,8 +2514,10 @@ static int rdt_get_tree(struct fs_context *fc)
> if (rdt_mon_capable)
> static_branch_enable_cpuslocked(&rdt_mon_enable_key);
>
> - if (rdt_alloc_capable || rdt_mon_capable)
> + if (rdt_alloc_capable || rdt_mon_capable) {
> static_branch_enable_cpuslocked(&rdt_enable_key);
> + resctrl_mounted = true;
> + }
>
> if (is_mbm_enabled()) {
> r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> @@ -2783,6 +2788,7 @@ static void rdt_kill_sb(struct super_block *sb)
> static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
> static_branch_disable_cpuslocked(&rdt_mon_enable_key);
> static_branch_disable_cpuslocked(&rdt_enable_key);
> + resctrl_mounted = false;
> kernfs_kill_sb(sb);
> mutex_unlock(&rdtgroup_mutex);
> cpus_read_unlock();
> @@ -3610,7 +3616,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
> * If resctrl is mounted, remove all the
> * per domain monitor data directories.
> */
> - if (static_branch_unlikely(&rdt_mon_enable_key))
> + if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
> rmdir_mondata_subdir_allrdtgrp(r, d->id);
>
> if (is_mbm_enabled())
> @@ -3687,8 +3693,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
> if (is_llc_occupancy_enabled())
> INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
>
> - /* If resctrl is mounted, add per domain monitor data directories. */
> - if (static_branch_unlikely(&rdt_mon_enable_key))
> + if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
> mkdir_mondata_subdir_allrdtgrp(r, d);
>
> return 0;

Above also, the resctrl_mounted check does not seem to be needed.

Reinette

2023-03-31 23:32:27

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 11/19] x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read()

Hi James,

On 3/20/2023 10:26 AM, James Morse wrote:
> Depending on the number of monitors available, Arm's MPAM may need to
> allocate a monitor prior to reading the counter value. Allocating a
> contended resource may involve sleeping.
>
> All callers of resctrl_arch_rmid_read() read the counter on more than
> one domain. If the monitor is allocated globally, there is no need to

This does not seem accurate considering the __check_limbo() call that
is called for a single domain.

> allocate and free it for each call to resctrl_arch_rmid_read().
>
> Add arch hooks for this allocation, which need calling before
> resctrl_arch_rmid_read(). The allocated monitor is passed to
> resctrl_arch_rmid_read(), then freed again afterwards. The helper
> can be called on any CPU, and can sleep.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> arch/x86/include/asm/resctrl.h | 11 +++++++
> arch/x86/kernel/cpu/resctrl/internal.h | 1 +
> arch/x86/kernel/cpu/resctrl/monitor.c | 40 +++++++++++++++++++++++---
> include/linux/resctrl.h | 4 +--
> 4 files changed, 50 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index 752123b0ce40..1c87f1626456 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -136,6 +136,17 @@ static inline u32 resctrl_arch_rmid_idx_encode(u32 ignored, u32 rmid)
> return rmid;
> }
>
> +/* x86 can always read an rmid, nothing needs allocating */
> +struct rdt_resource;
> +static inline int resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid)
> +{
> + might_sleep();
> + return 0;
> +};
> +
> +static inline void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid,
> + int ctx) { };
> +
> void resctrl_cpu_detect(struct cpuinfo_x86 *c);
>
> #else
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a07557390895..7262b355e128 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -135,6 +135,7 @@ struct rmid_read {
> bool first;
> int err;
> u64 val;
> + int arch_mon_ctx;
> };
>
> extern bool rdt_alloc_capable;
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index de72df06b37b..f38cd2f12285 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -15,6 +15,7 @@
> * Software Developer Manual June 2016, volume 3, section 17.17.
> */
>
> +#include <linux/cpu.h>

Why is this needed?

> #include <linux/module.h>
> #include <linux/sizes.h>
> #include <linux/slab.h>
> @@ -271,7 +272,7 @@ static void smp_call_rmid_read(void *_arg)
>
> int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> u32 closid, u32 rmid, enum resctrl_event_id eventid,
> - u64 *val)
> + u64 *val, int ignored)
> {
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> @@ -317,9 +318,14 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
> u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> struct rmid_entry *entry;
> u32 idx, cur_idx = 1;
> + int arch_mon_ctx;
> bool rmid_dirty;
> u64 val = 0;
>
> + arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
> + if (arch_mon_ctx < 0)
> + return;
> +

The vision for this is not clear to me. When I read that context needs to be allocated
I expect it to return a pointer to some new context, not an int. What would the
"context" consist of?


...

> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index ff7452f644e4..03e4f41cd336 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -233,6 +233,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
> * @rmid: rmid of the counter to read.
> * @eventid: eventid to read, e.g. L3 occupancy.
> * @val: result of the counter read in bytes.
> + * @arch_mon_ctx: An allocated context from resctrl_arch_mon_ctx_alloc().
> *

Could this description be expanded to indicate what this context is used for?

> * Call from process context on a CPU that belongs to domain @d.
> *


Reinette

2023-03-31 23:33:32

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 15/19] x86/resctrl: Add helpers for system wide mon/alloc capable

Hi James,

On 3/20/2023 10:26 AM, James Morse wrote:
> resctrl reads rdt_alloc_capable or rdt_mon_capable to determine
> whether any of the resources support the corresponding features.
> resctrl also uses the static-keys that affect the architecture's
> context-switch code to determine the same thing.

hmmm ... they are not the same though since the static-keys
in addition means that resctrl is mounted.

Reinette

2023-03-31 23:40:21

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 16/19] x86/resctrl: Add cpu online callback for resctrl work

Hi James,

On 3/20/2023 10:26 AM, James Morse wrote:

...

> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 351319403f84..8e25ea49372e 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -603,19 +603,20 @@ static void clear_closid_rmid(int cpu)
> wrmsr(MSR_IA32_PQR_ASSOC, RESCTRL_RESERVED_CLOSID, 0);
> }
>
> -static int resctrl_online_cpu(unsigned int cpu)
> +static int resctrl_arch_online_cpu(unsigned int cpu)
> {
> struct rdt_resource *r;
> + int err;

Could you please rename err to ret?

Reinette

2023-04-05 23:53:37

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 18/19] x86/resctrl: Add cpu offline callback for resctrl work

Hi James,

On 3/20/2023 10:26 AM, James Morse wrote:

> -static int resctrl_offline_cpu(unsigned int cpu)
> -{
> - struct rdtgroup *rdtgrp;
> struct rdt_resource *r;
>
> mutex_lock(&rdtgroup_mutex);
> + resctrl_offline_cpu(cpu);
> +
> for_each_capable_rdt_resource(r)
> domain_remove_cpu(cpu, r);
> - list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
> - if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
> - clear_childcpus(rdtgrp, cpu);
> - break;
> - }
> - }
> clear_closid_rmid(cpu);
> mutex_unlock(&rdtgroup_mutex);
>

I find this and the previous patch to be very complicated. It is not clear
to me why resctrl_offline_cpu(cpu) is required to be before offline of domain.
Previous patch would not be needed if the existing order of operations
is maintained.

Reinette

2023-04-24 13:08:38

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v3 02/19] x86/resctrl: Access per-rmid structures by index

Hi James,

On Mon, Mar 20, 2023 at 6:27 PM James Morse <[email protected]> wrote:
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 030d3b409768..351319403f84 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -600,7 +600,7 @@ static void clear_closid_rmid(int cpu)
> state->default_rmid = 0;
> state->cur_closid = 0;
> state->cur_rmid = 0;
> - wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
> + wrmsr(MSR_IA32_PQR_ASSOC, RESCTRL_RESERVED_CLOSID, 0);

It looks like the RMID/CLOSID params are in the wrong order in this wrmsr().

-Peter

2023-04-27 14:11:54

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 08/19] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

Hi Ilpo,

On 21/03/2023 13:21, Ilpo Järvinen wrote:
> On Mon, 20 Mar 2023, James Morse wrote:
>
>> The limbo and overflow code picks a CPU to use from the domain's list
>> of online CPUs. Work is then scheduled on these CPUs to maintain
>> the limbo list and any counters that may overflow.
>>
>> cpumask_any() may pick a CPU that is marked nohz_full, which will
>> either penalise the work that CPU was dedicated to, or delay the
>> processing of limbo list or counters that may overflow. Perhaps
>> indefinitely. Delaying the overflow handling will skew the bandwidth
>> values calculated by mba_sc, which expects to be called once a second.
>>
>> Add cpumask_any_housekeeping() as a replacement for cpumask_any()
>> that prefers housekeeping CPUs. This helper will still return
>> a nohz_full CPU if that is the only option. The CPU to use is
>> re-evaluated each time the limbo/overflow work runs. This ensures
>> the work will move off a nohz_full CPU once a houskeeping CPU is
>> available.

>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index 87545e4beb70..0b5fd5a0cda2 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -55,6 +56,28 @@
>> /* Max event bits supported */
>> #define MAX_EVT_CONFIG_BITS GENMASK(6, 0)
>>
>> +/**
>> + * cpumask_any_housekeeping() - Chose any cpu in @mask, preferring those that
>> + * aren't marked nohz_full
>> + * @mask: The mask to pick a CPU from.
>> + *
>> + * Returns a CPU in @mask. If there are houskeeping CPUs that don't use
>> + * nohz_full, these are preferred.
>> + */
>> +static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
>> +{
>> + int cpu, hk_cpu;
>> +
>> + cpu = cpumask_any(mask);
>> + if (tick_nohz_full_cpu(cpu)) {
>> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
>> + if (hk_cpu < nr_cpu_ids)
>> + cpu = hk_cpu;
>> + }
>> +
>> + return cpu;
>> +}
>> diff --git a/include/linux/tick.h b/include/linux/tick.h
>> index bfd571f18cfd..ae2e9019fc18 100644
>> --- a/include/linux/tick.h
>> +++ b/include/linux/tick.h
>> @@ -174,9 +174,10 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
>> static inline void tick_nohz_idle_stop_tick_protected(void) { }
>> #endif /* !CONFIG_NO_HZ_COMMON */
>>
>> +extern cpumask_var_t tick_nohz_full_mask;
>> +
>> #ifdef CONFIG_NO_HZ_FULL
>> extern bool tick_nohz_full_running;
>> -extern cpumask_var_t tick_nohz_full_mask;
>
> Its definition seems to also be inside #ifdef:
>
> kernel/time/tick-sched.c-#ifdef CONFIG_NO_HZ_FULL
> kernel/time/tick-sched.c:cpumask_var_t tick_nohz_full_mask;
> kernel/time/tick-sched.c:EXPORT_SYMBOL_GPL(tick_nohz_full_mask);

Indeed, but all the uses are guarded by tick_nohz_full_cpu(), which the compiler knows is
false if CONFIG_NO_HZ_FULL is not selected.

Moving the prototype is enough to let the compiler parse the code to check its correct,
before dead-code-eliminating it. There is no need to carry around the cpumask if its never
going to be used. This would only cause a problem if someone adds a user of
tick_nohz_full_mask which isn't guarded by IS_ENABLED(). I argue that would be a bug.

All this is being done to avoid more #ifdeffery!)


Thanks,

James

2023-04-27 14:12:04

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 08/19] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

Hi Ilpo,

On 21/03/2023 15:14, Ilpo Järvinen wrote:
> On Mon, 20 Mar 2023, James Morse wrote:
>
>> The limbo and overflow code picks a CPU to use from the domain's list
>> of online CPUs. Work is then scheduled on these CPUs to maintain
>> the limbo list and any counters that may overflow.
>>
>> cpumask_any() may pick a CPU that is marked nohz_full, which will
>> either penalise the work that CPU was dedicated to, or delay the
>> processing of limbo list or counters that may overflow. Perhaps
>> indefinitely. Delaying the overflow handling will skew the bandwidth
>> values calculated by mba_sc, which expects to be called once a second.
>>
>> Add cpumask_any_housekeeping() as a replacement for cpumask_any()
>> that prefers housekeeping CPUs. This helper will still return
>> a nohz_full CPU if that is the only option. The CPU to use is
>> re-evaluated each time the limbo/overflow work runs. This ensures
>> the work will move off a nohz_full CPU once a houskeeping CPU is
>
> housekeeping
>
>> available.

>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index 87545e4beb70..0b5fd5a0cda2 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h

>> +/**
>> + * cpumask_any_housekeeping() - Chose any cpu in @mask, preferring those that
>> + * aren't marked nohz_full
>> + * @mask: The mask to pick a CPU from.
>> + *
>> + * Returns a CPU in @mask. If there are houskeeping CPUs that don't use
>> + * nohz_full, these are preferred.
>> + */
>> +static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
>> +{
>> + int cpu, hk_cpu;
>> +
>> + cpu = cpumask_any(mask);
>> + if (tick_nohz_full_cpu(cpu)) {
>> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
>
> Why cpumask_nth_and() is not enough here? ..._andnot() seems to alter
> tick_nohz_full_mask which doesn't seem desirable?

tick_nohz_full_mask is the list of CPUs we should avoid. This wants to find the first cpu
set in the domain mask, and clear in tick_nohz_full_mask.

Where does cpumask_nth_andnot() modify its arguments? Its arguments are const.


Thanks,

James

2023-04-27 14:12:18

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 06/19] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID

Hi Reinette,

On 01/04/2023 00:21, Reinette Chatre wrote:
> On 3/20/2023 10:26 AM, James Morse wrote:
>> +/**
>> + * resctrl_closid_is_dirty - Determine if all RMID associated with this CLOSID
>> + * are available.
>> + * @closid: The CLOSID that is being queried.
>> + *
>> + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocated CLOSID
>> + * may not be able to allocate clean RMID. To avoid this the allocator will
>> + * only return clean CLOSID. This is enough for now as it allows MPAM systems
>> + * to use resctrl. This suffers from the problem that there may be no CLOSID
>> + * where all the RMID are clean, causing the CLOSID allocation to fail.
>> + * This can be improved (once MPAM support is upstream) to return the cleanest
>> + * CLOSID where PMG=0 is clean. This would allow the CLOSID allocation to

> Why does PMG=0 have to be the clean ID?

True, there ends up being a second search anyway.


> I am wondering about the use cases here. When a new CLOSID needs to be allocated,
> would it not be useful to instead have a utility that returns the "cleanest" CLOSID?

It would, but this is a trade off between churn and features, I'm trying to do the minimum
to get feature parity for supporting MPAM by keeping any additional code that x86 doesn't
use small and simple. Improvements that only affect MPAM can be kicked down the road.

But as we're discussing it...


> Instead of picking an available CLOSID and then always have to check if it is
> "dirty or not", why not have a utility that picks the CLOSID with the most
> available PMGs?

I think an extra array to keep track of this is simplest as it avoids a complex walk of
the rmid_ptrs[] array looking for the global minimum across a number of entries. I think
it would be two additional patches. I'll include this in the next version.


Thanks,

James



2023-04-27 14:12:40

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 08/19] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

Hi Reinette,

On 01/04/2023 00:24, Reinette Chatre wrote:
> On 3/20/2023 10:26 AM, James Morse wrote:
>> The limbo and overflow code picks a CPU to use from the domain's list
>> of online CPUs. Work is then scheduled on these CPUs to maintain
>> the limbo list and any counters that may overflow.
>>
>> cpumask_any() may pick a CPU that is marked nohz_full, which will
>> either penalise the work that CPU was dedicated to, or delay the
>
> penalise -> penalize

(s->z is the difference between British English and American English)


>> processing of limbo list or counters that may overflow. Perhaps
>> indefinitely. Delaying the overflow handling will skew the bandwidth
>> values calculated by mba_sc, which expects to be called once a second.
>>
>> Add cpumask_any_housekeeping() as a replacement for cpumask_any()
>> that prefers housekeeping CPUs. This helper will still return
>> a nohz_full CPU if that is the only option. The CPU to use is
>> re-evaluated each time the limbo/overflow work runs. This ensures
>> the work will move off a nohz_full CPU once a houskeeping CPU is
>> available.

>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index 87545e4beb70..0b5fd5a0cda2 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -55,6 +56,28 @@
>> /* Max event bits supported */
>> #define MAX_EVT_CONFIG_BITS GENMASK(6, 0)
>>
>> +/**
>> + * cpumask_any_housekeeping() - Chose any cpu in @mask, preferring those that
>> + * aren't marked nohz_full
>> + * @mask: The mask to pick a CPU from.
>> + *
>> + * Returns a CPU in @mask. If there are houskeeping CPUs that don't use
>> + * nohz_full, these are preferred.
>> + */
>> +static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
>> +{
>> + int cpu, hk_cpu;
>> +
>> + cpu = cpumask_any(mask);
>> + if (tick_nohz_full_cpu(cpu)) {
>> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
>> + if (hk_cpu < nr_cpu_ids)
>> + cpu = hk_cpu;
>> + }
>> +

> I think as a start this could perhaps be a #if defined(CONFIG_NO_HZ_FULL). There
> appears to be a precedent for this in kernel/rcu/tree_nocb.h.

This harms readability, and prevents the compiler from testing that this is valid C code
for any compile of this code.

With if-def's here you'd be reliant on come CI system to build with the required
combination of Kconfig symbols to expose any warnings.

It's much better to use IS_ENABLED() in the helpers and rely on the compiler's
dead-code-elimination to remove paths that have been configured out.

(See the section on Conditional Compilation in coding-style for a much better summary!)


> Apart from the issue that Ilpo pointed out I would prefer that any changes outside
> resctrl are submitted separately to that subsystem.

Sure, I'll pull those three lines out as a separate patch.


>> @@ -801,6 +803,11 @@ void mbm_handle_overflow(struct work_struct *work)
>> update_mba_bw(prgrp, d);
>> }
>>
>> + /*
>> + * Re-check for housekeeping CPUs. This allows the overflow handler to
>> + * move off a nohz_full CPU quickly.
>> + */
>> + cpu = cpumask_any_housekeeping(&d->cpu_mask);
>> schedule_delayed_work_on(cpu, &d->mbm_over, delay);
>>
>> out_unlock:
>
> From what I can tell the nohz_full CPUs are set during boot and do not change.

But the house keeping CPUs can be taken offline, and brought back.

With this change the work moves off the nohz_full CPU and back to the housekeeping CPU the
next time this runs. Without it, you're stuck on a nohz_full CPU until you take that CPU
offline too.


>> diff --git a/include/linux/tick.h b/include/linux/tick.h
>> index bfd571f18cfd..ae2e9019fc18 100644
>> --- a/include/linux/tick.h
>> +++ b/include/linux/tick.h
>> @@ -174,9 +174,10 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
>> static inline void tick_nohz_idle_stop_tick_protected(void) { }
>> #endif /* !CONFIG_NO_HZ_COMMON */
>>
>> +extern cpumask_var_t tick_nohz_full_mask;
>> +
>> #ifdef CONFIG_NO_HZ_FULL
>> extern bool tick_nohz_full_running;
>> -extern cpumask_var_t tick_nohz_full_mask;
>>
>> static inline bool tick_nohz_full_enabled(void)
>> {
>
> In addition to what Ilpo pointed out, be careful here.
> cpumask_var_t is a pointer (or array) and needs to be
> allocated before use. Moving its declaration but not the
> allocation code seems risky.

Risky how? Any use of tick_nohz_full_mask that isn't guarded by something like
tick_nohz_full_cpu() will lead to a link error regardless of the type.


Thanks,

James

2023-04-27 14:12:46

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi Peter,

On 22/03/2023 14:07, Peter Newman wrote:
> On Mon, Mar 20, 2023 at 6:27 PM James Morse <[email protected]> wrote:
>>
>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>
> As I explained earlier, this is not the case on AMD.

I'll change it so say Intel.


>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>> the number implemented is up to the manufacturer. This means when there are
>> fewer monitors than needed, they need to be allocated and freed.
>>
>> Worse, the domain may be broken up into slices, and the MMIO accesses
>> for each slice may need performing from different CPUs.
>>
>> These two details mean MPAMs monitor code needs to be able to sleep, and
>> IPI another CPU in the domain to read from a resource that has been sliced.
>
> This doesn't sound very convincing. Could mon_event_read() IPI all the
> CPUs in the domain? (after waiting to allocate and install monitors
> when necessary?)

On the majority of platforms this would be a waste of time as the IPI only needs sending
to one. I'd like to keep the cost of being strange limited to the strange platforms.

I don't think exposing a 'sub domain' cpumask to resctrl is helpful: this needs to be
hidden in the architecture specific code.

The IPI is because of SoC components being implemented as slices which are private to that
slice.


The sleeping is because the CSU counters are allowed to be 'not ready' immediately after
programming. The time is short, and to allow platforms that have too few CSU monitors to
support the same user-interface as x86^W Intel, the MPAM driver needs to be able to
multiplex a single CSU monitor between multiple control/monitor groups. Allowing it to
sleep for the advertised not-ready period is the simplest way of doing this.


>> mon_event_read() already invokes mon_event_count() via IPI, which means
>> this isn't possible. On systems using nohz-full, some CPUs need to be
>> interrupted to run kernel work as they otherwise stay in user-space
>> running realtime workloads. Interrupting these CPUs should be avoided,
>> and scheduling work on them may never complete.
>>
>> Change mon_event_read() to pick a housekeeping CPU, (one that is not using
>> nohz_full) and schedule mon_event_count() and wait. If all the CPUs
>> in a domain are using nohz-full, then an IPI is used as the fallback.
>>
>> This function is only used in response to a user-space filesystem request
>> (not the timing sensitive overflow code).
>>
>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>> the monitor-allocation in monitor.c.
>
> This goal sounds more likely.
>
> If it makes the initial enablement smoother, then I'm all for it.

> Reviewed-By: Peter Newman <[email protected]>
>
> These changes worked fine for me on tip/master, though there were merge
> conflicts to resolve.
>
> Tested-By: Peter Newman <[email protected]>

Thanks!


James

2023-04-27 14:13:08

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi Peter,

On 23/03/2023 09:09, Peter Newman wrote:
> On Wed, Mar 22, 2023 at 3:07 PM Peter Newman <[email protected]> wrote:
>> On Mon, Mar 20, 2023 at 6:27 PM James Morse <[email protected]> wrote:
>>>
>>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>>
>> As I explained earlier, this is not the case on AMD.
>>
>>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>>> the number implemented is up to the manufacturer. This means when there are
>>> fewer monitors than needed, they need to be allocated and freed.
>>>
>>> Worse, the domain may be broken up into slices, and the MMIO accesses
>>> for each slice may need performing from different CPUs.
>>>
>>> These two details mean MPAMs monitor code needs to be able to sleep, and
>>> IPI another CPU in the domain to read from a resource that has been sliced.
>>
>> This doesn't sound very convincing. Could mon_event_read() IPI all the
>> CPUs in the domain? (after waiting to allocate and install monitors
>> when necessary?)
>
> No wait, I know that isn't correct.
>
> As you explained it, the remote CPU needs to sleep because it may need
> to atomically acquire, install, and read a CSU monitor.
>
> It still seems possible for the mon_event_read() thread to do all the
> waiting (tell remote CPU to program CSU monitor, wait, tell same remote
> CPU to read monitor), but that sounds like more work that I don't see a
> lot of benefit to doing today.
>
> Can you update the changelog to just say the remote CPU needs to block
> when installing a CSU monitor?

Sure, I've added this after the first paragraph:
-------%<-------
MPAM's CSU monitors are used to back the 'llc_occupancy' monitor file. The
CSU counter is allowed to return 'not ready' for a small number of
micro-seconds after programming. To allow one CSU hardware monitor to be
used for multiple control or monitor groups, the CPU accessing the
monitor needs to be able to block when configuring and reading the
counter.
-------%<-------


Thanks,

James

2023-04-27 14:13:33

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi Reinette,

On 01/04/2023 00:25, Reinette Chatre wrote:
> On 3/20/2023 10:26 AM, James Morse wrote:
>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>> the number implemented is up to the manufacturer. This means when there are
>> fewer monitors than needed, they need to be allocated and freed.
>>
>> Worse, the domain may be broken up into slices, and the MMIO accesses
>> for each slice may need performing from different CPUs.
>>
>> These two details mean MPAMs monitor code needs to be able to sleep, and
>> IPI another CPU in the domain to read from a resource that has been sliced.
>>
>> mon_event_read() already invokes mon_event_count() via IPI, which means
>> this isn't possible. On systems using nohz-full, some CPUs need to be
>> interrupted to run kernel work as they otherwise stay in user-space
>> running realtime workloads. Interrupting these CPUs should be avoided,
>> and scheduling work on them may never complete.
>>
>> Change mon_event_read() to pick a housekeeping CPU, (one that is not using
>> nohz_full) and schedule mon_event_count() and wait. If all the CPUs
>> in a domain are using nohz-full, then an IPI is used as the fallback.
>
> It is not clear to me where in this solution an IPI is used as fallback ...
> (see below)

>> @@ -537,7 +543,16 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>> rr->val = 0;
>> rr->first = first;
>>
>> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>> + cpu = get_cpu();
>> + if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
>> + mon_event_count(rr);
>> + put_cpu();
>> + } else {
>> + put_cpu();
>> +
>> + cpu = cpumask_any_housekeeping(&d->cpu_mask);
>> + smp_call_on_cpu(cpu, mon_event_count, rr, false);
>> + }
>> }
>>
>
> ... from what I can tell there is no IPI fallback here. As per previous
> patch I understand cpumask_any_housekeeping() could still return
> a nohz_full CPU and calling smp_call_on_cpu() on it would not send
> an IPI but instead queue the work to it. What did I miss?

Huh, looks like its still in my git-stash. Sorry about that. The combined hunk looks like
this:
----------------------%<----------------------
@@ -537,7 +550,26 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
rr->val = 0;
rr->first = first;

- smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+ cpu = get_cpu();
+ if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
+ mon_event_count(rr);
+ put_cpu();
+ } else {
+ put_cpu();
+
+ cpu = cpumask_any_housekeeping(&d->cpu_mask);
+
+ /*
+ * cpumask_any_housekeeping() prefers housekeeping CPUs, but
+ * are all the CPUs nohz_full? If yes, pick a CPU to IPI.
+ * MPAM's resctrl_arch_rmid_read() is unable to read the
+ * counters on some platforms if its called in irq context.
+ */
+ if (tick_nohz_full_cpu(cpu))
+ smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+ else
+ smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
+ }
}

----------------------%<----------------------

Where smp_mon_event_count() is a static wrapper to make the types work.


Thanks,

James

2023-04-27 14:20:36

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 10/19] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

Hi Reinette,

On 01/04/2023 00:26, Reinette Chatre wrote:
> On 3/20/2023 10:26 AM, James Morse wrote:
>
> ...
>
>> int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>> u32 closid, u32 rmid, enum resctrl_event_id eventid,
>> u64 *val)
>> {
>> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>> struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
>> + struct __rmid_read_arg arg;
>> struct arch_mbm_state *am;
>> u64 msr_val, chunks;
>> - int ret;
>> + int err;
>>
>> - if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
>> - return -EINVAL;
>> + arg.rmid = rmid;
>> + arg.eventid = eventid;
>>
>> - ret = __rmid_read(rmid, eventid, &msr_val);
>> - if (ret)
>> - return ret;
>> + err = smp_call_function_any(&d->cpu_mask, smp_call_rmid_read, &arg, true);
>> + if (err)
>> + return err;
>
> This seems to break the assumption of expected return values. __mon_event_count()
> does:
> rr->err = resctrl_arch_rmid_read()
>
> and later rdtgroup_mondata_show() only expects -EIO or -EINVAL as errors, with
> default of success.

Yes, looks like I dithered on whether cpus_read_lock() should be held over this function,
or it should tolerate the error. This is protected by rdtgroup_mutex, which means the
hotplug callbacks can't run concurrently, so the error can't occur.

I'll change it to ignore the return value.


Thanks,

James


2023-04-27 14:22:47

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 12/19] x86/resctrl: Make resctrl_mounted checks explicit

Hi Reinette,

On 01/04/2023 00:28, Reinette Chatre wrote:
> On 3/20/2023 10:26 AM, James Morse wrote:
>> The rdt_enable_key is switched when resctrl is mounted, and used to
>> prevent a second mount of the filesystem. It also enables the
>> architecture's context switch code.
>>
>> This requires another architecture to have the same set of static-keys,
>> as resctrl depends on them too.
>>
>> Make the resctrl_mounted checks explicit: resctrl can keep track of
>> whether it has been mounted once. This doesn't need to be combined with
>> whether the arch code is context switching the CLOSID.
>> Tests against the rdt_mon_enable_key become a test that resctrl is
>> mounted and that monitoring is enabled.
>
> The last sentence above makes the code change hard to follow ...
> (see below)
>
>>
>> This will allow the static-key changing to be moved behind resctrl_arch_
>> calls.

>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index f38cd2f12285..6279f5c98b39 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -834,7 +834,7 @@ void mbm_handle_overflow(struct work_struct *work)
>>
>> mutex_lock(&rdtgroup_mutex);
>>
>> - if (!static_branch_likely(&rdt_mon_enable_key))
>> + if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
>
> ... considering the text in the changelog the "resctrl_mounted" check seems
> unnecessary. Looking ahead I wonder if this check would not be more
> appropriate in patch 15?

How so?

This is secretly relying on rdt_mon_enable_key being cleared in rdt_kill_sb() when the
filesystem is unmounted, otherwise the overflow thread keeps running once the filesystem
is unmounted.

I thought it simpler to add all these checks explicitly in one go.
That makes it simpler to thin out the static keys as their 'and its mounted' behaviour is
no longer relied on.

I'll add comments for these cases covering why the filesystem-mounted check is needed.


>> goto out_unlock;
>>
>> r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> @@ -867,8 +867,9 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
>> unsigned long delay = msecs_to_jiffies(delay_ms);
>> int cpu;
>>
>> - if (!static_branch_likely(&rdt_mon_enable_key))
>> + if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
>> return;
>
> same here

If a domain comes online mbm_setup_overflow_handler() is called, if the filesystem is not
mounted, there is nothing for it to do. Today this relies on the architecture having a
static key that resctrl can toggle when it gets unmounted.


>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 2306fbc9a9bb..5176a85f281c 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c

>> @@ -3687,8 +3693,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
>> if (is_llc_occupancy_enabled())
>> INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
>>
>> - /* If resctrl is mounted, add per domain monitor data directories. */
>> - if (static_branch_unlikely(&rdt_mon_enable_key))
>> + if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
>> mkdir_mondata_subdir_allrdtgrp(r, d);
>>
>> return 0;
>
> Above also, the resctrl_mounted check does not seem to be needed.

Today its implicit in the rdt_mon_enable_key.

If the filesystem isn't mounted, there is no need to create the directories as no-one can
see them. (it does look like it would be harmless as kernfs_create_root() is called once
at init time).
Instead this work gets done at mount time by rdt_get_tree() calling mkdir_mondata_all().


Thanks,

James

2023-04-27 14:23:17

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 11/19] x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read()

Hi Reinette,

On 01/04/2023 00:27, Reinette Chatre wrote:
> On 3/20/2023 10:26 AM, James Morse wrote:
>> Depending on the number of monitors available, Arm's MPAM may need to
>> allocate a monitor prior to reading the counter value. Allocating a
>> contended resource may involve sleeping.
>>
>> All callers of resctrl_arch_rmid_read() read the counter on more than
>> one domain. If the monitor is allocated globally, there is no need to
>
> This does not seem accurate considering the __check_limbo() call that
> is called for a single domain.

True, it was add_rmid_to_limbo() that motivated this being global, but its conflated with
holding the allocation for multiple invocations of resctrl_arch_rmid_read() for the
multiple groups that __check_limbo() walks over, and the calls for each monitor group that
mon_event_count() makes.


>> allocate and free it for each call to resctrl_arch_rmid_read().
>>
>> Add arch hooks for this allocation, which need calling before
>> resctrl_arch_rmid_read(). The allocated monitor is passed to
>> resctrl_arch_rmid_read(), then freed again afterwards. The helper
>> can be called on any CPU, and can sleep.


>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index de72df06b37b..f38cd2f12285 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -15,6 +15,7 @@
>> * Software Developer Manual June 2016, volume 3, section 17.17.
>> */
>>
>> +#include <linux/cpu.h>
>
> Why is this needed?

lockdep_assert_cpus_held(), but that got folded out to a latter patch. I've moved it there.


>> #include <linux/module.h>
>> #include <linux/sizes.h>
>> #include <linux/slab.h>
>> @@ -271,7 +272,7 @@ static void smp_call_rmid_read(void *_arg)
>>
>> int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>> u32 closid, u32 rmid, enum resctrl_event_id eventid,
>> - u64 *val)
>> + u64 *val, int ignored)
>> {
>> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>> struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
>> @@ -317,9 +318,14 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>> u32 idx_limit = resctrl_arch_system_num_rmid_idx();
>> struct rmid_entry *entry;
>> u32 idx, cur_idx = 1;
>> + int arch_mon_ctx;
>> bool rmid_dirty;
>> u64 val = 0;
>>
>> + arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
>> + if (arch_mon_ctx < 0)
>> + return;

> The vision for this is not clear to me. When I read that context needs to be allocated
> I expect it to return a pointer to some new context, not an int. What would the
> "context" consist of?

It might just need a different name.

For MPAM, this is allocating a monitor, which is the hardware that does the counting in
the cache or the memory controller. The number of monitors is an implementation choice,
and may not match the number of CLOSID/RMID that are in use. There aren't guaranteed to be
enough to allocate one for every control or monitor group up front.

The int being returned is the allocated monitor's index. It identifies which monitor needs
programming to read the provided CLOSID/RMID, and the counter register to read with the value.

I can allocate memory for an int if you think that is clearer.
(I was hoping to leave that for whoever needs something bigger than a pointer)


>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>> index ff7452f644e4..03e4f41cd336 100644
>> --- a/include/linux/resctrl.h
>> +++ b/include/linux/resctrl.h
>> @@ -233,6 +233,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
>> * @rmid: rmid of the counter to read.
>> * @eventid: eventid to read, e.g. L3 occupancy.
>> * @val: result of the counter read in bytes.
>> + * @arch_mon_ctx: An allocated context from resctrl_arch_mon_ctx_alloc().
>> *

> Could this description be expanded to indicate what this context is used for?

Sure,
"An architecture specific value from resctrl_arch_mon_ctx_alloc(), for MPAM this
identifies the hardware monitor allocated for this read request".



Thanks,

James

2023-04-27 14:23:29

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 15/19] x86/resctrl: Add helpers for system wide mon/alloc capable

On 01/04/2023 00:29, Reinette Chatre wrote:
> Hi James,
>
> On 3/20/2023 10:26 AM, James Morse wrote:
>> resctrl reads rdt_alloc_capable or rdt_mon_capable to determine
>> whether any of the resources support the corresponding features.
>> resctrl also uses the static-keys that affect the architecture's
>> context-switch code to determine the same thing.
>
> hmmm ... they are not the same though since the static-keys
> in addition means that resctrl is mounted.

and all the paths where this matters were updated in patch 12 to have an explicit
resctrl_mounted check.


Thanks,

James

2023-04-27 14:23:40

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 17/19] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu

Hi Ilpo,

On 21/03/2023 15:25, Ilpo Järvinen wrote:
> On Tue, 21 Mar 2023, Ilpo Jï¿œrvinen wrote:
>> On Mon, 20 Mar 2023, James Morse wrote:
>>
>>> When a CPU is taken offline resctrl may need to move the overflow or
>>> limbo handlers to run on a different CPU.
>>>
>>> Once the offline callbacks have been split, cqm_setup_limbo_handler()
>>> will be called while the CPU that is going offline is still present
>>> in the cpu_mask.
>>>
>>> Pass the CPU to exclude to cqm_setup_limbo_handler() and
>>> mbm_setup_overflow_handler(). These functions can use a variant of
>>> cpumask_any_but() when selecting the CPU. -1 is used to indicate no CPUs
>>> need excluding.

>>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>>> index 3eb5b307b809..47838ba6876e 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>>> @@ -78,6 +78,37 @@ static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
>>> return cpu;
>>> }
>>>
>>> +/**
>>> + * cpumask_any_housekeeping_but() - Chose any cpu in @mask, preferring those
>>> + * that aren't marked nohz_full, excluding
>>> + * the provided CPU
>>> + * @mask: The mask to pick a CPU from.
>>> + * @exclude_cpu:The CPU to avoid picking.
>>> + *
>>> + * Returns a CPU from @mask, but not @but. If there are houskeeping CPUs that
>>> + * don't use nohz_full, these are preferred.
>>> + * Returns >= nr_cpu_ids if no CPUs are available.
>>> + */
>>> +static inline unsigned int
>>> +cpumask_any_housekeeping_but(const struct cpumask *mask, int exclude_cpu)
>>> +{
>>> + int cpu, hk_cpu;
>>> +
>>> + cpu = cpumask_any_but(mask, exclude_cpu);
>>> + if (tick_nohz_full_cpu(cpu)) {
>>> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
>>> + if (hk_cpu == exclude_cpu) {
>>> + hk_cpu = cpumask_nth_andnot(1, mask,
>>> + tick_nohz_full_mask);

>> I'm left to wonder if it's okay to alter tick_nohz_full_mask in resctrl
>> code??

Why do you think cpumask_nth_andnot() modifies its arguments?

The cpumask arguments to cpumask_nth_andnot() are const.


> I suppose it should do instead:
> hk_cpu = cpumask_nth_and(0, mask, tick_nohz_full_mask);
> if (hk_cpu == exclude_cpu)
> hk_cpu = cpumask_next_and(hk_cpu, mask, tick_nohz_full_mask);
>

Removing the 'not' changes the behaviour. hk_cpu is now guaranteed to be a nohz_full CPU.
This needs to prefer CPUs that are not in that mask.

Passing 'hk_cpu' the second time doesn't look right, hk_cpu is a CPU-number, not a count
of the 'nth CPU to find', which is what the argument expects.
For example: If the mask only has CPU 10-12, where CPU 10 should be excluded, its possible
the first attempt for the 0th CPU returns 10... in which case I want to pass '1' now I
know that the 0th is the excluded CPU. If I pass 10 I expect an error, as there aren't 10
bits set in the mask.


Thanks,

James

2023-04-27 14:23:44

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 18/19] x86/resctrl: Add cpu offline callback for resctrl work

Hi Ilpo,

On 21/03/2023 15:32, Ilpo Järvinen wrote:
> On Mon, 20 Mar 2023, James Morse wrote:
>
>> The resctrl architecture specific code may need to free a domain when
>> a CPU goes offline, it also needs to reset the CPUs PQR_ASSOC register.
>> The resctrl filesystem code needs to move the overflow and limbo work
>> to run on a different CPU, and clear this CPU from the cpu_mask of
>> control and monitor groups.
>>
>> Currently this is all done in core.c and called from
>> resctrl_offline_cpu(), making the split between architecture and
>> filesystem code unclear.
>>
>> Move the filesystem work into a filesystem helper called
>> resctrl_offline_cpu(), and rename the one in core.c
>> resctrl_arch_offline_cpu().
>>
>> The rdtgroup_mutex is unlocked and locked again in the call in
>> preparation for changing the locking rules for the architecture
>> code.
>>
>> resctrl_offline_cpu() is called before any of the resource/domains
>> are updated, and makes use of the exclude_cpu feature that was
>> previously added.

>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>> index aafe4b74587c..4e5fc89dab6d 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -578,22 +578,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>>
>> return;
>> }
>> -
>> - if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
>> - if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
>> - cancel_delayed_work(&d->mbm_over);
>> - /*
>> - * exclude_cpu=-1 as this CPU has already been removed
>> - * by cpumask_clear_cpu()d
>> - */
>
> This was added in 17/19 and now removed (not moved) in 18/19. Please avoid
> such back-and-forth churn.

This is the cost of making small incremental changes that should be easier to review.
The intermediate step was a little odd, so came with a comment. (I normally mark those as
'temporary', but didn't bother this time as they are adjacent patches)

If you'd prefer, I can merge these patches together... but from Reinette's feedback its
likely I'll split them up even more.


Thanks,

James

2023-04-27 14:23:57

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 18/19] x86/resctrl: Add cpu offline callback for resctrl work

Hi Reinette,

On 06/04/2023 00:48, Reinette Chatre wrote:
> On 3/20/2023 10:26 AM, James Morse wrote:
>
>> -static int resctrl_offline_cpu(unsigned int cpu)
>> -{
>> - struct rdtgroup *rdtgrp;
>> struct rdt_resource *r;
>>
>> mutex_lock(&rdtgroup_mutex);
>> + resctrl_offline_cpu(cpu);
>> +
>> for_each_capable_rdt_resource(r)
>> domain_remove_cpu(cpu, r);
>> - list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
>> - if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
>> - clear_childcpus(rdtgrp, cpu);
>> - break;
>> - }
>> - }
>> clear_closid_rmid(cpu);
>> mutex_unlock(&rdtgroup_mutex);
>>
>
> I find this and the previous patch to be very complicated.

It consolidates the parts of this that have nothing to do with the architecture specific code.
The extra work is because the semantics are: "this CPU is going away", the callee needs to
to not pick 'this CPU' again when updating any structures.

Ensuring the structures have not yet been modified by the architecture code is the
cleanest interface as there is nothing special about what the arch code provides to the
filesystem here.

I agree it looks like a special case, but only because the existing code is being called
halfway through the tear down, and depends on what the arch code has already done.

Having a single call, where nothing has been changed yet is the most maintainable option
as it avoids extra hooks, or an incomplete list of what has been torn down, and what
hasn't - some of which may be architecture specific.

It also avoids any interaction with how the architecture code chooses to prevent multiple
writers to the domain list - I don't want any of the filesystem code to depend on a lock
held by the architecture specific code.


> It is not clear
> to me why resctrl_offline_cpu(cpu) is required to be before offline of domain.
> Previous patch would not be needed if the existing order of operations
> is maintained.

The existing order is a bit of a soup.

You'd need a resctrl_domain_rebalance_helpers() to move the limbo and mbm workers, but
this would run after the CPU had been removed from the domain. Hopefully the name conveys
that it doesn't always run when a CPU is going offline.
resctrl_offline_cpu() would potentially run after the CPUs domains have been free()d,
depending on what gets added in the future this might be a problem, leading to a
resctrl_pre_offline_cpu() hook.

I worry this strange state leads to extra special-case'd filesystem code, and extra hooks.


I can split the consolidation of the filesystem code up in this patch, the
clear_childcpus() and limbo/mbm stuff can be done in separate patches, which might make it
easier on the eye.


Thanks,

James

2023-04-27 14:38:31

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH v3 08/19] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

On Thu, 27 Apr 2023, James Morse wrote:

> Hi Ilpo,
>
> On 21/03/2023 15:14, Ilpo J?rvinen wrote:
> > On Mon, 20 Mar 2023, James Morse wrote:
> >
> >> The limbo and overflow code picks a CPU to use from the domain's list
> >> of online CPUs. Work is then scheduled on these CPUs to maintain
> >> the limbo list and any counters that may overflow.
> >>
> >> cpumask_any() may pick a CPU that is marked nohz_full, which will
> >> either penalise the work that CPU was dedicated to, or delay the
> >> processing of limbo list or counters that may overflow. Perhaps
> >> indefinitely. Delaying the overflow handling will skew the bandwidth
> >> values calculated by mba_sc, which expects to be called once a second.
> >>
> >> Add cpumask_any_housekeeping() as a replacement for cpumask_any()
> >> that prefers housekeeping CPUs. This helper will still return
> >> a nohz_full CPU if that is the only option. The CPU to use is
> >> re-evaluated each time the limbo/overflow work runs. This ensures
> >> the work will move off a nohz_full CPU once a houskeeping CPU is
> >
> > housekeeping
> >
> >> available.
>
> >> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> >> index 87545e4beb70..0b5fd5a0cda2 100644
> >> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> >> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>
> >> +/**
> >> + * cpumask_any_housekeeping() - Chose any cpu in @mask, preferring those that
> >> + * aren't marked nohz_full
> >> + * @mask: The mask to pick a CPU from.
> >> + *
> >> + * Returns a CPU in @mask. If there are houskeeping CPUs that don't use
> >> + * nohz_full, these are preferred.
> >> + */
> >> +static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
> >> +{
> >> + int cpu, hk_cpu;
> >> +
> >> + cpu = cpumask_any(mask);
> >> + if (tick_nohz_full_cpu(cpu)) {
> >> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
> >
> > Why cpumask_nth_and() is not enough here? ..._andnot() seems to alter
> > tick_nohz_full_mask which doesn't seem desirable?
>
> tick_nohz_full_mask is the list of CPUs we should avoid. This wants to find the first cpu
> set in the domain mask, and clear in tick_nohz_full_mask.
>
> Where does cpumask_nth_andnot() modify its arguments? Its arguments are const.

Ah, it doesn't, I'm sorry about that.

I think I was trapped by ambiguous English:
* cpumask_nth_andnot - get the first cpu set in 1st cpumask, and clear in 2nd.
...which can be understood as it clearing it in 2nd.


--
i.

2023-04-27 15:06:41

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH v3 18/19] x86/resctrl: Add cpu offline callback for resctrl work

On Thu, 27 Apr 2023, James Morse wrote:

> Hi Ilpo,
>
> On 21/03/2023 15:32, Ilpo Järvinen wrote:
> > On Mon, 20 Mar 2023, James Morse wrote:
> >
> >> The resctrl architecture specific code may need to free a domain when
> >> a CPU goes offline, it also needs to reset the CPUs PQR_ASSOC register.
> >> The resctrl filesystem code needs to move the overflow and limbo work
> >> to run on a different CPU, and clear this CPU from the cpu_mask of
> >> control and monitor groups.
> >>
> >> Currently this is all done in core.c and called from
> >> resctrl_offline_cpu(), making the split between architecture and
> >> filesystem code unclear.
> >>
> >> Move the filesystem work into a filesystem helper called
> >> resctrl_offline_cpu(), and rename the one in core.c
> >> resctrl_arch_offline_cpu().
> >>
> >> The rdtgroup_mutex is unlocked and locked again in the call in
> >> preparation for changing the locking rules for the architecture
> >> code.
> >>
> >> resctrl_offline_cpu() is called before any of the resource/domains
> >> are updated, and makes use of the exclude_cpu feature that was
> >> previously added.
>
> >> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> >> index aafe4b74587c..4e5fc89dab6d 100644
> >> --- a/arch/x86/kernel/cpu/resctrl/core.c
> >> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> >> @@ -578,22 +578,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> >>
> >> return;
> >> }
> >> -
> >> - if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> >> - if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> >> - cancel_delayed_work(&d->mbm_over);
> >> - /*
> >> - * exclude_cpu=-1 as this CPU has already been removed
> >> - * by cpumask_clear_cpu()d
> >> - */
> >
> > This was added in 17/19 and now removed (not moved) in 18/19. Please avoid
> > such back-and-forth churn.
>
> This is the cost of making small incremental changes that should be easier to review.
> The intermediate step was a little odd, so came with a comment. (I normally mark those as
> 'temporary', but didn't bother this time as they are adjacent patches)

Why not mention the oddity at the end of changelog then? That keeps the
diffs clean of temporary comments.

> If you'd prefer, I can merge these patches together... but from
> Reinette's feedback its likely I'll split them up even more.

I don't prefer merging.

--
i.

2023-04-27 23:38:26

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 08/19] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

Hi James,

On 4/27/2023 7:10 AM, James Morse wrote:
> Hi Reinette,
>
> On 01/04/2023 00:24, Reinette Chatre wrote:
>> On 3/20/2023 10:26 AM, James Morse wrote:
>>> The limbo and overflow code picks a CPU to use from the domain's list
>>> of online CPUs. Work is then scheduled on these CPUs to maintain
>>> the limbo list and any counters that may overflow.
>>>
>>> cpumask_any() may pick a CPU that is marked nohz_full, which will
>>> either penalise the work that CPU was dedicated to, or delay the
>>
>> penalise -> penalize
>
> (s->z is the difference between British English and American English)

My apologies.

>>> processing of limbo list or counters that may overflow. Perhaps
>>> indefinitely. Delaying the overflow handling will skew the bandwidth
>>> values calculated by mba_sc, which expects to be called once a second.
>>>
>>> Add cpumask_any_housekeeping() as a replacement for cpumask_any()
>>> that prefers housekeeping CPUs. This helper will still return
>>> a nohz_full CPU if that is the only option. The CPU to use is
>>> re-evaluated each time the limbo/overflow work runs. This ensures
>>> the work will move off a nohz_full CPU once a houskeeping CPU is
>>> available.
>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>>> index 87545e4beb70..0b5fd5a0cda2 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>>> @@ -55,6 +56,28 @@
>>> /* Max event bits supported */
>>> #define MAX_EVT_CONFIG_BITS GENMASK(6, 0)
>>>
>>> +/**
>>> + * cpumask_any_housekeeping() - Chose any cpu in @mask, preferring those that
>>> + * aren't marked nohz_full
>>> + * @mask: The mask to pick a CPU from.
>>> + *
>>> + * Returns a CPU in @mask. If there are houskeeping CPUs that don't use
>>> + * nohz_full, these are preferred.
>>> + */
>>> +static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
>>> +{
>>> + int cpu, hk_cpu;
>>> +
>>> + cpu = cpumask_any(mask);
>>> + if (tick_nohz_full_cpu(cpu)) {
>>> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
>>> + if (hk_cpu < nr_cpu_ids)
>>> + cpu = hk_cpu;
>>> + }
>>> +
>
>> I think as a start this could perhaps be a #if defined(CONFIG_NO_HZ_FULL). There
>> appears to be a precedent for this in kernel/rcu/tree_nocb.h.
>
> This harms readability, and prevents the compiler from testing that this is valid C code
> for any compile of this code.
>
> With if-def's here you'd be reliant on come CI system to build with the required
> combination of Kconfig symbols to expose any warnings.
>
> It's much better to use IS_ENABLED() in the helpers and rely on the compiler's
> dead-code-elimination to remove paths that have been configured out.
>
> (See the section on Conditional Compilation in coding-style for a much better summary!)

My assumption was that you intended to implement what is described first in
the document you point to. That is, providing no-stub versions for all
and then calling everything unconditionally. Since I did not see universal stubs
for the code you are using I was looking at how other areas in the kernel handled
the same.

Reading your response to Ilpo and what you write later I now see that you are using
a combination of no-op stubs and conditional compilation. That is, you use a no-op stub,
instead of "IS_ENABLED()" or "#if" to conditionally compile some code. I am not familiar
with how compilers handle these scenarios.

>> Apart from the issue that Ilpo pointed out I would prefer that any changes outside
>> resctrl are submitted separately to that subsystem.
>
> Sure, I'll pull those three lines out as a separate patch.
>
>
>>> @@ -801,6 +803,11 @@ void mbm_handle_overflow(struct work_struct *work)
>>> update_mba_bw(prgrp, d);
>>> }
>>>
>>> + /*
>>> + * Re-check for housekeeping CPUs. This allows the overflow handler to
>>> + * move off a nohz_full CPU quickly.
>>> + */
>>> + cpu = cpumask_any_housekeeping(&d->cpu_mask);
>>> schedule_delayed_work_on(cpu, &d->mbm_over, delay);
>>>
>>> out_unlock:
>>
>> From what I can tell the nohz_full CPUs are set during boot and do not change.
>
> But the house keeping CPUs can be taken offline, and brought back.
>
> With this change the work moves off the nohz_full CPU and back to the housekeeping CPU the
> next time this runs. Without it, you're stuck on a nohz_full CPU until you take that CPU
> offline too.

Good point, thanks.

>>> diff --git a/include/linux/tick.h b/include/linux/tick.h
>>> index bfd571f18cfd..ae2e9019fc18 100644
>>> --- a/include/linux/tick.h
>>> +++ b/include/linux/tick.h
>>> @@ -174,9 +174,10 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
>>> static inline void tick_nohz_idle_stop_tick_protected(void) { }
>>> #endif /* !CONFIG_NO_HZ_COMMON */
>>>
>>> +extern cpumask_var_t tick_nohz_full_mask;
>>> +
>>> #ifdef CONFIG_NO_HZ_FULL
>>> extern bool tick_nohz_full_running;
>>> -extern cpumask_var_t tick_nohz_full_mask;
>>>
>>> static inline bool tick_nohz_full_enabled(void)
>>> {
>>
>> In addition to what Ilpo pointed out, be careful here.
>> cpumask_var_t is a pointer (or array) and needs to be
>> allocated before use. Moving its declaration but not the
>> allocation code seems risky.
>
> Risky how? Any use of tick_nohz_full_mask that isn't guarded by something like
> tick_nohz_full_cpu() will lead to a link error regardless of the type.

I assumed that the intention was to create an actual "no-op" stub for this
mask, enabling it to be used unconditionally. That the intention is for it
to be guarded and how the compiler deals with this was not obvious to me. I think
it would be good to call out this usage when submitting this to the appropriate
maintainers. A comment near the declaration may help users to know how it is
intended to be used.

Reinette



2023-04-27 23:42:55

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 12/19] x86/resctrl: Make resctrl_mounted checks explicit

Hi James,

On 4/27/2023 7:19 AM, James Morse wrote:
> Hi Reinette,
>
> On 01/04/2023 00:28, Reinette Chatre wrote:
>> On 3/20/2023 10:26 AM, James Morse wrote:
>>> The rdt_enable_key is switched when resctrl is mounted, and used to
>>> prevent a second mount of the filesystem. It also enables the
>>> architecture's context switch code.
>>>
>>> This requires another architecture to have the same set of static-keys,
>>> as resctrl depends on them too.
>>>
>>> Make the resctrl_mounted checks explicit: resctrl can keep track of
>>> whether it has been mounted once. This doesn't need to be combined with
>>> whether the arch code is context switching the CLOSID.
>>> Tests against the rdt_mon_enable_key become a test that resctrl is
>>> mounted and that monitoring is enabled.
>>
>> The last sentence above makes the code change hard to follow ...
>> (see below)
>>
>>> This will allow the static-key changing to be moved behind resctrl_arch_
>>> calls.
>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> index f38cd2f12285..6279f5c98b39 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> @@ -834,7 +834,7 @@ void mbm_handle_overflow(struct work_struct *work)
>>>
>>> mutex_lock(&rdtgroup_mutex);
>>>
>>> - if (!static_branch_likely(&rdt_mon_enable_key))
>>> + if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
>>
>> ... considering the text in the changelog the "resctrl_mounted" check seems
>> unnecessary. Looking ahead I wonder if this check would not be more
>> appropriate in patch 15?
>
> How so?
>
> This is secretly relying on rdt_mon_enable_key being cleared in rdt_kill_sb() when the
> filesystem is unmounted, otherwise the overflow thread keeps running once the filesystem
> is unmounted.

hmmm ... I do not think my feedback was clear. I understand that this is done
as a prep patch but that was only clear when I read patch 15 because as the
work is presented here it seems unnecessary.

>
> I thought it simpler to add all these checks explicitly in one go.
> That makes it simpler to thin out the static keys as their 'and its mounted' behaviour is
> no longer relied on.

Understood. If you want to keep this as a prep patch, could you please update the
changelog to reflect this? The following sentence in the changelog makes this patch
hard to follow since it essentially claims that what this patch does is unnecessary:
"Tests against the rdt_mon_enable_key become a test that resctrl is mounted
and that monitoring is enabled."

I also do still wonder why these resctrl_mounted checks cannot move to patch
15 when they are needed. Adding them there makes it obvious that rdt_mon_enable_key
had a dual purpose that patch 15 splits into two separate checks.

Reinette




2023-04-27 23:49:36

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 11/19] x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read()

Hi James,

On 4/27/2023 7:19 AM, James Morse wrote:
> On 01/04/2023 00:27, Reinette Chatre wrote:
>> On 3/20/2023 10:26 AM, James Morse wrote:

...

>>> #include <linux/module.h>
>>> #include <linux/sizes.h>
>>> #include <linux/slab.h>
>>> @@ -271,7 +272,7 @@ static void smp_call_rmid_read(void *_arg)
>>>
>>> int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>>> u32 closid, u32 rmid, enum resctrl_event_id eventid,
>>> - u64 *val)
>>> + u64 *val, int ignored)
>>> {
>>> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>>> struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
>>> @@ -317,9 +318,14 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>>> u32 idx_limit = resctrl_arch_system_num_rmid_idx();
>>> struct rmid_entry *entry;
>>> u32 idx, cur_idx = 1;
>>> + int arch_mon_ctx;
>>> bool rmid_dirty;
>>> u64 val = 0;
>>>
>>> + arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
>>> + if (arch_mon_ctx < 0)
>>> + return;
>
>> The vision for this is not clear to me. When I read that context needs to be allocated
>> I expect it to return a pointer to some new context, not an int. What would the
>> "context" consist of?
>
> It might just need a different name.
>
> For MPAM, this is allocating a monitor, which is the hardware that does the counting in
> the cache or the memory controller. The number of monitors is an implementation choice,
> and may not match the number of CLOSID/RMID that are in use. There aren't guaranteed to be
> enough to allocate one for every control or monitor group up front.
>
> The int being returned is the allocated monitor's index. It identifies which monitor needs
> programming to read the provided CLOSID/RMID, and the counter register to read with the value.

I see.

>
> I can allocate memory for an int if you think that is clearer.
> (I was hoping to leave that for whoever needs something bigger than a pointer)

I'd rather not complicate it in this way.

>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index ff7452f644e4..03e4f41cd336 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -233,6 +233,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
>>> * @rmid: rmid of the counter to read.
>>> * @eventid: eventid to read, e.g. L3 occupancy.
>>> * @val: result of the counter read in bytes.
>>> + * @arch_mon_ctx: An allocated context from resctrl_arch_mon_ctx_alloc().
>>> *
>
>> Could this description be expanded to indicate what this context is used for?
>
> Sure,
> "An architecture specific value from resctrl_arch_mon_ctx_alloc(), for MPAM this
> identifies the hardware monitor allocated for this read request".

This helps. Thank you.

Reinette

2023-05-23 17:22:37

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v3 00/19] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

Hi all,

Looking at the changes already applied, and those planned to support
new architectures, new features, and quirks in specific implementations,
it is clear to me that the original resctrl file system implementation
did not provide enough flexibility for all the additions that are
needed.

So I've begun musing with 20-20 hindsight on how resctrl could have
provided better abstract building blocks.

The concept of a "resource" structure with a list of domains for
specific instances of that structure on a platform still seems like
a good building block.

But sharing those structures across increasingly different implementations
of the underlying resource is resulting in extra gymnastic efforts to
make all the new uses co-exist with the old. E.g. the domain structure
has elements for every type of resource even though each instance is
linked to just one resource type.

I had begun this journey with a plan to just allow new features to
hook into the existing resctrl filesystem with a "driver" registration
mechanism:

https://lore.kernel.org/all/[email protected]/

But feedback from Reinette that this would be cleaner if drivers created
new resources, rather than adding a patchwork of callback functions with
special case "if (type == DRIVER)" sprinkled around made me look into
a more radical redesign instead of joining in the trend of making the
smallest set of changes to meet my goals.


Goals:
1) User interfaces for existing resource control features should be
unchanged.

2) Admin interface should have the same capabilities, but interfaces
may change. E.g. kernel command line and mount options may be replaced
by choosing which resource modules to load.

3) Should be easy to create new modules to handle big differences
between implementations, or to handle model specific features that
may not exist in the same form across multiple CPU generations.

Initial notes:

Core resctrl filesystem functionality will just be:

1) Mount/unmount of filesystem. Architecture hook to allocate monitor
and control IDs for the default group.

2) Creation/removal/rename of control and monitor directories (with
call to architecture specific code to allocate/free the control and monitor
IDs to attach to the directory.

3) Maintaining the "tasks" file with architecture code to update the
control and monitor IDs in the task structure.

4) Maintaining the "cpus" file - similar to "tasks"

5) Context switch code to update h/w with control/monitor IDs.

6) CPU hotplug interface to build and maintain domain list for each
registered resource.

7) Framework for "schemata" file. Calls to resource specific functions
to maintain each line in the file.

8) Resource registration interface for modules to add new resources
to the list (and remove them on module unload). Modules may add files
to the info/ hierarchy, and also to each mon_data/ directory and/or
to each control/control_mon directory.

9) Note that the core code starts with an empty list of resources.
System admins must load modules to add support for each resource they
want to use.


We'd need a bunch of modules to cover existing x86 functionality. E.g.
an "L3" one for standard L3 cache allocation, an "L3CDP" one to be used
instead of the plain "L3" one for code/data priority mode by creating
a separate resource for each of code & data.

Logically separate mbm_local, mbm_total, and llc_cache_occupancy modules
(though could combine the mbm ones because they both need a periodic
counter read to avoid wraparound). "MB" for memory bandwidth allocation.

The "mba_MBps" mount option would be replaced with a module that does
both memory bandwidth allocation and monitoring, with a s/w feedback loop.

Peter's workaround for the quirks of AMD monitoring could become a
separate AMD specific module. But minor differences (e.g. contiguous
cache bitmask Intel requirements) could be handled within a module
if desired.

Pseudo-locking would be another case to load a different module to
set up pseudo-locking and enforce the cache bitmask rules between resctrl
groups instead of the basic cache allocation one.

Core resctrl code could handle overlaps between modules that want to
control the same resource with a "first to load reserves that feature"
policy.

Are there additional ARM specific architectural requirements that this
approach isn't addressing? Could the core functionality be extended to
make life easier for ARM?

-Tony

2023-05-25 17:33:08

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 12/19] x86/resctrl: Make resctrl_mounted checks explicit

Hi Reinette,

On 28/04/2023 00:37, Reinette Chatre wrote:
> On 4/27/2023 7:19 AM, James Morse wrote:
>> On 01/04/2023 00:28, Reinette Chatre wrote:
>>> On 3/20/2023 10:26 AM, James Morse wrote:
>>>> The rdt_enable_key is switched when resctrl is mounted, and used to
>>>> prevent a second mount of the filesystem. It also enables the
>>>> architecture's context switch code.
>>>>
>>>> This requires another architecture to have the same set of static-keys,
>>>> as resctrl depends on them too.
>>>>
>>>> Make the resctrl_mounted checks explicit: resctrl can keep track of
>>>> whether it has been mounted once. This doesn't need to be combined with
>>>> whether the arch code is context switching the CLOSID.
>>>> Tests against the rdt_mon_enable_key become a test that resctrl is
>>>> mounted and that monitoring is enabled.
>>>
>>> The last sentence above makes the code change hard to follow ...
>>> (see below)
>>>
>>>> This will allow the static-key changing to be moved behind resctrl_arch_
>>>> calls.
>>
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> index f38cd2f12285..6279f5c98b39 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> @@ -834,7 +834,7 @@ void mbm_handle_overflow(struct work_struct *work)
>>>>
>>>> mutex_lock(&rdtgroup_mutex);
>>>>
>>>> - if (!static_branch_likely(&rdt_mon_enable_key))
>>>> + if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
>>>
>>> ... considering the text in the changelog the "resctrl_mounted" check seems
>>> unnecessary. Looking ahead I wonder if this check would not be more
>>> appropriate in patch 15?
>>
>> How so?
>>
>> This is secretly relying on rdt_mon_enable_key being cleared in rdt_kill_sb() when the
>> filesystem is unmounted, otherwise the overflow thread keeps running once the filesystem
>> is unmounted.
>
> hmmm ... I do not think my feedback was clear. I understand that this is done
> as a prep patch but that was only clear when I read patch 15 because as the
> work is presented here it seems unnecessary.
>
>>
>> I thought it simpler to add all these checks explicitly in one go.
>> That makes it simpler to thin out the static keys as their 'and its mounted' behaviour is
>> no longer relied on.
>
> Understood. If you want to keep this as a prep patch, could you please update the
> changelog to reflect this? The following sentence in the changelog makes this patch
> hard to follow since it essentially claims that what this patch does is unnecessary:
> "Tests against the rdt_mon_enable_key become a test that resctrl is mounted
> and that monitoring is enabled."

"Because of the implicit mount test" ... the text immediately before this.

We're probably going to keep talking past each other on this - I'll rephrase that
paragraph as:
| rdt_mon_enable_key is never used just to test that resctrl is mounted,
| but does also have this implication. Add a resctrl_mounted to all uses
| of rdt_mon_enable_key. This will allow rdt_mon_enable_key to be swapped
| with a helper in a subsequent patch.


> I also do still wonder why these resctrl_mounted checks cannot move to patch
> 15 when they are needed. Adding them there makes it obvious that rdt_mon_enable_key
> had a dual purpose that patch 15 splits into two separate checks.

That is happening in this patch too, rdt_mon_enable_key becomes
(resctrl_mounted && rdt_mon_enable_key), the implicit property is now explicit, so a later
patch can modify rdt_mon_enable_key without breaking this behaviour.

I think its easier to review if patch 15 is making a set of 1:1 mappings instead of
splitting some static-keys but not others. Let me know what you think of the new version.


Thanks,

James

2023-05-25 17:33:21

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 11/19] x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read()

Hi Reinette,

On 28/04/2023 00:40, Reinette Chatre wrote:
> On 4/27/2023 7:19 AM, James Morse wrote:
>> On 01/04/2023 00:27, Reinette Chatre wrote:
>>> On 3/20/2023 10:26 AM, James Morse wrote:

>>>> @@ -317,9 +318,14 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>>>> u32 idx_limit = resctrl_arch_system_num_rmid_idx();
>>>> struct rmid_entry *entry;
>>>> u32 idx, cur_idx = 1;
>>>> + int arch_mon_ctx;
>>>> bool rmid_dirty;
>>>> u64 val = 0;
>>>>
>>>> + arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
>>>> + if (arch_mon_ctx < 0)
>>>> + return;
>>
>>> The vision for this is not clear to me. When I read that context needs to be allocated
>>> I expect it to return a pointer to some new context, not an int. What would the
>>> "context" consist of?
>>
>> It might just need a different name.
>>
>> For MPAM, this is allocating a monitor, which is the hardware that does the counting in
>> the cache or the memory controller. The number of monitors is an implementation choice,
>> and may not match the number of CLOSID/RMID that are in use. There aren't guaranteed to be
>> enough to allocate one for every control or monitor group up front.
>>
>> The int being returned is the allocated monitor's index. It identifies which monitor needs
>> programming to read the provided CLOSID/RMID, and the counter register to read with the value.
>
> I see.
>
>>
>> I can allocate memory for an int if you think that is clearer.
>> (I was hoping to leave that for whoever needs something bigger than a pointer)

> I'd rather not complicate it in this way.

It's a no-op for x86 as these calls get optimised out, but more annoying for MPAM (I've
done it now). I think the result is more intuitive, but see what you think.


Thanks,

James

2023-05-25 17:40:36

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 08/19] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

Hi Ilpo,

On 27/04/2023 15:25, Ilpo Järvinen wrote:
> On Thu, 27 Apr 2023, James Morse wrote:
>> On 21/03/2023 15:14, Ilpo Jï¿œrvinen wrote:
>>> On Mon, 20 Mar 2023, James Morse wrote:
>>>
>>>> The limbo and overflow code picks a CPU to use from the domain's list
>>>> of online CPUs. Work is then scheduled on these CPUs to maintain
>>>> the limbo list and any counters that may overflow.
>>>>
>>>> cpumask_any() may pick a CPU that is marked nohz_full, which will
>>>> either penalise the work that CPU was dedicated to, or delay the
>>>> processing of limbo list or counters that may overflow. Perhaps
>>>> indefinitely. Delaying the overflow handling will skew the bandwidth
>>>> values calculated by mba_sc, which expects to be called once a second.
>>>>
>>>> Add cpumask_any_housekeeping() as a replacement for cpumask_any()
>>>> that prefers housekeeping CPUs. This helper will still return
>>>> a nohz_full CPU if that is the only option. The CPU to use is
>>>> re-evaluated each time the limbo/overflow work runs. This ensures
>>>> the work will move off a nohz_full CPU once a houskeeping CPU is
>>>
>>> housekeeping
>>>
>>>> available.
>>
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>>>> index 87545e4beb70..0b5fd5a0cda2 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>>>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>>
>>>> +/**
>>>> + * cpumask_any_housekeeping() - Chose any cpu in @mask, preferring those that
>>>> + * aren't marked nohz_full
>>>> + * @mask: The mask to pick a CPU from.
>>>> + *
>>>> + * Returns a CPU in @mask. If there are houskeeping CPUs that don't use
>>>> + * nohz_full, these are preferred.
>>>> + */
>>>> +static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
>>>> +{
>>>> + int cpu, hk_cpu;
>>>> +
>>>> + cpu = cpumask_any(mask);
>>>> + if (tick_nohz_full_cpu(cpu)) {
>>>> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
>>>
>>> Why cpumask_nth_and() is not enough here? ..._andnot() seems to alter
>>> tick_nohz_full_mask which doesn't seem desirable?
>>
>> tick_nohz_full_mask is the list of CPUs we should avoid. This wants to find the first cpu
>> set in the domain mask, and clear in tick_nohz_full_mask.
>>
>> Where does cpumask_nth_andnot() modify its arguments? Its arguments are const.
>
> Ah, it doesn't, I'm sorry about that.
>
> I think I was trapped by ambiguous English:
> * cpumask_nth_andnot - get the first cpu set in 1st cpumask, and clear in 2nd.
> ...which can be understood as it clearing it in 2nd.
Great, I'm not going mad!

How could the english there be clearer?
"get the first cpu that is set in 1st cpumask, and not set in 2nd." ?


Thanks,

James

2023-05-25 17:41:20

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 02/19] x86/resctrl: Access per-rmid structures by index

Hi Peter,

On 24/04/2023 14:06, Peter Newman wrote:
> On Mon, Mar 20, 2023 at 6:27 PM James Morse <[email protected]> wrote:
>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>> index 030d3b409768..351319403f84 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -600,7 +600,7 @@ static void clear_closid_rmid(int cpu)
>> state->default_rmid = 0;
>> state->cur_closid = 0;
>> state->cur_rmid = 0;
>> - wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
>> + wrmsr(MSR_IA32_PQR_ASSOC, RESCTRL_RESERVED_CLOSID, 0);
>
> It looks like the RMID/CLOSID params are in the wrong order in this wrmsr().

Fixed, thanks!


James

2023-05-25 17:42:27

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 00/19] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

Hi Tony,

(CC: +Drew)

On 23/05/2023 18:14, Tony Luck wrote:
> Looking at the changes already applied, and those planned to support
> new architectures, new features, and quirks in specific implementations,
> it is clear to me that the original resctrl file system implementation
> did not provide enough flexibility for all the additions that are
> needed.
>
> So I've begun musing with 20-20 hindsight on how resctrl could have
> provided better abstract building blocks.

Heh, hindsight is 20:20!

My responses below are pretty much entirely about how this looks to user-space, that is
the bit we can't change.


> The concept of a "resource" structure with a list of domains for
> specific instances of that structure on a platform still seems like
> a good building block.

True, but having platform specific resource types does reduce the effectiveness of a
shared interface. User-space just has to have platform specific knowledge for these.

I think defining resources in terms of other things that are visible to user-space through
sysfs is the best approach. The CAT L3 schema does this.


In terms of the values used, I'd much prefer 'weights' or some other abstraction had been
used, to allow the kernel to pick the hardware configuration itself.
Similarly, properties like isolation between groups should be made explicit, instead of
asking "did user-space mean to set shared bits in those bitmaps?"

This stuff is the reason why resctrl can't support MPAM's 'CMAX', that gives a maximum
capacity limit for a cache, but doesn't implicitly isolate groups.


> But sharing those structures across increasingly different implementations
> of the underlying resource is resulting in extra gymnastic efforts to
> make all the new uses co-exist with the old. E.g. the domain structure
> has elements for every type of resource even though each instance is
> linked to just one resource type.

> I had begun this journey with a plan to just allow new features to
> hook into the existing resctrl filesystem with a "driver" registration
> mechanism:
>
> https://lore.kernel.org/all/[email protected]/
>
> But feedback from Reinette that this would be cleaner if drivers created
> new resources, rather than adding a patchwork of callback functions with
> special case "if (type == DRIVER)" sprinkled around made me look into
> a more radical redesign instead of joining in the trend of making the
> smallest set of changes to meet my goals.
>
>
> Goals:
> 1) User interfaces for existing resource control features should be
> unchanged.
>
> 2) Admin interface should have the same capabilities, but interfaces
> may change. E.g. kernel command line and mount options may be replaced
> by choosing which resource modules to load.
>
> 3) Should be easy to create new modules to handle big differences
> between implementations, or to handle model specific features that
> may not exist in the same form across multiple CPU generations.

The difficulty is knowing some behaviour is going to be platform specific, its not until
the next generation is different that you know there was something wrong with the first.

The difficulty is user-space expecting a resource that turned out to be platform-specific,
or was 'enhanced' in a subsequent version and doesn't behave in the same way.

I suspect we need two sets of resources, those that are abstracted to work in a portable
way between platforms and architectures - and the wild west.
The next trick is moving things between the two!


> Initial notes:
>
> Core resctrl filesystem functionality will just be:
>
> 1) Mount/unmount of filesystem. Architecture hook to allocate monitor
> and control IDs for the default group.
>
> 2) Creation/removal/rename of control and monitor directories (with
> call to architecture specific code to allocate/free the control and monitor
> IDs to attach to the directory.
>
> 3) Maintaining the "tasks" file with architecture code to update the
> control and monitor IDs in the task structure.
>
> 4) Maintaining the "cpus" file - similar to "tasks"
>
> 5) Context switch code to update h/w with control/monitor IDs.
>
> 6) CPU hotplug interface to build and maintain domain list for each
> registered resource.
>
> 7) Framework for "schemata" file. Calls to resource specific functions
> to maintain each line in the file.

> 8) Resource registration interface for modules to add new resources
> to the list (and remove them on module unload). Modules may add files
> to the info/ hierarchy, and also to each mon_data/ directory and/or
> to each control/control_mon directory.

I worry that this can lead to architecture specific schema, then each architecture having
a subtly different version. I think it would be good to keep all the user-ABI in one place
so it doesn't get re-invented. I agree its hard to know what the next platfrom will look like.

One difference I can't get my head round is how to handle platforms that use relative
percentages and fractions - and those that take an absolute MB/s value.


> 9) Note that the core code starts with an empty list of resources.
> System admins must load modules to add support for each resource they
> want to use.

I think this just moves the problem to modules. 'CAT' would get duplicated by all
architectures. MB is subtly different between them all, but user-space doesn't want to be
concerned with the differences.


> We'd need a bunch of modules to cover existing x86 functionality. E.g.
> an "L3" one for standard L3 cache allocation, an "L3CDP" one to be used
> instead of the plain "L3" one for code/data priority mode by creating
> a separate resource for each of code & data.

CDP may have system wide side-effects. For MPAM if you enable the emulation of that, then
resources that resctrl doesn't believe use CDP have to double-configure and double-count
everything.


> Logically separate mbm_local, mbm_total, and llc_cache_occupancy modules
> (though could combine the mbm ones because they both need a periodic
> counter read to avoid wraparound). "MB" for memory bandwidth allocation.

llc_cache_occupancy isn't a counter, but I'd prefer to bundle the others through perf.
That already has an interface for discovering and configuring events. I understand it was
tried and removed, but I think I've got a handle on making this work.


> The "mba_MBps" mount option would be replaced with a module that does
> both memory bandwidth allocation and monitoring, with a s/w feedback loop.

Keeping purely software features self contained is a great idea.


> Peter's workaround for the quirks of AMD monitoring could become a
> separate AMD specific module. But minor differences (e.g. contiguous
> cache bitmask Intel requirements) could be handled within a module
> if desired.
>
> Pseudo-locking would be another case to load a different module to
> set up pseudo-locking and enforce the cache bitmask rules between resctrl
> groups instead of the basic cache allocation one.
>
> Core resctrl code could handle overlaps between modules that want to
> control the same resource with a "first to load reserves that feature"
> policy.

> Are there additional ARM specific architectural requirements that this
> approach isn't addressing? Could the core functionality be extended to
> make life easier for ARM?

(We've got RISC-V to consider too - hence adding Drew Fustini [0])

My known issues list is:
* RMIDs.
These are an independent number space for RDT. For MPAM they are an
extension of the partid/closid space. There is no value that can be
exposed to user-space as num_rmid as it depends on how they will be
used.

* Monitors.
RDT has one counter per RMID, they run continuously. MPAM advertises
how many monitors it has, which is very likely to be fewer than we
want. This means MPAM can't expose the free-runing MBM_* counters
via the filesystem. These would need exposing via perf.

* Bitmaps.
MPAM has some bitmaps, but it has other stuff too. Forcing the bitmaps
to be the user-space interface requires the underlying control to be
capable of isolation. Ideally user-space would set a priority/cost for
each rdtgroup, and indicate whether they should be isolated from other
rdtgroup at the same level.

* Memory bandwidth.
For MB resources that control bandwidth, X86 provides user-space with
the cache-id of the cache that implements that bandwidth controls. For
MPAM there is no expectation that this is being done by a cache, it could
equally be a memory controller.


I'd really like to have these solved as part of a cross-architecture user-space ABI. I'm
not sure platform-specific modules solve the user-space problem.


Otherwise MPAM has additional types of control, which could be applied to any kind of
resource. The oddest is 'PRI' which is just a priority. I've not yet heard of a system
using it, but this could appear at any choke point in the SoC, it may not be on a cache or
memory controller.

The 'types of control' and 'resource' distinction may help in places where Intel/AMD take
wildly different values to configure the same resource. (*cough* MB)


Thanks,

James


[0] lore.kernel.org/r/[email protected]

2023-05-25 17:43:21

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 08/19] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

Hi Reinette,

On 28/04/2023 00:36, Reinette Chatre wrote:
> On 4/27/2023 7:10 AM, James Morse wrote:
>> On 01/04/2023 00:24, Reinette Chatre wrote:
>>> On 3/20/2023 10:26 AM, James Morse wrote:

>>>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>>>> index 87545e4beb70..0b5fd5a0cda2 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>>>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>>>> @@ -55,6 +56,28 @@
>>>> /* Max event bits supported */
>>>> #define MAX_EVT_CONFIG_BITS GENMASK(6, 0)
>>>>
>>>> +/**
>>>> + * cpumask_any_housekeeping() - Chose any cpu in @mask, preferring those that
>>>> + * aren't marked nohz_full
>>>> + * @mask: The mask to pick a CPU from.
>>>> + *
>>>> + * Returns a CPU in @mask. If there are houskeeping CPUs that don't use
>>>> + * nohz_full, these are preferred.
>>>> + */
>>>> +static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
>>>> +{
>>>> + int cpu, hk_cpu;
>>>> +
>>>> + cpu = cpumask_any(mask);
>>>> + if (tick_nohz_full_cpu(cpu)) {
>>>> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
>>>> + if (hk_cpu < nr_cpu_ids)
>>>> + cpu = hk_cpu;
>>>> + }
>>>> +
>>
>>> I think as a start this could perhaps be a #if defined(CONFIG_NO_HZ_FULL). There
>>> appears to be a precedent for this in kernel/rcu/tree_nocb.h.
>>
>> This harms readability, and prevents the compiler from testing that this is valid C code
>> for any compile of this code.
>>
>> With if-def's here you'd be reliant on come CI system to build with the required
>> combination of Kconfig symbols to expose any warnings.
>>
>> It's much better to use IS_ENABLED() in the helpers and rely on the compiler's
>> dead-code-elimination to remove paths that have been configured out.
>>
>> (See the section on Conditional Compilation in coding-style for a much better summary!)
>
> My assumption was that you intended to implement what is described first in
> the document you point to. That is, providing no-stub versions for all
> and then calling everything unconditionally. Since I did not see universal stubs
> for the code you are using I was looking at how other areas in the kernel handled
> the same.
>
> Reading your response to Ilpo and what you write later I now see that you are using
> a combination of no-op stubs and conditional compilation. That is, you use a no-op stub,
> instead of "IS_ENABLED()" or "#if" to conditionally compile some code. I am not familiar
> with how compilers handle these scenarios.
>

>>>> diff --git a/include/linux/tick.h b/include/linux/tick.h
>>>> index bfd571f18cfd..ae2e9019fc18 100644
>>>> --- a/include/linux/tick.h
>>>> +++ b/include/linux/tick.h
>>>> @@ -174,9 +174,10 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
>>>> static inline void tick_nohz_idle_stop_tick_protected(void) { }
>>>> #endif /* !CONFIG_NO_HZ_COMMON */
>>>>
>>>> +extern cpumask_var_t tick_nohz_full_mask;
>>>> +
>>>> #ifdef CONFIG_NO_HZ_FULL
>>>> extern bool tick_nohz_full_running;
>>>> -extern cpumask_var_t tick_nohz_full_mask;
>>>>
>>>> static inline bool tick_nohz_full_enabled(void)
>>>> {
>>>
>>> In addition to what Ilpo pointed out, be careful here.
>>> cpumask_var_t is a pointer (or array) and needs to be
>>> allocated before use. Moving its declaration but not the
>>> allocation code seems risky.
>>
>> Risky how? Any use of tick_nohz_full_mask that isn't guarded by something like
>> tick_nohz_full_cpu() will lead to a link error regardless of the type.
>
> I assumed that the intention was to create an actual "no-op" stub for this
> mask, enabling it to be used unconditionally. That the intention is for it
> to be guarded and how the compiler deals with this was not obvious to me. I think
> it would be good to call out this usage when submitting this to the appropriate
> maintainers. A comment near the declaration may help users to know how it is
> intended to be used.

Right, I'll add a comment:
/*
* Mask of CPUs that are nohz_full.
*
* Users should be guarded by CONFIG_NO_HZ_FULL or a tick_nohz_full_cpu()
* check.
*/




Thanks,

James

2023-05-25 21:04:30

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v3 00/19] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

On Thu, May 25, 2023 at 06:31:54PM +0100, James Morse wrote:
> Hi Tony,
>
> (CC: +Drew)

James: Thanks for all the comments below, and pulling in Drew for eyes
from another architecture.

>
> On 23/05/2023 18:14, Tony Luck wrote:
> > Looking at the changes already applied, and those planned to support
> > new architectures, new features, and quirks in specific implementations,
> > it is clear to me that the original resctrl file system implementation
> > did not provide enough flexibility for all the additions that are
> > needed.
> >
> > So I've begun musing with 20-20 hindsight on how resctrl could have
> > provided better abstract building blocks.
>
> Heh, hindsight is 20:20!
>
> My responses below are pretty much entirely about how this looks to user-space, that is
> the bit we can't change.
>
>
> > The concept of a "resource" structure with a list of domains for
> > specific instances of that structure on a platform still seems like
> > a good building block.
>
> True, but having platform specific resource types does reduce the effectiveness of a
> shared interface. User-space just has to have platform specific knowledge for these.
>
> I think defining resources in terms of other things that are visible to user-space through
> sysfs is the best approach. The CAT L3 schema does this.
>
>
> In terms of the values used, I'd much prefer 'weights' or some other abstraction had been
> used, to allow the kernel to pick the hardware configuration itself.
> Similarly, properties like isolation between groups should be made explicit, instead of
> asking "did user-space mean to set shared bits in those bitmaps?"
>
> This stuff is the reason why resctrl can't support MPAM's 'CMAX', that gives a maximum
> capacity limit for a cache, but doesn't implicitly isolate groups.

User interface is the trickiest problem when looking to make changes.
But doubly so here because of the underlying different capabilities
for resource monitoring and control in different architectures.

For CAT L3 I picked a direct pass through of the x86 bitmasks, partly
because I was going first and didn't have to worry about compatability
with existing implementations, but mostly because there really isn't any
way to hide the sharp edges of this h/w implementation. It would be much
easier, and nicer, for users if they could specify a whole list of
tuneables:

* minimum viable amount of cache
* maximum permitted under contention with higher priority processes
* maximum when system lightly loaded

and especially not have these limits bound directly to specific cache
ways. But those aren't the cards that I was dealt.

So now we appear to be stuck with something that is quite specific to
the initial x86 implementation, and prevents taking advantage of more
flexible features on other architectures (or future x86 if Intel does
something better in the future). Which is why I'd like to make it easy
to provide options (in ways other than new boot command line or mount
options togther with crammimg more code into an already somewhat
cluttered codebase).

> > But sharing those structures across increasingly different implementations
> > of the underlying resource is resulting in extra gymnastic efforts to
> > make all the new uses co-exist with the old. E.g. the domain structure
> > has elements for every type of resource even though each instance is
> > linked to just one resource type.
>
> > I had begun this journey with a plan to just allow new features to
> > hook into the existing resctrl filesystem with a "driver" registration
> > mechanism:
> >
> > https://lore.kernel.org/all/[email protected]/
> >
> > But feedback from Reinette that this would be cleaner if drivers created
> > new resources, rather than adding a patchwork of callback functions with
> > special case "if (type == DRIVER)" sprinkled around made me look into
> > a more radical redesign instead of joining in the trend of making the
> > smallest set of changes to meet my goals.
> >
> >
> > Goals:
> > 1) User interfaces for existing resource control features should be
> > unchanged.
> >
> > 2) Admin interface should have the same capabilities, but interfaces
> > may change. E.g. kernel command line and mount options may be replaced
> > by choosing which resource modules to load.
> >
> > 3) Should be easy to create new modules to handle big differences
> > between implementations, or to handle model specific features that
> > may not exist in the same form across multiple CPU generations.
>
> The difficulty is knowing some behaviour is going to be platform specific, its not until
> the next generation is different that you know there was something wrong with the first.
>
> The difficulty is user-space expecting a resource that turned out to be platform-specific,
> or was 'enhanced' in a subsequent version and doesn't behave in the same way.
>
> I suspect we need two sets of resources, those that are abstracted to work in a portable
> way between platforms and architectures - and the wild west.
> The next trick is moving things between the two!

This seems to be akin to the perfmon model. There are a small number of
architectural events and counters. All the rest is a very wild west with
no guarantees from one model to the next, or even from one core to the
next inside a hybrid CPU model.

Side note: Hybrid is already an issue for resctrl. On hybrid cpu models
that support CAT L2, there are different numbers of ways in the P-core
L2 cache from the E-core L2 cache. Other asymmetries are already in
the pipeline.

> > Initial notes:
> >
> > Core resctrl filesystem functionality will just be:
> >
> > 1) Mount/unmount of filesystem. Architecture hook to allocate monitor
> > and control IDs for the default group.
> >
> > 2) Creation/removal/rename of control and monitor directories (with
> > call to architecture specific code to allocate/free the control and monitor
> > IDs to attach to the directory.
> >
> > 3) Maintaining the "tasks" file with architecture code to update the
> > control and monitor IDs in the task structure.
> >
> > 4) Maintaining the "cpus" file - similar to "tasks"
> >
> > 5) Context switch code to update h/w with control/monitor IDs.
> >
> > 6) CPU hotplug interface to build and maintain domain list for each
> > registered resource.
> >
> > 7) Framework for "schemata" file. Calls to resource specific functions
> > to maintain each line in the file.
>
> > 8) Resource registration interface for modules to add new resources
> > to the list (and remove them on module unload). Modules may add files
> > to the info/ hierarchy, and also to each mon_data/ directory and/or
> > to each control/control_mon directory.
>
> I worry that this can lead to architecture specific schema, then each architecture having
> a subtly different version. I think it would be good to keep all the user-ABI in one place
> so it doesn't get re-invented. I agree its hard to know what the next platfrom will look like.

I'd also like the keep the user visible schema unchanged. But we already
have different architecture back-end code (x86 has arrays of MSRs on
each L3/L2 domain for s/w to program the desired bitmasks ... I'm going
to guess that ARM doen't do that with a WRMSR instruction :-)
Even between Intel and AMD there are small differences (AMD doesn't
require each bitmask to consist of a single block of consecutive "1" bits).

Some future x86 implementation might move away from MSRs (which are a
pain because of the need for cross processor interrupts to execute the
WRMSR from a CPU in each domain).

So the basic CAT L3 modules should continue to provide this legacy
schemata interface (as best they can).

But this opens up the possibility for you to provide an alternate module
that makes MPAM's 'CMAX' visible to knowledgeable users that have a use
case that would benefit from it.

> One difference I can't get my head round is how to handle platforms that use relative
> percentages and fractions - and those that take an absolute MB/s value.

Percentages were a terrible idea. I'm so, so, sorry. I just did a pass
through, and that was a mistake. The "mba_MBps" mount option was a
belated attempt to repair the damge by giving the user an parameter
that makes more sense ... though the s/w feedback loop can't really
ever run fast enough. So applications that go through phases of high
and low memory bandwidth are often out of compliance with the bandwidth
limits set by the user.
>
>
> > 9) Note that the core code starts with an empty list of resources.
> > System admins must load modules to add support for each resource they
> > want to use.
>
> I think this just moves the problem to modules. 'CAT' would get duplicated by all
> architectures. MB is subtly different between them all, but user-space doesn't want to be
> concerned with the differences.

The duplicated part is just the piece that parses and validates the user
input for the CAT line in schemata. I haven't written the code yet, but
that feels like it will be a small part of the module. If I'm wrong,
then maybe we'd just make a "lib" directory for the shared functions
needed by multiple modules.

> > We'd need a bunch of modules to cover existing x86 functionality. E.g.
> > an "L3" one for standard L3 cache allocation, an "L3CDP" one to be used
> > instead of the plain "L3" one for code/data priority mode by creating
> > a separate resource for each of code & data.
>
> CDP may have system wide side-effects. For MPAM if you enable the emulation of that, then
> resources that resctrl doesn't believe use CDP have to double-configure and double-count
> everything.

I don't know enough about MPAM to fully understand that. On x86 CDP is a
weird hack that keeps one resource and makes it double up lines in
schemata, also doubling the staging space in the domain structures
for the potential updated bitmasks before committing them. It would seem
cleaner to have separate resources, and a driver that knows that one
gets the even MSRs in the array while the other gets the odd MSRs.

> > Logically separate mbm_local, mbm_total, and llc_cache_occupancy modules
> > (though could combine the mbm ones because they both need a periodic
> > counter read to avoid wraparound). "MB" for memory bandwidth allocation.
>
> llc_cache_occupancy isn't a counter, but I'd prefer to bundle the others through perf.
> That already has an interface for discovering and configuring events. I understand it was
> tried and removed, but I think I've got a handle on making this work.

perf and llc_occupancy really were a bad match. But there are glitches
for memory bandwidth monitoring too. You'd really want the memory
traffic for cache evictions to be added to the counter for the tasks
that read that data into the cache. But if you do perf style monitoring
only while the task is running, then all those evictions are going to be
billed to the "wrong" process. Though perhaps you can handwave and say
that because a process is causing the evictions by bringing new data
into the cache, it really is responsible for those evictions.

> > The "mba_MBps" mount option would be replaced with a module that does
> > both memory bandwidth allocation and monitoring, with a s/w feedback loop.
>
> Keeping purely software features self contained is a great idea.
>
>
> > Peter's workaround for the quirks of AMD monitoring could become a
> > separate AMD specific module. But minor differences (e.g. contiguous
> > cache bitmask Intel requirements) could be handled within a module
> > if desired.
> >
> > Pseudo-locking would be another case to load a different module to
> > set up pseudo-locking and enforce the cache bitmask rules between resctrl
> > groups instead of the basic cache allocation one.
> >
> > Core resctrl code could handle overlaps between modules that want to
> > control the same resource with a "first to load reserves that feature"
> > policy.
>
> > Are there additional ARM specific architectural requirements that this
> > approach isn't addressing? Could the core functionality be extended to
> > make life easier for ARM?
>
> (We've got RISC-V to consider too - hence adding Drew Fustini [0])
>
> My known issues list is:
> * RMIDs.
> These are an independent number space for RDT. For MPAM they are an
> extension of the partid/closid space. There is no value that can be
> exposed to user-space as num_rmid as it depends on how they will be
> used.

Is the answer to avoid exposing this in the info/ directory and just
letting the user know they hit the limit when a "mkdir" fails?

> * Monitors.
> RDT has one counter per RMID, they run continuously. MPAM advertises
> how many monitors it has, which is very likely to be fewer than we
> want. This means MPAM can't expose the free-runing MBM_* counters
> via the filesystem. These would need exposing via perf.

So you don't create the "mon_data" directory. Do you still have
"mon_groups" to allow user to get subtotal counts for subsets of tasks
with the same allocation (CLOS) id?

> * Bitmaps.
> MPAM has some bitmaps, but it has other stuff too. Forcing the bitmaps
> to be the user-space interface requires the underlying control to be
> capable of isolation. Ideally user-space would set a priority/cost for
> each rdtgroup, and indicate whether they should be isolated from other
> rdtgroup at the same level.

Reinette tackled this with the limited set of tools available to
implement pseudo-locking ... but that's really for a very specific set
of use cases. Intel doesn't have a "priority" h/w knob, but I agree it
would be good to have and expose.

> * Memory bandwidth.
> For MB resources that control bandwidth, X86 provides user-space with
> the cache-id of the cache that implements that bandwidth controls. For
> MPAM there is no expectation that this is being done by a cache, it could
> equally be a memory controller.

Intel will have something besides the L3-tied control at some point. In
fact two very different somethings. Another reason that I want the
option to pull all these competing controls out into modules and leave
it to the user to pick which ones to load to meet their application
requirements.

>
> I'd really like to have these solved as part of a cross-architecture user-space ABI. I'm
> not sure platform-specific modules solve the user-space problem.

Where architectures can converge on a user interface, we should indeed
try to do so. But this might be difficult if the companies that produce
new resource control features want to keep the details under wraps for
as long as possible.

So we might see a re-run of "Here's some resctrl code to implement a new
control feature on Intel" ... and have the user interface set in stone
before other architectures have time to comment about whether the API is
flexible enough to handle different h/w implementations.

> Otherwise MPAM has additional types of control, which could be applied to any kind of
> resource. The oddest is 'PRI' which is just a priority. I've not yet heard of a system
> using it, but this could appear at any choke point in the SoC, it may not be on a cache or
> memory controller.

Could you handle that in schemata. A line that specifies priority from
0..N that translates to adding some field to the tasks in that group
that gets loaded into h/w during context switch.

> The 'types of control' and 'resource' distinction may help in places where Intel/AMD take
> wildly different values to configure the same resource. (*cough* MB)

MB is the poster child for how NOT to do a resource control.

> [0] lore.kernel.org/r/[email protected]

A quick glance looks like some of this is s/CLOSID/RCID/ s/RMID/RCID/
and then load into a CSR instead of an MSR. But that's just the "how
does s/w tell h/w which controls/counters to use". I'm sure that RISC-V
will diverge in ways to keep OS implementation interesting.

-Tony

2023-05-28 21:05:20

by Drew Fustini

[permalink] [raw]
Subject: Re: [PATCH v3 00/19] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

On Thu, May 25, 2023 at 02:00:21PM -0700, Tony Luck wrote:
> On Thu, May 25, 2023 at 06:31:54PM +0100, James Morse wrote:
> > Hi Tony,
> >
> > (CC: +Drew)

Hi Tony and James, thank you for bringing RISC-V QoS into the
converstaion.

>
> James: Thanks for all the comments below, and pulling in Drew for eyes
> from another architecture.
>
> >
> > On 23/05/2023 18:14, Tony Luck wrote:
> > > Looking at the changes already applied, and those planned to support
> > > new architectures, new features, and quirks in specific implementations,
> > > it is clear to me that the original resctrl file system implementation
> > > did not provide enough flexibility for all the additions that are
> > > needed.
> > >
> > > So I've begun musing with 20-20 hindsight on how resctrl could have
> > > provided better abstract building blocks.
> >
> > Heh, hindsight is 20:20!
> >
> > My responses below are pretty much entirely about how this looks to user-space, that is
> > the bit we can't change.
> >
> >
> > > The concept of a "resource" structure with a list of domains for
> > > specific instances of that structure on a platform still seems like
> > > a good building block.
> >
> > True, but having platform specific resource types does reduce the effectiveness of a
> > shared interface. User-space just has to have platform specific knowledge for these.
> >
> > I think defining resources in terms of other things that are visible to user-space through
> > sysfs is the best approach. The CAT L3 schema does this.
> >
> >
> > In terms of the values used, I'd much prefer 'weights' or some other abstraction had been
> > used, to allow the kernel to pick the hardware configuration itself.
> > Similarly, properties like isolation between groups should be made explicit, instead of
> > asking "did user-space mean to set shared bits in those bitmaps?"
> >
> > This stuff is the reason why resctrl can't support MPAM's 'CMAX', that gives a maximum
> > capacity limit for a cache, but doesn't implicitly isolate groups.
>
> User interface is the trickiest problem when looking to make changes.
> But doubly so here because of the underlying different capabilities
> for resource monitoring and control in different architectures.
>
> For CAT L3 I picked a direct pass through of the x86 bitmasks, partly
> because I was going first and didn't have to worry about compatability
> with existing implementations, but mostly because there really isn't any
> way to hide the sharp edges of this h/w implementation. It would be much
> easier, and nicer, for users if they could specify a whole list of
> tuneables:
>
> * minimum viable amount of cache
> * maximum permitted under contention with higher priority processes
> * maximum when system lightly loaded
>
> and especially not have these limits bound directly to specific cache
> ways. But those aren't the cards that I was dealt.
>
> So now we appear to be stuck with something that is quite specific to
> the initial x86 implementation, and prevents taking advantage of more
> flexible features on other architectures (or future x86 if Intel does
> something better in the future). Which is why I'd like to make it easy
> to provide options (in ways other than new boot command line or mount
> options togther with crammimg more code into an already somewhat
> cluttered codebase).
>
> > > But sharing those structures across increasingly different implementations
> > > of the underlying resource is resulting in extra gymnastic efforts to
> > > make all the new uses co-exist with the old. E.g. the domain structure
> > > has elements for every type of resource even though each instance is
> > > linked to just one resource type.
> >
> > > I had begun this journey with a plan to just allow new features to
> > > hook into the existing resctrl filesystem with a "driver" registration
> > > mechanism:
> > >
> > > https://lore.kernel.org/all/[email protected]/
> > >
> > > But feedback from Reinette that this would be cleaner if drivers created
> > > new resources, rather than adding a patchwork of callback functions with
> > > special case "if (type == DRIVER)" sprinkled around made me look into
> > > a more radical redesign instead of joining in the trend of making the
> > > smallest set of changes to meet my goals.
> > >
> > >
> > > Goals:
> > > 1) User interfaces for existing resource control features should be
> > > unchanged.
> > >
> > > 2) Admin interface should have the same capabilities, but interfaces
> > > may change. E.g. kernel command line and mount options may be replaced
> > > by choosing which resource modules to load.
> > >
> > > 3) Should be easy to create new modules to handle big differences
> > > between implementations, or to handle model specific features that
> > > may not exist in the same form across multiple CPU generations.
> >
> > The difficulty is knowing some behaviour is going to be platform specific, its not until
> > the next generation is different that you know there was something wrong with the first.
> >
> > The difficulty is user-space expecting a resource that turned out to be platform-specific,
> > or was 'enhanced' in a subsequent version and doesn't behave in the same way.
> >
> > I suspect we need two sets of resources, those that are abstracted to work in a portable
> > way between platforms and architectures - and the wild west.
> > The next trick is moving things between the two!
>
> This seems to be akin to the perfmon model. There are a small number of
> architectural events and counters. All the rest is a very wild west with
> no guarantees from one model to the next, or even from one core to the
> next inside a hybrid CPU model.
>
> Side note: Hybrid is already an issue for resctrl. On hybrid cpu models
> that support CAT L2, there are different numbers of ways in the P-core
> L2 cache from the E-core L2 cache. Other asymmetries are already in
> the pipeline.
>
> > > Initial notes:
> > >
> > > Core resctrl filesystem functionality will just be:
> > >
> > > 1) Mount/unmount of filesystem. Architecture hook to allocate monitor
> > > and control IDs for the default group.
> > >
> > > 2) Creation/removal/rename of control and monitor directories (with
> > > call to architecture specific code to allocate/free the control and monitor
> > > IDs to attach to the directory.
> > >
> > > 3) Maintaining the "tasks" file with architecture code to update the
> > > control and monitor IDs in the task structure.
> > >
> > > 4) Maintaining the "cpus" file - similar to "tasks"
> > >
> > > 5) Context switch code to update h/w with control/monitor IDs.
> > >
> > > 6) CPU hotplug interface to build and maintain domain list for each
> > > registered resource.
> > >
> > > 7) Framework for "schemata" file. Calls to resource specific functions
> > > to maintain each line in the file.
> >
> > > 8) Resource registration interface for modules to add new resources
> > > to the list (and remove them on module unload). Modules may add files
> > > to the info/ hierarchy, and also to each mon_data/ directory and/or
> > > to each control/control_mon directory.
> >
> > I worry that this can lead to architecture specific schema, then each architecture having
> > a subtly different version. I think it would be good to keep all the user-ABI in one place
> > so it doesn't get re-invented. I agree its hard to know what the next platfrom will look like.
>
> I'd also like the keep the user visible schema unchanged. But we already
> have different architecture back-end code (x86 has arrays of MSRs on
> each L3/L2 domain for s/w to program the desired bitmasks ... I'm going
> to guess that ARM doen't do that with a WRMSR instruction :-)
> Even between Intel and AMD there are small differences (AMD doesn't
> require each bitmask to consist of a single block of consecutive "1" bits).
>
> Some future x86 implementation might move away from MSRs (which are a
> pain because of the need for cross processor interrupts to execute the
> WRMSR from a CPU in each domain).
>
> So the basic CAT L3 modules should continue to provide this legacy
> schemata interface (as best they can).
>
> But this opens up the possibility for you to provide an alternate module
> that makes MPAM's 'CMAX' visible to knowledgeable users that have a use
> case that would benefit from it.
>
> > One difference I can't get my head round is how to handle platforms that use relative
> > percentages and fractions - and those that take an absolute MB/s value.
>
> Percentages were a terrible idea. I'm so, so, sorry. I just did a pass
> through, and that was a mistake. The "mba_MBps" mount option was a
> belated attempt to repair the damge by giving the user an parameter
> that makes more sense ... though the s/w feedback loop can't really
> ever run fast enough. So applications that go through phases of high
> and low memory bandwidth are often out of compliance with the bandwidth
> limits set by the user.
> >
> >
> > > 9) Note that the core code starts with an empty list of resources.
> > > System admins must load modules to add support for each resource they
> > > want to use.
> >
> > I think this just moves the problem to modules. 'CAT' would get duplicated by all
> > architectures. MB is subtly different between them all, but user-space doesn't want to be
> > concerned with the differences.
>
> The duplicated part is just the piece that parses and validates the user
> input for the CAT line in schemata. I haven't written the code yet, but
> that feels like it will be a small part of the module. If I'm wrong,
> then maybe we'd just make a "lib" directory for the shared functions
> needed by multiple modules.
>
> > > We'd need a bunch of modules to cover existing x86 functionality. E.g.
> > > an "L3" one for standard L3 cache allocation, an "L3CDP" one to be used
> > > instead of the plain "L3" one for code/data priority mode by creating
> > > a separate resource for each of code & data.
> >
> > CDP may have system wide side-effects. For MPAM if you enable the emulation of that, then
> > resources that resctrl doesn't believe use CDP have to double-configure and double-count
> > everything.
>
> I don't know enough about MPAM to fully understand that. On x86 CDP is a
> weird hack that keeps one resource and makes it double up lines in
> schemata, also doubling the staging space in the domain structures
> for the potential updated bitmasks before committing them. It would seem
> cleaner to have separate resources, and a driver that knows that one
> gets the even MSRs in the array while the other gets the odd MSRs.

In the RISC-V CBQRI spec [1], there is the concept of Access Type (AT
field) for capacity or bandwidth resources. The current encoding for the
field has code and data types (similar to CDP), but additional types are
possible in the future. I don't believe resctrl has the ability to apply
the CDP concept to bandwidth so that is a gap for CBQRI.

>
> > > Logically separate mbm_local, mbm_total, and llc_cache_occupancy modules
> > > (though could combine the mbm ones because they both need a periodic
> > > counter read to avoid wraparound). "MB" for memory bandwidth allocation.
> >
> > llc_cache_occupancy isn't a counter, but I'd prefer to bundle the others through perf.
> > That already has an interface for discovering and configuring events. I understand it was
> > tried and removed, but I think I've got a handle on making this work.
>
> perf and llc_occupancy really were a bad match. But there are glitches
> for memory bandwidth monitoring too. You'd really want the memory
> traffic for cache evictions to be added to the counter for the tasks
> that read that data into the cache. But if you do perf style monitoring
> only while the task is running, then all those evictions are going to be
> billed to the "wrong" process. Though perhaps you can handwave and say
> that because a process is causing the evictions by bringing new data
> into the cache, it really is responsible for those evictions.
>
> > > The "mba_MBps" mount option would be replaced with a module that does
> > > both memory bandwidth allocation and monitoring, with a s/w feedback loop.
> >
> > Keeping purely software features self contained is a great idea.
> >
> >
> > > Peter's workaround for the quirks of AMD monitoring could become a
> > > separate AMD specific module. But minor differences (e.g. contiguous
> > > cache bitmask Intel requirements) could be handled within a module
> > > if desired.
> > >
> > > Pseudo-locking would be another case to load a different module to
> > > set up pseudo-locking and enforce the cache bitmask rules between resctrl
> > > groups instead of the basic cache allocation one.
> > >
> > > Core resctrl code could handle overlaps between modules that want to
> > > control the same resource with a "first to load reserves that feature"
> > > policy.
> >
> > > Are there additional ARM specific architectural requirements that this
> > > approach isn't addressing? Could the core functionality be extended to
> > > make life easier for ARM?
> >
> > (We've got RISC-V to consider too - hence adding Drew Fustini [0])
> >
> > My known issues list is:
> > * RMIDs.
> > These are an independent number space for RDT. For MPAM they are an
> > extension of the partid/closid space. There is no value that can be
> > exposed to user-space as num_rmid as it depends on how they will be
> > used.
>
> Is the answer to avoid exposing this in the info/ directory and just
> letting the user know they hit the limit when a "mkdir" fails?
>
> > * Monitors.
> > RDT has one counter per RMID, they run continuously. MPAM advertises
> > how many monitors it has, which is very likely to be fewer than we
> > want. This means MPAM can't expose the free-runing MBM_* counters
> > via the filesystem. These would need exposing via perf.
>
> So you don't create the "mon_data" directory. Do you still have
> "mon_groups" to allow user to get subtotal counts for subsets of tasks
> with the same allocation (CLOS) id?
>
> > * Bitmaps.
> > MPAM has some bitmaps, but it has other stuff too. Forcing the bitmaps
> > to be the user-space interface requires the underlying control to be
> > capable of isolation. Ideally user-space would set a priority/cost for
> > each rdtgroup, and indicate whether they should be isolated from other
> > rdtgroup at the same level.
>
> Reinette tackled this with the limited set of tools available to
> implement pseudo-locking ... but that's really for a very specific set
> of use cases. Intel doesn't have a "priority" h/w knob, but I agree it
> would be good to have and expose.
>
> > * Memory bandwidth.
> > For MB resources that control bandwidth, X86 provides user-space with
> > the cache-id of the cache that implements that bandwidth controls. For
> > MPAM there is no expectation that this is being done by a cache, it could
> > equally be a memory controller.
>
> Intel will have something besides the L3-tied control at some point. In
> fact two very different somethings. Another reason that I want the
> option to pull all these competing controls out into modules and leave
> it to the user to pick which ones to load to meet their application
> requirements.

That is good to hear Intel is moving beyond just L3 bandwidth. Similar
to MPAM, the RISC-V CBQRI spec defines a bandwidth controller interface
that can monitor and allocate, but there is no concerte cache level
defined. It could be a DDR memory controller or an SoC interconnect.

>
> >
> > I'd really like to have these solved as part of a cross-architecture user-space ABI. I'm
> > not sure platform-specific modules solve the user-space problem.
>
> Where architectures can converge on a user interface, we should indeed
> try to do so. But this might be difficult if the companies that produce
> new resource control features want to keep the details under wraps for
> as long as possible.
>
> So we might see a re-run of "Here's some resctrl code to implement a new
> control feature on Intel" ... and have the user interface set in stone
> before other architectures have time to comment about whether the API is
> flexible enough to handle different h/w implementations.
>
> > Otherwise MPAM has additional types of control, which could be applied to any kind of
> > resource. The oddest is 'PRI' which is just a priority. I've not yet heard of a system
> > using it, but this could appear at any choke point in the SoC, it may not be on a cache or
> > memory controller.
>
> Could you handle that in schemata. A line that specifies priority from
> 0..N that translates to adding some field to the tasks in that group
> that gets loaded into h/w during context switch.
>
> > The 'types of control' and 'resource' distinction may help in places where Intel/AMD take
> > wildly different values to configure the same resource. (*cough* MB)
>
> MB is the poster child for how NOT to do a resource control.
>
> > [0] lore.kernel.org/r/[email protected]
>
> A quick glance looks like some of this is s/CLOSID/RCID/ s/RMID/RCID/
> and then load into a CSR instead of an MSR. But that's just the "how
> does s/w tell h/w which controls/counters to use". I'm sure that RISC-V
> will diverge in ways to keep OS implementation interesting.

That 2 patch series linked above is just to add support for the Ssqosid
extension (supervisor mode QoS ID). The extension adds the sqoscfg CSR
(QoS configuration register) which contains RCID (Resource Control ID)
and MCID (Monitoring Counter ID). RCID and MCID are independent of each
other. RCID can be used like CLOSID and MCID can be used like RMID.

The sqoscfg CSR is just the ISA part of the RISC-V QoS solution. The
majority of the functionality is contained in a non-ISA specification
called CBQRI (Capacity and Bandwidth Qos Register Interface). I sent an
RFC [2] back in April to allow an CBQRI capable controller to interface
with resctrl. That RFC series depended on the MPAM snapshot by James [3]
which allows other architectures to interface with the resctrl fs code.

The CBQRI spec tries very hard to be generic enough to allow for a
variety of implementations. The register interface describes two types
of shared resource controllers: a capacity controller and a bandwidth
controller. Both controller interfaces provide the ability to monitor
usage and control allocation, although implementations are free to just
implement monitoring or allocation.

The capacity controller register interface will often be used by cache
controllers but it is also possible for the interface to be used for
other shared resources like a TLB. Resource usage is monitored and
allocated in the form of capacity blocks which have no unit of measure
in the spec. resctrl currently can only support a CBQRI capacity
controller if if it is described in the device tree as L2 or L3 cache.

Similarly, the bandwidth controller register interface supports
monitoring and allocation in the form of bandwidth blocks which have no
unit. A memory controller or interconnect may implement this register
interface. Unfortunately, resctrl assumes that a bandwidth resource is
always L3 cache. For the CBQRI RFC, the example SoC had 3 memory
controllers which implement the CBQRI bandwidth interface. The only way
that I could find to represent this in resctrl was to create a "fake" MB
domain for each memory controller.

Thanks,
Drew

[1] https://github.com/riscv-non-isa/riscv-cbqri/blob/main/riscv-cbqri.pdf
[2] https://lore.kernel.org/linux-riscv/[email protected]/
[3] https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/v6.3