2023-05-25 18:16:06

by James Morse

[permalink] [raw]
Subject: [PATCH v4 00/24] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

Hello!

Changes since v3 are to split the patches up at the end of the series a little
more, and to make resctrl_arch_rmid_read() re-entrant which is needed before a
nohz_full CPU can call it in process context, and then be IPI'd from another
CPU. Otherwise changes since v3 are noted in each patch.
~
This series does two things, it changes resctrl to call resctrl_arch_rmid_read()
in a way that works for MPAM, and it separates the locking so that the arch code
and filesystem code don't have to share a mutex. I tried to split this as two
series, but these touch similar call sites, so it would create more work.

(What's MPAM? See the cover letter of the first series. [1])

On x86 the RMID is an independent number. MPAMs equivalent is PMG, but this
isn't an independent number - it extends the PARTID (same as CLOSID) space
with bits that aren't used to select the configuration. The monitors can
then be told to match specific PMG values, allowing monitor-groups to be
created.

But, MPAM expects the monitors to always monitor by PARTID. The
Cache-storage-utilisation counters can only work this way.
(In the MPAM spec not setting the MATCH_PARTID bit is made CONSTRAINED
UNPREDICTABLE - which is Arm's term to mean portable software can't rely on
this)

It gets worse, as some SoCs may have very few PMG bits. I've seen the
datasheet for one that has a single bit of PMG space.

To be usable, MPAM's counters always need the PARTID and the PMG.
For resctrl, this means always making the CLOSID available when the RMID
is used.

To ensure RMID are always unique, this series combines the CLOSID and RMID
into an index, and manages RMID based on that. For x86, the index and RMID
would always be the same.


Currently the architecture specific code in the cpuhp callbacks takes the
rdtgroup_mutex. This means the filesystem code would have to export this
lock, resulting in an ill-defined interface between the two, and the possibility
of cross-architecture lock-ordering head aches.

The second part of this series adds a domain_list_lock to protect writes to the
domain list, and protects the domain list with RCU - or read_cpus_lock().

Use of RCU is to allow lockless readers of the domain list, today resctrl only has
one, rdt_bit_usage_show(). But to get MPAMs monitors working, its very likely
they'll need to be plumbed up to perf. The uncore PMU driver would be a second
lockless reader of the domain list.

This series is based on v6.4-rc1, and can be retrieved from:
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/monitors_and_locking/v4

Bugs welcome,


Thanks,

James


[1] https://lore.kernel.org/lkml/[email protected]/
[v1] https://lore.kernel.org/all/[email protected]/
[v2] https://lore.kernel.org/lkml/[email protected]/
[v3] https://lore.kernel.org/r/[email protected]

James Morse (24):
x86/resctrl: Track the closid with the rmid
x86/resctrl: Access per-rmid structures by index
x86/resctrl: Create helper for RMID allocation and mondata dir
creation
x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()
x86/resctrl: Allow RMID allocation to be scoped by CLOSID
x86/resctrl: Track the number of dirty RMID a CLOSID has
x86/resctrl: Use set_bit()/clear_bit() instead of open coding
x86/resctrl: Allocate the cleanest CLOSID by searching
closid_num_dirty_rmid
x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
tick/nohz: Move tick_nohz_full_mask declaration outside the #ifdef
x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow
x86/resctrl: Make resctrl_arch_rmid_read() retry when it is
interrupted
x86/resctrl: Queue mon_event_read() instead of sending an IPI
x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
x86/resctrl: Allow arch to allocate memory needed in
resctrl_arch_rmid_read()
x86/resctrl: Make resctrl_mounted checks explicit
x86/resctrl: Move alloc/mon static keys into helpers
x86/resctrl: Make rdt_enable_key the arch's decision to switch
x86/resctrl: Add helpers for system wide mon/alloc capable
x86/resctrl: Add cpu online callback for resctrl work
x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but
cpu
x86/resctrl: Add cpu offline callback for resctrl work
x86/resctrl: Move domain helper migration into resctrl_offline_cpu()
x86/resctrl: Separate arch and fs resctrl locks

arch/x86/include/asm/resctrl.h | 90 +++++
arch/x86/kernel/cpu/resctrl/core.c | 78 ++--
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 45 ++-
arch/x86/kernel/cpu/resctrl/internal.h | 81 ++++-
arch/x86/kernel/cpu/resctrl/monitor.c | 414 ++++++++++++++++------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 15 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 335 ++++++++++++-----
include/linux/resctrl.h | 42 ++-
include/linux/tick.h | 9 +-
9 files changed, 846 insertions(+), 263 deletions(-)

--
2.39.2



2023-05-25 18:16:21

by James Morse

[permalink] [raw]
Subject: [PATCH v4 07/24] x86/resctrl: Use set_bit()/clear_bit() instead of open coding

The resctrl CLOSID allocator uses a single 32bit word to track which
CLOSID are free. The setting and clearing of bits is open coded.

A subsequent patch adds resctrl_closid_is_free(), which adds more open
coded bitmaps operations. These will eventually need changing to use
the bitops helpers so that a CLOSID bitmap of the correct size can be
allocated dynamically.

Convert the existing open coded bit manipulations of closid_free_map
to use set_bit() and friends.

Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index ba0595508b2f..6bf5623f82b4 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -106,7 +106,7 @@ void rdt_staged_configs_clear(void)
* - Our choices on how to configure each resource become progressively more
* limited as the number of resources grows.
*/
-static int closid_free_map;
+static unsigned long closid_free_map;
static int closid_free_map_len;

int closids_supported(void)
@@ -126,7 +126,7 @@ static void closid_init(void)
closid_free_map = BIT_MASK(rdt_min_closid) - 1;

/* CLOSID 0 is always reserved for the default group */
- closid_free_map &= ~1;
+ clear_bit(0, &closid_free_map);
closid_free_map_len = rdt_min_closid;
}

@@ -137,14 +137,14 @@ static int closid_alloc(void)
if (closid == 0)
return -ENOSPC;
closid--;
- closid_free_map &= ~(1 << closid);
+ clear_bit(closid, &closid_free_map);

return closid;
}

void closid_free(int closid)
{
- closid_free_map |= 1 << closid;
+ set_bit(closid, &closid_free_map);
}

/**
@@ -156,7 +156,7 @@ void closid_free(int closid)
*/
static bool closid_allocated(unsigned int closid)
{
- return (closid_free_map & (1 << closid)) == 0;
+ return !test_bit(closid, &closid_free_map);
}

/**
--
2.39.2


2023-05-25 18:16:27

by James Morse

[permalink] [raw]
Subject: [PATCH v4 22/24] x86/resctrl: Add cpu offline callback for resctrl work

The resctrl architecture specific code may need to free a domain when
a CPU goes offline, it also needs to reset the CPUs PQR_ASSOC register.
Amongst other things, the resctrl filesystem code needs to clear this
CPU from the cpu_mask of any control and monitor groups.

Currently this is all done in core.c and called from
resctrl_offline_cpu(), making the split between architecture and
filesystem code unclear.

Move the filesystem work to remove the CPU from the control and monitor
groups into a filesystem helper called resctrl_offline_cpu(), and rename
the one in core.c resctrl_arch_offline_cpu().

The rdtgroup_mutex is unlocked and locked again in the call in
preparation for changing the locking rules for the architecture
code.

Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 25 +++++--------------------
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 24 ++++++++++++++++++++++++
include/linux/resctrl.h | 1 +
3 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 187ed127a446..9128a9710537 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -623,31 +623,15 @@ static int resctrl_arch_online_cpu(unsigned int cpu)
return ret;
}

-static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
+static int resctrl_arch_offline_cpu(unsigned int cpu)
{
- struct rdtgroup *cr;
-
- list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) {
- if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) {
- break;
- }
- }
-}
-
-static int resctrl_offline_cpu(unsigned int cpu)
-{
- struct rdtgroup *rdtgrp;
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
+ resctrl_offline_cpu(cpu);
+
for_each_capable_rdt_resource(r)
domain_remove_cpu(cpu, r);
- list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
- if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
- clear_childcpus(rdtgrp, cpu);
- break;
- }
- }
clear_closid_rmid(cpu);
mutex_unlock(&rdtgroup_mutex);

@@ -970,7 +954,8 @@ static int __init resctrl_late_init(void)

state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
"x86/resctrl/cat:online:",
- resctrl_arch_online_cpu, resctrl_offline_cpu);
+ resctrl_arch_online_cpu,
+ resctrl_arch_offline_cpu);
if (state < 0)
return state;

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 3373b11afe01..08b426f52f6d 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3737,6 +3737,30 @@ int resctrl_online_cpu(unsigned int cpu)
return 0;
}

+static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
+{
+ struct rdtgroup *cr;
+
+ list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) {
+ if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask))
+ break;
+ }
+}
+
+void resctrl_offline_cpu(unsigned int cpu)
+{
+ struct rdtgroup *rdtgrp;
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
+ if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
+ clear_childcpus(rdtgrp, cpu);
+ break;
+ }
+ }
+}
+
/*
* rdtgroup_init - rdtgroup initialization
*
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 089b91133e5e..c4be3453b3ff 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -226,6 +226,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
int resctrl_online_cpu(unsigned int cpu);
+void resctrl_offline_cpu(unsigned int cpu);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
--
2.39.2


2023-05-25 18:16:45

by James Morse

[permalink] [raw]
Subject: [PATCH v4 02/24] x86/resctrl: Access per-rmid structures by index

Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
monitors, RMID values on arm64 are not unique unless the CLOSID is
also included. Bitmaps like rmid_busy_llc need to be sized by the
number of unique entries for this resource.

Add helpers to encode/decode the CLOSID and RMID to an index. The
domain's rmid_busy_llc and rmid_ptrs[] are then sized by index,
as are the domain mbm_local and mbm_total arrays.
On x86, the index is always just the RMID, so all these structures
remain the same size.

The index gives resctrl a unique value it can use to store monitor
values, and allows MPAM to decode the CLOSID when reading the hardware
counters.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v1:
* Added X86_BAD_CLOSID macro to make it clear what this value means
* Added second WARN_ON() for closid checking, and made both _ONCE()

Changes since v2:
* Added RESCTRL_RESERVED_CLOSID
* Removed a newline
* Repharsed some comments
* Renamed a variable 'ignore'd
* Moved X86_RESCTRL_BAD_CLOSID to a previous patch

Changes since v3:
* Changed a variable name
* Fixed various typos
---
arch/x86/include/asm/resctrl.h | 17 ++++++
arch/x86/kernel/cpu/resctrl/core.c | 2 +-
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/monitor.c | 84 +++++++++++++++++---------
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 7 ++-
include/linux/resctrl.h | 3 +
6 files changed, 83 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index e906070285fb..dd9b638d43c8 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -101,6 +101,23 @@ static inline void resctrl_sched_in(struct task_struct *tsk)
__resctrl_sched_in(tsk);
}

+static inline u32 resctrl_arch_system_num_rmid_idx(void)
+{
+ /* RMID are independent numbers for x86. num_rmid_idx == num_rmid */
+ return boot_cpu_data.x86_cache_max_rmid + 1;
+}
+
+static inline void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid)
+{
+ *rmid = idx;
+ *closid = X86_RESCTRL_BAD_CLOSID;
+}
+
+static inline u32 resctrl_arch_rmid_idx_encode(u32 ignored, u32 rmid)
+{
+ return rmid;
+}
+
void resctrl_cpu_detect(struct cpuinfo_x86 *c);

#else
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..4bea032d072e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -600,7 +600,7 @@ static void clear_closid_rmid(int cpu)
state->default_rmid = 0;
state->cur_closid = 0;
state->cur_rmid = 0;
- wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
+ wrmsr(MSR_IA32_PQR_ASSOC, 0, RESCTRL_RESERVED_CLOSID);
}

static int resctrl_online_cpu(unsigned int cpu)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index f2da908bb079..d571da4848a4 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -7,6 +7,7 @@
#include <linux/kernfs.h>
#include <linux/fs_context.h>
#include <linux/jump_label.h>
+#include <asm/resctrl.h>

#define L3_QOS_CDP_ENABLE 0x01ULL

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 86574adedd64..bcc25f5339c0 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -142,12 +142,29 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val)
return val;
}

-static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid)
+/*
+ * x86 and arm64 differ in their handling of monitoring.
+ * x86's RMID are an independent number, there is only one source of traffic
+ * with an RMID value of '1'.
+ * arm64's PMG extend the PARTID/CLOSID space, there are multiple sources of
+ * traffic with a PMG value of '1', one for each CLOSID, meaning the RMID
+ * value is no longer unique.
+ * To account for this, resctrl uses an index. On x86 this is just the RMID,
+ * on arm64 it encodes the CLOSID and RMID. This gives a unique number.
+ *
+ * The domain's rmid_busy_llc and rmid_ptrs are sized by index. The arch code
+ * must accept an attempt to read every index.
+ */
+static inline struct rmid_entry *__rmid_entry(u32 idx)
{
struct rmid_entry *entry;
+ u32 closid, rmid;

- entry = &rmid_ptrs[rmid];
- WARN_ON(entry->rmid != rmid);
+ entry = &rmid_ptrs[idx];
+ resctrl_arch_rmid_idx_decode(idx, &closid, &rmid);
+
+ WARN_ON_ONCE(entry->closid != closid);
+ WARN_ON_ONCE(entry->rmid != rmid);

return entry;
}
@@ -277,8 +294,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
void __check_limbo(struct rdt_domain *d, bool force_free)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ u32 idx_limit = resctrl_arch_system_num_rmid_idx();
struct rmid_entry *entry;
- u32 crmid = 1, nrmid;
+ u32 idx, cur_idx = 1;
bool rmid_dirty;
u64 val = 0;

@@ -289,12 +307,11 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
* RMID and move it to the free list when the counter reaches 0.
*/
for (;;) {
- nrmid = find_next_bit(d->rmid_busy_llc, r->num_rmid, crmid);
- if (nrmid >= r->num_rmid)
+ idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx);
+ if (idx >= idx_limit)
break;

- entry = __rmid_entry(X86_RESCTRL_BAD_CLOSID, nrmid);// temporary
-
+ entry = __rmid_entry(idx);
if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
QOS_L3_OCCUP_EVENT_ID, &val)) {
rmid_dirty = true;
@@ -303,19 +320,21 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
}

if (force_free || !rmid_dirty) {
- clear_bit(entry->rmid, d->rmid_busy_llc);
+ clear_bit(idx, d->rmid_busy_llc);
if (!--entry->busy) {
rmid_limbo_count--;
list_add_tail(&entry->list, &rmid_free_lru);
}
}
- crmid = nrmid + 1;
+ cur_idx = idx + 1;
}
}

bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
{
- return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
+ u32 idx_limit = resctrl_arch_system_num_rmid_idx();
+
+ return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit;
}

/*
@@ -345,6 +364,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
struct rdt_domain *d;
int cpu, err;
u64 val = 0;
+ u32 idx;
+
+ idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);

entry->busy = 0;
cpu = get_cpu();
@@ -364,7 +386,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
*/
if (!has_busy_rmid(r, d))
cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL);
- set_bit(entry->rmid, d->rmid_busy_llc);
+ set_bit(idx, d->rmid_busy_llc);
entry->busy++;
}
put_cpu();
@@ -377,14 +399,16 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)

void free_rmid(u32 closid, u32 rmid)
{
+ u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
struct rmid_entry *entry;

- if (!rmid)
- return;
-
lockdep_assert_held(&rdtgroup_mutex);

- entry = __rmid_entry(closid, rmid);
+ /* do not allow the default rmid to be free'd */
+ if (!idx)
+ return;
+
+ entry = __rmid_entry(idx);

if (is_llc_occupancy_enabled())
add_rmid_to_limbo(entry);
@@ -395,11 +419,13 @@ void free_rmid(u32 closid, u32 rmid)
static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid,
u32 rmid, enum resctrl_event_id evtid)
{
+ u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+
switch (evtid) {
case QOS_L3_MBM_TOTAL_EVENT_ID:
- return &d->mbm_total[rmid];
+ return &d->mbm_total[idx];
case QOS_L3_MBM_LOCAL_EVENT_ID:
- return &d->mbm_local[rmid];
+ return &d->mbm_local[idx];
default:
return NULL;
}
@@ -441,7 +467,8 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
*/
static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
{
- struct mbm_state *m = &rr->d->mbm_local[rmid];
+ u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+ struct mbm_state *m = &rr->d->mbm_local[idx];
u64 cur_bw, bytes, cur_bytes;

cur_bytes = rr->val;
@@ -531,7 +558,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
- u32 cur_bw, delta_bw, user_bw;
+ u32 cur_bw, delta_bw, user_bw, idx;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
struct list_head *head;
@@ -544,7 +571,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)

closid = rgrp->closid;
rmid = rgrp->mon.rmid;
- pmbm_data = &dom_mbm->mbm_local[rmid];
+ idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+ pmbm_data = &dom_mbm->mbm_local[idx];

dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
@@ -727,19 +755,20 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)

static int dom_data_init(struct rdt_resource *r)
{
+ u32 idx_limit = resctrl_arch_system_num_rmid_idx();
struct rmid_entry *entry = NULL;
- int i, nr_rmids;
+ u32 idx;
+ int i;

- nr_rmids = r->num_rmid;
- rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL);
+ rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL);
if (!rmid_ptrs)
return -ENOMEM;

- for (i = 0; i < nr_rmids; i++) {
+ for (i = 0; i < idx_limit; i++) {
entry = &rmid_ptrs[i];
INIT_LIST_HEAD(&entry->list);

- entry->rmid = i;
+ resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid);
list_add_tail(&entry->list, &rmid_free_lru);
}

@@ -748,7 +777,8 @@ static int dom_data_init(struct rdt_resource *r)
* default_rdtgroup control group, which will be setup later. See
* rdtgroup_setup_root().
*/
- entry = __rmid_entry(0, 0);
+ idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, 0);
+ entry = __rmid_entry(idx);
list_del(&entry->list);

return 0;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index ff9ccfcd18bd..023eae69f29e 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3604,16 +3604,17 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)

static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
{
+ u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize;

if (is_llc_occupancy_enabled()) {
- d->rmid_busy_llc = bitmap_zalloc(r->num_rmid, GFP_KERNEL);
+ d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
}
if (is_mbm_total_enabled()) {
tsize = sizeof(*d->mbm_total);
- d->mbm_total = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
+ d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL);
if (!d->mbm_total) {
bitmap_free(d->rmid_busy_llc);
return -ENOMEM;
@@ -3621,7 +3622,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
}
if (is_mbm_local_enabled()) {
tsize = sizeof(*d->mbm_local);
- d->mbm_local = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
+ d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL);
if (!d->mbm_local) {
bitmap_free(d->rmid_busy_llc);
kfree(d->mbm_total);
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 7d80bae05f59..ff7452f644e4 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -6,6 +6,9 @@
#include <linux/list.h>
#include <linux/pid.h>

+/* CLOSID value used by the default control group */
+#define RESCTRL_RESERVED_CLOSID 0
+
#ifdef CONFIG_PROC_CPU_RESCTRL

int proc_resctrl_show(struct seq_file *m,
--
2.39.2


2023-05-25 18:16:47

by James Morse

[permalink] [raw]
Subject: [PATCH v4 16/24] x86/resctrl: Make resctrl_mounted checks explicit

The rdt_enable_key is switched when resctrl is mounted, and used to
prevent a second mount of the filesystem. It also enables the
architecture's context switch code.

This requires another architecture to have the same set of static-keys,
as resctrl depends on them too. The existing users of these static-keys
are implicitly also checking if the filesystem is mounted.

Make the resctrl_mounted checks explicit: resctrl can keep track of
whether it has been mounted once. This doesn't need to be combined with
whether the arch code is context switching the CLOSID.

rdt_mon_enable_key is never used just to test that resctrl is mounted,
but does also have this implication. Add a resctrl_mounted to all uses
of rdt_mon_enable_key. This will allow rdt_mon_enable_key to be swapped
with a helper in a subsequent patch.

This will allow the static-key changing to be moved behind resctrl_arch_
calls.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>

----
Changse since v3:
* Removed a newline.
* Rephrased commit message
---
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/monitor.c | 12 ++++++++++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 21 +++++++++++++++------
3 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a7e025cffdbc..67cabda6fd4d 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -143,6 +143,7 @@ extern bool rdt_alloc_capable;
extern bool rdt_mon_capable;
extern unsigned int rdt_mon_features;
extern struct list_head resctrl_schema_all;
+extern bool resctrl_mounted;

enum rdt_group_type {
RDTCTRL_GROUP = 0,
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 6d140018358a..da5a86c95142 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -832,7 +832,11 @@ void mbm_handle_overflow(struct work_struct *work)

mutex_lock(&rdtgroup_mutex);

- if (!static_branch_likely(&rdt_mon_enable_key))
+ /*
+ * If the filesystem has been unmounted this work no longer needs to
+ * run.
+ */
+ if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
goto out_unlock;

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
@@ -865,7 +869,11 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;

- if (!static_branch_likely(&rdt_mon_enable_key))
+ /*
+ * When a domain comes online there is no guarantee the filesystem is
+ * mounted. If not, there is no need to catch counter overflow.
+ */
+ if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key))
return;
cpu = cpumask_any_housekeeping(&dom->cpu_mask);
dom->mbm_work_cpu = cpu;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f2bb3e09ed13..47bb3ab775fc 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -42,6 +42,9 @@ LIST_HEAD(rdt_all_groups);
/* list of entries for the schemata file */
LIST_HEAD(resctrl_schema_all);

+/* The filesystem can only be mounted once. */
+bool resctrl_mounted;
+
/* Kernel fs node for "info" directory under root */
static struct kernfs_node *kn_info;

@@ -815,7 +818,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
mutex_lock(&rdtgroup_mutex);

/* Return empty if resctrl has not been mounted. */
- if (!static_branch_unlikely(&rdt_enable_key)) {
+ if (!resctrl_mounted) {
seq_puts(s, "res:\nmon:\n");
goto unlock;
}
@@ -2482,7 +2485,7 @@ static int rdt_get_tree(struct fs_context *fc)
/*
* resctrl file system can only be mounted once.
*/
- if (static_branch_unlikely(&rdt_enable_key)) {
+ if (resctrl_mounted) {
ret = -EBUSY;
goto out;
}
@@ -2530,8 +2533,10 @@ static int rdt_get_tree(struct fs_context *fc)
if (rdt_mon_capable)
static_branch_enable_cpuslocked(&rdt_mon_enable_key);

- if (rdt_alloc_capable || rdt_mon_capable)
+ if (rdt_alloc_capable || rdt_mon_capable) {
static_branch_enable_cpuslocked(&rdt_enable_key);
+ resctrl_mounted = true;
+ }

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
@@ -2802,6 +2807,7 @@ static void rdt_kill_sb(struct super_block *sb)
static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
static_branch_disable_cpuslocked(&rdt_mon_enable_key);
static_branch_disable_cpuslocked(&rdt_enable_key);
+ resctrl_mounted = false;
kernfs_kill_sb(sb);
mutex_unlock(&rdtgroup_mutex);
cpus_read_unlock();
@@ -3633,7 +3639,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
* If resctrl is mounted, remove all the
* per domain monitor data directories.
*/
- if (static_branch_unlikely(&rdt_mon_enable_key))
+ if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
rmdir_mondata_subdir_allrdtgrp(r, d->id);

if (is_mbm_enabled())
@@ -3710,8 +3716,11 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
if (is_llc_occupancy_enabled())
INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);

- /* If resctrl is mounted, add per domain monitor data directories. */
- if (static_branch_unlikely(&rdt_mon_enable_key))
+ /*
+ * If the filesystem is not mounted, creating directories is deferred
+ * until mount time by rdt_get_tree() calling mkdir_mondata_all().
+ */
+ if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
mkdir_mondata_subdir_allrdtgrp(r, d);

return 0;
--
2.39.2


2023-05-25 18:17:45

by James Morse

[permalink] [raw]
Subject: [PATCH v4 11/24] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

The limbo and overflow code picks a CPU to use from the domain's list
of online CPUs. Work is then scheduled on these CPUs to maintain
the limbo list and any counters that may overflow.

cpumask_any() may pick a CPU that is marked nohz_full, which will
either penalise the work that CPU was dedicated to, or delay the
processing of limbo list or counters that may overflow. Perhaps
indefinitely. Delaying the overflow handling will skew the bandwidth
values calculated by mba_sc, which expects to be called once a second.

Add cpumask_any_housekeeping() as a replacement for cpumask_any()
that prefers housekeeping CPUs. This helper will still return
a nohz_full CPU if that is the only option. The CPU to use is
re-evaluated each time the limbo/overflow work runs. This ensures
the work will move off a nohz_full CPU once a housekeeping CPU is
available.

Signed-off-by: James Morse <[email protected]>
---
Changes since v3:
* typos fixed
---
arch/x86/kernel/cpu/resctrl/internal.h | 23 +++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 17 ++++++++++++-----
2 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 96fb7658ff74..6f18cf26988c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -7,6 +7,7 @@
#include <linux/kernfs.h>
#include <linux/fs_context.h>
#include <linux/jump_label.h>
+#include <linux/tick.h>
#include <asm/resctrl.h>

#define L3_QOS_CDP_ENABLE 0x01ULL
@@ -55,6 +56,28 @@
/* Max event bits supported */
#define MAX_EVT_CONFIG_BITS GENMASK(6, 0)

+/**
+ * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that
+ * aren't marked nohz_full
+ * @mask: The mask to pick a CPU from.
+ *
+ * Returns a CPU in @mask. If there are housekeeping CPUs that don't use
+ * nohz_full, these are preferred.
+ */
+static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
+{
+ int cpu, hk_cpu;
+
+ cpu = cpumask_any(mask);
+ if (tick_nohz_full_cpu(cpu)) {
+ hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
+ if (hk_cpu < nr_cpu_ids)
+ cpu = hk_cpu;
+ }
+
+ return cpu;
+}
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 128d4c7206e4..e267869d60d5 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -770,9 +770,9 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
void cqm_handle_limbo(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
- int cpu = smp_processor_id();
struct rdt_resource *r;
struct rdt_domain *d;
+ int cpu;

mutex_lock(&rdtgroup_mutex);

@@ -781,8 +781,10 @@ void cqm_handle_limbo(struct work_struct *work)

__check_limbo(d, false);

- if (has_busy_rmid(r, d))
+ if (has_busy_rmid(r, d)) {
+ cpu = cpumask_any_housekeeping(&d->cpu_mask);
schedule_delayed_work_on(cpu, &d->cqm_limbo, delay);
+ }

mutex_unlock(&rdtgroup_mutex);
}
@@ -792,7 +794,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;

- cpu = cpumask_any(&dom->cpu_mask);
+ cpu = cpumask_any_housekeeping(&dom->cpu_mask);
dom->cqm_work_cpu = cpu;

schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
@@ -802,10 +804,10 @@ void mbm_handle_overflow(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
struct rdtgroup *prgrp, *crgrp;
- int cpu = smp_processor_id();
struct list_head *head;
struct rdt_resource *r;
struct rdt_domain *d;
+ int cpu;

mutex_lock(&rdtgroup_mutex);

@@ -826,6 +828,11 @@ void mbm_handle_overflow(struct work_struct *work)
update_mba_bw(prgrp, d);
}

+ /*
+ * Re-check for housekeeping CPUs. This allows the overflow handler to
+ * move off a nohz_full CPU quickly.
+ */
+ cpu = cpumask_any_housekeeping(&d->cpu_mask);
schedule_delayed_work_on(cpu, &d->mbm_over, delay);

out_unlock:
@@ -839,7 +846,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)

if (!static_branch_likely(&rdt_mon_enable_key))
return;
- cpu = cpumask_any(&dom->cpu_mask);
+ cpu = cpumask_any_housekeeping(&dom->cpu_mask);
dom->mbm_work_cpu = cpu;
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
}
--
2.39.2


2023-05-25 18:17:44

by James Morse

[permalink] [raw]
Subject: [PATCH v4 24/24] x86/resctrl: Separate arch and fs resctrl locks

resctrl has one mutex that is taken by the architecture specific code,
and the filesystem parts. The two interact via cpuhp, where the
architecture code updates the domain list. Filesystem handlers that
walk the domains list should not run concurrently with the cpuhp
callback modifying the list.

Exposing a lock from the filesystem code means the interface is not
cleanly defined, and creates the possibility of cross-architecture
lock ordering headaches. The interaction only exists so that certain
filesystem paths are serialised against cpu hotplug. The cpu hotplug
code already has a mechanism to do this using cpus_read_lock().

MPAM's monitors have an overflow interrupt, so it needs to be possible
to walk the domains list in irq context. RCU is ideal for this,
but some paths need to be able to sleep to allocate memory.

Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
of a cpuhp callback, cpus_read_lock() must always be taken first.
rdtgroup_schemata_write() already does this.

Most of the filesystem code's domain list walkers are currently
protected by the rdtgroup_mutex taken in rdtgroup_kn_lock_live().
The exceptions are rdt_bit_usage_show() and the mon_config helpers
which take the lock directly.

Make the domain list protected by RCU. An architecture-specific
lock prevents concurrent writers. rdt_bit_usage_show() can
walk the domain list under rcu_read_lock(). The mon_config helpers
send multiple IPIs, take the cpus_read_lock() in these cases.

The other filesystem list walkers need to be able to sleep.
Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
cpuhp callbacks can't be invoked when file system operations are
occurring.

Add lockdep_assert_cpus_held() in the cases where the
rdtgroup_kn_lock_live() call isn't obvious.

Resctrl's domain online/offline calls now need to take the
rdtgroup_mutex themselves.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v2:
* Reworded a comment,
* Added a lockdep assertion
* Moved clear_closid_rmid() outside the locked region of cpu
online/offline

Changes since v3:
* Added a header include
---
arch/x86/kernel/cpu/resctrl/core.c | 38 +++++++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 16 ++++--
arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 3 ++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 63 ++++++++++++++++++++---
include/linux/resctrl.h | 2 +-
6 files changed, 100 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index edc0dd123317..f106c68a9be8 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -25,8 +25,15 @@
#include <asm/resctrl.h>
#include "internal.h"

-/* Mutex to protect rdtgroup access. */
-DEFINE_MUTEX(rdtgroup_mutex);
+/*
+ * rdt_domain structures are kfree()d when their last CPU goes offline,
+ * and allocated when the first CPU in a new domain comes online.
+ * The rdt_resource's domain list is updated when this happens. Readers of
+ * the domain list must either take cpus_read_lock(), or rely on an RCU
+ * read-side critical section, to avoid observing concurrent modification.
+ * All writers take this mutex:
+ */
+static DEFINE_MUTEX(domain_list_lock);

/*
* The cached resctrl_pqr_state is strictly per CPU and can never be
@@ -508,6 +515,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
struct rdt_domain *d;
int err;

+ lockdep_assert_held(&domain_list_lock);
+
d = rdt_find_domain(r, id, &add_pos);
if (IS_ERR(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -541,11 +550,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

- list_add_tail(&d->list, add_pos);
+ list_add_tail_rcu(&d->list, add_pos);

err = resctrl_online_domain(r, d);
if (err) {
- list_del(&d->list);
+ list_del_rcu(&d->list);
+ synchronize_rcu();
domain_free(hw_dom);
}
}
@@ -556,6 +566,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;

+ lockdep_assert_held(&domain_list_lock);
+
d = rdt_find_domain(r, id, NULL);
if (IS_ERR_OR_NULL(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -566,7 +578,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
cpumask_clear_cpu(cpu, &d->cpu_mask);
if (cpumask_empty(&d->cpu_mask)) {
resctrl_offline_domain(r, d);
- list_del(&d->list);
+ list_del_rcu(&d->list);
+ synchronize_rcu();

/*
* rdt_domain "d" is going to be freed below, so clear
@@ -594,30 +607,29 @@ static void clear_closid_rmid(int cpu)
static int resctrl_arch_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;
- int ret;

- mutex_lock(&rdtgroup_mutex);
+ mutex_lock(&domain_list_lock);
for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
+ mutex_unlock(&domain_list_lock);
+
clear_closid_rmid(cpu);

- ret = resctrl_online_cpu(cpu);
- mutex_unlock(&rdtgroup_mutex);
-
- return ret;
+ return resctrl_online_cpu(cpu);
}

static int resctrl_arch_offline_cpu(unsigned int cpu)
{
struct rdt_resource *r;

- mutex_lock(&rdtgroup_mutex);
resctrl_offline_cpu(cpu);

+ mutex_lock(&domain_list_lock);
for_each_capable_rdt_resource(r)
domain_remove_cpu(cpu, r);
+ mutex_unlock(&domain_list_lock);
+
clear_closid_rmid(cpu);
- mutex_unlock(&rdtgroup_mutex);

return 0;
}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 280d66fae21c..d8d7c127403b 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -209,6 +209,9 @@ static int parse_line(char *line, struct resctrl_schema *s,
struct rdt_domain *d;
unsigned long dom_id;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
(r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) {
rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n");
@@ -313,6 +316,9 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
struct rdt_domain *d;
u32 idx;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
return -ENOMEM;

@@ -378,11 +384,9 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
return -EINVAL;
buf[nbytes - 1] = '\0';

- cpus_read_lock();
rdtgrp = rdtgroup_kn_lock_live(of->kn);
if (!rdtgrp) {
rdtgroup_kn_unlock(of->kn);
- cpus_read_unlock();
return -ENOENT;
}
rdt_last_cmd_clear();
@@ -444,7 +448,6 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
out:
rdt_staged_configs_clear();
rdtgroup_kn_unlock(of->kn);
- cpus_read_unlock();
return ret ?: nbytes;
}

@@ -464,6 +467,9 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
bool sep = false;
u32 ctrl_val;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
seq_printf(s, "%*s:", max_name_width, schema->name);
list_for_each_entry(dom, &r->domains, list) {
if (sep)
@@ -534,8 +540,8 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
{
int cpu;

- /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
- lockdep_assert_held(&rdtgroup_mutex);
+ /* When picking a cpu from cpu_mask, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();

/*
* setup the parameters to pass to mon_event_count() to read the data.
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ae02185f3354..41b4cd2c7d64 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -15,6 +15,7 @@
* Software Developer Manual June 2016, volume 3, section 17.17.
*/

+#include <linux/cpu.h>
#include <linux/module.h>
#include <linux/sizes.h>
#include <linux/slab.h>
@@ -476,6 +477,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)

lockdep_assert_held(&rdtgroup_mutex);

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);

entry->busy = 0;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 460421051abf..fc3ed917d173 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -830,6 +830,9 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
struct rdt_domain *d_i;
bool ret = false;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
return true;

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 3a8e2c98b611..9002ac728001 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -35,6 +35,10 @@
DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
+
+/* Mutex to protect rdtgroup access. */
+DEFINE_MUTEX(rdtgroup_mutex);
+
static struct kernfs_root *rdt_root;
struct rdtgroup rdtgroup_default;
LIST_HEAD(rdt_all_groups);
@@ -950,7 +954,8 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,

mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(dom, &r->domains, list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
@@ -1006,8 +1011,10 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
}
sep = true;
}
+ rcu_read_unlock();
seq_putc(seq, '\n');
mutex_unlock(&rdtgroup_mutex);
+
return 0;
}

@@ -1250,6 +1257,9 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
struct rdt_domain *d;
u32 ctrl;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
list_for_each_entry(s, &resctrl_schema_all, list) {
r = s->res;
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
@@ -1516,6 +1526,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
struct rdt_domain *dom;
bool sep = false;

+ cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

list_for_each_entry(dom, &r->domains, list) {
@@ -1532,6 +1543,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
seq_puts(s, "\n");

mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();

return 0;
}
@@ -1623,6 +1635,9 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
struct rdt_domain *d;
int ret = 0;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
next:
if (!tok || tok[0] == '\0')
return 0;
@@ -1664,6 +1679,7 @@ static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
if (nbytes == 0 || buf[nbytes - 1] != '\n')
return -EINVAL;

+ cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

rdt_last_cmd_clear();
@@ -1673,6 +1689,7 @@ static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID);

mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();

return ret ?: nbytes;
}
@@ -1688,6 +1705,7 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of,
if (nbytes == 0 || buf[nbytes - 1] != '\n')
return -EINVAL;

+ cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

rdt_last_cmd_clear();
@@ -1697,6 +1715,7 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of,
ret = mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID);

mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();

return ret ?: nbytes;
}
@@ -2149,6 +2168,9 @@ static int set_cache_qos_cfg(int level, bool enable)
struct rdt_domain *d;
int cpu;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (level == RDT_RESOURCE_L3)
update = l3_qos_cfg_update;
else if (level == RDT_RESOURCE_L2)
@@ -2337,6 +2359,7 @@ struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn)
atomic_inc(&rdtgrp->waitcount);
kernfs_break_active_protection(kn);

+ cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

/* Was this group deleted while we waited? */
@@ -2354,6 +2377,7 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn)
return;

mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();

if (atomic_dec_and_test(&rdtgrp->waitcount) &&
(rdtgrp->flags & RDT_DELETED)) {
@@ -2651,6 +2675,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
struct rdt_domain *d;
int i;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
return -ENOMEM;

@@ -2935,6 +2962,9 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_domain *dom;
int ret;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
list_for_each_entry(dom, &r->domains, list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
@@ -3625,7 +3655,8 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
kfree(d->mbm_local);
}

-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+static void _resctrl_offline_domain(struct rdt_resource *r,
+ struct rdt_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3660,6 +3691,13 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
domain_destroy_mon_state(d);
}

+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ mutex_lock(&rdtgroup_mutex);
+ _resctrl_offline_domain(r, d);
+ mutex_unlock(&rdtgroup_mutex);
+}
+
static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -3691,7 +3729,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+static int _resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
{
int err;

@@ -3727,12 +3765,23 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

+int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ int err;
+
+ mutex_lock(&rdtgroup_mutex);
+ err = _resctrl_online_domain(r, d);
+ mutex_unlock(&rdtgroup_mutex);
+
+ return err;
+}
+
int resctrl_online_cpu(unsigned int cpu)
{
- lockdep_assert_held(&rdtgroup_mutex);
-
+ mutex_lock(&rdtgroup_mutex);
/* The cpu is set in default rdtgroup after online. */
cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
+ mutex_unlock(&rdtgroup_mutex);

return 0;
}
@@ -3753,8 +3802,7 @@ void resctrl_offline_cpu(unsigned int cpu)
struct rdtgroup *rdtgrp;
struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;

- lockdep_assert_held(&rdtgroup_mutex);
-
+ mutex_lock(&rdtgroup_mutex);
list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
clear_childcpus(rdtgrp, cpu);
@@ -3774,6 +3822,7 @@ void resctrl_offline_cpu(unsigned int cpu)
cqm_setup_limbo_handler(d, 0, cpu);
}
}
+ mutex_unlock(&rdtgroup_mutex);
}

/*
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c4be3453b3ff..fe94ef3369fa 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -159,7 +159,7 @@ struct resctrl_schema;
* @cache_level: Which cache level defines scope of this resource
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
- * @domains: All domains for this resource
+ * @domains: RCU list of all domains for this resource
* @name: Name to use in "schemata" file.
* @data_width: Character width of data when displaying
* @default_ctrl: Specifies default cache cbm or memory B/W percent.
--
2.39.2


2023-05-25 18:17:45

by James Morse

[permalink] [raw]
Subject: [PATCH v4 09/24] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers

When switching tasks, the CLOSID and RMID that the new task should
use are stored in struct task_struct. For x86 the CLOSID known by resctrl,
the value in task_struct, and the value written to the CPU register are
all the same thing.

MPAM's CPU interface has two different PARTID's one for data accesses
the other for instruction fetch. Storing resctrl's CLOSID value in
struct task_struct implies the arch code knows whether resctrl is using
CDP.

Move the matching and setting of the struct task_struct properties
to use helpers. This allows arm64 to store the hardware format of
the register, instead of having to convert it each time.

__rdtgroup_move_task()s use of READ_ONCE()/WRITE_ONCE() ensures torn
values aren't seen as another CPU may schedule the task being moved
while the value is being changed. MPAM has an additional corner-case
here as the PMG bits extend the PARTID space. If the scheduler sees a
new-CLOSID but old-RMID, the task will dirty an RMID that the limbo code
is not watching causing an inaccurate count. x86's RMID are independent
values, so the limbo code will still be watching the old-RMID in this
circumstance.
To avoid this, arm64 needs both the CLOSID/RMID WRITE_ONCE()d together.
Both values must be provided together.

Because MPAM's RMID values are not unique, the CLOSID must be provided
when matching the RMID.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v2:
* __rdtgroup_move_task() changed to set CLOSID from different CLOSID place
depending on group type
---
arch/x86/include/asm/resctrl.h | 18 ++++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 62 ++++++++++++++++----------
2 files changed, 56 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index dd9b638d43c8..78376c19ee6f 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -95,6 +95,24 @@ static inline unsigned int resctrl_arch_round_mon_val(unsigned int val)
return val * scale;
}

+static inline void resctrl_arch_set_closid_rmid(struct task_struct *tsk,
+ u32 closid, u32 rmid)
+{
+ WRITE_ONCE(tsk->closid, closid);
+ WRITE_ONCE(tsk->rmid, rmid);
+}
+
+static inline bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid)
+{
+ return READ_ONCE(tsk->closid) == closid;
+}
+
+static inline bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 ignored,
+ u32 rmid)
+{
+ return READ_ONCE(tsk->rmid) == rmid;
+}
+
static inline void resctrl_sched_in(struct task_struct *tsk)
{
if (static_branch_likely(&rdt_enable_key))
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 1120c7700126..f2bb3e09ed13 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -97,7 +97,7 @@ void rdt_staged_configs_clear(void)
*
* Using a global CLOSID across all resources has some advantages and
* some drawbacks:
- * + We can simply set "current->closid" to assign a task to a resource
+ * + We can simply set current's closid to assign a task to a resource
* group.
* + Context switch code can avoid extra memory references deciding which
* CLOSID to load into the PQR_ASSOC MSR
@@ -563,14 +563,26 @@ static void update_task_closid_rmid(struct task_struct *t)
_update_task_closid_rmid(t);
}

+static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup *rdtgrp)
+{
+ u32 closid, rmid = rdtgrp->mon.rmid;
+
+ if (rdtgrp->type == RDTCTRL_GROUP)
+ closid = rdtgrp->closid;
+ else if (rdtgrp->type == RDTMON_GROUP)
+ closid = rdtgrp->mon.parent->closid;
+ else
+ return false;
+
+ return resctrl_arch_match_closid(tsk, closid) &&
+ resctrl_arch_match_rmid(tsk, closid, rmid);
+}
+
static int __rdtgroup_move_task(struct task_struct *tsk,
struct rdtgroup *rdtgrp)
{
/* If the task is already in rdtgrp, no need to move the task. */
- if ((rdtgrp->type == RDTCTRL_GROUP && tsk->closid == rdtgrp->closid &&
- tsk->rmid == rdtgrp->mon.rmid) ||
- (rdtgrp->type == RDTMON_GROUP && tsk->rmid == rdtgrp->mon.rmid &&
- tsk->closid == rdtgrp->mon.parent->closid))
+ if (task_in_rdtgroup(tsk, rdtgrp))
return 0;

/*
@@ -581,19 +593,19 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
* For monitor groups, can move the tasks only from
* their parent CTRL group.
*/
-
- if (rdtgrp->type == RDTCTRL_GROUP) {
- WRITE_ONCE(tsk->closid, rdtgrp->closid);
- WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
- } else if (rdtgrp->type == RDTMON_GROUP) {
- if (rdtgrp->mon.parent->closid == tsk->closid) {
- WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
- } else {
- rdt_last_cmd_puts("Can't move task to different control group\n");
- return -EINVAL;
- }
+ if (rdtgrp->type == RDTMON_GROUP &&
+ !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) {
+ rdt_last_cmd_puts("Can't move task to different control group\n");
+ return -EINVAL;
}

+ if (rdtgrp->type == RDTMON_GROUP)
+ resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid,
+ rdtgrp->mon.rmid);
+ else
+ resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid,
+ rdtgrp->mon.rmid);
+
/*
* Ensure the task's closid and rmid are written before determining if
* the task is current that will decide if it will be interrupted.
@@ -615,14 +627,15 @@ static int __rdtgroup_move_task(struct task_struct *tsk,

static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
{
- return (rdt_alloc_capable &&
- (r->type == RDTCTRL_GROUP) && (t->closid == r->closid));
+ return (rdt_alloc_capable && (r->type == RDTCTRL_GROUP) &&
+ resctrl_arch_match_closid(t, r->closid));
}

static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r)
{
- return (rdt_mon_capable &&
- (r->type == RDTMON_GROUP) && (t->rmid == r->mon.rmid));
+ return (rdt_mon_capable && (r->type == RDTMON_GROUP) &&
+ resctrl_arch_match_rmid(t, r->mon.parent->closid,
+ r->mon.rmid));
}

/**
@@ -818,7 +831,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
rdtg->mode != RDT_MODE_EXCLUSIVE)
continue;

- if (rdtg->closid != tsk->closid)
+ if (!resctrl_arch_match_closid(tsk, rdtg->closid))
continue;

seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "",
@@ -826,7 +839,8 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
seq_puts(s, "mon:");
list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
mon.crdtgrp_list) {
- if (tsk->rmid != crg->mon.rmid)
+ if (!resctrl_arch_match_rmid(tsk, crg->mon.parent->closid,
+ crg->mon.rmid))
continue;
seq_printf(s, "%s", crg->kn->name);
break;
@@ -2678,8 +2692,8 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
for_each_process_thread(p, t) {
if (!from || is_closid_match(t, from) ||
is_rmid_match(t, from)) {
- WRITE_ONCE(t->closid, to->closid);
- WRITE_ONCE(t->rmid, to->mon.rmid);
+ resctrl_arch_set_closid_rmid(t, to->closid,
+ to->mon.rmid);

/*
* Order the closid/rmid stores above before the loads
--
2.39.2


2023-05-25 18:18:06

by James Morse

[permalink] [raw]
Subject: [PATCH v4 21/24] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu

When a CPU is taken offline resctrl may need to move the overflow or
limbo handlers to run on a different CPU.

Once the offline callbacks have been split, cqm_setup_limbo_handler()
will be called while the CPU that is going offline is still present
in the cpu_mask.

Pass the CPU to exclude to cqm_setup_limbo_handler() and
mbm_setup_overflow_handler(). These functions can use a variant of
cpumask_any_but() when selecting the CPU. -1 is used to indicate no CPUs
need excluding.

A subsequent patch moves these calls to be before CPUs have been removed,
so this exclude_cpus behaviour is temporary.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v2:
* Rephrased a comment to avoid a two letter bad-word. (we)
* Avoid assigning mbm_work_cpu if the domain is going to be free()d
* Added cpumask_any_housekeeping_but(), I dislike the name

Changes since v3:
* Marked an explanatory comment as temporary as the subsequent patch is
no longer adjacent.
---
arch/x86/kernel/cpu/resctrl/core.c | 8 +++--
arch/x86/kernel/cpu/resctrl/internal.h | 37 +++++++++++++++++++++--
arch/x86/kernel/cpu/resctrl/monitor.c | 42 +++++++++++++++++++++-----
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 ++--
include/linux/resctrl.h | 3 ++
5 files changed, 82 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index e00f3542e60e..187ed127a446 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -582,12 +582,16 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
cancel_delayed_work(&d->mbm_over);
- mbm_setup_overflow_handler(d, 0);
+ /*
+ * temporary: exclude_cpu=-1 as this CPU has already
+ * been removed by cpumask_clear_cpu()d
+ */
+ mbm_setup_overflow_handler(d, 0, RESCTRL_PICK_ANY_CPU);
}
if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
has_busy_rmid(r, d)) {
cancel_delayed_work(&d->cqm_limbo);
- cqm_setup_limbo_handler(d, 0);
+ cqm_setup_limbo_handler(d, 0, RESCTRL_PICK_ANY_CPU);
}
}
}
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 021a8956518c..9cba8fc405b9 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -79,6 +79,37 @@ static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
return cpu;
}

+/**
+ * cpumask_any_housekeeping_but() - Chose any cpu in @mask, preferring those
+ * that aren't marked nohz_full, excluding
+ * the provided CPU
+ * @mask: The mask to pick a CPU from.
+ * @exclude_cpu:The CPU to avoid picking.
+ *
+ * Returns a CPU from @mask, but not @but. If there are housekeeping CPUs that
+ * don't use nohz_full, these are preferred.
+ * Returns >= nr_cpu_ids if no CPUs are available.
+ */
+static inline unsigned int
+cpumask_any_housekeeping_but(const struct cpumask *mask, int exclude_cpu)
+{
+ int cpu, hk_cpu;
+
+ cpu = cpumask_any_but(mask, exclude_cpu);
+ if (tick_nohz_full_cpu(cpu)) {
+ hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
+ if (hk_cpu == exclude_cpu) {
+ hk_cpu = cpumask_nth_andnot(1, mask,
+ tick_nohz_full_mask);
+ }
+
+ if (hk_cpu < nr_cpu_ids)
+ cpu = hk_cpu;
+ }
+
+ return cpu;
+}
+
struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
@@ -564,11 +595,13 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first);
void mbm_setup_overflow_handler(struct rdt_domain *dom,
- unsigned long delay_ms);
+ unsigned long delay_ms,
+ int exclude_cpu);
void mbm_handle_overflow(struct work_struct *work);
void __init intel_rdt_mbm_apply_quirk(void);
bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
+void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+ int exclude_cpu);
void cqm_handle_limbo(struct work_struct *work);
bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
void __check_limbo(struct rdt_domain *d, bool force_free);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ced933694f60..ae02185f3354 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -485,7 +485,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
* setup up the limbo worker.
*/
if (!has_busy_rmid(r, d))
- cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL);
+ cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, -1);
set_bit(idx, d->rmid_busy_llc);
entry->busy++;
}
@@ -810,15 +810,28 @@ void cqm_handle_limbo(struct work_struct *work)
mutex_unlock(&rdtgroup_mutex);
}

-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
+/**
+ * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this
+ * domain.
+ * @delay_ms: How far in the future the handler should run.
+ * @exclude_cpu: Which CPU the handler should not run on, -1 to pick any CPU.
+ */
+void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+ int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;

- cpu = cpumask_any_housekeeping(&dom->cpu_mask);
- dom->cqm_work_cpu = cpu;
+ if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
+ cpu = cpumask_any_housekeeping(&dom->cpu_mask);
+ else
+ cpu = cpumask_any_housekeeping_but(&dom->cpu_mask,
+ exclude_cpu);

- schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
+ if (cpu < nr_cpu_ids) {
+ dom->cqm_work_cpu = cpu;
+ schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
+ }
}

void mbm_handle_overflow(struct work_struct *work)
@@ -864,7 +877,14 @@ void mbm_handle_overflow(struct work_struct *work)
mutex_unlock(&rdtgroup_mutex);
}

-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
+/**
+ * mbm_setup_overflow_handler() - Schedule the overflow handler to run for this
+ * domain.
+ * @delay_ms: How far in the future the handler should run.
+ * @exclude_cpu: Which CPU the handler should not run on, -1 to pick any CPU.
+ */
+void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
+ int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;
@@ -875,9 +895,15 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
*/
if (!resctrl_mounted || !resctrl_arch_mon_capable())
return;
- cpu = cpumask_any_housekeeping(&dom->cpu_mask);
+ if (exclude_cpu == -1)
+ cpu = cpumask_any_housekeeping(&dom->cpu_mask);
+ else
+ cpu = cpumask_any_housekeeping_but(&dom->cpu_mask,
+ exclude_cpu);
dom->mbm_work_cpu = cpu;
- schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
+
+ if (cpu < nr_cpu_ids)
+ schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
}

static int dom_data_init(struct rdt_resource *r)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 7c3de5ea0482..3373b11afe01 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2539,7 +2539,8 @@ static int rdt_get_tree(struct fs_context *fc)
if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
list_for_each_entry(dom, &r->domains, list)
- mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
+ mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL,
+ RESCTRL_PICK_ANY_CPU);
}

goto out;
@@ -3709,7 +3710,8 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)

if (is_mbm_enabled()) {
INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow);
- mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL);
+ mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL,
+ RESCTRL_PICK_ANY_CPU);
}

if (is_llc_occupancy_enabled())
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index ecd41762d61a..089b91133e5e 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -9,6 +9,9 @@
/* CLOSID value used by the default control group */
#define RESCTRL_RESERVED_CLOSID 0

+/* Indicates no CPU needs to be excluded */
+#define RESCTRL_PICK_ANY_CPU -1
+
#ifdef CONFIG_PROC_CPU_RESCTRL

int proc_resctrl_show(struct seq_file *m,
--
2.39.2


2023-06-08 08:12:18

by Shaopeng Tan (Fujitsu)

[permalink] [raw]
Subject: RE: [PATCH v4 02/24] x86/resctrl: Access per-rmid structures by index

Hello James,

> Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
> monitors, RMID values on arm64 are not unique unless the CLOSID is also
> included. Bitmaps like rmid_busy_llc need to be sized by the number of unique
> entries for this resource.
>
> Add helpers to encode/decode the CLOSID and RMID to an index. The
> domain's rmid_busy_llc and rmid_ptrs[] are then sized by index, as are the
> domain mbm_local and mbm_total arrays.
> On x86, the index is always just the RMID, so all these structures remain the
> same size.
>
> The index gives resctrl a unique value it can use to store monitor values, and
> allows MPAM to decode the CLOSID when reading the hardware counters.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> Changes since v1:
> * Added X86_BAD_CLOSID macro to make it clear what this value means
> * Added second WARN_ON() for closid checking, and made both _ONCE()
>
> Changes since v2:
> * Added RESCTRL_RESERVED_CLOSID
> * Removed a newline
> * Repharsed some comments
> * Renamed a variable 'ignore'd
> * Moved X86_RESCTRL_BAD_CLOSID to a previous patch
>
> Changes since v3:
> * Changed a variable name
> * Fixed various typos
> ---
> arch/x86/include/asm/resctrl.h | 17 ++++++
> arch/x86/kernel/cpu/resctrl/core.c | 2 +-
> arch/x86/kernel/cpu/resctrl/internal.h | 1 +
> arch/x86/kernel/cpu/resctrl/monitor.c | 84
> +++++++++++++++++---------
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 7 ++-
> include/linux/resctrl.h | 3 +
> 6 files changed, 83 insertions(+), 31 deletions(-)
>
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index e906070285fb..dd9b638d43c8 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -101,6 +101,23 @@ static inline void resctrl_sched_in(struct task_struct
> *tsk)
> __resctrl_sched_in(tsk);
> }
>
> +static inline u32 resctrl_arch_system_num_rmid_idx(void)
> +{
> + /* RMID are independent numbers for x86. num_rmid_idx ==
> num_rmid */
> + return boot_cpu_data.x86_cache_max_rmid + 1; }
> +
> +static inline void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid,
> +u32 *rmid) {
> + *rmid = idx;
> + *closid = X86_RESCTRL_BAD_CLOSID;
> +}
> +
> +static inline u32 resctrl_arch_rmid_idx_encode(u32 ignored, u32 rmid) {
> + return rmid;
> +}
> +
> void resctrl_cpu_detect(struct cpuinfo_x86 *c);
>
> #else
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c
> b/arch/x86/kernel/cpu/resctrl/core.c
> index 030d3b409768..4bea032d072e 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -600,7 +600,7 @@ static void clear_closid_rmid(int cpu)
> state->default_rmid = 0;
> state->cur_closid = 0;
> state->cur_rmid = 0;
> - wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
> + wrmsr(MSR_IA32_PQR_ASSOC, 0, RESCTRL_RESERVED_CLOSID);
> }
>
> static int resctrl_online_cpu(unsigned int cpu) diff --git
> a/arch/x86/kernel/cpu/resctrl/internal.h
> b/arch/x86/kernel/cpu/resctrl/internal.h
> index f2da908bb079..d571da4848a4 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -7,6 +7,7 @@
> #include <linux/kernfs.h>
> #include <linux/fs_context.h>
> #include <linux/jump_label.h>
> +#include <asm/resctrl.h>
>
> #define L3_QOS_CDP_ENABLE 0x01ULL
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
> b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 86574adedd64..bcc25f5339c0 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -142,12 +142,29 @@ static inline u64 get_corrected_mbm_count(u32 rmid,
> unsigned long val)
> return val;
> }
>
> -static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid)
> +/*
> + * x86 and arm64 differ in their handling of monitoring.
> + * x86's RMID are an independent number, there is only one source of
> +traffic
> + * with an RMID value of '1'.
> + * arm64's PMG extend the PARTID/CLOSID space, there are multiple
> +sources of
> + * traffic with a PMG value of '1', one for each CLOSID, meaning the
> +RMID
> + * value is no longer unique.
> + * To account for this, resctrl uses an index. On x86 this is just the
> +RMID,
> + * on arm64 it encodes the CLOSID and RMID. This gives a unique number.
> + *
> + * The domain's rmid_busy_llc and rmid_ptrs are sized by index. The
> +arch code
> + * must accept an attempt to read every index.
> + */
> +static inline struct rmid_entry *__rmid_entry(u32 idx)
> {
> struct rmid_entry *entry;
> + u32 closid, rmid;
>
> - entry = &rmid_ptrs[rmid];
> - WARN_ON(entry->rmid != rmid);
> + entry = &rmid_ptrs[idx];
> + resctrl_arch_rmid_idx_decode(idx, &closid, &rmid);
> +
> + WARN_ON_ONCE(entry->closid != closid);
> + WARN_ON_ONCE(entry->rmid != rmid);
>
> return entry;
> }
> @@ -277,8 +294,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct
> rdt_domain *d, void __check_limbo(struct rdt_domain *d, bool force_free) {
> struct rdt_resource *r =
> &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> struct rmid_entry *entry;
> - u32 crmid = 1, nrmid;
> + u32 idx, cur_idx = 1;
> bool rmid_dirty;
> u64 val = 0;
>
> @@ -289,12 +307,11 @@ void __check_limbo(struct rdt_domain *d, bool
> force_free)
> * RMID and move it to the free list when the counter reaches 0.
> */
> for (;;) {
> - nrmid = find_next_bit(d->rmid_busy_llc, r->num_rmid,
> crmid);
> - if (nrmid >= r->num_rmid)
> + idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx);
> + if (idx >= idx_limit)
> break;
>
> - entry = __rmid_entry(X86_RESCTRL_BAD_CLOSID, nrmid);//
> temporary
> -
> + entry = __rmid_entry(idx);
> if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
> QOS_L3_OCCUP_EVENT_ID,
> &val)) {
> rmid_dirty = true;
> @@ -303,19 +320,21 @@ void __check_limbo(struct rdt_domain *d, bool
> force_free)
> }
>
> if (force_free || !rmid_dirty) {
> - clear_bit(entry->rmid, d->rmid_busy_llc);
> + clear_bit(idx, d->rmid_busy_llc);
> if (!--entry->busy) {
> rmid_limbo_count--;
> list_add_tail(&entry->list, &rmid_free_lru);
> }
> }
> - crmid = nrmid + 1;
> + cur_idx = idx + 1;
> }
> }
>
> bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d) {
> - return find_first_bit(d->rmid_busy_llc, r->num_rmid) !=
> r->num_rmid;
> + u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> +
> + return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit;
> }

"struct rdt_resource *r" is not used in this function.


Best regards,
Shaopeng TAN


2023-06-09 11:20:48

by Shaopeng Tan (Fujitsu)

[permalink] [raw]
Subject: RE: [PATCH v4 21/24] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu

Hi James,

> When a CPU is taken offline resctrl may need to move the overflow or limbo
> handlers to run on a different CPU.
>
> Once the offline callbacks have been split, cqm_setup_limbo_handler() will be
> called while the CPU that is going offline is still present in the cpu_mask.
>
> Pass the CPU to exclude to cqm_setup_limbo_handler() and
> mbm_setup_overflow_handler(). These functions can use a variant of
> cpumask_any_but() when selecting the CPU. -1 is used to indicate no CPUs
> need excluding.
>
> A subsequent patch moves these calls to be before CPUs have been removed,
> so this exclude_cpus behaviour is temporary.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> Changes since v2:
> * Rephrased a comment to avoid a two letter bad-word. (we)
> * Avoid assigning mbm_work_cpu if the domain is going to be free()d
> * Added cpumask_any_housekeeping_but(), I dislike the name
>
> Changes since v3:
> * Marked an explanatory comment as temporary as the subsequent patch is
> no longer adjacent.
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 8 +++--
> arch/x86/kernel/cpu/resctrl/internal.h | 37
> +++++++++++++++++++++-- arch/x86/kernel/cpu/resctrl/monitor.c
> | 42 +++++++++++++++++++++-----
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 ++--
> include/linux/resctrl.h | 3 ++
> 5 files changed, 82 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c
> b/arch/x86/kernel/cpu/resctrl/core.c
> index e00f3542e60e..187ed127a446 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -582,12 +582,16 @@ static void domain_remove_cpu(int cpu, struct
> rdt_resource *r)
> if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> cancel_delayed_work(&d->mbm_over);
> - mbm_setup_overflow_handler(d, 0);
> + /*
> + * temporary: exclude_cpu=-1 as this CPU has
> already
> + * been removed by cpumask_clear_cpu()d
> + */
> + mbm_setup_overflow_handler(d, 0,
> RESCTRL_PICK_ANY_CPU);
> }
> if (is_llc_occupancy_enabled() && cpu ==
> d->cqm_work_cpu &&
> has_busy_rmid(r, d)) {
> cancel_delayed_work(&d->cqm_limbo);
> - cqm_setup_limbo_handler(d, 0);
> + cqm_setup_limbo_handler(d, 0,
> RESCTRL_PICK_ANY_CPU);
> }
> }
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h
> b/arch/x86/kernel/cpu/resctrl/internal.h
> index 021a8956518c..9cba8fc405b9 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -79,6 +79,37 @@ static inline unsigned int
> cpumask_any_housekeeping(const struct cpumask *mask)
> return cpu;
> }
>
> +/**
> + * cpumask_any_housekeeping_but() - Chose any cpu in @mask, preferring
> those
> + * that aren't marked nohz_full, excluding
> + * the provided CPU
> + * @mask: The mask to pick a CPU from.
> + * @exclude_cpu:The CPU to avoid picking.
> + *
> + * Returns a CPU from @mask, but not @but. If there are housekeeping
> +CPUs that
> + * don't use nohz_full, these are preferred.
> + * Returns >= nr_cpu_ids if no CPUs are available.
> + */
> +static inline unsigned int
> +cpumask_any_housekeeping_but(const struct cpumask *mask, int
> +exclude_cpu) {
> + int cpu, hk_cpu;
> +
> + cpu = cpumask_any_but(mask, exclude_cpu);
> + if (tick_nohz_full_cpu(cpu)) {
> + hk_cpu = cpumask_nth_andnot(0, mask,
> tick_nohz_full_mask);
> + if (hk_cpu == exclude_cpu) {
> + hk_cpu = cpumask_nth_andnot(1, mask,
> + tick_nohz_full_mask);
> + }
> +
> + if (hk_cpu < nr_cpu_ids)
> + cpu = hk_cpu;
> + }
> +
> + return cpu;
> +}
> +
> struct rdt_fs_context {
> struct kernfs_fs_context kfc;
> bool enable_cdpl2;
> @@ -564,11 +595,13 @@ void mon_event_read(struct rmid_read *rr, struct
> rdt_resource *r,
> struct rdt_domain *d, struct rdtgroup *rdtgrp,
> int evtid, int first);
> void mbm_setup_overflow_handler(struct rdt_domain *dom,
> - unsigned long delay_ms);
> + unsigned long delay_ms,
> + int exclude_cpu);
> void mbm_handle_overflow(struct work_struct *work); void __init
> intel_rdt_mbm_apply_quirk(void); bool is_mba_sc(struct rdt_resource *r);
> -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long
> delay_ms);
> +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long
> delay_ms,
> + int exclude_cpu);
> void cqm_handle_limbo(struct work_struct *work); bool
> has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d); void
> __check_limbo(struct rdt_domain *d, bool force_free); diff --git
> a/arch/x86/kernel/cpu/resctrl/monitor.c
> b/arch/x86/kernel/cpu/resctrl/monitor.c
> index ced933694f60..ae02185f3354 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -485,7 +485,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
> * setup up the limbo worker.
> */
> if (!has_busy_rmid(r, d))
> - cqm_setup_limbo_handler(d,
> CQM_LIMBOCHECK_INTERVAL);
> + cqm_setup_limbo_handler(d,
> CQM_LIMBOCHECK_INTERVAL, -1);
> set_bit(idx, d->rmid_busy_llc);
> entry->busy++;
> }
> @@ -810,15 +810,28 @@ void cqm_handle_limbo(struct work_struct *work)
> mutex_unlock(&rdtgroup_mutex);
> }
>
> -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long
> delay_ms)
> +/**
> + * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this
> + * domain.
> + * @delay_ms: How far in the future the handler should run.
> + * @exclude_cpu: Which CPU the handler should not run on, -1 to pick any
> CPU.
> + */
> +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long
> delay_ms,
> + int exclude_cpu)
> {
> unsigned long delay = msecs_to_jiffies(delay_ms);
> int cpu;
>
> - cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> - dom->cqm_work_cpu = cpu;
> + if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
> + cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> + else
> + cpu = cpumask_any_housekeeping_but(&dom->cpu_mask,
> + exclude_cpu);
>
> - schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
> + if (cpu < nr_cpu_ids) {
> + dom->cqm_work_cpu = cpu;
> + schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
> + }
> }
>
> void mbm_handle_overflow(struct work_struct *work) @@ -864,7 +877,14
> @@ void mbm_handle_overflow(struct work_struct *work)
> mutex_unlock(&rdtgroup_mutex);
> }
>
> -void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long
> delay_ms)
> +/**
> + * mbm_setup_overflow_handler() - Schedule the overflow handler to run for
> this
> + * domain.
> + * @delay_ms: How far in the future the handler should run.
> + * @exclude_cpu: Which CPU the handler should not run on, -1 to pick any
> CPU.
> + */
> +void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long
> delay_ms,
> + int exclude_cpu)
> {
> unsigned long delay = msecs_to_jiffies(delay_ms);
> int cpu;
> @@ -875,9 +895,15 @@ void mbm_setup_overflow_handler(struct rdt_domain
> *dom, unsigned long delay_ms)
> */
> if (!resctrl_mounted || !resctrl_arch_mon_capable())
> return;
> - cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> + if (exclude_cpu == -1)
> + cpu = cpumask_any_housekeeping(&dom->cpu_mask);

Should RESCTRL_PICK_ANY_CPU be used instead of -1?

Best regards,
Shaopeng TAN

2023-06-13 06:48:09

by Shaopeng Tan (Fujitsu)

[permalink] [raw]
Subject: RE: [PATCH v4 00/24] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

Hello,

I ran resctrl selftest on Intel(R) Xeon(R) Gold 6254 CPU with nohz_full enabled/disabled,
it seems there is no problem.

<tested-by:[email protected]>

2023-06-13 09:31:43

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v4 24/24] x86/resctrl: Separate arch and fs resctrl locks

Hi James,

On Thu, May 25, 2023 at 8:03 PM James Morse <[email protected]> wrote:
>
> resctrl has one mutex that is taken by the architecture specific code,
> and the filesystem parts. The two interact via cpuhp, where the
> architecture code updates the domain list. Filesystem handlers that
> walk the domains list should not run concurrently with the cpuhp
> callback modifying the list.
>
> Exposing a lock from the filesystem code means the interface is not
> cleanly defined, and creates the possibility of cross-architecture
> lock ordering headaches. The interaction only exists so that certain
> filesystem paths are serialised against cpu hotplug. The cpu hotplug
> code already has a mechanism to do this using cpus_read_lock().
>
> MPAM's monitors have an overflow interrupt, so it needs to be possible
> to walk the domains list in irq context. RCU is ideal for this,
> but some paths need to be able to sleep to allocate memory.

I noticed there's also a call to get_domain_from_cpu() in
rdt_ctrl_update(), which is invoked by IPI when updating a schemata
file, but at least it's a blocking IPI and the caller holds the
rdtgroup_mutex, so the handler is serialized with CPU hotplug.

Taking a step back, I'm concerned about the scalability of searching
these linked-lists of domains in atomic contexts. We already have
machines where the list is 32 entries deep in L3, and much larger in
L2 CAT.

Will the overflow interrupt target a CPU in the correct domain? The
existing domain list search in rdt_ctrl_update() is for the current
CPU's domain, so an alternative there would be to store each CPU's
interrupt-relevant domain pointers in percpu data for quick lookup.

Also, how quickly does the overflow condition need to be serviced? On
Intel, overflow handling deadlines aren't even tight enough to warrant
an interrupt handler.

Thanks!
-Peter





>
> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
> of a cpuhp callback, cpus_read_lock() must always be taken first.
> rdtgroup_schemata_write() already does this.
>
> Most of the filesystem code's domain list walkers are currently
> protected by the rdtgroup_mutex taken in rdtgroup_kn_lock_live().
> The exceptions are rdt_bit_usage_show() and the mon_config helpers
> which take the lock directly.
>
> Make the domain list protected by RCU. An architecture-specific
> lock prevents concurrent writers. rdt_bit_usage_show() can
> walk the domain list under rcu_read_lock(). The mon_config helpers
> send multiple IPIs, take the cpus_read_lock() in these cases.
>
> The other filesystem list walkers need to be able to sleep.
> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
> cpuhp callbacks can't be invoked when file system operations are
> occurring.
>
> Add lockdep_assert_cpus_held() in the cases where the
> rdtgroup_kn_lock_live() call isn't obvious.
>
> Resctrl's domain online/offline calls now need to take the
> rdtgroup_mutex themselves.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> Changes since v2:
> * Reworded a comment,
> * Added a lockdep assertion
> * Moved clear_closid_rmid() outside the locked region of cpu
> online/offline
>
> Changes since v3:
> * Added a header include
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 38 +++++++++-----
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 16 ++++--
> arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 3 ++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 63 ++++++++++++++++++++---
> include/linux/resctrl.h | 2 +-
> 6 files changed, 100 insertions(+), 26 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index edc0dd123317..f106c68a9be8 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -25,8 +25,15 @@
> #include <asm/resctrl.h>
> #include "internal.h"
>
> -/* Mutex to protect rdtgroup access. */
> -DEFINE_MUTEX(rdtgroup_mutex);
> +/*
> + * rdt_domain structures are kfree()d when their last CPU goes offline,
> + * and allocated when the first CPU in a new domain comes online.
> + * The rdt_resource's domain list is updated when this happens. Readers of
> + * the domain list must either take cpus_read_lock(), or rely on an RCU
> + * read-side critical section, to avoid observing concurrent modification.
> + * All writers take this mutex:
> + */
> +static DEFINE_MUTEX(domain_list_lock);
>
> /*
> * The cached resctrl_pqr_state is strictly per CPU and can never be
> @@ -508,6 +515,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> struct rdt_domain *d;
> int err;
>
> + lockdep_assert_held(&domain_list_lock);
> +
> d = rdt_find_domain(r, id, &add_pos);
> if (IS_ERR(d)) {
> pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> @@ -541,11 +550,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> return;
> }
>
> - list_add_tail(&d->list, add_pos);
> + list_add_tail_rcu(&d->list, add_pos);
>
> err = resctrl_online_domain(r, d);
> if (err) {
> - list_del(&d->list);
> + list_del_rcu(&d->list);
> + synchronize_rcu();
> domain_free(hw_dom);
> }
> }
> @@ -556,6 +566,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> struct rdt_hw_domain *hw_dom;
> struct rdt_domain *d;
>
> + lockdep_assert_held(&domain_list_lock);
> +
> d = rdt_find_domain(r, id, NULL);
> if (IS_ERR_OR_NULL(d)) {
> pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> @@ -566,7 +578,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> cpumask_clear_cpu(cpu, &d->cpu_mask);
> if (cpumask_empty(&d->cpu_mask)) {
> resctrl_offline_domain(r, d);
> - list_del(&d->list);
> + list_del_rcu(&d->list);
> + synchronize_rcu();
>
> /*
> * rdt_domain "d" is going to be freed below, so clear
> @@ -594,30 +607,29 @@ static void clear_closid_rmid(int cpu)
> static int resctrl_arch_online_cpu(unsigned int cpu)
> {
> struct rdt_resource *r;
> - int ret;
>
> - mutex_lock(&rdtgroup_mutex);
> + mutex_lock(&domain_list_lock);
> for_each_capable_rdt_resource(r)
> domain_add_cpu(cpu, r);
> + mutex_unlock(&domain_list_lock);
> +
> clear_closid_rmid(cpu);
>
> - ret = resctrl_online_cpu(cpu);
> - mutex_unlock(&rdtgroup_mutex);
> -
> - return ret;
> + return resctrl_online_cpu(cpu);
> }
>
> static int resctrl_arch_offline_cpu(unsigned int cpu)
> {
> struct rdt_resource *r;
>
> - mutex_lock(&rdtgroup_mutex);
> resctrl_offline_cpu(cpu);
>
> + mutex_lock(&domain_list_lock);
> for_each_capable_rdt_resource(r)
> domain_remove_cpu(cpu, r);
> + mutex_unlock(&domain_list_lock);
> +
> clear_closid_rmid(cpu);
> - mutex_unlock(&rdtgroup_mutex);
>
> return 0;
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 280d66fae21c..d8d7c127403b 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -209,6 +209,9 @@ static int parse_line(char *line, struct resctrl_schema *s,
> struct rdt_domain *d;
> unsigned long dom_id;
>
> + /* Walking r->domains, ensure it can't race with cpuhp */
> + lockdep_assert_cpus_held();
> +
> if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
> (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) {
> rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n");
> @@ -313,6 +316,9 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
> struct rdt_domain *d;
> u32 idx;
>
> + /* Walking r->domains, ensure it can't race with cpuhp */
> + lockdep_assert_cpus_held();
> +
> if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
> return -ENOMEM;
>
> @@ -378,11 +384,9 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
> return -EINVAL;
> buf[nbytes - 1] = '\0';
>
> - cpus_read_lock();
> rdtgrp = rdtgroup_kn_lock_live(of->kn);
> if (!rdtgrp) {
> rdtgroup_kn_unlock(of->kn);
> - cpus_read_unlock();
> return -ENOENT;
> }
> rdt_last_cmd_clear();
> @@ -444,7 +448,6 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
> out:
> rdt_staged_configs_clear();
> rdtgroup_kn_unlock(of->kn);
> - cpus_read_unlock();
> return ret ?: nbytes;
> }
>
> @@ -464,6 +467,9 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
> bool sep = false;
> u32 ctrl_val;
>
> + /* Walking r->domains, ensure it can't race with cpuhp */
> + lockdep_assert_cpus_held();
> +
> seq_printf(s, "%*s:", max_name_width, schema->name);
> list_for_each_entry(dom, &r->domains, list) {
> if (sep)
> @@ -534,8 +540,8 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> {
> int cpu;
>
> - /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
> - lockdep_assert_held(&rdtgroup_mutex);
> + /* When picking a cpu from cpu_mask, ensure it can't race with cpuhp */
> + lockdep_assert_cpus_held();
>
> /*
> * setup the parameters to pass to mon_event_count() to read the data.
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index ae02185f3354..41b4cd2c7d64 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -15,6 +15,7 @@
> * Software Developer Manual June 2016, volume 3, section 17.17.
> */
>
> +#include <linux/cpu.h>
> #include <linux/module.h>
> #include <linux/sizes.h>
> #include <linux/slab.h>
> @@ -476,6 +477,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>
> lockdep_assert_held(&rdtgroup_mutex);
>
> + /* Walking r->domains, ensure it can't race with cpuhp */
> + lockdep_assert_cpus_held();
> +
> idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
>
> entry->busy = 0;
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 460421051abf..fc3ed917d173 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -830,6 +830,9 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
> struct rdt_domain *d_i;
> bool ret = false;
>
> + /* Walking r->domains, ensure it can't race with cpuhp */
> + lockdep_assert_cpus_held();
> +
> if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
> return true;
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 3a8e2c98b611..9002ac728001 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -35,6 +35,10 @@
> DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
> DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key);
> DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
> +
> +/* Mutex to protect rdtgroup access. */
> +DEFINE_MUTEX(rdtgroup_mutex);
> +
> static struct kernfs_root *rdt_root;
> struct rdtgroup rdtgroup_default;
> LIST_HEAD(rdt_all_groups);
> @@ -950,7 +954,8 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
>
> mutex_lock(&rdtgroup_mutex);
> hw_shareable = r->cache.shareable_bits;
> - list_for_each_entry(dom, &r->domains, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(dom, &r->domains, list) {
> if (sep)
> seq_putc(seq, ';');
> sw_shareable = 0;
> @@ -1006,8 +1011,10 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
> }
> sep = true;
> }
> + rcu_read_unlock();
> seq_putc(seq, '\n');
> mutex_unlock(&rdtgroup_mutex);
> +
> return 0;
> }
>
> @@ -1250,6 +1257,9 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
> struct rdt_domain *d;
> u32 ctrl;
>
> + /* Walking r->domains, ensure it can't race with cpuhp */
> + lockdep_assert_cpus_held();
> +
> list_for_each_entry(s, &resctrl_schema_all, list) {
> r = s->res;
> if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
> @@ -1516,6 +1526,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
> struct rdt_domain *dom;
> bool sep = false;
>
> + cpus_read_lock();
> mutex_lock(&rdtgroup_mutex);
>
> list_for_each_entry(dom, &r->domains, list) {
> @@ -1532,6 +1543,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
> seq_puts(s, "\n");
>
> mutex_unlock(&rdtgroup_mutex);
> + cpus_read_unlock();
>
> return 0;
> }
> @@ -1623,6 +1635,9 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
> struct rdt_domain *d;
> int ret = 0;
>
> + /* Walking r->domains, ensure it can't race with cpuhp */
> + lockdep_assert_cpus_held();
> +
> next:
> if (!tok || tok[0] == '\0')
> return 0;
> @@ -1664,6 +1679,7 @@ static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
> if (nbytes == 0 || buf[nbytes - 1] != '\n')
> return -EINVAL;
>
> + cpus_read_lock();
> mutex_lock(&rdtgroup_mutex);
>
> rdt_last_cmd_clear();
> @@ -1673,6 +1689,7 @@ static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
> ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID);
>
> mutex_unlock(&rdtgroup_mutex);
> + cpus_read_unlock();
>
> return ret ?: nbytes;
> }
> @@ -1688,6 +1705,7 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of,
> if (nbytes == 0 || buf[nbytes - 1] != '\n')
> return -EINVAL;
>
> + cpus_read_lock();
> mutex_lock(&rdtgroup_mutex);
>
> rdt_last_cmd_clear();
> @@ -1697,6 +1715,7 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of,
> ret = mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID);
>
> mutex_unlock(&rdtgroup_mutex);
> + cpus_read_unlock();
>
> return ret ?: nbytes;
> }
> @@ -2149,6 +2168,9 @@ static int set_cache_qos_cfg(int level, bool enable)
> struct rdt_domain *d;
> int cpu;
>
> + /* Walking r->domains, ensure it can't race with cpuhp */
> + lockdep_assert_cpus_held();
> +
> if (level == RDT_RESOURCE_L3)
> update = l3_qos_cfg_update;
> else if (level == RDT_RESOURCE_L2)
> @@ -2337,6 +2359,7 @@ struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn)
> atomic_inc(&rdtgrp->waitcount);
> kernfs_break_active_protection(kn);
>
> + cpus_read_lock();
> mutex_lock(&rdtgroup_mutex);
>
> /* Was this group deleted while we waited? */
> @@ -2354,6 +2377,7 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn)
> return;
>
> mutex_unlock(&rdtgroup_mutex);
> + cpus_read_unlock();
>
> if (atomic_dec_and_test(&rdtgrp->waitcount) &&
> (rdtgrp->flags & RDT_DELETED)) {
> @@ -2651,6 +2675,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
> struct rdt_domain *d;
> int i;
>
> + /* Walking r->domains, ensure it can't race with cpuhp */
> + lockdep_assert_cpus_held();
> +
> if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
> return -ENOMEM;
>
> @@ -2935,6 +2962,9 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
> struct rdt_domain *dom;
> int ret;
>
> + /* Walking r->domains, ensure it can't race with cpuhp */
> + lockdep_assert_cpus_held();
> +
> list_for_each_entry(dom, &r->domains, list) {
> ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
> if (ret)
> @@ -3625,7 +3655,8 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
> kfree(d->mbm_local);
> }
>
> -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
> +static void _resctrl_offline_domain(struct rdt_resource *r,
> + struct rdt_domain *d)
> {
> lockdep_assert_held(&rdtgroup_mutex);
>
> @@ -3660,6 +3691,13 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
> domain_destroy_mon_state(d);
> }
>
> +void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
> +{
> + mutex_lock(&rdtgroup_mutex);
> + _resctrl_offline_domain(r, d);
> + mutex_unlock(&rdtgroup_mutex);
> +}
> +
> static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
> {
> u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> @@ -3691,7 +3729,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
> return 0;
> }
>
> -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
> +static int _resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
> {
> int err;
>
> @@ -3727,12 +3765,23 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
> return 0;
> }
>
> +int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
> +{
> + int err;
> +
> + mutex_lock(&rdtgroup_mutex);
> + err = _resctrl_online_domain(r, d);
> + mutex_unlock(&rdtgroup_mutex);
> +
> + return err;
> +}
> +
> int resctrl_online_cpu(unsigned int cpu)
> {
> - lockdep_assert_held(&rdtgroup_mutex);
> -
> + mutex_lock(&rdtgroup_mutex);
> /* The cpu is set in default rdtgroup after online. */
> cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
> + mutex_unlock(&rdtgroup_mutex);
>
> return 0;
> }
> @@ -3753,8 +3802,7 @@ void resctrl_offline_cpu(unsigned int cpu)
> struct rdtgroup *rdtgrp;
> struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>
> - lockdep_assert_held(&rdtgroup_mutex);
> -
> + mutex_lock(&rdtgroup_mutex);
> list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
> if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
> clear_childcpus(rdtgrp, cpu);
> @@ -3774,6 +3822,7 @@ void resctrl_offline_cpu(unsigned int cpu)
> cqm_setup_limbo_handler(d, 0, cpu);
> }
> }
> + mutex_unlock(&rdtgroup_mutex);
> }
>
> /*
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index c4be3453b3ff..fe94ef3369fa 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -159,7 +159,7 @@ struct resctrl_schema;
> * @cache_level: Which cache level defines scope of this resource
> * @cache: Cache allocation related data
> * @membw: If the component has bandwidth controls, their properties.
> - * @domains: All domains for this resource
> + * @domains: RCU list of all domains for this resource
> * @name: Name to use in "schemata" file.
> * @data_width: Character width of data when displaying
> * @default_ctrl: Specifies default cache cbm or memory B/W percent.
> --
> 2.39.2
>

2023-06-15 22:10:40

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 02/24] x86/resctrl: Access per-rmid structures by index

Hi James,

On 5/25/2023 11:01 AM, James Morse wrote:
> Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
> monitors, RMID values on arm64 are not unique unless the CLOSID is

I find the above a bit confusing ... the theme seems to be "RMID values
on arm64 are not unique because they are different from Intel".
Compare to: "One of the differences between Intel RDT/AMD QoS and
Arm's MPAM monitors is that RMID values on arm64 are not unique unless
the CLOSID is also included."

> also included. Bitmaps like rmid_busy_llc need to be sized by the
> number of unique entries for this resource.
>
> Add helpers to encode/decode the CLOSID and RMID to an index. The
> domain's rmid_busy_llc and rmid_ptrs[] are then sized by index,
> as are the domain mbm_local and mbm_total arrays.

You can use "[]" to indicate an array.

> On x86, the index is always just the RMID, so all these structures
> remain the same size.

I do not consider this accurate considering that the previous
patch increased the size of each element to support this change.

> The index gives resctrl a unique value it can use to store monitor
> values, and allows MPAM to decode the CLOSID when reading the hardware
> counters.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> Changes since v1:
> * Added X86_BAD_CLOSID macro to make it clear what this value means
> * Added second WARN_ON() for closid checking, and made both _ONCE()
>
> Changes since v2:
> * Added RESCTRL_RESERVED_CLOSID
> * Removed a newline
> * Repharsed some comments
> * Renamed a variable 'ignore'd
> * Moved X86_RESCTRL_BAD_CLOSID to a previous patch
>
> Changes since v3:
> * Changed a variable name
> * Fixed various typos
> ---
> arch/x86/include/asm/resctrl.h | 17 ++++++
> arch/x86/kernel/cpu/resctrl/core.c | 2 +-
> arch/x86/kernel/cpu/resctrl/internal.h | 1 +
> arch/x86/kernel/cpu/resctrl/monitor.c | 84 +++++++++++++++++---------
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 7 ++-
> include/linux/resctrl.h | 3 +
> 6 files changed, 83 insertions(+), 31 deletions(-)
>

...

> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 86574adedd64..bcc25f5339c0 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -142,12 +142,29 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val)
> return val;
> }
>
> -static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid)
> +/*
> + * x86 and arm64 differ in their handling of monitoring.
> + * x86's RMID are an independent number, there is only one source of traffic
> + * with an RMID value of '1'.
> + * arm64's PMG extend the PARTID/CLOSID space, there are multiple sources of
> + * traffic with a PMG value of '1', one for each CLOSID, meaning the RMID
> + * value is no longer unique.
> + * To account for this, resctrl uses an index. On x86 this is just the RMID,
> + * on arm64 it encodes the CLOSID and RMID. This gives a unique number.
> + *
> + * The domain's rmid_busy_llc and rmid_ptrs are sized by index. The arch code

rmid_ptrs[]

> + * must accept an attempt to read every index.
> + */
> +static inline struct rmid_entry *__rmid_entry(u32 idx)
> {
> struct rmid_entry *entry;
> + u32 closid, rmid;
>
> - entry = &rmid_ptrs[rmid];
> - WARN_ON(entry->rmid != rmid);
> + entry = &rmid_ptrs[idx];
> + resctrl_arch_rmid_idx_decode(idx, &closid, &rmid);
> +
> + WARN_ON_ONCE(entry->closid != closid);
> + WARN_ON_ONCE(entry->rmid != rmid);
>
> return entry;
> }

...

> @@ -377,14 +399,16 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>
> void free_rmid(u32 closid, u32 rmid)
> {
> + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
> struct rmid_entry *entry;
>
> - if (!rmid)
> - return;
> -
> lockdep_assert_held(&rdtgroup_mutex);
>
> - entry = __rmid_entry(closid, rmid);
> + /* do not allow the default rmid to be free'd */
> + if (!idx)
> + return;
> +

The interface seem to become blurry here. There are new
architecture specific encode/decode callbacks while at the same
time there are a few requirements:
- closid 0 and rmid 0 are reserved
- closid 0 and rmid 0 must map to index 0 (the architecture
callbacks thus do not have must freedom here ... they must
return index 0 for closid 0 and rmid 0, no?).

It does seem a bit strange that the one layer provides values (0,0)
to other layer while requiring a specific answer (0).

What if RESCTRL_RESERVED_RMID is also introduced and before encoding
the CLOSID and RMID the core first checks if it is a reserved entry
being freed and exit early if this is the case?


> + entry = __rmid_entry(idx);
>
> if (is_llc_occupancy_enabled())
> add_rmid_to_limbo(entry);

...

> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 7d80bae05f59..ff7452f644e4 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -6,6 +6,9 @@
> #include <linux/list.h>
> #include <linux/pid.h>
>
> +/* CLOSID value used by the default control group */
> +#define RESCTRL_RESERVED_CLOSID 0
> +

#define RESCTRL_RESERVED_RMID 0 ?

> #ifdef CONFIG_PROC_CPU_RESCTRL
>
> int proc_resctrl_show(struct seq_file *m,


Reinette

2023-06-15 22:42:15

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 24/24] x86/resctrl: Separate arch and fs resctrl locks

Hi James,

On 5/25/2023 11:02 AM, James Morse wrote:
> resctrl has one mutex that is taken by the architecture specific code,
> and the filesystem parts. The two interact via cpuhp, where the
> architecture code updates the domain list. Filesystem handlers that
> walk the domains list should not run concurrently with the cpuhp
> callback modifying the list.
>
> Exposing a lock from the filesystem code means the interface is not
> cleanly defined, and creates the possibility of cross-architecture
> lock ordering headaches. The interaction only exists so that certain
> filesystem paths are serialised against cpu hotplug. The cpu hotplug
> code already has a mechanism to do this using cpus_read_lock().
>
> MPAM's monitors have an overflow interrupt, so it needs to be possible
> to walk the domains list in irq context. RCU is ideal for this,
> but some paths need to be able to sleep to allocate memory.
>
> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
> of a cpuhp callback, cpus_read_lock() must always be taken first.
> rdtgroup_schemata_write() already does this.
>
> Most of the filesystem code's domain list walkers are currently
> protected by the rdtgroup_mutex taken in rdtgroup_kn_lock_live().
> The exceptions are rdt_bit_usage_show() and the mon_config helpers
> which take the lock directly.
>
> Make the domain list protected by RCU. An architecture-specific
> lock prevents concurrent writers. rdt_bit_usage_show() can

This does not sound right to me. The domain list belongs to resctrl,
the filesystem, having an architecture specific lock protect it does
not seem like the right thing to do. Could this instead be a resctrl
owned lock that is hidden behind helpers for the architecture
code to add domains?

Reinette

2023-06-15 22:57:06

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 21/24] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu

Hi James,

On 5/25/2023 11:02 AM, James Morse wrote:

...

> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 021a8956518c..9cba8fc405b9 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -79,6 +79,37 @@ static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
> return cpu;
> }
>
> +/**
> + * cpumask_any_housekeeping_but() - Chose any cpu in @mask, preferring those
> + * that aren't marked nohz_full, excluding
> + * the provided CPU
> + * @mask: The mask to pick a CPU from.
> + * @exclude_cpu:The CPU to avoid picking.
> + *
> + * Returns a CPU from @mask, but not @but. If there are housekeeping CPUs that

"but not @exclude_cpu"

> + * don't use nohz_full, these are preferred.
> + * Returns >= nr_cpu_ids if no CPUs are available.
> + */
> +static inline unsigned int
> +cpumask_any_housekeeping_but(const struct cpumask *mask, int exclude_cpu)
> +{
> + int cpu, hk_cpu;

Should these be unsigned int?

> +
> + cpu = cpumask_any_but(mask, exclude_cpu);
> + if (tick_nohz_full_cpu(cpu)) {
> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
> + if (hk_cpu == exclude_cpu) {
> + hk_cpu = cpumask_nth_andnot(1, mask,
> + tick_nohz_full_mask);
> + }
> +

These braces are not necessary. If they are added to help readability then
perhaps the indentation can be reduced by using an earlier:

if (!tick_nohz_full_cpu(cpu))
return cpu;


> + if (hk_cpu < nr_cpu_ids)
> + cpu = hk_cpu;
> + }
> +
> + return cpu;
> +}
> +
> struct rdt_fs_context {
> struct kernfs_fs_context kfc;
> bool enable_cdpl2;
> @@ -564,11 +595,13 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> struct rdt_domain *d, struct rdtgroup *rdtgrp,
> int evtid, int first);
> void mbm_setup_overflow_handler(struct rdt_domain *dom,
> - unsigned long delay_ms);
> + unsigned long delay_ms,
> + int exclude_cpu);
> void mbm_handle_overflow(struct work_struct *work);
> void __init intel_rdt_mbm_apply_quirk(void);
> bool is_mba_sc(struct rdt_resource *r);
> -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
> +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
> + int exclude_cpu);
> void cqm_handle_limbo(struct work_struct *work);
> bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
> void __check_limbo(struct rdt_domain *d, bool force_free);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index ced933694f60..ae02185f3354 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -485,7 +485,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
> * setup up the limbo worker.
> */
> if (!has_busy_rmid(r, d))
> - cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL);
> + cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, -1);

Should this -1 be RESCTRL_PICK_ANY_CPU?

> set_bit(idx, d->rmid_busy_llc);
> entry->busy++;
> }
> @@ -810,15 +810,28 @@ void cqm_handle_limbo(struct work_struct *work)
> mutex_unlock(&rdtgroup_mutex);
> }
>
> -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
> +/**
> + * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this
> + * domain.
> + * @delay_ms: How far in the future the handler should run.
> + * @exclude_cpu: Which CPU the handler should not run on, -1 to pick any CPU.

Should -1 be RESCTRL_PICK_ANY_CPU?

> + */
> +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
> + int exclude_cpu)
> {
> unsigned long delay = msecs_to_jiffies(delay_ms);
> int cpu;
>
> - cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> - dom->cqm_work_cpu = cpu;
> + if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
> + cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> + else
> + cpu = cpumask_any_housekeeping_but(&dom->cpu_mask,
> + exclude_cpu);
>
> - schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
> + if (cpu < nr_cpu_ids) {
> + dom->cqm_work_cpu = cpu;

Should cqm_work_cpu not perhaps be set to nr_cpu_ids on failure? If it keeps
pointing to CPU that ran worker previously there may be unexpected behavior.

Note the different behavior between cqm_setup_limbo_handler() and
mbm_setup_overflow_handler() in this regard.

> + schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
> + }
> }
>
> void mbm_handle_overflow(struct work_struct *work)
> @@ -864,7 +877,14 @@ void mbm_handle_overflow(struct work_struct *work)
> mutex_unlock(&rdtgroup_mutex);
> }
>
> -void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
> +/**
> + * mbm_setup_overflow_handler() - Schedule the overflow handler to run for this
> + * domain.
> + * @delay_ms: How far in the future the handler should run.
> + * @exclude_cpu: Which CPU the handler should not run on, -1 to pick any CPU.

RESCTRL_PICK_ANY_CPU?

> + */
> +void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
> + int exclude_cpu)
> {
> unsigned long delay = msecs_to_jiffies(delay_ms);
> int cpu;
> @@ -875,9 +895,15 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
> */
> if (!resctrl_mounted || !resctrl_arch_mon_capable())
> return;
> - cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> + if (exclude_cpu == -1)

same

> + cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> + else
> + cpu = cpumask_any_housekeeping_but(&dom->cpu_mask,
> + exclude_cpu);
> dom->mbm_work_cpu = cpu;
> - schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
> +
> + if (cpu < nr_cpu_ids)
> + schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
> }
>

...

> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index ecd41762d61a..089b91133e5e 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -9,6 +9,9 @@
> /* CLOSID value used by the default control group */
> #define RESCTRL_RESERVED_CLOSID 0
>
> +/* Indicates no CPU needs to be excluded */

This comment seems to just be a rewrite of the macro name.

> +#define RESCTRL_PICK_ANY_CPU -1
> +
> #ifdef CONFIG_PROC_CPU_RESCTRL
>
> int proc_resctrl_show(struct seq_file *m,

Reinette

2023-06-15 23:03:23

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 16/24] x86/resctrl: Make resctrl_mounted checks explicit

Hi James,

On 5/25/2023 11:02 AM, James Morse wrote:

...

> @@ -3710,8 +3716,11 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
> if (is_llc_occupancy_enabled())
> INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
>
> - /* If resctrl is mounted, add per domain monitor data directories. */
> - if (static_branch_unlikely(&rdt_mon_enable_key))
> + /*
> + * If the filesystem is not mounted, creating directories is deferred
> + * until mount time by rdt_get_tree() calling mkdir_mondata_all().
> + */

I do not think that this new comment captures this work. This code is when a new
domain comes online and directories need to be created in all the existing
resource groups. This includes the default resource group as well as those created
after resctrl was mounted. The new comment states that this is "deferred until mount
time by rdt_get_tree() calling mkdir_mondata_all()" - but I do not think that is
accurate since the reader is point to the directories created just for the default
resource group.
Perhaps something like:

/*
* If the filesystem is not mounted then only the default resource group
* exists. Creation of its directories is deferred until mount time
* by rdt_get_tree() calling mkdir_mondata_all().
*/

I do think that the comment explaining what the code does is helpful though.
Can you please also keep the comment about what is done in the case when resctrl
is indeed mounted?


> + if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key))
> mkdir_mondata_subdir_allrdtgrp(r, d);
>
> return 0;


Reinette

2023-06-15 23:04:37

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 11/24] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow

Hi James,

On 5/25/2023 11:01 AM, James Morse wrote:

...

> @@ -55,6 +56,28 @@
> /* Max event bits supported */
> #define MAX_EVT_CONFIG_BITS GENMASK(6, 0)
>
> +/**
> + * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that
> + * aren't marked nohz_full
> + * @mask: The mask to pick a CPU from.
> + *
> + * Returns a CPU in @mask. If there are housekeeping CPUs that don't use
> + * nohz_full, these are preferred.
> + */
> +static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
> +{
> + int cpu, hk_cpu;

Should both of these be unsigned int?

> +
> + cpu = cpumask_any(mask);
> + if (tick_nohz_full_cpu(cpu)) {
> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
> + if (hk_cpu < nr_cpu_ids)
> + cpu = hk_cpu;
> + }
> +
> + return cpu;
> +}
> +
> struct rdt_fs_context {
> struct kernfs_fs_context kfc;
> bool enable_cdpl2;
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 128d4c7206e4..e267869d60d5 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -770,9 +770,9 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
> void cqm_handle_limbo(struct work_struct *work)
> {
> unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
> - int cpu = smp_processor_id();
> struct rdt_resource *r;
> struct rdt_domain *d;
> + int cpu;
>
> mutex_lock(&rdtgroup_mutex);
>
> @@ -781,8 +781,10 @@ void cqm_handle_limbo(struct work_struct *work)
>
> __check_limbo(d, false);
>
> - if (has_busy_rmid(r, d))
> + if (has_busy_rmid(r, d)) {
> + cpu = cpumask_any_housekeeping(&d->cpu_mask);
> schedule_delayed_work_on(cpu, &d->cqm_limbo, delay);

I expected cqm_work_cpu to also change when the worker moves to a different CPU.

> + }
>
> mutex_unlock(&rdtgroup_mutex);
> }
> @@ -792,7 +794,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
> unsigned long delay = msecs_to_jiffies(delay_ms);
> int cpu;
>
> - cpu = cpumask_any(&dom->cpu_mask);
> + cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> dom->cqm_work_cpu = cpu;
>
> schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
> @@ -802,10 +804,10 @@ void mbm_handle_overflow(struct work_struct *work)
> {
> unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
> struct rdtgroup *prgrp, *crgrp;
> - int cpu = smp_processor_id();
> struct list_head *head;
> struct rdt_resource *r;
> struct rdt_domain *d;
> + int cpu;
>
> mutex_lock(&rdtgroup_mutex);
>
> @@ -826,6 +828,11 @@ void mbm_handle_overflow(struct work_struct *work)
> update_mba_bw(prgrp, d);
> }
>
> + /*
> + * Re-check for housekeeping CPUs. This allows the overflow handler to
> + * move off a nohz_full CPU quickly.
> + */
> + cpu = cpumask_any_housekeeping(&d->cpu_mask);
> schedule_delayed_work_on(cpu, &d->mbm_over, delay);

Here I expected mbm_work_cpu to change if the worker moves to a different CPU.

>
> out_unlock:
> @@ -839,7 +846,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
>
> if (!static_branch_likely(&rdt_mon_enable_key))
> return;
> - cpu = cpumask_any(&dom->cpu_mask);
> + cpu = cpumask_any_housekeeping(&dom->cpu_mask);
> dom->mbm_work_cpu = cpu;
> schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
> }

Reinette

2023-06-15 23:49:17

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 00/24] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

Hi James,

On 5/25/2023 11:01 AM, James Morse wrote:
>
> The second part of this series adds a domain_list_lock to protect writes to the
> domain list, and protects the domain list with RCU - or read_cpus_lock().

read_cpus_lock() -> cpus_read_lock()

>
> Use of RCU is to allow lockless readers of the domain list, today resctrl only has
> one, rdt_bit_usage_show(). But to get MPAMs monitors working, its very likely

This may need an update to reflect changes since initial version (similar to the
relevant patch).

Reinette

2023-07-28 16:45:04

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v4 21/24] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu

Hi Reinette,

On 15/06/2023 23:25, Reinette Chatre wrote:
> On 5/25/2023 11:02 AM, James Morse wrote:
>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>> index 021a8956518c..9cba8fc405b9 100644
>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>> @@ -79,6 +79,37 @@ static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask)
>> return cpu;
>> }
>>
>> +/**
>> + * cpumask_any_housekeeping_but() - Chose any cpu in @mask, preferring those
>> + * that aren't marked nohz_full, excluding
>> + * the provided CPU
>> + * @mask: The mask to pick a CPU from.
>> + * @exclude_cpu:The CPU to avoid picking.
>> + *
>> + * Returns a CPU from @mask, but not @but. If there are housekeeping CPUs that
>
> "but not @exclude_cpu"
>
>> + * don't use nohz_full, these are preferred.
>> + * Returns >= nr_cpu_ids if no CPUs are available.
>> + */
>> +static inline unsigned int
>> +cpumask_any_housekeeping_but(const struct cpumask *mask, int exclude_cpu)
>> +{
>> + int cpu, hk_cpu;
>
> Should these be unsigned int?

Yup, fixed.


>> +
>> + cpu = cpumask_any_but(mask, exclude_cpu);
>> + if (tick_nohz_full_cpu(cpu)) {
>> + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
>> + if (hk_cpu == exclude_cpu) {
>> + hk_cpu = cpumask_nth_andnot(1, mask,
>> + tick_nohz_full_mask);
>> + }
>> +
>
> These braces are not necessary.

My C parser is pretty dumb, and is easily confused by things like that....


> If they are added to help readability then
> perhaps the indentation can be reduced by using an earlier:
>
> if (!tick_nohz_full_cpu(cpu))
> return cpu;

Even better!


>> + if (hk_cpu < nr_cpu_ids)
>> + cpu = hk_cpu;
>> + }
>> +
>> + return cpu;
>> +}
>> +
>> struct rdt_fs_context {
>> struct kernfs_fs_context kfc;
>> bool enable_cdpl2;

>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index ced933694f60..ae02185f3354 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -485,7 +485,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>> * setup up the limbo worker.
>> */
>> if (!has_busy_rmid(r, d))
>> - cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL);
>> + cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, -1);
>
> Should this -1 be RESCTRL_PICK_ANY_CPU?
>
>> set_bit(idx, d->rmid_busy_llc);
>> entry->busy++;
>> }
>> @@ -810,15 +810,28 @@ void cqm_handle_limbo(struct work_struct *work)
>> mutex_unlock(&rdtgroup_mutex);
>> }
>>
>> -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
>> +/**
>> + * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this
>> + * domain.
>> + * @delay_ms: How far in the future the handler should run.
>> + * @exclude_cpu: Which CPU the handler should not run on, -1 to pick any CPU.
>
> Should -1 be RESCTRL_PICK_ANY_CPU?
>
>> + */
>> +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
>> + int exclude_cpu)
>> {
>> unsigned long delay = msecs_to_jiffies(delay_ms);
>> int cpu;
>>
>> - cpu = cpumask_any_housekeeping(&dom->cpu_mask);
>> - dom->cqm_work_cpu = cpu;
>> + if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
>> + cpu = cpumask_any_housekeeping(&dom->cpu_mask);
>> + else
>> + cpu = cpumask_any_housekeeping_but(&dom->cpu_mask,
>> + exclude_cpu);
>>
>> - schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
>> + if (cpu < nr_cpu_ids) {
>> + dom->cqm_work_cpu = cpu;
>
> Should cqm_work_cpu not perhaps be set to nr_cpu_ids on failure? If it keeps
> pointing to CPU that ran worker previously there may be unexpected behavior.
>
> Note the different behavior between cqm_setup_limbo_handler() and
> mbm_setup_overflow_handler() in this regard.

Sure,


>> + schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
>> + }
>> }
>>
>> void mbm_handle_overflow(struct work_struct *work)

>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>> index ecd41762d61a..089b91133e5e 100644
>> --- a/include/linux/resctrl.h
>> +++ b/include/linux/resctrl.h
>> @@ -9,6 +9,9 @@
>> /* CLOSID value used by the default control group */
>> #define RESCTRL_RESERVED_CLOSID 0
>>
>> +/* Indicates no CPU needs to be excluded */
>
> This comment seems to just be a rewrite of the macro name.

I'm more than happy to remove it!



Thanks,

James

2023-07-28 16:45:15

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v4 00/24] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

Hi Shaopeng Tan

On 13/06/2023 07:18, Shaopeng Tan (Fujitsu) wrote:
> I ran resctrl selftest on Intel(R) Xeon(R) Gold 6254 CPU with nohz_full enabled/disabled,
> it seems there is no problem.

Thanks for your help with the testing!


James

2023-07-28 16:45:20

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v4 24/24] x86/resctrl: Separate arch and fs resctrl locks

Hi Reinette,

On 15/06/2023 23:26, Reinette Chatre wrote:
> On 5/25/2023 11:02 AM, James Morse wrote:
>> resctrl has one mutex that is taken by the architecture specific code,
>> and the filesystem parts. The two interact via cpuhp, where the
>> architecture code updates the domain list. Filesystem handlers that
>> walk the domains list should not run concurrently with the cpuhp
>> callback modifying the list.
>>
>> Exposing a lock from the filesystem code means the interface is not
>> cleanly defined, and creates the possibility of cross-architecture
>> lock ordering headaches. The interaction only exists so that certain
>> filesystem paths are serialised against cpu hotplug. The cpu hotplug
>> code already has a mechanism to do this using cpus_read_lock().
>>
>> MPAM's monitors have an overflow interrupt, so it needs to be possible
>> to walk the domains list in irq context. RCU is ideal for this,
>> but some paths need to be able to sleep to allocate memory.
>>
>> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
>> of a cpuhp callback, cpus_read_lock() must always be taken first.
>> rdtgroup_schemata_write() already does this.
>>
>> Most of the filesystem code's domain list walkers are currently
>> protected by the rdtgroup_mutex taken in rdtgroup_kn_lock_live().
>> The exceptions are rdt_bit_usage_show() and the mon_config helpers
>> which take the lock directly.
>>
>> Make the domain list protected by RCU. An architecture-specific
>> lock prevents concurrent writers. rdt_bit_usage_show() can

> This does not sound right to me. The domain list belongs to resctrl,
> the filesystem,

I'd argue the list belongs to the arch code, as only the arch code writes to it.


> having an architecture specific lock protect it does
> not seem like the right thing to do. Could this instead be a resctrl
> owned lock that is hidden behind helpers for the architecture
> code to add domains?

resctrl should never be writing to the domain list, so it never needs to know how
concurrent-writers are avoided.

The memory is allocated and added to the list by the architecture code.
This new domain-list-lock API would only ever be called by the arch code, and there is no
resctrl call that needs to be made with that lock held. I don't see any reason for this to
be visible to the filesystem code.

I'm trying to keep the number of functions exposed by resctrl to the arch code as small as
possible. After all this they are:
resctrl_online_domain()
resctrl_offline_domain()
resctrl_online_cpu()
resctrl_offline_cpu()
resctrl_init()
resctrl_exit()

The weird one is resctrl_get_domain_from_cpu(), which is used by both architecture and
filesystem code, because I wanted to avoid duplicating it. It ends up inlined from a
header file.

Part of this is to explore making resctrl a loadable module in the future, which I think
fits with Tony Luck's aims.


Thanks,

James

2023-07-28 16:45:25

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v4 02/24] x86/resctrl: Access per-rmid structures by index

Hi Reinette,

On 15/06/2023 23:03, Reinette Chatre wrote:
> On 5/25/2023 11:01 AM, James Morse wrote:
>> Because of the differences between Intel RDT/AMD QoS and Arm's MPAM
>> monitors, RMID values on arm64 are not unique unless the CLOSID is
>
> I find the above a bit confusing ... the theme seems to be "RMID values
> on arm64 are not unique because they are different from Intel".
> Compare to: "One of the differences between Intel RDT/AMD QoS and
> Arm's MPAM monitors is that RMID values on arm64 are not unique unless
> the CLOSID is also included."

I'm not too sure what is confusing. I'll pad it out with an example.
Is this clearer?:
--------------%<--------------
x86 systems identify traffic using the CLOSID and RMID. The CLOSID is
used to lookup the control policy, the RMID is used for monitoring. For
x86 these are independent numbers.
Arm's MPAM has equivalent features PARTID and PMG, where the PARTID is
used to lookup the control policy. The PMG in contrast is a small number
of bits that are used to subdivide PARTID when monitoring. The
cache-occupancy monitors require the PARTID to be specified when monitoring.

This means MPAM's PMG field is not unique. There are multiple PMG-0, one
per allocated CLOSID/PARTID. If PMG is treated as equivalent to RMID, it
cannot be allocated as an independent number. Bitmaps like rmid_busy_llc
need to be sized by the number of unique entries for this resource.

Treat the combined CLOSID and RMID as an index, and provide architecture
helpers to pack and unpack an index. This makes the MPAM values unique.
The domain's rmid_busy_llc and rmid_ptrs[] are then sized by index, as
are domain mbm_local[] and mbm_total[].

x86 can ignore the CLOSID field when packing and unpacking an index, and
report as many indexes as RMID.
--------------%<--------------

>> also included. Bitmaps like rmid_busy_llc need to be sized by the
>> number of unique entries for this resource.
>>
>> Add helpers to encode/decode the CLOSID and RMID to an index. The
>> domain's rmid_busy_llc and rmid_ptrs[] are then sized by index,
>> as are the domain mbm_local and mbm_total arrays.
>
> You can use "[]" to indicate an array.

>> On x86, the index is always just the RMID, so all these structures
>> remain the same size.
>
> I do not consider this accurate considering that the previous
> patch increased the size of each element to support this change.

That is a different patch... My point here is that x86's array sizes don't get multiplied
by num_closid, as only arm64 needs that.

I've brushed that wording under the carpet in the text above.


>> The index gives resctrl a unique value it can use to store monitor
>> values, and allows MPAM to decode the CLOSID when reading the hardware
>> counters.

>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 86574adedd64..bcc25f5339c0 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c

>> @@ -377,14 +399,16 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>>
>> void free_rmid(u32 closid, u32 rmid)
>> {
>> + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
>> struct rmid_entry *entry;
>>
>> - if (!rmid)
>> - return;
>> -
>> lockdep_assert_held(&rdtgroup_mutex);
>>
>> - entry = __rmid_entry(closid, rmid);
>> + /* do not allow the default rmid to be free'd */
>> + if (!idx)
>> + return;
>> +
>
> The interface seem to become blurry here. There are new
> architecture specific encode/decode callbacks while at the same
> time there are a few requirements:
> - closid 0 and rmid 0 are reserved

> - closid 0 and rmid 0 must map to index 0 (the architecture
> callbacks thus do not have must freedom here ... they must
> return index 0 for closid 0 and rmid 0, no?).

It's not supposed to matter, but I agree this check in free_rmid() doesn't decode the
index to check.


> It does seem a bit strange that the one layer provides values (0,0)
> to other layer while requiring a specific answer (0).
>
> What if RESCTRL_RESERVED_RMID is also introduced and before encoding
> the CLOSID and RMID the core first checks if it is a reserved entry
> being freed and exit early if this is the case?

Sure, that makes this cleaner.


Thanks,

James

2023-07-28 16:57:22

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v4 24/24] x86/resctrl: Separate arch and fs resctrl locks

Hi Peter,

On 13/06/2023 10:15, Peter Newman wrote:
> On Thu, May 25, 2023 at 8:03 PM James Morse <[email protected]> wrote:
>>
>> resctrl has one mutex that is taken by the architecture specific code,
>> and the filesystem parts. The two interact via cpuhp, where the
>> architecture code updates the domain list. Filesystem handlers that
>> walk the domains list should not run concurrently with the cpuhp
>> callback modifying the list.
>>
>> Exposing a lock from the filesystem code means the interface is not
>> cleanly defined, and creates the possibility of cross-architecture
>> lock ordering headaches. The interaction only exists so that certain
>> filesystem paths are serialised against cpu hotplug. The cpu hotplug
>> code already has a mechanism to do this using cpus_read_lock().
>>
>> MPAM's monitors have an overflow interrupt, so it needs to be possible
>> to walk the domains list in irq context. RCU is ideal for this,
>> but some paths need to be able to sleep to allocate memory.
>
> I noticed there's also a call to get_domain_from_cpu() in
> rdt_ctrl_update(), which is invoked by IPI when updating a schemata
> file, but at least it's a blocking IPI and the caller holds the
> rdtgroup_mutex, so the handler is serialized with CPU hotplug.

resctrl_arch_update_domains()? This also has lockdep_assert_cpus_held(), which is taken by
rdtgroup_kn_lock_live() - this pattern covers all the user-space filesystem operations.

I'd really like to get all the resctrl_arch_* helpers away from depending on
rdtgroup_mutex. Part of this is to allow perf/kvm/etc to use this interface without having
to expose more things from resctrl.


> Taking a step back, I'm concerned about the scalability of searching
> these linked-lists of domains in atomic contexts. We already have
> machines where the list is 32 entries deep in L3, and much larger in
> L2 CAT.

(Isn't this an existing problem? All I'm doing here is making it something explicitly
supported).


> Will the overflow interrupt target a CPU in the correct domain? The
> existing domain list search in rdt_ctrl_update() is for the current
> CPU's domain, so an alternative there would be to store each CPU's
> interrupt-relevant domain pointers in percpu data for quick lookup.

For the overflow interrupt, the irqchip code passes the device to the irq handler, which
corresponds to the MSC and via some intermediate structure, the resctrl domain. So this is
cheap to look up.

The need to 'walk the list in IRQ context' really comes from perf, regardless of whether
the overflow interrupt is supported. The domain_id is cached, but the domain may go
offline before the PMU driver is next called, hence the need to search the list and check
the CPU each time. If this proves to be a bottleneck, we can look for a better data structure.
The domain list is sorted, so a bisect search would improve matters.
A hash table may work for x86, as far as I can see the cache-ids are where the domain-id
always comes from, and those values are picked by the kernel to be contiguous. (on arm64
they come from a firmware table, all we know is they should be unique).

resctrl_arch_find_domain() is a per-arch thing, so we can do whatever is best for each
architecture.


For your resctrl_arch_update_domains() example, x86 could dispatch an IPI to each CPU,
passing the necessary domain in msr_param as part of the main loop. This would be slower
for the caller as you'd have to wait for each one to complete, but it would avoid a CPU in
each domain having to search the domain list with interrupts masked. (possibly all at the
same time)
I figure schema updates are rare, and it doesn't matter how long they take, provided it
doesn't impact other tasks.


> Also, how quickly does the overflow condition need to be serviced? On
> Intel, overflow handling deadlines aren't even tight enough to warrant
> an interrupt handler.

Ultimately, I don't know. It's something the architecture supports, I think its most
useful for use with perf-record. I've not had anyone ask for this support yet.

MPAM can have bandwidth counters as small as 32bit and can count in bytes. If someone
builds this configuration, they may find the overflow interrupt goes off more frequently
than the overflow thread runs to scrub the counters.

MPAM also has options to scale the counters by some implementation specific size, as well
as 44 and 63 bit counters - all of which should overflow less frequently.

As with everything in the arm world, its very much "chose your own adventure".


Thanks,

James

2023-07-28 17:13:37

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v4 21/24] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu

Hi Shaopeng Tan,

On 09/06/2023 12:10, Shaopeng Tan (Fujitsu) wrote:
>> When a CPU is taken offline resctrl may need to move the overflow or limbo
>> handlers to run on a different CPU.
>>
>> Once the offline callbacks have been split, cqm_setup_limbo_handler() will be
>> called while the CPU that is going offline is still present in the cpu_mask.
>>
>> Pass the CPU to exclude to cqm_setup_limbo_handler() and
>> mbm_setup_overflow_handler(). These functions can use a variant of
>> cpumask_any_but() when selecting the CPU. -1 is used to indicate no CPUs
>> need excluding.
>>
>> A subsequent patch moves these calls to be before CPUs have been removed,
>> so this exclude_cpus behaviour is temporary.

>> diff --git
>> a/arch/x86/kernel/cpu/resctrl/monitor.c
>> b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index ced933694f60..ae02185f3354 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -875,9 +895,15 @@ void mbm_setup_overflow_handler(struct rdt_domain
>> *dom, unsigned long delay_ms)
>> */
>> if (!resctrl_mounted || !resctrl_arch_mon_capable())
>> return;
>> - cpu = cpumask_any_housekeeping(&dom->cpu_mask);
>> + if (exclude_cpu == -1)
>> + cpu = cpumask_any_housekeeping(&dom->cpu_mask);
>
> Should RESCTRL_PICK_ANY_CPU be used instead of -1?

Yup, that would be more readable. I did this for cqm_setup_limbo_handler(), but for some
reason missed this one.


Thanks,

James