2023-01-13 18:46:38

by James Morse

[permalink] [raw]
Subject: [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

Hello!

(Changes since v1 are noted in each patch)

This series does two things, it changes resctrl to call resctrl_arch_rmid_read()
in a way that works for MPAM, and it separates the locking so that the arch code
and filesystem code don't have to share a mutex. I tried to split this as two
series, but these touch similar call sites, so it would create more work.

(What's MPAM? See the cover letter of the first series. [1])

On x86 the RMID is an independent number. MPAMs equivalent is PMG, but this
isn't an independent number - it extends the PARTID (same as CLOSID) space
with bits that aren't used to select the configuration. The monitors can
then be told to match specific PMG values, allowing monitor-groups to be
created.

But, MPAM expects the monitors to always monitor by PARTID. The
Cache-storage-utilisation counters can only work this way.
(In the MPAM spec not setting the MATCH_PARTID bit is made CONSTRAINED
UNPREDICTABLE - which is Arm's term to mean portable software can't rely on
this)

It gets worse, as some SoCs may have very few PMG bits. I've seen the
datasheet for one that has a single bit of PMG space.

To be usable, MPAM's counters always need the PARTID and the PMG.
For resctrl, this means always making the CLOSID available when the RMID
is used.

To ensure RMID are always unique, this series combines the CLOSID and RMID
into an index, and manages RMID based on that. For x86, the index and RMID
would always be the same.


Currently the architecture specific code in the cpuhp callbacks takes the
rdtgroup_mutex. This means the filesystem code would have to export this
lock, resulting in an ill-defined interface between the two, and the possibility
of cross-architecture lock-ordering head aches.

The second part of this series adds a domain_list_lock to protect writes to the
domain list, and protects the domain list with RCU - or read_cpus_lock().

Use of RCU is to allow lockless readers of the domain list, today resctrl only has
one, rdt_bit_usage_show(). But to get MPAMs monitors working, its very likely
they'll need to be plumbed up to perf. The uncore PMU driver would be a second
lockless reader of the domain list.

This series is based on v6.2-rc3, and can be retrieved from:
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/monitors_and_locking/v2

Bugs welcome,


Thanks,

James


[1] https://lore.kernel.org/lkml/[email protected]/
[v1] https://lore.kernel.org/all/[email protected]/

James Morse (18):
x86/resctrl: Track the closid with the rmid
x86/resctrl: Access per-rmid structures by index
x86/resctrl: Create helper for RMID allocation and mondata dir
creation
x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()
x86/resctrl: Allow RMID allocation to be scoped by CLOSID
x86/resctrl: Allow the allocator to check if a CLOSID can allocate
clean RMID
x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
x86/resctrl: Queue mon_event_read() instead of sending an IPI
x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
x86/resctrl: Allow arch to allocate memory needed in
resctrl_arch_rmid_read()
x86/resctrl: Make resctrl_mounted checks explicit
x86/resctrl: Move alloc/mon static keys into helpers
x86/resctrl: Make rdt_enable_key the arch's decision to switch
x86/resctrl: Add helpers for system wide mon/alloc capable
x86/resctrl: Add cpu online callback for resctrl work
x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but
cpu
x86/resctrl: Add cpu offline callback for resctrl work
x86/resctrl: Separate arch and fs resctrl locks

arch/x86/include/asm/resctrl.h | 90 ++++++
arch/x86/kernel/cpu/resctrl/core.c | 71 ++---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 19 +-
arch/x86/kernel/cpu/resctrl/internal.h | 24 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 342 ++++++++++++++++------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 15 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 297 +++++++++++++------
include/linux/resctrl.h | 15 +-
8 files changed, 637 insertions(+), 236 deletions(-)

--
2.30.2


2023-01-13 19:00:08

by James Morse

[permalink] [raw]
Subject: [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks

resctrl has one mutex that is taken by the architecture specific code,
and the filesystem parts. The two interact via cpuhp, where the
architecture code updates the domain list. Filesystem handlers that
walk the domains list should not run concurrently with the cpuhp
callback modifying the list.

Exposing a lock from the filesystem code means the interface is not
cleanly defined, and creates the possibility of cross-architecture
lock ordering headaches. The interaction only exists so that certain
filesystem paths are serialised against cpu hotplug. The cpu hotplug
code already has a mechanism to do this using cpus_read_lock().

MPAM's monitors have an overflow interrupt, so it needs to be possible
to walk the domains list in irq context. RCU is ideal for this,
but some paths need to be able to sleep to allocate memory.

Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
of a cpuhp callback, cpus_read_lock() must always be taken first.
rdtgroup_schemata_write() already does this.

All but one of the filesystem code's domain list walkers are
currently protected by the rdtgroup_mutex taken in
rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
which takes the lock directly.

Make the domain list protected by RCU. An architecture-specific
lock prevents concurrent writers. rdt_bit_usage_show() can
walk the domain list under rcu_read_lock().
The other filesystem list walkers need to be able to sleep.
Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
cpuhp callbacks can't be invoked when file system operations are
occurring.

Add lockdep_assert_cpus_held() in the cases where the
rdtgroup_kn_lock_live() call isn't obvious.

Resctrl's domain online/offline calls now need to take the
rdtgroup_mutex themselves.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 33 ++++++++------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 14 ++++--
arch/x86/kernel/cpu/resctrl/monitor.c | 3 ++
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 3 ++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 54 ++++++++++++++++++++---
include/linux/resctrl.h | 2 +-
6 files changed, 84 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7896fcf11df6..dc1ba580c4db 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -25,8 +25,14 @@
#include <asm/resctrl.h>
#include "internal.h"

-/* Mutex to protect rdtgroup access. */
-DEFINE_MUTEX(rdtgroup_mutex);
+/*
+ * rdt_domain structures are kfree()d when their last cpu goes offline,
+ * and allocated when the first cpu in a new domain comes online.
+ * The rdt_resource's domain list is updated when this happens. The domain
+ * list is protected by RCU, but callers can also take the cpus_read_lock()
+ * to prevent modification if they need to sleep. All writers take this mutex:
+ */
+static DEFINE_MUTEX(domain_list_lock);

/*
* The cached resctrl_pqr_state is strictly per CPU and can never be
@@ -483,6 +489,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
struct rdt_domain *d;
int err;

+ lockdep_assert_held(&domain_list_lock);
+
d = rdt_find_domain(r, id, &add_pos);
if (IS_ERR(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -516,11 +524,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
}

- list_add_tail(&d->list, add_pos);
+ list_add_tail_rcu(&d->list, add_pos);

err = resctrl_online_domain(r, d);
if (err) {
- list_del(&d->list);
+ list_del_rcu(&d->list);
+ synchronize_rcu();
domain_free(hw_dom);
}
}
@@ -541,7 +550,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
cpumask_clear_cpu(cpu, &d->cpu_mask);
if (cpumask_empty(&d->cpu_mask)) {
resctrl_offline_domain(r, d);
- list_del(&d->list);
+ list_del_rcu(&d->list);
+ synchronize_rcu();

/*
* rdt_domain "d" is going to be freed below, so clear
@@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
static int resctrl_arch_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;
- int err;

- mutex_lock(&rdtgroup_mutex);
+ mutex_lock(&domain_list_lock);
for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
clear_closid_rmid(cpu);
+ mutex_unlock(&domain_list_lock);

- err = resctrl_online_cpu(cpu);
- mutex_unlock(&rdtgroup_mutex);
-
- return err;
+ return resctrl_online_cpu(cpu);
}

static int resctrl_arch_offline_cpu(unsigned int cpu)
{
struct rdt_resource *r;

- mutex_lock(&rdtgroup_mutex);
resctrl_offline_cpu(cpu);

+ mutex_lock(&domain_list_lock);
for_each_capable_rdt_resource(r)
domain_remove_cpu(cpu, r);
clear_closid_rmid(cpu);
- mutex_unlock(&rdtgroup_mutex);
+ mutex_unlock(&domain_list_lock);

return 0;
}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 4ee3da6dced7..6eb74304860f 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -208,6 +208,9 @@ static int parse_line(char *line, struct resctrl_schema *s,
struct rdt_domain *d;
unsigned long dom_id;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
r->rid == RDT_RESOURCE_MBA) {
rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n");
@@ -313,6 +316,9 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
int cpu;
u32 idx;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
return -ENOMEM;

@@ -383,11 +389,9 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
return -EINVAL;
buf[nbytes - 1] = '\0';

- cpus_read_lock();
rdtgrp = rdtgroup_kn_lock_live(of->kn);
if (!rdtgrp) {
rdtgroup_kn_unlock(of->kn);
- cpus_read_unlock();
return -ENOENT;
}
rdt_last_cmd_clear();
@@ -451,7 +455,6 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,

out:
rdtgroup_kn_unlock(of->kn);
- cpus_read_unlock();
return ret ?: nbytes;
}

@@ -471,6 +474,9 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
bool sep = false;
u32 ctrl_val;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
seq_printf(s, "%*s:", max_name_width, schema->name);
list_for_each_entry(dom, &r->domains, list) {
if (sep)
@@ -533,7 +539,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
int evtid, int first)
{
/* When picking a cpu from cpu_mask, ensure it can't race with cpuhp */
- lockdep_assert_held(&rdtgroup_mutex);
+ lockdep_assert_cpus_held();

/*
* setup the parameters to pass to mon_event_count() to read the data.
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 334fb3f1c6e2..a2d5cf9052e4 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -421,6 +421,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
u32 idx;
int err;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);

arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 0b4fdb118643..f8864626d593 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -830,6 +830,9 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
struct rdt_domain *d_i;
bool ret = false;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
return true;

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index aa75c42e0faa..391ac3c56680 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -35,6 +35,10 @@
DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
+
+/* Mutex to protect rdtgroup access. */
+DEFINE_MUTEX(rdtgroup_mutex);
+
static struct kernfs_root *rdt_root;
struct rdtgroup rdtgroup_default;
LIST_HEAD(rdt_all_groups);
@@ -929,7 +933,8 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,

mutex_lock(&rdtgroup_mutex);
hw_shareable = r->cache.shareable_bits;
- list_for_each_entry(dom, &r->domains, list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(dom, &r->domains, list) {
if (sep)
seq_putc(seq, ';');
sw_shareable = 0;
@@ -985,8 +990,10 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
}
sep = true;
}
+ rcu_read_unlock();
seq_putc(seq, '\n');
mutex_unlock(&rdtgroup_mutex);
+
return 0;
}

@@ -1226,6 +1233,9 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
struct rdt_domain *d;
u32 ctrl;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
list_for_each_entry(s, &resctrl_schema_all, list) {
r = s->res;
if (r->rid == RDT_RESOURCE_MBA)
@@ -1859,6 +1869,9 @@ static int set_cache_qos_cfg(int level, bool enable)
struct rdt_domain *d;
int cpu;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (level == RDT_RESOURCE_L3)
update = l3_qos_cfg_update;
else if (level == RDT_RESOURCE_L2)
@@ -2051,6 +2064,7 @@ struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn)
atomic_inc(&rdtgrp->waitcount);
kernfs_break_active_protection(kn);

+ cpus_read_lock();
mutex_lock(&rdtgroup_mutex);

/* Was this group deleted while we waited? */
@@ -2068,6 +2082,7 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn)
return;

mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();

if (atomic_dec_and_test(&rdtgrp->waitcount) &&
(rdtgrp->flags & RDT_DELETED)) {
@@ -2364,6 +2379,9 @@ static int reset_all_ctrls(struct rdt_resource *r)
struct rdt_domain *d;
int i, cpu;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
return -ENOMEM;

@@ -2644,6 +2662,9 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_domain *dom;
int ret;

+ /* Walking r->domains, ensure it can't race with cpuhp */
+ lockdep_assert_cpus_held();
+
list_for_each_entry(dom, &r->domains, list) {
ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
if (ret)
@@ -3327,7 +3348,8 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
kfree(d->mbm_local);
}

-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+static void _resctrl_offline_domain(struct rdt_resource *r,
+ struct rdt_domain *d)
{
lockdep_assert_held(&rdtgroup_mutex);

@@ -3362,6 +3384,13 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
domain_destroy_mon_state(d);
}

+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ mutex_lock(&rdtgroup_mutex);
+ _resctrl_offline_domain(r, d);
+ mutex_unlock(&rdtgroup_mutex);
+}
+
static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -3393,7 +3422,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+static int _resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
{
int err;

@@ -3424,12 +3453,23 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

+int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ int err;
+
+ mutex_lock(&rdtgroup_mutex);
+ err = _resctrl_online_domain(r, d);
+ mutex_unlock(&rdtgroup_mutex);
+
+ return err;
+}
+
int resctrl_online_cpu(unsigned int cpu)
{
- lockdep_assert_held(&rdtgroup_mutex);
-
+ mutex_lock(&rdtgroup_mutex);
/* The cpu is set in default rdtgroup after online. */
cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
+ mutex_unlock(&rdtgroup_mutex);

return 0;
}
@@ -3450,8 +3490,7 @@ void resctrl_offline_cpu(unsigned int cpu)
struct rdtgroup *rdtgrp;
struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;

- lockdep_assert_held(&rdtgroup_mutex);
-
+ mutex_lock(&rdtgroup_mutex);
list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
clear_childcpus(rdtgrp, cpu);
@@ -3471,6 +3510,7 @@ void resctrl_offline_cpu(unsigned int cpu)
cqm_setup_limbo_handler(d, 0, cpu);
}
}
+ mutex_unlock(&rdtgroup_mutex);
}

/*
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0d21e7ba7c1d..4fedd45738a5 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -153,7 +153,7 @@ struct resctrl_schema;
* @cache_level: Which cache level defines scope of this resource
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
- * @domains: All domains for this resource
+ * @domains: RCU list of all domains for this resource
* @name: Name to use in "schemata" file.
* @data_width: Character width of data when displaying
* @default_ctrl: Specifies default cache cbm or memory B/W percent.
--
2.30.2

2023-01-13 19:00:58

by James Morse

[permalink] [raw]
Subject: [PATCH v2 13/18] x86/resctrl: Make rdt_enable_key the arch's decision to switch

rdt_enable_key is switched when resctrl is mounted. It was also previously
used to prevent a second mount of the filesystem.

Any other architecture that wants to support resctrl has to provide
identical static keys.

Now that we have helpers for enablign and disabling the alloc/mon keys,
resctrl doesn't need to switch this extra key, it can be done by the arch
code. Use the static-key increment and decrement helpers, and change
resctrl to ensure the calls are balanced.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/include/asm/resctrl.h | 4 ++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++------
2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index d6889aea5496..5b5ae6d8a343 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -45,21 +45,25 @@ DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
static inline void resctrl_arch_enable_alloc(void)
{
static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
+ static_branch_inc_cpuslocked(&rdt_enable_key);
}

static inline void resctrl_arch_disable_alloc(void)
{
static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
+ static_branch_dec_cpuslocked(&rdt_enable_key);
}

static inline void resctrl_arch_enable_mon(void)
{
static_branch_enable_cpuslocked(&rdt_mon_enable_key);
+ static_branch_inc_cpuslocked(&rdt_enable_key);
}

static inline void resctrl_arch_disable_mon(void)
{
static_branch_disable_cpuslocked(&rdt_mon_enable_key);
+ static_branch_dec_cpuslocked(&rdt_enable_key);
}

/*
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index bbad906f573a..0e22f8361392 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2247,10 +2247,8 @@ static int rdt_get_tree(struct fs_context *fc)
if (rdt_mon_capable)
resctrl_arch_enable_mon();

- if (rdt_alloc_capable || rdt_mon_capable) {
- static_branch_enable_cpuslocked(&rdt_enable_key);
+ if (rdt_alloc_capable || rdt_mon_capable)
resctrl_mounted = true;
- }

if (is_mbm_enabled()) {
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
@@ -2514,9 +2512,10 @@ static void rdt_kill_sb(struct super_block *sb)
rdt_pseudo_lock_release();
rdtgroup_default.mode = RDT_MODE_SHAREABLE;
schemata_list_destroy();
- resctrl_arch_disable_alloc();
- resctrl_arch_disable_mon();
- static_branch_disable_cpuslocked(&rdt_enable_key);
+ if (rdt_alloc_capable)
+ resctrl_arch_disable_alloc();
+ if (rdt_mon_capable)
+ resctrl_arch_disable_mon();
resctrl_mounted = false;
kernfs_kill_sb(sb);
mutex_unlock(&rdtgroup_mutex);
--
2.30.2

2023-01-13 19:02:44

by James Morse

[permalink] [raw]
Subject: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID

MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
used for different control groups.

This means once a CLOSID is allocated, all its monitoring ids may still be
dirty, and held in limbo.

Add a helper to allow the CLOSID allocator to check if a CLOSID has dirty
RMID values. This behaviour is enabled by a kconfig option selected by
the architecture, which avoids a pointless search for x86.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>

---
Changes since v1:
* Removed superflous IS_ENABLED().
---
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/monitor.c | 31 ++++++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 ++++++++------
3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 013c8fc9fd28..adae6231324f 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -509,6 +509,7 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
int closids_supported(void);
+bool resctrl_closid_is_dirty(u32 closid);
void closid_free(int closid);
int alloc_rmid(u32 closid);
void free_rmid(u32 closid, u32 rmid);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 347be3767241..190ac183139e 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -327,6 +327,37 @@ static struct rmid_entry *resctrl_find_free_rmid(u32 closid)
return ERR_PTR(-ENOSPC);
}

+/**
+ * resctrl_closid_is_dirty - Determine if clean RMID can be allocate for this
+ * CLOSID.
+ * @closid: The CLOSID that is being queried.
+ *
+ * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocate CLOSID
+ * may not be able to allocate clean RMID. To avoid this the allocator will
+ * only return clean CLOSID.
+ */
+bool resctrl_closid_is_dirty(u32 closid)
+{
+ struct rmid_entry *entry;
+ int i;
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
+ return false;
+
+ for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
+ entry = &rmid_ptrs[i];
+ if (entry->closid != closid)
+ continue;
+
+ if (entry->busy)
+ return true;
+ }
+
+ return false;
+}
+
/*
* As of now the RMIDs allocation is the same in each domain.
* However we keep track of which packages the RMIDs
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index ac88610a6946..e1f879e13823 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -93,7 +93,7 @@ void rdt_last_cmd_printf(const char *fmt, ...)
* - Our choices on how to configure each resource become progressively more
* limited as the number of resources grows.
*/
-static int closid_free_map;
+static unsigned long closid_free_map;
static int closid_free_map_len;

int closids_supported(void)
@@ -119,14 +119,17 @@ static void closid_init(void)

static int closid_alloc(void)
{
- u32 closid = ffs(closid_free_map);
+ u32 closid;

- if (closid == 0)
- return -ENOSPC;
- closid--;
- closid_free_map &= ~(1 << closid);
+ for_each_set_bit(closid, &closid_free_map, closid_free_map_len) {
+ if (resctrl_closid_is_dirty(closid))
+ continue;

- return closid;
+ clear_bit(closid, &closid_free_map);
+ return closid;
+ }
+
+ return -ENOSPC;
}

void closid_free(int closid)
--
2.30.2

2023-01-13 19:05:53

by James Morse

[permalink] [raw]
Subject: [PATCH v2 15/18] x86/resctrl: Add cpu online callback for resctrl work

The resctrl architecture specific code may need to create a domain when
a CPU comes online, it also needs to reset the CPUs PQR_ASSOC register.
The resctrl filesystem code needs to update the rdtgroup_default cpu
mask when cpus are brought online.

Currently this is all done in one function, resctrl_online_cpu().
This will need to be split into architecture and filesystem parts
before resctrl can be moved to /fs/.

Pull the rdtgroup_default update work out as a filesystem specific
cpu_online helper. resctrl_online_cpu() is the obvious name for this,
which means the version in core.c needs renaming.

resctrl_online_cpu() is called by the arch code once it has done the
work to add the new cpu to any domains.

In future patches, resctrl_online_cpu() will take the rdtgroup_mutex
itself.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/core.c | 11 ++++++-----
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 10 ++++++++++
include/linux/resctrl.h | 1 +
3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c98e52ff5f20..749d9a450749 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -578,19 +578,20 @@ static void clear_closid_rmid(int cpu)
wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
}

-static int resctrl_online_cpu(unsigned int cpu)
+static int resctrl_arch_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;
+ int err;

mutex_lock(&rdtgroup_mutex);
for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
- /* The cpu is set in default rdtgroup after online. */
- cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
clear_closid_rmid(cpu);
+
+ err = resctrl_online_cpu(cpu);
mutex_unlock(&rdtgroup_mutex);

- return 0;
+ return err;
}

static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
@@ -917,7 +918,7 @@ static int __init resctrl_late_init(void)

state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
"x86/resctrl/cat:online:",
- resctrl_online_cpu, resctrl_offline_cpu);
+ resctrl_arch_online_cpu, resctrl_offline_cpu);
if (state < 0)
return state;

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 44e6d6fbab25..25b027d19dbe 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3424,6 +3424,16 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

+int resctrl_online_cpu(unsigned int cpu)
+{
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ /* The cpu is set in default rdtgroup after online. */
+ cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
+
+ return 0;
+}
+
/*
* rdtgroup_init - rdtgroup initialization
*
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index d90d3dca48e9..7b71c3558d83 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -219,6 +219,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_cpu(unsigned int cpu);

/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
--
2.30.2

2023-01-13 19:06:38

by James Morse

[permalink] [raw]
Subject: [PATCH v2 03/18] x86/resctrl: Create helper for RMID allocation and mondata dir creation

RMID are allocated for each monitor or control group directory, because
each of these needs its own RMID. For control groups,
rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID.

MPAM's equivalent of RMID are not an independent number, so can't be
allocated until the CLOSID is known. An RMID allocation for one CLOSID
may fail, whereas another may succeed depending on how many monitor
groups a control group has.

The RMID allocation needs to move to be after the CLOSID has been
allocated.

To make a subsequent change that does this easier to read, move the RMID
allocation and mondata dir creation to a helper.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 42 +++++++++++++++++---------
1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9ce4746778f4..841294ad6263 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2868,6 +2868,30 @@ static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp)
return 0;
}

+static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
+{
+ int ret;
+
+ if (!rdt_mon_capable)
+ return 0;
+
+ ret = alloc_rmid();
+ if (ret < 0) {
+ rdt_last_cmd_puts("Out of RMIDs\n");
+ return ret;
+ }
+ rdtgrp->mon.rmid = ret;
+
+ ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
+ if (ret) {
+ rdt_last_cmd_puts("kernfs subdir error\n");
+ free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
+ return ret;
+ }
+
+ return 0;
+}
+
static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
const char *name, umode_t mode,
enum rdt_group_type rtype, struct rdtgroup **r)
@@ -2933,20 +2957,10 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
goto out_destroy;
}

- if (rdt_mon_capable) {
- ret = alloc_rmid();
- if (ret < 0) {
- rdt_last_cmd_puts("Out of RMIDs\n");
- goto out_destroy;
- }
- rdtgrp->mon.rmid = ret;
+ ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp);
+ if (ret)
+ goto out_destroy;

- ret = mkdir_mondata_all(kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
- if (ret) {
- rdt_last_cmd_puts("kernfs subdir error\n");
- goto out_idfree;
- }
- }
kernfs_activate(kn);

/*
@@ -2954,8 +2968,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
*/
return 0;

-out_idfree:
- free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
out_destroy:
kernfs_put(rdtgrp->kn);
kernfs_remove(rdtgrp->kn);
--
2.30.2

2023-01-13 19:06:54

by James Morse

[permalink] [raw]
Subject: [PATCH v2 10/18] x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read()

Depending on the number of monitors available, Arm's MPAM may need to
allocate a monitor prior to reading the counter value. Allocating a
contended resource may involve sleeping.

All callers of resctrl_arch_rmid_read() read the counter on more than
one domain. If the monitor is allocated globally, there is no need to
allocate and free it for each call to resctrl_arch_rmid_read().

Add arch hooks for this allocation, which need calling before
resctrl_arch_rmid_read(). The allocated monitor is passed to
resctrl_arch_rmid_read(), then freed again afterwards. The helper
can be called on any CPU, and can sleep.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/include/asm/resctrl.h | 11 +++++++
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/monitor.c | 40 +++++++++++++++++++++++---
include/linux/resctrl.h | 4 +--
4 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index d589a82995ac..194a1570af7b 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -136,6 +136,17 @@ static inline u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid)
return rmid;
}

+/* x86 can always read an rmid, nothing needs allocating */
+struct rdt_resource;
+static inline int resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid)
+{
+ might_sleep();
+ return 0;
+};
+
+static inline void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid,
+ int ctx) { };
+
void resctrl_cpu_detect(struct cpuinfo_x86 *c);

#else
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 1f90a10b75a1..e85e454bec72 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -88,6 +88,7 @@ struct rmid_read {
bool first;
int err;
u64 val;
+ int arch_mon_ctx;
};

extern bool rdt_alloc_capable;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index d6ae4b713801..4e248f4a5f59 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -15,6 +15,7 @@
* Software Developer Manual June 2016, volume 3, section 17.17.
*/

+#include <linux/cpu.h>
#include <linux/module.h>
#include <linux/sizes.h>
#include <linux/slab.h>
@@ -236,7 +237,7 @@ static void __rmid_read(void *arg)

int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
- u64 *val)
+ u64 *val, int ignored)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
@@ -285,9 +286,14 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
struct rmid_entry *entry;
u32 idx, cur_idx = 1;
+ int arch_mon_ctx;
bool rmid_dirty;
u64 val = 0;

+ arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
+ if (arch_mon_ctx < 0)
+ return;
+
/*
* Skip RMID 0 and start from RMID 1 and check all the RMIDs that
* are marked as busy for occupancy < threshold. If the occupancy
@@ -301,7 +307,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free)

entry = __rmid_entry(idx);
if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
- QOS_L3_OCCUP_EVENT_ID, &val)) {
+ QOS_L3_OCCUP_EVENT_ID, &val,
+ arch_mon_ctx)) {
rmid_dirty = true;
} else {
rmid_dirty = (val >= resctrl_rmid_realloc_threshold);
@@ -316,6 +323,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
}
cur_idx = idx + 1;
}
+
+ resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
}

bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
@@ -407,16 +416,22 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
struct rdt_domain *d;
+ int arch_mon_ctx;
u64 val = 0;
u32 idx;
int err;

idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);

+ arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
+ if (arch_mon_ctx < 0)
+ return;
+
entry->busy = 0;
list_for_each_entry(d, &r->domains, list) {
err = resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
- QOS_L3_OCCUP_EVENT_ID, &val);
+ QOS_L3_OCCUP_EVENT_ID, &val,
+ arch_mon_ctx);
if (err || val <= resctrl_rmid_realloc_threshold)
continue;

@@ -429,6 +444,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
set_bit(idx, d->rmid_busy_llc);
entry->busy++;
}
+ resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);

if (entry->busy)
rmid_limbo_count++;
@@ -465,7 +481,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid);

rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid,
- &tval);
+ &tval, rr->arch_mon_ctx);
if (rr->err)
return rr->err;

@@ -538,6 +554,9 @@ int mon_event_count(void *info)
int ret;

rdtgrp = rr->rgrp;
+ rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr->r, rr->evtid);
+ if (rr->arch_mon_ctx < 0)
+ return rr->arch_mon_ctx;

ret = __mon_event_count(rdtgrp->closid, rdtgrp->mon.rmid, rr);

@@ -564,6 +583,8 @@ int mon_event_count(void *info)
if (ret == 0)
rr->err = 0;

+ resctrl_arch_mon_ctx_free(rr->r, rr->evtid, rr->arch_mon_ctx);
+
return 0;
}

@@ -700,11 +721,21 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
if (is_mbm_total_enabled()) {
rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
rr.val = 0;
+ rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
+ if (rr.arch_mon_ctx < 0)
+ return;
+
__mon_event_count(closid, rmid, &rr);
+
+ resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
}
if (is_mbm_local_enabled()) {
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
rr.val = 0;
+ rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
+ if (rr.arch_mon_ctx < 0)
+ return;
+
__mon_event_count(closid, rmid, &rr);

/*
@@ -714,6 +745,7 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
*/
if (is_mba_sc(NULL))
mbm_bw_count(closid, rmid, &rr);
+ resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
}
}

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 57d32c3ce06f..d90d3dca48e9 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -230,6 +230,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
* @rmid: rmid of the counter to read.
* @eventid: eventid to read, e.g. L3 occupancy.
* @val: result of the counter read in bytes.
+ * @arch_mon_ctx: An allocated context from resctrl_arch_mon_ctx_alloc().
*
* Call from process context on a CPU that belongs to domain @d.
*
@@ -238,8 +239,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
*/
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
- u64 *val);
-
+ u64 *val, int arch_mon_ctx);

/**
* resctrl_arch_reset_rmid() - Reset any private state associated with rmid
--
2.30.2

2023-01-17 21:02:46

by Fenghua Yu

[permalink] [raw]
Subject: RE: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID

Hi, James,

> MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
> used for different control groups.
>
> This means once a CLOSID is allocated, all its monitoring ids may still be dirty,
> and held in limbo.
>
> Add a helper to allow the CLOSID allocator to check if a CLOSID has dirty RMID
> values. This behaviour is enabled by a kconfig option selected by the
> architecture, which avoids a pointless search for x86.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
>
> ---
> Changes since v1:
> * Removed superflous IS_ENABLED().
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 1 +
> arch/x86/kernel/cpu/resctrl/monitor.c | 31 ++++++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 ++++++++------
> 3 files changed, 42 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h
> b/arch/x86/kernel/cpu/resctrl/internal.h
> index 013c8fc9fd28..adae6231324f 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -509,6 +509,7 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup
> *rdtgrp); void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp); struct
> rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r); int
> closids_supported(void);
> +bool resctrl_closid_is_dirty(u32 closid);
> void closid_free(int closid);
> int alloc_rmid(u32 closid);
> void free_rmid(u32 closid, u32 rmid);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
> b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 347be3767241..190ac183139e 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -327,6 +327,37 @@ static struct rmid_entry *resctrl_find_free_rmid(u32
> closid)
> return ERR_PTR(-ENOSPC);
> }
>
> +/**
> + * resctrl_closid_is_dirty - Determine if clean RMID can be allocate for this

s/allocate/allocated/

> + * CLOSID.
> + * @closid: The CLOSID that is being queried.
> + *
> + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocate

s/allocate/allocated/

> +CLOSID
> + * may not be able to allocate clean RMID. To avoid this the allocator
> +will
> + * only return clean CLOSID.
> + */
> +bool resctrl_closid_is_dirty(u32 closid) {
> + struct rmid_entry *entry;
> + int i;
> +
> + lockdep_assert_held(&rdtgroup_mutex);

It's better to move lockdep_asser_held() after if (!IS_ENABLE()).
Then compiler might optimize this function to empty on X86.

> +
> + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
> + return false;
> +
> + for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
> + entry = &rmid_ptrs[i];
> + if (entry->closid != closid)
> + continue;
> +
> + if (entry->busy)
> + return true;
> + }
> +
> + return false;
> +}
> +
> /*
> * As of now the RMIDs allocation is the same in each domain.
> * However we keep track of which packages the RMIDs diff --git
> a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index ac88610a6946..e1f879e13823 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -93,7 +93,7 @@ void rdt_last_cmd_printf(const char *fmt, ...)
> * - Our choices on how to configure each resource become progressively more
> * limited as the number of resources grows.
> */
> -static int closid_free_map;
> +static unsigned long closid_free_map;
> static int closid_free_map_len;
>
> int closids_supported(void)
> @@ -119,14 +119,17 @@ static void closid_init(void)
>
> static int closid_alloc(void)
> {
> - u32 closid = ffs(closid_free_map);
> + u32 closid;
>
> - if (closid == 0)
> - return -ENOSPC;
> - closid--;
> - closid_free_map &= ~(1 << closid);
> + for_each_set_bit(closid, &closid_free_map, closid_free_map_len) {
> + if (resctrl_closid_is_dirty(closid))
> + continue;
>
> - return closid;
> + clear_bit(closid, &closid_free_map);
> + return closid;
> + }
> +
> + return -ENOSPC;
> }
>
> void closid_free(int closid)
> --
> 2.30.2

Thanks.

-Fenghua

2023-01-25 07:30:59

by Shaopeng Tan (Fujitsu)

[permalink] [raw]
Subject: RE: [PATCH v2 00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

Hi James,

Tested this patch series(v2) on Intel(R) Xeon(R) Gold 6254 CPU with resctrl kselftest.
No problem.

Thanks
Shaopeng TAN


2023-02-02 23:47:26

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
> used for different control groups.
>
> This means once a CLOSID is allocated, all its monitoring ids may still be
> dirty, and held in limbo.
>
> Add a helper to allow the CLOSID allocator to check if a CLOSID has dirty
> RMID values. This behaviour is enabled by a kconfig option selected by
> the architecture, which avoids a pointless search for x86.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
>
> ---
> Changes since v1:
> * Removed superflous IS_ENABLED().
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 1 +
> arch/x86/kernel/cpu/resctrl/monitor.c | 31 ++++++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 ++++++++------
> 3 files changed, 42 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 013c8fc9fd28..adae6231324f 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -509,6 +509,7 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
> void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
> struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
> int closids_supported(void);
> +bool resctrl_closid_is_dirty(u32 closid);
> void closid_free(int closid);
> int alloc_rmid(u32 closid);
> void free_rmid(u32 closid, u32 rmid);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 347be3767241..190ac183139e 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -327,6 +327,37 @@ static struct rmid_entry *resctrl_find_free_rmid(u32 closid)
> return ERR_PTR(-ENOSPC);
> }
>
> +/**
> + * resctrl_closid_is_dirty - Determine if clean RMID can be allocate for this
> + * CLOSID.
> + * @closid: The CLOSID that is being queried.
> + *
> + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocate CLOSID
> + * may not be able to allocate clean RMID. To avoid this the allocator will
> + * only return clean CLOSID.
> + */
> +bool resctrl_closid_is_dirty(u32 closid)
> +{
> + struct rmid_entry *entry;
> + int i;
> +
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
> + return false;

Why is a config option chosen? Is this not something that can be set in the
architecture specific code using a global in the form matching existing related
items like "arch_has..." or "arch_needs..."?

> +
> + for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
> + entry = &rmid_ptrs[i];
> + if (entry->closid != closid)
> + continue;
> +
> + if (entry->busy)
> + return true;
> + }
> +
> + return false;
> +}

If I understand this correctly resctrl_closid_is_dirty() will return true if
_any_ RMID/PMG associated with a CLOSID is in use. That is, a CLOSID may be
able to support 100s of PMG but if only one of them is busy then the CLOSID
will be considered unusable ("dirty"). It sounds to me that there could be scenarios
when CLOSID could be considered unavailable while there are indeed sufficient
resources?

The function comment states "Determine if clean RMID can be allocate for this
CLOSID." - if I understand correctly it behaves more like "Determine if all
RMID associated with CLOSID are available".

Reinette

2023-02-02 23:48:34

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 13/18] x86/resctrl: Make rdt_enable_key the arch's decision to switch

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> rdt_enable_key is switched when resctrl is mounted. It was also previously
> used to prevent a second mount of the filesystem.
>
> Any other architecture that wants to support resctrl has to provide
> identical static keys.
>
> Now that we have helpers for enablign and disabling the alloc/mon keys,

Please remove all usage of "we" in changelog and code comments.

enablign -> enabling

Reinette



2023-02-02 23:51:00

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> resctrl has one mutex that is taken by the architecture specific code,
> and the filesystem parts. The two interact via cpuhp, where the
> architecture code updates the domain list. Filesystem handlers that
> walk the domains list should not run concurrently with the cpuhp
> callback modifying the list.
>
> Exposing a lock from the filesystem code means the interface is not
> cleanly defined, and creates the possibility of cross-architecture
> lock ordering headaches. The interaction only exists so that certain
> filesystem paths are serialised against cpu hotplug. The cpu hotplug
> code already has a mechanism to do this using cpus_read_lock().
>
> MPAM's monitors have an overflow interrupt, so it needs to be possible
> to walk the domains list in irq context. RCU is ideal for this,
> but some paths need to be able to sleep to allocate memory.
>
> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
> of a cpuhp callback, cpus_read_lock() must always be taken first.
> rdtgroup_schemata_write() already does this.
>
> All but one of the filesystem code's domain list walkers are
> currently protected by the rdtgroup_mutex taken in
> rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
> which takes the lock directly.

The new BMEC code also. You can find it on tip's x86/cache branch,
see mbm_total_bytes_config_write() and mbm_local_bytes_config_write().

>
> Make the domain list protected by RCU. An architecture-specific
> lock prevents concurrent writers. rdt_bit_usage_show() can
> walk the domain list under rcu_read_lock().
> The other filesystem list walkers need to be able to sleep.
> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
> cpuhp callbacks can't be invoked when file system operations are
> occurring.
>
> Add lockdep_assert_cpus_held() in the cases where the
> rdtgroup_kn_lock_live() call isn't obvious.
>
> Resctrl's domain online/offline calls now need to take the
> rdtgroup_mutex themselves.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 33 ++++++++------
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 14 ++++--
> arch/x86/kernel/cpu/resctrl/monitor.c | 3 ++
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 3 ++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 54 ++++++++++++++++++++---
> include/linux/resctrl.h | 2 +-
> 6 files changed, 84 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 7896fcf11df6..dc1ba580c4db 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -25,8 +25,14 @@
> #include <asm/resctrl.h>
> #include "internal.h"
>
> -/* Mutex to protect rdtgroup access. */
> -DEFINE_MUTEX(rdtgroup_mutex);
> +/*
> + * rdt_domain structures are kfree()d when their last cpu goes offline,
> + * and allocated when the first cpu in a new domain comes online.
> + * The rdt_resource's domain list is updated when this happens. The domain
> + * list is protected by RCU, but callers can also take the cpus_read_lock()
> + * to prevent modification if they need to sleep. All writers take this mutex:

Using "callers can" is not specific (compare to "callers should"). Please provide
clear guidance on how the locks should be used. Reader may wonder "why take cpus_read_lock()
to prevent modification, why not just take the mutex to prevent modification?"

> + */
> +static DEFINE_MUTEX(domain_list_lock);
>
> /*
> * The cached resctrl_pqr_state is strictly per CPU and can never be
> @@ -483,6 +489,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> struct rdt_domain *d;
> int err;
>
> + lockdep_assert_held(&domain_list_lock);
> +
> d = rdt_find_domain(r, id, &add_pos);
> if (IS_ERR(d)) {
> pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> @@ -516,11 +524,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
> return;
> }
>
> - list_add_tail(&d->list, add_pos);
> + list_add_tail_rcu(&d->list, add_pos);
>
> err = resctrl_online_domain(r, d);
> if (err) {
> - list_del(&d->list);
> + list_del_rcu(&d->list);
> + synchronize_rcu();
> domain_free(hw_dom);
> }
> }
> @@ -541,7 +550,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> cpumask_clear_cpu(cpu, &d->cpu_mask);
> if (cpumask_empty(&d->cpu_mask)) {
> resctrl_offline_domain(r, d);
> - list_del(&d->list);
> + list_del_rcu(&d->list);
> + synchronize_rcu();
>
> /*
> * rdt_domain "d" is going to be freed below, so clear

Should domain_remove_cpu() also get a "lockdep_assert_held(&domain_list_lock)"?

> @@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
> static int resctrl_arch_online_cpu(unsigned int cpu)
> {
> struct rdt_resource *r;
> - int err;
>
> - mutex_lock(&rdtgroup_mutex);
> + mutex_lock(&domain_list_lock);
> for_each_capable_rdt_resource(r)
> domain_add_cpu(cpu, r);
> clear_closid_rmid(cpu);
> + mutex_unlock(&domain_list_lock);

Why is clear_closid_rmid(cpu) protected by mutex?

>
> - err = resctrl_online_cpu(cpu);
> - mutex_unlock(&rdtgroup_mutex);
> -
> - return err;
> + return resctrl_online_cpu(cpu);
> }
>
> static int resctrl_arch_offline_cpu(unsigned int cpu)
> {
> struct rdt_resource *r;
>
> - mutex_lock(&rdtgroup_mutex);
> resctrl_offline_cpu(cpu);
>
> + mutex_lock(&domain_list_lock);
> for_each_capable_rdt_resource(r)
> domain_remove_cpu(cpu, r);
> clear_closid_rmid(cpu);
> - mutex_unlock(&rdtgroup_mutex);
> + mutex_unlock(&domain_list_lock);

Same

>
> return 0;
> }

Reinette

2023-03-03 18:35:44

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID

Hi Fenghua,

On 17/01/2023 18:29, Yu, Fenghua wrote:
>> MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
>> used for different control groups.
>>
>> This means once a CLOSID is allocated, all its monitoring ids may still be dirty,
>> and held in limbo.
>>
>> Add a helper to allow the CLOSID allocator to check if a CLOSID has dirty RMID
>> values. This behaviour is enabled by a kconfig option selected by the
>> architecture, which avoids a pointless search for x86.

>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
>> b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 347be3767241..190ac183139e 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -327,6 +327,37 @@ static struct rmid_entry *resctrl_find_free_rmid(u32
>> closid)
>> return ERR_PTR(-ENOSPC);
>> }
>>
>> +/**
>> + * resctrl_closid_is_dirty - Determine if clean RMID can be allocate for this
>
> s/allocate/allocated/
>
>> + * CLOSID.
>> + * @closid: The CLOSID that is being queried.
>> + *
>> + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocate
>
> s/allocate/allocated/

(Both fixed, thanks)


>> +CLOSID
>> + * may not be able to allocate clean RMID. To avoid this the allocator
>> +will
>> + * only return clean CLOSID.
>> + */
>> +bool resctrl_closid_is_dirty(u32 closid) {
>> + struct rmid_entry *entry;
>> + int i;
>> +
>> + lockdep_assert_held(&rdtgroup_mutex);
>
> It's better to move lockdep_asser_held() after if (!IS_ENABLE()).
> Then compiler might optimize this function to empty on X86.

If you compile without lockdep it will be empty!
Is anyone worried about performance with lockdep enabled?

The reason for it being here is documentation and for the runtime check if you run with
lockdep. Having it here is so that new code that only runs on x86 (with lockdep) also
checks this, even though it doesn't have CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID.

I'd prefer to keep it so we can catch bugs early. Lockdep isn't on by default.


>> +
>> + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
>> + return false;
>> +
>> + for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
>> + entry = &rmid_ptrs[i];
>> + if (entry->closid != closid)
>> + continue;
>> +
>> + if (entry->busy)
>> + return true;
>> + }
>> +
>> + return false;
>> +}


Thanks,

James

2023-03-03 18:36:58

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID

Hi Reinette,

On 02/02/2023 23:46, Reinette Chatre wrote:
> On 1/13/2023 9:54 AM, James Morse wrote:
>> MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
>> used for different control groups.
>>
>> This means once a CLOSID is allocated, all its monitoring ids may still be
>> dirty, and held in limbo.
>>
>> Add a helper to allow the CLOSID allocator to check if a CLOSID has dirty
>> RMID values. This behaviour is enabled by a kconfig option selected by
>> the architecture, which avoids a pointless search for x86.

>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 347be3767241..190ac183139e 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -327,6 +327,37 @@ static struct rmid_entry *resctrl_find_free_rmid(u32 closid)
>> return ERR_PTR(-ENOSPC);
>> }
>>
>> +/**
>> + * resctrl_closid_is_dirty - Determine if clean RMID can be allocate for this
>> + * CLOSID.
>> + * @closid: The CLOSID that is being queried.
>> + *
>> + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocate CLOSID
>> + * may not be able to allocate clean RMID. To avoid this the allocator will
>> + * only return clean CLOSID.
>> + */
>> +bool resctrl_closid_is_dirty(u32 closid)
>> +{
>> + struct rmid_entry *entry;
>> + int i;
>> +
>> + lockdep_assert_held(&rdtgroup_mutex);
>> +
>> + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
>> + return false;

> Why is a config option chosen? Is this not something that can be set in the
> architecture specific code using a global in the form matching existing related
> items like "arch_has..." or "arch_needs..."?

It doesn't vary by platform, so making it a runtime variable would mean x86 has to carry
this extra code around, even though it will never use it. Done like this, the compiler can
dead-code eliminate the below checks and embed the constant return value in all the callers.


>
>> +
>> + for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
>> + entry = &rmid_ptrs[i];
>> + if (entry->closid != closid)
>> + continue;
>> +
>> + if (entry->busy)
>> + return true;
>> + }
>> +
>> + return false;
>> +}
>
> If I understand this correctly resctrl_closid_is_dirty() will return true if
> _any_ RMID/PMG associated with a CLOSID is in use. That is, a CLOSID may be
> able to support 100s of PMG but if only one of them is busy then the CLOSID
> will be considered unusable ("dirty"). It sounds to me that there could be scenarios
> when CLOSID could be considered unavailable while there are indeed sufficient
> resources?

You are right this could happen. I guess the better approach would be to prefer the
cleanest CLOSID that has a clean PMG=0. User-space may not be able to allocate all the
monitor groups immediately, but that would be preferable to failing the control group
creation.

But as this code doesn't get built until the MPAM driver is merged, I'd like to keep it to
an absolute minimum. This would be more than is needed for MPAM to have close to resctrl
feature-parity, so I'd prefer to do this as an improvement once the MPAM driver is upstream.

(also in this category is better use of MPAM's monitors and labelling traffic from the iommu)


> The function comment states "Determine if clean RMID can be allocate for this
> CLOSID." - if I understand correctly it behaves more like "Determine if all
> RMID associated with CLOSID are available".

Yes, I'll fix that.


Thanks!

James

2023-03-06 11:34:58

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks

Hi Reinette,

On 02/02/2023 23:50, Reinette Chatre wrote:
> On 1/13/2023 9:54 AM, James Morse wrote:
>> resctrl has one mutex that is taken by the architecture specific code,
>> and the filesystem parts. The two interact via cpuhp, where the
>> architecture code updates the domain list. Filesystem handlers that
>> walk the domains list should not run concurrently with the cpuhp
>> callback modifying the list.
>>
>> Exposing a lock from the filesystem code means the interface is not
>> cleanly defined, and creates the possibility of cross-architecture
>> lock ordering headaches. The interaction only exists so that certain
>> filesystem paths are serialised against cpu hotplug. The cpu hotplug
>> code already has a mechanism to do this using cpus_read_lock().
>>
>> MPAM's monitors have an overflow interrupt, so it needs to be possible
>> to walk the domains list in irq context. RCU is ideal for this,
>> but some paths need to be able to sleep to allocate memory.
>>
>> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
>> of a cpuhp callback, cpus_read_lock() must always be taken first.
>> rdtgroup_schemata_write() already does this.
>>
>> All but one of the filesystem code's domain list walkers are
>> currently protected by the rdtgroup_mutex taken in
>> rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
>> which takes the lock directly.
>
> The new BMEC code also. You can find it on tip's x86/cache branch,
> see mbm_total_bytes_config_write() and mbm_local_bytes_config_write().
>
>>
>> Make the domain list protected by RCU. An architecture-specific
>> lock prevents concurrent writers. rdt_bit_usage_show() can
>> walk the domain list under rcu_read_lock().
>> The other filesystem list walkers need to be able to sleep.
>> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
>> cpuhp callbacks can't be invoked when file system operations are
>> occurring.
>>
>> Add lockdep_assert_cpus_held() in the cases where the
>> rdtgroup_kn_lock_live() call isn't obvious.
>>
>> Resctrl's domain online/offline calls now need to take the
>> rdtgroup_mutex themselves.


>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>> index 7896fcf11df6..dc1ba580c4db 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -25,8 +25,14 @@
>> #include <asm/resctrl.h>
>> #include "internal.h"
>>
>> -/* Mutex to protect rdtgroup access. */
>> -DEFINE_MUTEX(rdtgroup_mutex);
>> +/*
>> + * rdt_domain structures are kfree()d when their last cpu goes offline,
>> + * and allocated when the first cpu in a new domain comes online.
>> + * The rdt_resource's domain list is updated when this happens. The domain
>> + * list is protected by RCU, but callers can also take the cpus_read_lock()
>> + * to prevent modification if they need to sleep. All writers take this mutex:
>
> Using "callers can" is not specific (compare to "callers should"). Please provide
> clear guidance on how the locks should be used. Reader may wonder "why take cpus_read_lock()
> to prevent modification, why not just take the mutex to prevent modification?"

'if they need to sleep' is the answer to this. I think a certain amount of background
knowledge can be assumed. My aim here wasn't to write an essay, but indicate not all
readers do the same thing. This is already the case in resctrl, and the MPAM pmu stuff
makes that worse.

Is this more robust:
| * rdt_domain structures are kfree()d when their last cpu goes offline,
| * and allocated when the first cpu in a new domain comes online.
| * The rdt_resource's domain list is updated when this happens. Readers of
| * the domain list must either take cpus_read_lock(), or rely on an RCU
| * read-side critical section, to avoid observing concurrent modification.
| * For information about RCU, see Docuemtation/RCU/rcu.rst.
| * All writers take this mutex:

?


>> @@ -541,7 +550,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>> cpumask_clear_cpu(cpu, &d->cpu_mask);
>> if (cpumask_empty(&d->cpu_mask)) {
>> resctrl_offline_domain(r, d);
>> - list_del(&d->list);
>> + list_del_rcu(&d->list);
>> + synchronize_rcu();
>>
>> /*
>> * rdt_domain "d" is going to be freed below, so clear

> Should domain_remove_cpu() also get a "lockdep_assert_held(&domain_list_lock)"?

Yes, not sure why I didn't do that!


>> @@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
>> static int resctrl_arch_online_cpu(unsigned int cpu)
>> {
>> struct rdt_resource *r;
>> - int err;
>>
>> - mutex_lock(&rdtgroup_mutex);
>> + mutex_lock(&domain_list_lock);
>> for_each_capable_rdt_resource(r)
>> domain_add_cpu(cpu, r);
>> clear_closid_rmid(cpu);
>> + mutex_unlock(&domain_list_lock);

> Why is clear_closid_rmid(cpu) protected by mutex?

It doesn't need to be, its just an artefact of changing the lock, then moving the
filesystem calls out. (its doesn't need to be protected by rdtgroup_mutex today).

If you don't think its churn, I'll move it to make it clearer.


Thanks,

James

2023-03-10 19:59:47

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID

Hi James,

On 3/3/2023 10:36 AM, James Morse wrote:
> Hi Reinette,
>
> On 02/02/2023 23:46, Reinette Chatre wrote:
>> On 1/13/2023 9:54 AM, James Morse wrote:

...

>>> +bool resctrl_closid_is_dirty(u32 closid)
>>> +{
>>> + struct rmid_entry *entry;
>>> + int i;
>>> +
>>> + lockdep_assert_held(&rdtgroup_mutex);
>>> +
>>> + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
>>> + return false;
>
>> Why is a config option chosen? Is this not something that can be set in the
>> architecture specific code using a global in the form matching existing related
>> items like "arch_has..." or "arch_needs..."?
>
> It doesn't vary by platform, so making it a runtime variable would mean x86 has to carry
> this extra code around, even though it will never use it. Done like this, the compiler can
> dead-code eliminate the below checks and embed the constant return value in all the callers.

This is fair. I missed any other mention of this option in this series so I
assume this will be a config that will be "select"ed automatically without
users needing to think about whether it is needed?

>>> +
>>> + for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
>>> + entry = &rmid_ptrs[i];
>>> + if (entry->closid != closid)
>>> + continue;
>>> +
>>> + if (entry->busy)
>>> + return true;
>>> + }
>>> +
>>> + return false;
>>> +}
>>
>> If I understand this correctly resctrl_closid_is_dirty() will return true if
>> _any_ RMID/PMG associated with a CLOSID is in use. That is, a CLOSID may be
>> able to support 100s of PMG but if only one of them is busy then the CLOSID
>> will be considered unusable ("dirty"). It sounds to me that there could be scenarios
>> when CLOSID could be considered unavailable while there are indeed sufficient
>> resources?
>
> You are right this could happen. I guess the better approach would be to prefer the
> cleanest CLOSID that has a clean PMG=0. User-space may not be able to allocate all the
> monitor groups immediately, but that would be preferable to failing the control group
> creation.
>
> But as this code doesn't get built until the MPAM driver is merged, I'd like to keep it to
> an absolute minimum. This would be more than is needed for MPAM to have close to resctrl
> feature-parity, so I'd prefer to do this as an improvement once the MPAM driver is upstream.
>
> (also in this category is better use of MPAM's monitors and labelling traffic from the iommu)
>
>
>> The function comment states "Determine if clean RMID can be allocate for this
>> CLOSID." - if I understand correctly it behaves more like "Determine if all
>> RMID associated with CLOSID are available".
>
> Yes, I'll fix that.

I understand your reasoning for the solution chosen. Would you be ok to expand on
the function comment more to document the intentions that you summarize above (eg. "This
is the absolute minimum solution that will be/should be/could be improved ...")?

Reinette


2023-03-11 00:22:31

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks

Hi James,

On 3/6/2023 3:34 AM, James Morse wrote:
> Hi Reinette,
>
> On 02/02/2023 23:50, Reinette Chatre wrote:
>> On 1/13/2023 9:54 AM, James Morse wrote:
>>> resctrl has one mutex that is taken by the architecture specific code,
>>> and the filesystem parts. The two interact via cpuhp, where the
>>> architecture code updates the domain list. Filesystem handlers that
>>> walk the domains list should not run concurrently with the cpuhp
>>> callback modifying the list.
>>>
>>> Exposing a lock from the filesystem code means the interface is not
>>> cleanly defined, and creates the possibility of cross-architecture
>>> lock ordering headaches. The interaction only exists so that certain
>>> filesystem paths are serialised against cpu hotplug. The cpu hotplug
>>> code already has a mechanism to do this using cpus_read_lock().
>>>
>>> MPAM's monitors have an overflow interrupt, so it needs to be possible
>>> to walk the domains list in irq context. RCU is ideal for this,
>>> but some paths need to be able to sleep to allocate memory.
>>>
>>> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
>>> of a cpuhp callback, cpus_read_lock() must always be taken first.
>>> rdtgroup_schemata_write() already does this.
>>>
>>> All but one of the filesystem code's domain list walkers are
>>> currently protected by the rdtgroup_mutex taken in
>>> rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
>>> which takes the lock directly.
>>
>> The new BMEC code also. You can find it on tip's x86/cache branch,
>> see mbm_total_bytes_config_write() and mbm_local_bytes_config_write().
>>
>>>
>>> Make the domain list protected by RCU. An architecture-specific
>>> lock prevents concurrent writers. rdt_bit_usage_show() can
>>> walk the domain list under rcu_read_lock().
>>> The other filesystem list walkers need to be able to sleep.
>>> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
>>> cpuhp callbacks can't be invoked when file system operations are
>>> occurring.
>>>
>>> Add lockdep_assert_cpus_held() in the cases where the
>>> rdtgroup_kn_lock_live() call isn't obvious.
>>>
>>> Resctrl's domain online/offline calls now need to take the
>>> rdtgroup_mutex themselves.
>
>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>> index 7896fcf11df6..dc1ba580c4db 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>> @@ -25,8 +25,14 @@
>>> #include <asm/resctrl.h>
>>> #include "internal.h"
>>>
>>> -/* Mutex to protect rdtgroup access. */
>>> -DEFINE_MUTEX(rdtgroup_mutex);
>>> +/*
>>> + * rdt_domain structures are kfree()d when their last cpu goes offline,
>>> + * and allocated when the first cpu in a new domain comes online.
>>> + * The rdt_resource's domain list is updated when this happens. The domain
>>> + * list is protected by RCU, but callers can also take the cpus_read_lock()
>>> + * to prevent modification if they need to sleep. All writers take this mutex:
>>
>> Using "callers can" is not specific (compare to "callers should"). Please provide
>> clear guidance on how the locks should be used. Reader may wonder "why take cpus_read_lock()
>> to prevent modification, why not just take the mutex to prevent modification?"
>
> 'if they need to sleep' is the answer to this. I think a certain amount of background
> knowledge can be assumed. My aim here wasn't to write an essay, but indicate not all
> readers do the same thing. This is already the case in resctrl, and the MPAM pmu stuff
> makes that worse.
>
> Is this more robust:
> | * rdt_domain structures are kfree()d when their last cpu goes offline,
> | * and allocated when the first cpu in a new domain comes online.
> | * The rdt_resource's domain list is updated when this happens. Readers of
> | * the domain list must either take cpus_read_lock(), or rely on an RCU
> | * read-side critical section, to avoid observing concurrent modification.
> | * For information about RCU, see Docuemtation/RCU/rcu.rst.
> | * All writers take this mutex:
>
> ?

Yes, I do think this is more robust. Since you do mention, "'if they need to sleep'
is the answer to this", how about "... must take cpus_read_lock() if they need to
sleep, or otherwise rely on an RCU read-side critical section, ..."? I do not
think it is necessary to provide a link to the documentation. If you do prefer
to keep it, please note the typo.

Also, please cpu -> CPU.

>>> @@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
>>> static int resctrl_arch_online_cpu(unsigned int cpu)
>>> {
>>> struct rdt_resource *r;
>>> - int err;
>>>
>>> - mutex_lock(&rdtgroup_mutex);
>>> + mutex_lock(&domain_list_lock);
>>> for_each_capable_rdt_resource(r)
>>> domain_add_cpu(cpu, r);
>>> clear_closid_rmid(cpu);
>>> + mutex_unlock(&domain_list_lock);
>
>> Why is clear_closid_rmid(cpu) protected by mutex?
>
> It doesn't need to be, its just an artefact of changing the lock, then moving the
> filesystem calls out. (its doesn't need to be protected by rdtgroup_mutex today).
>
> If you don't think its churn, I'll move it to make it clearer.
>

I do not see a problem with keeping the lock/unlock as before but
if you do find that you can make the locking clearer then
please do.

Reinette




2023-03-20 17:17:50

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID

Hi Reinette,

On 10/03/2023 19:59, Reinette Chatre wrote:
> On 3/3/2023 10:36 AM, James Morse wrote:
>> On 02/02/2023 23:46, Reinette Chatre wrote:
>>> On 1/13/2023 9:54 AM, James Morse wrote:
>
> ...
>
>>>> +bool resctrl_closid_is_dirty(u32 closid)
>>>> +{
>>>> + struct rmid_entry *entry;
>>>> + int i;
>>>> +
>>>> + lockdep_assert_held(&rdtgroup_mutex);
>>>> +
>>>> + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
>>>> + return false;
>>
>>> Why is a config option chosen? Is this not something that can be set in the
>>> architecture specific code using a global in the form matching existing related
>>> items like "arch_has..." or "arch_needs..."?
>>
>> It doesn't vary by platform, so making it a runtime variable would mean x86 has to carry
>> this extra code around, even though it will never use it. Done like this, the compiler can
>> dead-code eliminate the below checks and embed the constant return value in all the callers.

> This is fair. I missed any other mention of this option in this series so I
> assume this will be a config that will be "select"ed automatically without
> users needing to think about whether it is needed?

Yes, MPAM platforms would unconditionally enable it, as it reflects how MPAM works. [0]
Users would never need to know it exists.


>>>> + for (i = 0; i < resctrl_arch_system_num_rmid_idx(); i++) {
>>>> + entry = &rmid_ptrs[i];
>>>> + if (entry->closid != closid)
>>>> + continue;
>>>> +
>>>> + if (entry->busy)
>>>> + return true;
>>>> + }
>>>> +
>>>> + return false;
>>>> +}
>>>
>>> If I understand this correctly resctrl_closid_is_dirty() will return true if
>>> _any_ RMID/PMG associated with a CLOSID is in use. That is, a CLOSID may be
>>> able to support 100s of PMG but if only one of them is busy then the CLOSID
>>> will be considered unusable ("dirty"). It sounds to me that there could be scenarios
>>> when CLOSID could be considered unavailable while there are indeed sufficient
>>> resources?
>>
>> You are right this could happen. I guess the better approach would be to prefer the
>> cleanest CLOSID that has a clean PMG=0. User-space may not be able to allocate all the
>> monitor groups immediately, but that would be preferable to failing the control group
>> creation.
>>
>> But as this code doesn't get built until the MPAM driver is merged, I'd like to keep it to
>> an absolute minimum. This would be more than is needed for MPAM to have close to resctrl
>> feature-parity, so I'd prefer to do this as an improvement once the MPAM driver is upstream.
>>
>> (also in this category is better use of MPAM's monitors and labelling traffic from the iommu)
>>
>>
>>> The function comment states "Determine if clean RMID can be allocate for this
>>> CLOSID." - if I understand correctly it behaves more like "Determine if all
>>> RMID associated with CLOSID are available".
>>
>> Yes, I'll fix that.
>
> I understand your reasoning for the solution chosen. Would you be ok to expand on
> the function comment more to document the intentions that you summarize above (eg. "This
> is the absolute minimum solution that will be/should be/could be improved ...")?

Sure thing,


Thanks,

James

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/tree/drivers/platform/mpam/Kconfig?h=mpam/snapshot/v6.2&id=ef6b11256ba2cceaff846c96183e8eb6019642d7

2023-03-20 17:27:02

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks

Hi Reinette

On 11/03/2023 00:22, Reinette Chatre wrote:
> On 3/6/2023 3:34 AM, James Morse wrote:
>> On 02/02/2023 23:50, Reinette Chatre wrote:
>>> On 1/13/2023 9:54 AM, James Morse wrote:
>>>> resctrl has one mutex that is taken by the architecture specific code,
>>>> and the filesystem parts. The two interact via cpuhp, where the
>>>> architecture code updates the domain list. Filesystem handlers that
>>>> walk the domains list should not run concurrently with the cpuhp
>>>> callback modifying the list.
>>>>
>>>> Exposing a lock from the filesystem code means the interface is not
>>>> cleanly defined, and creates the possibility of cross-architecture
>>>> lock ordering headaches. The interaction only exists so that certain
>>>> filesystem paths are serialised against cpu hotplug. The cpu hotplug
>>>> code already has a mechanism to do this using cpus_read_lock().
>>>>
>>>> MPAM's monitors have an overflow interrupt, so it needs to be possible
>>>> to walk the domains list in irq context. RCU is ideal for this,
>>>> but some paths need to be able to sleep to allocate memory.
>>>>
>>>> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
>>>> of a cpuhp callback, cpus_read_lock() must always be taken first.
>>>> rdtgroup_schemata_write() already does this.
>>>>
>>>> All but one of the filesystem code's domain list walkers are
>>>> currently protected by the rdtgroup_mutex taken in
>>>> rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
>>>> which takes the lock directly.
>>>
>>> The new BMEC code also. You can find it on tip's x86/cache branch,
>>> see mbm_total_bytes_config_write() and mbm_local_bytes_config_write().
>>>
>>>>
>>>> Make the domain list protected by RCU. An architecture-specific
>>>> lock prevents concurrent writers. rdt_bit_usage_show() can
>>>> walk the domain list under rcu_read_lock().
>>>> The other filesystem list walkers need to be able to sleep.
>>>> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
>>>> cpuhp callbacks can't be invoked when file system operations are
>>>> occurring.
>>>>
>>>> Add lockdep_assert_cpus_held() in the cases where the
>>>> rdtgroup_kn_lock_live() call isn't obvious.
>>>>
>>>> Resctrl's domain online/offline calls now need to take the
>>>> rdtgroup_mutex themselves.

>>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>>> index 7896fcf11df6..dc1ba580c4db 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>>> @@ -25,8 +25,14 @@
>>>> #include <asm/resctrl.h>
>>>> #include "internal.h"
>>>>
>>>> -/* Mutex to protect rdtgroup access. */
>>>> -DEFINE_MUTEX(rdtgroup_mutex);
>>>> +/*
>>>> + * rdt_domain structures are kfree()d when their last cpu goes offline,
>>>> + * and allocated when the first cpu in a new domain comes online.
>>>> + * The rdt_resource's domain list is updated when this happens. The domain
>>>> + * list is protected by RCU, but callers can also take the cpus_read_lock()
>>>> + * to prevent modification if they need to sleep. All writers take this mutex:
>>>
>>> Using "callers can" is not specific (compare to "callers should"). Please provide
>>> clear guidance on how the locks should be used. Reader may wonder "why take cpus_read_lock()
>>> to prevent modification, why not just take the mutex to prevent modification?"
>>
>> 'if they need to sleep' is the answer to this. I think a certain amount of background
>> knowledge can be assumed. My aim here wasn't to write an essay, but indicate not all
>> readers do the same thing. This is already the case in resctrl, and the MPAM pmu stuff
>> makes that worse.
>>
>> Is this more robust:
>> | * rdt_domain structures are kfree()d when their last cpu goes offline,
>> | * and allocated when the first cpu in a new domain comes online.
>> | * The rdt_resource's domain list is updated when this happens. Readers of
>> | * the domain list must either take cpus_read_lock(), or rely on an RCU
>> | * read-side critical section, to avoid observing concurrent modification.
>> | * For information about RCU, see Docuemtation/RCU/rcu.rst.
>> | * All writers take this mutex:
>>
>> ?
>
> Yes, I do think this is more robust. Since you do mention, "'if they need to sleep'
> is the answer to this", how about "... must take cpus_read_lock() if they need to
> sleep, or otherwise rely on an RCU read-side critical section, ..."?

Yes, I've changed this to
| * The rdt_resource's domain list is updated when this happens. Readers of
| * the domain list must either take cpus_read_lock() if they need to sleep,
| * or rely on an RCU read-side critical section, to avoid observing concurrent
| * modification.


> I do not
> think it is necessary to provide a link to the documentation. If you do prefer
> to keep it, please note the typo.

I'll drop that then.

> Also, please cpu -> CPU.

Fixed.


>>>> @@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
>>>> static int resctrl_arch_online_cpu(unsigned int cpu)
>>>> {
>>>> struct rdt_resource *r;
>>>> - int err;
>>>>
>>>> - mutex_lock(&rdtgroup_mutex);
>>>> + mutex_lock(&domain_list_lock);
>>>> for_each_capable_rdt_resource(r)
>>>> domain_add_cpu(cpu, r);
>>>> clear_closid_rmid(cpu);
>>>> + mutex_unlock(&domain_list_lock);
>>
>>> Why is clear_closid_rmid(cpu) protected by mutex?
>>
>> It doesn't need to be, its just an artefact of changing the lock, then moving the
>> filesystem calls out. (its doesn't need to be protected by rdtgroup_mutex today).
>>
>> If you don't think its churn, I'll move it to make it clearer.

> I do not see a problem with keeping the lock/unlock as before but
> if you do find that you can make the locking clearer then
> please do.

Done,


Thanks,

James