Hello!
The most interesting change since v4 is to use a sequence counter when accessing the
QM_EVTSEL MSR to avoid the cost of re-reading the system register to detect when
this becomes re-entrant. Otherwise changes since v3 are noted in each patch.
~
This series does two things, it changes resctrl to call resctrl_arch_rmid_read()
in a way that works for MPAM, and it separates the locking so that the arch code
and filesystem code don't have to share a mutex. I tried to split this as two
series, but these touch similar call sites, so it would create more work.
(What's MPAM? See the cover letter of the first series. [1])
On x86 the RMID is an independent number. MPAMs equivalent is PMG, but this
isn't an independent number - it extends the PARTID (same as CLOSID) space
with bits that aren't used to select the configuration. The monitors can
then be told to match specific PMG values, allowing monitor-groups to be
created.
But, MPAM expects the monitors to always monitor by PARTID. The
Cache-storage-utilisation counters can only work this way.
(In the MPAM spec not setting the MATCH_PARTID bit is made CONSTRAINED
UNPREDICTABLE - which is Arm's term to mean portable software can't rely on
this)
It gets worse, as some SoCs may have very few PMG bits. I've seen the
datasheet for one that has a single bit of PMG space.
To be usable, MPAM's counters always need the PARTID and the PMG.
For resctrl, this means always making the CLOSID available when the RMID
is used.
To ensure RMID are always unique, this series combines the CLOSID and RMID
into an index, and manages RMID based on that. For x86, the index and RMID
would always be the same.
Currently the architecture specific code in the cpuhp callbacks takes the
rdtgroup_mutex. This means the filesystem code would have to export this
lock, resulting in an ill-defined interface between the two, and the possibility
of cross-architecture lock-ordering head aches.
The second part of this series adds a domain_list_lock to protect writes to the
domain list, and protects the domain list with RCU - or cpus_read_lock().
Use of RCU is to allow lockless readers of the domain list. To get MPAMs monitors
working, its very likely they'll need to be plumbed up to perf. An uncore PMU
driver would need to be a lockless reader of the domain list.
This series is based on v6.5-rc1, and can be retrieved from:
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/monitors_and_locking/v5
Bugs welcome,
Thanks,
James
[1] https://lore.kernel.org/lkml/[email protected]/
[v1] https://lore.kernel.org/all/[email protected]/
[v2] https://lore.kernel.org/lkml/[email protected]/
[v3] https://lore.kernel.org/r/[email protected]
[v4] https://lore.kernel.org/r/[email protected]
James Morse (24):
x86/resctrl: Track the closid with the rmid
x86/resctrl: Access per-rmid structures by index
x86/resctrl: Create helper for RMID allocation and mondata dir
creation
x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()
x86/resctrl: Allow RMID allocation to be scoped by CLOSID
x86/resctrl: Track the number of dirty RMID a CLOSID has
x86/resctrl: Use set_bit()/clear_bit() instead of open coding
x86/resctrl: Allocate the cleanest CLOSID by searching
closid_num_dirty_rmid
x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
tick/nohz: Move tick_nohz_full_mask declaration outside the #ifdef
x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow
x86/resctrl: Make resctrl_arch_rmid_read() retry when it is
interrupted
x86/resctrl: Queue mon_event_read() instead of sending an IPI
x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
x86/resctrl: Allow arch to allocate memory needed in
resctrl_arch_rmid_read()
x86/resctrl: Make resctrl_mounted checks explicit
x86/resctrl: Move alloc/mon static keys into helpers
x86/resctrl: Make rdt_enable_key the arch's decision to switch
x86/resctrl: Add helpers for system wide mon/alloc capable
x86/resctrl: Add cpu online callback for resctrl work
x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but
cpu
x86/resctrl: Add cpu offline callback for resctrl work
x86/resctrl: Move domain helper migration into resctrl_offline_cpu()
x86/resctrl: Separate arch and fs resctrl locks
arch/x86/include/asm/resctrl.h | 90 +++++
arch/x86/kernel/cpu/resctrl/core.c | 78 ++--
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 45 ++-
arch/x86/kernel/cpu/resctrl/internal.h | 82 +++-
arch/x86/kernel/cpu/resctrl/monitor.c | 440 ++++++++++++++++------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 15 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 340 ++++++++++++-----
include/linux/resctrl.h | 43 ++-
include/linux/tick.h | 9 +-
9 files changed, 870 insertions(+), 272 deletions(-)
--
2.39.2
MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
used for different control groups.
This means once a CLOSID is allocated, all its monitoring ids may still be
dirty, and held in limbo.
Keep track of the number of RMID held in limbo each CLOSID has. This will
allow a future helper to find the 'cleanest' CLOSID when allocating.
The array is only needed when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is
defined. This will never be the case on x86.
Signed-off-by: James Morse <[email protected]>
---
Changes since v4:
* Moved closid_num_dirty_rmid[] update under entry->busy check
* Take the mutex in dom_data_init() as the caller doesn't.
---
arch/x86/kernel/cpu/resctrl/monitor.c | 49 +++++++++++++++++++++++----
1 file changed, 42 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index de91ca781d9f..44addc0126fc 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -43,6 +43,13 @@ struct rmid_entry {
*/
static LIST_HEAD(rmid_free_lru);
+/**
+ * @closid_num_dirty_rmid The number of dirty RMID each CLOSID has.
+ * Only allocated when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined.
+ * Indexed by CLOSID. Protected by rdtgroup_mutex.
+ */
+static int *closid_num_dirty_rmid;
+
/**
* @rmid_limbo_count count of currently unused but (potentially)
* dirty RMIDs.
@@ -285,6 +292,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
return 0;
}
+static void limbo_release_entry(struct rmid_entry *entry)
+{
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ rmid_limbo_count--;
+ list_add_tail(&entry->list, &rmid_free_lru);
+
+ if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
+ closid_num_dirty_rmid[entry->closid]--;
+}
+
/*
* Check the RMIDs that are marked as busy for this domain. If the
* reported LLC occupancy is below the threshold clear the busy bit and
@@ -321,10 +339,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
if (force_free || !rmid_dirty) {
clear_bit(idx, d->rmid_busy_llc);
- if (!--entry->busy) {
- rmid_limbo_count--;
- list_add_tail(&entry->list, &rmid_free_lru);
- }
+ if (!--entry->busy)
+ limbo_release_entry(entry);
}
cur_idx = idx + 1;
}
@@ -391,6 +407,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
u64 val = 0;
u32 idx;
+ lockdep_assert_held(&rdtgroup_mutex);
+
idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
entry->busy = 0;
@@ -416,9 +434,11 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
}
put_cpu();
- if (entry->busy)
+ if (entry->busy) {
rmid_limbo_count++;
- else
+ if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
+ closid_num_dirty_rmid[entry->closid]++;
+ } else
list_add_tail(&entry->list, &rmid_free_lru);
}
@@ -782,13 +802,28 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
static int dom_data_init(struct rdt_resource *r)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
+ u32 num_closid = resctrl_arch_get_num_closid(r);
struct rmid_entry *entry = NULL;
u32 idx;
int i;
+ if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
+ int *tmp;
+
+ tmp = kcalloc(num_closid, sizeof(int), GFP_KERNEL);
+ if (!tmp)
+ return -ENOMEM;
+
+ mutex_lock(&rdtgroup_mutex);
+ closid_num_dirty_rmid = tmp;
+ mutex_unlock(&rdtgroup_mutex);
+ }
+
rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL);
- if (!rmid_ptrs)
+ if (!rmid_ptrs) {
+ kfree(closid_num_dirty_rmid);
return -ENOMEM;
+ }
for (i = 0; i < idx_limit; i++) {
entry = &rmid_ptrs[i];
--
2.39.2
The resctrl CLOSID allocator uses a single 32bit word to track which
CLOSID are free. The setting and clearing of bits is open coded.
A subsequent patch adds resctrl_closid_is_free(), which adds more open
coded bitmaps operations. These will eventually need changing to use
the bitops helpers so that a CLOSID bitmap of the correct size can be
allocated dynamically.
Convert the existing open coded bit manipulations of closid_free_map
to use set_bit() and friends.
Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b97e119dbe46..4ab9bb018c17 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -106,7 +106,7 @@ void rdt_staged_configs_clear(void)
* - Our choices on how to configure each resource become progressively more
* limited as the number of resources grows.
*/
-static int closid_free_map;
+static unsigned long closid_free_map;
static int closid_free_map_len;
int closids_supported(void)
@@ -126,7 +126,7 @@ static void closid_init(void)
closid_free_map = BIT_MASK(rdt_min_closid) - 1;
/* CLOSID 0 is always reserved for the default group */
- closid_free_map &= ~1;
+ clear_bit(0, &closid_free_map);
closid_free_map_len = rdt_min_closid;
}
@@ -137,14 +137,14 @@ static int closid_alloc(void)
if (closid == 0)
return -ENOSPC;
closid--;
- closid_free_map &= ~(1 << closid);
+ clear_bit(closid, &closid_free_map);
return closid;
}
void closid_free(int closid)
{
- closid_free_map |= 1 << closid;
+ set_bit(closid, &closid_free_map);
}
/**
@@ -156,7 +156,7 @@ void closid_free(int closid)
*/
static bool closid_allocated(unsigned int closid)
{
- return (closid_free_map & (1 << closid)) == 0;
+ return !test_bit(closid, &closid_free_map);
}
/**
--
2.39.2
x86 systems identify traffic using the CLOSID and RMID. The CLOSID is
used to lookup the control policy, the RMID is used for monitoring. For
x86 these are independent numbers.
Arm's MPAM has equivalent features PARTID and PMG, where the PARTID is
used to lookup the control policy. The PMG in contrast is a small number
of bits that are used to subdivide PARTID when monitoring. The
cache-occupancy monitors require the PARTID to be specified when monitoring.
This means MPAM's PMG field is not unique. There are multiple PMG-0, one
per allocated CLOSID/PARTID. If PMG is treated as equivalent to RMID, it
cannot be allocated as an independent number. Bitmaps like rmid_busy_llc
need to be sized by the number of unique entries for this resource.
Treat the combined CLOSID and RMID as an index, and provide architecture
helpers to pack and unpack an index. This makes the MPAM values unique.
The domain's rmid_busy_llc and rmid_ptrs[] are then sized by index, as
are domain mbm_local[] and mbm_total[].
x86 can ignore the CLOSID field when packing and unpacking an index, and
report as many indexes as RMID.
Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v1:
* Added X86_BAD_CLOSID macro to make it clear what this value means
* Added second WARN_ON() for closid checking, and made both _ONCE()
Changes since v2:
* Added RESCTRL_RESERVED_CLOSID
* Removed a newline
* Repharsed some comments
* Renamed a variable 'ignore'd
* Moved X86_RESCTRL_BAD_CLOSID to a previous patch
Changes since v3:
* Changed a variable name
* Fixed various typos
Changes since v4:
* Removed resource paramter from has_busy_rmid()
* Rewrote commit message
---
arch/x86/include/asm/resctrl.h | 17 +++++
arch/x86/kernel/cpu/resctrl/core.c | 4 +-
arch/x86/kernel/cpu/resctrl/internal.h | 3 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 92 +++++++++++++++++---------
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 9 +--
include/linux/resctrl.h | 4 ++
6 files changed, 92 insertions(+), 37 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 29999f52b461..9510c23db62d 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -101,6 +101,23 @@ static inline void resctrl_sched_in(struct task_struct *tsk)
__resctrl_sched_in(tsk);
}
+static inline u32 resctrl_arch_system_num_rmid_idx(void)
+{
+ /* RMID are independent numbers for x86. num_rmid_idx == num_rmid */
+ return boot_cpu_data.x86_cache_max_rmid + 1;
+}
+
+static inline void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid)
+{
+ *rmid = idx;
+ *closid = X86_RESCTRL_EMPTY_CLOSID;
+}
+
+static inline u32 resctrl_arch_rmid_idx_encode(u32 ignored, u32 rmid)
+{
+ return rmid;
+}
+
void resctrl_cpu_detect(struct cpuinfo_x86 *c);
#else
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..8dfede01b0c9 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -585,7 +585,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
mbm_setup_overflow_handler(d, 0);
}
if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
- has_busy_rmid(r, d)) {
+ has_busy_rmid(d)) {
cancel_delayed_work(&d->cqm_limbo);
cqm_setup_limbo_handler(d, 0);
}
@@ -600,7 +600,7 @@ static void clear_closid_rmid(int cpu)
state->default_rmid = 0;
state->cur_closid = 0;
state->cur_rmid = 0;
- wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
+ wrmsr(MSR_IA32_PQR_ASSOC, 0, RESCTRL_RESERVED_CLOSID);
}
static int resctrl_online_cpu(unsigned int cpu)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index f2da908bb079..b48715bb8762 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -7,6 +7,7 @@
#include <linux/kernfs.h>
#include <linux/fs_context.h>
#include <linux/jump_label.h>
+#include <asm/resctrl.h>
#define L3_QOS_CDP_ENABLE 0x01ULL
@@ -550,7 +551,7 @@ void __init intel_rdt_mbm_apply_quirk(void);
bool is_mba_sc(struct rdt_resource *r);
void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
+bool has_busy_rmid(struct rdt_domain *d);
void __check_limbo(struct rdt_domain *d, bool force_free);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void __init thread_throttle_mode_init(void);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index fa66029de41c..bd234b66dddf 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -142,12 +142,29 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val)
return val;
}
-static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid)
+/*
+ * x86 and arm64 differ in their handling of monitoring.
+ * x86's RMID are an independent number, there is only one source of traffic
+ * with an RMID value of '1'.
+ * arm64's PMG extend the PARTID/CLOSID space, there are multiple sources of
+ * traffic with a PMG value of '1', one for each CLOSID, meaning the RMID
+ * value is no longer unique.
+ * To account for this, resctrl uses an index. On x86 this is just the RMID,
+ * on arm64 it encodes the CLOSID and RMID. This gives a unique number.
+ *
+ * The domain's rmid_busy_llc and rmid_ptrs[] are sized by index. The arch code
+ * must accept an attempt to read every index.
+ */
+static inline struct rmid_entry *__rmid_entry(u32 idx)
{
struct rmid_entry *entry;
+ u32 closid, rmid;
- entry = &rmid_ptrs[rmid];
- WARN_ON(entry->rmid != rmid);
+ entry = &rmid_ptrs[idx];
+ resctrl_arch_rmid_idx_decode(idx, &closid, &rmid);
+
+ WARN_ON_ONCE(entry->closid != closid);
+ WARN_ON_ONCE(entry->rmid != rmid);
return entry;
}
@@ -277,8 +294,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
void __check_limbo(struct rdt_domain *d, bool force_free)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ u32 idx_limit = resctrl_arch_system_num_rmid_idx();
struct rmid_entry *entry;
- u32 crmid = 1, nrmid;
+ u32 idx, cur_idx = 1;
bool rmid_dirty;
u64 val = 0;
@@ -289,12 +307,11 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
* RMID and move it to the free list when the counter reaches 0.
*/
for (;;) {
- nrmid = find_next_bit(d->rmid_busy_llc, r->num_rmid, crmid);
- if (nrmid >= r->num_rmid)
+ idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx);
+ if (idx >= idx_limit)
break;
- entry = __rmid_entry(X86_RESCTRL_EMPTY_CLOSID, nrmid);// temporary
-
+ entry = __rmid_entry(idx);
if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
QOS_L3_OCCUP_EVENT_ID, &val)) {
rmid_dirty = true;
@@ -303,19 +320,21 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
}
if (force_free || !rmid_dirty) {
- clear_bit(entry->rmid, d->rmid_busy_llc);
+ clear_bit(idx, d->rmid_busy_llc);
if (!--entry->busy) {
rmid_limbo_count--;
list_add_tail(&entry->list, &rmid_free_lru);
}
}
- crmid = nrmid + 1;
+ cur_idx = idx + 1;
}
}
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_domain *d)
{
- return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
+ u32 idx_limit = resctrl_arch_system_num_rmid_idx();
+
+ return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit;
}
/*
@@ -345,6 +364,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
struct rdt_domain *d;
int cpu, err;
u64 val = 0;
+ u32 idx;
+
+ idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
entry->busy = 0;
cpu = get_cpu();
@@ -362,9 +384,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
* For the first limbo RMID in the domain,
* setup up the limbo worker.
*/
- if (!has_busy_rmid(r, d))
+ if (!has_busy_rmid(d))
cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL);
- set_bit(entry->rmid, d->rmid_busy_llc);
+ set_bit(idx, d->rmid_busy_llc);
entry->busy++;
}
put_cpu();
@@ -377,14 +399,17 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
void free_rmid(u32 closid, u32 rmid)
{
+ u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
struct rmid_entry *entry;
- if (!rmid)
- return;
-
lockdep_assert_held(&rdtgroup_mutex);
- entry = __rmid_entry(closid, rmid);
+ /* do not allow the default rmid to be free'd */
+ if (idx == resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID,
+ RESCTRL_RESERVED_RMID))
+ return;
+
+ entry = __rmid_entry(idx);
if (is_llc_occupancy_enabled())
add_rmid_to_limbo(entry);
@@ -395,11 +420,13 @@ void free_rmid(u32 closid, u32 rmid)
static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid,
u32 rmid, enum resctrl_event_id evtid)
{
+ u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+
switch (evtid) {
case QOS_L3_MBM_TOTAL_EVENT_ID:
- return &d->mbm_total[rmid];
+ return &d->mbm_total[idx];
case QOS_L3_MBM_LOCAL_EVENT_ID:
- return &d->mbm_local[rmid];
+ return &d->mbm_local[idx];
default:
return NULL;
}
@@ -441,7 +468,8 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
*/
static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
{
- struct mbm_state *m = &rr->d->mbm_local[rmid];
+ u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+ struct mbm_state *m = &rr->d->mbm_local[idx];
u64 cur_bw, bytes, cur_bytes;
cur_bytes = rr->val;
@@ -531,7 +559,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
- u32 cur_bw, delta_bw, user_bw;
+ u32 cur_bw, delta_bw, user_bw, idx;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
struct list_head *head;
@@ -544,7 +572,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
closid = rgrp->closid;
rmid = rgrp->mon.rmid;
- pmbm_data = &dom_mbm->mbm_local[rmid];
+ idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+ pmbm_data = &dom_mbm->mbm_local[idx];
dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
@@ -662,7 +691,7 @@ void cqm_handle_limbo(struct work_struct *work)
__check_limbo(d, false);
- if (has_busy_rmid(r, d))
+ if (has_busy_rmid(d))
schedule_delayed_work_on(cpu, &d->cqm_limbo, delay);
mutex_unlock(&rdtgroup_mutex);
@@ -727,19 +756,20 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
static int dom_data_init(struct rdt_resource *r)
{
+ u32 idx_limit = resctrl_arch_system_num_rmid_idx();
struct rmid_entry *entry = NULL;
- int i, nr_rmids;
+ u32 idx;
+ int i;
- nr_rmids = r->num_rmid;
- rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL);
+ rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL);
if (!rmid_ptrs)
return -ENOMEM;
- for (i = 0; i < nr_rmids; i++) {
+ for (i = 0; i < idx_limit; i++) {
entry = &rmid_ptrs[i];
INIT_LIST_HEAD(&entry->list);
- entry->rmid = i;
+ resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid);
list_add_tail(&entry->list, &rmid_free_lru);
}
@@ -748,7 +778,9 @@ static int dom_data_init(struct rdt_resource *r)
* used for rdtgroup_default control group, which will be setup later.
* See rdtgroup_setup_root().
*/
- entry = __rmid_entry(0, 0);
+ idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID,
+ RESCTRL_RESERVED_RMID);
+ entry = __rmid_entry(idx);
list_del(&entry->list);
return 0;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f7fda4fc2c9e..6b7190f9cff6 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3727,7 +3727,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
if (is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
- if (is_llc_occupancy_enabled() && has_busy_rmid(r, d)) {
+ if (is_llc_occupancy_enabled() && has_busy_rmid(d)) {
/*
* When a package is going down, forcefully
* decrement rmid->ebusy. There is no way to know
@@ -3745,16 +3745,17 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
{
+ u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize;
if (is_llc_occupancy_enabled()) {
- d->rmid_busy_llc = bitmap_zalloc(r->num_rmid, GFP_KERNEL);
+ d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
}
if (is_mbm_total_enabled()) {
tsize = sizeof(*d->mbm_total);
- d->mbm_total = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
+ d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL);
if (!d->mbm_total) {
bitmap_free(d->rmid_busy_llc);
return -ENOMEM;
@@ -3762,7 +3763,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
}
if (is_mbm_local_enabled()) {
tsize = sizeof(*d->mbm_local);
- d->mbm_local = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
+ d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL);
if (!d->mbm_local) {
bitmap_free(d->rmid_busy_llc);
kfree(d->mbm_total);
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c413bb11d336..660752406174 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -6,6 +6,10 @@
#include <linux/list.h>
#include <linux/pid.h>
+/* CLOSID, RMID value used by the default control group */
+#define RESCTRL_RESERVED_CLOSID 0
+#define RESCTRL_RESERVED_RMID 0
+
#ifdef CONFIG_PROC_CPU_RESCTRL
int proc_resctrl_show(struct seq_file *m,
--
2.39.2
resctrl_arch_rmid_read() could be called by resctrl in process context,
and then called by the PMU driver from irq context on the same CPU.
This could cause struct arch_mbm_state's prev_msr value to go backwards,
leading to the chunks value being incremented multiple times.
The struct arch_mbm_state holds both the previous msr value, and a count
of the number of chunks. These two fields need to be updated atomically.
Similarly __rmid_read() must write to one MSR and read from another,
this must be proteted from re-entrance.
Read the prev_msr before accessing the hardware, and cmpxchg() the value
back. If the value has changed, the whole thing is re-attempted. To protect
the MSR, __rmid_read() will retry reads for QM_CTR if QM_EVTSEL has changed
from the selected value.
Signed-off-by: James Morse <[email protected]>
---
Changes since v4:
* Added retry loop in __rmid_read() to protect the CPU MSRs.
---
arch/x86/kernel/cpu/resctrl/internal.h | 5 +--
arch/x86/kernel/cpu/resctrl/monitor.c | 45 ++++++++++++++++++++------
2 files changed, 38 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a32d307292a1..7012f42a82ee 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -2,6 +2,7 @@
#ifndef _ASM_X86_RESCTRL_INTERNAL_H
#define _ASM_X86_RESCTRL_INTERNAL_H
+#include <linux/atomic.h>
#include <linux/resctrl.h>
#include <linux/sched.h>
#include <linux/kernfs.h>
@@ -338,8 +339,8 @@ struct mbm_state {
* find this struct.
*/
struct arch_mbm_state {
- u64 chunks;
- u64 prev_msr;
+ atomic64_t chunks;
+ atomic64_t prev_msr;
};
/**
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f0670795b446..62350bbd23e0 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -16,6 +16,7 @@
*/
#include <linux/module.h>
+#include <linux/percpu.h>
#include <linux/sizes.h>
#include <linux/slab.h>
@@ -24,6 +25,9 @@
#include "internal.h"
+/* Sequence number for writes to IA32 QM_EVTSEL */
+static DEFINE_PER_CPU(u64, qm_evtsel_seq);
+
struct rmid_entry {
/*
* Some architectures's resctrl_arch_rmid_read() needs the CLOSID value
@@ -178,7 +182,7 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
- u64 msr_val;
+ u64 msr_val, seq;
/*
* As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
@@ -187,9 +191,16 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
* IA32_QM_CTR.data (bits 61:0) reports the monitored data.
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
+ * A per-cpu sequence counter is incremented each time QM_EVTSEL is
+ * written. This is used to detect if this function was interrupted by
+ * another call without re-reading the MSRs. Retry the MSR read when
+ * this happens as the QM_CTR value may belong to a different event.
*/
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
- rdmsrl(MSR_IA32_QM_CTR, msr_val);
+ do {
+ seq = this_cpu_inc_return(qm_evtsel_seq);
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ rdmsrl(MSR_IA32_QM_CTR, msr_val);
+ } while (seq != this_cpu_read(qm_evtsel_seq));
if (msr_val & RMID_VAL_ERROR)
return -EIO;
@@ -225,13 +236,15 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
{
struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
struct arch_mbm_state *am;
+ u64 msr_val;
am = get_arch_mbm_state(hw_dom, rmid, eventid);
if (am) {
memset(am, 0, sizeof(*am));
/* Record any initial, non-zero count value. */
- __rmid_read(rmid, eventid, &am->prev_msr);
+ __rmid_read(rmid, eventid, &msr_val);
+ atomic64_set(&am->prev_msr, msr_val);
}
}
@@ -266,23 +279,35 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ u64 start_msr_val, old_msr_val, msr_val, chunks;
struct arch_mbm_state *am;
- u64 msr_val, chunks;
- int ret;
+ int ret = 0;
if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
return -EINVAL;
+interrupted:
+ am = get_arch_mbm_state(hw_dom, rmid, eventid);
+ if (am)
+ start_msr_val = atomic64_read(&am->prev_msr);
+
ret = __rmid_read(rmid, eventid, &msr_val);
if (ret)
return ret;
am = get_arch_mbm_state(hw_dom, rmid, eventid);
if (am) {
- am->chunks += mbm_overflow_count(am->prev_msr, msr_val,
- hw_res->mbm_width);
- chunks = get_corrected_mbm_count(rmid, am->chunks);
- am->prev_msr = msr_val;
+ old_msr_val = atomic64_cmpxchg(&am->prev_msr, start_msr_val,
+ msr_val);
+ if (old_msr_val != start_msr_val)
+ goto interrupted;
+
+ chunks = mbm_overflow_count(start_msr_val, msr_val,
+ hw_res->mbm_width);
+ atomic64_add(chunks, &am->chunks);
+
+ chunks = get_corrected_mbm_count(rmid,
+ atomic64_read(&am->chunks));
} else {
chunks = msr_val;
}
--
2.39.2
MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
used for different control groups.
This means once a CLOSID is allocated, all its monitoring ids may still be
dirty, and held in limbo.
Instead of allocating the first free CLOSID, on architectures where
CONFIG_RESCTRL_RMID_DEPENDS_ON_COSID is enabled, search
closid_num_dirty_rmid[] to find the cleanest CLOSID.
The CLOSID found is returned to closid_alloc() for the free list
to be updated.
Signed-off-by: James Morse <[email protected]>
---
Changes since v4:
* Dropped stale section from comment
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/monitor.c | 42 ++++++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 19 +++++++++---
3 files changed, 58 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 94749ee950dd..7c2a1c235480 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -557,5 +557,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void __init thread_throttle_mode_init(void);
void __init mbm_config_rftype_init(const char *config);
void rdt_staged_configs_clear(void);
+bool closid_allocated(unsigned int closid);
+int resctrl_find_cleanest_closid(void);
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 44addc0126fc..c268aa5925c7 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -379,6 +379,48 @@ static struct rmid_entry *resctrl_find_free_rmid(u32 closid)
return ERR_PTR(-ENOSPC);
}
+/**
+ * resctrl_find_cleanest_closid() - Find a CLOSID where all the associated
+ * RMID are clean, or the CLOSID that has
+ * the most clean RMID.
+ *
+ * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocated CLOSID
+ * may not be able to allocate clean RMID. To avoid this the allocator will
+ * choose the CLOSID with the most clean RMID.
+ *
+ * When the CLOSID and RMID are independent numbers, the first free CLOSID will
+ * be returned.
+ */
+int resctrl_find_cleanest_closid(void)
+{
+ u32 cleanest_closid = ~0, iter_num_dirty;
+ int i = 0;
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
+ return -EIO;
+
+ for (i = 0; i < closids_supported(); i++) {
+ if (closid_allocated(i))
+ continue;
+
+ iter_num_dirty = closid_num_dirty_rmid[i];
+ if (iter_num_dirty == 0)
+ return i;
+
+ if (cleanest_closid == ~0)
+ cleanest_closid = i;
+
+ if (iter_num_dirty < closid_num_dirty_rmid[cleanest_closid])
+ cleanest_closid = i;
+ }
+
+ if (cleanest_closid == ~0)
+ return -ENOSPC;
+ return cleanest_closid;
+}
+
/*
* For MPAM the RMID value is not unique, and has to be considered with
* the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 4ab9bb018c17..df28b81d2c9c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -132,11 +132,20 @@ static void closid_init(void)
static int closid_alloc(void)
{
- u32 closid = ffs(closid_free_map);
+ u32 closid;
+ int err;
- if (closid == 0)
- return -ENOSPC;
- closid--;
+ if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
+ err = resctrl_find_cleanest_closid();
+ if (err < 0)
+ return err;
+ closid = err;
+ } else {
+ closid = ffs(closid_free_map);
+ if (closid == 0)
+ return -ENOSPC;
+ closid--;
+ }
clear_bit(closid, &closid_free_map);
return closid;
@@ -154,7 +163,7 @@ void closid_free(int closid)
* Return: true if @closid is currently associated with a resource group,
* false if @closid is free
*/
-static bool closid_allocated(unsigned int closid)
+bool closid_allocated(unsigned int closid)
{
return !test_bit(closid, &closid_free_map);
}
--
2.39.2
resctrl enables three static keys depending on the features it has enabled.
Another architecture's context switch code may look different, any
static keys that control it should be buried behind helpers.
Move the alloc/mon logic into arch-specific helpers as a preparatory step
for making the rdt_enable_key's status something the arch code decides.
This means other architectures don't have to mirror the static keys.
Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/include/asm/resctrl.h | 20 ++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/internal.h | 5 -----
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++----
3 files changed, 24 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 0986b5208d76..23010fce5f8f 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -42,6 +42,26 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
+static inline void resctrl_arch_enable_alloc(void)
+{
+ static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
+}
+
+static inline void resctrl_arch_disable_alloc(void)
+{
+ static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
+}
+
+static inline void resctrl_arch_enable_mon(void)
+{
+ static_branch_enable_cpuslocked(&rdt_mon_enable_key);
+}
+
+static inline void resctrl_arch_disable_mon(void)
+{
+ static_branch_disable_cpuslocked(&rdt_mon_enable_key);
+}
+
/*
* __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
*
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 28751579abe6..ac39fecba4ca 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -93,9 +93,6 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
return container_of(kfc, struct rdt_fs_context, kfc);
}
-DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
-DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
-
/**
* struct mon_evt - Entry in the event list of a resource
* @evtid: event id
@@ -453,8 +450,6 @@ extern struct mutex rdtgroup_mutex;
extern struct rdt_hw_resource rdt_resources_all[];
extern struct rdtgroup rdtgroup_default;
-DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
-
extern struct dentry *debugfs_resctrl;
enum resctrl_res_level {
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 68fe2dde8887..91a740e10865 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2542,9 +2542,9 @@ static int rdt_get_tree(struct fs_context *fc)
goto out_psl;
if (rdt_alloc_capable)
- static_branch_enable_cpuslocked(&rdt_alloc_enable_key);
+ resctrl_arch_enable_alloc();
if (rdt_mon_capable)
- static_branch_enable_cpuslocked(&rdt_mon_enable_key);
+ resctrl_arch_enable_mon();
if (rdt_alloc_capable || rdt_mon_capable) {
static_branch_enable_cpuslocked(&rdt_enable_key);
@@ -2817,8 +2817,8 @@ static void rdt_kill_sb(struct super_block *sb)
rdt_pseudo_lock_release();
rdtgroup_default.mode = RDT_MODE_SHAREABLE;
schemata_list_destroy();
- static_branch_disable_cpuslocked(&rdt_alloc_enable_key);
- static_branch_disable_cpuslocked(&rdt_mon_enable_key);
+ resctrl_arch_disable_alloc();
+ resctrl_arch_disable_mon();
static_branch_disable_cpuslocked(&rdt_enable_key);
resctrl_mounted = false;
kernfs_kill_sb(sb);
--
2.39.2
Hello James,
I reviewed this patch series(v5) and it looks fine.
I ran resctrl selftest on Intel(R) Xeon(R) Gold 6254 CPU with nohz_full enabled/disabled, and there is no problem.
<reviewed-by:[email protected]>
<tested-by:[email protected]>
Hi James,
On 7/28/2023 9:42 AM, James Morse wrote:
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index de91ca781d9f..44addc0126fc 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -43,6 +43,13 @@ struct rmid_entry {
> */
> static LIST_HEAD(rmid_free_lru);
>
> +/**
> + * @closid_num_dirty_rmid The number of dirty RMID each CLOSID has.
> + * Only allocated when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined.
> + * Indexed by CLOSID. Protected by rdtgroup_mutex.
> + */
> +static int *closid_num_dirty_rmid;
> +
Will the values ever be negative?
> /**
> * @rmid_limbo_count count of currently unused but (potentially)
> * dirty RMIDs.
> @@ -285,6 +292,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> return 0;
> }
>
> +static void limbo_release_entry(struct rmid_entry *entry)
> +{
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + rmid_limbo_count--;
> + list_add_tail(&entry->list, &rmid_free_lru);
> +
> + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
> + closid_num_dirty_rmid[entry->closid]--;
> +}
> +
> /*
> * Check the RMIDs that are marked as busy for this domain. If the
> * reported LLC occupancy is below the threshold clear the busy bit and
> @@ -321,10 +339,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>
> if (force_free || !rmid_dirty) {
> clear_bit(idx, d->rmid_busy_llc);
> - if (!--entry->busy) {
> - rmid_limbo_count--;
> - list_add_tail(&entry->list, &rmid_free_lru);
> - }
> + if (!--entry->busy)
> + limbo_release_entry(entry);
> }
> cur_idx = idx + 1;
> }
> @@ -391,6 +407,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
> u64 val = 0;
> u32 idx;
>
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
>
> entry->busy = 0;
> @@ -416,9 +434,11 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
> }
> put_cpu();
>
> - if (entry->busy)
> + if (entry->busy) {
> rmid_limbo_count++;
> - else
> + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
> + closid_num_dirty_rmid[entry->closid]++;
> + } else
> list_add_tail(&entry->list, &rmid_free_lru);
> }
This new addition breaks the coding style with the last statement
now also needing a brace.
>
> @@ -782,13 +802,28 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
> static int dom_data_init(struct rdt_resource *r)
> {
> u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> + u32 num_closid = resctrl_arch_get_num_closid(r);
> struct rmid_entry *entry = NULL;
> u32 idx;
> int i;
>
> + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
> + int *tmp;
> +
> + tmp = kcalloc(num_closid, sizeof(int), GFP_KERNEL);
> + if (!tmp)
> + return -ENOMEM;
> +
> + mutex_lock(&rdtgroup_mutex);
> + closid_num_dirty_rmid = tmp;
> + mutex_unlock(&rdtgroup_mutex);
> + }
> +
It does no harm but I cannot see why the mutex is needed here.
> rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL);
> - if (!rmid_ptrs)
> + if (!rmid_ptrs) {
> + kfree(closid_num_dirty_rmid);
> return -ENOMEM;
> + }
>
> for (i = 0; i < idx_limit; i++) {
> entry = &rmid_ptrs[i];
How will this new memory be freed? Actually I cannot find where
rmid_ptrs is freed either .... is a "dom_data_free()" needed?
Reinette
Hi James,
On 7/28/2023 9:42 AM, James Morse wrote:
> @@ -600,7 +600,7 @@ static void clear_closid_rmid(int cpu)
> state->default_rmid = 0;
> state->cur_closid = 0;
> state->cur_rmid = 0;
> - wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
> + wrmsr(MSR_IA32_PQR_ASSOC, 0, RESCTRL_RESERVED_CLOSID);
> }
Can the remaining "0" be replaced with RESCTRL_RESERVED_RMID?
...
> @@ -377,14 +399,17 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>
> void free_rmid(u32 closid, u32 rmid)
> {
> + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
> struct rmid_entry *entry;
>
> - if (!rmid)
> - return;
> -
> lockdep_assert_held(&rdtgroup_mutex);
>
> - entry = __rmid_entry(closid, rmid);
> + /* do not allow the default rmid to be free'd */
> + if (idx == resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID,
> + RESCTRL_RESERVED_RMID))
> + return;
> +
Why is this encoding necessary? Can the provided function parameters
not be tested directly against RESCTRL_RESERVED_CLOSID and
RESCTRL_RESERVED_RMID ?
> + entry = __rmid_entry(idx);
>
> if (is_llc_occupancy_enabled())
> add_rmid_to_limbo(entry);
Reinette
Hi James,
On 7/28/2023 9:42 AM, James Morse wrote:
> resctrl_arch_rmid_read() could be called by resctrl in process context,
> and then called by the PMU driver from irq context on the same CPU.
The changelog is written as a bug report of current behavior.
This does not seem to describe current but instead planned future behavior.
> This could cause struct arch_mbm_state's prev_msr value to go backwards,
> leading to the chunks value being incremented multiple times.
>
> The struct arch_mbm_state holds both the previous msr value, and a count
> of the number of chunks. These two fields need to be updated atomically.
> Similarly __rmid_read() must write to one MSR and read from another,
> this must be proteted from re-entrance.
proteted -> protected
>
> Read the prev_msr before accessing the hardware, and cmpxchg() the value
> back. If the value has changed, the whole thing is re-attempted. To protect
> the MSR, __rmid_read() will retry reads for QM_CTR if QM_EVTSEL has changed
> from the selected value.
The latter part of the sentence does not seem to match with what the
patch does.
>
> Signed-off-by: James Morse <[email protected]>
>
> ---
> Changes since v4:
> * Added retry loop in __rmid_read() to protect the CPU MSRs.
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 5 +--
> arch/x86/kernel/cpu/resctrl/monitor.c | 45 ++++++++++++++++++++------
> 2 files changed, 38 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a32d307292a1..7012f42a82ee 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -2,6 +2,7 @@
> #ifndef _ASM_X86_RESCTRL_INTERNAL_H
> #define _ASM_X86_RESCTRL_INTERNAL_H
>
> +#include <linux/atomic.h>
> #include <linux/resctrl.h>
> #include <linux/sched.h>
> #include <linux/kernfs.h>
> @@ -338,8 +339,8 @@ struct mbm_state {
> * find this struct.
> */
> struct arch_mbm_state {
> - u64 chunks;
> - u64 prev_msr;
> + atomic64_t chunks;
> + atomic64_t prev_msr;
> };
>
> /**
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f0670795b446..62350bbd23e0 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -16,6 +16,7 @@
> */
>
> #include <linux/module.h>
> +#include <linux/percpu.h>
> #include <linux/sizes.h>
> #include <linux/slab.h>
>
> @@ -24,6 +25,9 @@
>
> #include "internal.h"
>
> +/* Sequence number for writes to IA32 QM_EVTSEL */
> +static DEFINE_PER_CPU(u64, qm_evtsel_seq);
> +
> struct rmid_entry {
> /*
> * Some architectures's resctrl_arch_rmid_read() needs the CLOSID value
> @@ -178,7 +182,7 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
>
> static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> {
> - u64 msr_val;
> + u64 msr_val, seq;
>
> /*
> * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> @@ -187,9 +191,16 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> * IA32_QM_CTR.data (bits 61:0) reports the monitored data.
> * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> * are error bits.
> + * A per-cpu sequence counter is incremented each time QM_EVTSEL is
> + * written. This is used to detect if this function was interrupted by
> + * another call without re-reading the MSRs. Retry the MSR read when
> + * this happens as the QM_CTR value may belong to a different event.
> */
> - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> - rdmsrl(MSR_IA32_QM_CTR, msr_val);
> + do {
> + seq = this_cpu_inc_return(qm_evtsel_seq);
> + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> + rdmsrl(MSR_IA32_QM_CTR, msr_val);
> + } while (seq != this_cpu_read(qm_evtsel_seq));
>
> if (msr_val & RMID_VAL_ERROR)
> return -EIO;
> @@ -225,13 +236,15 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
> {
> struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> struct arch_mbm_state *am;
> + u64 msr_val;
>
> am = get_arch_mbm_state(hw_dom, rmid, eventid);
> if (am) {
> memset(am, 0, sizeof(*am));
>
> /* Record any initial, non-zero count value. */
> - __rmid_read(rmid, eventid, &am->prev_msr);
> + __rmid_read(rmid, eventid, &msr_val);
> + atomic64_set(&am->prev_msr, msr_val);
> }
> }
>
> @@ -266,23 +279,35 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> {
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
> + u64 start_msr_val, old_msr_val, msr_val, chunks;
> struct arch_mbm_state *am;
> - u64 msr_val, chunks;
> - int ret;
> + int ret = 0;
>
> if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
> return -EINVAL;
>
> +interrupted:
> + am = get_arch_mbm_state(hw_dom, rmid, eventid);
> + if (am)
> + start_msr_val = atomic64_read(&am->prev_msr);
> +
> ret = __rmid_read(rmid, eventid, &msr_val);
> if (ret)
> return ret;
>
> am = get_arch_mbm_state(hw_dom, rmid, eventid);
> if (am) {
> - am->chunks += mbm_overflow_count(am->prev_msr, msr_val,
> - hw_res->mbm_width);
> - chunks = get_corrected_mbm_count(rmid, am->chunks);
> - am->prev_msr = msr_val;
> + old_msr_val = atomic64_cmpxchg(&am->prev_msr, start_msr_val,
> + msr_val);
> + if (old_msr_val != start_msr_val)
> + goto interrupted;
> +
hmmm ... what if interruption occurs here?
> + chunks = mbm_overflow_count(start_msr_val, msr_val,
> + hw_res->mbm_width);
> + atomic64_add(chunks, &am->chunks);
> +
> + chunks = get_corrected_mbm_count(rmid,
> + atomic64_read(&am->chunks));
> } else {
> chunks = msr_val;
> }
Reinette
Hi, James,
On 7/28/23 09:42, James Morse wrote:
> MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
> used for different control groups.
>
> This means once a CLOSID is allocated, all its monitoring ids may still be
> dirty, and held in limbo.
>
> Keep track of the number of RMID held in limbo each CLOSID has. This will
> allow a future helper to find the 'cleanest' CLOSID when allocating.
>
> The array is only needed when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is
> defined. This will never be the case on x86.
>
> Signed-off-by: James Morse <[email protected]>
> ---
> Changes since v4:
> * Moved closid_num_dirty_rmid[] update under entry->busy check
> * Take the mutex in dom_data_init() as the caller doesn't.
> ---
> arch/x86/kernel/cpu/resctrl/monitor.c | 49 +++++++++++++++++++++++----
> 1 file changed, 42 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index de91ca781d9f..44addc0126fc 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -43,6 +43,13 @@ struct rmid_entry {
> */
> static LIST_HEAD(rmid_free_lru);
>
Better to add:
#if CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID
> +/**
> + * @closid_num_dirty_rmid The number of dirty RMID each CLOSID has.
> + * Only allocated when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined.
> + * Indexed by CLOSID. Protected by rdtgroup_mutex.
> + */
> +static int *closid_num_dirty_rmid;
#endif
Then the global variable won't exist on x86 to avoid confusion and space.
Some code related to the CONFIG also needs to be changed accordingly.
> +
> /**
> * @rmid_limbo_count count of currently unused but (potentially)
> * dirty RMIDs.
> @@ -285,6 +292,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> return 0;
> }
>
> +static void limbo_release_entry(struct rmid_entry *entry)
> +{
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + rmid_limbo_count--;
> + list_add_tail(&entry->list, &rmid_free_lru);
> +
> + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
> + closid_num_dirty_rmid[entry->closid]--;
Maybe define some helpers (along with other similar ones) in resctrl.h
like this:
#ifdef CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID
static inline void closid_num_dirty_rmid_dec(struct rmid_entry *entry)
{
closid_num_dirty_rmid[entry->closid]--;
}
...
#else
static inline void closid_num_dirty_rmid_dec(struct rmid_entry *unused)
{
}
...
#endif
Then directly call the helper here:
+ closid_num_dirty_rmid_dec(entry);
On x86 this is noop without occupy any space and cleaner code.
> +}
> +
> /*
> * Check the RMIDs that are marked as busy for this domain. If the
> * reported LLC occupancy is below the threshold clear the busy bit and
> @@ -321,10 +339,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>
> if (force_free || !rmid_dirty) {
> clear_bit(idx, d->rmid_busy_llc);
> - if (!--entry->busy) {
> - rmid_limbo_count--;
> - list_add_tail(&entry->list, &rmid_free_lru);
> - }
> + if (!--entry->busy)
> + limbo_release_entry(entry);
> }
> cur_idx = idx + 1;
> }
> @@ -391,6 +407,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
> u64 val = 0;
> u32 idx;
>
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
>
> entry->busy = 0;
> @@ -416,9 +434,11 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
> }
> put_cpu();
>
> - if (entry->busy)
> + if (entry->busy) {
> rmid_limbo_count++;
> - else
> + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
> + closid_num_dirty_rmid[entry->closid]++;
Ditto.
> + } else
> list_add_tail(&entry->list, &rmid_free_lru);
> }
>
> @@ -782,13 +802,28 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
> static int dom_data_init(struct rdt_resource *r)
> {
> u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> + u32 num_closid = resctrl_arch_get_num_closid(r);
> struct rmid_entry *entry = NULL;
> u32 idx;
> int i;
>
> + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
> + int *tmp;
> +
> + tmp = kcalloc(num_closid, sizeof(int), GFP_KERNEL);
> + if (!tmp)
> + return -ENOMEM;
> +
> + mutex_lock(&rdtgroup_mutex);
data_init() is called in __init. No need to lock here, right?
> + closid_num_dirty_rmid = tmp;
> + mutex_unlock(&rdtgroup_mutex);
> + }
> +
This code is also can be defined as a helper in resctrl.h.
> rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL);
> - if (!rmid_ptrs)
> + if (!rmid_ptrs) {
> + kfree(closid_num_dirty_rmid);
> return -ENOMEM;
> + }
>
> for (i = 0; i < idx_limit; i++) {
> entry = &rmid_ptrs[i];
Thanks.
-Fenghua
Hi Fenghua
On 15/08/2023 03:37, Fenghua Yu wrote:
> On 7/28/23 09:42, James Morse wrote:
>> MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be
>> used for different control groups.
>>
>> This means once a CLOSID is allocated, all its monitoring ids may still be
>> dirty, and held in limbo.
>>
>> Keep track of the number of RMID held in limbo each CLOSID has. This will
>> allow a future helper to find the 'cleanest' CLOSID when allocating.
>>
>> The array is only needed when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is
>> defined. This will never be the case on x86.
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index de91ca781d9f..44addc0126fc 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -43,6 +43,13 @@ struct rmid_entry {
>> */
>> static LIST_HEAD(rmid_free_lru);
>>
>
> Better to add:
>
> #if CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID
>> +/**
>> + * @closid_num_dirty_rmid The number of dirty RMID each CLOSID has.
>> + * Only allocated when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined.
>> + * Indexed by CLOSID. Protected by rdtgroup_mutex.
>> + */
>> +static int *closid_num_dirty_rmid;
> #endif
>
> Then the global variable won't exist on x86 to avoid confusion and space.
>
> Some code related to the CONFIG also needs to be changed accordingly.
Uh-huh, that would force me to put #ifdef warts all over the code that accesses that variable.
Modern compilers are really smart. Because this is static, the compiler is free to remove
it if there are no users. All the users are behind if(IS_ENABLED()), meaning the compilers
dead-code elimination will cull the lot, and this variable too:
morse@eglon:~/kernel/mpam/build_x86_64/fs/resctrl$ nm -s monitor.o | grep closid_num_dirty
morse@eglon:~/kernel/mpam/build_arm64/fs/resctrl$ nm -s monitor.o | grep closid_num_dirty
0000000000000000 b closid_num_dirty_rmid
morse@eglon:~/kernel/mpam/build_arm64/fs/resctrl$
Using #ifdef is not only ugly - it prevents the compiler from seeing all the code, so the
CI build systems get worse coverage.
>> +
>> /**
>> * @rmid_limbo_count count of currently unused but (potentially)
>> * dirty RMIDs.
>> @@ -285,6 +292,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct
>> rdt_domain *d,
>> return 0;
>> }
>> +static void limbo_release_entry(struct rmid_entry *entry)
>> +{
>> + lockdep_assert_held(&rdtgroup_mutex);
>> +
>> + rmid_limbo_count--;
>> + list_add_tail(&entry->list, &rmid_free_lru);
>> +
>> + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
>> + closid_num_dirty_rmid[entry->closid]--;
>
>
> Maybe define some helpers (along with other similar ones) in resctrl.h like this:
>
> #ifdef CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID
> static inline void closid_num_dirty_rmid_dec(struct rmid_entry *entry)
> {
> closid_num_dirty_rmid[entry->closid]--;
> }
> ...
> #else
> static inline void closid_num_dirty_rmid_dec(struct rmid_entry *unused)
> {
> }
> ...
> #endif
>
> Then directly call the helper here:
>
> + closid_num_dirty_rmid_dec(entry);
>
> On x86 this is noop without
and the compiler knows this.
> occupy any space
Literally more lines of code.
> and cleaner code.
Maybe, this would hide the IS_ENABLED() check - but moving that out as a single use helper
would required closid_num_dirty_rmid[] to be exported from this file - which would prevent
it being optimised out. You'd get the result you were trying to avoid.
>> +}
>> +
>> /*
>> * Check the RMIDs that are marked as busy for this domain. If the
>> * reported LLC occupancy is below the threshold clear the busy bit and
>> @@ -321,10 +339,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>> if (force_free || !rmid_dirty) {
>> clear_bit(idx, d->rmid_busy_llc);
>> - if (!--entry->busy) {
>> - rmid_limbo_count--;
>> - list_add_tail(&entry->list, &rmid_free_lru);
>> - }
>> + if (!--entry->busy)
>> + limbo_release_entry(entry);
>> }
>> cur_idx = idx + 1;
>> }
>> @@ -391,6 +407,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>> u64 val = 0;
>> u32 idx;
>> + lockdep_assert_held(&rdtgroup_mutex);
>> +
>> idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
>> entry->busy = 0;
>> @@ -416,9 +434,11 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>> }
>> put_cpu();
>> - if (entry->busy)
>> + if (entry->busy) {
>> rmid_limbo_count++;
>> - else
>> + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
>> + closid_num_dirty_rmid[entry->closid]++;
>
> Ditto.
>
>> + } else
>> list_add_tail(&entry->list, &rmid_free_lru);
>> }
>> @@ -782,13 +802,28 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned
>> long delay_ms)
>> static int dom_data_init(struct rdt_resource *r)
>> {
>> u32 idx_limit = resctrl_arch_system_num_rmid_idx();
>> + u32 num_closid = resctrl_arch_get_num_closid(r);
>> struct rmid_entry *entry = NULL;
>> u32 idx;
>> int i;
>> + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
>> + int *tmp;
>> +
>> + tmp = kcalloc(num_closid, sizeof(int), GFP_KERNEL);
>> + if (!tmp)
>> + return -ENOMEM;
>> +
>> + mutex_lock(&rdtgroup_mutex);
> data_init() is called in __init. No need to lock here, right?
__init code can still race with other callers - especially as there are
CPUHP_AP_ONLINE_DYN cpuhp callbacks that are expected to sleep.
This is about ensuring all accesses to those global variables are protected by the lock.
This saves me a memory ordering headache.
Thanks,
James
Hi Peter,
On 22/08/2023 09:42, Peter Newman wrote:
> I ran this series successfully against our internal test suites on the
> following processors:
>
> - Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz
> - AMD EPYC 7B12 64-Core Processor
>
> Tested-By: Peter Newman <[email protected]>
Great, Thanks!
James
Hi Reinette,
On 8/25/23 00:01, Reinette Chatre wrote:
> On 8/24/2023 9:55 AM, James Morse wrote:
>> On 09/08/2023 23:35, Reinette Chatre wrote:
>>> On 7/28/2023 9:42 AM, James Morse wrote:
>>>> resctrl_arch_rmid_read() could be called by resctrl in process context,
>>>> and then called by the PMU driver from irq context on the same CPU.
>>>
>>> The changelog is written as a bug report of current behavior.
>>> This does not seem to describe current but instead planned future behavior.
>>
>> I pulled this patch from much later in the tree as it's about to be a problem in this
>> series. I haven't yet decided if its an existing bug in resctrl....
>>
>> ... it doesn't look like this can affect the path through mon_event_read(), as
>> generic_exec_single() masks interrupts.
>> But an incoming IPI from mon_event_read can corrupt the values for the limbo worker, which
>> at the worst would result in early re-use. And the MBM overflow worker ... which would
>> corrupt the value seen by user-space.
>> free_rmid() is equally affected, the outcome for limbo is the same spurious delay or early
>> re-use.
>
> Apologies but these races are not obvious to me. Let me take the first, where the
> race could be between mon_event_read() and the limbo worker. From what I can tell
> mon_event_read() can be called from user space when creating a new monitoring
> group or when viewing data associated with a monitoring group. In both cases
> rdtgroup_mutex is held from the time user space triggers the request until
> all IPIs are completed. Compare that with the limbo worker, cqm_handle_limbo(),
> that also obtains rdtgroup_mutex before it attempts to do its work.
> Considering this example I am not able to see how an incoming IPI from
> mon_event_read() can interfere with the limbo worker since both
> holding rdtgroup_mutex prevents them from running concurrently.
>
> Similarly, the MBM overflow worker takes rdtgroup_mutex, and free_rmid()
> is run with rdtgroup_mutex held.
Yes, sorry -I'd forgotten about that! I'll need to dig into this again.
Part of the problem is I'm looking at this from a different angle - something I haven't described properly: the resctrl_arch_ calls shouldn't depend on lock that is private to resctrl.
This allows for multiple callers, (e.g. resctrl_pmu that I haven't posted yet), and allows MPAM's
overflow interrupt to eventually be something resctrl could support.
It also allows the resctrl_arch_ calls to have lockdep asserts for their dependencies.
Yes the resctrl_mutex is what prevents this from being a problem today.
(I haven't yet looked at how Peter's series solves the same problem.)
... it may be possible to move this patch back of the 'fold' to live with the PMU code ...
>> I'll change the commit messages to describe that, and float this earlier in the series.
>> The backport will be a problem. This applies cleanly to v6.1.46, but for v5.15.127 there
>> are at least 13 dependencies ... its probably not worth trying to fix as chances are
>> no-one is seeing this happen in reality.
I'll remove that wording around this and mention the mutex.
[...]
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> index f0670795b446..62350bbd23e0 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> @@ -266,23 +279,35 @@ int :/(struct rdt_resource *r, struct rdt_domain *d,
>>>> {
>>>> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>>>> struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
>>>> + u64 start_msr_val, old_msr_val, msr_val, chunks;
>>>> struct arch_mbm_state *am;
>>>> - u64 msr_val, chunks;
>>>> - int ret;
>>>> + int ret = 0;
>>>>
>>>> if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
>>>> return -EINVAL;
>>>>
>>>> +interrupted:
>>>> + am = get_arch_mbm_state(hw_dom, rmid, eventid);
>>>> + if (am)
>>>> + start_msr_val = atomic64_read(&am->prev_msr);
>>>> +
>>>> ret = __rmid_read(rmid, eventid, &msr_val);
>>>> if (ret)
>>>> return ret;
>>>>
>>>> am = get_arch_mbm_state(hw_dom, rmid, eventid);
>>>> if (am) {
>>>> - am->chunks += mbm_overflow_count(am->prev_msr, msr_val,
>>>> - hw_res->mbm_width);
>>>> - chunks = get_corrected_mbm_count(rmid, am->chunks);
>>>> - am->prev_msr = msr_val;
>>>> + old_msr_val = atomic64_cmpxchg(&am->prev_msr, start_msr_val,
>>>> + msr_val);
>>>> + if (old_msr_val != start_msr_val)
>>>> + goto interrupted;
>>>> +
>>
>>> hmmm ... what if interruption occurs here?
>>
>> This is after the MSR write/read, so this function can't get a torn value from the
>> hardware. (e.g. reads the wrong RMID). The operations on struct arch_mbm_state are atomic,
>> so are still safe if the function becomes re-entrant.
>>
>> If the re-entrant call accessed the same RMID and the same counter, its atomic64_add()
>> would be based on the prev_msr value this call read - because the above cmpxchg succeeded.
>>
>> (put another way:)
>> The interrupting call returns a lower value, consistent with the first call not having
>> finished yet. The interrupted call returns the correct value, which is larger than it
>> read, because it completed after the interrupting call.
> I see, thank you. If this does end up being needed for a future
> concurrency issue, could you please add a comment describing
> this behavior where a later call can return a lower value and
> why that is ok? It looks to me, as accomplished with the use of
> atomic64_add(), as though this scenario would
> end with the correct arch_mbm_state even though the members
> are not updated atomically as a unit.
Sure my stab at that is:
/*
* At this point the hardware values have been read without
* being interrupted. Interrupts that occur later will read
* the updated am->prev_msr and safely increment am->chunks
* with the new data using atomic64_add().
*/
Thanks,
James