2022-02-17 23:02:29

by James Morse

[permalink] [raw]
Subject: [PATCH v3 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

Changes in this version? Expanded the commit message of patch 12, and
made a few things static, as reported by the kbuild robot.

---
The aim of this series is to insert a split between the parts of the monitor
code that the architecture must implement, and those that are part of the
resctrl filesystem. The eventual aim is to move all filesystem parts out
to live in /fs/resctrl, so that resctrl can be wired up for MPAM.

What's MPAM? See the cover letter of a previous series. [1]

The series adds domain online/offline callbacks to allow the filesystem to
manage some of its structures itself, then moves all the 'mba_sc' behaviour
to be part of the filesystem.
This means another architecture doesn't need to provide an mbps_val array.
As its all software, the resctrl filesystem should be able to do this without
any help from the architecture code.

Finally __rmid_read() is refactored to be the API call that the architecture
provides to read a counter value. All the hardware specific overflow detection,
scaling and value correction should occur behind this helper.


This series is based on v5.17-rc1, and can be retrieved from:
git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/resctrl_monitors_in_bytes/v3

[0] git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/resctrl_merge_cdp/v7
[1] https://lore.kernel.org/lkml/[email protected]/

[v1] https://lore.kernel.org/lkml/[email protected]/
[v2] https://lore.kernel.org/lkml/[email protected]/

Thanks,

James Morse (21):
x86/resctrl: Kill off alloc_enabled
x86/resctrl: Merge mon_capable and mon_enabled
x86/resctrl: Add domain online callback for resctrl work
x86/resctrl: Group struct rdt_hw_domain cleanup
x86/resctrl: Add domain offline callback for resctrl work
x86/resctrl: Remove set_mba_sc()s control array re-initialisation
x86/resctrl: Create mba_sc configuration in the rdt_domain
x86/resctrl: Switch over to the resctrl mbps_val list
x86/resctrl: Remove architecture copy of mbps_val
x86/resctrl: Abstract and use supports_mba_mbps()
x86/resctrl: Allow update_mba_bw() to update controls directly
x86/resctrl: Calculate bandwidth from the previous __mon_event_count()
chunks
x86/recstrl: Add per-rmid arch private storage for overflow and chunks
x86/recstrl: Allow per-rmid arch private storage to be reset
x86/resctrl: Abstract __rmid_read()
x86/resctrl: Pass the required parameters into
resctrl_arch_rmid_read()
x86/resctrl: Move mbm_overflow_count() into resctrl_arch_rmid_read()
x86/resctrl: Move get_corrected_mbm_count() into
resctrl_arch_rmid_read()
x86/resctrl: Rename and change the units of resctrl_cqm_threshold
x86/resctrl: Add resctrl_rmid_realloc_limit to abstract x86's
boot_cpu_data
x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

arch/x86/kernel/cpu/resctrl/core.c | 117 ++++--------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 75 +++++---
arch/x86/kernel/cpu/resctrl/internal.h | 61 +++----
arch/x86/kernel/cpu/resctrl/monitor.c | 211 ++++++++++++++--------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 205 +++++++++++++++++----
include/linux/resctrl.h | 62 ++++++-
7 files changed, 478 insertions(+), 255 deletions(-)

--
2.30.2


2022-02-17 23:10:59

by James Morse

[permalink] [raw]
Subject: [PATCH v3 11/21] x86/resctrl: Allow update_mba_bw() to update controls directly

update_mba_bw() calculates a new control value for the MBA resource
based on the user provided mbps_val and the current measured
bandwidth. Some control values need remapping by delay_bw_map().

It does this by calling wrmsrl() directly. This needs splitting
up to be done by an architecture specific helper, so that the
remainder can eventually be moved to /fs/.

Add resctrl_arch_update_one() to apply one configuration value
to the provided resource and domain. This avoids the staging
and cross-calling that is only needed with changes made by
user-space. delay_bw_map() moves to be part of the arch code,
to maintain the 'percentage control' view of MBA resources
in resctrl.

Signed-off-by: James Morse <[email protected]>
---
Chagnes since v2:
* Call msr_update() directly from resctrl_arch_update_one().

Changes since v1:
* Capitalisation
---
arch/x86/kernel/cpu/resctrl/core.c | 2 +-
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 21 +++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/internal.h | 1 -
arch/x86/kernel/cpu/resctrl/monitor.c | 13 ++++---------
include/linux/resctrl.h | 8 ++++++++
5 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f0e2820af475..90ebb7d71af2 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -296,7 +296,7 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
* that can be written to QOS_MSRs.
* There are currently no SKUs which support non linear delay values.
*/
-u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
+static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
{
if (r->membw.delay_linear)
return MAX_MBA_BW - bw;
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 9f45207a6c74..ece3a1e0e6f2 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -282,6 +282,27 @@ static bool apply_config(struct rdt_hw_domain *hw_dom,
return false;
}

+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+ u32 closid, enum resctrl_conf_type t, u32 cfg_val)
+{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ u32 idx = get_config_index(closid, t);
+ struct msr_param msr_param;
+
+ if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+ return -EINVAL;
+
+ hw_dom->ctrl_val[idx] = cfg_val;
+
+ msr_param.res = r;
+ msr_param.low = idx;
+ msr_param.high = idx + 1;
+ hw_res->msr_update(d, &msr_param, r);
+
+ return 0;
+}
+
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
{
struct resctrl_staged_config *cfg;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 373aaba53ecd..3b9e43ba7590 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -527,7 +527,6 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom,
void mbm_handle_overflow(struct work_struct *work);
void __init intel_rdt_mbm_apply_quirk(void);
bool is_mba_sc(struct rdt_resource *r);
-u32 delay_bw_map(unsigned long bw, struct rdt_resource *r);
void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
void cqm_handle_limbo(struct work_struct *work);
bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 5cc1e6b229d4..ac1a2e8998bb 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -420,10 +420,8 @@ void mon_event_count(void *info)
*/
static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
{
- u32 closid, rmid, cur_msr, cur_msr_val, new_msr_val;
+ u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
- struct rdt_hw_resource *hw_r_mba;
- struct rdt_hw_domain *hw_dom_mba;
u32 cur_bw, delta_bw, user_bw;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
@@ -433,8 +431,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
if (!is_mbm_local_enabled())
return;

- hw_r_mba = &rdt_resources_all[RDT_RESOURCE_MBA];
- r_mba = &hw_r_mba->r_resctrl;
+ r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
+
closid = rgrp->closid;
rmid = rgrp->mon.rmid;
pmbm_data = &dom_mbm->mbm_local[rmid];
@@ -444,7 +442,6 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
pr_warn_once("Failure to get domain for MBA update\n");
return;
}
- hw_dom_mba = resctrl_to_arch_dom(dom_mba);

cur_bw = pmbm_data->prev_bw;
user_bw = dom_mba->mbps_val[closid];
@@ -486,9 +483,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
return;
}

- cur_msr = hw_r_mba->msr_base + closid;
- wrmsrl(cur_msr, delay_bw_map(new_msr_val, r_mba));
- hw_dom_mba->ctrl_val[closid] = new_msr_val;
+ resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);

/*
* Delta values are updated dynamically package wise for each
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 46ab9fb5562e..84e815cb3be6 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -197,6 +197,14 @@ struct resctrl_schema {
/* The number of closid supported by this resource regardless of CDP */
u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
+
+/*
+ * Update the ctrl_val and apply this config right now.
+ * Must be called on one of the domain's CPUs.
+ */
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+ u32 closid, enum resctrl_conf_type t, u32 cfg_val);
+
u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
--
2.30.2

2022-02-17 23:10:59

by James Morse

[permalink] [raw]
Subject: [PATCH v3 01/21] x86/resctrl: Kill off alloc_enabled

rdt_resources_all[] used to have extra entries for L2CODE/L2DATA.
These were hidden from resctrl by the alloc_enabled value.

Now that the L2/L2CODE/L2DATA resources have been merged together,
alloc_enabled doesn't mean anything, it always has the same value as
alloc_capable which indicates which indicates allocation is supported
by this resource.

Remove alloc_enabled and its helpers.

Signed-off-by: James Morse <[email protected]>
---
Changes since v1:
* Fixed comment in rdtgroup_create_info_dir()
---
arch/x86/kernel/cpu/resctrl/core.c | 4 ----
arch/x86/kernel/cpu/resctrl/internal.h | 4 ----
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 +++---
include/linux/resctrl.h | 2 --
5 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index bb1c3f5f60c8..2f87177f1f69 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -147,7 +147,6 @@ static inline void cache_alloc_hsw_probe(void)
r->cache.shareable_bits = 0xc0000;
r->cache.min_cbm_bits = 2;
r->alloc_capable = true;
- r->alloc_enabled = true;

rdt_alloc_capable = true;
}
@@ -211,7 +210,6 @@ static bool __get_mem_config_intel(struct rdt_resource *r)
thread_throttle_mode_init();

r->alloc_capable = true;
- r->alloc_enabled = true;

return true;
}
@@ -242,7 +240,6 @@ static bool __rdt_get_mem_config_amd(struct rdt_resource *r)
r->data_width = 4;

r->alloc_capable = true;
- r->alloc_enabled = true;

return true;
}
@@ -261,7 +258,6 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
r->cache.shareable_bits = ebx & r->default_ctrl;
r->data_width = (r->cache.cbm_len + 3) / 4;
r->alloc_capable = true;
- r->alloc_enabled = true;
}

static void rdt_get_cdp_config(int level)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 1d647188a43b..53f3d275a98f 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -459,10 +459,6 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
for_each_rdt_resource(r) \
if (r->mon_capable)

-#define for_each_alloc_enabled_rdt_resource(r) \
- for_each_rdt_resource(r) \
- if (r->alloc_enabled)
-
#define for_each_mon_enabled_rdt_resource(r) \
for_each_rdt_resource(r) \
if (r->mon_enabled)
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index db813f819ad6..f810969ced4b 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -835,7 +835,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
* First determine which cpus have pseudo-locked regions
* associated with them.
*/
- for_each_alloc_enabled_rdt_resource(r) {
+ for_each_alloc_capable_rdt_resource(r) {
list_for_each_entry(d_i, &r->domains, list) {
if (d_i->plr)
cpumask_or(cpu_with_psl, cpu_with_psl,
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b57b3db9a6a7..e327f8d1c8a3 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1756,7 +1756,7 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn)
if (ret)
goto out_destroy;

- /* loop over enabled controls, these are all alloc_enabled */
+ /* loop over enabled controls, these are all alloc_capable */
list_for_each_entry(s, &resctrl_schema_all, list) {
r = s->res;
fflags = r->fflags | RF_CTRL_INFO;
@@ -2106,7 +2106,7 @@ static int schemata_list_create(void)
struct rdt_resource *r;
int ret = 0;

- for_each_alloc_enabled_rdt_resource(r) {
+ for_each_alloc_capable_rdt_resource(r) {
if (resctrl_arch_get_cdp_enabled(r->rid)) {
ret = schemata_list_add(r, CDP_CODE);
if (ret)
@@ -2452,7 +2452,7 @@ static void rdt_kill_sb(struct super_block *sb)
set_mba_sc(false);

/*Put everything back to default values. */
- for_each_alloc_enabled_rdt_resource(r)
+ for_each_alloc_capable_rdt_resource(r)
reset_all_ctrls(r);
cdp_disable_all();
rmdir_all_sub();
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 21deb5212bbd..386ab3a41500 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -130,7 +130,6 @@ struct resctrl_schema;
/**
* struct rdt_resource - attributes of a resctrl resource
* @rid: The index of the resource
- * @alloc_enabled: Is allocation enabled on this machine
* @mon_enabled: Is monitoring enabled for this feature
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
@@ -150,7 +149,6 @@ struct resctrl_schema;
*/
struct rdt_resource {
int rid;
- bool alloc_enabled;
bool mon_enabled;
bool alloc_capable;
bool mon_capable;
--
2.30.2

2022-02-17 23:22:46

by James Morse

[permalink] [raw]
Subject: [PATCH v3 10/21] x86/resctrl: Abstract and use supports_mba_mbps()

To determine whether the mba_MBps option to resctrl should be supported,
resctrl tests the boot CPUs' x86_vendor.

This isn't portable, and needs abstracting behind a helper so this check
can be part of the filesystem code that moves to /fs/.

Re-use the tests set_mba_sc() does to determine if the mba_sc is supported
on this system. An 'alloc_capable' test is added so that support for the
controls isn't implied by the 'delay_linear' property, which is always
true for MPAM.

Reviewed-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
Changes since v1:
* Capitalisation
* Added MPAM example in commit message
* Fixed supports_mba_mbps() logic error in rdt_parse_param()
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 22ebfdb4472d..7ec089d72ab7 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1922,11 +1922,21 @@ static void mba_sc_domain_destroy(struct rdt_resource *r,
}

/*
- * Enable or disable the MBA software controller
- * which helps user specify bandwidth in MBps.
* MBA software controller is supported only if
* MBM is supported and MBA is in linear scale.
*/
+static bool supports_mba_mbps(void)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
+
+ return (is_mbm_enabled() &&
+ r->alloc_capable && is_mba_linear());
+}
+
+/*
+ * Enable or disable the MBA software controller
+ * which helps user specify bandwidth in MBps.
+ */
static int set_mba_sc(bool mba_sc)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
@@ -1934,8 +1944,7 @@ static int set_mba_sc(bool mba_sc)
struct rdt_domain *d;
int i;

- if (!is_mbm_enabled() || !is_mba_linear() ||
- mba_sc == is_mba_sc(r))
+ if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
return -EINVAL;

r->membw.mba_sc = mba_sc;
@@ -2295,7 +2304,7 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
ctx->enable_cdpl2 = true;
return 0;
case Opt_mba_mbps:
- if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+ if (!supports_mba_mbps())
return -EINVAL;
ctx->enable_mba_mbps = true;
return 0;
--
2.30.2

2022-02-17 23:31:44

by James Morse

[permalink] [raw]
Subject: [PATCH v3 15/21] x86/resctrl: Abstract __rmid_read()

__rmid_read() selects the specified eventid and returns the counter
value from the MSR. The error handling is architecture specific, and
handled by the callers, rdtgroup_mondata_show() and __mon_event_count().

Error handling should be handled by architecture specific code, as
a different architecture may have different requirements. MPAM's
counters can report that they are 'not ready', requiring a second
read after a short delay. This should be hidden from resctrl.

Make __rmid_read() the architecture specific function for reading
a counter. Rename it resctrl_arch_rmid_read() and move the error
handling into it.

Signed-off-by: James Morse <[email protected]>
---
Changes since v2:
* Capitalisation
* Stray newline restored
* Removed rr->val set to the error value, and replaced it with clearing the
the error to hide Unavailable from monitor group reads. (and added a block
comment).

Changes since v1:
* Return EINVAL from the impossible case in __mon_event_count() instead
of an x86 hardware specific value.
---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 4 +-
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/monitor.c | 53 +++++++++++++++--------
include/linux/resctrl.h | 1 +
4 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index ece3a1e0e6f2..d3f7eb2ac14b 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -579,9 +579,9 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)

mon_event_read(&rr, r, d, rdtgrp, evtid, false);

- if (rr.val & RMID_VAL_ERROR)
+ if (rr.err == -EIO)
seq_puts(m, "Error\n");
- else if (rr.val & RMID_VAL_UNAVAIL)
+ else if (rr.err == -EINVAL)
seq_puts(m, "Unavailable\n");
else
seq_printf(m, "%llu\n", rr.val * hw_res->mon_scale);
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 726065bc02e4..e55d1774ce9c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -94,6 +94,7 @@ struct rmid_read {
struct rdt_domain *d;
enum resctrl_event_id evtid;
bool first;
+ int err;
u64 val;
};

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 431acc3d3eb8..b6ad290fda8d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -167,9 +167,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
memset(am, 0, sizeof(*am));
}

-static u64 __rmid_read(u32 rmid, enum resctrl_event_id eventid)
+int resctrl_arch_rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
- u64 val;
+ u64 msr_val;

/*
* As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
@@ -180,14 +180,24 @@ static u64 __rmid_read(u32 rmid, enum resctrl_event_id eventid)
* are error bits.
*/
wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
- rdmsrl(MSR_IA32_QM_CTR, val);
+ rdmsrl(MSR_IA32_QM_CTR, msr_val);

- return val;
+ if (msr_val & RMID_VAL_ERROR)
+ return -EIO;
+ if (msr_val & RMID_VAL_UNAVAIL)
+ return -EINVAL;
+
+ *val = msr_val;
+
+ return 0;
}

static bool rmid_dirty(struct rmid_entry *entry)
{
- u64 val = __rmid_read(entry->rmid, QOS_L3_OCCUP_EVENT_ID);
+ u64 val = 0;
+
+ if (resctrl_arch_rmid_read(entry->rmid, QOS_L3_OCCUP_EVENT_ID, &val))
+ return true;

return val >= resctrl_cqm_threshold;
}
@@ -259,8 +269,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r;
struct rdt_domain *d;
- int cpu;
- u64 val;
+ int cpu, err;
+ u64 val = 0;

r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;

@@ -268,8 +278,10 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
cpu = get_cpu();
list_for_each_entry(d, &r->domains, list) {
if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
- val = __rmid_read(entry->rmid, QOS_L3_OCCUP_EVENT_ID);
- if (val <= resctrl_cqm_threshold)
+ err = resctrl_arch_rmid_read(entry->rmid,
+ QOS_L3_OCCUP_EVENT_ID,
+ &val);
+ if (err || val <= resctrl_cqm_threshold)
continue;
}

@@ -319,15 +331,15 @@ static u64 __mon_event_count(u32 rmid, struct rmid_read *rr)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(rr->r);
struct mbm_state *m;
- u64 chunks, tval;
+ u64 chunks, tval = 0;

if (rr->first)
resctrl_arch_reset_rmid(rr->r, rr->d, rmid, rr->evtid);

- tval = __rmid_read(rmid, rr->evtid);
- if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
- return tval;
- }
+ rr->err = resctrl_arch_rmid_read(rmid, rr->evtid, &tval);
+ if (rr->err)
+ return rr->err;
+
switch (rr->evtid) {
case QOS_L3_OCCUP_EVENT_ID:
rr->val += tval;
@@ -343,7 +355,7 @@ static u64 __mon_event_count(u32 rmid, struct rmid_read *rr)
* Code would never reach here because an invalid
* event id would fail the __rmid_read.
*/
- return RMID_VAL_ERROR;
+ return -EINVAL;
}

if (rr->first) {
@@ -419,9 +431,14 @@ void mon_event_count(void *info)
}
}

- /* Report error if none of rmid_reads are successful */
- if (ret_val)
- rr->val = ret_val;
+ /*
+ * __mon_event_count() calls for newly created monitor groups may
+ * report -EINVAL/Unavailable if the monitor hasn't seen any traffic.
+ * If the first call for the control group succeed, discard any error
+ * set by reads of monitor groups.
+ */
+ if (ret_val == 0)
+ rr->err = 0;
}

/*
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0b48239f5920..70112dbfa128 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -219,6 +219,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_arch_rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *res);

/**
* resctrl_arch_reset_rmid() - Reset any private state associated with rmid
--
2.30.2

2022-02-17 23:34:10

by James Morse

[permalink] [raw]
Subject: [PATCH v3 18/21] x86/resctrl: Move get_corrected_mbm_count() into resctrl_arch_rmid_read()

resctrl_arch_rmid_read() is intended as the function that an
architecture agnostic resctrl filesystem driver can use to
read a value in bytes from a counter. Currently the function returns
the MBM values in chunks directly from hardware. When reading a bandwidth
counter, get_corrected_mbm_count() must be used to correct the
value read.

get_corrected_mbm_count() is architecture specific, this work should be
done in resctrl_arch_rmid_read().

Move the function calls. This allows the resctrl filesystems's chunks
value to be removed in favour of the architecture private version.

Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 4 ++--
arch/x86/kernel/cpu/resctrl/monitor.c | 8 ++++----
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 5ab8b448fb7e..c8d52fbee8cd 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -280,14 +280,12 @@ struct rftype {

/**
* struct mbm_state - status for each MBM counter in each domain
- * @chunks: Total data moved (multiply by rdt_group.mon_scale to get bytes)
* @prev_bw_chunks: Previous chunks value read when for bandwidth calculation
* @prev_bw: The most recent bandwidth in MBps
* @delta_bw: Difference between the current and previous bandwidth
* @delta_comp: Indicates whether to compute the delta_bw
*/
struct mbm_state {
- u64 chunks;
u64 prev_bw_chunks;
u32 prev_bw;
u32 delta_bw;
@@ -297,10 +295,12 @@ struct mbm_state {
/**
* struct arch_mbm_state - values used to compute resctrl_arch_rmid_read()s
* return value.
+ * @chunks: Total data moved (multiply by rdt_group.mon_scale to get bytes)
* @prev_msr: Value of IA32_QM_CTR last time it was read for the RMID used to
* find this struct.
*/
struct arch_mbm_state {
+ u64 chunks;
u64 prev_msr;
};

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index e04bc97c710d..e89ae648046b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -204,7 +204,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,

am = get_arch_mbm_state(hw_dom, rmid, eventid);
if (am) {
- *val = mbm_overflow_count(am->prev_msr, msr_val, hw_res->mbm_width);
+ am->chunks += mbm_overflow_count(am->prev_msr, msr_val,
+ hw_res->mbm_width);
+ *val = get_corrected_mbm_count(rmid, am->chunks);
am->prev_msr = msr_val;
} else {
*val = msr_val;
@@ -374,9 +376,7 @@ static u64 __mon_event_count(u32 rmid, struct rmid_read *rr)
return 0;
}

- m->chunks += tval;
-
- rr->val += get_corrected_mbm_count(rmid, m->chunks);
+ rr->val += tval;

return 0;
}
--
2.30.2

2022-02-17 23:35:46

by James Morse

[permalink] [raw]
Subject: [PATCH v3 17/21] x86/resctrl: Move mbm_overflow_count() into resctrl_arch_rmid_read()

resctrl_arch_rmid_read() is intended as the function that an
architecture agnostic resctrl filesystem driver can use to
read a value in bytes from a counter. Currently the function returns
the MBM values in chunks directly from hardware. When reading a bandwidth
counter, mbm_overflow_count() must be used to correct for any possible
overflow.

mbm_overflow_count() is architecture specific, its behaviour should
be part of resctrl_arch_rmid_read().

Move the mbm_overflow_count() calls into resctrl_arch_rmid_read().
This allows the resctrl filesystems's prev_msr to be removed in
favour of the architecture private version.

Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 --
arch/x86/kernel/cpu/resctrl/monitor.c | 35 +++++++++++++++-----------
2 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e55d1774ce9c..5ab8b448fb7e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -281,7 +281,6 @@ struct rftype {
/**
* struct mbm_state - status for each MBM counter in each domain
* @chunks: Total data moved (multiply by rdt_group.mon_scale to get bytes)
- * @prev_msr: Value of IA32_QM_CTR for this RMID last time we read it
* @prev_bw_chunks: Previous chunks value read when for bandwidth calculation
* @prev_bw: The most recent bandwidth in MBps
* @delta_bw: Difference between the current and previous bandwidth
@@ -289,7 +288,6 @@ struct rftype {
*/
struct mbm_state {
u64 chunks;
- u64 prev_msr;
u64 prev_bw_chunks;
u32 prev_bw;
u32 delta_bw;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 277c22f8c976..e04bc97c710d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -167,9 +167,20 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
memset(am, 0, sizeof(*am));
}

+static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
+{
+ u64 shift = 64 - width, chunks;
+
+ chunks = (cur_msr << shift) - (prev_msr << shift);
+ return chunks >> shift;
+}
+
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
+ struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+ struct arch_mbm_state *am;
u64 msr_val;

if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
@@ -191,7 +202,13 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
if (msr_val & RMID_VAL_UNAVAIL)
return -EINVAL;

- *val = msr_val;
+ am = get_arch_mbm_state(hw_dom, rmid, eventid);
+ if (am) {
+ *val = mbm_overflow_count(am->prev_msr, msr_val, hw_res->mbm_width);
+ am->prev_msr = msr_val;
+ } else {
+ *val = msr_val;
+ }

return 0;
}
@@ -322,19 +339,10 @@ void free_rmid(u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}

-static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
-{
- u64 shift = 64 - width, chunks;
-
- chunks = (cur_msr << shift) - (prev_msr << shift);
- return chunks >> shift;
-}
-
static u64 __mon_event_count(u32 rmid, struct rmid_read *rr)
{
- struct rdt_hw_resource *hw_res = resctrl_to_arch_res(rr->r);
struct mbm_state *m;
- u64 chunks, tval = 0;
+ u64 tval = 0;

if (rr->first)
resctrl_arch_reset_rmid(rr->r, rr->d, rmid, rr->evtid);
@@ -363,13 +371,10 @@ static u64 __mon_event_count(u32 rmid, struct rmid_read *rr)

if (rr->first) {
memset(m, 0, sizeof(struct mbm_state));
- m->prev_msr = tval;
return 0;
}

- chunks = mbm_overflow_count(m->prev_msr, tval, hw_res->mbm_width);
- m->chunks += chunks;
- m->prev_msr = tval;
+ m->chunks += tval;

rr->val += get_corrected_mbm_count(rmid, m->chunks);

--
2.30.2

2022-02-17 23:44:00

by James Morse

[permalink] [raw]
Subject: [PATCH v3 12/21] x86/resctrl: Calculate bandwidth from the previous __mon_event_count() chunks

mbm_bw_count() is only called by the mbm_handle_overflow() worker once a
second. It reads the hardware register, calculates the bandwidth and
updates m->prev_bw_msr which is used to hold the previous hardware register
value.

Operating directly on hardware register values makes it difficult to make
this code architecture independent, so that it can be moved to /fs/,
making the mba_sc feature something resctrl supports with no additional
support from the architecture.
Prior to calling mbm_bw_count(), mbm_update() reads from the same hardware
register using __mon_event_count().

Change mbm_bw_count() to use the current chunks value most recently saved by
__mon_event_count(). This removes an extra call to __rmid_read().
Instead of using m->prev_msr to calculate the number of chunks seen,
use the rr->val that was updated by __mon_event_count(). This removes a extra
calls to mbm_overflow_count() and get_corrected_mbm_count().
Calculating bandwidth like this means mbm_bw_count() no longer operates
on hardware register values directly.

Signed-off-by: James Morse <[email protected]>
---
Changes since v2:
* Expanded commit message

Changes since v1:
* This patch was rewritten
---
arch/x86/kernel/cpu/resctrl/internal.h | 4 ++--
arch/x86/kernel/cpu/resctrl/monitor.c | 24 +++++++++++++++---------
2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 3b9e43ba7590..c50c8911ef59 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -289,7 +289,7 @@ struct rftype {
* struct mbm_state - status for each MBM counter in each domain
* @chunks: Total data moved (multiply by rdt_group.mon_scale to get bytes)
* @prev_msr: Value of IA32_QM_CTR for this RMID last time we read it
- * @prev_bw_msr:Value of previous IA32_QM_CTR for bandwidth counting
+ * @prev_bw_chunks: Previous chunks value read when for bandwidth calculation
* @prev_bw: The most recent bandwidth in MBps
* @delta_bw: Difference between the current and previous bandwidth
* @delta_comp: Indicates whether to compute the delta_bw
@@ -297,7 +297,7 @@ struct rftype {
struct mbm_state {
u64 chunks;
u64 prev_msr;
- u64 prev_bw_msr;
+ u64 prev_bw_chunks;
u32 prev_bw;
u32 delta_bw;
bool delta_comp;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ac1a2e8998bb..8ae375e29256 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -315,7 +315,7 @@ static u64 __mon_event_count(u32 rmid, struct rmid_read *rr)

if (rr->first) {
memset(m, 0, sizeof(struct mbm_state));
- m->prev_bw_msr = m->prev_msr = tval;
+ m->prev_msr = tval;
return 0;
}

@@ -329,27 +329,32 @@ static u64 __mon_event_count(u32 rmid, struct rmid_read *rr)
}

/*
+ * mbm_bw_count() - Update bw count from values previously read by
+ * __mon_event_count().
+ * @rmid: The rmid used to identify the cached mbm_state.
+ * @rr: The struct rmid_read populated by __mon_event_count().
+ *
* Supporting function to calculate the memory bandwidth
- * and delta bandwidth in MBps.
+ * and delta bandwidth in MBps. The chunks value previously read by
+ * __mon_event_count() is compared with the chunks value from the previous
+ * invocation. This must be called oncer per second to maintain values in MBps.
*/
static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(rr->r);
struct mbm_state *m = &rr->d->mbm_local[rmid];
- u64 tval, cur_bw, chunks;
+ u64 cur_bw, chunks, cur_chunks;

- tval = __rmid_read(rmid, rr->evtid);
- if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
- return;
+ cur_chunks = rr->val;
+ chunks = cur_chunks - m->prev_bw_chunks;
+ m->prev_bw_chunks = cur_chunks;

- chunks = mbm_overflow_count(m->prev_bw_msr, tval, hw_res->mbm_width);
- cur_bw = (get_corrected_mbm_count(rmid, chunks) * hw_res->mon_scale) >> 20;
+ cur_bw = (chunks * hw_res->mon_scale) >> 20;

if (m->delta_comp)
m->delta_bw = abs(cur_bw - m->prev_bw);
m->delta_comp = false;
m->prev_bw = cur_bw;
- m->prev_bw_msr = tval;
}

/*
@@ -509,6 +514,7 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
rr.first = false;
rr.r = r;
rr.d = d;
+ rr.val = 0;

/*
* This is protected from concurrent reads from user
--
2.30.2

2022-02-17 23:56:44

by James Morse

[permalink] [raw]
Subject: [PATCH v3 02/21] x86/resctrl: Merge mon_capable and mon_enabled

mon_enabled and mon_capable are always set as a pair by
rdt_get_mon_l3_config().

There is no point having two values.

Merge them together.

Signed-off-by: James Morse <[email protected]>
---
Changes since v1:
* Removed stray cdp_capable changes.
---
arch/x86/kernel/cpu/resctrl/internal.h | 4 ----
arch/x86/kernel/cpu/resctrl/monitor.c | 1 -
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++----
include/linux/resctrl.h | 2 --
4 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 53f3d275a98f..8828b5c1b6d2 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -459,10 +459,6 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
for_each_rdt_resource(r) \
if (r->mon_capable)

-#define for_each_mon_enabled_rdt_resource(r) \
- for_each_rdt_resource(r) \
- if (r->mon_enabled)
-
/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
union cpuid_0x10_1_eax {
struct {
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index eaf25a234ff5..497cadf3285d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -717,7 +717,6 @@ int rdt_get_mon_l3_config(struct rdt_resource *r)
l3_mon_evt_init(r);

r->mon_capable = true;
- r->mon_enabled = true;

return 0;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e327f8d1c8a3..e243c7d15b81 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1765,7 +1765,7 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn)
goto out_destroy;
}

- for_each_mon_enabled_rdt_resource(r) {
+ for_each_mon_capable_rdt_resource(r) {
fflags = r->fflags | RF_MON_INFO;
sprintf(name, "%s_MON", r->name);
ret = rdtgroup_mkdir_info_resdir(r, name, fflags);
@@ -2504,7 +2504,7 @@ void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, unsigned int dom_id)
struct rdtgroup *prgrp, *crgrp;
char name[32];

- if (!r->mon_enabled)
+ if (!r->mon_capable)
return;

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
@@ -2572,7 +2572,7 @@ void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdtgroup *prgrp, *crgrp;
struct list_head *head;

- if (!r->mon_enabled)
+ if (!r->mon_capable)
return;

list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
@@ -2642,7 +2642,7 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn,
* Create the subdirectories for each domain. Note that all events
* in a domain like L3 are grouped into a resource whose domain is L3
*/
- for_each_mon_enabled_rdt_resource(r) {
+ for_each_mon_capable_rdt_resource(r) {
ret = mkdir_mondata_subdir_alldom(kn, r, prgrp);
if (ret)
goto out_destroy;
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 386ab3a41500..8180c539800d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -130,7 +130,6 @@ struct resctrl_schema;
/**
* struct rdt_resource - attributes of a resctrl resource
* @rid: The index of the resource
- * @mon_enabled: Is monitoring enabled for this feature
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
@@ -149,7 +148,6 @@ struct resctrl_schema;
*/
struct rdt_resource {
int rid;
- bool mon_enabled;
bool alloc_capable;
bool mon_capable;
int num_rmid;
--
2.30.2

2022-02-18 00:13:23

by James Morse

[permalink] [raw]
Subject: [PATCH v3 05/21] x86/resctrl: Add domain offline callback for resctrl work

Because domains are exposed to user-space via resctrl, the filesystem
must update its state when CPU hotplug callbacks are triggered.

Some of this work is common to any architecture that would support
resctrl, but the work is tied up with the architecture code to
free the memory.

Move the monitor subdir removal and the cancelling of the mbm/limbo
works into a new resctrl_offline_domain() call. These bits are not
specific to the architecture. Grouping them in one function allows
that code to be moved to /fs/ and re-used by another architecture.

Signed-off-by: James Morse <[email protected]>
---
Changes since v2:
* Moved kfree()ing to domain_destroy_mon_state() for later re-use.

Changes since v1:
* Removed a redundant mon_capable check
* Capitalisation
* Removed inline comment
* Added to the commit message
---
arch/x86/kernel/cpu/resctrl/core.c | 26 ++-------------
arch/x86/kernel/cpu/resctrl/internal.h | 2 --
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 45 +++++++++++++++++++++++---
include/linux/resctrl.h | 1 +
4 files changed, 44 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index e37889f7a1a5..f69182973175 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -523,27 +523,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)

cpumask_clear_cpu(cpu, &d->cpu_mask);
if (cpumask_empty(&d->cpu_mask)) {
- /*
- * If resctrl is mounted, remove all the
- * per domain monitor data directories.
- */
- if (static_branch_unlikely(&rdt_mon_enable_key))
- rmdir_mondata_subdir_allrdtgrp(r, d->id);
+ resctrl_offline_domain(r, d);
list_del(&d->list);
- if (r->mon_capable && is_mbm_enabled())
- cancel_delayed_work(&d->mbm_over);
- if (is_llc_occupancy_enabled() && has_busy_rmid(r, d)) {
- /*
- * When a package is going down, forcefully
- * decrement rmid->ebusy. There is no way to know
- * that the L3 was flushed and hence may lead to
- * incorrect counts in rare scenarios, but leaving
- * the RMID as busy creates RMID leaks if the
- * package never comes back.
- */
- __check_limbo(d, true);
- cancel_delayed_work(&d->cqm_limbo);
- }

/*
* rdt_domain "d" is going to be freed below, so clear
@@ -551,11 +532,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
*/
if (d->plr)
d->plr->d = NULL;
-
- bitmap_free(d->rmid_busy_llc);
- kfree(d->mbm_total);
- kfree(d->mbm_local);
domain_free(hw_dom);
+
return;
}

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index be48a682dbdb..e12b55f815bf 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -522,8 +522,6 @@ void free_rmid(u32 rmid);
int rdt_get_mon_l3_config(struct rdt_resource *r);
void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
-void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- unsigned int dom_id);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 19691f9ab061..281b38b363c2 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2499,14 +2499,12 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
* Remove all subdirectories of mon_data of ctrl_mon groups
* and monitor groups with given domain id.
*/
-void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, unsigned int dom_id)
+static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
+ unsigned int dom_id)
{
struct rdtgroup *prgrp, *crgrp;
char name[32];

- if (!r->mon_capable)
- return;
-
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
sprintf(name, "mon_%s_%02d", r->name, dom_id);
kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
@@ -3233,6 +3231,45 @@ static int __init rdtgroup_setup_root(void)
return ret;
}

+static void domain_destroy_mon_state(struct rdt_domain *d)
+{
+ bitmap_free(d->rmid_busy_llc);
+ kfree(d->mbm_total);
+ kfree(d->mbm_local);
+}
+
+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ if (!r->mon_capable)
+ return;
+
+ /*
+ * If resctrl is mounted, remove all the
+ * per domain monitor data directories.
+ */
+ if (static_branch_unlikely(&rdt_mon_enable_key))
+ rmdir_mondata_subdir_allrdtgrp(r, d->id);
+
+ if (is_mbm_enabled())
+ cancel_delayed_work(&d->mbm_over);
+ if (is_llc_occupancy_enabled() && has_busy_rmid(r, d)) {
+ /*
+ * When a package is going down, forcefully
+ * decrement rmid->ebusy. There is no way to know
+ * that the L3 was flushed and hence may lead to
+ * incorrect counts in rare scenarios, but leaving
+ * the RMID as busy creates RMID leaks if the
+ * package never comes back.
+ */
+ __check_limbo(d, true);
+ cancel_delayed_work(&d->cqm_limbo);
+ }
+
+ domain_destroy_mon_state(d);
+}
+
static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
{
size_t tsize;
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index d512455b4c3a..5d283bdd6162 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -193,5 +193,6 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, enum resctrl_conf_type type);
int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);

#endif /* _RESCTRL_H */
--
2.30.2

2022-03-05 01:07:42

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 12/21] x86/resctrl: Calculate bandwidth from the previous __mon_event_count() chunks

Hi James,

On 2/17/2022 10:21 AM, James Morse wrote:
> mbm_bw_count() is only called by the mbm_handle_overflow() worker once a
> second. It reads the hardware register, calculates the bandwidth and
> updates m->prev_bw_msr which is used to hold the previous hardware register
> value.
>
> Operating directly on hardware register values makes it difficult to make
> this code architecture independent, so that it can be moved to /fs/,
> making the mba_sc feature something resctrl supports with no additional
> support from the architecture.
> Prior to calling mbm_bw_count(), mbm_update() reads from the same hardware
> register using __mon_event_count().
>
> Change mbm_bw_count() to use the current chunks value most recently saved by
> __mon_event_count(). This removes an extra call to __rmid_read().

> Instead of using m->prev_msr to calculate the number of chunks seen,
> use the rr->val that was updated by __mon_event_count(). This removes a extra
> calls to mbm_overflow_count() and get_corrected_mbm_count().

"removes a extra calls" -> "removes an extra call" ?

__mon_event_count() ends with "rr->val += get_corrected_mbm_count()" and
it is called twice by mbm_update(). The intention in this change is for
mbm_bw_count() to benefit from the rmid read done just before ...
but would using rr->val within mbm_bw_count() not result in it getting
data from both rmid reads due to the increment?

mbm_update() {

rr.val = 0;

if (is_mbm_total_enabled()) {
...
__mon_event_count() /* rr->val += ... */
}
if (is_mbm_local_enabled()) {
...
__mon_event_count() /* rr->val += ... */

if (is_mba_sc(NULL))
mbm_bw_count(rmid, &rr); /* Use rr-> val */
}


Should rr.val perhaps be reset before each __mon_event_count() call instead of
just at the beginning of mbm_update()?

> Calculating bandwidth like this means mbm_bw_count() no longer operates
> on hardware register values directly.
>
> Signed-off-by: James Morse <[email protected]>
> ---
> Changes since v2:
> * Expanded commit message
>
> Changes since v1:
> * This patch was rewritten
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 4 ++--
> arch/x86/kernel/cpu/resctrl/monitor.c | 24 +++++++++++++++---------
> 2 files changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 3b9e43ba7590..c50c8911ef59 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -289,7 +289,7 @@ struct rftype {
> * struct mbm_state - status for each MBM counter in each domain
> * @chunks: Total data moved (multiply by rdt_group.mon_scale to get bytes)
> * @prev_msr: Value of IA32_QM_CTR for this RMID last time we read it
> - * @prev_bw_msr:Value of previous IA32_QM_CTR for bandwidth counting
> + * @prev_bw_chunks: Previous chunks value read when for bandwidth calculation

"value read when for" -> "value read for"?

> * @prev_bw: The most recent bandwidth in MBps
> * @delta_bw: Difference between the current and previous bandwidth
> * @delta_comp: Indicates whether to compute the delta_bw
> @@ -297,7 +297,7 @@ struct rftype {
> struct mbm_state {
> u64 chunks;
> u64 prev_msr;
> - u64 prev_bw_msr;
> + u64 prev_bw_chunks;
> u32 prev_bw;
> u32 delta_bw;
> bool delta_comp;
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index ac1a2e8998bb..8ae375e29256 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -315,7 +315,7 @@ static u64 __mon_event_count(u32 rmid, struct rmid_read *rr)
>
> if (rr->first) {
> memset(m, 0, sizeof(struct mbm_state));
> - m->prev_bw_msr = m->prev_msr = tval;
> + m->prev_msr = tval;
> return 0;
> }
>
> @@ -329,27 +329,32 @@ static u64 __mon_event_count(u32 rmid, struct rmid_read *rr)
> }
>
> /*
> + * mbm_bw_count() - Update bw count from values previously read by
> + * __mon_event_count().
> + * @rmid: The rmid used to identify the cached mbm_state.
> + * @rr: The struct rmid_read populated by __mon_event_count().
> + *
> * Supporting function to calculate the memory bandwidth
> - * and delta bandwidth in MBps.
> + * and delta bandwidth in MBps. The chunks value previously read by
> + * __mon_event_count() is compared with the chunks value from the previous
> + * invocation. This must be called oncer per second to maintain values in MBps.

"must be called oncer" -> "must be called once"

> */
> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
> {
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(rr->r);
> struct mbm_state *m = &rr->d->mbm_local[rmid];
> - u64 tval, cur_bw, chunks;
> + u64 cur_bw, chunks, cur_chunks;
>
> - tval = __rmid_read(rmid, rr->evtid);
> - if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
> - return;
> + cur_chunks = rr->val;
> + chunks = cur_chunks - m->prev_bw_chunks;
> + m->prev_bw_chunks = cur_chunks;
>
> - chunks = mbm_overflow_count(m->prev_bw_msr, tval, hw_res->mbm_width);
> - cur_bw = (get_corrected_mbm_count(rmid, chunks) * hw_res->mon_scale) >> 20;
> + cur_bw = (chunks * hw_res->mon_scale) >> 20;
>
> if (m->delta_comp)
> m->delta_bw = abs(cur_bw - m->prev_bw);
> m->delta_comp = false;
> m->prev_bw = cur_bw;
> - m->prev_bw_msr = tval;
> }
>
> /*
> @@ -509,6 +514,7 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
> rr.first = false;
> rr.r = r;
> rr.d = d;
> + rr.val = 0;
>
> /*
> * This is protected from concurrent reads from user

Reinette

2022-03-07 17:13:48

by Jamie Iles

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

On Thu, Feb 17, 2022 at 06:20:49PM +0000, James Morse wrote:
> Changes in this version? Expanded the commit message of patch 12, and
> made a few things static, as reported by the kbuild robot.
>
> ---
> The aim of this series is to insert a split between the parts of the monitor
> code that the architecture must implement, and those that are part of the
> resctrl filesystem. The eventual aim is to move all filesystem parts out
> to live in /fs/resctrl, so that resctrl can be wired up for MPAM.
>
> What's MPAM? See the cover letter of a previous series. [1]
>
> The series adds domain online/offline callbacks to allow the filesystem to
> manage some of its structures itself, then moves all the 'mba_sc' behaviour
> to be part of the filesystem.
> This means another architecture doesn't need to provide an mbps_val array.
> As its all software, the resctrl filesystem should be able to do this without
> any help from the architecture code.
>
> Finally __rmid_read() is refactored to be the API call that the architecture
> provides to read a counter value. All the hardware specific overflow detection,
> scaling and value correction should occur behind this helper.
>
>
> This series is based on v5.17-rc1, and can be retrieved from:
> git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/resctrl_monitors_in_bytes/v3
>
> [0] git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/resctrl_merge_cdp/v7
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> [v1] https://lore.kernel.org/lkml/[email protected]/
> [v2] https://lore.kernel.org/lkml/[email protected]/
>
> Thanks,
>
> James Morse (21):
> x86/resctrl: Kill off alloc_enabled
> x86/resctrl: Merge mon_capable and mon_enabled
> x86/resctrl: Add domain online callback for resctrl work
> x86/resctrl: Group struct rdt_hw_domain cleanup
> x86/resctrl: Add domain offline callback for resctrl work
> x86/resctrl: Remove set_mba_sc()s control array re-initialisation
> x86/resctrl: Create mba_sc configuration in the rdt_domain
> x86/resctrl: Switch over to the resctrl mbps_val list
> x86/resctrl: Remove architecture copy of mbps_val
> x86/resctrl: Abstract and use supports_mba_mbps()
> x86/resctrl: Allow update_mba_bw() to update controls directly
> x86/resctrl: Calculate bandwidth from the previous __mon_event_count()
> chunks
> x86/recstrl: Add per-rmid arch private storage for overflow and chunks
> x86/recstrl: Allow per-rmid arch private storage to be reset
> x86/resctrl: Abstract __rmid_read()
> x86/resctrl: Pass the required parameters into
> resctrl_arch_rmid_read()
> x86/resctrl: Move mbm_overflow_count() into resctrl_arch_rmid_read()
> x86/resctrl: Move get_corrected_mbm_count() into
> resctrl_arch_rmid_read()
> x86/resctrl: Rename and change the units of resctrl_cqm_threshold
> x86/resctrl: Add resctrl_rmid_realloc_limit to abstract x86's
> boot_cpu_data
> x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes
>
> arch/x86/kernel/cpu/resctrl/core.c | 117 ++++--------
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 75 +++++---
> arch/x86/kernel/cpu/resctrl/internal.h | 61 +++----
> arch/x86/kernel/cpu/resctrl/monitor.c | 211 ++++++++++++++--------
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 205 +++++++++++++++++----
> include/linux/resctrl.h | 62 ++++++-
> 7 files changed, 478 insertions(+), 255 deletions(-)

Reviewed-by: Jamie Iles <[email protected]>

2022-03-15 15:46:28

by Cristian Marussi

[permalink] [raw]
Subject: RE: [PATCH v3 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

Hi James,

I tested this on an Intel(R) Xeon(R) Gold 5120T trying to compare gathered
resctrl monitor data with and without your series and see if results
were consistent.

I started from this paper [0] from Intel itself for my basic setup with
some minor variations: basically, using the attached test_monitors.sh
my test setup is as follows:

- a cpuset shield is created upfront isolating all the cpus belonging
to node1 (14-27,42-55)
- 2 resctrl CoS are created:
+ 1 process (tar on a big file) act as a LC LatencyCritical actor
and is run on one of the shielded CPUs with taskset (48)
+ other 3 processes instead runs stress-ng, supposedly acting as
BE BestEffort noisy neighbours and are pinned to other 3 distinct
cpus (49,50,51)

The script then triggers 4 different runs of the above crowd with different
cache allocation masks setup in lc/be CoS schemata for node1: ranging
basically from no dedicated allocation (7ff 7ff) to a cache allocation
highly unbalanced in favour of the LC task (7fe 001).

While doing that I collect in background (and out of node1 processors) all
the mon_data from the lc_cos group every 100ms and dump those in a file one
for each cache allocation mask. (mondata_LC_7f0_00f.txt etc)

I tested first against a v5.17-rc1 mainline without your series (named as
5.17.0-rc1-mainline in the results) and then again with your series on top
(named as 5.17.0-rc1-00021-g21c69a5706a5). Got your series from [1].

Then I used gnuplot to see what was the 'profile' of this data with and
without your series by plotting the LC process llc_occupancy data against
time for each one of the runs with the differerent cache allocated.
(each colored graphs represent a different run with a different
cache allocation as reported)

Note that during each run:

- at first the LC process is run without any noisy BEs
- then BEs neighbours are spawned and let to settle for 5s
- finally LC is run again while BEs are making a mess in bg

As a consequence in the plotted graphs, you can see a clear break between
the first part of the run and the last one with BEs.

Looking at the graphs it seems to me that the resctrl counters with and
without you series report a highly similar data profile, as expected
(and hoped :D).

I attach as references:

- a tarball of the raw data (test_mon_data.gz)
- the test_monitors.sh script (not nice but working)
- draw_resctrl.gp gnuplot script
- two PNG of the LC llc_occupancy graphs (all cachemasks runs)
- with your series: LC_llc_occupancy_5.17.0-rc1-00021-g21c69a5706a5.png
- without your series: LC_llc_occupancy_5.17.0-rc1-mainline.png

Gnuplot is run as:

gnuplot -e "filedir='results/5.17.0-rc1-00021-g21c69a5706a5" draw_resctrl.gp

Hope this helps...

Thanks,
Cristian


[0] https://www.intel.com/content/www/us/en/developer/articles/technical/use-intel-resource-director-technology-to-allocate-last-level-cache-llc.html
[1] git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/resctrl_monitors_in_bytes/v3


Attachments:
(No filename) (3.02 kB)
draw_resctrl.gp (458.00 B)
test_mon_data.gz (37.77 kB)
test_monitors.sh (2.98 kB)
LC_llc_occupancy_5.17.0-rc1-mainline.png (128.43 kB)
LC_llc_occupancy_5.17.0-rc1-00021-g21c69a5706a5.png (131.21 kB)
Download all attachments

2022-03-16 07:05:29

by Shaopeng Tan (Fujitsu)

[permalink] [raw]
Subject: RE: [PATCH v3 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

Hi James,

> Changes in this version? Expanded the commit message of patch 12, and made
> a few things static, as reported by the kbuild robot.
>
> ---
> The aim of this series is to insert a split between the parts of the monitor code
> that the architecture must implement, and those that are part of the resctrl
> filesystem. The eventual aim is to move all filesystem parts out to live in
> /fs/resctrl, so that resctrl can be wired up for MPAM.
>
> What's MPAM? See the cover letter of a previous series. [1]
>
> The series adds domain online/offline callbacks to allow the filesystem to
> manage some of its structures itself, then moves all the 'mba_sc' behaviour to
> be part of the filesystem.
> This means another architecture doesn't need to provide an mbps_val array.
> As its all software, the resctrl filesystem should be able to do this without any
> help from the architecture code.
>
> Finally __rmid_read() is refactored to be the API call that the architecture
> provides to read a counter value. All the hardware specific overflow detection,
> scaling and value correction should occur behind this helper.
>
>
> This series is based on v5.17-rc1, and can be retrieved from:
> git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git
> mpam/resctrl_monitors_in_bytes/v3
>
> [0] git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git
> mpam/resctrl_merge_cdp/v7 [1]
> https://lore.kernel.org/lkml/[email protected]
> /
>
> [v1]
> https://lore.kernel.org/lkml/[email protected]
> /
> [v2]
> https://lore.kernel.org/lkml/[email protected]
> /
>
> Thanks,
>
> James Morse (21):
> x86/resctrl: Kill off alloc_enabled
> x86/resctrl: Merge mon_capable and mon_enabled
> x86/resctrl: Add domain online callback for resctrl work
> x86/resctrl: Group struct rdt_hw_domain cleanup
> x86/resctrl: Add domain offline callback for resctrl work
> x86/resctrl: Remove set_mba_sc()s control array re-initialisation
> x86/resctrl: Create mba_sc configuration in the rdt_domain
> x86/resctrl: Switch over to the resctrl mbps_val list
> x86/resctrl: Remove architecture copy of mbps_val
> x86/resctrl: Abstract and use supports_mba_mbps()
> x86/resctrl: Allow update_mba_bw() to update controls directly
> x86/resctrl: Calculate bandwidth from the previous __mon_event_count()
> chunks
> x86/recstrl: Add per-rmid arch private storage for overflow and chunks
> x86/recstrl: Allow per-rmid arch private storage to be reset
> x86/resctrl: Abstract __rmid_read()
> x86/resctrl: Pass the required parameters into
> resctrl_arch_rmid_read()
> x86/resctrl: Move mbm_overflow_count() into resctrl_arch_rmid_read()
> x86/resctrl: Move get_corrected_mbm_count() into
> resctrl_arch_rmid_read()
> x86/resctrl: Rename and change the units of resctrl_cqm_threshold
> x86/resctrl: Add resctrl_rmid_realloc_limit to abstract x86's
> boot_cpu_data
> x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes
>
> arch/x86/kernel/cpu/resctrl/core.c | 117 ++++--------
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 75 +++++---
> arch/x86/kernel/cpu/resctrl/internal.h | 61 +++----
> arch/x86/kernel/cpu/resctrl/monitor.c | 211
> ++++++++++++++--------
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 205
> +++++++++++++++++----
> include/linux/resctrl.h | 62 ++++++-
> 7 files changed, 478 insertions(+), 255 deletions(-)
>
> --
> 2.30.2

I tested this patch series on Intel(R) Xeon(R) Gold 6254 CPU with resctrl selftest.
It is no problem.

Reviewed-by: Shaopeng Tan <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>

2022-03-17 03:54:13

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 15/21] x86/resctrl: Abstract __rmid_read()

Hi James,

On 2/17/2022 10:21 AM, James Morse wrote:
> __rmid_read() selects the specified eventid and returns the counter
> value from the MSR. The error handling is architecture specific, and
> handled by the callers, rdtgroup_mondata_show() and __mon_event_count().
>
> Error handling should be handled by architecture specific code, as
> a different architecture may have different requirements. MPAM's
> counters can report that they are 'not ready', requiring a second
> read after a short delay. This should be hidden from resctrl.
>
> Make __rmid_read() the architecture specific function for reading
> a counter. Rename it resctrl_arch_rmid_read() and move the error
> handling into it.
>
> Signed-off-by: James Morse <[email protected]>

...

> @@ -167,9 +167,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
> memset(am, 0, sizeof(*am));
> }
>
> -static u64 __rmid_read(u32 rmid, enum resctrl_event_id eventid)
> +int resctrl_arch_rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> {
> - u64 val;
> + u64 msr_val;
>
> /*
> * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> @@ -180,14 +180,24 @@ static u64 __rmid_read(u32 rmid, enum resctrl_event_id eventid)
> * are error bits.
> */
> wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> - rdmsrl(MSR_IA32_QM_CTR, val);
> + rdmsrl(MSR_IA32_QM_CTR, msr_val);
>
> - return val;
> + if (msr_val & RMID_VAL_ERROR)
> + return -EIO;
> + if (msr_val & RMID_VAL_UNAVAIL)
> + return -EINVAL;
> +
> + *val = msr_val;
> +
> + return 0;
> }

From above we see that resctrl_arch_rmid_read() returns an int that could be
-EIO or -EINVAL ...

...

> @@ -319,15 +331,15 @@ static u64 __mon_event_count(u32 rmid, struct rmid_read *rr)
> {
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(rr->r);
> struct mbm_state *m;
> - u64 chunks, tval;
> + u64 chunks, tval = 0;
>
> if (rr->first)
> resctrl_arch_reset_rmid(rr->r, rr->d, rmid, rr->evtid);
>
> - tval = __rmid_read(rmid, rr->evtid);
> - if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
> - return tval;
> - }
> + rr->err = resctrl_arch_rmid_read(rmid, rr->evtid, &tval);
> + if (rr->err)
> + return rr->err;
> +

Setting rr->err, an int, to the return of resctrl_arch_rmid_read() is ok and
can handle the negative error codes, but returning it here means that
__mon_event_count()'s return type should be changed,
it is currently u64.

> switch (rr->evtid) {
> case QOS_L3_OCCUP_EVENT_ID:
> rr->val += tval;
> @@ -343,7 +355,7 @@ static u64 __mon_event_count(u32 rmid, struct rmid_read *rr)
> * Code would never reach here because an invalid
> * event id would fail the __rmid_read.
> */
> - return RMID_VAL_ERROR;
> + return -EINVAL;

Same here ... __mon_event_count() is u64 still ...

> }
>
> if (rr->first) {
> @@ -419,9 +431,14 @@ void mon_event_count(void *info)
> }
> }
>

Also take care here ... ret_val in mon_event_count() is still u64 while
__mon_event_count() attempts to return negative errors.

> - /* Report error if none of rmid_reads are successful */
> - if (ret_val)
> - rr->val = ret_val;
> + /*
> + * __mon_event_count() calls for newly created monitor groups may
> + * report -EINVAL/Unavailable if the monitor hasn't seen any traffic.
> + * If the first call for the control group succeed, discard any error
> + * set by reads of monitor groups.
> + */

Additionally, if the first call fails, but a following read of monitor group
succeeds then the first call's error is discarded.

How about if the last sentence is replaced with:
"Discard error if any of the monitor event reads succeeded."

> + if (ret_val == 0)
> + rr->err = 0;
> }
>
> /*
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 0b48239f5920..70112dbfa128 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -219,6 +219,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
> u32 closid, enum resctrl_conf_type type);
> int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
> void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
> +int resctrl_arch_rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *res);
>
> /**
> * resctrl_arch_reset_rmid() - Reset any private state associated with rmid


Reinette

2022-03-17 04:21:36

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3 01/21] x86/resctrl: Kill off alloc_enabled

Hi James,

On 2/17/2022 10:20 AM, James Morse wrote:
> rdt_resources_all[] used to have extra entries for L2CODE/L2DATA.
> These were hidden from resctrl by the alloc_enabled value.
>
> Now that the L2/L2CODE/L2DATA resources have been merged together,
> alloc_enabled doesn't mean anything, it always has the same value as
> alloc_capable which indicates which indicates allocation is supported
> by this resource.

"which indicates which indicates" -> "which indicates"

Reinette

2022-03-17 05:10:44

by haoxin

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

Hi james,

I have done some verification on x86 machine(Intel(R) Xeon(R) Platinum
8269CY), and "CAT, CMT, MBA, MBM"  have been tested,

They all work well.

So please add Tested-by: Xin Hao <[email protected]>

--
Best Regards!
Xin Hao

2022-03-31 03:54:33

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 12/21] x86/resctrl: Calculate bandwidth from the previous __mon_event_count() chunks

Hi Reinette,

On 05/03/2022 00:27, Reinette Chatre wrote:
> On 2/17/2022 10:21 AM, James Morse wrote:
>> mbm_bw_count() is only called by the mbm_handle_overflow() worker once a
>> second. It reads the hardware register, calculates the bandwidth and
>> updates m->prev_bw_msr which is used to hold the previous hardware register
>> value.
>>
>> Operating directly on hardware register values makes it difficult to make
>> this code architecture independent, so that it can be moved to /fs/,
>> making the mba_sc feature something resctrl supports with no additional
>> support from the architecture.
>> Prior to calling mbm_bw_count(), mbm_update() reads from the same hardware
>> register using __mon_event_count().
>>
>> Change mbm_bw_count() to use the current chunks value most recently saved by
>> __mon_event_count(). This removes an extra call to __rmid_read().
>
>> Instead of using m->prev_msr to calculate the number of chunks seen,
>> use the rr->val that was updated by __mon_event_count(). This removes a extra
>> calls to mbm_overflow_count() and get_corrected_mbm_count().
>
> "removes a extra calls" -> "removes an extra call" ?
>
> __mon_event_count() ends with "rr->val += get_corrected_mbm_count()" and
> it is called twice by mbm_update(). The intention in this change is for
> mbm_bw_count() to benefit from the rmid read done just before ...
> but would using rr->val within mbm_bw_count() not result in it getting
> data from both rmid reads due to the increment?

Yes, bother. I thought those were mutually exclusive, but its __mon_event_count() that
uses a different struct mbm_state for each set of raw values, not mbm_update().

[...]

> Should rr.val perhaps be reset before each __mon_event_count() call instead of
> just at the beginning of mbm_update()?

Yes. This is because the struct rmid_read is just to allow __mon_event_count() to do its
thing, nothing used to read those values.


>> Calculating bandwidth like this means mbm_bw_count() no longer operates
>> on hardware register values directly.


Thanks!

James

2022-03-31 04:21:04

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 15/21] x86/resctrl: Abstract __rmid_read()

Hi Reinette,

On 16/03/2022 21:52, Reinette Chatre wrote:
> On 2/17/2022 10:21 AM, James Morse wrote:
>> __rmid_read() selects the specified eventid and returns the counter
>> value from the MSR. The error handling is architecture specific, and
>> handled by the callers, rdtgroup_mondata_show() and __mon_event_count().
>>
>> Error handling should be handled by architecture specific code, as
>> a different architecture may have different requirements. MPAM's
>> counters can report that they are 'not ready', requiring a second
>> read after a short delay. This should be hidden from resctrl.
>>
>> Make __rmid_read() the architecture specific function for reading
>> a counter. Rename it resctrl_arch_rmid_read() and move the error
>> handling into it.

>> @@ -180,14 +180,24 @@ static u64 __rmid_read(u32 rmid, enum resctrl_event_id eventid)
>> * are error bits.
>> */
>> wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
>> - rdmsrl(MSR_IA32_QM_CTR, val);
>> + rdmsrl(MSR_IA32_QM_CTR, msr_val);
>>
>> - return val;
>> + if (msr_val & RMID_VAL_ERROR)
>> + return -EIO;
>> + if (msr_val & RMID_VAL_UNAVAIL)
>> + return -EINVAL;
>> +
>> + *val = msr_val;
>> +
>> + return 0;
>> }
>
> From above we see that resctrl_arch_rmid_read() returns an int that could be
> -EIO or -EINVAL ...
>
> ...
>
>> @@ -319,15 +331,15 @@ static u64 __mon_event_count(u32 rmid, struct rmid_read *rr)
>> {
>> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(rr->r);
>> struct mbm_state *m;
>> - u64 chunks, tval;
>> + u64 chunks, tval = 0;
>>
>> if (rr->first)
>> resctrl_arch_reset_rmid(rr->r, rr->d, rmid, rr->evtid);
>>
>> - tval = __rmid_read(rmid, rr->evtid);
>> - if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
>> - return tval;
>> - }
>> + rr->err = resctrl_arch_rmid_read(rmid, rr->evtid, &tval);
>> + if (rr->err)
>> + return rr->err;
>> +
>
> Setting rr->err, an int, to the return of resctrl_arch_rmid_read() is ok and
> can handle the negative error codes, but returning it here means that
> __mon_event_count()'s return type should be changed,
> it is currently u64.

Good point. Fixed.


>> @@ -419,9 +431,14 @@ void mon_event_count(void *info)
>> }
>> }
>>
>
> Also take care here ... ret_val in mon_event_count() is still u64 while
> __mon_event_count() attempts to return negative errors.

(yup, fixed)


>> - /* Report error if none of rmid_reads are successful */
>> - if (ret_val)
>> - rr->val = ret_val;
>> + /*
>> + * __mon_event_count() calls for newly created monitor groups may
>> + * report -EINVAL/Unavailable if the monitor hasn't seen any traffic.
>> + * If the first call for the control group succeed, discard any error
>> + * set by reads of monitor groups.
>> + */
>
> Additionally, if the first call fails, but a following read of monitor group
> succeeds then the first call's error is discarded.
>
> How about if the last sentence is replaced with:
> "Discard error if any of the monitor event reads succeeded."

Sure,


Thanks,

James

2022-04-04 23:12:02

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

Hello!

On 3/15/22 08:16, [email protected] wrote:
>> Changes in this version? Expanded the commit message of patch 12, and made
>> a few things static, as reported by the kbuild robot.

[...]

> I tested this patch series on Intel(R) Xeon(R) Gold 6254 CPU with resctrl selftest.
> It is no problem.
>
> Reviewed-by: Shaopeng Tan <[email protected]>
> Tested-by: Shaopeng Tan <[email protected]>

Thanks!

James

2022-04-05 02:01:52

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

Hi Jamie,

On 3/7/22 12:56, Jamie Iles wrote:
> On Thu, Feb 17, 2022 at 06:20:49PM +0000, James Morse wrote:
>> Changes in this version? Expanded the commit message of patch 12, and
>> made a few things static, as reported by the kbuild robot.

[...]

> Reviewed-by: Jamie Iles <[email protected]>

Thanks!

James

2022-04-05 02:53:53

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

Hello!

On 3/15/22 06:41, Xin Hao wrote:
> I have done some verification on x86 machine(Intel(R) Xeon(R) Platinum 8269CY), and "CAT, CMT, MBA, MBM"  have been tested,
>
> They all work well.
>
> So please add Tested-by: Xin Hao <[email protected]>

Great, thanks!

James


2022-04-10 10:28:44

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

Hi Cristian,

On 15/03/2022 12:10, Cristian Marussi wrote:
> I tested this on an Intel(R) Xeon(R) Gold 5120T trying to compare gathered
> resctrl monitor data with and without your series and see if results
> were consistent.
>
> I started from this paper [0] from Intel itself for my basic setup with
> some minor variations:

[..]

> While doing that I collect in background (and out of node1 processors) all
> the mon_data from the lc_cos group every 100ms and dump those in a file one
> for each cache allocation mask. (mondata_LC_7f0_00f.txt etc)
>
> I tested first against a v5.17-rc1 mainline without your series (named as
> 5.17.0-rc1-mainline in the results) and then again with your series on top
> (named as 5.17.0-rc1-00021-g21c69a5706a5). Got your series from [1].
>
> Then I used gnuplot to see what was the 'profile' of this data with and
> without your series by plotting the LC process llc_occupancy data against
> time for each one of the runs with the differerent cache allocated.
> (each colored graphs represent a different run with a different
> cache allocation as reported)

> Hope this helps...

This is great, thanks!

Can I take this as a Tested-by?



Thanks,

James

2022-04-12 05:09:06

by Cristian Marussi

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

On Fri, Apr 08, 2022 at 06:30:40PM +0100, James Morse wrote:
> Hi Cristian,
>
> On 15/03/2022 12:10, Cristian Marussi wrote:
> > I tested this on an Intel(R) Xeon(R) Gold 5120T trying to compare gathered
> > resctrl monitor data with and without your series and see if results
> > were consistent.
> >
> > I started from this paper [0] from Intel itself for my basic setup with
> > some minor variations:
>
> [..]
>
> > While doing that I collect in background (and out of node1 processors) all
> > the mon_data from the lc_cos group every 100ms and dump those in a file one
> > for each cache allocation mask. (mondata_LC_7f0_00f.txt etc)
> >
> > I tested first against a v5.17-rc1 mainline without your series (named as
> > 5.17.0-rc1-mainline in the results) and then again with your series on top
> > (named as 5.17.0-rc1-00021-g21c69a5706a5). Got your series from [1].
> >
> > Then I used gnuplot to see what was the 'profile' of this data with and
> > without your series by plotting the LC process llc_occupancy data against
> > time for each one of the runs with the differerent cache allocated.
> > (each colored graphs represent a different run with a different
> > cache allocation as reported)
>
> > Hope this helps...
>

Hi James,

> This is great, thanks!
>
> Can I take this as a Tested-by?
>

Sure, sorry to have forgot that :D

Tested-by: Cristian Marussi <[email protected]>

Thanks,
Cristian