2017-06-26 18:54:53

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH V1 00/21] x86/cqm3: Resctrl based cqm

Sending a version of resctrl based cqm and mbm patches as per the
requirements discussed here:
https://marc.info/?l=linux-kernel&m=148891934720489

Several attempts were made to fix the existing upstream perf based cqm
but were NACKed by the community which led to the above discussions.

Patches are based on 4.12-rc4.

Acknowledgements:
- Thanks to Thomas for all the feedback on the requirements and design.
- Thanks to Stephane Eranian <[email protected]> and David
Carrillo-Cisneros <[email protected]> for going through all the
churning during requirements and design phase and code reviews.
- Thanks to "Chatre, Reinette" <[email protected]> for bringing up
issues in code organization and other issues in declaring globals which
were not static.

Summary of the changes -

01/21 - Remove the existing perf based cqm
02/21 - A fix to existing RDT memory leak issue
03/21 - Documentation for resctrl based cqm
04 - 06/21 - Cleanup/Preparatory patches for resctrl based cqm
07 - 18/21 - Add CQM support
19 - 21/21 - Add MBM support

Tony Luck (2):
x86/intel_rdt: Simplify info and base file lists
x86/intel_rdt/mbm: Basic counting of MBM events (total and local)

Vikas Shivappa (19):
x86/perf/cqm: Wipe out perf based cqm
x86/intel_rdt: Fix memory leak during mount
x86/intel_rdt/cqm: Documentation for resctrl based RDT Monitoring
x86/intel_rdt: Introduce a common compile option for RDT
x86/intel_rdt: Change file names to accommodate RDT monitor code
x86/intel_rdt: Cleanup namespace to support RDT monitoring
x86/intel_rdt/cqm: Add RDT monitoring initialization
x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management
x86/intel_rdt/cqm: Add info files for RDT monitoring
x86/intel_rdt/cqm: Add mkdir support for RDT monitoring
x86/intel_rdt/cqm: Add tasks file support
x86/intel_rdt/cqm: Add cpus file support
x86/intel_rdt/cqm: Add mon_data
x86/intel_rdt/cqm: Add rmdir support
x86/intel_rdt/cqm: Add mount,umount support
x86/intel_rdt/cqm: Add sched_in support
x86/intel_rdt/cqm: Add hotcpu support
x86/intel_rdt/mbm: Add mbm counter initialization
x86/intel_rdt/mbm: Handle counter overflow

Documentation/x86/intel_rdt_ui.txt | 316 ++++-
MAINTAINERS | 2 +-
arch/x86/Kconfig | 12 +-
arch/x86/events/intel/Makefile | 2 +-
arch/x86/events/intel/cqm.c | 1766 ---------------------------
arch/x86/include/asm/intel_rdt.h | 286 -----
arch/x86/include/asm/intel_rdt_common.h | 27 -
arch/x86/include/asm/intel_rdt_sched.h | 93 ++
arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/intel_rdt.c | 186 ++-
arch/x86/kernel/cpu/intel_rdt.h | 413 +++++++
arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 340 ++++++
arch/x86/kernel/cpu/intel_rdt_monitor.c | 421 +++++++
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 1099 ++++++++++++++---
arch/x86/kernel/cpu/intel_rdt_schemata.c | 286 -----
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
include/linux/perf_event.h | 18 -
include/linux/sched.h | 3 +-
kernel/events/core.c | 11 +-
kernel/trace/bpf_trace.c | 2 +-
21 files changed, 2617 insertions(+), 2672 deletions(-)
delete mode 100644 arch/x86/events/intel/cqm.c
delete mode 100644 arch/x86/include/asm/intel_rdt.h
delete mode 100644 arch/x86/include/asm/intel_rdt_common.h
create mode 100644 arch/x86/include/asm/intel_rdt_sched.h
create mode 100644 arch/x86/kernel/cpu/intel_rdt.h
create mode 100644 arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
create mode 100644 arch/x86/kernel/cpu/intel_rdt_monitor.c
delete mode 100644 arch/x86/kernel/cpu/intel_rdt_schemata.c

--
1.9.1


2017-06-26 18:54:59

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 15/21] x86/intel_rdt/cqm: Add rmdir support

Resource groups (ctrl_mon and monitor groups) are represented by
directories in resctrl fs. Add support to remove the directories.

When a ctrl_mon directory is removed all the cpus and tasks are assigned
back to the root rdtgroup. When a monitor group is removed the cpus and
tasks are returned to the parent control group.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 110 +++++++++++++++++++++++++++----
1 file changed, 99 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 9377bcd..6131508 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1164,6 +1164,18 @@ static int reset_all_ctrls(struct rdt_resource *r)
return 0;
}

+static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
+{
+ return (rdt_alloc_enabled &&
+ (r->type == RDTCTRL_GROUP) && (t->closid == r->closid));
+}
+
+static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r)
+{
+ return (rdt_mon_features &&
+ (r->type == RDTMON_GROUP) && (t->rmid == r->rmid));
+}
+
/*
* Move tasks from one to the other group. If @from is NULL, then all tasks
* in the systems are moved unconditionally (used for teardown).
@@ -1179,8 +1191,11 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,

read_lock(&tasklist_lock);
for_each_process_thread(p, t) {
- if (!from || t->closid == from->closid) {
+ if (!from || is_closid_match(t, from) ||
+ is_rmid_match(t, from)) {
t->closid = to->closid;
+ t->rmid = to->rmid;
+
#ifdef CONFIG_SMP
/*
* This is safe on x86 w/o barriers as the ordering
@@ -1402,6 +1417,7 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn, struct rdtgroup *pr,
kernfs_remove(kn);
return ret;
}
+
/*
* Common code for ctrl_mon and monitor group mkdir.
* The caller needs to unlock the global mutex upon success.
@@ -1588,20 +1604,57 @@ static int rdtgroup_mkdir(struct kernfs_node *pkn, const char *name,
return -EPERM;
}

-static int rdtgroup_rmdir(struct kernfs_node *kn)
+static int rdtgroup_rmdir_mon(struct kernfs_node *kn, struct rdtgroup *rdtgrp)
{
- int ret, cpu, closid = rdtgroup_default.closid;
- struct rdtgroup *rdtgrp;
+ struct rdtgroup *prdtgrp = rdtgrp->parent;
cpumask_var_t tmpmask;
+ int cpu;

if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
return -ENOMEM;

- rdtgrp = rdtgroup_kn_lock_live(kn);
- if (!rdtgrp) {
- ret = -EPERM;
- goto out;
- }
+ /* Give any tasks back to the parent group */
+ rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask);
+
+ /* Update per cpu rmid of the moved CPUs first */
+ for_each_cpu(cpu, &rdtgrp->cpu_mask)
+ per_cpu(cpu_rmid, cpu) = prdtgrp->rmid;
+ /*
+ * Update the MSR on moved CPUs and CPUs which have moved
+ * task running on them.
+ */
+ cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
+ update_closid_rmid(tmpmask, NULL);
+
+ rdtgrp->flags = RDT_DELETED;
+ free_rmid(rdtgrp->rmid);
+
+ /*
+ * Remove your rmid from the parent ctrl groups list
+ */
+ WARN_ON(list_empty(&prdtgrp->crdtgrp_list));
+ list_del(&rdtgrp->crdtgrp_list);
+
+ /*
+ * one extra hold on this, will drop when we kfree(rdtgrp)
+ * in rdtgroup_kn_unlock()
+ */
+ kernfs_get(kn);
+ kernfs_remove(rdtgrp->kn);
+ free_cpumask_var(tmpmask);
+
+ return 0;
+}
+
+static int rdtgroup_rmdir_ctrl(struct kernfs_node *kn, struct rdtgroup *rdtgrp)
+{
+ int cpu, closid = rdtgroup_default.closid;
+ struct rdtgroup *entry, *tmp;
+ struct list_head *llist;
+ cpumask_var_t tmpmask;
+
+ if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
+ return -ENOMEM;

/* Give any tasks back to the default group */
rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask);
@@ -1622,6 +1675,18 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)

rdtgrp->flags = RDT_DELETED;
closid_free(rdtgrp->closid);
+ free_rmid(rdtgrp->rmid);
+
+ /*
+ * Free all the child monitor group rmids.
+ */
+ llist = &rdtgrp->crdtgrp_list;
+ list_for_each_entry_safe(entry, tmp, llist, crdtgrp_list) {
+ free_rmid(entry->rmid);
+ list_del(&entry->crdtgrp_list);
+ kfree(entry);
+ }
+
list_del(&rdtgrp->rdtgroup_list);

/*
@@ -1630,10 +1695,33 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
*/
kernfs_get(kn);
kernfs_remove(rdtgrp->kn);
- ret = 0;
+ free_cpumask_var(tmpmask);
+
+ return 0;
+}
+
+static int rdtgroup_rmdir(struct kernfs_node *kn)
+{
+ struct kernfs_node *parent_kn = kn->parent;
+ struct rdtgroup *rdtgrp;
+ int ret = 0;
+
+ rdtgrp = rdtgroup_kn_lock_live(kn);
+ if (!rdtgrp) {
+ ret = -EPERM;
+ goto out;
+ }
+
+ if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn)
+ ret = rdtgroup_rmdir_ctrl(kn, rdtgrp);
+ else if (rdtgrp->type == RDTMON_GROUP &&
+ !strcmp(parent_kn->name, "mon_groups"))
+ ret = rdtgroup_rmdir_mon(kn, rdtgrp);
+ else
+ ret = -EPERM;
+
out:
rdtgroup_kn_unlock(kn);
- free_cpumask_var(tmpmask);
return ret;
}

--
1.9.1

2017-06-26 18:55:13

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 19/21] x86/intel_rdt/mbm: Basic counting of MBM events (total and local)

From: Tony Luck <[email protected]>

Check CPUID bits for whether each of the MBM events is supported.
Allocate space for each RMID for each counter in each domain to save
previous MSR counter value and running total of data.
Create files in each of the monitor directories.

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 33 ++++++++++++++++++++++++++++-
arch/x86/kernel/cpu/intel_rdt.h | 32 ++++++++++++++++++++++++++++
arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 1 +
arch/x86/kernel/cpu/intel_rdt_monitor.c | 31 ++++++++++++++++++++++++++-
4 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 8c7643d..a17f323 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -395,6 +395,27 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}

+static bool domain_setup_mbm_state(struct rdt_resource *r, struct rdt_domain *d)
+{
+ size_t tsize;
+
+ if (is_mbm_total_enabled()) {
+ tsize = sizeof(*d->mbm_total);
+ d->mbm_total = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
+ if (!d->mbm_total)
+ return false;
+ }
+ if (is_mbm_local_enabled()) {
+ tsize = sizeof(*d->mbm_local);
+ d->mbm_local = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
+ if (!d->mbm_local) {
+ kfree(d->mbm_total);
+ return false;
+ }
+ }
+ return true;
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -430,13 +451,17 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;

d->id = id;
+ cpumask_set_cpu(cpu, &d->cpu_mask);

if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
kfree(d);
return;
}

- cpumask_set_cpu(cpu, &d->cpu_mask);
+ if (r->mon_capable && !domain_setup_mbm_state(r, d)) {
+ kfree(d);
+ return;
+ }
list_add_tail(&d->list, add_pos);

/*
@@ -467,6 +492,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
if (static_branch_unlikely(&rdt_mon_enable_key))
rmdir_mondata_subdir_allrdtgrp(r, d->id);
kfree(d->ctrl_val);
+ kfree(d->mbm_total);
+ kfree(d->mbm_local);
list_del(&d->list);
kfree(d);
}
@@ -560,6 +587,10 @@ static __init bool get_rdt_resources(void)

if (boot_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
+ if (boot_cpu_has(X86_FEATURE_CQM_MBM_TOTAL))
+ rdt_mon_features |= (1 << QOS_L3_MBM_TOTAL_EVENT_ID);
+ if (boot_cpu_has(X86_FEATURE_CQM_MBM_LOCAL))
+ rdt_mon_features |= (1 << QOS_L3_MBM_LOCAL_EVENT_ID);

if (rdt_mon_features)
rdt_get_mon_l3_config(&rdt_resources_all[RDT_RESOURCE_L3]);
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index b771dae..ab010fb 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -20,6 +20,8 @@
#define QOS_L3_MBM_TOTAL_EVENT_ID 0x02
#define QOS_L3_MBM_LOCAL_EVENT_ID 0x03

+#define MBM_CNTR_WIDTH 24
+
#define RMID_VAL_ERROR (1ULL << 63)
#define RMID_VAL_UNAVAIL (1ULL << 62)

@@ -53,6 +55,7 @@ struct mon_evt {

struct rmid_read {
struct rdtgroup *rgrp;
+ struct rdt_domain *d;
int evtid;
u64 val;
};
@@ -159,10 +162,22 @@ struct rftype {
};

/**
+ * struct mbm_state - status for each MBM counter in each domain
+ * @chunks: Total data moved (multiply by rdt_group.mon_scale to get bytes)
+ * @prev_msr: Value of IA32_QM_CTR for this RMID last time we read it
+ */
+struct mbm_state {
+ u64 chunks;
+ u64 prev_msr;
+};
+
+/**
* struct rdt_domain - group of cpus sharing an RDT resource
* @list: all instances of this resource
* @id: unique id for this instance
* @cpu_mask: which cpus share this resource
+ * @mbm_total: saved state for MBM total bandwidth
+ * @mbm_local: saved state for MBM local bandwidth
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
* @new_ctrl: new ctrl value to be loaded
* @have_new_ctrl: did user provide new_ctrl for this domain
@@ -171,6 +186,8 @@ struct rdt_domain {
struct list_head list;
int id;
struct cpumask cpu_mask;
+ struct mbm_state *mbm_total;
+ struct mbm_state *mbm_local;
u32 *ctrl_val;
u32 new_ctrl;
bool have_new_ctrl;
@@ -221,6 +238,21 @@ struct rdt_membw {
u32 *mb_map;
};

+static inline bool is_mbm_total_enabled(void)
+{
+ return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID));
+}
+
+static inline bool is_mbm_local_enabled(void)
+{
+ return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID));
+}
+
+static inline bool is_mbm_enabled(void)
+{
+ return (is_mbm_total_enabled() || is_mbm_local_enabled());
+}
+
/**
* struct rdt_resource - attributes of an RDT resource
* @alloc_enabled: Is allocation enabled on this machine
diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
index 0c8bca0..926d889 100644
--- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
@@ -315,6 +315,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
*/
rr.rgrp = rdtgrp;
rr.evtid = evtid;
+ rr.d = d;
rr.val = 0;

smp_call_function_any(&d->cpu_mask, mon_event_count, &rr, 1);
diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c
index a7247ca..2c5768c 100644
--- a/arch/x86/kernel/cpu/intel_rdt_monitor.c
+++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c
@@ -209,7 +209,8 @@ void free_rmid(u32 rmid)

static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
{
- u64 tval;
+ u64 chunks, shift, tval;
+ struct mbm_state *m;

tval = __rmid_read(rmid, rr->evtid);
if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
@@ -220,9 +221,23 @@ static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
case QOS_L3_OCCUP_EVENT_ID:
rr->val += tval;
return true;
+ case QOS_L3_MBM_TOTAL_EVENT_ID:
+ m = &rr->d->mbm_total[rmid];
+ break;
+ case QOS_L3_MBM_LOCAL_EVENT_ID:
+ m = &rr->d->mbm_local[rmid];
+ break;
default:
return false;
}
+ shift = 64 - MBM_CNTR_WIDTH;
+ chunks = (tval << shift) - (m->prev_msr << shift);
+ chunks >>= shift;
+ m->chunks += chunks;
+ m->prev_msr = tval;
+
+ rr->val += m->chunks;
+ return true;
}

void mon_event_count(void *info)
@@ -285,12 +300,26 @@ static int dom_data_init(struct rdt_resource *r)
.evtid = QOS_L3_OCCUP_EVENT_ID,
};

+static struct mon_evt mbm_total_event = {
+ .name = "mbm_total_bytes",
+ .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
+};
+
+static struct mon_evt mbm_local_event = {
+ .name = "mbm_local_bytes",
+ .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
+};
+
static void l3_mon_evt_init(struct rdt_resource *r)
{
INIT_LIST_HEAD(&r->evt_list);

if (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID))
list_add_tail(&llc_occupancy_event.list, &r->evt_list);
+ if (is_mbm_total_enabled())
+ list_add_tail(&mbm_total_event.list, &r->evt_list);
+ if (is_mbm_local_enabled())
+ list_add_tail(&mbm_local_event.list, &r->evt_list);
}

void rdt_get_mon_l3_config(struct rdt_resource *r)
--
1.9.1

2017-06-26 18:55:18

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 18/21] x86/intel_rdt/cqm: Add hotcpu support

Resource groups have a per domain directory under "mon_data". Add or
remove these directories as and when domains come online and go offline.
Also update the per cpu rmids and cache upon onlining and offlining
cpus.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 28 +++++++++++++++-----
arch/x86/kernel/cpu/intel_rdt.h | 9 +++++++
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 44 ++++++++++++++++++++++++++++++++
3 files changed, 75 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 63bfb47c..8c7643d 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -438,6 +438,13 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

cpumask_set_cpu(cpu, &d->cpu_mask);
list_add_tail(&d->list, add_pos);
+
+ /*
+ * If resctrl is mounted, add
+ * per domain monitor data directories.
+ */
+ if (static_branch_unlikely(&rdt_mon_enable_key))
+ mkdir_mondata_subdir_allrdtgrp(r, d->id);
}

static void domain_remove_cpu(int cpu, struct rdt_resource *r)
@@ -453,19 +460,28 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)

cpumask_clear_cpu(cpu, &d->cpu_mask);
if (cpumask_empty(&d->cpu_mask)) {
+ /*
+ * If resctrl is mounted, remove all the
+ * per domain monitor data directories.
+ */
+ if (static_branch_unlikely(&rdt_mon_enable_key))
+ rmdir_mondata_subdir_allrdtgrp(r, d->id);
kfree(d->ctrl_val);
list_del(&d->list);
kfree(d);
}
}

-static void clear_closid(int cpu)
+static void clear_closid_rmid(int cpu)
{
struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);

per_cpu(cpu_closid, cpu) = 0;
+ per_cpu(cpu_rmid, cpu) = 0;
+
state->closid = 0;
- wrmsr(IA32_PQR_ASSOC, state->rmid, 0);
+ state->rmid = 0;
+ wrmsr(IA32_PQR_ASSOC, 0, 0);
}

static int intel_rdt_online_cpu(unsigned int cpu)
@@ -473,11 +489,11 @@ static int intel_rdt_online_cpu(unsigned int cpu)
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
- for_each_alloc_capable_rdt_resource(r)
+ for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
/* The cpu is set in default rdtgroup after online. */
cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
- clear_closid(cpu);
+ clear_closid_rmid(cpu);
mutex_unlock(&rdtgroup_mutex);

return 0;
@@ -500,7 +516,7 @@ static int intel_rdt_offline_cpu(unsigned int cpu)
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
- for_each_alloc_capable_rdt_resource(r)
+ for_each_capable_rdt_resource(r)
domain_remove_cpu(cpu, r);
list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
@@ -508,7 +524,7 @@ static int intel_rdt_offline_cpu(unsigned int cpu)
break;
}
}
- clear_closid(cpu);
+ clear_closid_rmid(cpu);
mutex_unlock(&rdtgroup_mutex);

return 0;
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index ea7a86f..b771dae 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -290,6 +290,11 @@ enum {
RDT_NUM_RESOURCES,
};

+#define for_each_capable_rdt_resource(r) \
+ for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
+ r++) \
+ if (r->alloc_capable || r->mon_capable)
+
#define for_each_alloc_capable_rdt_resource(r) \
for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
r++) \
@@ -350,5 +355,9 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
void rdt_get_mon_l3_config(struct rdt_resource *r);
void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
+void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
+ unsigned int dom_id);
+void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
+ unsigned int dom_id);

#endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 2384c07..8e1581a 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1348,6 +1348,27 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
return ret;
}

+/*
+ * Remove all subdirectories of mon_data of ctrl_mon groups
+ * and monitor groups with given domain id.
+ */
+void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, unsigned int dom_id)
+{
+ struct rdtgroup *pr, *cr;
+ char name[32];
+
+ if (!r->mon_enabled)
+ return;
+
+ list_for_each_entry(pr, &rdt_all_groups, rdtgroup_list) {
+ sprintf(name, "mon_%s_%02d", r->name, dom_id);
+ kernfs_remove_by_name(pr->mon_data_kn, name);
+
+ list_for_each_entry(cr, &pr->crdtgrp_list, crdtgrp_list)
+ kernfs_remove_by_name(cr->mon_data_kn, name);
+ }
+}
+
static int get_rdt_resourceid(struct rdt_resource *r)
{
if (r > (rdt_resources_all + RDT_NUM_RESOURCES - 1) ||
@@ -1407,6 +1428,29 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, int domid,
return ret;
}

+/*
+ * Add all subdirectories of mon_data for "ctrl_mon" groups
+ * and "monitor" groups with given domain id.
+ */
+void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, unsigned int domid)
+{
+ struct kernfs_node *parent_kn;
+ struct rdtgroup *pr, *cr;
+
+ if (!r->mon_enabled)
+ return;
+
+ list_for_each_entry(pr, &rdt_all_groups, rdtgroup_list) {
+ parent_kn = pr->mon_data_kn;
+ mkdir_mondata_subdir(parent_kn, domid, r, pr);
+
+ list_for_each_entry(cr, &pr->crdtgrp_list, crdtgrp_list) {
+ parent_kn = cr->mon_data_kn;
+ mkdir_mondata_subdir(parent_kn, domid, r, cr);
+ }
+ }
+}
+
static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_resource *r,
struct rdtgroup *pr)
--
1.9.1

2017-06-26 18:55:27

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 17/21] x86/intel_rdt/cqm: Add sched_in support

OS associates an RMID/CLOSid to a task by writing the per CPU
IA32_PQR_ASSOC MSR when a task is scheduled in.

The sched_in code will stay as no-op unless we are running on Intel SKU
which supports either resource control or monitoring and we also enable
them by mounting the resctrl fs. The per cpu CLOSid/RMID values are
cached and the write is performed only when a task with a different
CLOSid/RMID is scheduled in.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/intel_rdt_sched.h | 47 ++++++++++++++++++++++++----------
1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/intel_rdt_sched.h b/arch/x86/include/asm/intel_rdt_sched.h
index 4dee77b..7c823c4 100644
--- a/arch/x86/include/asm/intel_rdt_sched.h
+++ b/arch/x86/include/asm/intel_rdt_sched.h
@@ -27,27 +27,31 @@ struct intel_pqr_state {

DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid);
+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
+DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
+DECLARE_STATIC_KEY_FALSE(rdt_enable_key);

/*
- * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ * __intel_rdt_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
*
* Following considerations are made so that this has minimal impact
* on scheduler hot path:
* - This will stay as no-op unless we are running on an Intel SKU
- * which supports resource control and we enable by mounting the
- * resctrl file system.
- * - Caches the per cpu CLOSid values and does the MSR write only
- * when a task with a different CLOSid is scheduled in.
+ * which supports resource control or monitoring and we enable by
+ * mounting the resctrl file system.
+ * - Caches the per cpu CLOSid/RMID values and does the MSR write only
+ * when a task with a different CLOSid/RMID is scheduled in.
*
* Must be called with preemption disabled.
*/
-static inline void intel_rdt_sched_in(void)
+static void __intel_rdt_sched_in(void)
{
- if (static_branch_likely(&rdt_alloc_enable_key)) {
- struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
- int closid;
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+ u32 closid = 0;
+ u32 rmid = 0;

+ if (static_branch_likely(&rdt_alloc_enable_key)) {
/*
* If this task has a closid assigned, use it.
* Else use the closid assigned to this cpu.
@@ -55,14 +59,31 @@ static inline void intel_rdt_sched_in(void)
closid = current->closid;
if (closid == 0)
closid = this_cpu_read(cpu_closid);
+ }
+
+ if (static_branch_likely(&rdt_mon_enable_key)) {
+ /*
+ * If this task has a rmid assigned, use it.
+ * Else use the rmid assigned to this cpu.
+ */
+ rmid = current->rmid;
+ if (rmid == 0)
+ rmid = this_cpu_read(cpu_rmid);
+ }

- if (closid != state->closid) {
- state->closid = closid;
- wrmsr(IA32_PQR_ASSOC, state->rmid, closid);
- }
+ if (closid != state->closid || rmid != state->rmid) {
+ state->closid = closid;
+ state->rmid = rmid;
+ wrmsr(IA32_PQR_ASSOC, rmid, closid);
}
}

+static inline void intel_rdt_sched_in(void)
+{
+ if (static_branch_likely(&rdt_enable_key))
+ __intel_rdt_sched_in();
+}
+
#else

static inline void intel_rdt_sched_in(void) {}
--
1.9.1

2017-06-26 18:55:45

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 02/21] x86/intel_rdt: Fix memory leak during mount

If mount fails, the kn_info directory is not freed causing memory leak.
Fix the leak by freeing kn_info when mount fails.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index f5af0cc..9257bd9 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -856,11 +856,13 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
dentry = kernfs_mount(fs_type, flags, rdt_root,
RDTGROUP_SUPER_MAGIC, NULL);
if (IS_ERR(dentry))
- goto out_cdp;
+ goto out_destroy;

static_branch_enable(&rdt_enable_key);
goto out;

+out_destroy:
+ kernfs_remove(kn_info);
out_cdp:
cdp_disable();
out:
--
1.9.1

2017-06-26 18:55:54

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 04/21] x86/intel_rdt: Introduce a common compile option for RDT

We currently have a CONFIG_RDT_A which is for RDT(Resource directory
technology) allocation based resctrl filesystem interface. As a
preparation to add support for RDT monitoring as well into the same
resctrl filesystem, change the config option to be CONFIG_RDT which
would include both RDT allocation and monitoring code.

No functional change.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/Kconfig | 12 ++++++------
arch/x86/include/asm/intel_rdt.h | 4 ++--
arch/x86/kernel/cpu/Makefile | 2 +-
include/linux/sched.h | 2 +-
4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4ccfacc..52348a3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -416,16 +416,16 @@ config GOLDFISH
def_bool y
depends on X86_GOLDFISH

-config INTEL_RDT_A
- bool "Intel Resource Director Technology Allocation support"
+config INTEL_RDT
+ bool "Intel Resource Director Technology support"
default n
depends on X86 && CPU_SUP_INTEL
select KERNFS
help
- Select to enable resource allocation which is a sub-feature of
- Intel Resource Director Technology(RDT). More information about
- RDT can be found in the Intel x86 Architecture Software
- Developer Manual.
+ Select to enable resource allocation and monitoring which are
+ sub-features of Intel Resource Director Technology(RDT). More
+ information about RDT can be found in the Intel x86
+ Architecture Software Developer Manual.

Say N if unsure.

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 597dc49..ae1efc3 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -1,7 +1,7 @@
#ifndef _ASM_X86_INTEL_RDT_H
#define _ASM_X86_INTEL_RDT_H

-#ifdef CONFIG_INTEL_RDT_A
+#ifdef CONFIG_INTEL_RDT

#include <linux/sched.h>
#include <linux/kernfs.h>
@@ -282,5 +282,5 @@ static inline void intel_rdt_sched_in(void)

static inline void intel_rdt_sched_in(void) {}

-#endif /* CONFIG_INTEL_RDT_A */
+#endif /* CONFIG_INTEL_RDT */
#endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 5200001..a576121 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -32,7 +32,7 @@ obj-$(CONFIG_CPU_SUP_CENTAUR) += centaur.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o

-obj-$(CONFIG_INTEL_RDT_A) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o
+obj-$(CONFIG_INTEL_RDT) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o

obj-$(CONFIG_X86_MCE) += mcheck/
obj-$(CONFIG_MTRR) += mtrr/
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b69fc6..9e31b3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -864,7 +864,7 @@ struct task_struct {
/* cg_list protected by css_set_lock and tsk->alloc_lock: */
struct list_head cg_list;
#endif
-#ifdef CONFIG_INTEL_RDT_A
+#ifdef CONFIG_INTEL_RDT
int closid;
#endif
#ifdef CONFIG_FUTEX
--
1.9.1

2017-06-26 18:56:03

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 20/21] x86/intel_rdt/mbm: Add mbm counter initialization

MBM counters are monotonically increasing counts representing the total
memory bytes at a particular time. In order to calculate total_bytes for
an rdtgroup, we store the value of the counter when we create an
rdtgroup or when a new domain comes online.

When the total_bytes(all memory controller bytes) or local_bytes(local
memory controller bytes) file in "mon_data" is read it shows the total
bytes for that rdtgroup since its creation. User can snapshot this at
different time intervals to obtain bytes/second.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 2 +-
arch/x86/kernel/cpu/intel_rdt.h | 11 +++++++++-
arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 25 +++++++++++++--------
arch/x86/kernel/cpu/intel_rdt_monitor.c | 7 ++++++
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 34 ++++++++++++++++++-----------
5 files changed, 55 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index a17f323..7762e32 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -469,7 +469,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
* per domain monitor data directories.
*/
if (static_branch_unlikely(&rdt_mon_enable_key))
- mkdir_mondata_subdir_allrdtgrp(r, d->id);
+ mkdir_mondata_subdir_allrdtgrp(r, d);
}

static void domain_remove_cpu(int cpu, struct rdt_resource *r)
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index ab010fb..f0896ac 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -57,6 +57,7 @@ struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_domain *d;
int evtid;
+ bool first;
u64 val;
};

@@ -253,6 +254,12 @@ static inline bool is_mbm_enabled(void)
return (is_mbm_total_enabled() || is_mbm_local_enabled());
}

+static inline bool is_mbm_event(int e)
+{
+ return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
+ e <= QOS_L3_MBM_LOCAL_EVENT_ID);
+}
+
/**
* struct rdt_resource - attributes of an RDT resource
* @alloc_enabled: Is allocation enabled on this machine
@@ -390,6 +397,8 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
unsigned int dom_id);
void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- unsigned int dom_id);
+ struct rdt_domain *d);
+void mon_event_read(struct rmid_read *rr, struct rdt_domain *d,
+ struct rdtgroup *rdtgrp, int evtid, int first);

#endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
index 926d889..e9391e3 100644
--- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
@@ -285,6 +285,21 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
return ret;
}

+void mon_event_read(struct rmid_read *rr, struct rdt_domain *d,
+ struct rdtgroup *rdtgrp, int evtid, int first)
+{
+ /*
+ * setup the parameters to send to the IPI to read the data.
+ */
+ rr->rgrp = rdtgrp;
+ rr->evtid = evtid;
+ rr->d = d;
+ rr->val = 0;
+ rr->first = first;
+
+ smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+}
+
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
@@ -310,15 +325,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
goto out;
}

- /*
- * setup the parameters to send to the IPI to read the data.
- */
- rr.rgrp = rdtgrp;
- rr.evtid = evtid;
- rr.d = d;
- rr.val = 0;
-
- smp_call_function_any(&d->cpu_mask, mon_event_count, &rr, 1);
+ mon_event_read(&rr, d, rdtgrp, evtid, false);

if (rr.val & RMID_VAL_ERROR)
seq_puts(m, "Error\n");
diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c
index 2c5768c..a196f4d 100644
--- a/arch/x86/kernel/cpu/intel_rdt_monitor.c
+++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c
@@ -230,6 +230,13 @@ static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
default:
return false;
}
+
+ if (rr->first) {
+ m->prev_msr = tval;
+ m->chunks = 0;
+ return true;
+ }
+
shift = 64 - MBM_CNTR_WIDTH;
chunks = (tval << shift) - (m->prev_msr << shift);
chunks >>= shift;
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 8e1581a..1b84485 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1379,12 +1379,14 @@ static int get_rdt_resourceid(struct rdt_resource *r)
return ((r - rdt_resources_all) / sizeof(struct rdt_resource));
}

-static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, int domid,
+static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
+ struct rdt_domain *d,
struct rdt_resource *r, struct rdtgroup *pr)
{
union mon_data_bits priv;
struct kernfs_node *kn;
struct mon_evt *mevt;
+ struct rmid_read rr;
char name[32];
int ret, rid;

@@ -1392,7 +1394,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, int domid,
if (rid < 0)
return -EINVAL;

- sprintf(name, "mon_%s_%02d", r->name, domid);
+ sprintf(name, "mon_%s_%02d", r->name, d->id);
/* create the directory */
kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, pr);
if (IS_ERR(kn))
@@ -1413,12 +1415,15 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, int domid,
}

priv.u.rid = rid;
- priv.u.domid = domid;
+ priv.u.domid = d->id;
list_for_each_entry(mevt, &r->evt_list, list) {
priv.u.evtid = mevt->evtid;
ret = mon_addfile(kn, mevt->name, priv.priv);
if (ret)
goto out_destroy;
+
+ if (is_mbm_event(mevt->evtid))
+ mon_event_read(&rr, d, pr, mevt->evtid, true);
}
kernfs_activate(kn);
return 0;
@@ -1432,7 +1437,8 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, int domid,
* Add all subdirectories of mon_data for "ctrl_mon" groups
* and "monitor" groups with given domain id.
*/
-void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, unsigned int domid)
+void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
+ struct rdt_domain *d)
{
struct kernfs_node *parent_kn;
struct rdtgroup *pr, *cr;
@@ -1442,11 +1448,11 @@ void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, unsigned int domid)

list_for_each_entry(pr, &rdt_all_groups, rdtgroup_list) {
parent_kn = pr->mon_data_kn;
- mkdir_mondata_subdir(parent_kn, domid, r, pr);
+ mkdir_mondata_subdir(parent_kn, d, r, pr);

list_for_each_entry(cr, &pr->crdtgrp_list, crdtgrp_list) {
parent_kn = cr->mon_data_kn;
- mkdir_mondata_subdir(parent_kn, domid, r, cr);
+ mkdir_mondata_subdir(parent_kn, d, r, cr);
}
}
}
@@ -1459,7 +1465,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
int ret;

list_for_each_entry(dom, &r->domains, list) {
- ret = mkdir_mondata_subdir(parent_kn, dom->id, r, pr);
+ ret = mkdir_mondata_subdir(parent_kn, dom, r, pr);
if (ret)
return ret;
}
@@ -1575,20 +1581,22 @@ static int mkdir_rdt_common(struct kernfs_node *pkn, struct kernfs_node *prkn,
goto out_destroy;

if (rdt_mon_features) {
- ret = mkdir_mondata_all(kn, rdtgrp, &rdtgrp->mon_data_kn);
- if (ret)
- goto out_destroy;
-
ret = alloc_rmid();
if (ret < 0)
- return ret;
-
+ goto out_destroy;
rdtgrp->rmid = ret;
+
+ ret = mkdir_mondata_all(kn, rdtgrp, &rdtgrp->mon_data_kn);
+ if (ret)
+ goto out_idfree;
}
kernfs_activate(kn);

return 0;

+out_idfree:
+ if (rdtgrp->rmid)
+ free_rmid(rdtgrp->rmid);
out_destroy:
kernfs_remove(rdtgrp->kn);
out_cancel_ref:
--
1.9.1

2017-06-26 18:56:11

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 05/21] x86/intel_rdt: Change file names to accommodate RDT monitor code

Because the "perf cqm" and resctrl code were separately added and
indivdually configurable, there seem to be separate context switch code
and also things on global .h which are not really needed.

Move only the scheduling specific code and definitions to
<asm/intel_rdt_sched.h> and the put all the other declarations to a
local intel_rdt.h.

h/t to Reinette Chatre for pointing out that we should separate the
public interfaces used by other parts of the kernel from private
objects shared between the various files comprising RDT.

No functional change.

Signed-off-by: Vikas Shivappa <[email protected]>
---
MAINTAINERS | 2 +-
arch/x86/include/asm/intel_rdt.h | 286 -------------------------------
arch/x86/include/asm/intel_rdt_common.h | 25 ---
arch/x86/include/asm/intel_rdt_sched.h | 72 ++++++++
arch/x86/kernel/cpu/intel_rdt.c | 5 +-
arch/x86/kernel/cpu/intel_rdt.h | 243 ++++++++++++++++++++++++++
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 4 +-
arch/x86/kernel/cpu/intel_rdt_schemata.c | 2 +-
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
10 files changed, 324 insertions(+), 319 deletions(-)
delete mode 100644 arch/x86/include/asm/intel_rdt.h
delete mode 100644 arch/x86/include/asm/intel_rdt_common.h
create mode 100644 arch/x86/include/asm/intel_rdt_sched.h
create mode 100644 arch/x86/kernel/cpu/intel_rdt.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 7a28acd..39d7a7f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10734,7 +10734,7 @@ M: Fenghua Yu <[email protected]>
L: [email protected]
S: Supported
F: arch/x86/kernel/cpu/intel_rdt*
-F: arch/x86/include/asm/intel_rdt*
+F: arch/x86/include/asm/intel_rdt_sched.h
F: Documentation/x86/intel_rdt*

READ-COPY UPDATE (RCU)
diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
deleted file mode 100644
index ae1efc3..0000000
--- a/arch/x86/include/asm/intel_rdt.h
+++ /dev/null
@@ -1,286 +0,0 @@
-#ifndef _ASM_X86_INTEL_RDT_H
-#define _ASM_X86_INTEL_RDT_H
-
-#ifdef CONFIG_INTEL_RDT
-
-#include <linux/sched.h>
-#include <linux/kernfs.h>
-#include <linux/jump_label.h>
-
-#include <asm/intel_rdt_common.h>
-
-#define IA32_L3_QOS_CFG 0xc81
-#define IA32_L3_CBM_BASE 0xc90
-#define IA32_L2_CBM_BASE 0xd10
-#define IA32_MBA_THRTL_BASE 0xd50
-
-#define L3_QOS_CDP_ENABLE 0x01ULL
-
-/**
- * struct rdtgroup - store rdtgroup's data in resctrl file system.
- * @kn: kernfs node
- * @rdtgroup_list: linked list for all rdtgroups
- * @closid: closid for this rdtgroup
- * @cpu_mask: CPUs assigned to this rdtgroup
- * @flags: status bits
- * @waitcount: how many cpus expect to find this
- * group when they acquire rdtgroup_mutex
- */
-struct rdtgroup {
- struct kernfs_node *kn;
- struct list_head rdtgroup_list;
- int closid;
- struct cpumask cpu_mask;
- int flags;
- atomic_t waitcount;
-};
-
-/* rdtgroup.flags */
-#define RDT_DELETED 1
-
-/* rftype.flags */
-#define RFTYPE_FLAGS_CPUS_LIST 1
-
-/* List of all resource groups */
-extern struct list_head rdt_all_groups;
-
-extern int max_name_width, max_data_width;
-
-int __init rdtgroup_init(void);
-
-/**
- * struct rftype - describe each file in the resctrl file system
- * @name: File name
- * @mode: Access mode
- * @kf_ops: File operations
- * @flags: File specific RFTYPE_FLAGS_* flags
- * @seq_show: Show content of the file
- * @write: Write to the file
- */
-struct rftype {
- char *name;
- umode_t mode;
- struct kernfs_ops *kf_ops;
- unsigned long flags;
-
- int (*seq_show)(struct kernfs_open_file *of,
- struct seq_file *sf, void *v);
- /*
- * write() is the generic write callback which maps directly to
- * kernfs write operation and overrides all other operations.
- * Maximum write size is determined by ->max_write_len.
- */
- ssize_t (*write)(struct kernfs_open_file *of,
- char *buf, size_t nbytes, loff_t off);
-};
-
-/**
- * struct rdt_domain - group of cpus sharing an RDT resource
- * @list: all instances of this resource
- * @id: unique id for this instance
- * @cpu_mask: which cpus share this resource
- * @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
- * @new_ctrl: new ctrl value to be loaded
- * @have_new_ctrl: did user provide new_ctrl for this domain
- */
-struct rdt_domain {
- struct list_head list;
- int id;
- struct cpumask cpu_mask;
- u32 *ctrl_val;
- u32 new_ctrl;
- bool have_new_ctrl;
-};
-
-/**
- * struct msr_param - set a range of MSRs from a domain
- * @res: The resource to use
- * @low: Beginning index from base MSR
- * @high: End index
- */
-struct msr_param {
- struct rdt_resource *res;
- int low;
- int high;
-};
-
-/**
- * struct rdt_cache - Cache allocation related data
- * @cbm_len: Length of the cache bit mask
- * @min_cbm_bits: Minimum number of consecutive bits to be set
- * @cbm_idx_mult: Multiplier of CBM index
- * @cbm_idx_offset: Offset of CBM index. CBM index is computed by:
- * closid * cbm_idx_multi + cbm_idx_offset
- * in a cache bit mask
- */
-struct rdt_cache {
- unsigned int cbm_len;
- unsigned int min_cbm_bits;
- unsigned int cbm_idx_mult;
- unsigned int cbm_idx_offset;
-};
-
-/**
- * struct rdt_membw - Memory bandwidth allocation related data
- * @max_delay: Max throttle delay. Delay is the hardware
- * representation for memory bandwidth.
- * @min_bw: Minimum memory bandwidth percentage user can request
- * @bw_gran: Granularity at which the memory bandwidth is allocated
- * @delay_linear: True if memory B/W delay is in linear scale
- * @mb_map: Mapping of memory B/W percentage to memory B/W delay
- */
-struct rdt_membw {
- u32 max_delay;
- u32 min_bw;
- u32 bw_gran;
- u32 delay_linear;
- u32 *mb_map;
-};
-
-/**
- * struct rdt_resource - attributes of an RDT resource
- * @enabled: Is this feature enabled on this machine
- * @capable: Is this feature available on this machine
- * @name: Name to use in "schemata" file
- * @num_closid: Number of CLOSIDs available
- * @cache_level: Which cache level defines scope of this resource
- * @default_ctrl: Specifies default cache cbm or memory B/W percent.
- * @msr_base: Base MSR address for CBMs
- * @msr_update: Function pointer to update QOS MSRs
- * @data_width: Character width of data when displaying
- * @domains: All domains for this resource
- * @cache: Cache allocation related data
- * @info_files: resctrl info files for the resource
- * @nr_info_files: Number of info files
- * @format_str: Per resource format string to show domain value
- * @parse_ctrlval: Per resource function pointer to parse control values
- */
-struct rdt_resource {
- bool enabled;
- bool capable;
- char *name;
- int num_closid;
- int cache_level;
- u32 default_ctrl;
- unsigned int msr_base;
- void (*msr_update) (struct rdt_domain *d, struct msr_param *m,
- struct rdt_resource *r);
- int data_width;
- struct list_head domains;
- struct rdt_cache cache;
- struct rdt_membw membw;
- struct rftype *info_files;
- int nr_info_files;
- const char *format_str;
- int (*parse_ctrlval) (char *buf, struct rdt_resource *r,
- struct rdt_domain *d);
-};
-
-void rdt_get_cache_infofile(struct rdt_resource *r);
-void rdt_get_mba_infofile(struct rdt_resource *r);
-int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d);
-int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d);
-
-extern struct mutex rdtgroup_mutex;
-
-extern struct rdt_resource rdt_resources_all[];
-extern struct rdtgroup rdtgroup_default;
-DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
-
-int __init rdtgroup_init(void);
-
-enum {
- RDT_RESOURCE_L3,
- RDT_RESOURCE_L3DATA,
- RDT_RESOURCE_L3CODE,
- RDT_RESOURCE_L2,
- RDT_RESOURCE_MBA,
-
- /* Must be the last */
- RDT_NUM_RESOURCES,
-};
-
-#define for_each_capable_rdt_resource(r) \
- for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
- r++) \
- if (r->capable)
-
-#define for_each_enabled_rdt_resource(r) \
- for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
- r++) \
- if (r->enabled)
-
-/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
-union cpuid_0x10_1_eax {
- struct {
- unsigned int cbm_len:5;
- } split;
- unsigned int full;
-};
-
-/* CPUID.(EAX=10H, ECX=ResID=3).EAX */
-union cpuid_0x10_3_eax {
- struct {
- unsigned int max_delay:12;
- } split;
- unsigned int full;
-};
-
-/* CPUID.(EAX=10H, ECX=ResID).EDX */
-union cpuid_0x10_x_edx {
- struct {
- unsigned int cos_max:16;
- } split;
- unsigned int full;
-};
-
-DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid);
-
-void rdt_ctrl_update(void *arg);
-struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
-void rdtgroup_kn_unlock(struct kernfs_node *kn);
-ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
- char *buf, size_t nbytes, loff_t off);
-int rdtgroup_schemata_show(struct kernfs_open_file *of,
- struct seq_file *s, void *v);
-
-/*
- * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
- *
- * Following considerations are made so that this has minimal impact
- * on scheduler hot path:
- * - This will stay as no-op unless we are running on an Intel SKU
- * which supports resource control and we enable by mounting the
- * resctrl file system.
- * - Caches the per cpu CLOSid values and does the MSR write only
- * when a task with a different CLOSid is scheduled in.
- *
- * Must be called with preemption disabled.
- */
-static inline void intel_rdt_sched_in(void)
-{
- if (static_branch_likely(&rdt_enable_key)) {
- struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
- int closid;
-
- /*
- * If this task has a closid assigned, use it.
- * Else use the closid assigned to this cpu.
- */
- closid = current->closid;
- if (closid == 0)
- closid = this_cpu_read(cpu_closid);
-
- if (closid != state->closid) {
- state->closid = closid;
- wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, closid);
- }
- }
-}
-
-#else
-
-static inline void intel_rdt_sched_in(void) {}
-
-#endif /* CONFIG_INTEL_RDT */
-#endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
deleted file mode 100644
index c953218..0000000
--- a/arch/x86/include/asm/intel_rdt_common.h
+++ /dev/null
@@ -1,25 +0,0 @@
-#ifndef _ASM_X86_INTEL_RDT_COMMON_H
-#define _ASM_X86_INTEL_RDT_COMMON_H
-
-#define MSR_IA32_PQR_ASSOC 0x0c8f
-
-/**
- * struct intel_pqr_state - State cache for the PQR MSR
- * @rmid: The cached Resource Monitoring ID
- * @closid: The cached Class Of Service ID
- *
- * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
- * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
- * contains both parts, so we need to cache them.
- *
- * The cache also helps to avoid pointless updates if the value does
- * not change.
- */
-struct intel_pqr_state {
- u32 rmid;
- u32 closid;
-};
-
-DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
-
-#endif /* _ASM_X86_INTEL_RDT_COMMON_H */
diff --git a/arch/x86/include/asm/intel_rdt_sched.h b/arch/x86/include/asm/intel_rdt_sched.h
new file mode 100644
index 0000000..62a70bc
--- /dev/null
+++ b/arch/x86/include/asm/intel_rdt_sched.h
@@ -0,0 +1,72 @@
+#ifndef _ASM_X86_INTEL_RDT_SCHED_H
+#define _ASM_X86_INTEL_RDT_SCHED_H
+
+#ifdef CONFIG_INTEL_RDT
+
+#include <linux/sched.h>
+#include <linux/jump_label.h>
+
+#define IA32_PQR_ASSOC 0x0c8f
+
+/**
+ * struct intel_pqr_state - State cache for the PQR MSR
+ * @rmid: The cached Resource Monitoring ID
+ * @closid: The cached Class Of Service ID
+ *
+ * The upper 32 bits of IA32_PQR_ASSOC contain closid and the
+ * lower 10 bits rmid. The update to IA32_PQR_ASSOC always
+ * contains both parts, so we need to cache them.
+ *
+ * The cache also helps to avoid pointless updates if the value does
+ * not change.
+ */
+struct intel_pqr_state {
+ u32 rmid;
+ u32 closid;
+};
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid);
+DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
+
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ * which supports resource control and we enable by mounting the
+ * resctrl file system.
+ * - Caches the per cpu CLOSid values and does the MSR write only
+ * when a task with a different CLOSid is scheduled in.
+ *
+ * Must be called with preemption disabled.
+ */
+static inline void intel_rdt_sched_in(void)
+{
+ if (static_branch_likely(&rdt_enable_key)) {
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+ int closid;
+
+ /*
+ * If this task has a closid assigned, use it.
+ * Else use the closid assigned to this cpu.
+ */
+ closid = current->closid;
+ if (closid == 0)
+ closid = this_cpu_read(cpu_closid);
+
+ if (closid != state->closid) {
+ state->closid = closid;
+ wrmsr(IA32_PQR_ASSOC, state->rmid, closid);
+ }
+ }
+}
+
+#else
+
+static inline void intel_rdt_sched_in(void) {}
+
+#endif /* CONFIG_INTEL_RDT */
+
+#endif /* _ASM_X86_INTEL_RDT_SCHED_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 989a997..08872e9 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -30,7 +30,8 @@
#include <linux/cpuhotplug.h>

#include <asm/intel-family.h>
-#include <asm/intel_rdt.h>
+#include <asm/intel_rdt_sched.h>
+#include "intel_rdt.h"

#define MAX_MBA_BW 100u
#define MBA_IS_LINEAR 0x4
@@ -455,7 +456,7 @@ static void clear_closid(int cpu)

per_cpu(cpu_closid, cpu) = 0;
state->closid = 0;
- wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);
+ wrmsr(IA32_PQR_ASSOC, state->rmid, 0);
}

static int intel_rdt_online_cpu(unsigned int cpu)
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
new file mode 100644
index 0000000..0e4852d
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -0,0 +1,243 @@
+#ifndef _ASM_X86_INTEL_RDT_H
+#define _ASM_X86_INTEL_RDT_H
+
+#include <linux/sched.h>
+#include <linux/kernfs.h>
+#include <linux/jump_label.h>
+
+#define IA32_L3_QOS_CFG 0xc81
+#define IA32_L3_CBM_BASE 0xc90
+#define IA32_L2_CBM_BASE 0xd10
+#define IA32_MBA_THRTL_BASE 0xd50
+
+#define L3_QOS_CDP_ENABLE 0x01ULL
+
+/**
+ * struct rdtgroup - store rdtgroup's data in resctrl file system.
+ * @kn: kernfs node
+ * @rdtgroup_list: linked list for all rdtgroups
+ * @closid: closid for this rdtgroup
+ * @cpu_mask: CPUs assigned to this rdtgroup
+ * @flags: status bits
+ * @waitcount: how many cpus expect to find this
+ * group when they acquire rdtgroup_mutex
+ */
+struct rdtgroup {
+ struct kernfs_node *kn;
+ struct list_head rdtgroup_list;
+ int closid;
+ struct cpumask cpu_mask;
+ int flags;
+ atomic_t waitcount;
+};
+
+/* rdtgroup.flags */
+#define RDT_DELETED 1
+
+/* rftype.flags */
+#define RFTYPE_FLAGS_CPUS_LIST 1
+
+/* List of all resource groups */
+extern struct list_head rdt_all_groups;
+
+extern int max_name_width, max_data_width;
+
+int __init rdtgroup_init(void);
+
+/**
+ * struct rftype - describe each file in the resctrl file system
+ * @name: File name
+ * @mode: Access mode
+ * @kf_ops: File operations
+ * @flags: File specific RFTYPE_FLAGS_* flags
+ * @seq_show: Show content of the file
+ * @write: Write to the file
+ */
+struct rftype {
+ char *name;
+ umode_t mode;
+ struct kernfs_ops *kf_ops;
+ unsigned long flags;
+
+ int (*seq_show)(struct kernfs_open_file *of,
+ struct seq_file *sf, void *v);
+ /*
+ * write() is the generic write callback which maps directly to
+ * kernfs write operation and overrides all other operations.
+ * Maximum write size is determined by ->max_write_len.
+ */
+ ssize_t (*write)(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off);
+};
+
+/**
+ * struct rdt_domain - group of cpus sharing an RDT resource
+ * @list: all instances of this resource
+ * @id: unique id for this instance
+ * @cpu_mask: which cpus share this resource
+ * @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
+ * @new_ctrl: new ctrl value to be loaded
+ * @have_new_ctrl: did user provide new_ctrl for this domain
+ */
+struct rdt_domain {
+ struct list_head list;
+ int id;
+ struct cpumask cpu_mask;
+ u32 *ctrl_val;
+ u32 new_ctrl;
+ bool have_new_ctrl;
+};
+
+/**
+ * struct msr_param - set a range of MSRs from a domain
+ * @res: The resource to use
+ * @low: Beginning index from base MSR
+ * @high: End index
+ */
+struct msr_param {
+ struct rdt_resource *res;
+ int low;
+ int high;
+};
+
+/**
+ * struct rdt_cache - Cache allocation related data
+ * @cbm_len: Length of the cache bit mask
+ * @min_cbm_bits: Minimum number of consecutive bits to be set
+ * @cbm_idx_mult: Multiplier of CBM index
+ * @cbm_idx_offset: Offset of CBM index. CBM index is computed by:
+ * closid * cbm_idx_multi + cbm_idx_offset
+ * in a cache bit mask
+ */
+struct rdt_cache {
+ unsigned int cbm_len;
+ unsigned int min_cbm_bits;
+ unsigned int cbm_idx_mult;
+ unsigned int cbm_idx_offset;
+};
+
+/**
+ * struct rdt_membw - Memory bandwidth allocation related data
+ * @max_delay: Max throttle delay. Delay is the hardware
+ * representation for memory bandwidth.
+ * @min_bw: Minimum memory bandwidth percentage user can request
+ * @bw_gran: Granularity at which the memory bandwidth is allocated
+ * @delay_linear: True if memory B/W delay is in linear scale
+ * @mb_map: Mapping of memory B/W percentage to memory B/W delay
+ */
+struct rdt_membw {
+ u32 max_delay;
+ u32 min_bw;
+ u32 bw_gran;
+ u32 delay_linear;
+ u32 *mb_map;
+};
+
+/**
+ * struct rdt_resource - attributes of an RDT resource
+ * @enabled: Is this feature enabled on this machine
+ * @capable: Is this feature available on this machine
+ * @name: Name to use in "schemata" file
+ * @num_closid: Number of CLOSIDs available
+ * @cache_level: Which cache level defines scope of this resource
+ * @default_ctrl: Specifies default cache cbm or memory B/W percent.
+ * @msr_base: Base MSR address for CBMs
+ * @msr_update: Function pointer to update QOS MSRs
+ * @data_width: Character width of data when displaying
+ * @domains: All domains for this resource
+ * @cache: Cache allocation related data
+ * @info_files: resctrl info files for the resource
+ * @nr_info_files: Number of info files
+ * @format_str: Per resource format string to show domain value
+ * @parse_ctrlval: Per resource function pointer to parse control values
+ */
+struct rdt_resource {
+ bool enabled;
+ bool capable;
+ char *name;
+ int num_closid;
+ int cache_level;
+ u32 default_ctrl;
+ unsigned int msr_base;
+ void (*msr_update) (struct rdt_domain *d, struct msr_param *m,
+ struct rdt_resource *r);
+ int data_width;
+ struct list_head domains;
+ struct rdt_cache cache;
+ struct rdt_membw membw;
+ struct rftype *info_files;
+ int nr_info_files;
+ const char *format_str;
+ int (*parse_ctrlval) (char *buf, struct rdt_resource *r,
+ struct rdt_domain *d);
+};
+
+void rdt_get_cache_infofile(struct rdt_resource *r);
+void rdt_get_mba_infofile(struct rdt_resource *r);
+int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d);
+int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d);
+
+extern struct mutex rdtgroup_mutex;
+
+extern struct rdt_resource rdt_resources_all[];
+extern struct rdtgroup rdtgroup_default;
+DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
+
+int __init rdtgroup_init(void);
+
+enum {
+ RDT_RESOURCE_L3,
+ RDT_RESOURCE_L3DATA,
+ RDT_RESOURCE_L3CODE,
+ RDT_RESOURCE_L2,
+ RDT_RESOURCE_MBA,
+
+ /* Must be the last */
+ RDT_NUM_RESOURCES,
+};
+
+#define for_each_capable_rdt_resource(r) \
+ for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
+ r++) \
+ if (r->capable)
+
+#define for_each_enabled_rdt_resource(r) \
+ for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
+ r++) \
+ if (r->enabled)
+
+/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
+union cpuid_0x10_1_eax {
+ struct {
+ unsigned int cbm_len:5;
+ } split;
+ unsigned int full;
+};
+
+/* CPUID.(EAX=10H, ECX=ResID=3).EAX */
+union cpuid_0x10_3_eax {
+ struct {
+ unsigned int max_delay:12;
+ } split;
+ unsigned int full;
+};
+
+/* CPUID.(EAX=10H, ECX=ResID).EDX */
+union cpuid_0x10_x_edx {
+ struct {
+ unsigned int cos_max:16;
+ } split;
+ unsigned int full;
+};
+
+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid);
+
+void rdt_ctrl_update(void *arg);
+struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
+void rdtgroup_kn_unlock(struct kernfs_node *kn);
+ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off);
+int rdtgroup_schemata_show(struct kernfs_open_file *of,
+ struct seq_file *s, void *v);
+
+#endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 9257bd9..fab8811 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -32,8 +32,8 @@

#include <uapi/linux/magic.h>

-#include <asm/intel_rdt.h>
-#include <asm/intel_rdt_common.h>
+#include <asm/intel_rdt_sched.h>
+#include "intel_rdt.h"

DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
struct kernfs_root *rdt_root;
diff --git a/arch/x86/kernel/cpu/intel_rdt_schemata.c b/arch/x86/kernel/cpu/intel_rdt_schemata.c
index 406d7a6..8cef1c8 100644
--- a/arch/x86/kernel/cpu/intel_rdt_schemata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_schemata.c
@@ -26,7 +26,7 @@
#include <linux/kernfs.h>
#include <linux/seq_file.h>
#include <linux/slab.h>
-#include <asm/intel_rdt.h>
+#include "intel_rdt.h"

/*
* Check whether MBA bandwidth percentage value is correct. The value is
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index ffeae81..930aebc 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -56,7 +56,7 @@
#include <asm/debugreg.h>
#include <asm/switch_to.h>
#include <asm/vm86.h>
-#include <asm/intel_rdt.h>
+#include <asm/intel_rdt_sched.h>
#include <asm/proto.h>

void __show_regs(struct pt_regs *regs, int all)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b6840bf..063360e 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -52,7 +52,7 @@
#include <asm/switch_to.h>
#include <asm/xen/hypervisor.h>
#include <asm/vdso.h>
-#include <asm/intel_rdt.h>
+#include <asm/intel_rdt_sched.h>
#include <asm/unistd.h>
#ifdef CONFIG_IA32_EMULATION
/* Not included via unistd.h */
--
1.9.1

2017-06-26 18:55:38

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 21/21] x86/intel_rdt/mbm: Handle counter overflow

Set up a delayed work queue for each domain that will read all
the MBM counters of active RMIDs once per second to make sure
that they don't wrap around between reads from users.

[Tony: Added the initializations for the work structure and completed
the patch]

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 34 +++++++++++++++----
arch/x86/kernel/cpu/intel_rdt.h | 9 +++++
arch/x86/kernel/cpu/intel_rdt_monitor.c | 57 ++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 9 +++++
4 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 7762e32..fe5c22a 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -318,6 +318,19 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
wrmsrl(r->msr_base + cbm_idx(r, i), d->ctrl_val[i]);
}

+struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+{
+ struct rdt_domain *d;
+
+ list_for_each_entry(d, &r->domains, list) {
+ /* Find the domain that contains this CPU */
+ if (cpumask_test_cpu(cpu, &d->cpu_mask))
+ return d;
+ }
+
+ return NULL;
+}
+
void rdt_ctrl_update(void *arg)
{
struct msr_param *m = arg;
@@ -325,12 +338,10 @@ void rdt_ctrl_update(void *arg)
int cpu = smp_processor_id();
struct rdt_domain *d;

- list_for_each_entry(d, &r->domains, list) {
- /* Find the domain that contains this CPU */
- if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
- r->msr_update(d, m, r);
- return;
- }
+ d = get_domain_from_cpu(cpu, r);
+ if (d) {
+ r->msr_update(d, m, r);
+ return;
}
pr_warn_once("cpu %d not found in any domain for resource %s\n",
cpu, r->name);
@@ -413,6 +424,12 @@ static bool domain_setup_mbm_state(struct rdt_resource *r, struct rdt_domain *d)
return false;
}
}
+
+ if (is_mbm_enabled()) {
+ INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow);
+ mbm_setup_overflow_handler(d);
+ }
+
return true;
}

@@ -495,7 +512,12 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
kfree(d->mbm_total);
kfree(d->mbm_local);
list_del(&d->list);
+ cancel_delayed_work(&d->mbm_over);
kfree(d);
+ } else if (r == &rdt_resources_all[RDT_RESOURCE_L3] &&
+ cpu == d->mbm_work_cpu) {
+ cancel_delayed_work(&d->mbm_over);
+ mbm_setup_overflow_handler(d);
}
}

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index f0896ac..652efdbe 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -21,6 +21,7 @@
#define QOS_L3_MBM_LOCAL_EVENT_ID 0x03

#define MBM_CNTR_WIDTH 24
+#define MBM_OVERFLOW_INTERVAL 1000

#define RMID_VAL_ERROR (1ULL << 63)
#define RMID_VAL_UNAVAIL (1ULL << 62)
@@ -179,6 +180,9 @@ struct mbm_state {
* @cpu_mask: which cpus share this resource
* @mbm_total: saved state for MBM total bandwidth
* @mbm_local: saved state for MBM local bandwidth
+ * @mbm_over: worker to periodically read MBM h/w counters
+ * @mbm_work_cpu:
+ * worker cpu for MBM h/w counters
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
* @new_ctrl: new ctrl value to be loaded
* @have_new_ctrl: did user provide new_ctrl for this domain
@@ -189,6 +193,8 @@ struct rdt_domain {
struct cpumask cpu_mask;
struct mbm_state *mbm_total;
struct mbm_state *mbm_local;
+ struct delayed_work mbm_over;
+ int mbm_work_cpu;
u32 *ctrl_val;
u32 new_ctrl;
bool have_new_ctrl;
@@ -389,6 +395,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
struct seq_file *s, void *v);
+struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
int alloc_rmid(void);
void free_rmid(u32 rmid);
void rdt_get_mon_l3_config(struct rdt_resource *r);
@@ -400,5 +407,7 @@ void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdt_domain *d);
void mon_event_read(struct rmid_read *rr, struct rdt_domain *d,
struct rdtgroup *rdtgrp, int evtid, int first);
+void mbm_setup_overflow_handler(struct rdt_domain *dom);
+void mbm_handle_overflow(struct work_struct *work);

#endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c
index a196f4d..d78a635 100644
--- a/arch/x86/kernel/cpu/intel_rdt_monitor.c
+++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c
@@ -271,6 +271,63 @@ void mon_event_count(void *info)
}
}

+static void mbm_update(struct rdt_domain *d, int rmid)
+{
+ struct rmid_read rr;
+
+ rr.first = false;
+ rr.d = d;
+
+ if (is_mbm_total_enabled()) {
+ rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
+ __mon_event_count(rmid, &rr);
+ }
+ if (is_mbm_local_enabled()) {
+ rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
+ __mon_event_count(rmid, &rr);
+ }
+}
+
+void mbm_handle_overflow(struct work_struct *work)
+{
+ unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
+ int cpu = smp_processor_id();
+ struct rdtgroup *pr, *cr;
+ struct rdt_domain *d;
+
+ mutex_lock(&rdtgroup_mutex);
+
+ if (!static_branch_likely(&rdt_enable_key))
+ goto out_unlock;
+
+ d = get_domain_from_cpu(cpu, &rdt_resources_all[RDT_RESOURCE_L3]);
+ if (!d)
+ goto out_unlock;
+
+ list_for_each_entry(pr, &rdt_all_groups, rdtgroup_list) {
+ mbm_update(d, pr->rmid);
+
+ list_for_each_entry(cr, &pr->crdtgrp_list, crdtgrp_list)
+ mbm_update(d, cr->rmid);
+ }
+
+ schedule_delayed_work_on(cpu, &d->mbm_over, delay);
+out_unlock:
+ mutex_unlock(&rdtgroup_mutex);
+}
+
+void mbm_setup_overflow_handler(struct rdt_domain *dom)
+{
+ unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
+ int cpu;
+
+ if (!static_branch_likely(&rdt_enable_key))
+ return;
+ cpu = cpumask_any(&dom->cpu_mask);
+ dom->mbm_work_cpu = cpu;
+ schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
+}
+
static int dom_data_init(struct rdt_resource *r)
{
struct rmid_entry *entry = NULL;
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 1b84485..a78f7f7 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1092,6 +1092,8 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
void *data)
{
+ struct rdt_domain *dom;
+ struct rdt_resource *r;
struct dentry *dentry;
int ret;

@@ -1150,6 +1152,13 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,

if (rdt_alloc_enabled || rdt_mon_features)
static_branch_enable(&rdt_enable_key);
+
+ if (is_mbm_enabled()) {
+ r = &rdt_resources_all[RDT_RESOURCE_L3];
+ list_for_each_entry(dom, &r->domains, list)
+ mbm_setup_overflow_handler(dom);
+ }
+
goto out;

out_mondata:
--
1.9.1

2017-06-26 18:56:36

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 14/21] x86/intel_rdt/cqm: Add mon_data

Add a mon_data directory for the root rdtgroup and all other rdtgroups.
The directory holds all of the monitored data for all domains and events
of all resources being monitored.

The mon_data itself has a list of directories in the format
mon_<domain_name>_<domain_id>. Each of these subdirectories contain one
file per event in the mode "0444". Reading the file displays a snapshot
of the monitored data for the event the file represents.

For ex, on a 2 socket Broadwell with llc_occupancy being
monitored the mon_data contents look as below:

$ ls /sys/fs/resctrl/p1/mon_data/
mon_L3_00
mon_L3_01

Each domain directory has one file per event:
$ ls /sys/fs/resctrl/p1/mon_data/mon_L3_00/
llc_occupancy

To read current llc_occupancy of ctrl_mon group p1
$ cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
33789096

[This patch idea is based on Tony's sample patches to organise data in a
per domain directory and have one file per event (and use the fp->priv to
store mon data bits)]

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/intel_rdt.c | 4 +-
arch/x86/kernel/cpu/intel_rdt.h | 27 +++
arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 332 ++++++++++++++++++++++++++++
arch/x86/kernel/cpu/intel_rdt_monitor.c | 42 ++++
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 155 +++++++++++++
arch/x86/kernel/cpu/intel_rdt_schemata.c | 286 ------------------------
7 files changed, 559 insertions(+), 289 deletions(-)
create mode 100644 arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
delete mode 100644 arch/x86/kernel/cpu/intel_rdt_schemata.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 81b0060..1245f98 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -32,7 +32,7 @@ obj-$(CONFIG_CPU_SUP_CENTAUR) += centaur.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o

-obj-$(CONFIG_INTEL_RDT) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o intel_rdt_monitor.o
+obj-$(CONFIG_INTEL_RDT) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_monitor.o intel_rdt_ctrlmondata.o

obj-$(CONFIG_X86_MCE) += mcheck/
obj-$(CONFIG_MTRR) += mtrr/
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index b0f8c35..63bfb47c 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -344,8 +344,8 @@ void rdt_ctrl_update(void *arg)
* caller, return the first domain whose id is bigger than the input id.
* The domain list is sorted by id in ascending order.
*/
-static struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
- struct list_head **pos)
+struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
+ struct list_head **pos)
{
struct rdt_domain *d;
struct list_head *l;
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index fec8ba9..631d58e 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -33,6 +33,27 @@ struct mon_evt {
struct list_head list;
};

+/**
+ * struct mon_data_bits - Monitoring details for each event file
+ * @rid: Resource id associated with the event file.
+ * @evtid: Event id associated with the event file
+ * @domid: The domain to which the event file belongs
+ */
+union mon_data_bits {
+ void *priv;
+ struct {
+ unsigned int rid : 10;
+ unsigned int evtid : 8;
+ unsigned int domid : 14;
+ } u;
+};
+
+struct rmid_read {
+ struct rdtgroup *rgrp;
+ int evtid;
+ u64 val;
+};
+
extern unsigned int intel_cqm_threshold;
extern bool rdt_alloc_enabled;
extern int rdt_mon_features;
@@ -48,6 +69,7 @@ enum rdt_group_type {
/**
* struct rdtgroup - store rdtgroup's data in resctrl file system.
* @kn: kernfs node
+ * @mon_data_kn kernlfs node for the mon_data directory
* @rdtgroup_list: linked list for all rdtgroups
* @parent: parent rdtgrp
* @crdtgrp_list: child rdtgroup node list
@@ -62,6 +84,7 @@ enum rdt_group_type {
*/
struct rdtgroup {
struct kernfs_node *kn;
+ struct kernfs_node *mon_data_kn;
struct list_head rdtgroup_list;
struct rdtgroup *parent;
struct list_head crdtgrp_list;
@@ -311,6 +334,8 @@ enum {
void rdt_ctrl_update(void *arg);
struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
void rdtgroup_kn_unlock(struct kernfs_node *kn);
+struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
+ struct list_head **pos);
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -318,5 +343,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
int alloc_rmid(void);
void free_rmid(u32 rmid);
void rdt_get_mon_l3_config(struct rdt_resource *r);
+void mon_event_count(void *info);
+int rdtgroup_mondata_show(struct seq_file *m, void *arg);

#endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
new file mode 100644
index 0000000..0c8bca0
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
@@ -0,0 +1,332 @@
+/*
+ * Resource Director Technology(RDT)
+ * - Cache Allocation code.
+ *
+ * Copyright (C) 2016 Intel Corporation
+ *
+ * Authors:
+ * Fenghua Yu <[email protected]>
+ * Tony Luck <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual June 2016, volume 3, section 17.17.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kernfs.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include "intel_rdt.h"
+
+/*
+ * Check whether MBA bandwidth percentage value is correct. The value is
+ * checked against the minimum and max bandwidth values specified by the
+ * hardware. The allocated bandwidth percentage is rounded to the next
+ * control step available on the hardware.
+ */
+static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
+{
+ unsigned long bw;
+ int ret;
+
+ /*
+ * Only linear delay values is supported for current Intel SKUs.
+ */
+ if (!r->membw.delay_linear)
+ return false;
+
+ ret = kstrtoul(buf, 10, &bw);
+ if (ret)
+ return false;
+
+ if (bw < r->membw.min_bw || bw > r->default_ctrl)
+ return false;
+
+ *data = roundup(bw, (unsigned long)r->membw.bw_gran);
+ return true;
+}
+
+int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d)
+{
+ unsigned long data;
+
+ if (d->have_new_ctrl)
+ return -EINVAL;
+
+ if (!bw_validate(buf, &data, r))
+ return -EINVAL;
+ d->new_ctrl = data;
+ d->have_new_ctrl = true;
+
+ return 0;
+}
+
+/*
+ * Check whether a cache bit mask is valid. The SDM says:
+ * Please note that all (and only) contiguous '1' combinations
+ * are allowed (e.g. FFFFH, 0FF0H, 003CH, etc.).
+ * Additionally Haswell requires at least two bits set.
+ */
+static bool cbm_validate(char *buf, unsigned long *data, struct rdt_resource *r)
+{
+ unsigned long first_bit, zero_bit, val;
+ unsigned int cbm_len = r->cache.cbm_len;
+ int ret;
+
+ ret = kstrtoul(buf, 16, &val);
+ if (ret)
+ return false;
+
+ if (val == 0 || val > r->default_ctrl)
+ return false;
+
+ first_bit = find_first_bit(&val, cbm_len);
+ zero_bit = find_next_zero_bit(&val, cbm_len, first_bit);
+
+ if (find_next_bit(&val, cbm_len, zero_bit) < cbm_len)
+ return false;
+
+ if ((zero_bit - first_bit) < r->cache.min_cbm_bits)
+ return false;
+
+ *data = val;
+ return true;
+}
+
+/*
+ * Read one cache bit mask (hex). Check that it is valid for the current
+ * resource type.
+ */
+int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d)
+{
+ unsigned long data;
+
+ if (d->have_new_ctrl)
+ return -EINVAL;
+
+ if (!cbm_validate(buf, &data, r))
+ return -EINVAL;
+ d->new_ctrl = data;
+ d->have_new_ctrl = true;
+
+ return 0;
+}
+
+/*
+ * For each domain in this resource we expect to find a series of:
+ * id=mask
+ * separated by ";". The "id" is in decimal, and must match one of
+ * the "id"s for this resource.
+ */
+static int parse_line(char *line, struct rdt_resource *r)
+{
+ char *dom = NULL, *id;
+ struct rdt_domain *d;
+ unsigned long dom_id;
+
+next:
+ if (!line || line[0] == '\0')
+ return 0;
+ dom = strsep(&line, ";");
+ id = strsep(&dom, "=");
+ if (!dom || kstrtoul(id, 10, &dom_id))
+ return -EINVAL;
+ dom = strim(dom);
+ list_for_each_entry(d, &r->domains, list) {
+ if (d->id == dom_id) {
+ if (r->parse_ctrlval(dom, r, d))
+ return -EINVAL;
+ goto next;
+ }
+ }
+ return -EINVAL;
+}
+
+static int update_domains(struct rdt_resource *r, int closid)
+{
+ struct msr_param msr_param;
+ cpumask_var_t cpu_mask;
+ struct rdt_domain *d;
+ int cpu;
+
+ if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ msr_param.low = closid;
+ msr_param.high = msr_param.low + 1;
+ msr_param.res = r;
+
+ list_for_each_entry(d, &r->domains, list) {
+ if (d->have_new_ctrl && d->new_ctrl != d->ctrl_val[closid]) {
+ cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+ d->ctrl_val[closid] = d->new_ctrl;
+ }
+ }
+ if (cpumask_empty(cpu_mask))
+ goto done;
+ cpu = get_cpu();
+ /* Update CBM on this cpu if it's in cpu_mask. */
+ if (cpumask_test_cpu(cpu, cpu_mask))
+ rdt_ctrl_update(&msr_param);
+ /* Update CBM on other cpus. */
+ smp_call_function_many(cpu_mask, rdt_ctrl_update, &msr_param, 1);
+ put_cpu();
+
+done:
+ free_cpumask_var(cpu_mask);
+
+ return 0;
+}
+
+static int rdtgroup_parse_resource(char *resname, char *tok, int closid)
+{
+ struct rdt_resource *r;
+
+ for_each_alloc_enabled_rdt_resource(r) {
+ if (!strcmp(resname, r->name) && closid < r->num_closid)
+ return parse_line(tok, r);
+ }
+ return -EINVAL;
+}
+
+ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct rdtgroup *rdtgrp;
+ struct rdt_domain *dom;
+ struct rdt_resource *r;
+ char *tok, *resname;
+ int closid, ret = 0;
+
+ /* Valid input requires a trailing newline */
+ if (nbytes == 0 || buf[nbytes - 1] != '\n')
+ return -EINVAL;
+ buf[nbytes - 1] = '\0';
+
+ rdtgrp = rdtgroup_kn_lock_live(of->kn);
+ if (!rdtgrp) {
+ rdtgroup_kn_unlock(of->kn);
+ return -ENOENT;
+ }
+
+ closid = rdtgrp->closid;
+
+ for_each_alloc_enabled_rdt_resource(r) {
+ list_for_each_entry(dom, &r->domains, list)
+ dom->have_new_ctrl = false;
+ }
+
+ while ((tok = strsep(&buf, "\n")) != NULL) {
+ resname = strim(strsep(&tok, ":"));
+ if (!tok) {
+ ret = -EINVAL;
+ goto out;
+ }
+ ret = rdtgroup_parse_resource(resname, tok, closid);
+ if (ret)
+ goto out;
+ }
+
+ for_each_alloc_enabled_rdt_resource(r) {
+ ret = update_domains(r, closid);
+ if (ret)
+ goto out;
+ }
+
+out:
+ rdtgroup_kn_unlock(of->kn);
+ return ret ?: nbytes;
+}
+
+static void show_doms(struct seq_file *s, struct rdt_resource *r, int closid)
+{
+ struct rdt_domain *dom;
+ bool sep = false;
+
+ seq_printf(s, "%*s:", max_name_width, r->name);
+ list_for_each_entry(dom, &r->domains, list) {
+ if (sep)
+ seq_puts(s, ";");
+ seq_printf(s, r->format_str, dom->id, max_data_width,
+ dom->ctrl_val[closid]);
+ sep = true;
+ }
+ seq_puts(s, "\n");
+}
+
+int rdtgroup_schemata_show(struct kernfs_open_file *of,
+ struct seq_file *s, void *v)
+{
+ struct rdtgroup *rdtgrp;
+ struct rdt_resource *r;
+ int closid, ret = 0;
+
+ rdtgrp = rdtgroup_kn_lock_live(of->kn);
+ if (rdtgrp) {
+ closid = rdtgrp->closid;
+ for_each_alloc_enabled_rdt_resource(r) {
+ if (closid < r->num_closid)
+ show_doms(s, r, closid);
+ }
+ } else {
+ ret = -ENOENT;
+ }
+ rdtgroup_kn_unlock(of->kn);
+ return ret;
+}
+
+int rdtgroup_mondata_show(struct seq_file *m, void *arg)
+{
+ struct kernfs_open_file *of = m->private;
+ u32 resid, evtid, domid;
+ struct rdtgroup *rdtgrp;
+ struct rdt_resource *r;
+ union mon_data_bits md;
+ struct rdt_domain *d;
+ struct rmid_read rr;
+ int ret = 0;
+
+ rdtgrp = rdtgroup_kn_lock_live(of->kn);
+
+ md.priv = of->kn->priv;
+ resid = md.u.rid;
+ domid = md.u.domid;
+ evtid = md.u.evtid;
+
+ r = &rdt_resources_all[resid];
+ d = rdt_find_domain(r, domid, NULL);
+ if (!d) {
+ ret = -ENOENT;
+ goto out;
+ }
+
+ /*
+ * setup the parameters to send to the IPI to read the data.
+ */
+ rr.rgrp = rdtgrp;
+ rr.evtid = evtid;
+ rr.val = 0;
+
+ smp_call_function_any(&d->cpu_mask, mon_event_count, &rr, 1);
+
+ if (rr.val & RMID_VAL_ERROR)
+ seq_puts(m, "Error\n");
+ else if (rr.val & RMID_VAL_UNAVAIL)
+ seq_puts(m, "Unavailable\n");
+ else
+ seq_printf(m, "%llu\n", rr.val * r->mon_scale);
+
+out:
+ rdtgroup_kn_unlock(of->kn);
+ return ret;
+}
diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c
index 624a0aa..cc252eb 100644
--- a/arch/x86/kernel/cpu/intel_rdt_monitor.c
+++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c
@@ -204,6 +204,48 @@ void free_rmid(u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}

+static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
+{
+ u64 tval;
+
+ tval = __rmid_read(rmid, rr->evtid);
+ if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
+ rr->val = tval;
+ return false;
+ }
+ switch (rr->evtid) {
+ case QOS_L3_OCCUP_EVENT_ID:
+ rr->val += tval;
+ return true;
+ default:
+ return false;
+ }
+}
+
+void mon_event_count(void *info)
+{
+ struct rdtgroup *rdtgrp, *entry;
+ struct rmid_read *rr = info;
+ struct list_head *llist;
+
+ rdtgrp = rr->rgrp;
+
+ if (!__mon_event_count(rdtgrp->rmid, rr))
+ return;
+
+ /*
+ * For Ctrl groups read data from child monitor groups.
+ */
+ llist = &rdtgrp->crdtgrp_list;
+
+ if (rdtgrp->type == RDTCTRL_GROUP) {
+ list_for_each_entry(entry, llist, crdtgrp_list) {
+ if (!__mon_event_count(entry->rmid, rr))
+ return;
+ }
+ }
+}
+
static int dom_data_init(struct rdt_resource *r)
{
struct rmid_entry *entry = NULL;
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index d32b781..9377bcd 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -152,6 +152,11 @@ static ssize_t rdtgroup_file_write(struct kernfs_open_file *of, char *buf,
.seq_show = rdtgroup_seqfile_show,
};

+static struct kernfs_ops kf_mondata_ops = {
+ .atomic_write_len = PAGE_SIZE,
+ .seq_show = rdtgroup_mondata_show,
+};
+
static bool is_cpu_list(struct kernfs_open_file *of)
{
struct rftype *rft = of->kn->priv;
@@ -1251,6 +1256,152 @@ static void rdt_kill_sb(struct super_block *sb)
.kill_sb = rdt_kill_sb,
};

+static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
+ void *priv)
+{
+ struct kernfs_node *kn;
+ int ret = 0;
+
+ kn = __kernfs_create_file(parent_kn, name, 0444, 0,
+ &kf_mondata_ops, priv, NULL, NULL);
+ if (IS_ERR(kn))
+ return PTR_ERR(kn);
+
+ ret = rdtgroup_kn_set_ugid(kn);
+ if (ret) {
+ kernfs_remove(kn);
+ return ret;
+ }
+
+ return ret;
+}
+
+static int get_rdt_resourceid(struct rdt_resource *r)
+{
+ if (r > (rdt_resources_all + RDT_NUM_RESOURCES - 1) ||
+ r < rdt_resources_all ||
+ ((r - rdt_resources_all) % sizeof(struct rdt_resource)))
+ return -EINVAL;
+
+ return ((r - rdt_resources_all) / sizeof(struct rdt_resource));
+}
+
+static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, int domid,
+ struct rdt_resource *r, struct rdtgroup *pr)
+{
+ union mon_data_bits priv;
+ struct kernfs_node *kn;
+ struct mon_evt *mevt;
+ char name[32];
+ int ret, rid;
+
+ rid = get_rdt_resourceid(r);
+ if (rid < 0)
+ return -EINVAL;
+
+ sprintf(name, "mon_%s_%02d", r->name, domid);
+ /* create the directory */
+ kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, pr);
+ if (IS_ERR(kn))
+ return PTR_ERR(kn);
+
+ /*
+ * This extra ref will be put in kernfs_remove() and guarantees
+ * that @rdtgrp->kn is always accessible.
+ */
+ kernfs_get(kn);
+ ret = rdtgroup_kn_set_ugid(kn);
+ if (ret)
+ goto out_destroy;
+
+ if (WARN_ON(list_empty(&r->evt_list))) {
+ ret = -EPERM;
+ goto out_destroy;
+ }
+
+ priv.u.rid = rid;
+ priv.u.domid = domid;
+ list_for_each_entry(mevt, &r->evt_list, list) {
+ priv.u.evtid = mevt->evtid;
+ ret = mon_addfile(kn, mevt->name, priv.priv);
+ if (ret)
+ goto out_destroy;
+ }
+ kernfs_activate(kn);
+ return 0;
+
+out_destroy:
+ kernfs_remove(kn);
+ return ret;
+}
+
+static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
+ struct rdt_resource *r,
+ struct rdtgroup *pr)
+{
+ struct rdt_domain *dom;
+ int ret;
+
+ list_for_each_entry(dom, &r->domains, list) {
+ ret = mkdir_mondata_subdir(parent_kn, dom->id, r, pr);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+/*
+ * This creates a directory mon_data which holds one subdirectory
+ * per domain which contains the monitored data.
+ *
+ * mon_data has one directory for each domain whic are named
+ * mon_<domain_name>_<domain_id>. For ex: A mon_data with
+ * with L3 domain looks as below:
+ * ./mon_data:
+ * mon_L3_00
+ * mon_L3_01
+ * mon_L3_02
+ * ...
+ *
+ * Each domain directory has one file per event:
+ * ./mon_L3_00/:
+ * llc_occupancy
+ *
+ */
+static int mkdir_mondata_all(struct kernfs_node *parent_kn, struct rdtgroup *pr,
+ struct kernfs_node **dest_kn)
+{
+ struct rdt_resource *r;
+ struct kernfs_node *kn;
+ int ret;
+
+ /*
+ * Create the mon_data directory first.
+ */
+ ret = mongroup_create_dir(parent_kn, NULL, "mon_data", &kn);
+ if (ret)
+ return ret;
+
+ if (dest_kn)
+ *dest_kn = kn;
+
+ /*
+ * Create the subdirectories for each domain. Note that all events
+ * in a domain like L3 are grouped into a resource whose domain is L3
+ */
+ for_each_mon_enabled_rdt_resource(r) {
+ ret = mkdir_mondata_subdir_alldom(kn, r, pr);
+ if (ret)
+ goto out_destroy;
+ }
+
+ return 0;
+
+out_destroy:
+ kernfs_remove(kn);
+ return ret;
+}
/*
* Common code for ctrl_mon and monitor group mkdir.
* The caller needs to unlock the global mutex upon success.
@@ -1307,6 +1458,10 @@ static int mkdir_rdt_common(struct kernfs_node *pkn, struct kernfs_node *prkn,
goto out_destroy;

if (rdt_mon_features) {
+ ret = mkdir_mondata_all(kn, rdtgrp, &rdtgrp->mon_data_kn);
+ if (ret)
+ goto out_destroy;
+
ret = alloc_rmid();
if (ret < 0)
return ret;
diff --git a/arch/x86/kernel/cpu/intel_rdt_schemata.c b/arch/x86/kernel/cpu/intel_rdt_schemata.c
deleted file mode 100644
index 952156c..0000000
--- a/arch/x86/kernel/cpu/intel_rdt_schemata.c
+++ /dev/null
@@ -1,286 +0,0 @@
-/*
- * Resource Director Technology(RDT)
- * - Cache Allocation code.
- *
- * Copyright (C) 2016 Intel Corporation
- *
- * Authors:
- * Fenghua Yu <[email protected]>
- * Tony Luck <[email protected]>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
- * more details.
- *
- * More information about RDT be found in the Intel (R) x86 Architecture
- * Software Developer Manual June 2016, volume 3, section 17.17.
- */
-
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
-#include <linux/kernfs.h>
-#include <linux/seq_file.h>
-#include <linux/slab.h>
-#include "intel_rdt.h"
-
-/*
- * Check whether MBA bandwidth percentage value is correct. The value is
- * checked against the minimum and max bandwidth values specified by the
- * hardware. The allocated bandwidth percentage is rounded to the next
- * control step available on the hardware.
- */
-static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
-{
- unsigned long bw;
- int ret;
-
- /*
- * Only linear delay values is supported for current Intel SKUs.
- */
- if (!r->membw.delay_linear)
- return false;
-
- ret = kstrtoul(buf, 10, &bw);
- if (ret)
- return false;
-
- if (bw < r->membw.min_bw || bw > r->default_ctrl)
- return false;
-
- *data = roundup(bw, (unsigned long)r->membw.bw_gran);
- return true;
-}
-
-int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d)
-{
- unsigned long data;
-
- if (d->have_new_ctrl)
- return -EINVAL;
-
- if (!bw_validate(buf, &data, r))
- return -EINVAL;
- d->new_ctrl = data;
- d->have_new_ctrl = true;
-
- return 0;
-}
-
-/*
- * Check whether a cache bit mask is valid. The SDM says:
- * Please note that all (and only) contiguous '1' combinations
- * are allowed (e.g. FFFFH, 0FF0H, 003CH, etc.).
- * Additionally Haswell requires at least two bits set.
- */
-static bool cbm_validate(char *buf, unsigned long *data, struct rdt_resource *r)
-{
- unsigned long first_bit, zero_bit, val;
- unsigned int cbm_len = r->cache.cbm_len;
- int ret;
-
- ret = kstrtoul(buf, 16, &val);
- if (ret)
- return false;
-
- if (val == 0 || val > r->default_ctrl)
- return false;
-
- first_bit = find_first_bit(&val, cbm_len);
- zero_bit = find_next_zero_bit(&val, cbm_len, first_bit);
-
- if (find_next_bit(&val, cbm_len, zero_bit) < cbm_len)
- return false;
-
- if ((zero_bit - first_bit) < r->cache.min_cbm_bits)
- return false;
-
- *data = val;
- return true;
-}
-
-/*
- * Read one cache bit mask (hex). Check that it is valid for the current
- * resource type.
- */
-int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d)
-{
- unsigned long data;
-
- if (d->have_new_ctrl)
- return -EINVAL;
-
- if(!cbm_validate(buf, &data, r))
- return -EINVAL;
- d->new_ctrl = data;
- d->have_new_ctrl = true;
-
- return 0;
-}
-
-/*
- * For each domain in this resource we expect to find a series of:
- * id=mask
- * separated by ";". The "id" is in decimal, and must match one of
- * the "id"s for this resource.
- */
-static int parse_line(char *line, struct rdt_resource *r)
-{
- char *dom = NULL, *id;
- struct rdt_domain *d;
- unsigned long dom_id;
-
-next:
- if (!line || line[0] == '\0')
- return 0;
- dom = strsep(&line, ";");
- id = strsep(&dom, "=");
- if (!dom || kstrtoul(id, 10, &dom_id))
- return -EINVAL;
- dom = strim(dom);
- list_for_each_entry(d, &r->domains, list) {
- if (d->id == dom_id) {
- if (r->parse_ctrlval(dom, r, d))
- return -EINVAL;
- goto next;
- }
- }
- return -EINVAL;
-}
-
-static int update_domains(struct rdt_resource *r, int closid)
-{
- struct msr_param msr_param;
- cpumask_var_t cpu_mask;
- struct rdt_domain *d;
- int cpu;
-
- if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
- return -ENOMEM;
-
- msr_param.low = closid;
- msr_param.high = msr_param.low + 1;
- msr_param.res = r;
-
- list_for_each_entry(d, &r->domains, list) {
- if (d->have_new_ctrl && d->new_ctrl != d->ctrl_val[closid]) {
- cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
- d->ctrl_val[closid] = d->new_ctrl;
- }
- }
- if (cpumask_empty(cpu_mask))
- goto done;
- cpu = get_cpu();
- /* Update CBM on this cpu if it's in cpu_mask. */
- if (cpumask_test_cpu(cpu, cpu_mask))
- rdt_ctrl_update(&msr_param);
- /* Update CBM on other cpus. */
- smp_call_function_many(cpu_mask, rdt_ctrl_update, &msr_param, 1);
- put_cpu();
-
-done:
- free_cpumask_var(cpu_mask);
-
- return 0;
-}
-
-static int rdtgroup_parse_resource(char *resname, char *tok, int closid)
-{
- struct rdt_resource *r;
-
- for_each_alloc_enabled_rdt_resource(r) {
- if (!strcmp(resname, r->name) && closid < r->num_closid)
- return parse_line(tok, r);
- }
- return -EINVAL;
-}
-
-ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
- char *buf, size_t nbytes, loff_t off)
-{
- struct rdtgroup *rdtgrp;
- struct rdt_domain *dom;
- struct rdt_resource *r;
- char *tok, *resname;
- int closid, ret = 0;
-
- /* Valid input requires a trailing newline */
- if (nbytes == 0 || buf[nbytes - 1] != '\n')
- return -EINVAL;
- buf[nbytes - 1] = '\0';
-
- rdtgrp = rdtgroup_kn_lock_live(of->kn);
- if (!rdtgrp) {
- rdtgroup_kn_unlock(of->kn);
- return -ENOENT;
- }
-
- closid = rdtgrp->closid;
-
- for_each_alloc_enabled_rdt_resource(r) {
- list_for_each_entry(dom, &r->domains, list)
- dom->have_new_ctrl = false;
- }
-
- while ((tok = strsep(&buf, "\n")) != NULL) {
- resname = strim(strsep(&tok, ":"));
- if (!tok) {
- ret = -EINVAL;
- goto out;
- }
- ret = rdtgroup_parse_resource(resname, tok, closid);
- if (ret)
- goto out;
- }
-
- for_each_alloc_enabled_rdt_resource(r) {
- ret = update_domains(r, closid);
- if (ret)
- goto out;
- }
-
-out:
- rdtgroup_kn_unlock(of->kn);
- return ret ?: nbytes;
-}
-
-static void show_doms(struct seq_file *s, struct rdt_resource *r, int closid)
-{
- struct rdt_domain *dom;
- bool sep = false;
-
- seq_printf(s, "%*s:", max_name_width, r->name);
- list_for_each_entry(dom, &r->domains, list) {
- if (sep)
- seq_puts(s, ";");
- seq_printf(s, r->format_str, dom->id, max_data_width,
- dom->ctrl_val[closid]);
- sep = true;
- }
- seq_puts(s, "\n");
-}
-
-int rdtgroup_schemata_show(struct kernfs_open_file *of,
- struct seq_file *s, void *v)
-{
- struct rdtgroup *rdtgrp;
- struct rdt_resource *r;
- int closid, ret = 0;
-
- rdtgrp = rdtgroup_kn_lock_live(of->kn);
- if (rdtgrp) {
- closid = rdtgrp->closid;
- for_each_alloc_enabled_rdt_resource(r) {
- if (closid < r->num_closid)
- show_doms(s, r, closid);
- }
- } else {
- ret = -ENOENT;
- }
- rdtgroup_kn_unlock(of->kn);
- return ret;
-}
--
1.9.1

2017-06-26 18:56:46

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support

The cpus file is extended to support resource monitoring. This is used
to over-ride the RMID of the default group when running on specific
CPUs. It works similar to the resource control. The "cpus" and
"cpus_list" file is present in default group, ctrl_mon groups and
monitor groups.

Each "cpus" file or cpu_list file reads a cpumask or list showing which
CPUs belong to the resource group. By default all online cpus belong to
the default root group. A CPU can be present in one "ctrl_mon" and one
"monitor" group simultaneously. They can be added to a resource group by
writing the CPU to the file. When a CPU is added to a ctrl_mon group it
is automatically removed from the previous ctrl_mon group. A CPU can be
added to a monitor group only if it is present in the parent ctrl_mon
group and when a CPU is added to a monitor group, it is automatically
removed from the previous monitor group. When CPUs go offline, they are
automatically removed from the ctrl_mon and monitor groups.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 15 ++-
arch/x86/kernel/cpu/intel_rdt.h | 2 +
arch/x86/kernel/cpu/intel_rdt_monitor.c | 1 +
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 183 +++++++++++++++++++++++++------
4 files changed, 169 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index e96b3f0..b0f8c35 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -483,6 +483,17 @@ static int intel_rdt_online_cpu(unsigned int cpu)
return 0;
}

+static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
+{
+ struct rdtgroup *cr;
+
+ list_for_each_entry(cr, &r->crdtgrp_list, crdtgrp_list) {
+ if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) {
+ break;
+ }
+ }
+}
+
static int intel_rdt_offline_cpu(unsigned int cpu)
{
struct rdtgroup *rdtgrp;
@@ -492,8 +503,10 @@ static int intel_rdt_offline_cpu(unsigned int cpu)
for_each_alloc_capable_rdt_resource(r)
domain_remove_cpu(cpu, r);
list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
- if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask))
+ if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
+ clear_childcpus(rdtgrp, cpu);
break;
+ }
}
clear_closid(cpu);
mutex_unlock(&rdtgroup_mutex);
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index fdf3654..fec8ba9 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -37,6 +37,8 @@ struct mon_evt {
extern bool rdt_alloc_enabled;
extern int rdt_mon_features;

+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
+
enum rdt_group_type {
RDTCTRL_GROUP = 0,
RDTMON_GROUP,
diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c
index 4f4221a..624a0aa 100644
--- a/arch/x86/kernel/cpu/intel_rdt_monitor.c
+++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c
@@ -75,6 +75,7 @@ struct rmid_entry {
*/
unsigned int intel_cqm_threshold;

+DEFINE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
static inline struct rmid_entry *__rmid_entry(u32 rmid)
{
struct rmid_entry *entry;
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 8fd0757..d32b781 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -181,13 +181,18 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
/*
* This is safe against intel_rdt_sched_in() called from __switch_to()
* because __switch_to() is executed with interrupts disabled. A local call
- * from rdt_update_closid() is proteced against __switch_to() because
+ * from rdt_update_closidrmid() is proteced against __switch_to() because
* preemption is disabled.
*/
-static void rdt_update_cpu_closid(void *closid)
+static void update_cpu_closid_rmid(void *info)
{
- if (closid)
- this_cpu_write(cpu_closid, *(int *)closid);
+ struct rdtgroup *r = info;
+
+ if (r) {
+ this_cpu_write(cpu_closid, r->closid);
+ this_cpu_write(cpu_rmid, r->rmid);
+ }
+
/*
* We cannot unconditionally write the MSR because the current
* executing task might have its own closid selected. Just reuse
@@ -199,33 +204,30 @@ static void rdt_update_cpu_closid(void *closid)
/*
* Update the PGR_ASSOC MSR on all cpus in @cpu_mask,
*
- * Per task closids must have been set up before calling this function.
+ * Per task closids/rmids must have been set up before calling this function.
*
- * The per cpu closids are updated with the smp function call, when @closid
- * is not NULL. If @closid is NULL then all affected percpu closids must
- * have been set up before calling this function.
+ * The per cpu closids and rmids are updated with the smp function call
*/
static void
-rdt_update_closid(const struct cpumask *cpu_mask, int *closid)
+update_closid_rmid(const struct cpumask *cpu_mask, struct rdtgroup *r)
{
int cpu = get_cpu();

if (cpumask_test_cpu(cpu, cpu_mask))
- rdt_update_cpu_closid(closid);
- smp_call_function_many(cpu_mask, rdt_update_cpu_closid, closid, 1);
+ update_cpu_closid_rmid(r);
+ smp_call_function_many(cpu_mask, update_cpu_closid_rmid, r, 1);
put_cpu();
}

-static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
- char *buf, size_t nbytes, loff_t off)
+static ssize_t cpus_mon_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ struct rdtgroup *rdtgrp)
{
+ struct rdtgroup *pr = rdtgrp->parent, *cr;
cpumask_var_t tmpmask, newmask;
- struct rdtgroup *rdtgrp, *r;
+ struct list_head *llist;
int ret;

- if (!buf)
- return -EINVAL;
-
if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
return -ENOMEM;
if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) {
@@ -233,10 +235,89 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
return -ENOMEM;
}

- rdtgrp = rdtgroup_kn_lock_live(of->kn);
- if (!rdtgrp) {
- ret = -ENOENT;
- goto unlock;
+ if (is_cpu_list(of))
+ ret = cpulist_parse(buf, newmask);
+ else
+ ret = cpumask_parse(buf, newmask);
+
+ if (ret)
+ goto out;
+
+ /* check that user didn't specify any offline cpus */
+ cpumask_andnot(tmpmask, newmask, cpu_online_mask);
+ if (cpumask_weight(tmpmask)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Check whether cpus belong to parent ctrl group */
+ cpumask_andnot(tmpmask, newmask, &pr->cpu_mask);
+ if (cpumask_weight(tmpmask)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Check whether cpus are dropped from this group */
+ cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask);
+ if (cpumask_weight(tmpmask)) {
+ /* Give any dropped cpus to parent rdtgroup */
+ cpumask_or(&pr->cpu_mask, &pr->cpu_mask, tmpmask);
+ update_closid_rmid(tmpmask, pr);
+ }
+
+ /*
+ * If we added cpus, remove them from previous group that owned them
+ * and update per-cpu rmid
+ */
+ cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask);
+ if (cpumask_weight(tmpmask)) {
+ llist = &pr->crdtgrp_list;
+ list_for_each_entry(cr, llist, crdtgrp_list) {
+ if (cr == rdtgrp)
+ continue;
+ cpumask_andnot(&cr->cpu_mask, &cr->cpu_mask, tmpmask);
+ }
+ update_closid_rmid(tmpmask, rdtgrp);
+ }
+
+ /* Done pushing/pulling - update this group with new mask */
+ cpumask_copy(&rdtgrp->cpu_mask, newmask);
+
+out:
+ free_cpumask_var(tmpmask);
+ free_cpumask_var(newmask);
+
+ return ret ?: nbytes;
+}
+
+static void cpumask_rdtgrp_clear(struct rdtgroup *r, struct cpumask *m)
+{
+ struct rdtgroup *cr;
+
+ cpumask_andnot(&r->cpu_mask, &r->cpu_mask, m);
+ /* update the child mon group masks as well*/
+ list_for_each_entry(cr, &r->crdtgrp_list, crdtgrp_list)
+ cpumask_and(&cr->cpu_mask, &r->cpu_mask, &cr->cpu_mask);
+}
+
+static ssize_t cpus_ctrl_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ struct rdtgroup *rdtgrp)
+{
+ cpumask_var_t tmpmask, newmask, tmpmask1;
+ struct rdtgroup *r, *cr;
+ int ret;
+
+ if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
+ return -ENOMEM;
+ if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) {
+ free_cpumask_var(tmpmask);
+ return -ENOMEM;
+ }
+ if (!zalloc_cpumask_var(&tmpmask1, GFP_KERNEL)) {
+ free_cpumask_var(tmpmask);
+ free_cpumask_var(newmask);
+ return -ENOMEM;
}

if (is_cpu_list(of))
@@ -245,13 +326,13 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
ret = cpumask_parse(buf, newmask);

if (ret)
- goto unlock;
+ goto out;

/* check that user didn't specify any offline cpus */
cpumask_andnot(tmpmask, newmask, cpu_online_mask);
if (cpumask_weight(tmpmask)) {
ret = -EINVAL;
- goto unlock;
+ goto out;
}

/* Check whether cpus are dropped from this group */
@@ -260,12 +341,13 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
/* Can't drop from default group */
if (rdtgrp == &rdtgroup_default) {
ret = -EINVAL;
- goto unlock;
+ goto out;
}
+
/* Give any dropped cpus to rdtgroup_default */
cpumask_or(&rdtgroup_default.cpu_mask,
&rdtgroup_default.cpu_mask, tmpmask);
- rdt_update_closid(tmpmask, &rdtgroup_default.closid);
+ update_closid_rmid(tmpmask, &rdtgroup_default);
}

/*
@@ -277,22 +359,61 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
list_for_each_entry(r, &rdt_all_groups, rdtgroup_list) {
if (r == rdtgrp)
continue;
- cpumask_andnot(&r->cpu_mask, &r->cpu_mask, tmpmask);
+ cpumask_and(tmpmask1, &r->cpu_mask, tmpmask);
+ if (cpumask_weight(tmpmask1))
+ cpumask_rdtgrp_clear(r, tmpmask1);
}
- rdt_update_closid(tmpmask, &rdtgrp->closid);
+ update_closid_rmid(tmpmask, rdtgrp);
}

/* Done pushing/pulling - update this group with new mask */
cpumask_copy(&rdtgrp->cpu_mask, newmask);

-unlock:
- rdtgroup_kn_unlock(of->kn);
+ /*
+ * Update the child mon group masks as well. The child groups
+ * would always have the subset of parent, but any new cpus
+ * to the parent need to be removed from the children.
+ */
+ list_for_each_entry(cr, &rdtgrp->crdtgrp_list, crdtgrp_list) {
+ cpumask_and(tmpmask, &rdtgrp->cpu_mask, &cr->cpu_mask);
+ cpumask_andnot(&cr->cpu_mask, tmpmask, newmask);
+ }
+out:
+ free_cpumask_var(tmpmask1);
free_cpumask_var(tmpmask);
free_cpumask_var(newmask);

return ret ?: nbytes;
}

+static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct rdtgroup *rdtgrp;
+ int ret;
+
+ if (!buf)
+ return -EINVAL;
+
+ rdtgrp = rdtgroup_kn_lock_live(of->kn);
+ if (!rdtgrp) {
+ ret = -ENOENT;
+ goto unlock;
+ }
+
+ if (rdtgrp->type == RDTCTRL_GROUP)
+ ret = cpus_ctrl_write(of, buf, nbytes, rdtgrp);
+ else if (rdtgrp->type == RDTMON_GROUP)
+ ret = cpus_mon_write(of, buf, nbytes, rdtgrp);
+ else
+ ret = -EINVAL;
+
+unlock:
+ rdtgroup_kn_unlock(of->kn);
+
+ return ret ?: nbytes;
+}
+
struct task_move_callback {
struct callback_head work;
struct rdtgroup *rdtgrp;
@@ -1102,7 +1223,7 @@ static void rmdir_all_sub(void)
}
/* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */
get_online_cpus();
- rdt_update_closid(cpu_online_mask, &rdtgroup_default.closid);
+ update_closid_rmid(cpu_online_mask, &rdtgroup_default);
put_online_cpus();

kernfs_remove(kn_info);
@@ -1342,7 +1463,7 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
* task running on them.
*/
cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
- rdt_update_closid(tmpmask, NULL);
+ update_closid_rmid(tmpmask, NULL);

rdtgrp->flags = RDT_DELETED;
closid_free(rdtgrp->closid);
--
1.9.1

2017-06-26 18:57:21

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 12/21] x86/intel_rdt/cqm: Add tasks file support

The root directory, ctrl_mon and monitor groups are populated
with a read/write file named "tasks". When read, it shows all the task
IDs assigned to the resource group.

Tasks can be added to groups by writing the PID to the file. A task can
be present in one "ctrl_mon" group "and" one "monitor" group. IOW a
PID_x can be seen in a ctrl_mon group and a monitor group at the same
time. When a task is added to a ctrl_mon group, it is automatically
removed from the previous ctrl_mon group where it belonged. Similarly if
a task is moved to a monitor group it is removed from the previous
monitor group . Also since the monitor groups can only have subset of
tasks of parent ctrl_mon group, a task can be moved to a monitor group
only if its already present in the parent ctrl_mon group.

Task membership is indicated by a new field in the task_struct "u32
rmid" which holds the RMID for the task. RMID=0 is reserved for the
default root group where the tasks belong to at mount.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 18 ++++++++++++++++--
include/linux/sched.h | 1 +
2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 36078c7..8fd0757 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -352,7 +352,20 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
atomic_dec(&rdtgrp->waitcount);
kfree(callback);
} else {
- tsk->closid = rdtgrp->closid;
+ /*
+ * For ctrl_mon groups move both closid and rmid.
+ * For monitor groups, can move the tasks only from
+ * their parent CTRL group.
+ */
+ if (rdtgrp->type == RDTCTRL_GROUP) {
+ tsk->closid = rdtgrp->closid;
+ tsk->rmid = rdtgrp->rmid;
+ } else if (rdtgrp->type == RDTMON_GROUP) {
+ if (rdtgrp->parent->closid == tsk->closid)
+ tsk->rmid = rdtgrp->rmid;
+ else
+ ret = -EINVAL;
+ }
}
return ret;
}
@@ -432,7 +445,8 @@ static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)

rcu_read_lock();
for_each_process_thread(p, t) {
- if (t->closid == r->closid)
+ if ((r->type == RDTCTRL_GROUP && t->closid == r->closid) ||
+ (r->type == RDTMON_GROUP && t->rmid == r->rmid))
seq_printf(s, "%d\n", t->pid);
}
rcu_read_unlock();
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9e31b3d..6643692 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -866,6 +866,7 @@ struct task_struct {
#endif
#ifdef CONFIG_INTEL_RDT
int closid;
+ u32 rmid;
#endif
#ifdef CONFIG_FUTEX
struct robust_list_head __user *robust_list;
--
1.9.1

2017-06-26 18:57:29

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 11/21] x86/intel_rdt/cqm: Add mkdir support for RDT monitoring

Resource control groups can be created using mkdir in resctrl
fs(rdtgroup). In order to extend the resctrl interface to support
monitoring the control groups, extend the current mkdir to support
resource monitoring also.

This allows the rdtgroup created under the root directory to be able to
both control and monitor resources (ctrl_mon group). The ctrl_mon groups
are associated with one CLOSID like the legacy rdtgroups and one
RMID(Resource monitoring ID) as well. Hardware uses RMID to track the
resource usage. Once either of the CLOSID or RMID are exhausted, the
mkdir fails with -ENOSPC. If there are RMIDs in limbo list but not free
an -EBUSY is returned. User can also monitor a subset of the ctrl_mon
rdtgroup's tasks/cpus using the monitor groups. The monitor groups are
created using mkdir under the "mon_groups" directory in every ctrl_mon
group.

[Merged tony's code:
Removed a lot of common mkdir code, a fix to handling of the list of the
child rdtgroups and some cleanups in list traversal. Also the changes to
have similar alloc and free for CLOS/RMID and return -EBUSY when RMIDs
are in limbo and not free]

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.h | 17 +++
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 201 ++++++++++++++++++++++++++-----
2 files changed, 189 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index c0acfc3..fdf3654 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -37,23 +37,38 @@ struct mon_evt {
extern bool rdt_alloc_enabled;
extern int rdt_mon_features;

+enum rdt_group_type {
+ RDTCTRL_GROUP = 0,
+ RDTMON_GROUP,
+ RDT_NUM_GROUP,
+};
+
/**
* struct rdtgroup - store rdtgroup's data in resctrl file system.
* @kn: kernfs node
* @rdtgroup_list: linked list for all rdtgroups
+ * @parent: parent rdtgrp
+ * @crdtgrp_list: child rdtgroup node list
* @closid: closid for this rdtgroup
+ * @rmid: rmid for this rdtgroup
* @cpu_mask: CPUs assigned to this rdtgroup
* @flags: status bits
* @waitcount: how many cpus expect to find this
* group when they acquire rdtgroup_mutex
+ * @type: indicates type of this rdtgroup - either
+ * monitor only or ctrl_mon group
*/
struct rdtgroup {
struct kernfs_node *kn;
struct list_head rdtgroup_list;
+ struct rdtgroup *parent;
+ struct list_head crdtgrp_list;
int closid;
+ u32 rmid;
struct cpumask cpu_mask;
int flags;
atomic_t waitcount;
+ enum rdt_group_type type;
};

/* rdtgroup.flags */
@@ -298,6 +313,8 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
struct seq_file *s, void *v);
+int alloc_rmid(void);
+void free_rmid(u32 rmid);
void rdt_get_mon_l3_config(struct rdt_resource *r);

#endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index e997330..36078c7 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -759,6 +759,39 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn)
return ret;
}

+static int
+mongroup_create_dir(struct kernfs_node *parent_kn, struct rdtgroup *pr,
+ char *name, struct kernfs_node **dest_kn)
+{
+ struct kernfs_node *kn;
+ int ret;
+
+ /* create the directory */
+ kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, pr);
+ if (IS_ERR(kn))
+ return PTR_ERR(kn);
+
+ if (dest_kn)
+ *dest_kn = kn;
+
+ /*
+ * This extra ref will be put in kernfs_remove() and guarantees
+ * that @rdtgrp->kn is always accessible.
+ */
+ kernfs_get(kn);
+
+ ret = rdtgroup_kn_set_ugid(kn);
+ if (ret)
+ goto out_destroy;
+
+ kernfs_activate(kn);
+
+ return 0;
+
+out_destroy:
+ kernfs_remove(kn);
+ return ret;
+}
static void l3_qos_cfg_update(void *arg)
{
bool *enable = arg;
@@ -1083,43 +1116,38 @@ static void rdt_kill_sb(struct super_block *sb)
.kill_sb = rdt_kill_sb,
};

-static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
- umode_t mode)
+/*
+ * Common code for ctrl_mon and monitor group mkdir.
+ * The caller needs to unlock the global mutex upon success.
+ */
+static int mkdir_rdt_common(struct kernfs_node *pkn, struct kernfs_node *prkn,
+ const char *name, umode_t mode,
+ enum rdt_group_type rtype, struct rdtgroup **r)
{
- struct rdtgroup *parent, *rdtgrp;
+ struct rdtgroup *prgrp, *rdtgrp;
struct kernfs_node *kn;
- int ret, closid;
-
- /* Only allow mkdir in the root directory */
- if (parent_kn != rdtgroup_default.kn)
- return -EPERM;
-
- /* Do not accept '\n' to avoid unparsable situation. */
- if (strchr(name, '\n'))
- return -EINVAL;
+ uint fshift = 0;
+ int ret;

- parent = rdtgroup_kn_lock_live(parent_kn);
- if (!parent) {
+ prgrp = rdtgroup_kn_lock_live(prkn);
+ if (!prgrp) {
ret = -ENODEV;
goto out_unlock;
}

- ret = closid_alloc();
- if (ret < 0)
- goto out_unlock;
- closid = ret;
-
/* allocate the rdtgroup. */
rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
if (!rdtgrp) {
ret = -ENOSPC;
- goto out_closid_free;
+ goto out_unlock;
}
- rdtgrp->closid = closid;
- list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
+ *r = rdtgrp;
+ rdtgrp->parent = prgrp;
+ rdtgrp->type = rtype;
+ INIT_LIST_HEAD(&rdtgrp->crdtgrp_list);

/* kernfs creates the directory for rdtgrp */
- kn = kernfs_create_dir(parent->kn, name, mode, rdtgrp);
+ kn = kernfs_create_dir(pkn, name, mode, rdtgrp);
if (IS_ERR(kn)) {
ret = PTR_ERR(kn);
goto out_cancel_ref;
@@ -1138,27 +1166,138 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
if (ret)
goto out_destroy;

- ret = rdtgroup_add_files(kn, RF_CTRL_BASE);
+ fshift = 1 << (RF_CTRLSHIFT + rtype);
+ ret = rdtgroup_add_files(kn, RFTYPE_BASE | fshift);
if (ret)
goto out_destroy;

+ if (rdt_mon_features) {
+ ret = alloc_rmid();
+ if (ret < 0)
+ return ret;
+
+ rdtgrp->rmid = ret;
+ }
kernfs_activate(kn);

- ret = 0;
- goto out_unlock;
+ return 0;

out_destroy:
kernfs_remove(rdtgrp->kn);
out_cancel_ref:
- list_del(&rdtgrp->rdtgroup_list);
kfree(rdtgrp);
-out_closid_free:
+out_unlock:
+ rdtgroup_kn_unlock(prkn);
+ return ret;
+}
+
+static void mkdir_rdt_common_clean(struct rdtgroup *rgrp)
+{
+ kernfs_remove(rgrp->kn);
+ if (rgrp->rmid)
+ free_rmid(rgrp->rmid);
+ kfree(rgrp);
+}
+
+/*
+ * Create a monitor group under "mon_groups" directory of a control
+ * and monitor group(ctrl_mon). This is a resource group
+ * to monitor a subset of tasks and cpus in its parent ctrl_mon group.
+ */
+static int rdtgroup_mkdir_mon(struct kernfs_node *pkn, struct kernfs_node *prkn,
+ const char *name,
+ umode_t mode)
+{
+ struct rdtgroup *rdtgrp, *prdtgrp;
+ int ret;
+
+ ret = mkdir_rdt_common(pkn, prkn, name, mode, RDTMON_GROUP, &rdtgrp);
+ if (ret)
+ return ret;
+
+ prdtgrp = rdtgrp->parent;
+ rdtgrp->closid = prdtgrp->closid;
+
+ /*
+ * Add the rdtgrp to the list of rdgrp the parent
+ * ctrl_mon group has to track.
+ */
+ list_add_tail(&rdtgrp->crdtgrp_list, &prdtgrp->crdtgrp_list);
+
+ rdtgroup_kn_unlock(prkn);
+ return ret;
+}
+
+/*
+ * These are rdtgroups created under the root directory. Can be used
+ * to allocate and monitor resources.
+ */
+static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *pkn,
+ struct kernfs_node *prkn,
+ const char *name, umode_t mode)
+{
+ struct rdtgroup *rdtgrp;
+ struct kernfs_node *kn;
+ int ret, closid;
+
+ ret = mkdir_rdt_common(pkn, prkn, name, mode, RDTCTRL_GROUP, &rdtgrp);
+ if (ret)
+ return ret;
+
+ kn = rdtgrp->kn;
+ ret = closid_alloc();
+ if (ret < 0)
+ goto out_common_fail;
+ closid = ret;
+
+ rdtgrp->closid = closid;
+ list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
+
+ if (rdt_mon_features) {
+ /*
+ * Create an empty mon_groups directory to hold the subset
+ * of tasks and cpus to monitor.
+ */
+ ret = mongroup_create_dir(kn, NULL, "mon_groups", NULL);
+ if (ret)
+ goto out_id_free;
+ }
+
+ ret = 0;
+ goto out_unlock;
+
+out_id_free:
closid_free(closid);
+ list_del(&rdtgrp->rdtgroup_list);
+out_common_fail:
+ mkdir_rdt_common_clean(rdtgrp);
out_unlock:
- rdtgroup_kn_unlock(parent_kn);
+ rdtgroup_kn_unlock(prkn);
return ret;
}

+static int rdtgroup_mkdir(struct kernfs_node *pkn, const char *name,
+ umode_t mode)
+{
+ /* Do not accept '\n' to avoid unparsable situation. */
+ if (strchr(name, '\n'))
+ return -EINVAL;
+
+ /*
+ * We don't allow rdtgroup ctrl_mon directories to be created anywhere
+ * except the root directory and dont allow rdtgroup monitor
+ * directories to be created anywhere execept inside mon_groups
+ * directory.
+ */
+ if (rdt_alloc_enabled && pkn == rdtgroup_default.kn)
+ return rdtgroup_mkdir_ctrl_mon(pkn, pkn, name, mode);
+ else if (rdt_mon_features &&
+ !strcmp(pkn->name, "mon_groups"))
+ return rdtgroup_mkdir_mon(pkn, pkn->parent, name, mode);
+ else
+ return -EPERM;
+}
+
static int rdtgroup_rmdir(struct kernfs_node *kn)
{
int ret, cpu, closid = rdtgroup_default.closid;
@@ -1234,6 +1373,10 @@ static int __init rdtgroup_setup_root(void)
mutex_lock(&rdtgroup_mutex);

rdtgroup_default.closid = 0;
+ rdtgroup_default.rmid = 0;
+ rdtgroup_default.type = RDTCTRL_GROUP;
+ INIT_LIST_HEAD(&rdtgroup_default.crdtgrp_list);
+
list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups);

ret = rdtgroup_add_files(rdt_root->kn, RF_CTRL_BASE);
--
1.9.1

2017-06-26 18:57:38

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 10/21] x86/intel_rdt/cqm: Add info files for RDT monitoring

Add info directory files specific to RDT monitoring.

num_rmids:
The number of RMIDs which are valid for the resource.

mon_features:
Lists the monitoring events if monitoring is enabled for the
resource.

max_threshold_occupancy:
This is specific to llc_occupancy monitoring and is used to
determine if an RMID can be reused. Provides an upper bound on the
threshold and is shown to the user in bytes though the internal
value will be rounded to the scaling factor supported by the h/w.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.h | 8 ++
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 121 +++++++++++++++++++++++++++----
2 files changed, 113 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index eb41b21..c0acfc3 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -68,11 +68,14 @@ struct rdtgroup {
#define RFTYPE_INFO 1U
#define RFTYPE_BASE (1U << 1)
#define RF_CTRLSHIFT 4
+#define RF_MONSHIFT 5
#define RFTYPE_CTRL (1U << RF_CTRLSHIFT)
+#define RFTYPE_MON (1U << RF_MONSHIFT)
#define RFTYPE_RES_CACHE (1U << 8)
#define RFTYPE_RES_MB (1U << 9)

#define RF_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL)
+#define RF_MON_INFO (RFTYPE_INFO | RFTYPE_MON)
#define RF_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL)

/* List of all resource groups */
@@ -257,6 +260,11 @@ enum {
r++) \
if (r->alloc_enabled)

+#define for_each_mon_enabled_rdt_resource(r) \
+ for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
+ r++) \
+ if (r->mon_enabled)
+
/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
union cpuid_0x10_1_eax {
struct {
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 1c53802..e997330 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -490,6 +490,28 @@ static int rdt_min_bw_show(struct kernfs_open_file *of,
return 0;
}

+static int rdt_num_rmids_show(struct kernfs_open_file *of,
+ struct seq_file *seq, void *v)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+
+ seq_printf(seq, "%d\n", r->num_rmid);
+
+ return 0;
+}
+
+static int rdt_mon_features_show(struct kernfs_open_file *of,
+ struct seq_file *seq, void *v)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+ struct mon_evt *mevt;
+
+ list_for_each_entry(mevt, &r->evt_list, list)
+ seq_printf(seq, "%s\n", mevt->name);
+
+ return 0;
+}
+
static int rdt_bw_gran_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
@@ -508,6 +530,35 @@ static int rdt_delay_linear_show(struct kernfs_open_file *of,
return 0;
}

+static int max_threshold_occ_show(struct kernfs_open_file *of,
+ struct seq_file *seq, void *v)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+
+ seq_printf(seq, "%u\n", intel_cqm_threshold * r->mon_scale);
+
+ return 0;
+}
+
+static ssize_t max_threshold_occ_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+ unsigned int bytes;
+ int ret;
+
+ ret = kstrtouint(buf, 0, &bytes);
+ if (ret)
+ return ret;
+
+ if (bytes > (boot_cpu_data.x86_cache_size * 1024))
+ return -EINVAL;
+
+ intel_cqm_threshold = bytes / r->mon_scale;
+
+ return ret ?: nbytes;
+}
+
/* rdtgroup information files for one cache resource. */
static struct rftype res_common_files[] = {
{
@@ -518,6 +569,20 @@ static int rdt_delay_linear_show(struct kernfs_open_file *of,
.fflags = RF_CTRL_INFO,
},
{
+ .name = "mon_features",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdt_mon_features_show,
+ .fflags = RF_MON_INFO,
+ },
+ {
+ .name = "num_rmids",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdt_num_rmids_show,
+ .fflags = RF_MON_INFO,
+ },
+ {
.name = "cbm_mask",
.mode = 0444,
.kf_ops = &rdtgroup_kf_single_ops,
@@ -553,6 +618,14 @@ static int rdt_delay_linear_show(struct kernfs_open_file *of,
.fflags = RF_CTRL_INFO | RFTYPE_RES_MB,
},
{
+ .name = "max_threshold_occupancy",
+ .mode = 0644,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .write = max_threshold_occ_write,
+ .seq_show = max_threshold_occ_show,
+ .fflags = RF_MON_INFO | RFTYPE_RES_CACHE,
+ },
+ {
.name = "cpus",
.mode = 0644,
.kf_ops = &rdtgroup_kf_single_ops,
@@ -615,15 +688,37 @@ static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags)
return ret;
}

-static u32 getres_fflags(struct rdt_resource *r)
+static int rdtgroup_mkdir_info_resdir(struct rdt_resource *r, char *name,
+ unsigned long fflags)
{
- return (r->fflags | (r->alloc_enabled << RF_CTRLSHIFT));
+ struct kernfs_node *kn_subdir;
+ int ret;
+
+ kn_subdir = kernfs_create_dir(kn_info, name,
+ kn_info->mode, r);
+ if (IS_ERR(kn_subdir)) {
+ ret = PTR_ERR(kn_subdir);
+ goto out_fail;
+ }
+ kernfs_get(kn_subdir);
+ ret = rdtgroup_kn_set_ugid(kn_subdir);
+ if (ret)
+ goto out_fail;
+
+ ret = rdtgroup_add_files(kn_subdir, fflags);
+ if (ret)
+ goto out_fail;
+ kernfs_activate(kn_subdir);
+
+out_fail:
+ return ret;
}

static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn)
{
- struct kernfs_node *kn_subdir;
struct rdt_resource *r;
+ unsigned long fflags;
+ char name[32];
int ret;

/* create the directory */
@@ -633,22 +728,16 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn)
kernfs_get(kn_info);

for_each_alloc_enabled_rdt_resource(r) {
- kn_subdir = kernfs_create_dir(kn_info, r->name,
- kn_info->mode, r);
- if (IS_ERR(kn_subdir)) {
- ret = PTR_ERR(kn_subdir);
- goto out_destroy;
- }
- kernfs_get(kn_subdir);
- ret = rdtgroup_kn_set_ugid(kn_subdir);
- if (ret)
+ fflags = r->fflags | RF_CTRL_INFO;
+ if (rdtgroup_mkdir_info_resdir(r, r->name, fflags))
goto out_destroy;
+ }

- ret = rdtgroup_add_files(kn_subdir, getres_fflags(r) |
- RFTYPE_INFO);
- if (ret)
+ for_each_mon_enabled_rdt_resource(r) {
+ fflags = r->fflags | RF_MON_INFO;
+ sprintf(name, "%s_MON", r->name);
+ if (rdtgroup_mkdir_info_resdir(r, name, fflags))
goto out_destroy;
- kernfs_activate(kn_subdir);
}

/*
--
1.9.1

2017-06-26 18:58:03

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

Hardware uses RMID(Resource monitoring ID) to keep track of each of the
RDT events associated with tasks. The number of RMIDs is dependent on
the SKU and is enumerated via CPUID. We add support to manage the RMIDs
which include managing the RMID allocation and reading LLC occupancy
for an RMID.

RMID allocation is managed by keeping a free list which is initialized
to all available RMIDs except for RMID 0 which is always reserved for
root group. RMIDs goto a limbo list once they are
freed since the RMIDs are still tagged to cache lines of the tasks which
were using them - thereby still having some occupancy. They continue to
be in limbo list until the occupancy < threshold_occupancy. The
threshold_occupancy is a user configurable value.
OS uses IA32_QM_CTR MSR to read the occupancy associated with an RMID
after programming the IA32_EVENTSEL MSR with the RMID.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.h | 2 +
arch/x86/kernel/cpu/intel_rdt_monitor.c | 121 ++++++++++++++++++++++++++++++++
2 files changed, 123 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 285f106..cf25b6c 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -20,6 +20,8 @@
#define QOS_L3_MBM_TOTAL_EVENT_ID 0x02
#define QOS_L3_MBM_LOCAL_EVENT_ID 0x03

+#define RMID_VAL_ERROR (1ULL << 63)
+#define RMID_VAL_UNAVAIL (1ULL << 62)
/**
* struct mon_evt - Entry in the event list of a resource
* @evtid: event id
diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c
index a418854..4f4221a 100644
--- a/arch/x86/kernel/cpu/intel_rdt_monitor.c
+++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c
@@ -28,6 +28,9 @@
#include <asm/cpu_device_id.h>
#include "intel_rdt.h"

+#define MSR_IA32_QM_CTR 0x0c8e
+#define MSR_IA32_QM_EVTSEL 0x0c8d
+
enum rmid_recycle_state {
RMID_CHECK = 0,
RMID_DIRTY,
@@ -82,6 +85,124 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
return entry;
}

+static u64 __rmid_read(u32 rmid, u32 eventid)
+{
+ u64 val;
+
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ rdmsrl(MSR_IA32_QM_CTR, val);
+
+ /*
+ * Aside from the ERROR and UNAVAIL bits, the return value is the
+ * count for this @eventid tagged with @rmid.
+ */
+ return val;
+}
+
+/*
+ * Test whether an RMID is dirty(occupancy > threshold_occupancy)
+ */
+static void intel_cqm_stable(void *arg)
+{
+ struct rmid_entry *entry;
+ u64 val;
+
+ /*
+ * Since we are in the IPI already lets mark all the RMIDs
+ * that are dirty
+ */
+ list_for_each_entry(entry, &rmid_limbo_lru, list) {
+ val = __rmid_read(entry->rmid, QOS_L3_OCCUP_EVENT_ID);
+ if (val > intel_cqm_threshold)
+ entry->state = RMID_DIRTY;
+ }
+}
+
+/*
+ * Scan the limbo list and move all entries that are below the
+ * intel_cqm_threshold to the free list.
+ * Return "true" if the limbo list is empty, "false" if there are
+ * still some RMIDs there.
+ */
+static bool try_freeing_limbo_rmid(void)
+{
+ struct rmid_entry *entry, *tmp;
+ struct rdt_resource *r;
+ cpumask_var_t cpu_mask;
+ struct rdt_domain *d;
+ bool ret = true;
+
+ if (list_empty(&rmid_limbo_lru))
+ return ret;
+
+ if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
+ return false;
+
+ r = &rdt_resources_all[RDT_RESOURCE_L3];
+
+ list_for_each_entry(d, &r->domains, list)
+ cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+
+ /*
+ * Test whether an RMID is free for each package.
+ */
+ on_each_cpu_mask(cpu_mask, intel_cqm_stable, NULL, true);
+
+ list_for_each_entry_safe(entry, tmp, &rmid_limbo_lru, list) {
+ /*
+ * Ignore the RMIDs that are marked dirty and reset the
+ * state to check for being dirty again later.
+ */
+ if (entry->state == RMID_DIRTY) {
+ entry->state = RMID_CHECK;
+ ret = false;
+ continue;
+ }
+ list_del(&entry->list);
+ list_add_tail(&entry->list, &rmid_free_lru);
+ }
+
+ free_cpumask_var(cpu_mask);
+ return ret;
+}
+
+int alloc_rmid(void)
+{
+ struct rmid_entry *entry;
+ bool ret;
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ if (list_empty(&rmid_free_lru)) {
+ ret = try_freeing_limbo_rmid();
+ if (list_empty(&rmid_free_lru))
+ return ret ? -ENOSPC : -EBUSY;
+ }
+
+ entry = list_first_entry(&rmid_free_lru,
+ struct rmid_entry, list);
+ list_del(&entry->list);
+
+ return entry->rmid;
+}
+
+void free_rmid(u32 rmid)
+{
+ struct rmid_entry *entry;
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ WARN_ON(!rmid);
+ entry = __rmid_entry(rmid);
+
+ entry->state = RMID_CHECK;
+
+ if (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID))
+ list_add_tail(&entry->list, &rmid_limbo_lru);
+ else
+ list_add_tail(&entry->list, &rmid_free_lru);
+}
+
static int dom_data_init(struct rdt_resource *r)
{
struct rmid_entry *entry = NULL;
--
1.9.1

2017-06-26 18:58:07

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 06/21] x86/intel_rdt: Cleanup namespace to support RDT monitoring

Few of the data-structures have generic names although they are RDT
allocation specific. Rename them to be allocation specific to
accommodate RDT monitoring. E.g. s/enabled/alloc_enabled/

No functional change.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/intel_rdt_sched.h | 4 ++--
arch/x86/kernel/cpu/intel_rdt.c | 24 +++++++++++------------
arch/x86/kernel/cpu/intel_rdt.h | 18 ++++++++---------
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 33 ++++++++++++++++----------------
arch/x86/kernel/cpu/intel_rdt_schemata.c | 8 ++++----
5 files changed, 44 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/intel_rdt_sched.h b/arch/x86/include/asm/intel_rdt_sched.h
index 62a70bc..4dee77b 100644
--- a/arch/x86/include/asm/intel_rdt_sched.h
+++ b/arch/x86/include/asm/intel_rdt_sched.h
@@ -27,7 +27,7 @@ struct intel_pqr_state {

DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid);
-DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
+DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);

/*
* intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
@@ -44,7 +44,7 @@ struct intel_pqr_state {
*/
static inline void intel_rdt_sched_in(void)
{
- if (static_branch_likely(&rdt_enable_key)) {
+ if (static_branch_likely(&rdt_alloc_enable_key)) {
struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
int closid;

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 08872e9..59500f9 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -173,8 +173,8 @@ static inline bool cache_alloc_hsw_probe(void)
r->default_ctrl = max_cbm;
r->cache.cbm_len = 20;
r->cache.min_cbm_bits = 2;
- r->capable = true;
- r->enabled = true;
+ r->alloc_capable = true;
+ r->alloc_enabled = true;

return true;
}
@@ -224,8 +224,8 @@ static bool rdt_get_mem_config(struct rdt_resource *r)
r->data_width = 3;
rdt_get_mba_infofile(r);

- r->capable = true;
- r->enabled = true;
+ r->alloc_capable = true;
+ r->alloc_enabled = true;

return true;
}
@@ -242,8 +242,8 @@ static void rdt_get_cache_config(int idx, struct rdt_resource *r)
r->default_ctrl = BIT_MASK(eax.split.cbm_len + 1) - 1;
r->data_width = (r->cache.cbm_len + 3) / 4;
rdt_get_cache_infofile(r);
- r->capable = true;
- r->enabled = true;
+ r->alloc_capable = true;
+ r->alloc_enabled = true;
}

static void rdt_get_cdp_l3_config(int type)
@@ -255,12 +255,12 @@ static void rdt_get_cdp_l3_config(int type)
r->cache.cbm_len = r_l3->cache.cbm_len;
r->default_ctrl = r_l3->default_ctrl;
r->data_width = (r->cache.cbm_len + 3) / 4;
- r->capable = true;
+ r->alloc_capable = true;
/*
* By default, CDP is disabled. CDP can be enabled by mount parameter
* "cdp" during resctrl file system mount time.
*/
- r->enabled = false;
+ r->alloc_enabled = false;
}

static int get_cache_id(int cpu, int level)
@@ -464,7 +464,7 @@ static int intel_rdt_online_cpu(unsigned int cpu)
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
- for_each_capable_rdt_resource(r)
+ for_each_alloc_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
/* The cpu is set in default rdtgroup after online. */
cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
@@ -480,7 +480,7 @@ static int intel_rdt_offline_cpu(unsigned int cpu)
struct rdt_resource *r;

mutex_lock(&rdtgroup_mutex);
- for_each_capable_rdt_resource(r)
+ for_each_alloc_capable_rdt_resource(r)
domain_remove_cpu(cpu, r);
list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask))
@@ -501,7 +501,7 @@ static __init void rdt_init_padding(void)
struct rdt_resource *r;
int cl;

- for_each_capable_rdt_resource(r) {
+ for_each_alloc_capable_rdt_resource(r) {
cl = strlen(r->name);
if (cl > max_name_width)
max_name_width = cl;
@@ -565,7 +565,7 @@ static int __init intel_rdt_late_init(void)
return ret;
}

- for_each_capable_rdt_resource(r)
+ for_each_alloc_capable_rdt_resource(r)
pr_info("Intel RDT %s allocation detected\n", r->name);

return 0;
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 0e4852d..29630af 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -135,8 +135,8 @@ struct rdt_membw {

/**
* struct rdt_resource - attributes of an RDT resource
- * @enabled: Is this feature enabled on this machine
- * @capable: Is this feature available on this machine
+ * @alloc_enabled: Is allocation enabled on this machine
+ * @alloc_capable: Is allocation available on this machine
* @name: Name to use in "schemata" file
* @num_closid: Number of CLOSIDs available
* @cache_level: Which cache level defines scope of this resource
@@ -152,8 +152,8 @@ struct rdt_membw {
* @parse_ctrlval: Per resource function pointer to parse control values
*/
struct rdt_resource {
- bool enabled;
- bool capable;
+ bool alloc_enabled;
+ bool alloc_capable;
char *name;
int num_closid;
int cache_level;
@@ -181,7 +181,7 @@ struct rdt_resource {

extern struct rdt_resource rdt_resources_all[];
extern struct rdtgroup rdtgroup_default;
-DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
+DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);

int __init rdtgroup_init(void);

@@ -196,15 +196,15 @@ enum {
RDT_NUM_RESOURCES,
};

-#define for_each_capable_rdt_resource(r) \
+#define for_each_alloc_capable_rdt_resource(r) \
for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
r++) \
- if (r->capable)
+ if (r->alloc_capable)

-#define for_each_enabled_rdt_resource(r) \
+#define for_each_alloc_enabled_rdt_resource(r) \
for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
r++) \
- if (r->enabled)
+ if (r->alloc_enabled)

/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
union cpuid_0x10_1_eax {
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index fab8811..8ef9390 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -35,7 +35,7 @@
#include <asm/intel_rdt_sched.h>
#include "intel_rdt.h"

-DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
+DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
struct kernfs_root *rdt_root;
struct rdtgroup rdtgroup_default;
LIST_HEAD(rdt_all_groups);
@@ -66,7 +66,7 @@ static void closid_init(void)
int rdt_min_closid = 32;

/* Compute rdt_min_closid across all resources */
- for_each_enabled_rdt_resource(r)
+ for_each_alloc_enabled_rdt_resource(r)
rdt_min_closid = min(rdt_min_closid, r->num_closid);

closid_free_map = BIT_MASK(rdt_min_closid) - 1;
@@ -638,7 +638,7 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn)
return PTR_ERR(kn_info);
kernfs_get(kn_info);

- for_each_enabled_rdt_resource(r) {
+ for_each_alloc_enabled_rdt_resource(r) {
kn_subdir = kernfs_create_dir(kn_info, r->name,
kn_info->mode, r);
if (IS_ERR(kn_subdir)) {
@@ -718,14 +718,15 @@ static int cdp_enable(void)
struct rdt_resource *r_l3 = &rdt_resources_all[RDT_RESOURCE_L3];
int ret;

- if (!r_l3->capable || !r_l3data->capable || !r_l3code->capable)
+ if (!r_l3->alloc_capable || !r_l3data->alloc_capable ||
+ !r_l3code->alloc_capable)
return -EINVAL;

ret = set_l3_qos_cfg(r_l3, true);
if (!ret) {
- r_l3->enabled = false;
- r_l3data->enabled = true;
- r_l3code->enabled = true;
+ r_l3->alloc_enabled = false;
+ r_l3data->alloc_enabled = true;
+ r_l3code->alloc_enabled = true;
}
return ret;
}
@@ -734,11 +735,11 @@ static void cdp_disable(void)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];

- r->enabled = r->capable;
+ r->alloc_enabled = r->alloc_capable;

- if (rdt_resources_all[RDT_RESOURCE_L3DATA].enabled) {
- rdt_resources_all[RDT_RESOURCE_L3DATA].enabled = false;
- rdt_resources_all[RDT_RESOURCE_L3CODE].enabled = false;
+ if (rdt_resources_all[RDT_RESOURCE_L3DATA].alloc_enabled) {
+ rdt_resources_all[RDT_RESOURCE_L3DATA].alloc_enabled = false;
+ rdt_resources_all[RDT_RESOURCE_L3CODE].alloc_enabled = false;
set_l3_qos_cfg(r, false);
}
}
@@ -834,7 +835,7 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
/*
* resctrl file system can only be mounted once.
*/
- if (static_branch_unlikely(&rdt_enable_key)) {
+ if (static_branch_unlikely(&rdt_alloc_enable_key)) {
dentry = ERR_PTR(-EBUSY);
goto out;
}
@@ -858,7 +859,7 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
if (IS_ERR(dentry))
goto out_destroy;

- static_branch_enable(&rdt_enable_key);
+ static_branch_enable(&rdt_alloc_enable_key);
goto out;

out_destroy:
@@ -986,11 +987,11 @@ static void rdt_kill_sb(struct super_block *sb)
mutex_lock(&rdtgroup_mutex);

/*Put everything back to default values. */
- for_each_enabled_rdt_resource(r)
+ for_each_alloc_enabled_rdt_resource(r)
reset_all_ctrls(r);
cdp_disable();
rmdir_all_sub();
- static_branch_disable(&rdt_enable_key);
+ static_branch_disable(&rdt_alloc_enable_key);
kernfs_kill_sb(sb);
mutex_unlock(&rdtgroup_mutex);
}
@@ -1129,7 +1130,7 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)

static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf)
{
- if (rdt_resources_all[RDT_RESOURCE_L3DATA].enabled)
+ if (rdt_resources_all[RDT_RESOURCE_L3DATA].alloc_enabled)
seq_puts(seq, ",cdp");
return 0;
}
diff --git a/arch/x86/kernel/cpu/intel_rdt_schemata.c b/arch/x86/kernel/cpu/intel_rdt_schemata.c
index 8cef1c8..952156c 100644
--- a/arch/x86/kernel/cpu/intel_rdt_schemata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_schemata.c
@@ -192,7 +192,7 @@ static int rdtgroup_parse_resource(char *resname, char *tok, int closid)
{
struct rdt_resource *r;

- for_each_enabled_rdt_resource(r) {
+ for_each_alloc_enabled_rdt_resource(r) {
if (!strcmp(resname, r->name) && closid < r->num_closid)
return parse_line(tok, r);
}
@@ -221,7 +221,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,

closid = rdtgrp->closid;

- for_each_enabled_rdt_resource(r) {
+ for_each_alloc_enabled_rdt_resource(r) {
list_for_each_entry(dom, &r->domains, list)
dom->have_new_ctrl = false;
}
@@ -237,7 +237,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
goto out;
}

- for_each_enabled_rdt_resource(r) {
+ for_each_alloc_enabled_rdt_resource(r) {
ret = update_domains(r, closid);
if (ret)
goto out;
@@ -274,7 +274,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
rdtgrp = rdtgroup_kn_lock_live(of->kn);
if (rdtgrp) {
closid = rdtgrp->closid;
- for_each_enabled_rdt_resource(r) {
+ for_each_alloc_enabled_rdt_resource(r) {
if (closid < r->num_closid)
show_doms(s, r, closid);
}
--
1.9.1

2017-06-26 18:58:00

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 09/21] x86/intel_rdt: Simplify info and base file lists

From: Tony Luck <[email protected]>

The info directory files and base files need to be different for each
resource like cache and Memory bandwidth. With in each resource, the
files would be further different for monitoring and ctrl. This leads to
a lot of different static array declarations given that we are adding
resctrl monitoring.

Simplify this to one common list of files and then declare a set of
flags to choose the files based on the resource, whether it is info or
base and if it is control type file. This is as a preparation to include
monitoring based info and base files.

No functional change.

[Vikas: Extended the flags to have few bits per category like resource,
info/base etc]

Signed-off-by: Tony luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 7 +-
arch/x86/kernel/cpu/intel_rdt.h | 23 +++--
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 154 +++++++++++++++----------------
3 files changed, 94 insertions(+), 90 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 121eb14..e96b3f0 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -82,6 +82,7 @@ struct rdt_resource rdt_resources_all[] = {
},
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
+ .fflags = RFTYPE_RES_CACHE,
},
{
.name = "L3DATA",
@@ -96,6 +97,7 @@ struct rdt_resource rdt_resources_all[] = {
},
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
+ .fflags = RFTYPE_RES_CACHE,
},
{
.name = "L3CODE",
@@ -110,6 +112,7 @@ struct rdt_resource rdt_resources_all[] = {
},
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
+ .fflags = RFTYPE_RES_CACHE,
},
{
.name = "L2",
@@ -124,6 +127,7 @@ struct rdt_resource rdt_resources_all[] = {
},
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
+ .fflags = RFTYPE_RES_CACHE,
},
{
.name = "MB",
@@ -133,6 +137,7 @@ struct rdt_resource rdt_resources_all[] = {
.cache_level = 3,
.parse_ctrlval = parse_bw,
.format_str = "%d=%*d",
+ .fflags = RFTYPE_RES_MB,
},
};

@@ -228,7 +233,6 @@ static bool rdt_get_mem_config(struct rdt_resource *r)
return false;
}
r->data_width = 3;
- rdt_get_mba_infofile(r);

r->alloc_capable = true;
r->alloc_enabled = true;
@@ -247,7 +251,6 @@ static void rdt_get_cache_alloc_config(int idx, struct rdt_resource *r)
r->cache.cbm_len = eax.split.cbm_len + 1;
r->default_ctrl = BIT_MASK(eax.split.cbm_len + 1) - 1;
r->data_width = (r->cache.cbm_len + 3) / 4;
- rdt_get_cache_infofile(r);
r->alloc_capable = true;
r->alloc_enabled = true;
}
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index cf25b6c..eb41b21 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -62,6 +62,19 @@ struct rdtgroup {
/* rftype.flags */
#define RFTYPE_FLAGS_CPUS_LIST 1

+/*
+ * Define the file type flags for base and info directories.
+ */
+#define RFTYPE_INFO 1U
+#define RFTYPE_BASE (1U << 1)
+#define RF_CTRLSHIFT 4
+#define RFTYPE_CTRL (1U << RF_CTRLSHIFT)
+#define RFTYPE_RES_CACHE (1U << 8)
+#define RFTYPE_RES_MB (1U << 9)
+
+#define RF_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL)
+#define RF_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL)
+
/* List of all resource groups */
extern struct list_head rdt_all_groups;

@@ -75,6 +88,7 @@ struct rdtgroup {
* @mode: Access mode
* @kf_ops: File operations
* @flags: File specific RFTYPE_FLAGS_* flags
+ * @fflags: File specific RF_* or RFTYPE_* flags
* @seq_show: Show content of the file
* @write: Write to the file
*/
@@ -83,6 +97,7 @@ struct rftype {
umode_t mode;
struct kernfs_ops *kf_ops;
unsigned long flags;
+ unsigned long fflags;

int (*seq_show)(struct kernfs_open_file *of,
struct seq_file *sf, void *v);
@@ -173,13 +188,12 @@ struct rdt_membw {
* @data_width: Character width of data when displaying
* @domains: All domains for this resource
* @cache: Cache allocation related data
- * @info_files: resctrl info files for the resource
- * @nr_info_files: Number of info files
* @format_str: Per resource format string to show domain value
* @parse_ctrlval: Per resource function pointer to parse control values
* @evt_list: List of monitoring events
* @num_rmid: Number of RMIDs available
* @mon_scale: cqm counter * mon_scale = occupancy in bytes
+ * @fflags: flags to choose base and info files
*/
struct rdt_resource {
bool alloc_enabled;
@@ -197,18 +211,15 @@ struct rdt_resource {
struct list_head domains;
struct rdt_cache cache;
struct rdt_membw membw;
- struct rftype *info_files;
- int nr_info_files;
const char *format_str;
int (*parse_ctrlval) (char *buf, struct rdt_resource *r,
struct rdt_domain *d);
struct list_head evt_list;
int num_rmid;
unsigned int mon_scale;
+ unsigned long fflags;
};

-void rdt_get_cache_infofile(struct rdt_resource *r);
-void rdt_get_mba_infofile(struct rdt_resource *r);
int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d);
int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d);

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 8ef9390..1c53802 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -125,28 +125,6 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
return 0;
}

-static int rdtgroup_add_files(struct kernfs_node *kn, struct rftype *rfts,
- int len)
-{
- struct rftype *rft;
- int ret;
-
- lockdep_assert_held(&rdtgroup_mutex);
-
- for (rft = rfts; rft < rfts + len; rft++) {
- ret = rdtgroup_add_file(kn, rft);
- if (ret)
- goto error;
- }
-
- return 0;
-error:
- pr_warn("Failed to add %s, err=%d\n", rft->name, ret);
- while (--rft >= rfts)
- kernfs_remove_by_name(kn, rft->name);
- return ret;
-}
-
static int rdtgroup_seqfile_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
@@ -476,39 +454,6 @@ static int rdtgroup_tasks_show(struct kernfs_open_file *of,
return ret;
}

-/* Files in each rdtgroup */
-static struct rftype rdtgroup_base_files[] = {
- {
- .name = "cpus",
- .mode = 0644,
- .kf_ops = &rdtgroup_kf_single_ops,
- .write = rdtgroup_cpus_write,
- .seq_show = rdtgroup_cpus_show,
- },
- {
- .name = "cpus_list",
- .mode = 0644,
- .kf_ops = &rdtgroup_kf_single_ops,
- .write = rdtgroup_cpus_write,
- .seq_show = rdtgroup_cpus_show,
- .flags = RFTYPE_FLAGS_CPUS_LIST,
- },
- {
- .name = "tasks",
- .mode = 0644,
- .kf_ops = &rdtgroup_kf_single_ops,
- .write = rdtgroup_tasks_write,
- .seq_show = rdtgroup_tasks_show,
- },
- {
- .name = "schemata",
- .mode = 0644,
- .kf_ops = &rdtgroup_kf_single_ops,
- .write = rdtgroup_schemata_write,
- .seq_show = rdtgroup_schemata_show,
- },
-};
-
static int rdt_num_closids_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
@@ -564,73 +509,122 @@ static int rdt_delay_linear_show(struct kernfs_open_file *of,
}

/* rdtgroup information files for one cache resource. */
-static struct rftype res_cache_info_files[] = {
+static struct rftype res_common_files[] = {
{
.name = "num_closids",
.mode = 0444,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdt_num_closids_show,
+ .fflags = RF_CTRL_INFO,
},
{
.name = "cbm_mask",
.mode = 0444,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdt_default_ctrl_show,
+ .fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE,
},
{
.name = "min_cbm_bits",
.mode = 0444,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdt_min_cbm_bits_show,
- },
-};
-
-/* rdtgroup information files for memory bandwidth. */
-static struct rftype res_mba_info_files[] = {
- {
- .name = "num_closids",
- .mode = 0444,
- .kf_ops = &rdtgroup_kf_single_ops,
- .seq_show = rdt_num_closids_show,
+ .fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE,
},
{
.name = "min_bandwidth",
.mode = 0444,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdt_min_bw_show,
+ .fflags = RF_CTRL_INFO | RFTYPE_RES_MB,
},
{
.name = "bandwidth_gran",
.mode = 0444,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdt_bw_gran_show,
+ .fflags = RF_CTRL_INFO | RFTYPE_RES_MB,
},
{
.name = "delay_linear",
.mode = 0444,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdt_delay_linear_show,
+ .fflags = RF_CTRL_INFO | RFTYPE_RES_MB,
+ },
+ {
+ .name = "cpus",
+ .mode = 0644,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .write = rdtgroup_cpus_write,
+ .seq_show = rdtgroup_cpus_show,
+ .fflags = RFTYPE_BASE,
+ },
+ {
+ .name = "cpus_list",
+ .mode = 0644,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .write = rdtgroup_cpus_write,
+ .seq_show = rdtgroup_cpus_show,
+ .flags = RFTYPE_FLAGS_CPUS_LIST,
+ .fflags = RFTYPE_BASE,
+ },
+ {
+ .name = "tasks",
+ .mode = 0644,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .write = rdtgroup_tasks_write,
+ .seq_show = rdtgroup_tasks_show,
+ .fflags = RFTYPE_BASE,
+ },
+ {
+ .name = "schemata",
+ .mode = 0644,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .write = rdtgroup_schemata_write,
+ .seq_show = rdtgroup_schemata_show,
+ .fflags = RF_CTRL_BASE,
},
};

-void rdt_get_mba_infofile(struct rdt_resource *r)
+static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags)
{
- r->info_files = res_mba_info_files;
- r->nr_info_files = ARRAY_SIZE(res_mba_info_files);
+ struct rftype *rfts, *rft;
+ int ret, len;
+
+ rfts = res_common_files;
+ len = ARRAY_SIZE(res_common_files);
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ for (rft = rfts; rft < rfts + len; rft++) {
+ if ((fflags & rft->fflags) == rft->fflags) {
+ ret = rdtgroup_add_file(kn, rft);
+ if (ret)
+ goto error;
+ }
+ }
+
+ return 0;
+error:
+ pr_warn("Failed to add %s, err=%d\n", rft->name, ret);
+ while (--rft >= rfts) {
+ if ((fflags & rft->fflags) == rft->fflags)
+ kernfs_remove_by_name(kn, rft->name);
+ }
+ return ret;
}

-void rdt_get_cache_infofile(struct rdt_resource *r)
+static u32 getres_fflags(struct rdt_resource *r)
{
- r->info_files = res_cache_info_files;
- r->nr_info_files = ARRAY_SIZE(res_cache_info_files);
+ return (r->fflags | (r->alloc_enabled << RF_CTRLSHIFT));
}

static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn)
{
struct kernfs_node *kn_subdir;
- struct rftype *res_info_files;
struct rdt_resource *r;
- int ret, len;
+ int ret;

/* create the directory */
kn_info = kernfs_create_dir(parent_kn, "info", parent_kn->mode, NULL);
@@ -650,10 +644,8 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn)
if (ret)
goto out_destroy;

- res_info_files = r->info_files;
- len = r->nr_info_files;
-
- ret = rdtgroup_add_files(kn_subdir, res_info_files, len);
+ ret = rdtgroup_add_files(kn_subdir, getres_fflags(r) |
+ RFTYPE_INFO);
if (ret)
goto out_destroy;
kernfs_activate(kn_subdir);
@@ -1057,8 +1049,7 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
if (ret)
goto out_destroy;

- ret = rdtgroup_add_files(kn, rdtgroup_base_files,
- ARRAY_SIZE(rdtgroup_base_files));
+ ret = rdtgroup_add_files(kn, RF_CTRL_BASE);
if (ret)
goto out_destroy;

@@ -1156,8 +1147,7 @@ static int __init rdtgroup_setup_root(void)
rdtgroup_default.closid = 0;
list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups);

- ret = rdtgroup_add_files(rdt_root->kn, rdtgroup_base_files,
- ARRAY_SIZE(rdtgroup_base_files));
+ ret = rdtgroup_add_files(rdt_root->kn, RF_CTRL_BASE);
if (ret) {
kernfs_destroy_root(rdt_root);
goto out;
--
1.9.1

2017-06-26 18:59:04

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 16/21] x86/intel_rdt/cqm: Add mount,umount support

Add monitoring support during mount and unmount. Since root directory is
a "ctrl_mon" directory which can control and monitor resources create
the "mon_groups" directory which can hold monitor groups and a
"mon_data" directory which would hold all monitoring data like the rest
of resource groups.

The mount succeeds if either of monitoring or control/allocation is
enabled. If only monitoring is enabled user can still create monitor
groups under the "/sys/fs/resctrl/mon_groups/" and any mkdir under root
would fail. If only control/allocation is enabled all of the monitoring
related directories/files would not exist and resctrl would work in
legacy mode.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.h | 5 +++
arch/x86/kernel/cpu/intel_rdt_monitor.c | 3 ++
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 67 +++++++++++++++++++++++++++++---
3 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 631d58e..ea7a86f 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -22,6 +22,9 @@

#define RMID_VAL_ERROR (1ULL << 63)
#define RMID_VAL_UNAVAIL (1ULL << 62)
+
+DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
+
/**
* struct mon_evt - Entry in the event list of a resource
* @evtid: event id
@@ -59,6 +62,8 @@ struct rmid_read {
extern int rdt_mon_features;

DECLARE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
+DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
+

enum rdt_group_type {
RDTCTRL_GROUP = 0,
diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c
index cc252eb..a7247ca 100644
--- a/arch/x86/kernel/cpu/intel_rdt_monitor.c
+++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c
@@ -76,6 +76,9 @@ struct rmid_entry {
unsigned int intel_cqm_threshold;

DEFINE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
+
+DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key);
+
static inline struct rmid_entry *__rmid_entry(u32 rmid)
{
struct rmid_entry *entry;
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 6131508..2384c07 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -35,6 +35,7 @@
#include <asm/intel_rdt_sched.h>
#include "intel_rdt.h"

+DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
struct kernfs_root *rdt_root;
struct rdtgroup rdtgroup_default;
@@ -43,6 +44,12 @@
/* Kernel fs node for "info" directory under root */
static struct kernfs_node *kn_info;

+/* Kernel fs node for "mon_groups" directory under root */
+static struct kernfs_node *kn_mongrp;
+
+/* Kernel fs node for "mon_data" directory under root */
+static struct kernfs_node *kn_mondata;
+
/*
* Trivial allocator for CLOSIDs. Since h/w only supports a small number,
* we can keep a bitmap of free CLOSIDs in a single integer.
@@ -1078,6 +1085,9 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn)
}
}

+static int mkdir_mondata_all(struct kernfs_node *parent_kn, struct rdtgroup *pr,
+ struct kernfs_node **mon_data_kn);
+
static struct dentry *rdt_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
void *data)
@@ -1089,7 +1099,7 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
/*
* resctrl file system can only be mounted once.
*/
- if (static_branch_unlikely(&rdt_alloc_enable_key)) {
+ if (static_branch_unlikely(&rdt_enable_key)) {
dentry = ERR_PTR(-EBUSY);
goto out;
}
@@ -1108,15 +1118,47 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
goto out_cdp;
}

+ if (rdt_mon_features) {
+ ret = mongroup_create_dir(rdtgroup_default.kn,
+ NULL, "mon_groups",
+ &kn_mongrp);
+ if (ret) {
+ dentry = ERR_PTR(ret);
+ goto out_info;
+ }
+ kernfs_get(kn_mongrp);
+
+ ret = mkdir_mondata_all(rdtgroup_default.kn,
+ &rdtgroup_default, &kn_mondata);
+ if (ret) {
+ dentry = ERR_PTR(ret);
+ goto out_mongrp;
+ }
+ kernfs_get(kn_mondata);
+ rdtgroup_default.mon_data_kn = kn_mondata;
+ }
+
dentry = kernfs_mount(fs_type, flags, rdt_root,
RDTGROUP_SUPER_MAGIC, NULL);
if (IS_ERR(dentry))
- goto out_destroy;
+ goto out_mondata;

- static_branch_enable(&rdt_alloc_enable_key);
+ if (rdt_alloc_enabled)
+ static_branch_enable(&rdt_alloc_enable_key);
+ if (rdt_mon_features)
+ static_branch_enable(&rdt_mon_enable_key);
+
+ if (rdt_alloc_enabled || rdt_mon_features)
+ static_branch_enable(&rdt_enable_key);
goto out;

-out_destroy:
+out_mondata:
+ if (rdt_mon_features)
+ kernfs_remove(kn_mondata);
+out_mongrp:
+ if (rdt_mon_features)
+ kernfs_remove(kn_mongrp);
+out_info:
kernfs_remove(kn_info);
out_cdp:
cdp_disable();
@@ -1219,12 +1261,21 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
*/
static void rmdir_all_sub(void)
{
- struct rdtgroup *rdtgrp, *tmp;
+ struct rdtgroup *rdtgrp, *tmp, *sentry, *stmp;
+ struct list_head *llist;

/* Move all tasks to the default resource group */
rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);

list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
+ /* Free any child rmids */
+ llist = &rdtgrp->crdtgrp_list;
+ list_for_each_entry_safe(sentry, stmp, llist, crdtgrp_list) {
+ free_rmid(sentry->rmid);
+ list_del(&sentry->crdtgrp_list);
+ kfree(sentry);
+ }
+
/* Remove each rdtgroup other than root */
if (rdtgrp == &rdtgroup_default)
continue;
@@ -1237,6 +1288,8 @@ static void rmdir_all_sub(void)
cpumask_or(&rdtgroup_default.cpu_mask,
&rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);

+ free_rmid(rdtgrp->rmid);
+
kernfs_remove(rdtgrp->kn);
list_del(&rdtgrp->rdtgroup_list);
kfree(rdtgrp);
@@ -1247,6 +1300,8 @@ static void rmdir_all_sub(void)
put_online_cpus();

kernfs_remove(kn_info);
+ kernfs_remove(kn_mongrp);
+ kernfs_remove(kn_mondata);
}

static void rdt_kill_sb(struct super_block *sb)
@@ -1261,6 +1316,8 @@ static void rdt_kill_sb(struct super_block *sb)
cdp_disable();
rmdir_all_sub();
static_branch_disable(&rdt_alloc_enable_key);
+ static_branch_disable(&rdt_mon_enable_key);
+ static_branch_disable(&rdt_enable_key);
kernfs_kill_sb(sb);
mutex_unlock(&rdtgroup_mutex);
}
--
1.9.1

2017-06-26 18:59:15

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 01/21] x86/perf/cqm: Wipe out perf based cqm

'perf cqm' never worked due to the incompatibility between perf
infrastructure and cqm hardware support. The hardware uses RMIDs to
track the llc occupancy of tasks and these RMIDs are per package. This
makes monitoring a hierarchy like cgroup along with monitoring of tasks
separately difficult and several patches sent to lkml to fix them were
NACKed. Further more, the following issues in the current perf cqm make
it almost unusable:

1. No support to monitor the same group of tasks for which we do
allocation using resctrl.

2. It gives random and inaccurate data (mostly 0s) once we run out
of RMIDs due to issues in Recycling.

3. Recycling results in inaccuracy of data because we cannot
guarantee that the RMID was stolen from a task when it was not
pulling data into cache or even when it pulled the least data. Also
for monitoring llc_occupancy, if we stop using an RMID_x and then
start using an RMID_y after we reclaim an RMID from an other event,
we miss accounting all the occupancy that was tagged to RMID_x at a
later perf_count.

2. Recycling code makes the monitoring code complex including
scheduling because the event can lose RMID any time. Since MBM
counters count bandwidth for a period of time by taking snap shot of
total bytes at two different times, recycling complicates the way we
count MBM in a hierarchy. Also we need a spin lock while we do the
processing to account for MBM counter overflow. We also currently
use a spin lock in scheduling to prevent the RMID from being taken
away.

4. Lack of support when we run different kind of event like task,
system-wide and cgroup events together. Data mostly prints 0s. This
is also because we can have only one RMID tied to a cpu as defined
by the cqm hardware but a perf can at the same time tie multiple
events during one sched_in.

5. No support of monitoring a group of tasks. There is partial support
for cgroup but it does not work once there is a hierarchy of cgroups
or if we want to monitor a task in a cgroup and the cgroup itself.

6. No support for monitoring tasks for the lifetime without perf
overhead.

7. It reported the aggregate cache occupancy or memory bandwidth over
all sockets. But most cloud and VMM based use cases want to know the
individual per-socket usage.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/events/intel/Makefile | 2 +-
arch/x86/events/intel/cqm.c | 1766 -------------------------------
arch/x86/include/asm/intel_rdt_common.h | 2 -
arch/x86/kernel/cpu/intel_rdt.c | 8 +
include/linux/perf_event.h | 18 -
kernel/events/core.c | 11 +-
kernel/trace/bpf_trace.c | 2 +-
7 files changed, 11 insertions(+), 1798 deletions(-)
delete mode 100644 arch/x86/events/intel/cqm.c

diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
index 06c2baa..e9d8520 100644
--- a/arch/x86/events/intel/Makefile
+++ b/arch/x86/events/intel/Makefile
@@ -1,4 +1,4 @@
-obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o cqm.o
+obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o
obj-$(CONFIG_CPU_SUP_INTEL) += ds.o knc.o
obj-$(CONFIG_CPU_SUP_INTEL) += lbr.o p4.o p6.o pt.o
obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL) += intel-rapl-perf.o
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
deleted file mode 100644
index 8c00dc0..0000000
--- a/arch/x86/events/intel/cqm.c
+++ /dev/null
@@ -1,1766 +0,0 @@
-/*
- * Intel Cache Quality-of-Service Monitoring (CQM) support.
- *
- * Based very, very heavily on work by Peter Zijlstra.
- */
-
-#include <linux/perf_event.h>
-#include <linux/slab.h>
-#include <asm/cpu_device_id.h>
-#include <asm/intel_rdt_common.h>
-#include "../perf_event.h"
-
-#define MSR_IA32_QM_CTR 0x0c8e
-#define MSR_IA32_QM_EVTSEL 0x0c8d
-
-#define MBM_CNTR_WIDTH 24
-/*
- * Guaranteed time in ms as per SDM where MBM counters will not overflow.
- */
-#define MBM_CTR_OVERFLOW_TIME 1000
-
-static u32 cqm_max_rmid = -1;
-static unsigned int cqm_l3_scale; /* supposedly cacheline size */
-static bool cqm_enabled, mbm_enabled;
-unsigned int mbm_socket_max;
-
-/*
- * The cached intel_pqr_state is strictly per CPU and can never be
- * updated from a remote CPU. Both functions which modify the state
- * (intel_cqm_event_start and intel_cqm_event_stop) are called with
- * interrupts disabled, which is sufficient for the protection.
- */
-DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
-static struct hrtimer *mbm_timers;
-/**
- * struct sample - mbm event's (local or total) data
- * @total_bytes #bytes since we began monitoring
- * @prev_msr previous value of MSR
- */
-struct sample {
- u64 total_bytes;
- u64 prev_msr;
-};
-
-/*
- * samples profiled for total memory bandwidth type events
- */
-static struct sample *mbm_total;
-/*
- * samples profiled for local memory bandwidth type events
- */
-static struct sample *mbm_local;
-
-#define pkg_id topology_physical_package_id(smp_processor_id())
-/*
- * rmid_2_index returns the index for the rmid in mbm_local/mbm_total array.
- * mbm_total[] and mbm_local[] are linearly indexed by socket# * max number of
- * rmids per socket, an example is given below
- * RMID1 of Socket0: vrmid = 1
- * RMID1 of Socket1: vrmid = 1 * (cqm_max_rmid + 1) + 1
- * RMID1 of Socket2: vrmid = 2 * (cqm_max_rmid + 1) + 1
- */
-#define rmid_2_index(rmid) ((pkg_id * (cqm_max_rmid + 1)) + rmid)
-/*
- * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
- * Also protects event->hw.cqm_rmid
- *
- * Hold either for stability, both for modification of ->hw.cqm_rmid.
- */
-static DEFINE_MUTEX(cache_mutex);
-static DEFINE_RAW_SPINLOCK(cache_lock);
-
-/*
- * Groups of events that have the same target(s), one RMID per group.
- */
-static LIST_HEAD(cache_groups);
-
-/*
- * Mask of CPUs for reading CQM values. We only need one per-socket.
- */
-static cpumask_t cqm_cpumask;
-
-#define RMID_VAL_ERROR (1ULL << 63)
-#define RMID_VAL_UNAVAIL (1ULL << 62)
-
-/*
- * Event IDs are used to program IA32_QM_EVTSEL before reading event
- * counter from IA32_QM_CTR
- */
-#define QOS_L3_OCCUP_EVENT_ID 0x01
-#define QOS_MBM_TOTAL_EVENT_ID 0x02
-#define QOS_MBM_LOCAL_EVENT_ID 0x03
-
-/*
- * This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
- *
- * This rmid is always free and is guaranteed to have an associated
- * near-zero occupancy value, i.e. no cachelines are tagged with this
- * RMID, once __intel_cqm_rmid_rotate() returns.
- */
-static u32 intel_cqm_rotation_rmid;
-
-#define INVALID_RMID (-1)
-
-/*
- * Is @rmid valid for programming the hardware?
- *
- * rmid 0 is reserved by the hardware for all non-monitored tasks, which
- * means that we should never come across an rmid with that value.
- * Likewise, an rmid value of -1 is used to indicate "no rmid currently
- * assigned" and is used as part of the rotation code.
- */
-static inline bool __rmid_valid(u32 rmid)
-{
- if (!rmid || rmid == INVALID_RMID)
- return false;
-
- return true;
-}
-
-static u64 __rmid_read(u32 rmid)
-{
- u64 val;
-
- /*
- * Ignore the SDM, this thing is _NOTHING_ like a regular perfcnt,
- * it just says that to increase confusion.
- */
- wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_OCCUP_EVENT_ID, rmid);
- rdmsrl(MSR_IA32_QM_CTR, val);
-
- /*
- * Aside from the ERROR and UNAVAIL bits, assume this thing returns
- * the number of cachelines tagged with @rmid.
- */
- return val;
-}
-
-enum rmid_recycle_state {
- RMID_YOUNG = 0,
- RMID_AVAILABLE,
- RMID_DIRTY,
-};
-
-struct cqm_rmid_entry {
- u32 rmid;
- enum rmid_recycle_state state;
- struct list_head list;
- unsigned long queue_time;
-};
-
-/*
- * cqm_rmid_free_lru - A least recently used list of RMIDs.
- *
- * Oldest entry at the head, newest (most recently used) entry at the
- * tail. This list is never traversed, it's only used to keep track of
- * the lru order. That is, we only pick entries of the head or insert
- * them on the tail.
- *
- * All entries on the list are 'free', and their RMIDs are not currently
- * in use. To mark an RMID as in use, remove its entry from the lru
- * list.
- *
- *
- * cqm_rmid_limbo_lru - list of currently unused but (potentially) dirty RMIDs.
- *
- * This list is contains RMIDs that no one is currently using but that
- * may have a non-zero occupancy value associated with them. The
- * rotation worker moves RMIDs from the limbo list to the free list once
- * the occupancy value drops below __intel_cqm_threshold.
- *
- * Both lists are protected by cache_mutex.
- */
-static LIST_HEAD(cqm_rmid_free_lru);
-static LIST_HEAD(cqm_rmid_limbo_lru);
-
-/*
- * We use a simple array of pointers so that we can lookup a struct
- * cqm_rmid_entry in O(1). This alleviates the callers of __get_rmid()
- * and __put_rmid() from having to worry about dealing with struct
- * cqm_rmid_entry - they just deal with rmids, i.e. integers.
- *
- * Once this array is initialized it is read-only. No locks are required
- * to access it.
- *
- * All entries for all RMIDs can be looked up in the this array at all
- * times.
- */
-static struct cqm_rmid_entry **cqm_rmid_ptrs;
-
-static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid)
-{
- struct cqm_rmid_entry *entry;
-
- entry = cqm_rmid_ptrs[rmid];
- WARN_ON(entry->rmid != rmid);
-
- return entry;
-}
-
-/*
- * Returns < 0 on fail.
- *
- * We expect to be called with cache_mutex held.
- */
-static u32 __get_rmid(void)
-{
- struct cqm_rmid_entry *entry;
-
- lockdep_assert_held(&cache_mutex);
-
- if (list_empty(&cqm_rmid_free_lru))
- return INVALID_RMID;
-
- entry = list_first_entry(&cqm_rmid_free_lru, struct cqm_rmid_entry, list);
- list_del(&entry->list);
-
- return entry->rmid;
-}
-
-static void __put_rmid(u32 rmid)
-{
- struct cqm_rmid_entry *entry;
-
- lockdep_assert_held(&cache_mutex);
-
- WARN_ON(!__rmid_valid(rmid));
- entry = __rmid_entry(rmid);
-
- entry->queue_time = jiffies;
- entry->state = RMID_YOUNG;
-
- list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
-}
-
-static void cqm_cleanup(void)
-{
- int i;
-
- if (!cqm_rmid_ptrs)
- return;
-
- for (i = 0; i < cqm_max_rmid; i++)
- kfree(cqm_rmid_ptrs[i]);
-
- kfree(cqm_rmid_ptrs);
- cqm_rmid_ptrs = NULL;
- cqm_enabled = false;
-}
-
-static int intel_cqm_setup_rmid_cache(void)
-{
- struct cqm_rmid_entry *entry;
- unsigned int nr_rmids;
- int r = 0;
-
- nr_rmids = cqm_max_rmid + 1;
- cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) *
- nr_rmids, GFP_KERNEL);
- if (!cqm_rmid_ptrs)
- return -ENOMEM;
-
- for (; r <= cqm_max_rmid; r++) {
- struct cqm_rmid_entry *entry;
-
- entry = kmalloc(sizeof(*entry), GFP_KERNEL);
- if (!entry)
- goto fail;
-
- INIT_LIST_HEAD(&entry->list);
- entry->rmid = r;
- cqm_rmid_ptrs[r] = entry;
-
- list_add_tail(&entry->list, &cqm_rmid_free_lru);
- }
-
- /*
- * RMID 0 is special and is always allocated. It's used for all
- * tasks that are not monitored.
- */
- entry = __rmid_entry(0);
- list_del(&entry->list);
-
- mutex_lock(&cache_mutex);
- intel_cqm_rotation_rmid = __get_rmid();
- mutex_unlock(&cache_mutex);
-
- return 0;
-
-fail:
- cqm_cleanup();
- return -ENOMEM;
-}
-
-/*
- * Determine if @a and @b measure the same set of tasks.
- *
- * If @a and @b measure the same set of tasks then we want to share a
- * single RMID.
- */
-static bool __match_event(struct perf_event *a, struct perf_event *b)
-{
- /* Per-cpu and task events don't mix */
- if ((a->attach_state & PERF_ATTACH_TASK) !=
- (b->attach_state & PERF_ATTACH_TASK))
- return false;
-
-#ifdef CONFIG_CGROUP_PERF
- if (a->cgrp != b->cgrp)
- return false;
-#endif
-
- /* If not task event, we're machine wide */
- if (!(b->attach_state & PERF_ATTACH_TASK))
- return true;
-
- /*
- * Events that target same task are placed into the same cache group.
- * Mark it as a multi event group, so that we update ->count
- * for every event rather than just the group leader later.
- */
- if (a->hw.target == b->hw.target) {
- b->hw.is_group_event = true;
- return true;
- }
-
- /*
- * Are we an inherited event?
- */
- if (b->parent == a)
- return true;
-
- return false;
-}
-
-#ifdef CONFIG_CGROUP_PERF
-static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event)
-{
- if (event->attach_state & PERF_ATTACH_TASK)
- return perf_cgroup_from_task(event->hw.target, event->ctx);
-
- return event->cgrp;
-}
-#endif
-
-/*
- * Determine if @a's tasks intersect with @b's tasks
- *
- * There are combinations of events that we explicitly prohibit,
- *
- * PROHIBITS
- * system-wide -> cgroup and task
- * cgroup -> system-wide
- * -> task in cgroup
- * task -> system-wide
- * -> task in cgroup
- *
- * Call this function before allocating an RMID.
- */
-static bool __conflict_event(struct perf_event *a, struct perf_event *b)
-{
-#ifdef CONFIG_CGROUP_PERF
- /*
- * We can have any number of cgroups but only one system-wide
- * event at a time.
- */
- if (a->cgrp && b->cgrp) {
- struct perf_cgroup *ac = a->cgrp;
- struct perf_cgroup *bc = b->cgrp;
-
- /*
- * This condition should have been caught in
- * __match_event() and we should be sharing an RMID.
- */
- WARN_ON_ONCE(ac == bc);
-
- if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
- cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
- return true;
-
- return false;
- }
-
- if (a->cgrp || b->cgrp) {
- struct perf_cgroup *ac, *bc;
-
- /*
- * cgroup and system-wide events are mutually exclusive
- */
- if ((a->cgrp && !(b->attach_state & PERF_ATTACH_TASK)) ||
- (b->cgrp && !(a->attach_state & PERF_ATTACH_TASK)))
- return true;
-
- /*
- * Ensure neither event is part of the other's cgroup
- */
- ac = event_to_cgroup(a);
- bc = event_to_cgroup(b);
- if (ac == bc)
- return true;
-
- /*
- * Must have cgroup and non-intersecting task events.
- */
- if (!ac || !bc)
- return false;
-
- /*
- * We have cgroup and task events, and the task belongs
- * to a cgroup. Check for for overlap.
- */
- if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
- cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
- return true;
-
- return false;
- }
-#endif
- /*
- * If one of them is not a task, same story as above with cgroups.
- */
- if (!(a->attach_state & PERF_ATTACH_TASK) ||
- !(b->attach_state & PERF_ATTACH_TASK))
- return true;
-
- /*
- * Must be non-overlapping.
- */
- return false;
-}
-
-struct rmid_read {
- u32 rmid;
- u32 evt_type;
- atomic64_t value;
-};
-
-static void __intel_cqm_event_count(void *info);
-static void init_mbm_sample(u32 rmid, u32 evt_type);
-static void __intel_mbm_event_count(void *info);
-
-static bool is_cqm_event(int e)
-{
- return (e == QOS_L3_OCCUP_EVENT_ID);
-}
-
-static bool is_mbm_event(int e)
-{
- return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_EVENT_ID);
-}
-
-static void cqm_mask_call(struct rmid_read *rr)
-{
- if (is_mbm_event(rr->evt_type))
- on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count, rr, 1);
- else
- on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, rr, 1);
-}
-
-/*
- * Exchange the RMID of a group of events.
- */
-static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
-{
- struct perf_event *event;
- struct list_head *head = &group->hw.cqm_group_entry;
- u32 old_rmid = group->hw.cqm_rmid;
-
- lockdep_assert_held(&cache_mutex);
-
- /*
- * If our RMID is being deallocated, perform a read now.
- */
- if (__rmid_valid(old_rmid) && !__rmid_valid(rmid)) {
- struct rmid_read rr = {
- .rmid = old_rmid,
- .evt_type = group->attr.config,
- .value = ATOMIC64_INIT(0),
- };
-
- cqm_mask_call(&rr);
- local64_set(&group->count, atomic64_read(&rr.value));
- }
-
- raw_spin_lock_irq(&cache_lock);
-
- group->hw.cqm_rmid = rmid;
- list_for_each_entry(event, head, hw.cqm_group_entry)
- event->hw.cqm_rmid = rmid;
-
- raw_spin_unlock_irq(&cache_lock);
-
- /*
- * If the allocation is for mbm, init the mbm stats.
- * Need to check if each event in the group is mbm event
- * because there could be multiple type of events in the same group.
- */
- if (__rmid_valid(rmid)) {
- event = group;
- if (is_mbm_event(event->attr.config))
- init_mbm_sample(rmid, event->attr.config);
-
- list_for_each_entry(event, head, hw.cqm_group_entry) {
- if (is_mbm_event(event->attr.config))
- init_mbm_sample(rmid, event->attr.config);
- }
- }
-
- return old_rmid;
-}
-
-/*
- * If we fail to assign a new RMID for intel_cqm_rotation_rmid because
- * cachelines are still tagged with RMIDs in limbo, we progressively
- * increment the threshold until we find an RMID in limbo with <=
- * __intel_cqm_threshold lines tagged. This is designed to mitigate the
- * problem where cachelines tagged with an RMID are not steadily being
- * evicted.
- *
- * On successful rotations we decrease the threshold back towards zero.
- *
- * __intel_cqm_max_threshold provides an upper bound on the threshold,
- * and is measured in bytes because it's exposed to userland.
- */
-static unsigned int __intel_cqm_threshold;
-static unsigned int __intel_cqm_max_threshold;
-
-/*
- * Test whether an RMID has a zero occupancy value on this cpu.
- */
-static void intel_cqm_stable(void *arg)
-{
- struct cqm_rmid_entry *entry;
-
- list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
- if (entry->state != RMID_AVAILABLE)
- break;
-
- if (__rmid_read(entry->rmid) > __intel_cqm_threshold)
- entry->state = RMID_DIRTY;
- }
-}
-
-/*
- * If we have group events waiting for an RMID that don't conflict with
- * events already running, assign @rmid.
- */
-static bool intel_cqm_sched_in_event(u32 rmid)
-{
- struct perf_event *leader, *event;
-
- lockdep_assert_held(&cache_mutex);
-
- leader = list_first_entry(&cache_groups, struct perf_event,
- hw.cqm_groups_entry);
- event = leader;
-
- list_for_each_entry_continue(event, &cache_groups,
- hw.cqm_groups_entry) {
- if (__rmid_valid(event->hw.cqm_rmid))
- continue;
-
- if (__conflict_event(event, leader))
- continue;
-
- intel_cqm_xchg_rmid(event, rmid);
- return true;
- }
-
- return false;
-}
-
-/*
- * Initially use this constant for both the limbo queue time and the
- * rotation timer interval, pmu::hrtimer_interval_ms.
- *
- * They don't need to be the same, but the two are related since if you
- * rotate faster than you recycle RMIDs, you may run out of available
- * RMIDs.
- */
-#define RMID_DEFAULT_QUEUE_TIME 250 /* ms */
-
-static unsigned int __rmid_queue_time_ms = RMID_DEFAULT_QUEUE_TIME;
-
-/*
- * intel_cqm_rmid_stabilize - move RMIDs from limbo to free list
- * @nr_available: number of freeable RMIDs on the limbo list
- *
- * Quiescent state; wait for all 'freed' RMIDs to become unused, i.e. no
- * cachelines are tagged with those RMIDs. After this we can reuse them
- * and know that the current set of active RMIDs is stable.
- *
- * Return %true or %false depending on whether stabilization needs to be
- * reattempted.
- *
- * If we return %true then @nr_available is updated to indicate the
- * number of RMIDs on the limbo list that have been queued for the
- * minimum queue time (RMID_AVAILABLE), but whose data occupancy values
- * are above __intel_cqm_threshold.
- */
-static bool intel_cqm_rmid_stabilize(unsigned int *available)
-{
- struct cqm_rmid_entry *entry, *tmp;
-
- lockdep_assert_held(&cache_mutex);
-
- *available = 0;
- list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
- unsigned long min_queue_time;
- unsigned long now = jiffies;
-
- /*
- * We hold RMIDs placed into limbo for a minimum queue
- * time. Before the minimum queue time has elapsed we do
- * not recycle RMIDs.
- *
- * The reasoning is that until a sufficient time has
- * passed since we stopped using an RMID, any RMID
- * placed onto the limbo list will likely still have
- * data tagged in the cache, which means we'll probably
- * fail to recycle it anyway.
- *
- * We can save ourselves an expensive IPI by skipping
- * any RMIDs that have not been queued for the minimum
- * time.
- */
- min_queue_time = entry->queue_time +
- msecs_to_jiffies(__rmid_queue_time_ms);
-
- if (time_after(min_queue_time, now))
- break;
-
- entry->state = RMID_AVAILABLE;
- (*available)++;
- }
-
- /*
- * Fast return if none of the RMIDs on the limbo list have been
- * sitting on the queue for the minimum queue time.
- */
- if (!*available)
- return false;
-
- /*
- * Test whether an RMID is free for each package.
- */
- on_each_cpu_mask(&cqm_cpumask, intel_cqm_stable, NULL, true);
-
- list_for_each_entry_safe(entry, tmp, &cqm_rmid_limbo_lru, list) {
- /*
- * Exhausted all RMIDs that have waited min queue time.
- */
- if (entry->state == RMID_YOUNG)
- break;
-
- if (entry->state == RMID_DIRTY)
- continue;
-
- list_del(&entry->list); /* remove from limbo */
-
- /*
- * The rotation RMID gets priority if it's
- * currently invalid. In which case, skip adding
- * the RMID to the the free lru.
- */
- if (!__rmid_valid(intel_cqm_rotation_rmid)) {
- intel_cqm_rotation_rmid = entry->rmid;
- continue;
- }
-
- /*
- * If we have groups waiting for RMIDs, hand
- * them one now provided they don't conflict.
- */
- if (intel_cqm_sched_in_event(entry->rmid))
- continue;
-
- /*
- * Otherwise place it onto the free list.
- */
- list_add_tail(&entry->list, &cqm_rmid_free_lru);
- }
-
-
- return __rmid_valid(intel_cqm_rotation_rmid);
-}
-
-/*
- * Pick a victim group and move it to the tail of the group list.
- * @next: The first group without an RMID
- */
-static void __intel_cqm_pick_and_rotate(struct perf_event *next)
-{
- struct perf_event *rotor;
- u32 rmid;
-
- lockdep_assert_held(&cache_mutex);
-
- rotor = list_first_entry(&cache_groups, struct perf_event,
- hw.cqm_groups_entry);
-
- /*
- * The group at the front of the list should always have a valid
- * RMID. If it doesn't then no groups have RMIDs assigned and we
- * don't need to rotate the list.
- */
- if (next == rotor)
- return;
-
- rmid = intel_cqm_xchg_rmid(rotor, INVALID_RMID);
- __put_rmid(rmid);
-
- list_rotate_left(&cache_groups);
-}
-
-/*
- * Deallocate the RMIDs from any events that conflict with @event, and
- * place them on the back of the group list.
- */
-static void intel_cqm_sched_out_conflicting_events(struct perf_event *event)
-{
- struct perf_event *group, *g;
- u32 rmid;
-
- lockdep_assert_held(&cache_mutex);
-
- list_for_each_entry_safe(group, g, &cache_groups, hw.cqm_groups_entry) {
- if (group == event)
- continue;
-
- rmid = group->hw.cqm_rmid;
-
- /*
- * Skip events that don't have a valid RMID.
- */
- if (!__rmid_valid(rmid))
- continue;
-
- /*
- * No conflict? No problem! Leave the event alone.
- */
- if (!__conflict_event(group, event))
- continue;
-
- intel_cqm_xchg_rmid(group, INVALID_RMID);
- __put_rmid(rmid);
- }
-}
-
-/*
- * Attempt to rotate the groups and assign new RMIDs.
- *
- * We rotate for two reasons,
- * 1. To handle the scheduling of conflicting events
- * 2. To recycle RMIDs
- *
- * Rotating RMIDs is complicated because the hardware doesn't give us
- * any clues.
- *
- * There's problems with the hardware interface; when you change the
- * task:RMID map cachelines retain their 'old' tags, giving a skewed
- * picture. In order to work around this, we must always keep one free
- * RMID - intel_cqm_rotation_rmid.
- *
- * Rotation works by taking away an RMID from a group (the old RMID),
- * and assigning the free RMID to another group (the new RMID). We must
- * then wait for the old RMID to not be used (no cachelines tagged).
- * This ensure that all cachelines are tagged with 'active' RMIDs. At
- * this point we can start reading values for the new RMID and treat the
- * old RMID as the free RMID for the next rotation.
- *
- * Return %true or %false depending on whether we did any rotating.
- */
-static bool __intel_cqm_rmid_rotate(void)
-{
- struct perf_event *group, *start = NULL;
- unsigned int threshold_limit;
- unsigned int nr_needed = 0;
- unsigned int nr_available;
- bool rotated = false;
-
- mutex_lock(&cache_mutex);
-
-again:
- /*
- * Fast path through this function if there are no groups and no
- * RMIDs that need cleaning.
- */
- if (list_empty(&cache_groups) && list_empty(&cqm_rmid_limbo_lru))
- goto out;
-
- list_for_each_entry(group, &cache_groups, hw.cqm_groups_entry) {
- if (!__rmid_valid(group->hw.cqm_rmid)) {
- if (!start)
- start = group;
- nr_needed++;
- }
- }
-
- /*
- * We have some event groups, but they all have RMIDs assigned
- * and no RMIDs need cleaning.
- */
- if (!nr_needed && list_empty(&cqm_rmid_limbo_lru))
- goto out;
-
- if (!nr_needed)
- goto stabilize;
-
- /*
- * We have more event groups without RMIDs than available RMIDs,
- * or we have event groups that conflict with the ones currently
- * scheduled.
- *
- * We force deallocate the rmid of the group at the head of
- * cache_groups. The first event group without an RMID then gets
- * assigned intel_cqm_rotation_rmid. This ensures we always make
- * forward progress.
- *
- * Rotate the cache_groups list so the previous head is now the
- * tail.
- */
- __intel_cqm_pick_and_rotate(start);
-
- /*
- * If the rotation is going to succeed, reduce the threshold so
- * that we don't needlessly reuse dirty RMIDs.
- */
- if (__rmid_valid(intel_cqm_rotation_rmid)) {
- intel_cqm_xchg_rmid(start, intel_cqm_rotation_rmid);
- intel_cqm_rotation_rmid = __get_rmid();
-
- intel_cqm_sched_out_conflicting_events(start);
-
- if (__intel_cqm_threshold)
- __intel_cqm_threshold--;
- }
-
- rotated = true;
-
-stabilize:
- /*
- * We now need to stablize the RMID we freed above (if any) to
- * ensure that the next time we rotate we have an RMID with zero
- * occupancy value.
- *
- * Alternatively, if we didn't need to perform any rotation,
- * we'll have a bunch of RMIDs in limbo that need stabilizing.
- */
- threshold_limit = __intel_cqm_max_threshold / cqm_l3_scale;
-
- while (intel_cqm_rmid_stabilize(&nr_available) &&
- __intel_cqm_threshold < threshold_limit) {
- unsigned int steal_limit;
-
- /*
- * Don't spin if nobody is actively waiting for an RMID,
- * the rotation worker will be kicked as soon as an
- * event needs an RMID anyway.
- */
- if (!nr_needed)
- break;
-
- /* Allow max 25% of RMIDs to be in limbo. */
- steal_limit = (cqm_max_rmid + 1) / 4;
-
- /*
- * We failed to stabilize any RMIDs so our rotation
- * logic is now stuck. In order to make forward progress
- * we have a few options:
- *
- * 1. rotate ("steal") another RMID
- * 2. increase the threshold
- * 3. do nothing
- *
- * We do both of 1. and 2. until we hit the steal limit.
- *
- * The steal limit prevents all RMIDs ending up on the
- * limbo list. This can happen if every RMID has a
- * non-zero occupancy above threshold_limit, and the
- * occupancy values aren't dropping fast enough.
- *
- * Note that there is prioritisation at work here - we'd
- * rather increase the number of RMIDs on the limbo list
- * than increase the threshold, because increasing the
- * threshold skews the event data (because we reuse
- * dirty RMIDs) - threshold bumps are a last resort.
- */
- if (nr_available < steal_limit)
- goto again;
-
- __intel_cqm_threshold++;
- }
-
-out:
- mutex_unlock(&cache_mutex);
- return rotated;
-}
-
-static void intel_cqm_rmid_rotate(struct work_struct *work);
-
-static DECLARE_DELAYED_WORK(intel_cqm_rmid_work, intel_cqm_rmid_rotate);
-
-static struct pmu intel_cqm_pmu;
-
-static void intel_cqm_rmid_rotate(struct work_struct *work)
-{
- unsigned long delay;
-
- __intel_cqm_rmid_rotate();
-
- delay = msecs_to_jiffies(intel_cqm_pmu.hrtimer_interval_ms);
- schedule_delayed_work(&intel_cqm_rmid_work, delay);
-}
-
-static u64 update_sample(unsigned int rmid, u32 evt_type, int first)
-{
- struct sample *mbm_current;
- u32 vrmid = rmid_2_index(rmid);
- u64 val, bytes, shift;
- u32 eventid;
-
- if (evt_type == QOS_MBM_LOCAL_EVENT_ID) {
- mbm_current = &mbm_local[vrmid];
- eventid = QOS_MBM_LOCAL_EVENT_ID;
- } else {
- mbm_current = &mbm_total[vrmid];
- eventid = QOS_MBM_TOTAL_EVENT_ID;
- }
-
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
- rdmsrl(MSR_IA32_QM_CTR, val);
- if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
- return mbm_current->total_bytes;
-
- if (first) {
- mbm_current->prev_msr = val;
- mbm_current->total_bytes = 0;
- return mbm_current->total_bytes;
- }
-
- /*
- * The h/w guarantees that counters will not overflow
- * so long as we poll them at least once per second.
- */
- shift = 64 - MBM_CNTR_WIDTH;
- bytes = (val << shift) - (mbm_current->prev_msr << shift);
- bytes >>= shift;
-
- bytes *= cqm_l3_scale;
-
- mbm_current->total_bytes += bytes;
- mbm_current->prev_msr = val;
-
- return mbm_current->total_bytes;
-}
-
-static u64 rmid_read_mbm(unsigned int rmid, u32 evt_type)
-{
- return update_sample(rmid, evt_type, 0);
-}
-
-static void __intel_mbm_event_init(void *info)
-{
- struct rmid_read *rr = info;
-
- update_sample(rr->rmid, rr->evt_type, 1);
-}
-
-static void init_mbm_sample(u32 rmid, u32 evt_type)
-{
- struct rmid_read rr = {
- .rmid = rmid,
- .evt_type = evt_type,
- .value = ATOMIC64_INIT(0),
- };
-
- /* on each socket, init sample */
- on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
-}
-
-/*
- * Find a group and setup RMID.
- *
- * If we're part of a group, we use the group's RMID.
- */
-static void intel_cqm_setup_event(struct perf_event *event,
- struct perf_event **group)
-{
- struct perf_event *iter;
- bool conflict = false;
- u32 rmid;
-
- event->hw.is_group_event = false;
- list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
- rmid = iter->hw.cqm_rmid;
-
- if (__match_event(iter, event)) {
- /* All tasks in a group share an RMID */
- event->hw.cqm_rmid = rmid;
- *group = iter;
- if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
- init_mbm_sample(rmid, event->attr.config);
- return;
- }
-
- /*
- * We only care about conflicts for events that are
- * actually scheduled in (and hence have a valid RMID).
- */
- if (__conflict_event(iter, event) && __rmid_valid(rmid))
- conflict = true;
- }
-
- if (conflict)
- rmid = INVALID_RMID;
- else
- rmid = __get_rmid();
-
- if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
- init_mbm_sample(rmid, event->attr.config);
-
- event->hw.cqm_rmid = rmid;
-}
-
-static void intel_cqm_event_read(struct perf_event *event)
-{
- unsigned long flags;
- u32 rmid;
- u64 val;
-
- /*
- * Task events are handled by intel_cqm_event_count().
- */
- if (event->cpu == -1)
- return;
-
- raw_spin_lock_irqsave(&cache_lock, flags);
- rmid = event->hw.cqm_rmid;
-
- if (!__rmid_valid(rmid))
- goto out;
-
- if (is_mbm_event(event->attr.config))
- val = rmid_read_mbm(rmid, event->attr.config);
- else
- val = __rmid_read(rmid);
-
- /*
- * Ignore this reading on error states and do not update the value.
- */
- if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
- goto out;
-
- local64_set(&event->count, val);
-out:
- raw_spin_unlock_irqrestore(&cache_lock, flags);
-}
-
-static void __intel_cqm_event_count(void *info)
-{
- struct rmid_read *rr = info;
- u64 val;
-
- val = __rmid_read(rr->rmid);
-
- if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
- return;
-
- atomic64_add(val, &rr->value);
-}
-
-static inline bool cqm_group_leader(struct perf_event *event)
-{
- return !list_empty(&event->hw.cqm_groups_entry);
-}
-
-static void __intel_mbm_event_count(void *info)
-{
- struct rmid_read *rr = info;
- u64 val;
-
- val = rmid_read_mbm(rr->rmid, rr->evt_type);
- if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
- return;
- atomic64_add(val, &rr->value);
-}
-
-static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
-{
- struct perf_event *iter, *iter1;
- int ret = HRTIMER_RESTART;
- struct list_head *head;
- unsigned long flags;
- u32 grp_rmid;
-
- /*
- * Need to cache_lock as the timer Event Select MSR reads
- * can race with the mbm/cqm count() and mbm_init() reads.
- */
- raw_spin_lock_irqsave(&cache_lock, flags);
-
- if (list_empty(&cache_groups)) {
- ret = HRTIMER_NORESTART;
- goto out;
- }
-
- list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
- grp_rmid = iter->hw.cqm_rmid;
- if (!__rmid_valid(grp_rmid))
- continue;
- if (is_mbm_event(iter->attr.config))
- update_sample(grp_rmid, iter->attr.config, 0);
-
- head = &iter->hw.cqm_group_entry;
- if (list_empty(head))
- continue;
- list_for_each_entry(iter1, head, hw.cqm_group_entry) {
- if (!iter1->hw.is_group_event)
- break;
- if (is_mbm_event(iter1->attr.config))
- update_sample(iter1->hw.cqm_rmid,
- iter1->attr.config, 0);
- }
- }
-
- hrtimer_forward_now(hrtimer, ms_to_ktime(MBM_CTR_OVERFLOW_TIME));
-out:
- raw_spin_unlock_irqrestore(&cache_lock, flags);
-
- return ret;
-}
-
-static void __mbm_start_timer(void *info)
-{
- hrtimer_start(&mbm_timers[pkg_id], ms_to_ktime(MBM_CTR_OVERFLOW_TIME),
- HRTIMER_MODE_REL_PINNED);
-}
-
-static void __mbm_stop_timer(void *info)
-{
- hrtimer_cancel(&mbm_timers[pkg_id]);
-}
-
-static void mbm_start_timers(void)
-{
- on_each_cpu_mask(&cqm_cpumask, __mbm_start_timer, NULL, 1);
-}
-
-static void mbm_stop_timers(void)
-{
- on_each_cpu_mask(&cqm_cpumask, __mbm_stop_timer, NULL, 1);
-}
-
-static void mbm_hrtimer_init(void)
-{
- struct hrtimer *hr;
- int i;
-
- for (i = 0; i < mbm_socket_max; i++) {
- hr = &mbm_timers[i];
- hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
- hr->function = mbm_hrtimer_handle;
- }
-}
-
-static u64 intel_cqm_event_count(struct perf_event *event)
-{
- unsigned long flags;
- struct rmid_read rr = {
- .evt_type = event->attr.config,
- .value = ATOMIC64_INIT(0),
- };
-
- /*
- * We only need to worry about task events. System-wide events
- * are handled like usual, i.e. entirely with
- * intel_cqm_event_read().
- */
- if (event->cpu != -1)
- return __perf_event_count(event);
-
- /*
- * Only the group leader gets to report values except in case of
- * multiple events in the same group, we still need to read the
- * other events.This stops us
- * reporting duplicate values to userspace, and gives us a clear
- * rule for which task gets to report the values.
- *
- * Note that it is impossible to attribute these values to
- * specific packages - we forfeit that ability when we create
- * task events.
- */
- if (!cqm_group_leader(event) && !event->hw.is_group_event)
- return 0;
-
- /*
- * Getting up-to-date values requires an SMP IPI which is not
- * possible if we're being called in interrupt context. Return
- * the cached values instead.
- */
- if (unlikely(in_interrupt()))
- goto out;
-
- /*
- * Notice that we don't perform the reading of an RMID
- * atomically, because we can't hold a spin lock across the
- * IPIs.
- *
- * Speculatively perform the read, since @event might be
- * assigned a different (possibly invalid) RMID while we're
- * busying performing the IPI calls. It's therefore necessary to
- * check @event's RMID afterwards, and if it has changed,
- * discard the result of the read.
- */
- rr.rmid = ACCESS_ONCE(event->hw.cqm_rmid);
-
- if (!__rmid_valid(rr.rmid))
- goto out;
-
- cqm_mask_call(&rr);
-
- raw_spin_lock_irqsave(&cache_lock, flags);
- if (event->hw.cqm_rmid == rr.rmid)
- local64_set(&event->count, atomic64_read(&rr.value));
- raw_spin_unlock_irqrestore(&cache_lock, flags);
-out:
- return __perf_event_count(event);
-}
-
-static void intel_cqm_event_start(struct perf_event *event, int mode)
-{
- struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
- u32 rmid = event->hw.cqm_rmid;
-
- if (!(event->hw.cqm_state & PERF_HES_STOPPED))
- return;
-
- event->hw.cqm_state &= ~PERF_HES_STOPPED;
-
- if (state->rmid_usecnt++) {
- if (!WARN_ON_ONCE(state->rmid != rmid))
- return;
- } else {
- WARN_ON_ONCE(state->rmid);
- }
-
- state->rmid = rmid;
- wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
-}
-
-static void intel_cqm_event_stop(struct perf_event *event, int mode)
-{
- struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-
- if (event->hw.cqm_state & PERF_HES_STOPPED)
- return;
-
- event->hw.cqm_state |= PERF_HES_STOPPED;
-
- intel_cqm_event_read(event);
-
- if (!--state->rmid_usecnt) {
- state->rmid = 0;
- wrmsr(MSR_IA32_PQR_ASSOC, 0, state->closid);
- } else {
- WARN_ON_ONCE(!state->rmid);
- }
-}
-
-static int intel_cqm_event_add(struct perf_event *event, int mode)
-{
- unsigned long flags;
- u32 rmid;
-
- raw_spin_lock_irqsave(&cache_lock, flags);
-
- event->hw.cqm_state = PERF_HES_STOPPED;
- rmid = event->hw.cqm_rmid;
-
- if (__rmid_valid(rmid) && (mode & PERF_EF_START))
- intel_cqm_event_start(event, mode);
-
- raw_spin_unlock_irqrestore(&cache_lock, flags);
-
- return 0;
-}
-
-static void intel_cqm_event_destroy(struct perf_event *event)
-{
- struct perf_event *group_other = NULL;
- unsigned long flags;
-
- mutex_lock(&cache_mutex);
- /*
- * Hold the cache_lock as mbm timer handlers could be
- * scanning the list of events.
- */
- raw_spin_lock_irqsave(&cache_lock, flags);
-
- /*
- * If there's another event in this group...
- */
- if (!list_empty(&event->hw.cqm_group_entry)) {
- group_other = list_first_entry(&event->hw.cqm_group_entry,
- struct perf_event,
- hw.cqm_group_entry);
- list_del(&event->hw.cqm_group_entry);
- }
-
- /*
- * And we're the group leader..
- */
- if (cqm_group_leader(event)) {
- /*
- * If there was a group_other, make that leader, otherwise
- * destroy the group and return the RMID.
- */
- if (group_other) {
- list_replace(&event->hw.cqm_groups_entry,
- &group_other->hw.cqm_groups_entry);
- } else {
- u32 rmid = event->hw.cqm_rmid;
-
- if (__rmid_valid(rmid))
- __put_rmid(rmid);
- list_del(&event->hw.cqm_groups_entry);
- }
- }
-
- raw_spin_unlock_irqrestore(&cache_lock, flags);
-
- /*
- * Stop the mbm overflow timers when the last event is destroyed.
- */
- if (mbm_enabled && list_empty(&cache_groups))
- mbm_stop_timers();
-
- mutex_unlock(&cache_mutex);
-}
-
-static int intel_cqm_event_init(struct perf_event *event)
-{
- struct perf_event *group = NULL;
- bool rotate = false;
- unsigned long flags;
-
- if (event->attr.type != intel_cqm_pmu.type)
- return -ENOENT;
-
- if ((event->attr.config < QOS_L3_OCCUP_EVENT_ID) ||
- (event->attr.config > QOS_MBM_LOCAL_EVENT_ID))
- return -EINVAL;
-
- if ((is_cqm_event(event->attr.config) && !cqm_enabled) ||
- (is_mbm_event(event->attr.config) && !mbm_enabled))
- return -EINVAL;
-
- /* unsupported modes and filters */
- if (event->attr.exclude_user ||
- event->attr.exclude_kernel ||
- event->attr.exclude_hv ||
- event->attr.exclude_idle ||
- event->attr.exclude_host ||
- event->attr.exclude_guest ||
- event->attr.sample_period) /* no sampling */
- return -EINVAL;
-
- INIT_LIST_HEAD(&event->hw.cqm_group_entry);
- INIT_LIST_HEAD(&event->hw.cqm_groups_entry);
-
- event->destroy = intel_cqm_event_destroy;
-
- mutex_lock(&cache_mutex);
-
- /*
- * Start the mbm overflow timers when the first event is created.
- */
- if (mbm_enabled && list_empty(&cache_groups))
- mbm_start_timers();
-
- /* Will also set rmid */
- intel_cqm_setup_event(event, &group);
-
- /*
- * Hold the cache_lock as mbm timer handlers be
- * scanning the list of events.
- */
- raw_spin_lock_irqsave(&cache_lock, flags);
-
- if (group) {
- list_add_tail(&event->hw.cqm_group_entry,
- &group->hw.cqm_group_entry);
- } else {
- list_add_tail(&event->hw.cqm_groups_entry,
- &cache_groups);
-
- /*
- * All RMIDs are either in use or have recently been
- * used. Kick the rotation worker to clean/free some.
- *
- * We only do this for the group leader, rather than for
- * every event in a group to save on needless work.
- */
- if (!__rmid_valid(event->hw.cqm_rmid))
- rotate = true;
- }
-
- raw_spin_unlock_irqrestore(&cache_lock, flags);
- mutex_unlock(&cache_mutex);
-
- if (rotate)
- schedule_delayed_work(&intel_cqm_rmid_work, 0);
-
- return 0;
-}
-
-EVENT_ATTR_STR(llc_occupancy, intel_cqm_llc, "event=0x01");
-EVENT_ATTR_STR(llc_occupancy.per-pkg, intel_cqm_llc_pkg, "1");
-EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
-EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
-EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");
-
-EVENT_ATTR_STR(total_bytes, intel_cqm_total_bytes, "event=0x02");
-EVENT_ATTR_STR(total_bytes.per-pkg, intel_cqm_total_bytes_pkg, "1");
-EVENT_ATTR_STR(total_bytes.unit, intel_cqm_total_bytes_unit, "MB");
-EVENT_ATTR_STR(total_bytes.scale, intel_cqm_total_bytes_scale, "1e-6");
-
-EVENT_ATTR_STR(local_bytes, intel_cqm_local_bytes, "event=0x03");
-EVENT_ATTR_STR(local_bytes.per-pkg, intel_cqm_local_bytes_pkg, "1");
-EVENT_ATTR_STR(local_bytes.unit, intel_cqm_local_bytes_unit, "MB");
-EVENT_ATTR_STR(local_bytes.scale, intel_cqm_local_bytes_scale, "1e-6");
-
-static struct attribute *intel_cqm_events_attr[] = {
- EVENT_PTR(intel_cqm_llc),
- EVENT_PTR(intel_cqm_llc_pkg),
- EVENT_PTR(intel_cqm_llc_unit),
- EVENT_PTR(intel_cqm_llc_scale),
- EVENT_PTR(intel_cqm_llc_snapshot),
- NULL,
-};
-
-static struct attribute *intel_mbm_events_attr[] = {
- EVENT_PTR(intel_cqm_total_bytes),
- EVENT_PTR(intel_cqm_local_bytes),
- EVENT_PTR(intel_cqm_total_bytes_pkg),
- EVENT_PTR(intel_cqm_local_bytes_pkg),
- EVENT_PTR(intel_cqm_total_bytes_unit),
- EVENT_PTR(intel_cqm_local_bytes_unit),
- EVENT_PTR(intel_cqm_total_bytes_scale),
- EVENT_PTR(intel_cqm_local_bytes_scale),
- NULL,
-};
-
-static struct attribute *intel_cmt_mbm_events_attr[] = {
- EVENT_PTR(intel_cqm_llc),
- EVENT_PTR(intel_cqm_total_bytes),
- EVENT_PTR(intel_cqm_local_bytes),
- EVENT_PTR(intel_cqm_llc_pkg),
- EVENT_PTR(intel_cqm_total_bytes_pkg),
- EVENT_PTR(intel_cqm_local_bytes_pkg),
- EVENT_PTR(intel_cqm_llc_unit),
- EVENT_PTR(intel_cqm_total_bytes_unit),
- EVENT_PTR(intel_cqm_local_bytes_unit),
- EVENT_PTR(intel_cqm_llc_scale),
- EVENT_PTR(intel_cqm_total_bytes_scale),
- EVENT_PTR(intel_cqm_local_bytes_scale),
- EVENT_PTR(intel_cqm_llc_snapshot),
- NULL,
-};
-
-static struct attribute_group intel_cqm_events_group = {
- .name = "events",
- .attrs = NULL,
-};
-
-PMU_FORMAT_ATTR(event, "config:0-7");
-static struct attribute *intel_cqm_formats_attr[] = {
- &format_attr_event.attr,
- NULL,
-};
-
-static struct attribute_group intel_cqm_format_group = {
- .name = "format",
- .attrs = intel_cqm_formats_attr,
-};
-
-static ssize_t
-max_recycle_threshold_show(struct device *dev, struct device_attribute *attr,
- char *page)
-{
- ssize_t rv;
-
- mutex_lock(&cache_mutex);
- rv = snprintf(page, PAGE_SIZE-1, "%u\n", __intel_cqm_max_threshold);
- mutex_unlock(&cache_mutex);
-
- return rv;
-}
-
-static ssize_t
-max_recycle_threshold_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf, size_t count)
-{
- unsigned int bytes, cachelines;
- int ret;
-
- ret = kstrtouint(buf, 0, &bytes);
- if (ret)
- return ret;
-
- mutex_lock(&cache_mutex);
-
- __intel_cqm_max_threshold = bytes;
- cachelines = bytes / cqm_l3_scale;
-
- /*
- * The new maximum takes effect immediately.
- */
- if (__intel_cqm_threshold > cachelines)
- __intel_cqm_threshold = cachelines;
-
- mutex_unlock(&cache_mutex);
-
- return count;
-}
-
-static DEVICE_ATTR_RW(max_recycle_threshold);
-
-static struct attribute *intel_cqm_attrs[] = {
- &dev_attr_max_recycle_threshold.attr,
- NULL,
-};
-
-static const struct attribute_group intel_cqm_group = {
- .attrs = intel_cqm_attrs,
-};
-
-static const struct attribute_group *intel_cqm_attr_groups[] = {
- &intel_cqm_events_group,
- &intel_cqm_format_group,
- &intel_cqm_group,
- NULL,
-};
-
-static struct pmu intel_cqm_pmu = {
- .hrtimer_interval_ms = RMID_DEFAULT_QUEUE_TIME,
- .attr_groups = intel_cqm_attr_groups,
- .task_ctx_nr = perf_sw_context,
- .event_init = intel_cqm_event_init,
- .add = intel_cqm_event_add,
- .del = intel_cqm_event_stop,
- .start = intel_cqm_event_start,
- .stop = intel_cqm_event_stop,
- .read = intel_cqm_event_read,
- .count = intel_cqm_event_count,
-};
-
-static inline void cqm_pick_event_reader(int cpu)
-{
- int reader;
-
- /* First online cpu in package becomes the reader */
- reader = cpumask_any_and(&cqm_cpumask, topology_core_cpumask(cpu));
- if (reader >= nr_cpu_ids)
- cpumask_set_cpu(cpu, &cqm_cpumask);
-}
-
-static int intel_cqm_cpu_starting(unsigned int cpu)
-{
- struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
- struct cpuinfo_x86 *c = &cpu_data(cpu);
-
- state->rmid = 0;
- state->closid = 0;
- state->rmid_usecnt = 0;
-
- WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid);
- WARN_ON(c->x86_cache_occ_scale != cqm_l3_scale);
-
- cqm_pick_event_reader(cpu);
- return 0;
-}
-
-static int intel_cqm_cpu_exit(unsigned int cpu)
-{
- int target;
-
- /* Is @cpu the current cqm reader for this package ? */
- if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask))
- return 0;
-
- /* Find another online reader in this package */
- target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
-
- if (target < nr_cpu_ids)
- cpumask_set_cpu(target, &cqm_cpumask);
-
- return 0;
-}
-
-static const struct x86_cpu_id intel_cqm_match[] = {
- { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_OCCUP_LLC },
- {}
-};
-
-static void mbm_cleanup(void)
-{
- if (!mbm_enabled)
- return;
-
- kfree(mbm_local);
- kfree(mbm_total);
- mbm_enabled = false;
-}
-
-static const struct x86_cpu_id intel_mbm_local_match[] = {
- { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_LOCAL },
- {}
-};
-
-static const struct x86_cpu_id intel_mbm_total_match[] = {
- { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_TOTAL },
- {}
-};
-
-static int intel_mbm_init(void)
-{
- int ret = 0, array_size, maxid = cqm_max_rmid + 1;
-
- mbm_socket_max = topology_max_packages();
- array_size = sizeof(struct sample) * maxid * mbm_socket_max;
- mbm_local = kmalloc(array_size, GFP_KERNEL);
- if (!mbm_local)
- return -ENOMEM;
-
- mbm_total = kmalloc(array_size, GFP_KERNEL);
- if (!mbm_total) {
- ret = -ENOMEM;
- goto out;
- }
-
- array_size = sizeof(struct hrtimer) * mbm_socket_max;
- mbm_timers = kmalloc(array_size, GFP_KERNEL);
- if (!mbm_timers) {
- ret = -ENOMEM;
- goto out;
- }
- mbm_hrtimer_init();
-
-out:
- if (ret)
- mbm_cleanup();
-
- return ret;
-}
-
-static int __init intel_cqm_init(void)
-{
- char *str = NULL, scale[20];
- int cpu, ret;
-
- if (x86_match_cpu(intel_cqm_match))
- cqm_enabled = true;
-
- if (x86_match_cpu(intel_mbm_local_match) &&
- x86_match_cpu(intel_mbm_total_match))
- mbm_enabled = true;
-
- if (!cqm_enabled && !mbm_enabled)
- return -ENODEV;
-
- cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
-
- /*
- * It's possible that not all resources support the same number
- * of RMIDs. Instead of making scheduling much more complicated
- * (where we have to match a task's RMID to a cpu that supports
- * that many RMIDs) just find the minimum RMIDs supported across
- * all cpus.
- *
- * Also, check that the scales match on all cpus.
- */
- get_online_cpus();
- for_each_online_cpu(cpu) {
- struct cpuinfo_x86 *c = &cpu_data(cpu);
-
- if (c->x86_cache_max_rmid < cqm_max_rmid)
- cqm_max_rmid = c->x86_cache_max_rmid;
-
- if (c->x86_cache_occ_scale != cqm_l3_scale) {
- pr_err("Multiple LLC scale values, disabling\n");
- ret = -EINVAL;
- goto out;
- }
- }
-
- /*
- * A reasonable upper limit on the max threshold is the number
- * of lines tagged per RMID if all RMIDs have the same number of
- * lines tagged in the LLC.
- *
- * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
- */
- __intel_cqm_max_threshold =
- boot_cpu_data.x86_cache_size * 1024 / (cqm_max_rmid + 1);
-
- snprintf(scale, sizeof(scale), "%u", cqm_l3_scale);
- str = kstrdup(scale, GFP_KERNEL);
- if (!str) {
- ret = -ENOMEM;
- goto out;
- }
-
- event_attr_intel_cqm_llc_scale.event_str = str;
-
- ret = intel_cqm_setup_rmid_cache();
- if (ret)
- goto out;
-
- if (mbm_enabled)
- ret = intel_mbm_init();
- if (ret && !cqm_enabled)
- goto out;
-
- if (cqm_enabled && mbm_enabled)
- intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
- else if (!cqm_enabled && mbm_enabled)
- intel_cqm_events_group.attrs = intel_mbm_events_attr;
- else if (cqm_enabled && !mbm_enabled)
- intel_cqm_events_group.attrs = intel_cqm_events_attr;
-
- ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
- if (ret) {
- pr_err("Intel CQM perf registration failed: %d\n", ret);
- goto out;
- }
-
- if (cqm_enabled)
- pr_info("Intel CQM monitoring enabled\n");
- if (mbm_enabled)
- pr_info("Intel MBM enabled\n");
-
- /*
- * Setup the hot cpu notifier once we are sure cqm
- * is enabled to avoid notifier leak.
- */
- cpuhp_setup_state(CPUHP_AP_PERF_X86_CQM_STARTING,
- "perf/x86/cqm:starting",
- intel_cqm_cpu_starting, NULL);
- cpuhp_setup_state(CPUHP_AP_PERF_X86_CQM_ONLINE, "perf/x86/cqm:online",
- NULL, intel_cqm_cpu_exit);
-
-out:
- put_online_cpus();
-
- if (ret) {
- kfree(str);
- cqm_cleanup();
- mbm_cleanup();
- }
-
- return ret;
-}
-device_initcall(intel_cqm_init);
diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
index b31081b..c953218 100644
--- a/arch/x86/include/asm/intel_rdt_common.h
+++ b/arch/x86/include/asm/intel_rdt_common.h
@@ -7,7 +7,6 @@
* struct intel_pqr_state - State cache for the PQR MSR
* @rmid: The cached Resource Monitoring ID
* @closid: The cached Class Of Service ID
- * @rmid_usecnt: The usage counter for rmid
*
* The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
* lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
@@ -19,7 +18,6 @@
struct intel_pqr_state {
u32 rmid;
u32 closid;
- int rmid_usecnt;
};

DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 5b36646..989a997 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -41,6 +41,14 @@
DEFINE_PER_CPU_READ_MOSTLY(int, cpu_closid);

/*
+ * The cached intel_pqr_state is strictly per CPU and can never be
+ * updated from a remote CPU. Functions which modify the state
+ * are called with interrupts disabled and no preemption, which
+ * is sufficient for the protection.
+ */
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+
+/*
* Used to store the max resource name width and max resource data width
* to display the schemata in a tabular format
*/
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 24a6358..7043d65 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -139,14 +139,6 @@ struct hw_perf_event {
/* for tp_event->class */
struct list_head tp_list;
};
- struct { /* intel_cqm */
- int cqm_state;
- u32 cqm_rmid;
- int is_group_event;
- struct list_head cqm_events_entry;
- struct list_head cqm_groups_entry;
- struct list_head cqm_group_entry;
- };
struct { /* itrace */
int itrace_started;
};
@@ -417,11 +409,6 @@ struct pmu {


/*
- * Return the count value for a counter.
- */
- u64 (*count) (struct perf_event *event); /*optional*/
-
- /*
* Set up pmu-private data structures for an AUX area
*/
void *(*setup_aux) (int cpu, void **pages,
@@ -1109,11 +1096,6 @@ static inline void perf_event_task_sched_out(struct task_struct *prev,
__perf_event_task_sched_out(prev, next);
}

-static inline u64 __perf_event_count(struct perf_event *event)
-{
- return local64_read(&event->count) + atomic64_read(&event->child_count);
-}
-
extern void perf_event_mmap(struct vm_area_struct *vma);
extern struct perf_guest_info_callbacks *perf_guest_cbs;
extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6e75a5c..8492fb1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3622,10 +3622,7 @@ static void __perf_event_read(void *info)

static inline u64 perf_event_count(struct perf_event *event)
{
- if (event->pmu->count)
- return event->pmu->count(event);
-
- return __perf_event_count(event);
+ return local64_read(&event->count) + atomic64_read(&event->child_count);
}

/*
@@ -3662,12 +3659,6 @@ u64 perf_event_read_local(struct perf_event *event)
WARN_ON_ONCE(event->attr.inherit);

/*
- * It must not have a pmu::count method, those are not
- * NMI safe.
- */
- WARN_ON_ONCE(event->pmu->count);
-
- /*
* If the event is currently on this CPU, its either a per-task event,
* or local to this CPU. Furthermore it means its ACTIVE (otherwise
* oncpu == -1).
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 460a031..662ebcd 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -253,7 +253,7 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void)
return -EINVAL;

/* make sure event is local and doesn't have pmu::count */
- if (unlikely(event->oncpu != cpu || event->pmu->count))
+ if (unlikely(event->oncpu != cpu))
return -EINVAL;

/*
--
1.9.1

2017-06-26 18:59:22

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 07/21] x86/intel_rdt/cqm: Add RDT monitoring initialization

Add common data structures for RDT resource monitoring and perform RDT
monitoring related data structure initializations which include setting
up the RMID(Resource monitoring ID) lists and event list which the
resource supports.

[tony: some cleanup to make adding MBM easier later, remove "cqm"
from some names, make some data structure local to intel_rdt_monitor.c
static. Add copyright header]

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/intel_rdt.c | 38 +++++---
arch/x86/kernel/cpu/intel_rdt.h | 39 ++++++++
arch/x86/kernel/cpu/intel_rdt_monitor.c | 161 ++++++++++++++++++++++++++++++++
4 files changed, 227 insertions(+), 13 deletions(-)
create mode 100644 arch/x86/kernel/cpu/intel_rdt_monitor.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index a576121..81b0060 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -32,7 +32,7 @@ obj-$(CONFIG_CPU_SUP_CENTAUR) += centaur.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o

-obj-$(CONFIG_INTEL_RDT) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o
+obj-$(CONFIG_INTEL_RDT) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o intel_rdt_monitor.o

obj-$(CONFIG_X86_MCE) += mcheck/
obj-$(CONFIG_MTRR) += mtrr/
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 59500f9..121eb14 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -55,6 +55,12 @@
*/
int max_name_width, max_data_width;

+/*
+ * Global boolean for rdt_alloc which is true if any
+ * resource allocation is enabled.
+ */
+bool rdt_alloc_enabled;
+
static void
mba_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
static void
@@ -230,7 +236,7 @@ static bool rdt_get_mem_config(struct rdt_resource *r)
return true;
}

-static void rdt_get_cache_config(int idx, struct rdt_resource *r)
+static void rdt_get_cache_alloc_config(int idx, struct rdt_resource *r)
{
union cpuid_0x10_1_eax eax;
union cpuid_0x10_x_edx edx;
@@ -422,7 +428,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

d->id = id;

- if (domain_setup_ctrlval(r, d)) {
+ if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
kfree(d);
return;
}
@@ -513,34 +519,39 @@ static __init void rdt_init_padding(void)

static __init bool get_rdt_resources(void)
{
- bool ret = false;
-
if (cache_alloc_hsw_probe())
- return true;
+ rdt_alloc_enabled = true;

- if (!boot_cpu_has(X86_FEATURE_RDT_A))
+ if ((!rdt_alloc_enabled && !boot_cpu_has(X86_FEATURE_RDT_A)) &&
+ !boot_cpu_has(X86_FEATURE_CQM))
return false;

+ if (boot_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
+ rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
+
+ if (rdt_mon_features)
+ rdt_get_mon_l3_config(&rdt_resources_all[RDT_RESOURCE_L3]);
+
if (boot_cpu_has(X86_FEATURE_CAT_L3)) {
- rdt_get_cache_config(1, &rdt_resources_all[RDT_RESOURCE_L3]);
+ rdt_get_cache_alloc_config(1, &rdt_resources_all[RDT_RESOURCE_L3]);
if (boot_cpu_has(X86_FEATURE_CDP_L3)) {
rdt_get_cdp_l3_config(RDT_RESOURCE_L3DATA);
rdt_get_cdp_l3_config(RDT_RESOURCE_L3CODE);
}
- ret = true;
+ rdt_alloc_enabled = true;
}
if (boot_cpu_has(X86_FEATURE_CAT_L2)) {
/* CPUID 0x10.2 fields are same format at 0x10.1 */
- rdt_get_cache_config(2, &rdt_resources_all[RDT_RESOURCE_L2]);
- ret = true;
+ rdt_get_cache_alloc_config(2, &rdt_resources_all[RDT_RESOURCE_L2]);
+ rdt_alloc_enabled = true;
}

if (boot_cpu_has(X86_FEATURE_MBA)) {
if (rdt_get_mem_config(&rdt_resources_all[RDT_RESOURCE_MBA]))
- ret = true;
+ rdt_alloc_enabled = true;
}

- return ret;
+ return (rdt_mon_features || rdt_alloc_enabled);
}

static int __init intel_rdt_late_init(void)
@@ -568,6 +579,9 @@ static int __init intel_rdt_late_init(void)
for_each_alloc_capable_rdt_resource(r)
pr_info("Intel RDT %s allocation detected\n", r->name);

+ for_each_mon_capable_rdt_resource(r)
+ pr_info("Intel RDT %s monitoring detected\n", r->name);
+
return 0;
}

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 29630af..285f106 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -12,6 +12,29 @@

#define L3_QOS_CDP_ENABLE 0x01ULL

+/*
+ * Event IDs are used to program IA32_QM_EVTSEL before reading event
+ * counter from IA32_QM_CTR
+ */
+#define QOS_L3_OCCUP_EVENT_ID 0x01
+#define QOS_L3_MBM_TOTAL_EVENT_ID 0x02
+#define QOS_L3_MBM_LOCAL_EVENT_ID 0x03
+
+/**
+ * struct mon_evt - Entry in the event list of a resource
+ * @evtid: event id
+ * @name: name of the event
+ */
+struct mon_evt {
+ u32 evtid;
+ char *name;
+ struct list_head list;
+};
+
+extern unsigned int intel_cqm_threshold;
+extern bool rdt_alloc_enabled;
+extern int rdt_mon_features;
+
/**
* struct rdtgroup - store rdtgroup's data in resctrl file system.
* @kn: kernfs node
@@ -136,7 +159,9 @@ struct rdt_membw {
/**
* struct rdt_resource - attributes of an RDT resource
* @alloc_enabled: Is allocation enabled on this machine
+ * @mon_enabled: Is monitoring enabled for this feature
* @alloc_capable: Is allocation available on this machine
+ * @mon_capable: Is monitor feature available on this machine
* @name: Name to use in "schemata" file
* @num_closid: Number of CLOSIDs available
* @cache_level: Which cache level defines scope of this resource
@@ -150,10 +175,15 @@ struct rdt_membw {
* @nr_info_files: Number of info files
* @format_str: Per resource format string to show domain value
* @parse_ctrlval: Per resource function pointer to parse control values
+ * @evt_list: List of monitoring events
+ * @num_rmid: Number of RMIDs available
+ * @mon_scale: cqm counter * mon_scale = occupancy in bytes
*/
struct rdt_resource {
bool alloc_enabled;
+ bool mon_enabled;
bool alloc_capable;
+ bool mon_capable;
char *name;
int num_closid;
int cache_level;
@@ -170,6 +200,9 @@ struct rdt_resource {
const char *format_str;
int (*parse_ctrlval) (char *buf, struct rdt_resource *r,
struct rdt_domain *d);
+ struct list_head evt_list;
+ int num_rmid;
+ unsigned int mon_scale;
};

void rdt_get_cache_infofile(struct rdt_resource *r);
@@ -201,6 +234,11 @@ enum {
r++) \
if (r->alloc_capable)

+#define for_each_mon_capable_rdt_resource(r) \
+ for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
+ r++) \
+ if (r->mon_capable)
+
#define for_each_alloc_enabled_rdt_resource(r) \
for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
r++) \
@@ -239,5 +277,6 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
struct seq_file *s, void *v);
+void rdt_get_mon_l3_config(struct rdt_resource *r);

#endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c
new file mode 100644
index 0000000..a418854
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c
@@ -0,0 +1,161 @@
+/*
+ * Resource Director Technology(RDT)
+ * - Monitoring code
+ *
+ * Copyright (C) 2017 Intel Corporation
+ *
+ * Author:
+ * Vikas Shivappa <[email protected]>
+ *
+ * This replaces the cqm.c based on perf but we reuse a lot of
+ * code and datastructures originally from Peter Zijlstra and Matt Fleming.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual June 2016, volume 3, section 17.17.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <asm/cpu_device_id.h>
+#include "intel_rdt.h"
+
+enum rmid_recycle_state {
+ RMID_CHECK = 0,
+ RMID_DIRTY,
+};
+
+struct rmid_entry {
+ u32 rmid;
+ enum rmid_recycle_state state;
+ struct list_head list;
+};
+
+/**
+ * @rmid_free_lru A least recently used list of free RMIDs
+ * These RMIDs are guaranteed to have an occupancy less than the
+ * threshold occupancy
+ */
+static struct list_head rmid_free_lru;
+
+/**
+ * @rmid_limbo_lru list of currently unused but (potentially)
+ * dirty RMIDs.
+ * This list contains RMIDs that no one is currently using but that
+ * may have a occupancy value > intel_cqm_threshold. User can change
+ * the threshold occupancy value.
+ */
+static struct list_head rmid_limbo_lru;
+
+/**
+ * @rmid_entry - The entry in the limbo and free lists.
+ */
+static struct rmid_entry *rmid_ptrs;
+
+/*
+ * Global boolean for rdt_monitor which is true if any
+ * resource monitoring is enabled.
+ */
+int rdt_mon_features;
+
+/*
+ * This is the threshold cache occupancy at which we will consider an
+ * RMID available for re-allocation.
+ */
+unsigned int intel_cqm_threshold;
+
+static inline struct rmid_entry *__rmid_entry(u32 rmid)
+{
+ struct rmid_entry *entry;
+
+ entry = &rmid_ptrs[rmid];
+ WARN_ON(entry->rmid != rmid);
+
+ return entry;
+}
+
+static int dom_data_init(struct rdt_resource *r)
+{
+ struct rmid_entry *entry = NULL;
+ int i = 0, nr_rmids;
+
+ INIT_LIST_HEAD(&rmid_free_lru);
+ INIT_LIST_HEAD(&rmid_limbo_lru);
+
+ nr_rmids = r->num_rmid;
+ rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL);
+ if (!rmid_ptrs)
+ return -ENOMEM;
+
+ for (; i < nr_rmids; i++) {
+ entry = &rmid_ptrs[i];
+ INIT_LIST_HEAD(&entry->list);
+
+ entry->rmid = i;
+ list_add_tail(&entry->list, &rmid_free_lru);
+ }
+
+ /*
+ * RMID 0 is special and is always allocated. It's used for all
+ * tasks that are not monitored.
+ */
+ entry = __rmid_entry(0);
+ list_del(&entry->list);
+
+ return 0;
+}
+
+static struct mon_evt llc_occupancy_event = {
+ .name = "llc_occupancy",
+ .evtid = QOS_L3_OCCUP_EVENT_ID,
+};
+
+static void l3_mon_evt_init(struct rdt_resource *r)
+{
+ INIT_LIST_HEAD(&r->evt_list);
+
+ if (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID))
+ list_add_tail(&llc_occupancy_event.list, &r->evt_list);
+}
+
+void rdt_get_mon_l3_config(struct rdt_resource *r)
+{
+ int ret;
+
+ r->mon_scale = boot_cpu_data.x86_cache_occ_scale;
+ r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+
+ /*
+ * A reasonable upper limit on the max threshold is the number
+ * of lines tagged per RMID if all RMIDs have the same number of
+ * lines tagged in the LLC.
+ *
+ * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
+ */
+ intel_cqm_threshold = boot_cpu_data.x86_cache_size * 1024 / r->num_rmid;
+
+ /* h/w works in units of "boot_cpu_data.x86_cache_occ_scale" */
+ intel_cqm_threshold /= r->mon_scale;
+
+ ret = dom_data_init(r);
+ if (ret)
+ goto out;
+
+ l3_mon_evt_init(r);
+
+ r->mon_capable = true;
+ r->mon_enabled = true;
+
+ return;
+out:
+ kfree(rmid_ptrs);
+ rdt_mon_features = 0;
+}
--
1.9.1

2017-06-26 18:59:33

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 03/21] x86/intel_rdt/cqm: Documentation for resctrl based RDT Monitoring

Add a description of resctrl based RDT(resource director technology)
monitoring extension and its usage.

[Tony: Added descriptions for how monitoring and allocation are measured
and some cleanups]

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
Documentation/x86/intel_rdt_ui.txt | 316 ++++++++++++++++++++++++++++++++-----
1 file changed, 278 insertions(+), 38 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index c491a1b..76f21e2 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -6,8 +6,8 @@ Fenghua Yu <[email protected]>
Tony Luck <[email protected]>
Vikas Shivappa <[email protected]>

-This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
-X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
+This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
+X86 /proc/cpuinfo flag bits "rdt", "cqm", "cat_l3" and "cdp_l3".

To use the feature mount the file system:

@@ -17,6 +17,13 @@ mount options are:

"cdp": Enable code/data prioritization in L3 cache allocations.

+RDT features are orthogonal. A particular system may support only
+monitoring, only control, or both monitoring and control.
+
+The mount succeeds if either of allocation or monitoring is present, but
+only those files and directories supported by the system will be created.
+For more details on the behavior of the interface during monitoring
+and allocation, see the "Resource alloc and monitor groups" section.

Info directory
--------------
@@ -24,7 +31,12 @@ Info directory
The 'info' directory contains information about the enabled
resources. Each resource has its own subdirectory. The subdirectory
names reflect the resource names.
-Cache resource(L3/L2) subdirectory contains the following files:
+
+Each subdirectory contains the following files with respect to
+allocation:
+
+Cache resource(L3/L2) subdirectory contains the following files
+related to allocation:

"num_closids": The number of CLOSIDs which are valid for this
resource. The kernel uses the smallest number of
@@ -36,7 +48,8 @@ Cache resource(L3/L2) subdirectory contains the following files:
"min_cbm_bits": The minimum number of consecutive bits which
must be set when writing a mask.

-Memory bandwitdh(MB) subdirectory contains the following files:
+Memory bandwitdh(MB) subdirectory contains the following files
+with respect to allocation:

"min_bandwidth": The minimum memory bandwidth percentage which
user can request.
@@ -52,48 +65,152 @@ Memory bandwitdh(MB) subdirectory contains the following files:
non-linear. This field is purely informational
only.

-Resource groups
----------------
+If RDT monitoring is available there will be an "L3_MON" directory
+with the following files:
+
+"num_rmids": The number of RMIDs available. This is the
+ upper bound for how many "CTRL_MON" + "MON"
+ groups can be created.
+
+"mon_features": Lists the monitoring events if
+ monitoring is enabled for the resource.
+
+"max_threshold_occupancy":
+ Read/write file provides the largest value (in
+ bytes) at which a previously used LLC_occupancy
+ counter can be considered for re-use.
+
+
+Resource alloc and monitor groups
+---------------------------------
+
Resource groups are represented as directories in the resctrl file
-system. The default group is the root directory. Other groups may be
-created as desired by the system administrator using the "mkdir(1)"
-command, and removed using "rmdir(1)".
+system. The default group is the root directory which, immediately
+after mounting, owns all the tasks and cpus in the system and can make
+full use of all resources.
+
+On a system with RDT control features additional directories can be
+created in the root directory that specify different amounts of each
+resource (see "schemata" below). The root and these additional top level
+directories are referred to as "CTRL_MON" groups below.
+
+On a system with RDT monitoring the root directory and other top level
+directories contain a directory named "mon_groups" in which additional
+directories can be created to monitor subsets of tasks in the CTRL_MON
+group that is their ancestor. These are called "MON" groups in the rest
+of this document.
+
+Removing a directory will move all tasks and cpus owned by the group it
+represents to the parent. Removing one of the created CTRL_MON groups
+will automatically remove all MON groups below it.
+
+All groups contain the following files:
+
+"tasks":
+ Reading this file shows the list of all tasks that belong to
+ this group. Writing a task id to the file will add a task to the
+ group. If the group is a CTRL_MON group the task is removed from
+ whichever previous CTRL_MON group owned the task and also from
+ any MON group that owned the task. If the group is a MON group,
+ then the task must already belong to the CTRL_MON parent of this
+ group. The task is removed from any previous MON group.
+
+
+"cpus":
+ Reading this file shows a bitmask of the logical CPUs owned by
+ this group. Writing a mask to this file will add and remove
+ CPUs to/from this group. As with the tasks file a hierarchy is
+ maintained where MON groups may only include CPUs owned by the
+ parent CTRL_MON group.
+
+
+"cpus_list":
+ Just like "cpus", only using ranges of CPUs instead of bitmasks.

-There are three files associated with each group:

-"tasks": A list of tasks that belongs to this group. Tasks can be
- added to a group by writing the task ID to the "tasks" file
- (which will automatically remove them from the previous
- group to which they belonged). New tasks created by fork(2)
- and clone(2) are added to the same group as their parent.
- If a pid is not in any sub partition, it is in root partition
- (i.e. default partition).
+When control is enabled all CTRL_MON groups will also contain:

-"cpus": A bitmask of logical CPUs assigned to this group. Writing
- a new mask can add/remove CPUs from this group. Added CPUs
- are removed from their previous group. Removed ones are
- given to the default (root) group. You cannot remove CPUs
- from the default group.
+"schemata":
+ A list of all the resources available to this group.
+ Each resource has its own line and format - see below for details.

-"cpus_list": One or more CPU ranges of logical CPUs assigned to this
- group. Same rules apply like for the "cpus" file.
+When monitoring is enabled all MON groups will also contain:

-"schemata": A list of all the resources available to this group.
- Each resource has its own line and format - see below for
- details.
+"mon_data":
+ This contains a set of files organized by L3 domain and by
+ RDT event. E.g. on a system with two L3 domains there will
+ be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
+ directories have one file per event (e.g. "llc_occupancy",
+ "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
+ files provide a read out of the current value of the event for
+ all tasks in the group. In CTRL_MON groups these files provide
+ the sum for all tasks in the CTRL_MON group and all tasks in
+ MON groups. Please see example section for more details on usage.

-When a task is running the following rules define which resources
-are available to it:
+Resource allocation rules
+-------------------------
+When a task is running the following rules define which resources are
+available to it:

1) If the task is a member of a non-default group, then the schemata
-for that group is used.
+ for that group is used.

2) Else if the task belongs to the default group, but is running on a
-CPU that is assigned to some specific group, then the schemata for
-the CPU's group is used.
+ CPU that is assigned to some specific group, then the schemata for the
+ CPU's group is used.

3) Otherwise the schemata for the default group is used.

+Resource monitoring rules
+-------------------------
+1) If a task is a member of a MON group, or non-default CTRL_MON group
+ then RDT events for the task will be reported in that group.
+
+2) If a task is a member of the default CTRL_MON group, but is running
+ on a CPU that is assigned to some specific group, then the RDT events
+ for the task will be reported in that group.
+
+3) Otherwise RDT events for the task will be reported in the root level
+ "mon_data" group.
+
+
+Notes on cache occupancy monitoring and control
+-----------------------------------------------
+When moving a task from one group to another you should remember that
+this only affects *new* cache allocations by the task. E.g. you may have
+a task in a monitor group showing 3 MB of cache occupancy. If you move
+to a new group and immediately check the occupancy of the old and new
+groups you will likely see that the old group is still showing 3 MB and
+the new group zero. When the task accesses locations still in cache from
+before the move, the h/w does not update any counters. On a busy system
+you will likely see the occupancy in the old group go down as cache lines
+are evicted and re-used while the occupancy in the new group rises as
+the task accesses memory and loads into the cache are counted based on
+membership in the new group.
+
+The same applies to cache allocation control. Moving a task to a group
+with a smaller cache partition will not evict any cache lines. The
+process may continue to use them from the old partition.
+
+Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
+to identify a control group and a monitoring group respectively. Each of
+the resource groups are mapped to these IDs based on the kind of group. The
+number of CLOSid and RMID are limited by the hardware and hence the creation of
+a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
+and creation of "MON" group may fail if we run out of RMIDs.
+
+max_threshold_occupancy - generic concepts
+------------------------------------------
+
+Note that an RMID once freed may not be immediately available for use as
+the RMID is still tagged the cache lines of the previous user of RMID.
+Hence such RMIDs are placed on limbo list and checked back if the cache
+occupancy has gone down. If there is a time when system has a lot of
+limbo RMIDs but which are not ready to be used, user may see an -EBUSY
+during mkdir.
+
+max_threshold_occupancy is a user configurable value to determine the
+occupancy at which an RMID can be freed.

Schemata files - general concepts
---------------------------------
@@ -143,22 +260,22 @@ SKUs. Using a high bandwidth and a low bandwidth setting on two threads
sharing a core will result in both threads being throttled to use the
low bandwidth.

-L3 details (code and data prioritization disabled)
---------------------------------------------------
+L3 schemata file details (code and data prioritization disabled)
+----------------------------------------------------------------
With CDP disabled the L3 schemata format is:

L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

-L3 details (CDP enabled via mount option to resctrl)
-----------------------------------------------------
+L3 schemata file details (CDP enabled via mount option to resctrl)
+------------------------------------------------------------------
When CDP is enabled L3 control is split into two separate resources
so you can specify independent masks for code and data like this:

L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

-L2 details
-----------
+L2 schemata file details
+------------------------
L2 cache does not support code and data prioritization, so the
schemata format is always:

@@ -185,6 +302,8 @@ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
L3CODE:0=fffff;1=fffff;2=fffff;3=fffff

+Examples for RDT allocation usage:
+
Example 1
---------
On a two socket machine (one L3 cache per socket) with just four bits
@@ -410,3 +529,124 @@ void main(void)
/* code to read and write directory contents */
resctrl_release_lock(fd);
}
+
+Examples for RDT Monitoring along with allocation usage:
+
+Reading monitored data
+----------------------
+Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
+show the current snapshot of LLC occupancy of the corresponding MON
+group or CTRL_MON group.
+
+
+Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
+---------
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir p0 p1
+# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
+# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
+# echo 5678 > p1/tasks
+# echo 5679 > p1/tasks
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Create monitor groups and assign a subset of tasks to each monitor group.
+
+# cd /sys/fs/resctrl/p1/mon_groups
+# mkdir m11 m12
+# echo 5678 > m11/tasks
+# echo 5679 > m12/tasks
+
+fetch data (data shown in bytes)
+
+# cat m11/mon_data/mon_L3_00/llc_occupancy
+16234000
+# cat m11/mon_data/mon_L3_01/llc_occupancy
+14789000
+# cat m12/mon_data/mon_L3_00/llc_occupancy
+16789000
+
+The parent ctrl_mon group shows the aggregated data.
+
+# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+31234000
+
+Example 2 (Monitor a task from its creation)
+---------
+On a two socket machine (one L3 cache per socket)
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir p0 p1
+
+An RMID is allocated to the group once its created and hence the <cmd>
+below is monitored from its creation.
+
+# echo $$ > /sys/fs/resctrl/p1/tasks
+# <cmd>
+
+Fetch the data
+
+# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+31789000
+
+Example 3 (Monitor without CAT support or before creating CAT groups)
+---------
+
+Assume a system like HSW has only CQM and no CAT support. In this case
+the resctrl will still mount but cannot create CTRL_MON directories.
+But user can create different MON groups within the root group thereby
+able to monitor all tasks including kernel threads.
+
+This can also be used to profile jobs cache size footprint before being
+able to allocate them to different allocation groups.
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir mon_groups/m01
+# mkdir mon_groups/m02
+
+# echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
+# echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
+
+Monitor the groups separately and also get per domain data. From the
+below its apparent that the tasks are mostly doing work on
+domain(socket) 0.
+
+# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
+31234000
+# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
+34555
+# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
+31234000
+# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
+32789
+
+
+Example 4 (Monitor real time tasks)
+-----------------------------------
+
+A single socket system which has real time tasks running on cores 4-7
+and non real time tasks on other cpus. We want to monitor the cache
+occupancy of the real time threads on these cores.
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir p1
+
+Move the cpus 4-7 over to p1
+# echo f0 > p0/cpus
+
+View the llc occupancy snapshot
+
+# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
+11234000
--
1.9.1

Subject: [tip:x86/urgent] x86/intel_rdt: Fix memory leak on mount failure

Commit-ID: 79298acc4ba097e9ab78644e3e38902d73547c92
Gitweb: http://git.kernel.org/tip/79298acc4ba097e9ab78644e3e38902d73547c92
Author: Vikas Shivappa <[email protected]>
AuthorDate: Mon, 26 Jun 2017 11:55:49 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 30 Jun 2017 21:20:00 +0200

x86/intel_rdt: Fix memory leak on mount failure

If mount fails, the kn_info directory is not freed causing memory leak.

Add the missing error handling path.

Fixes: 4e978d06dedb ("x86/intel_rdt: Add "info" files to resctrl file system")
Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]

---
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index f5af0cc..9257bd9 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -856,11 +856,13 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
dentry = kernfs_mount(fs_type, flags, rdt_root,
RDTGROUP_SUPER_MAGIC, NULL);
if (IS_ERR(dentry))
- goto out_cdp;
+ goto out_destroy;

static_branch_enable(&rdt_enable_key);
goto out;

+out_destroy:
+ kernfs_remove(kn_info);
out_cdp:
cdp_disable();
out:

2017-07-02 09:14:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 07/21] x86/intel_rdt/cqm: Add RDT monitoring initialization

On Mon, 26 Jun 2017, Vikas Shivappa wrote:
> +/*
> + * Global boolean for rdt_alloc which is true if any
> + * resource allocation is enabled.
> + */
> +bool rdt_alloc_enabled;

That should be rdt_alloc_capable. It's not enabled at probe time. Probing
merily detects the capability. That mirrors the capable/enabled bits in the
rdt resource struct.

> static void
> mba_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
> static void
> @@ -230,7 +236,7 @@ static bool rdt_get_mem_config(struct rdt_resource *r)
> return true;
> }
>
> -static void rdt_get_cache_config(int idx, struct rdt_resource *r)
> +static void rdt_get_cache_alloc_config(int idx, struct rdt_resource *r)
> {
> union cpuid_0x10_1_eax eax;
> union cpuid_0x10_x_edx edx;
> @@ -422,7 +428,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>
> d->id = id;
>
> - if (domain_setup_ctrlval(r, d)) {
> + if (r->alloc_capable && domain_setup_ctrlval(r, d)) {

This should be done in the name space cleanup patch or in a separate one.

> kfree(d);
> return;
> }
> @@ -513,34 +519,39 @@ static __init void rdt_init_padding(void)
>
> static __init bool get_rdt_resources(void)
> {
> - bool ret = false;
> -
> if (cache_alloc_hsw_probe())
> - return true;
> + rdt_alloc_enabled = true;
>
> - if (!boot_cpu_has(X86_FEATURE_RDT_A))
> + if ((!rdt_alloc_enabled && !boot_cpu_has(X86_FEATURE_RDT_A)) &&
> + !boot_cpu_has(X86_FEATURE_CQM))
> return false;
>
> + if (boot_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
> + rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);

Instead of artificially cramming the CQM bits into this function, it
would be cleaner to leave that function alone, rename it to

get_rdt_alloc_resources()

and have a new function

get_rdt_mon_resources()

and handle the aggregation at the call site.

rdt_alloc_capable = get_rdt_alloc_resources();
rdt_mon_capable = get_rdt_mon_resources();

if (!rdt_alloc_capable && !rdt_mon_capable)
return -ENODEV;

I'd make both variables boolean and have rdt_mon_features as a separate
one, which carries the actual available feature bits. This is neither
hotpath nor are we in a situation where we need to spare the last 4byte of
memory. Clean separation of code and functionality is more important.

> +/*
> + * Event IDs are used to program IA32_QM_EVTSEL before reading event
> + * counter from IA32_QM_CTR
> + */
> +#define QOS_L3_OCCUP_EVENT_ID 0x01
> +#define QOS_L3_MBM_TOTAL_EVENT_ID 0x02
> +#define QOS_L3_MBM_LOCAL_EVENT_ID 0x03
> +
> +/**
> + * struct mon_evt - Entry in the event list of a resource
> + * @evtid: event id
> + * @name: name of the event
> + */
> +struct mon_evt {
> + u32 evtid;
> + char *name;
> + struct list_head list;
> +};
> +
> +extern unsigned int intel_cqm_threshold;
> +extern bool rdt_alloc_enabled;
> +extern int rdt_mon_features;

Please do not use 'int' for variables which contain bit flags. unsigned int
is the proper choice here.

> +struct rmid_entry {
> + u32 rmid;
> + enum rmid_recycle_state state;
> + struct list_head list;

Please make it tabular as you did with mon_evt and other structs.

> +};
> +
> +/**
> + * @rmid_free_lru A least recently used list of free RMIDs
> + * These RMIDs are guaranteed to have an occupancy less than the
> + * threshold occupancy
> + */
> +static struct list_head rmid_free_lru;
> +
> +/**
> + * @rmid_limbo_lru list of currently unused but (potentially)
> + * dirty RMIDs.
> + * This list contains RMIDs that no one is currently using but that
> + * may have a occupancy value > intel_cqm_threshold. User can change
> + * the threshold occupancy value.
> + */
> +static struct list_head rmid_limbo_lru;
> +
> +/**
> + * @rmid_entry - The entry in the limbo and free lists.
> + */
> +static struct rmid_entry *rmid_ptrs;
> +
> +/*
> + * Global boolean for rdt_monitor which is true if any

Boolean !?!?!

> + * resource monitoring is enabled.
> + */
> +int rdt_mon_features;
> +
> +/*
> + * This is the threshold cache occupancy at which we will consider an
> + * RMID available for re-allocation.
> + */
> +unsigned int intel_cqm_threshold;
> +
> +static inline struct rmid_entry *__rmid_entry(u32 rmid)
> +{
> + struct rmid_entry *entry;
> +
> + entry = &rmid_ptrs[rmid];
> + WARN_ON(entry->rmid != rmid);
> +
> + return entry;
> +}
> +
> +static int dom_data_init(struct rdt_resource *r)
> +{
> + struct rmid_entry *entry = NULL;
> + int i = 0, nr_rmids;
> +
> + INIT_LIST_HEAD(&rmid_free_lru);
> + INIT_LIST_HEAD(&rmid_limbo_lru);

You can spare that by declaring the list head with

static LIST_HEAD(rmid_xxx_lru);

> +
> + nr_rmids = r->num_rmid;
> + rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL);
> + if (!rmid_ptrs)
> + return -ENOMEM;
> +
> + for (; i < nr_rmids; i++) {

Please initialize i in the for() construct. It's really bad to read,
because the missing initialization statement makes one look for a special
initialization magic just to figure out that it's simply i = 0.

> + entry = &rmid_ptrs[i];
> + INIT_LIST_HEAD(&entry->list);
> +
> + entry->rmid = i;
> + list_add_tail(&entry->list, &rmid_free_lru);
> + }
> +
> + /*
> + * RMID 0 is special and is always allocated. It's used for all
> + * tasks that are not monitored.
> + */
> + entry = __rmid_entry(0);
> + list_del(&entry->list);
> +
> + return 0;
> +}
> +
> +static struct mon_evt llc_occupancy_event = {
> + .name = "llc_occupancy",
> + .evtid = QOS_L3_OCCUP_EVENT_ID,

Tabluar...

> +};
> +
> +static void l3_mon_evt_init(struct rdt_resource *r)
> +{
> + INIT_LIST_HEAD(&r->evt_list);
> +
> + if (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID))
> + list_add_tail(&llc_occupancy_event.list, &r->evt_list);

What's that list for? Why don't you have that event as a member of the L3
rdt resource and control it via r->mon_capable/enabled?

> +}
> +
> +void rdt_get_mon_l3_config(struct rdt_resource *r)
> +{
> + int ret;
> +
> + r->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> + r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> +
> + /*
> + * A reasonable upper limit on the max threshold is the number
> + * of lines tagged per RMID if all RMIDs have the same number of
> + * lines tagged in the LLC.
> + *
> + * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
> + */
> + intel_cqm_threshold = boot_cpu_data.x86_cache_size * 1024 / r->num_rmid;
> +
> + /* h/w works in units of "boot_cpu_data.x86_cache_occ_scale" */
> + intel_cqm_threshold /= r->mon_scale;
> +
> + ret = dom_data_init(r);
> + if (ret)
> + goto out;
> +
> + l3_mon_evt_init(r);
> +
> + r->mon_capable = true;
> + r->mon_enabled = true;
> +
> + return;
> +out:
> + kfree(rmid_ptrs);
> + rdt_mon_features = 0;

This is silly. if dom_data_init() fails, then it failed because it was
unable to allocate rmid_ptrs. .....

Also clearing rdt_mod_features here is conceptually wrong. Make that
function return int, i.e. the failure value, and clear rdt_mon_capable at
the call site in case of error.

Thanks,

tglx




2017-07-02 10:05:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

On Mon, 26 Jun 2017, Vikas Shivappa wrote:
> +static u64 __rmid_read(u32 rmid, u32 eventid)
> +{
> + u64 val;
> +
> + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> + rdmsrl(MSR_IA32_QM_CTR, val);

The calling convention of this function needs to be documented. It's
obvious that it needs to be serialized ....

> +
> + /*
> + * Aside from the ERROR and UNAVAIL bits, the return value is the
> + * count for this @eventid tagged with @rmid.
> + */
> + return val;
> +}
> +
> +/*
> + * Test whether an RMID is dirty(occupancy > threshold_occupancy)
> + */
> +static void intel_cqm_stable(void *arg)
> +{
> + struct rmid_entry *entry;
> + u64 val;
> +
> + /*
> + * Since we are in the IPI already lets mark all the RMIDs
> + * that are dirty

This comment is crap. It suggests: Let's do it while we are here anyway.

But that's not true. The IPI is issued solely to figure out which RMIDs are
dirty.

> + */
> + list_for_each_entry(entry, &rmid_limbo_lru, list) {

Since this is executed on multiple CPUs, that needs an explanation why that
list is safe to iterate w/o explicit protection here.

> + val = __rmid_read(entry->rmid, QOS_L3_OCCUP_EVENT_ID);
> + if (val > intel_cqm_threshold)
> + entry->state = RMID_DIRTY;
> + }
> +}
> +
> +/*
> + * Scan the limbo list and move all entries that are below the
> + * intel_cqm_threshold to the free list.
> + * Return "true" if the limbo list is empty, "false" if there are
> + * still some RMIDs there.
> + */
> +static bool try_freeing_limbo_rmid(void)
> +{
> + struct rmid_entry *entry, *tmp;
> + struct rdt_resource *r;
> + cpumask_var_t cpu_mask;
> + struct rdt_domain *d;
> + bool ret = true;
> +
> + if (list_empty(&rmid_limbo_lru))
> + return ret;
> +
> + if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
> + return false;
> +
> + r = &rdt_resources_all[RDT_RESOURCE_L3];
> +
> + list_for_each_entry(d, &r->domains, list)
> + cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
> +
> + /*
> + * Test whether an RMID is free for each package.

That wants a bit of explanation at some place why RMIDs have global
scope. That's a pure implementation decision because from a hardware POV
RMIDs have package scope. We could use the same RMID on different packages
for different purposes.

> + */
> + on_each_cpu_mask(cpu_mask, intel_cqm_stable, NULL, true);
> +
> + list_for_each_entry_safe(entry, tmp, &rmid_limbo_lru, list) {
> + /*
> + * Ignore the RMIDs that are marked dirty and reset the
> + * state to check for being dirty again later.

Ignore? -EMAKESNOSENSE

> + */
> + if (entry->state == RMID_DIRTY) {
> + entry->state = RMID_CHECK;
> + ret = false;
> + continue;
> + }
> + list_del(&entry->list);
> + list_add_tail(&entry->list, &rmid_free_lru);
> + }
> +
> + free_cpumask_var(cpu_mask);

...

> +void free_rmid(u32 rmid)
> +{
> + struct rmid_entry *entry;
> +
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + WARN_ON(!rmid);
> + entry = __rmid_entry(rmid);
> +
> + entry->state = RMID_CHECK;
> +
> + if (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID))
> + list_add_tail(&entry->list, &rmid_limbo_lru);
> + else
> + list_add_tail(&entry->list, &rmid_free_lru);

Thinking a bit more about that limbo mechanics.

In case that a RMID was never used on a particular package, the state check
forces an IPI on all packages unconditionally. That's suboptimal at least.

We know on which package a given RMID was used, so we could restrict the
checks to exactly these packages, but I'm not sure it's worth the
trouble. We might at least document that and explain why this is
implemented in that way.

Thanks,

tglx




2017-07-02 10:09:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 09/21] x86/intel_rdt: Simplify info and base file lists

On Mon, 26 Jun 2017, Vikas Shivappa wrote:
> @@ -82,6 +82,7 @@ struct rdt_resource rdt_resources_all[] = {
> },
> .parse_ctrlval = parse_cbm,
> .format_str = "%d=%0*x",
> + .fflags = RFTYPE_RES_CACHE,
> },

Can you please convert that array to use explicit array member
initializers? I've noticed this back when I reviewed the intial RDT
implementation, but it somehow escaped. i.e.:

[RESOURCE_ID] =
{
.....
}

Thanks,

tglx

2017-07-02 10:59:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 11/21] x86/intel_rdt/cqm: Add mkdir support for RDT monitoring

On Mon, 26 Jun 2017, Vikas Shivappa wrote:
> +/*
> + * Common code for ctrl_mon and monitor group mkdir.
> + * The caller needs to unlock the global mutex upon success.
> + */
> +static int mkdir_rdt_common(struct kernfs_node *pkn, struct kernfs_node *prkn,

pkn and prkn are horrible to distinguish. What's wrong with keeping
*parent_kn and have *kn as the new thing?

> + const char *name, umode_t mode,
> + enum rdt_group_type rtype, struct rdtgroup **r)
> {

Can you please split out that mkdir_rdt_common() change into a separate
patch? It can be done as a preparatory stand alone change just for the
existing rdt group code. Then the monitoring add ons come on top of it.

> - struct rdtgroup *parent, *rdtgrp;
> + struct rdtgroup *prgrp, *rdtgrp;
> struct kernfs_node *kn;
> - int ret, closid;
> -
> - /* Only allow mkdir in the root directory */
> - if (parent_kn != rdtgroup_default.kn)
> - return -EPERM;
> -
> - /* Do not accept '\n' to avoid unparsable situation. */
> - if (strchr(name, '\n'))
> - return -EINVAL;
> + uint fshift = 0;
> + int ret;
>
> - parent = rdtgroup_kn_lock_live(parent_kn);
> - if (!parent) {
> + prgrp = rdtgroup_kn_lock_live(prkn);
> + if (!prgrp) {
> ret = -ENODEV;
> goto out_unlock;
> }
>
> - ret = closid_alloc();
> - if (ret < 0)
> - goto out_unlock;
> - closid = ret;
> -
> /* allocate the rdtgroup. */
> rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
> if (!rdtgrp) {
> ret = -ENOSPC;
> - goto out_closid_free;
> + goto out_unlock;
> }
> - rdtgrp->closid = closid;
> - list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
> + *r = rdtgrp;
> + rdtgrp->parent = prgrp;
> + rdtgrp->type = rtype;
> + INIT_LIST_HEAD(&rdtgrp->crdtgrp_list);
>
> /* kernfs creates the directory for rdtgrp */
> - kn = kernfs_create_dir(parent->kn, name, mode, rdtgrp);
> + kn = kernfs_create_dir(pkn, name, mode, rdtgrp);
> if (IS_ERR(kn)) {
> ret = PTR_ERR(kn);
> goto out_cancel_ref;
> @@ -1138,27 +1166,138 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
> if (ret)
> goto out_destroy;
>
> - ret = rdtgroup_add_files(kn, RF_CTRL_BASE);
> + fshift = 1 << (RF_CTRLSHIFT + rtype);
> + ret = rdtgroup_add_files(kn, RFTYPE_BASE | fshift);


I'd rather make this:

files = RFTYPE_BASE | (1U << (RF_CTRLSHIFT + rtype));
ret = rdtgroup_add_files(kn, files);

> if (ret)
> goto out_destroy;
>
> + if (rdt_mon_features) {
> + ret = alloc_rmid();
> + if (ret < 0)
> + return ret;
> +
> + rdtgrp->rmid = ret;
> + }
> kernfs_activate(kn);
>
> - ret = 0;
> - goto out_unlock;

What unlocks prkn now? The caller, right? Please add a comment ...

> + return 0;
>
> out_destroy:
> kernfs_remove(rdtgrp->kn);
> out_cancel_ref:
> - list_del(&rdtgrp->rdtgroup_list);
> kfree(rdtgrp);
> -out_closid_free:
> +out_unlock:
> + rdtgroup_kn_unlock(prkn);
> + return ret;
> +}
> +
> +static void mkdir_rdt_common_clean(struct rdtgroup *rgrp)
> +{
> + kernfs_remove(rgrp->kn);
> + if (rgrp->rmid)
> + free_rmid(rgrp->rmid);

Please put that conditonal into free_rmid().

> + kfree(rgrp);
> +}

> +static int rdtgroup_mkdir(struct kernfs_node *pkn, const char *name,
> + umode_t mode)
> +{
> + /* Do not accept '\n' to avoid unparsable situation. */
> + if (strchr(name, '\n'))
> + return -EINVAL;
> +
> + /*
> + * We don't allow rdtgroup ctrl_mon directories to be created anywhere
> + * except the root directory and dont allow rdtgroup monitor
> + * directories to be created anywhere execept inside mon_groups
> + * directory.
> + */
> + if (rdt_alloc_enabled && pkn == rdtgroup_default.kn)
> + return rdtgroup_mkdir_ctrl_mon(pkn, pkn, name, mode);
> + else if (rdt_mon_features &&
> + !strcmp(pkn->name, "mon_groups"))
> + return rdtgroup_mkdir_mon(pkn, pkn->parent, name, mode);
> + else
> + return -EPERM;

TBH, this is really convoluted (including the comment).

/*
* If the parent directory is the root directory and RDT
* allocation is supported, add a control and monitoring
* subdirectory.
*/
if (rdt_alloc_capable && parent_kn == rdtgroup_default.kn)
return rdtgroup_mkdir_ctrl_mon(...);

/*
* If the parent directory is a monitoring group and RDT
* monitoring is supported, add a monitoring subdirectory.
*/
if (rdt_mon_capable && is_mon_group(parent_kn))
return rdtgroup_mkdir_mon(...);

return -EPERM;

Note, that I did not use strcmp(parent_kn->name) because that's simply
not sufficient. What prevents a user from doing:

# mkdir /sys/fs/resctrl/mon_group/mon_group
# mkdir /sys/fs/resctrl/mon_group/mon_group/foo

You need a better way to distignuish that than strcmp(). You probably want
to prevent creating subdirectories named "mon_group" as well.

Thanks,

tglx


2017-07-02 11:01:37

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 12/21] x86/intel_rdt/cqm: Add tasks file support

On Mon, 26 Jun 2017, Vikas Shivappa wrote:
> @@ -866,6 +866,7 @@ struct task_struct {
> #endif
> #ifdef CONFIG_INTEL_RDT
> int closid;
> + u32 rmid;

Can you please make a preparatory change which makes closid an u32 as well?
We should have done that in the first place, but in hindsight we are always
smarter...

Thanks,

tglx

2017-07-02 11:12:04

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support

On Mon, 26 Jun 2017, Vikas Shivappa wrote:
> diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
> index fdf3654..fec8ba9 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.h
> +++ b/arch/x86/kernel/cpu/intel_rdt.h
> @@ -37,6 +37,8 @@ struct mon_evt {
> extern bool rdt_alloc_enabled;
> extern int rdt_mon_features;
>
> +DECLARE_PER_CPU_READ_MOSTLY(int, cpu_rmid);

u32

>
> +DEFINE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
> static inline struct rmid_entry *__rmid_entry(u32 rmid)

Bah. Please add a new line between the DEFINE... and the function.

But that whole thing is wrong. The per cpu default closid and rmid want to
be in a single place, not in two distinct per cpu variables.

struct rdt_cpu_default {
u32 rmid;
u32 closid;
};

DEFINE_PER_CPU_READ_MOSTLY(struct rdt_cpu_default, rdt_cpu_default);

or something like this. That way it's guaranteed that the context switch
code touches a single cache line for the per cpu defaults.

Thanks,

tglx

2017-07-02 12:29:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support

On Mon, 26 Jun 2017, Vikas Shivappa wrote:
> -static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
> - char *buf, size_t nbytes, loff_t off)
> +static ssize_t cpus_mon_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes,
> + struct rdtgroup *rdtgrp)

Again. Please make the split of rdtgroup_cpus_write() as a seperate
preparatory change first and just move the guts of the existing write
function out into cpus_ctrl_write() and then add the mon_write stuff as an
extra patch.

> {
> + struct rdtgroup *pr = rdtgrp->parent, *cr;

*pr and *cr really suck.

> cpumask_var_t tmpmask, newmask;
> - struct rdtgroup *rdtgrp, *r;
> + struct list_head *llist;
> int ret;
>
> - if (!buf)
> - return -EINVAL;
> -
> if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
> return -ENOMEM;
> if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) {
> @@ -233,10 +235,89 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
> return -ENOMEM;
> }
>
> - rdtgrp = rdtgroup_kn_lock_live(of->kn);
> - if (!rdtgrp) {
> - ret = -ENOENT;
> - goto unlock;
> + if (is_cpu_list(of))
> + ret = cpulist_parse(buf, newmask);
> + else
> + ret = cpumask_parse(buf, newmask);

The cpuask allocation and parsing of the user buffer can be done in the
common code. No point in duplicating that.

> +
> + if (ret)
> + goto out;
> +
> + /* check that user didn't specify any offline cpus */
> + cpumask_andnot(tmpmask, newmask, cpu_online_mask);
> + if (cpumask_weight(tmpmask)) {
> + ret = -EINVAL;
> + goto out;
> + }

Common code.

> + /* Check whether cpus belong to parent ctrl group */
> + cpumask_andnot(tmpmask, newmask, &pr->cpu_mask);
> + if (cpumask_weight(tmpmask)) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /* Check whether cpus are dropped from this group */
> + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask);
> + if (cpumask_weight(tmpmask)) {
> + /* Give any dropped cpus to parent rdtgroup */
> + cpumask_or(&pr->cpu_mask, &pr->cpu_mask, tmpmask);

This does not make any sense. The check above verifies that all cpus in
newmask belong to the parent->cpu_mask. If they don't then you return
-EINVAL, but here you give them back to parent->cpu_mask. How is that
supposed to work? You never get into this code path!

So you need a seperate mask in the parent rdtgroup to store the CPUs which
are valid in any monitoring group which belongs to it. So the logic
becomes:

/*
* Check whether the CPU mask is a subset of the CPUs
* which belong to the parent group.
*/
cpumask_andnot(tmpmask, newmask, parent->cpus_valid_mask);
if (cpumask_weight(tmpmask))
return -EINVAL;

When CAT is not available, then parent->cpus_valid_mask is a pointer to
cpu_online_mask. When CAT is enabled, then parent->cpus_valid_mask is a
pointer to the CAT group cpu mask.

> + update_closid_rmid(tmpmask, pr);
> + }
> +
> + /*
> + * If we added cpus, remove them from previous group that owned them
> + * and update per-cpu rmid
> + */
> + cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask);
> + if (cpumask_weight(tmpmask)) {
> + llist = &pr->crdtgrp_list;

llist is a bad name. We have a facility llist, i.e. lockless list. head ?

> + list_for_each_entry(cr, llist, crdtgrp_list) {
> + if (cr == rdtgrp)
> + continue;
> + cpumask_andnot(&cr->cpu_mask, &cr->cpu_mask, tmpmask);
> + }
> + update_closid_rmid(tmpmask, rdtgrp);
> + }

> +static void cpumask_rdtgrp_clear(struct rdtgroup *r, struct cpumask *m)
> +{
> + struct rdtgroup *cr;
> +
> + cpumask_andnot(&r->cpu_mask, &r->cpu_mask, m);
> + /* update the child mon group masks as well*/
> + list_for_each_entry(cr, &r->crdtgrp_list, crdtgrp_list)
> + cpumask_and(&cr->cpu_mask, &r->cpu_mask, &cr->cpu_mask);

That's equally wrong. See above.

Thanks,

tglx

2017-07-02 12:44:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 14/21] x86/intel_rdt/cqm: Add mon_data

On Mon, 26 Jun 2017, Vikas Shivappa wrote:

> Add a mon_data directory for the root rdtgroup and all other rdtgroups.
> The directory holds all of the monitored data for all domains and events
> of all resources being monitored.

Again. This does two things at once. Move the existing code to a new file
and add the monitoring stuff. Please split it apart.

> +static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
> +{
> + u64 tval;
> +
> + tval = __rmid_read(rmid, rr->evtid);
> + if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
> + rr->val = tval;
> + return false;
> + }
> + switch (rr->evtid) {
> + case QOS_L3_OCCUP_EVENT_ID:
> + rr->val += tval;
> + return true;
> + default:
> + return false;

I have no idea what that return code means.

> + }
> +}
> +
> +void mon_event_count(void *info)

Some explanation why this is a void pointer and how that function is called
(I assume it's via IPI) would be appreciated.

> +{
> + struct rdtgroup *rdtgrp, *entry;
> + struct rmid_read *rr = info;
> + struct list_head *llist;

*head;

> +
> + rdtgrp = rr->rgrp;
> +
> + if (!__mon_event_count(rdtgrp->rmid, rr))
> + return;
> +
> + /*
> + * For Ctrl groups read data from child monitor groups.
> + */
> + llist = &rdtgrp->crdtgrp_list;
> +
> + if (rdtgrp->type == RDTCTRL_GROUP) {
> + list_for_each_entry(entry, llist, crdtgrp_list) {
> + if (!__mon_event_count(entry->rmid, rr))
> + return;
> + }
> + }
> +}

> +static int get_rdt_resourceid(struct rdt_resource *r)
> +{
> + if (r > (rdt_resources_all + RDT_NUM_RESOURCES - 1) ||
> + r < rdt_resources_all ||
> + ((r - rdt_resources_all) % sizeof(struct rdt_resource)))
> + return -EINVAL;

If that ever happens, then you have other problems than a wrong pointer.

> +
> + return ((r - rdt_resources_all) / sizeof(struct rdt_resource));

Moo. Can't you simply put an index field into struct rdt_resource,
intialize it with the resource ID and use that?

Thanks,

tglx

2017-07-02 13:16:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 15/21] x86/intel_rdt/cqm: Add rmdir support

On Mon, 26 Jun 2017, Vikas Shivappa wrote:

> Resource groups (ctrl_mon and monitor groups) are represented by
> directories in resctrl fs. Add support to remove the directories.

Again. Please split that patch into two parts; seperate ctrl stuff from rmdir and
then add monitoring support.

> + rdtgrp->flags = RDT_DELETED;
> + free_rmid(rdtgrp->rmid);
> +
> + /*
> + * Remove your rmid from the parent ctrl groups list

You are not removing a rmid. You remove the group from the parents group
list. Please be more accurate with your comments. Wrong comments are worse
than no comments.

> + WARN_ON(list_empty(&prdtgrp->crdtgrp_list));
> + list_del(&rdtgrp->crdtgrp_list);

> +static int rdtgroup_rmdir_ctrl(struct kernfs_node *kn, struct rdtgroup *rdtgrp)
> +{
> + int cpu, closid = rdtgroup_default.closid;
> + struct rdtgroup *entry, *tmp;
> + struct list_head *llist;

*head please.

> + cpumask_var_t tmpmask;
> +
> + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
> + return -ENOMEM;

Allocation/free can be done at the call site for both functions.

> +static int rdtgroup_rmdir(struct kernfs_node *kn)
> +{
> + struct kernfs_node *parent_kn = kn->parent;
> + struct rdtgroup *rdtgrp;
> + int ret = 0;
> +
> + rdtgrp = rdtgroup_kn_lock_live(kn);
> + if (!rdtgrp) {
> + ret = -EPERM;
> + goto out;
> + }
> +
> + if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn)
> + ret = rdtgroup_rmdir_ctrl(kn, rdtgrp);
> + else if (rdtgrp->type == RDTMON_GROUP &&
> + !strcmp(parent_kn->name, "mon_groups"))
> + ret = rdtgroup_rmdir_mon(kn, rdtgrp);
> + else
> + ret = -EPERM;

Like in the other patch, please makes this parseable.

Thanks,

tglx

2017-07-02 13:22:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 16/21] x86/intel_rdt/cqm: Add mount,umount support

On Mon, 26 Jun 2017, Vikas Shivappa wrote:
>
> list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
> + /* Free any child rmids */
> + llist = &rdtgrp->crdtgrp_list;
> + list_for_each_entry_safe(sentry, stmp, llist, crdtgrp_list) {
> + free_rmid(sentry->rmid);
> + list_del(&sentry->crdtgrp_list);
> + kfree(sentry);
> + }

I'm pretty sure, that I've seen exactly this code sequence already. Please
create a helper instead of copying stuff over and over.

Thanks,

tglx

2017-07-02 13:37:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 17/21] x86/intel_rdt/cqm: Add sched_in support

On Mon, 26 Jun 2017, Vikas Shivappa wrote:
> DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
> DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid);
> +DECLARE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
> DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
> +DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
> +DECLARE_STATIC_KEY_FALSE(rdt_enable_key);

Please make this a two stage change. Add rdt_enable_key first and then the
monitoring stuff. Ideally you introduce rdt_enable_key here and in the
control code in one go.

> +static void __intel_rdt_sched_in(void)
> {
> - if (static_branch_likely(&rdt_alloc_enable_key)) {
> - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> - int closid;
> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> + u32 closid = 0;
> + u32 rmid = 0;
>
> + if (static_branch_likely(&rdt_alloc_enable_key)) {
> /*
> * If this task has a closid assigned, use it.
> * Else use the closid assigned to this cpu.
> @@ -55,14 +59,31 @@ static inline void intel_rdt_sched_in(void)
> closid = current->closid;
> if (closid == 0)
> closid = this_cpu_read(cpu_closid);
> + }
> +
> + if (static_branch_likely(&rdt_mon_enable_key)) {
> + /*
> + * If this task has a rmid assigned, use it.
> + * Else use the rmid assigned to this cpu.
> + */
> + rmid = current->rmid;
> + if (rmid == 0)
> + rmid = this_cpu_read(cpu_rmid);
> + }
>
> - if (closid != state->closid) {
> - state->closid = closid;
> - wrmsr(IA32_PQR_ASSOC, state->rmid, closid);
> - }
> + if (closid != state->closid || rmid != state->rmid) {
> + state->closid = closid;
> + state->rmid = rmid;
> + wrmsr(IA32_PQR_ASSOC, rmid, closid);

This can be written smarter.

struct intel_pqr_state newstate = this_cpu_read(rdt_cpu_default);
struct intel_pqr_state *curstate = this_cpu_ptr(&pqr_state);

if (static_branch_likely(&rdt_alloc_enable_key)) {
if (current->closid)
newstate.closid = current->closid;
}

if (static_branch_likely(&rdt_mon_enable_key)) {
if (current->rmid)
newstate.rmid = current->rmid;
}

if (newstate != *curstate) {
*curstate = newstate;
wrmsr(IA32_PQR_ASSOC, newstate.rmid, newstate.closid);
}

The unconditional read of rdt_cpu_default is the right thing to do because
the default behaviour is exactly this.

Thanks,

tglx



2017-07-02 13:46:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 19/21] x86/intel_rdt/mbm: Basic counting of MBM events (total and local)

On Mon, 26 Jun 2017, Vikas Shivappa wrote:
> +static struct mon_evt mbm_total_event = {
> + .name = "mbm_total_bytes",
> + .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
> +};
> +
> +static struct mon_evt mbm_local_event = {
> + .name = "mbm_local_bytes",
> + .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
> +};
> +
> static void l3_mon_evt_init(struct rdt_resource *r)
> {
> INIT_LIST_HEAD(&r->evt_list);
>
> if (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID))
> list_add_tail(&llc_occupancy_event.list, &r->evt_list);
> + if (is_mbm_total_enabled())
> + list_add_tail(&mbm_total_event.list, &r->evt_list);
> + if (is_mbm_local_enabled())
> + list_add_tail(&mbm_local_event.list, &r->evt_list);

Confused. This hooks all monitoring features to RDT_RESOURCE_L3. Why?

Thanks,

tglx


2017-07-02 13:58:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 21/21] x86/intel_rdt/mbm: Handle counter overflow

On Mon, 26 Jun 2017, Vikas Shivappa wrote:
> +static void mbm_update(struct rdt_domain *d, int rmid)
> +{
> + struct rmid_read rr;
> +
> + rr.first = false;
> + rr.d = d;
> +
> + if (is_mbm_total_enabled()) {
> + rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
> + __mon_event_count(rmid, &rr);

This is broken as it is not protected against a concurrent read from user
space which comes in via a smp function call.

This means both the internal state and __rmid_read() are unprotected.

I'm not sure whether it's enough to disable interrupts around
__mon_event_count(), but that's the minimal protection required. It's
definitely good enough for __rmid_read(), but it might not be sufficient
for protecting domain->mbm_[local|total]. I leave the exercise of figuring
that out to you.

Thanks,

tglx

2017-07-03 09:55:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

On Sun, 2 Jul 2017, Thomas Gleixner wrote:
> Thinking a bit more about that limbo mechanics.
>
> In case that a RMID was never used on a particular package, the state check
> forces an IPI on all packages unconditionally. That's suboptimal at least.
>
> We know on which package a given RMID was used, so we could restrict the
> checks to exactly these packages, but I'm not sure it's worth the
> trouble. We might at least document that and explain why this is
> implemented in that way.

Second thoughts on that. The allocation logic is:

> + if (list_empty(&rmid_free_lru)) {
> + ret = try_freeing_limbo_rmid();
> + if (list_empty(&rmid_free_lru))
> + return ret ? -ENOSPC : -EBUSY;
> + }
> +
> + entry = list_first_entry(&rmid_free_lru,
> + struct rmid_entry, list);
> + list_del(&entry->list);
> +
> + return entry->rmid;

That means, the free list is used as the primary source. One of my boxes
has 143 RMIDs. So it only takes 142 mkdir/rmdir invocations to move all
RMIDs to the limbo list. On the next mkdir invocation the allocation goes
into the limbo path and the SMP function call has to walk the list with 142
entries on ALL online domains whether they used the RMID or not!

That's bad enough already and the number of RMIDs will not become smaller;
it doubled from HSW to BDW ...

The HPC and RT folks will love you for that - NOT!

So this needs to be solved differently.

Let's have a look at the context switch path first. That's the most
sensitive part of it.

if (static_branch_likely(&rdt_mon_enable_key)) {
if (current->rmid)
newstate.rmid = current->rmid;
}

That's optimized for the !monitoring case. So we can really penalize the
per task monitoring case.

if (static_branch_likely(&rdt_mon_enable_key)) {
if (unlikely(current->rmid)) {
newstate.rmid = current->rmid;
__set_bit(newstate.rmid, this_cpu_ptr(rmid_bitmap));
}
}

Now in rmid_free() we can collect that information:

cpumask_clear(&tmpmask);
cpumask_clear(rmid_entry->mask);

cpus_read_lock();
for_each_online_cpu(cpu) {
if (test_and_clear_bit(rmid, per_cpu_ptr(cpu, rmid_bitmap)))
cpumask_set(cpu, tmpmask);
}

for_each_domain(d, resource) {
cpu = cpumask_any_and(d->cpu_mask, tmpmask);
if (cpu < nr_cpu_ids)
cpumask_set(cpu, rmid_entry->mask);
}

list_add(&rmid_entry->list, &limbo_list);

for_each_cpu(cpu, rmid_entry->mask)
schedule_delayed_work_on(cpu, rmid_work);
cpus_read_unlock();

The work function:

boot resched = false;

list_for_each_entry(rme, limbo_list,...) {
if (!cpumask_test_cpu(cpu, rme->mask))
continue;

if (!rmid_is_reusable(rme)) {
resched = true;
continue;
}

cpumask_clear_cpu(cpu, rme->mask);
if (!cpumask_empty(rme->mask))
continue;

/* Ready for reuse */
list_del(rme->list);
list_add(&rme->list, &free_list);
}

The alloc function then becomes:

if (list_empty(&free_list))
return list_empty(&limbo_list) ? -ENOSPC : -EBUSY;

The switch_to() covers the task rmids. The per cpu default rmids can be
marked at the point where they are installed on a CPU in the per cpu
rmid_bitmap. The free path is the same for per task and per cpu.

Another thing which needs some thought it the CPU hotplug code. We need to
make sure that pending work which is scheduled on an outgoing CPU is moved
in the offline callback to a still online CPU of the same domain and not
moved to some random CPU by the workqueue hotplug code.

There is another subtle issue. Assume a RMID is freed. The limbo stuff is
scheduled on all domains which have online CPUs.

Now the last CPU of a domain goes offline before the threshold for clearing
the domain CPU bit in the rme->mask is reached.

So we have two options here:

1) Clear the bit unconditionally when the last CPU of a domain goes
offline.

2) Arm a timer which clears the bit after a grace period

#1 The RMID might become available for reuse right away because all other
domains have not used it or have cleared their bits already.

If one of the CPUs of that domain comes online again and is associated
to that reused RMID again, then the counter content might still contain
leftovers from the previous usage.

#2 Prevents #1 but has it's own issues vs. serialization and coordination
with CPU hotplug.

I'd say we go for #1 as the simplest solution, document it and if really
the need arises revisit it later.

Thanks,

tglx

2017-07-05 15:34:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

On Mon, Jul 03, 2017 at 11:55:37AM +0200, Thomas Gleixner wrote:

>
> if (static_branch_likely(&rdt_mon_enable_key)) {
> if (unlikely(current->rmid)) {
> newstate.rmid = current->rmid;
> __set_bit(newstate.rmid, this_cpu_ptr(rmid_bitmap));

Non atomic op

> }
> }
>
> Now in rmid_free() we can collect that information:
>
> cpumask_clear(&tmpmask);
> cpumask_clear(rmid_entry->mask);
>
> cpus_read_lock();
> for_each_online_cpu(cpu) {
> if (test_and_clear_bit(rmid, per_cpu_ptr(cpu, rmid_bitmap)))

atomic op

> cpumask_set(cpu, tmpmask);
> }
>
> for_each_domain(d, resource) {
> cpu = cpumask_any_and(d->cpu_mask, tmpmask);
> if (cpu < nr_cpu_ids)
> cpumask_set(cpu, rmid_entry->mask);
> }
>
> list_add(&rmid_entry->list, &limbo_list);
>
> for_each_cpu(cpu, rmid_entry->mask)
> schedule_delayed_work_on(cpu, rmid_work);
> cpus_read_unlock();
>
> The work function:
>
> boot resched = false;
>
> list_for_each_entry(rme, limbo_list,...) {
> if (!cpumask_test_cpu(cpu, rme->mask))
> continue;
>
> if (!rmid_is_reusable(rme)) {
> resched = true;
> continue;
> }
>
> cpumask_clear_cpu(cpu, rme->mask);
> if (!cpumask_empty(rme->mask))
> continue;
>
> /* Ready for reuse */
> list_del(rme->list);
> list_add(&rme->list, &free_list);
> }
>
> The alloc function then becomes:
>
> if (list_empty(&free_list))
> return list_empty(&limbo_list) ? -ENOSPC : -EBUSY;
>
> The switch_to() covers the task rmids. The per cpu default rmids can be
> marked at the point where they are installed on a CPU in the per cpu
> rmid_bitmap. The free path is the same for per task and per cpu.
>
> Another thing which needs some thought it the CPU hotplug code. We need to
> make sure that pending work which is scheduled on an outgoing CPU is moved
> in the offline callback to a still online CPU of the same domain and not
> moved to some random CPU by the workqueue hotplug code.

just flush the workqueue for that CPU? That's what the workqueue core
_should_ do in any case. And that also covers the case where @cpu is the
last in the set of CPUs we could run on.

> There is another subtle issue. Assume a RMID is freed. The limbo stuff is
> scheduled on all domains which have online CPUs.
>
> Now the last CPU of a domain goes offline before the threshold for clearing
> the domain CPU bit in the rme->mask is reached.
>
> So we have two options here:
>
> 1) Clear the bit unconditionally when the last CPU of a domain goes
> offline.

Arguably this. This is cache level stuff, that means this is the last
CPU of a cache, so just explicitly kill the _entire_ cache and insta
mark everything good again; WBINVD ftw.

> 2) Arm a timer which clears the bit after a grace period
>
> #1 The RMID might become available for reuse right away because all other
> domains have not used it or have cleared their bits already.
>
> If one of the CPUs of that domain comes online again and is associated
> to that reused RMID again, then the counter content might still contain
> leftovers from the previous usage.

Not if we kill the cache on offline -- also, if all CPUs have been
offline, its not too weird to expect something like a package idle state
to have happened and shot down the caches anyway.

> #2 Prevents #1 but has it's own issues vs. serialization and coordination
> with CPU hotplug.
>
> I'd say we go for #1 as the simplest solution, document it and if really
> the need arises revisit it later.
>
> Thanks,
>
> tglx

2017-07-05 17:25:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

On Wed, 5 Jul 2017, Peter Zijlstra wrote:

> On Mon, Jul 03, 2017 at 11:55:37AM +0200, Thomas Gleixner wrote:
>
> >
> > if (static_branch_likely(&rdt_mon_enable_key)) {
> > if (unlikely(current->rmid)) {
> > newstate.rmid = current->rmid;
> > __set_bit(newstate.rmid, this_cpu_ptr(rmid_bitmap));
>
> Non atomic op
>
> > }
> > }
> >
> > Now in rmid_free() we can collect that information:
> >
> > cpumask_clear(&tmpmask);
> > cpumask_clear(rmid_entry->mask);
> >
> > cpus_read_lock();
> > for_each_online_cpu(cpu) {
> > if (test_and_clear_bit(rmid, per_cpu_ptr(cpu, rmid_bitmap)))
>
> atomic op

Indeed. We need atomic on both sides unfortunately.

> > cpumask_set(cpu, tmpmask);
> > }
> > Another thing which needs some thought it the CPU hotplug code. We need to
> > make sure that pending work which is scheduled on an outgoing CPU is moved
> > in the offline callback to a still online CPU of the same domain and not
> > moved to some random CPU by the workqueue hotplug code.
>
> just flush the workqueue for that CPU? That's what the workqueue core
> _should_ do in any case. And that also covers the case where @cpu is the
> last in the set of CPUs we could run on.

Indeed.

> > There is another subtle issue. Assume a RMID is freed. The limbo stuff is
> > scheduled on all domains which have online CPUs.
> >
> > Now the last CPU of a domain goes offline before the threshold for clearing
> > the domain CPU bit in the rme->mask is reached.
> >
> > So we have two options here:
> >
> > 1) Clear the bit unconditionally when the last CPU of a domain goes
> > offline.
>
> Arguably this. This is cache level stuff, that means this is the last
> CPU of a cache, so just explicitly kill the _entire_ cache and insta
> mark everything good again; WBINVD ftw.

Right.

> > 2) Arm a timer which clears the bit after a grace period
> >
> > #1 The RMID might become available for reuse right away because all other
> > domains have not used it or have cleared their bits already.
> >
> > If one of the CPUs of that domain comes online again and is associated
> > to that reused RMID again, then the counter content might still contain
> > leftovers from the previous usage.
>
> Not if we kill the cache on offline -- also, if all CPUs have been
> offline, its not too weird to expect something like a package idle state
> to have happened and shot down the caches anyway.

Yes, didn't think about that.

Thanks,

tglx

2017-07-05 17:59:53

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

> In case that a RMID was never used on a particular package, the state check
> forces an IPI on all packages unconditionally. That's suboptimal at least.
>
> We know on which package a given RMID was used, so we could restrict the
> checks to exactly these packages, but I'm not sure it's worth the
> trouble. We might at least document that and explain why this is
> implemented in that way.

We only allocate RMIDs when a user makes a directory. I don't think
we should consider options that slow down context switch in order to
keep track of which packages were used just to make mkdir(2) a bit faster
in the case where we need to check the limbo list.

We could make the check of the limbo list less costly by using a bitmask
to keep track of which packages have already found that the llc_occupancy
is below the threshold. But I'd question whether the extra complexity in the
code was really worth it.

-Tony [on vacation - responses will be slow]

2017-07-06 06:52:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

On Wed, 5 Jul 2017, Tony Luck wrote:

> > In case that a RMID was never used on a particular package, the state check
> > forces an IPI on all packages unconditionally. That's suboptimal at least.
> >
> > We know on which package a given RMID was used, so we could restrict the
> > checks to exactly these packages, but I'm not sure it's worth the
> > trouble. We might at least document that and explain why this is
> > implemented in that way.
>
> We only allocate RMIDs when a user makes a directory. I don't think
> we should consider options that slow down context switch in order to
> keep track of which packages were used just to make mkdir(2) a bit faster
> in the case where we need to check the limbo list.

It's not about speeding up mkdir. It's about preventing IPIs which walk a
list of hundreds of rmid entries in the limbo list. I tested the current
pile on a BDW which has 143 RMIDs and the list walk plus the WRMSR/RDMSR
takes > 100us in IPI context. That's just crap, seriously.

> We could make the check of the limbo list less costly by using a bitmask
> to keep track of which packages have already found that the llc_occupancy
> is below the threshold. But I'd question whether the extra complexity in the
> code was really worth it.

Delegating the check to an IPI which only gets invoked when we ran out of
free RMIDs is the problem and that needs to be fixed.

Whether we optimize it for avoiding the work on packages which did not use
the RMID can be discussed, but replacing that current approach of
delegating a full list walk to an IPI is not so much debatable.

OTOH, the set_bit operation in the context switch path on a per cpu local
variable is aside of dirtying a cacheline negligible vs. the MSR write
itself. And it burdens only the tasks which use monitoring and not the
normal and interesting non monitoring case.

Thanks,

tglx

2017-07-06 21:06:14

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 07/21] x86/intel_rdt/cqm: Add RDT monitoring initialization



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

> On Mon, 26 Jun 2017, Vikas Shivappa wrote:
>> +/*
>> + * Global boolean for rdt_alloc which is true if any
>> + * resource allocation is enabled.
>> + */
>> +bool rdt_alloc_enabled;
>
> That should be rdt_alloc_capable. It's not enabled at probe time. Probing
> merily detects the capability. That mirrors the capable/enabled bits in the
> rdt resource struct.
>
>> static void
>> mba_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
>> static void
>> @@ -230,7 +236,7 @@ static bool rdt_get_mem_config(struct rdt_resource *r)
>> return true;
>> }
>>
>> -static void rdt_get_cache_config(int idx, struct rdt_resource *r)
>> +static void rdt_get_cache_alloc_config(int idx, struct rdt_resource *r)
>> {
>> union cpuid_0x10_1_eax eax;
>> union cpuid_0x10_x_edx edx;
>> @@ -422,7 +428,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>>
>> d->id = id;
>>
>> - if (domain_setup_ctrlval(r, d)) {
>> + if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
>
> This should be done in the name space cleanup patch or in a separate one.
>
>> kfree(d);
>> return;
>> }
>> @@ -513,34 +519,39 @@ static __init void rdt_init_padding(void)
>>
>> static __init bool get_rdt_resources(void)
>> {
>> - bool ret = false;
>> -
>> if (cache_alloc_hsw_probe())
>> - return true;
>> + rdt_alloc_enabled = true;
>>
>> - if (!boot_cpu_has(X86_FEATURE_RDT_A))
>> + if ((!rdt_alloc_enabled && !boot_cpu_has(X86_FEATURE_RDT_A)) &&
>> + !boot_cpu_has(X86_FEATURE_CQM))
>> return false;
>>
>> + if (boot_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
>> + rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
>
> Instead of artificially cramming the CQM bits into this function, it
> would be cleaner to leave that function alone, rename it to
>
> get_rdt_alloc_resources()
>
> and have a new function
>
> get_rdt_mon_resources()
>
> and handle the aggregation at the call site.
>
> rdt_alloc_capable = get_rdt_alloc_resources();
> rdt_mon_capable = get_rdt_mon_resources();
>
> if (!rdt_alloc_capable && !rdt_mon_capable)
> return -ENODEV;
>
> I'd make both variables boolean and have rdt_mon_features as a separate
> one, which carries the actual available feature bits. This is neither
> hotpath nor are we in a situation where we need to spare the last 4byte of
> memory. Clean separation of code and functionality is more important.
>
>> +/*
>> + * Event IDs are used to program IA32_QM_EVTSEL before reading event
>> + * counter from IA32_QM_CTR
>> + */
>> +#define QOS_L3_OCCUP_EVENT_ID 0x01
>> +#define QOS_L3_MBM_TOTAL_EVENT_ID 0x02
>> +#define QOS_L3_MBM_LOCAL_EVENT_ID 0x03
>> +
>> +/**
>> + * struct mon_evt - Entry in the event list of a resource
>> + * @evtid: event id
>> + * @name: name of the event
>> + */
>> +struct mon_evt {
>> + u32 evtid;
>> + char *name;
>> + struct list_head list;
>> +};
>> +
>> +extern unsigned int intel_cqm_threshold;
>> +extern bool rdt_alloc_enabled;
>> +extern int rdt_mon_features;
>
> Please do not use 'int' for variables which contain bit flags. unsigned int
> is the proper choice here.
>
>> +struct rmid_entry {
>> + u32 rmid;
>> + enum rmid_recycle_state state;
>> + struct list_head list;
>
> Please make it tabular as you did with mon_evt and other structs.
>
>> +};
>> +
>> +/**
>> + * @rmid_free_lru A least recently used list of free RMIDs
>> + * These RMIDs are guaranteed to have an occupancy less than the
>> + * threshold occupancy
>> + */
>> +static struct list_head rmid_free_lru;
>> +
>> +/**
>> + * @rmid_limbo_lru list of currently unused but (potentially)
>> + * dirty RMIDs.
>> + * This list contains RMIDs that no one is currently using but that
>> + * may have a occupancy value > intel_cqm_threshold. User can change
>> + * the threshold occupancy value.
>> + */
>> +static struct list_head rmid_limbo_lru;
>> +
>> +/**
>> + * @rmid_entry - The entry in the limbo and free lists.
>> + */
>> +static struct rmid_entry *rmid_ptrs;
>> +
>> +/*
>> + * Global boolean for rdt_monitor which is true if any
>
> Boolean !?!?!
>
>> + * resource monitoring is enabled.
>> + */
>> +int rdt_mon_features;
>> +
>> +/*
>> + * This is the threshold cache occupancy at which we will consider an
>> + * RMID available for re-allocation.
>> + */
>> +unsigned int intel_cqm_threshold;
>> +
>> +static inline struct rmid_entry *__rmid_entry(u32 rmid)
>> +{
>> + struct rmid_entry *entry;
>> +
>> + entry = &rmid_ptrs[rmid];
>> + WARN_ON(entry->rmid != rmid);
>> +
>> + return entry;
>> +}
>> +
>> +static int dom_data_init(struct rdt_resource *r)
>> +{
>> + struct rmid_entry *entry = NULL;
>> + int i = 0, nr_rmids;
>> +
>> + INIT_LIST_HEAD(&rmid_free_lru);
>> + INIT_LIST_HEAD(&rmid_limbo_lru);
>
> You can spare that by declaring the list head with
>
> static LIST_HEAD(rmid_xxx_lru);
>
>> +
>> + nr_rmids = r->num_rmid;
>> + rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL);
>> + if (!rmid_ptrs)
>> + return -ENOMEM;
>> +
>> + for (; i < nr_rmids; i++) {
>
> Please initialize i in the for() construct. It's really bad to read,
> because the missing initialization statement makes one look for a special
> initialization magic just to figure out that it's simply i = 0.
>
>> + entry = &rmid_ptrs[i];
>> + INIT_LIST_HEAD(&entry->list);
>> +
>> + entry->rmid = i;
>> + list_add_tail(&entry->list, &rmid_free_lru);
>> + }
>> +
>> + /*
>> + * RMID 0 is special and is always allocated. It's used for all
>> + * tasks that are not monitored.
>> + */
>> + entry = __rmid_entry(0);
>> + list_del(&entry->list);
>> +
>> + return 0;
>> +}
>> +
>> +static struct mon_evt llc_occupancy_event = {
>> + .name = "llc_occupancy",
>> + .evtid = QOS_L3_OCCUP_EVENT_ID,
>
> Tabluar...
>
>> +};
>> +
>> +static void l3_mon_evt_init(struct rdt_resource *r)
>> +{
>> + INIT_LIST_HEAD(&r->evt_list);
>> +
>> + if (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID))
>> + list_add_tail(&llc_occupancy_event.list, &r->evt_list);
>
> What's that list for? Why don't you have that event as a member of the L3
> rdt resource and control it via r->mon_capable/enabled?

Will fix all comments above. The memory bandwidth(total and local) is enumerated
for L3 resource itself like llc_occupancy event with resource id as 1.
CPUID.(EAX=0FH, ECX=0):EDX.L3[bit 1] = 1. So all three events are added to the
L3 resource struct itself..

>
>> +}
>> +
>> +void rdt_get_mon_l3_config(struct rdt_resource *r)
>> +{
>> + int ret;
>> +
>> + r->mon_scale = boot_cpu_data.x86_cache_occ_scale;
>> + r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
>> +
>> + /*
>> + * A reasonable upper limit on the max threshold is the number
>> + * of lines tagged per RMID if all RMIDs have the same number of
>> + * lines tagged in the LLC.
>> + *
>> + * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
>> + */
>> + intel_cqm_threshold = boot_cpu_data.x86_cache_size * 1024 / r->num_rmid;
>> +
>> + /* h/w works in units of "boot_cpu_data.x86_cache_occ_scale" */
>> + intel_cqm_threshold /= r->mon_scale;
>> +
>> + ret = dom_data_init(r);
>> + if (ret)
>> + goto out;
>> +
>> + l3_mon_evt_init(r);
>> +
>> + r->mon_capable = true;
>> + r->mon_enabled = true;
>> +
>> + return;
>> +out:
>> + kfree(rmid_ptrs);
>> + rdt_mon_features = 0;
>
> This is silly. if dom_data_init() fails, then it failed because it was
> unable to allocate rmid_ptrs. .....
>
> Also clearing rdt_mod_features here is conceptually wrong. Make that
> function return int, i.e. the failure value, and clear rdt_mon_capable at
> the call site in case of error.

Ok will fix and keep a seperate rdt_mon_capable.

Thanks,
Vikas

>
> Thanks,
>
> tglx
>
>
>
>
>

2017-07-06 21:08:09

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 09/21] x86/intel_rdt: Simplify info and base file lists



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

> On Mon, 26 Jun 2017, Vikas Shivappa wrote:
>> @@ -82,6 +82,7 @@ struct rdt_resource rdt_resources_all[] = {
>> },
>> .parse_ctrlval = parse_cbm,
>> .format_str = "%d=%0*x",
>> + .fflags = RFTYPE_RES_CACHE,
>> },
>
> Can you please convert that array to use explicit array member
> initializers? I've noticed this back when I reviewed the intial RDT
> implementation, but it somehow escaped. i.e.:
>
> [RESOURCE_ID] =
> {
> .....
> }

Will fix.
Thanks,
Vikas

>
> Thanks,
>
> tglx
>
>

2017-07-06 21:21:42

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 11/21] x86/intel_rdt/cqm: Add mkdir support for RDT monitoring



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

> On Mon, 26 Jun 2017, Vikas Shivappa wrote:
>> +/*
>> + * Common code for ctrl_mon and monitor group mkdir.
>> + * The caller needs to unlock the global mutex upon success.
>> + */
>> +static int mkdir_rdt_common(struct kernfs_node *pkn, struct kernfs_node *prkn,
>
> pkn and prkn are horrible to distinguish. What's wrong with keeping
> *parent_kn and have *kn as the new thing?

the prkn is always the kn for parent rdtgroup where as pkn is the parent kn. May
be parent_kn parent_kn_rdtgrp ? Wanted to make it shorter.

>
>> + const char *name, umode_t mode,
>> + enum rdt_group_type rtype, struct rdtgroup **r)
>> {
>
> Can you please split out that mkdir_rdt_common() change into a separate
> patch? It can be done as a preparatory stand alone change just for the
> existing rdt group code. Then the monitoring add ons come on top of it.
>
>> - struct rdtgroup *parent, *rdtgrp;
>> + struct rdtgroup *prgrp, *rdtgrp;
>> struct kernfs_node *kn;
>> - int ret, closid;
>> -
>> - /* Only allow mkdir in the root directory */
>> - if (parent_kn != rdtgroup_default.kn)
>> - return -EPERM;
>> -
>> - /* Do not accept '\n' to avoid unparsable situation. */
>> - if (strchr(name, '\n'))
>> - return -EINVAL;
>> + uint fshift = 0;
>> + int ret;
>>
>> - parent = rdtgroup_kn_lock_live(parent_kn);
>> - if (!parent) {
>> + prgrp = rdtgroup_kn_lock_live(prkn);
>> + if (!prgrp) {
>> ret = -ENODEV;
>> goto out_unlock;
>> }
>>
>> - ret = closid_alloc();
>> - if (ret < 0)
>> - goto out_unlock;
>> - closid = ret;
>> -
>> /* allocate the rdtgroup. */
>> rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
>> if (!rdtgrp) {
>> ret = -ENOSPC;
>> - goto out_closid_free;
>> + goto out_unlock;
>> }
>> - rdtgrp->closid = closid;
>> - list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
>> + *r = rdtgrp;
>> + rdtgrp->parent = prgrp;
>> + rdtgrp->type = rtype;
>> + INIT_LIST_HEAD(&rdtgrp->crdtgrp_list);
>>
>> /* kernfs creates the directory for rdtgrp */
>> - kn = kernfs_create_dir(parent->kn, name, mode, rdtgrp);
>> + kn = kernfs_create_dir(pkn, name, mode, rdtgrp);
>> if (IS_ERR(kn)) {
>> ret = PTR_ERR(kn);
>> goto out_cancel_ref;
>> @@ -1138,27 +1166,138 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>> if (ret)
>> goto out_destroy;
>>
>> - ret = rdtgroup_add_files(kn, RF_CTRL_BASE);
>> + fshift = 1 << (RF_CTRLSHIFT + rtype);
>> + ret = rdtgroup_add_files(kn, RFTYPE_BASE | fshift);
>
>
> I'd rather make this:
>
> files = RFTYPE_BASE | (1U << (RF_CTRLSHIFT + rtype));
> ret = rdtgroup_add_files(kn, files);
>
>> if (ret)
>> goto out_destroy;
>>
>> + if (rdt_mon_features) {
>> + ret = alloc_rmid();
>> + if (ret < 0)
>> + return ret;
>> +
>> + rdtgrp->rmid = ret;
>> + }
>> kernfs_activate(kn);
>>
>> - ret = 0;
>> - goto out_unlock;
>
> What unlocks prkn now? The caller, right? Please add a comment ...
>
>> + return 0;
>>
>> out_destroy:
>> kernfs_remove(rdtgrp->kn);
>> out_cancel_ref:
>> - list_del(&rdtgrp->rdtgroup_list);
>> kfree(rdtgrp);
>> -out_closid_free:
>> +out_unlock:
>> + rdtgroup_kn_unlock(prkn);
>> + return ret;
>> +}
>> +
>> +static void mkdir_rdt_common_clean(struct rdtgroup *rgrp)
>> +{
>> + kernfs_remove(rgrp->kn);
>> + if (rgrp->rmid)
>> + free_rmid(rgrp->rmid);
>
> Please put that conditonal into free_rmid().

Will fix all above.

>
>> + kfree(rgrp);
>> +}
>
>> +static int rdtgroup_mkdir(struct kernfs_node *pkn, const char *name,
>> + umode_t mode)
>> +{
>> + /* Do not accept '\n' to avoid unparsable situation. */
>> + if (strchr(name, '\n'))
>> + return -EINVAL;
>> +
>> + /*
>> + * We don't allow rdtgroup ctrl_mon directories to be created anywhere
>> + * except the root directory and dont allow rdtgroup monitor
>> + * directories to be created anywhere execept inside mon_groups
>> + * directory.
>> + */
>> + if (rdt_alloc_enabled && pkn == rdtgroup_default.kn)
>> + return rdtgroup_mkdir_ctrl_mon(pkn, pkn, name, mode);
>> + else if (rdt_mon_features &&
>> + !strcmp(pkn->name, "mon_groups"))
>> + return rdtgroup_mkdir_mon(pkn, pkn->parent, name, mode);
>> + else
>> + return -EPERM;
>
> TBH, this is really convoluted (including the comment).
>
> /*
> * If the parent directory is the root directory and RDT
> * allocation is supported, add a control and monitoring
> * subdirectory.
> */
> if (rdt_alloc_capable && parent_kn == rdtgroup_default.kn)
> return rdtgroup_mkdir_ctrl_mon(...);
>
> /*
> * If the parent directory is a monitoring group and RDT
> * monitoring is supported, add a monitoring subdirectory.
> */
> if (rdt_mon_capable && is_mon_group(parent_kn))
> return rdtgroup_mkdir_mon(...);
>
> return -EPERM;

Will fix.

>
> Note, that I did not use strcmp(parent_kn->name) because that's simply
> not sufficient. What prevents a user from doing:
>
> # mkdir /sys/fs/resctrl/mon_group/mon_group
> # mkdir /sys/fs/resctrl/mon_group/mon_group/foo
>

This would fail because the parent rdtgrp when creating foo is NULL. This is
because the parent rdtgrp is taken from the "resctrl/mon_group/mon_group"
directory's parent which is the resctrl/mon_groups->priv. We always keep this
NULL. So user can create a mon_groups under resctr/mon_groups but cant create a
dir under that..

> You need a better way to distignuish that than strcmp(). You probably want
> to prevent creating subdirectories named "mon_group" as well.
>

If creating a monitor group named mon_group is confusing then it can be
checked.

Thanks,
Vikas

> Thanks,
>
> tglx
>
>

2017-07-06 21:23:31

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 12/21] x86/intel_rdt/cqm: Add tasks file support



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

> On Mon, 26 Jun 2017, Vikas Shivappa wrote:
>> @@ -866,6 +866,7 @@ struct task_struct {
>> #endif
>> #ifdef CONFIG_INTEL_RDT
>> int closid;
>> + u32 rmid;
>
> Can you please make a preparatory change which makes closid an u32 as well?
> We should have done that in the first place, but in hindsight we are always
> smarter...

ok makes sense. Will fix. Think Fenghua or David had suggested this but i
missed.

Thanks,
Vikas

>
> Thanks,
>
> tglx
>

2017-07-06 21:24:37

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

> On Mon, 26 Jun 2017, Vikas Shivappa wrote:
>> diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
>> index fdf3654..fec8ba9 100644
>> --- a/arch/x86/kernel/cpu/intel_rdt.h
>> +++ b/arch/x86/kernel/cpu/intel_rdt.h
>> @@ -37,6 +37,8 @@ struct mon_evt {
>> extern bool rdt_alloc_enabled;
>> extern int rdt_mon_features;
>>
>> +DECLARE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
>
> u32
>
>>
>> +DEFINE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
>> static inline struct rmid_entry *__rmid_entry(u32 rmid)
>
> Bah. Please add a new line between the DEFINE... and the function.
>
> But that whole thing is wrong. The per cpu default closid and rmid want to
> be in a single place, not in two distinct per cpu variables.
>
> struct rdt_cpu_default {
> u32 rmid;
> u32 closid;
> };
>
> DEFINE_PER_CPU_READ_MOSTLY(struct rdt_cpu_default, rdt_cpu_default);
>
> or something like this. That way it's guaranteed that the context switch
> code touches a single cache line for the per cpu defaults.

Will fix and add both rmid and closid into common struct.

Thanks,
Vikas

>
> Thanks,
>
> tglx
>
>

2017-07-06 21:40:44

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

> On Mon, 26 Jun 2017, Vikas Shivappa wrote:
>> -static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
>> - char *buf, size_t nbytes, loff_t off)
>> +static ssize_t cpus_mon_write(struct kernfs_open_file *of,
>> + char *buf, size_t nbytes,
>> + struct rdtgroup *rdtgrp)
>
> Again. Please make the split of rdtgroup_cpus_write() as a seperate
> preparatory change first and just move the guts of the existing write
> function out into cpus_ctrl_write() and then add the mon_write stuff as an
> extra patch.
>
>> {
>> + struct rdtgroup *pr = rdtgrp->parent, *cr;
>
> *pr and *cr really suck.
>
>> cpumask_var_t tmpmask, newmask;
>> - struct rdtgroup *rdtgrp, *r;
>> + struct list_head *llist;
>> int ret;
>>
>> - if (!buf)
>> - return -EINVAL;
>> -
>> if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
>> return -ENOMEM;
>> if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) {
>> @@ -233,10 +235,89 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
>> return -ENOMEM;
>> }
>>
>> - rdtgrp = rdtgroup_kn_lock_live(of->kn);
>> - if (!rdtgrp) {
>> - ret = -ENOENT;
>> - goto unlock;
>> + if (is_cpu_list(of))
>> + ret = cpulist_parse(buf, newmask);
>> + else
>> + ret = cpumask_parse(buf, newmask);
>
> The cpuask allocation and parsing of the user buffer can be done in the
> common code. No point in duplicating that.
>
>> +
>> + if (ret)
>> + goto out;
>> +
>> + /* check that user didn't specify any offline cpus */
>> + cpumask_andnot(tmpmask, newmask, cpu_online_mask);
>> + if (cpumask_weight(tmpmask)) {
>> + ret = -EINVAL;
>> + goto out;
>> + }
>
> Common code.

Will fix all above

>
>> + /* Check whether cpus belong to parent ctrl group */
>> + cpumask_andnot(tmpmask, newmask, &pr->cpu_mask);
>> + if (cpumask_weight(tmpmask)) {
>> + ret = -EINVAL;
>> + goto out;
>> + }
>> +
>> + /* Check whether cpus are dropped from this group */
>> + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask);
>> + if (cpumask_weight(tmpmask)) {
>> + /* Give any dropped cpus to parent rdtgroup */
>> + cpumask_or(&pr->cpu_mask, &pr->cpu_mask, tmpmask);
>
> This does not make any sense. The check above verifies that all cpus in
> newmask belong to the parent->cpu_mask. If they don't then you return
> -EINVAL, but here you give them back to parent->cpu_mask. How is that
> supposed to work? You never get into this code path!

The parent->cpu_mask always is the parent->cpus_valid_mask if i understand
right. With monitor group, the cpu is present is always present in "one"
ctrl_mon group and one mon_group. And the mon group can have only cpus in its
parent. May be it needs a comment? (its explaind in the documentation patch).

# mkdir /sys/fs/resctrl/p1
# mkdir /sys/fs/resctrl/p1/mon_groups/m1
# echo 5-10 > /sys/fs/resctr/p1/cpus_list
Say p1 has RMID 2
cpus 5-10 have RMID 2

# echo 5-6 > /sys/fs/resctrl/p1/mon_groups/m1/cpus_list
cpus 5-6 have RMID 3
cpus 7-10 have RMID 2

# cat /sys/fs/resctrl/p1/cpus_list
5-10

This is because when we query the data for p1 it adds its own data (RMID 2) and
all the data for its child mon groups (hence all cpus from 5-10).

But
>> + cpumask_or(&pr->cpu_mask, &pr->cpu_mask, tmpmask);
can be removed because it does nothing like you suggest as the parent already
has these cpus. We just need the update_rmid_closid(tmpmask, pr)

>
> So you need a seperate mask in the parent rdtgroup to store the CPUs which
> are valid in any monitoring group which belongs to it. So the logic
> becomes:
>
> /*
> * Check whether the CPU mask is a subset of the CPUs
> * which belong to the parent group.
> */
> cpumask_andnot(tmpmask, newmask, parent->cpus_valid_mask);
> if (cpumask_weight(tmpmask))
> return -EINVAL;
>
> When CAT is not available, then parent->cpus_valid_mask is a pointer to
> cpu_online_mask. When CAT is enabled, then parent->cpus_valid_mask is a
> pointer to the CAT group cpu mask.

When CAT is unavailable we cannot create any ctrl_mon groups.

>
>> + update_closid_rmid(tmpmask, pr);
>> + }
>> +
>> + /*
>> + * If we added cpus, remove them from previous group that owned them
>> + * and update per-cpu rmid
>> + */
>> + cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask);
>> + if (cpumask_weight(tmpmask)) {
>> + llist = &pr->crdtgrp_list;
>
> llist is a bad name. We have a facility llist, i.e. lockless list. head ?

Will fix.

>
>> + list_for_each_entry(cr, llist, crdtgrp_list) {
>> + if (cr == rdtgrp)
>> + continue;
>> + cpumask_andnot(&cr->cpu_mask, &cr->cpu_mask, tmpmask);
>> + }
>> + update_closid_rmid(tmpmask, rdtgrp);
>> + }
>
>> +static void cpumask_rdtgrp_clear(struct rdtgroup *r, struct cpumask *m)
>> +{
>> + struct rdtgroup *cr;
>> +
>> + cpumask_andnot(&r->cpu_mask, &r->cpu_mask, m);
>> + /* update the child mon group masks as well*/
>> + list_for_each_entry(cr, &r->crdtgrp_list, crdtgrp_list)
>> + cpumask_and(&cr->cpu_mask, &r->cpu_mask, &cr->cpu_mask);
>
> That's equally wrong. See above.

Because of same reason above, each cpu is present in "one" ctrl_mon group and
may be present in "one" mon group - we need to clear both..

Thanks,
Vikas

>
> Thanks,
>
> tglx
>

2017-07-06 21:46:44

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 14/21] x86/intel_rdt/cqm: Add mon_data



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

> On Mon, 26 Jun 2017, Vikas Shivappa wrote:
>
>> Add a mon_data directory for the root rdtgroup and all other rdtgroups.
>> The directory holds all of the monitored data for all domains and events
>> of all resources being monitored.
>
> Again. This does two things at once. Move the existing code to a new file
> and add the monitoring stuff. Please split it apart.

Will fix.

>
>> +static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
>> +{
>> + u64 tval;
>> +
>> + tval = __rmid_read(rmid, rr->evtid);
>> + if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
>> + rr->val = tval;
>> + return false;
>> + }
>> + switch (rr->evtid) {
>> + case QOS_L3_OCCUP_EVENT_ID:
>> + rr->val += tval;
>> + return true;
>> + default:
>> + return false;
>
> I have no idea what that return code means.

false for the invalid event id and all errors for __rmid_read. (IOW all errors
for __mon_event-read)

>
>> + }
>> +}
>> +
>> +void mon_event_count(void *info)
>
> Some explanation why this is a void pointer and how that function is called
> (I assume it's via IPI) would be appreciated.
>
>> +{
>> + struct rdtgroup *rdtgrp, *entry;
>> + struct rmid_read *rr = info;
>> + struct list_head *llist;
>
> *head;
>
>> +
>> + rdtgrp = rr->rgrp;
>> +
>> + if (!__mon_event_count(rdtgrp->rmid, rr))
>> + return;
>> +
>> + /*
>> + * For Ctrl groups read data from child monitor groups.
>> + */
>> + llist = &rdtgrp->crdtgrp_list;
>> +
>> + if (rdtgrp->type == RDTCTRL_GROUP) {
>> + list_for_each_entry(entry, llist, crdtgrp_list) {
>> + if (!__mon_event_count(entry->rmid, rr))
>> + return;
>> + }
>> + }
>> +}
>
>> +static int get_rdt_resourceid(struct rdt_resource *r)
>> +{
>> + if (r > (rdt_resources_all + RDT_NUM_RESOURCES - 1) ||
>> + r < rdt_resources_all ||
>> + ((r - rdt_resources_all) % sizeof(struct rdt_resource)))
>> + return -EINVAL;
>
> If that ever happens, then you have other problems than a wrong pointer.
>
>> +
>> + return ((r - rdt_resources_all) / sizeof(struct rdt_resource));
>
> Moo. Can't you simply put an index field into struct rdt_resource,
> intialize it with the resource ID and use that?

Ok will fix all above,

thanks,
Vikas

>
> Thanks,
>
> tglx
>

2017-07-06 21:47:36

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 15/21] x86/intel_rdt/cqm: Add rmdir support



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

> On Mon, 26 Jun 2017, Vikas Shivappa wrote:
>
>> Resource groups (ctrl_mon and monitor groups) are represented by
>> directories in resctrl fs. Add support to remove the directories.
>
> Again. Please split that patch into two parts; seperate ctrl stuff from rmdir and
> then add monitoring support.
>
>> + rdtgrp->flags = RDT_DELETED;
>> + free_rmid(rdtgrp->rmid);
>> +
>> + /*
>> + * Remove your rmid from the parent ctrl groups list
>
> You are not removing a rmid. You remove the group from the parents group
> list. Please be more accurate with your comments. Wrong comments are worse
> than no comments.
>
>> + WARN_ON(list_empty(&prdtgrp->crdtgrp_list));
>> + list_del(&rdtgrp->crdtgrp_list);
>
>> +static int rdtgroup_rmdir_ctrl(struct kernfs_node *kn, struct rdtgroup *rdtgrp)
>> +{
>> + int cpu, closid = rdtgroup_default.closid;
>> + struct rdtgroup *entry, *tmp;
>> + struct list_head *llist;
>
> *head please.
>
>> + cpumask_var_t tmpmask;
>> +
>> + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
>> + return -ENOMEM;
>
> Allocation/free can be done at the call site for both functions.
>
>> +static int rdtgroup_rmdir(struct kernfs_node *kn)
>> +{
>> + struct kernfs_node *parent_kn = kn->parent;
>> + struct rdtgroup *rdtgrp;
>> + int ret = 0;
>> +
>> + rdtgrp = rdtgroup_kn_lock_live(kn);
>> + if (!rdtgrp) {
>> + ret = -EPERM;
>> + goto out;
>> + }
>> +
>> + if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn)
>> + ret = rdtgroup_rmdir_ctrl(kn, rdtgrp);
>> + else if (rdtgrp->type == RDTMON_GROUP &&
>> + !strcmp(parent_kn->name, "mon_groups"))
>> + ret = rdtgroup_rmdir_mon(kn, rdtgrp);
>> + else
>> + ret = -EPERM;
>
> Like in the other patch, please makes this parseable.

Will fix all..

Thanks,
Vikas

>
> Thanks,
>
> tglx
>

2017-07-06 21:56:57

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 16/21] x86/intel_rdt/cqm: Add mount,umount support



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

> On Mon, 26 Jun 2017, Vikas Shivappa wrote:
>>
>> list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
>> + /* Free any child rmids */
>> + llist = &rdtgrp->crdtgrp_list;
>> + list_for_each_entry_safe(sentry, stmp, llist, crdtgrp_list) {
>> + free_rmid(sentry->rmid);
>> + list_del(&sentry->crdtgrp_list);
>> + kfree(sentry);
>> + }
>
> I'm pretty sure, that I've seen exactly this code sequence already. Please
> create a helper instead of copying stuff over and over.

Thats right, during rmdir_ctrl_mon which deletes all its child mon groups. Will
fix.

Thanks,
Vikas
>
> Thanks,
>
> tglx
>

2017-07-06 23:33:45

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 17/21] x86/intel_rdt/cqm: Add sched_in support



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

> On Mon, 26 Jun 2017, Vikas Shivappa wrote:
>> DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>> DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid);
>> +DECLARE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
>> DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>> +DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
>> +DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
>
> Please make this a two stage change. Add rdt_enable_key first and then the
> monitoring stuff. Ideally you introduce rdt_enable_key here and in the
> control code in one go.
>
>> +static void __intel_rdt_sched_in(void)
>> {
>> - if (static_branch_likely(&rdt_alloc_enable_key)) {
>> - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> - int closid;
>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> + u32 closid = 0;
>> + u32 rmid = 0;
>>
>> + if (static_branch_likely(&rdt_alloc_enable_key)) {
>> /*
>> * If this task has a closid assigned, use it.
>> * Else use the closid assigned to this cpu.
>> @@ -55,14 +59,31 @@ static inline void intel_rdt_sched_in(void)
>> closid = current->closid;
>> if (closid == 0)
>> closid = this_cpu_read(cpu_closid);
>> + }
>> +
>> + if (static_branch_likely(&rdt_mon_enable_key)) {
>> + /*
>> + * If this task has a rmid assigned, use it.
>> + * Else use the rmid assigned to this cpu.
>> + */
>> + rmid = current->rmid;
>> + if (rmid == 0)
>> + rmid = this_cpu_read(cpu_rmid);
>> + }
>>
>> - if (closid != state->closid) {
>> - state->closid = closid;
>> - wrmsr(IA32_PQR_ASSOC, state->rmid, closid);
>> - }
>> + if (closid != state->closid || rmid != state->rmid) {
>> + state->closid = closid;
>> + state->rmid = rmid;
>> + wrmsr(IA32_PQR_ASSOC, rmid, closid);
>
> This can be written smarter.
>
> struct intel_pqr_state newstate = this_cpu_read(rdt_cpu_default);
> struct intel_pqr_state *curstate = this_cpu_ptr(&pqr_state);
>
> if (static_branch_likely(&rdt_alloc_enable_key)) {
> if (current->closid)
> newstate.closid = current->closid;
> }
>
> if (static_branch_likely(&rdt_mon_enable_key)) {
> if (current->rmid)
> newstate.rmid = current->rmid;
> }
>
> if (newstate != *curstate) {
> *curstate = newstate;
> wrmsr(IA32_PQR_ASSOC, newstate.rmid, newstate.closid);
> }
>
> The unconditional read of rdt_cpu_default is the right thing to do because
> the default behaviour is exactly this.

Ok makes sense. Will fix.

Thanks,
Vikas
>
> Thanks,
>
> tglx
>
>
>
>

2017-07-06 23:38:12

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 19/21] x86/intel_rdt/mbm: Basic counting of MBM events (total and local)



On Sun, 2 Jul 2017, Thomas Gleixner wrote:
>> INIT_LIST_HEAD(&r->evt_list);
>>
>> if (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID))
>> list_add_tail(&llc_occupancy_event.list, &r->evt_list);
>> + if (is_mbm_total_enabled())
>> + list_add_tail(&mbm_total_event.list, &r->evt_list);
>> + if (is_mbm_local_enabled())
>> + list_add_tail(&mbm_local_event.list, &r->evt_list);
>
> Confused. This hooks all monitoring features to RDT_RESOURCE_L3. Why?

They are really L3 resource events as per the spec.
CPUID.(EAX=0FH, ECX=0):EDX.L3[bit 1] = 1 if L3 monitoring and we query for all
the llc_occupancy, l3 total and local b/w with the same resource id 1.

>
> Thanks,
>
> tglx
>
>
>

2017-07-06 23:52:08

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 21/21] x86/intel_rdt/mbm: Handle counter overflow



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

> On Mon, 26 Jun 2017, Vikas Shivappa wrote:
>> +static void mbm_update(struct rdt_domain *d, int rmid)
>> +{
>> + struct rmid_read rr;
>> +
>> + rr.first = false;
>> + rr.d = d;
>> +
>> + if (is_mbm_total_enabled()) {
>> + rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
>> + __mon_event_count(rmid, &rr);
>
> This is broken as it is not protected against a concurrent read from user
> space which comes in via a smp function call.

The read from user also has the rdtgroup_mutex.

Thanks,
Vikas

>
> This means both the internal state and __rmid_read() are unprotected.
>
> I'm not sure whether it's enough to disable interrupts around
> __mon_event_count(), but that's the minimal protection required. It's
> definitely good enough for __rmid_read(), but it might not be sufficient
> for protecting domain->mbm_[local|total]. I leave the exercise of figuring
> that out to you.
>
> Thanks,
>
> tglx
>

2017-07-07 06:22:41

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 14/21] x86/intel_rdt/cqm: Add mon_data

On Thu, 6 Jul 2017, Shivappa Vikas wrote:
> On Sun, 2 Jul 2017, Thomas Gleixner wrote:
> > > +static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
> > > +{
> > > + u64 tval;
> > > +
> > > + tval = __rmid_read(rmid, rr->evtid);
> > > + if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
> > > + rr->val = tval;
> > > + return false;
> > > + }
> > > + switch (rr->evtid) {
> > > + case QOS_L3_OCCUP_EVENT_ID:
> > > + rr->val += tval;
> > > + return true;
> > > + default:
> > > + return false;
> >
> > I have no idea what that return code means.
>
> false for the invalid event id and all errors for __rmid_read. (IOW all errors
> for __mon_event-read)

Sure, but why bool? What's wrong with proper error return codes, so issues
can be distinguished and potentially propagated in the callchain?

Thanks,

tglx




2017-07-07 06:45:04

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support

On Thu, 6 Jul 2017, Shivappa Vikas wrote:
> On Sun, 2 Jul 2017, Thomas Gleixner wrote:
> > > + /* Check whether cpus belong to parent ctrl group */
> > > + cpumask_andnot(tmpmask, newmask, &pr->cpu_mask);
> > > + if (cpumask_weight(tmpmask)) {
> > > + ret = -EINVAL;
> > > + goto out;
> > > + }
> > > +
> > > + /* Check whether cpus are dropped from this group */
> > > + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask);
> > > + if (cpumask_weight(tmpmask)) {
> > > + /* Give any dropped cpus to parent rdtgroup */
> > > + cpumask_or(&pr->cpu_mask, &pr->cpu_mask, tmpmask);
> >
> > This does not make any sense. The check above verifies that all cpus in
> > newmask belong to the parent->cpu_mask. If they don't then you return
> > -EINVAL, but here you give them back to parent->cpu_mask. How is that
> > supposed to work? You never get into this code path!
>
> The parent->cpu_mask always is the parent->cpus_valid_mask if i understand
> right. With monitor group, the cpu is present is always present in "one"
> ctrl_mon group and one mon_group. And the mon group can have only cpus in its
> parent. May be it needs a comment? (its explaind in the documentation patch).

Sigh, the code needs to be written in a way that it is halfways obvious
what's going on.

> # mkdir /sys/fs/resctrl/p1
> # mkdir /sys/fs/resctrl/p1/mon_groups/m1
> # echo 5-10 > /sys/fs/resctr/p1/cpus_list
> Say p1 has RMID 2
> cpus 5-10 have RMID 2

So what you say, is that parent is always the resource control group
itself.

Can we please have a proper distinction in the code? I tripped over that
ambigiousities several times.

The normal meaning of parent->child relations is that both have the same
type. While this is the case at the implementation detail level (both are
type struct rdtgroup), from a conceptual level they are different:

parent is a resource group and child is a monitoring group

That should be expressed in the code, at the very least by variable naming,
so it becomes immediately clear that this operates on two different
entities.

The proper solution is to have different data types or at least embedd the
monitoring bits in a seperate entity inside of struct rdtgroup.

struct mongroup {
monitoring stuff;
};

struct rdtgroup {
common stuff;
struct mongroup mon;
};

So the code can operate on r->mon.foo or mon->foo which makes it entirely
clear what kind of operation this is.

Sigh, cramming everything into a single struct without distinction is the
same as operating on a pile of global variables, which is the most common
pattern used by people learning C. You certainly belong not to that group,
so dammit, get your act together and structure the code so it's obvious and
maintainable.

Thanks,

tglx







2017-07-07 06:47:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 19/21] x86/intel_rdt/mbm: Basic counting of MBM events (total and local)

On Thu, 6 Jul 2017, Shivappa Vikas wrote:
> On Sun, 2 Jul 2017, Thomas Gleixner wrote:
> > > INIT_LIST_HEAD(&r->evt_list);
> > >
> > > if (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID))
> > > list_add_tail(&llc_occupancy_event.list, &r->evt_list);
> > > + if (is_mbm_total_enabled())
> > > + list_add_tail(&mbm_total_event.list, &r->evt_list);
> > > + if (is_mbm_local_enabled())
> > > + list_add_tail(&mbm_local_event.list, &r->evt_list);
> >
> > Confused. This hooks all monitoring features to RDT_RESOURCE_L3. Why?
>
> They are really L3 resource events as per the spec.
> CPUID.(EAX=0FH, ECX=0):EDX.L3[bit 1] = 1 if L3 monitoring and we query for all
> the llc_occupancy, l3 total and local b/w with the same resource id 1.

Then this should be documented somewhere in the code ....

2017-07-07 06:50:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 21/21] x86/intel_rdt/mbm: Handle counter overflow

On Thu, 6 Jul 2017, Shivappa Vikas wrote:
> On Sun, 2 Jul 2017, Thomas Gleixner wrote:
> > On Mon, 26 Jun 2017, Vikas Shivappa wrote:
> > > +static void mbm_update(struct rdt_domain *d, int rmid)
> > > +{
> > > + struct rmid_read rr;
> > > +
> > > + rr.first = false;
> > > + rr.d = d;
> > > +
> > > + if (is_mbm_total_enabled()) {
> > > + rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
> > > + __mon_event_count(rmid, &rr);
> >
> > This is broken as it is not protected against a concurrent read from user
> > space which comes in via a smp function call.
>
> The read from user also has the rdtgroup_mutex.

Which is again, completely non obvious and undocumented in the code.

Aside of that, are you really serious about serializing the world and
everything on a single global mutex?

Thanks,

tglx

2017-07-10 17:54:20

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH 21/21] x86/intel_rdt/mbm: Handle counter overflow

On Fri, Jul 07, 2017 at 08:50:40AM +0200, Thomas Gleixner wrote:
> Aside of that, are you really serious about serializing the world and
> everything on a single global mutex?

It would be nice to not do that, but there are challenges. At
any instant someone else might run:

# rmdir /sys/fs/resctrl/{some_control_group}

and blow away the control group and all the monitor groups under
it.

Someone else might do:

# echo 0 > /sys/devices/system/cpu/cpu{N}/online

where "N" is the last online cpu in a domain, which will
blow away an rdt_domain structure and ask kernfs to remove
some monitor files from every monitor directory.


If we change how we handle rdt_domains to

1) Not delete them when last CPU goes away (and re-use them
if they come back)
2) Have a safe way to search rdt_resource.domains for a domain
that we know is there even though another may be in the middle
of being added

Then we could probably make:

$ cat /sys/fs/restrl/ ... /llc_occupancy

etc. not need to grab the mutex. We'd still need something
to protect against a cross processor interrupt geting in the
middle of the access to IA32_QM_EVTSEL/IA32_QM_CTR and for
MBM counters to serialize access to mbm_state ... but it would
be a lot finer granularity.

-Tony

2017-07-11 15:22:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 21/21] x86/intel_rdt/mbm: Handle counter overflow

On Mon, 10 Jul 2017, Luck, Tony wrote:
> On Fri, Jul 07, 2017 at 08:50:40AM +0200, Thomas Gleixner wrote:
> > Aside of that, are you really serious about serializing the world and
> > everything on a single global mutex?
>
> It would be nice to not do that, but there are challenges. At
> any instant someone else might run:
>
> # rmdir /sys/fs/resctrl/{some_control_group}
>
> and blow away the control group and all the monitor groups under
> it.
>
> Someone else might do:
>
> # echo 0 > /sys/devices/system/cpu/cpu{N}/online
>
> where "N" is the last online cpu in a domain, which will
> blow away an rdt_domain structure and ask kernfs to remove
> some monitor files from every monitor directory.
>
>
> If we change how we handle rdt_domains to
>
> 1) Not delete them when last CPU goes away (and re-use them
> if they come back)
> 2) Have a safe way to search rdt_resource.domains for a domain
> that we know is there even though another may be in the middle
> of being added
>
> Then we could probably make:
>
> $ cat /sys/fs/restrl/ ... /llc_occupancy
>
> etc. not need to grab the mutex. We'd still need something
> to protect against a cross processor interrupt geting in the
> middle of the access to IA32_QM_EVTSEL/IA32_QM_CTR and for
> MBM counters to serialize access to mbm_state ... but it would
> be a lot finer granularity.

Thanks for the explanation. Yes, that would be nice, but we can start off
with the global mutex and think about the scalability issue after we got
the functionality itself under control.

Thanks,

tglx

2017-07-11 21:16:13

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 14/21] x86/intel_rdt/cqm: Add mon_data



On Thu, 6 Jul 2017, Thomas Gleixner wrote:

> On Thu, 6 Jul 2017, Shivappa Vikas wrote:
>> On Sun, 2 Jul 2017, Thomas Gleixner wrote:
>>>> +static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
>>>> +{
>>>> + u64 tval;
>>>> +
>>>> + tval = __rmid_read(rmid, rr->evtid);
>>>> + if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
>>>> + rr->val = tval;
>>>> + return false;
>>>> + }
>>>> + switch (rr->evtid) {
>>>> + case QOS_L3_OCCUP_EVENT_ID:
>>>> + rr->val += tval;
>>>> + return true;
>>>> + default:
>>>> + return false;
>>>
>>> I have no idea what that return code means.
>>
>> false for the invalid event id and all errors for __rmid_read. (IOW all errors
>> for __mon_event-read)
>
> Sure, but why bool? What's wrong with proper error return codes, so issues
> can be distinguished and potentially propagated in the callchain?

Ok, The error is propagated wih the rr->val actually. is this better?

Hardware throws the RMID_VAL_ERROR (bit 63) when an invalid RMID or
event is written to event select - this case seems similar.

default:
rr->val = RMID_VAL_ERROR;
return -EINVAL;
}

Thanks,
Vikas

>
> Thanks,
>
> tglx
>
>
>
>
>

2017-07-11 21:37:58

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH 14/21] x86/intel_rdt/cqm: Add mon_data

On Tue, Jul 11, 2017 at 02:17:47PM -0700, Shivappa Vikas wrote:
>
>
> On Thu, 6 Jul 2017, Thomas Gleixner wrote:
>
> > On Thu, 6 Jul 2017, Shivappa Vikas wrote:
> > > On Sun, 2 Jul 2017, Thomas Gleixner wrote:
> > > > > +static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
> > > > > +{
> > > > > + u64 tval;
> > > > > +
> > > > > + tval = __rmid_read(rmid, rr->evtid);
> > > > > + if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
> > > > > + rr->val = tval;
> > > > > + return false;
> > > > > + }
> > > > > + switch (rr->evtid) {
> > > > > + case QOS_L3_OCCUP_EVENT_ID:
> > > > > + rr->val += tval;
> > > > > + return true;
> > > > > + default:
> > > > > + return false;
> > > >
> > > > I have no idea what that return code means.
> > >
> > > false for the invalid event id and all errors for __rmid_read. (IOW all errors
> > > for __mon_event-read)
> >
> > Sure, but why bool? What's wrong with proper error return codes, so issues
> > can be distinguished and potentially propagated in the callchain?
>
> Ok, The error is propagated wih the rr->val actually. is this better?
>
> Hardware throws the RMID_VAL_ERROR (bit 63) when an invalid RMID or
> event is written to event select - this case seems similar.
>
> default:
> rr->val = RMID_VAL_ERROR;
> return -EINVAL;
> }

I'll take the blame for not documenting this better. What's going
on here is that we are calculating the sum of some list of RMIDs
(for the case where we read a mon_data/*/* file for a CTRL_MON group
that has some MON subgroups ... when reading a from a MON group there
is only one RMID).

Now we might get an error reading one of those (either or both of
RMID_VAL_ERROR and RMID_VAL_UNAVAIL bits set). In which case we can't
compute the sum, and there is no point in reading any more RMIDs.

So the return of this function is:

true: I read this RMID OK and added it to rr->val

false: I got an error. Give up. The error type is in the high bits of rr->val


-Tony

2017-07-11 23:53:16

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management



On Mon, 3 Jul 2017, Thomas Gleixner wrote:

> On Sun, 2 Jul 2017, Thomas Gleixner wrote:
>> Thinking a bit more about that limbo mechanics.
>>
>> In case that a RMID was never used on a particular package, the state check
>> forces an IPI on all packages unconditionally. That's suboptimal at least.
>>
>> We know on which package a given RMID was used, so we could restrict the
>> checks to exactly these packages, but I'm not sure it's worth the
>> trouble. We might at least document that and explain why this is
>> implemented in that way.
>
> Second thoughts on that. The allocation logic is:
>
>> + if (list_empty(&rmid_free_lru)) {
>> + ret = try_freeing_limbo_rmid();
>> + if (list_empty(&rmid_free_lru))
>> + return ret ? -ENOSPC : -EBUSY;
>> + }
>> +
>> + entry = list_first_entry(&rmid_free_lru,
>> + struct rmid_entry, list);
>> + list_del(&entry->list);
>> +
>> + return entry->rmid;
>
> That means, the free list is used as the primary source. One of my boxes
> has 143 RMIDs. So it only takes 142 mkdir/rmdir invocations to move all
> RMIDs to the limbo list. On the next mkdir invocation the allocation goes
> into the limbo path and the SMP function call has to walk the list with 142
> entries on ALL online domains whether they used the RMID or not!

Would it be better if we do this in the MBM 1s overflow timer delayed_work? That
is not in the interupt context. So we do a periodic flush of the limbo list and
then mkdir fails with -EBUSY if list_empty(&free_list) &&
!list_empty(&limbo_list).
To improve that -
We may also include the optimization Tony suggested to
skip the checks for RMIDs which are already checked to be < threshold (however
that needs a domain mask like I mention below but may be we can just check the
list here).

>
> That's bad enough already and the number of RMIDs will not become smaller;
> it doubled from HSW to BDW ...
>
> The HPC and RT folks will love you for that - NOT!
>
> So this needs to be solved differently.
>
> Let's have a look at the context switch path first. That's the most
> sensitive part of it.
>
> if (static_branch_likely(&rdt_mon_enable_key)) {
> if (current->rmid)
> newstate.rmid = current->rmid;
> }
>
> That's optimized for the !monitoring case. So we can really penalize the
> per task monitoring case.
>
> if (static_branch_likely(&rdt_mon_enable_key)) {
> if (unlikely(current->rmid)) {
> newstate.rmid = current->rmid;
> __set_bit(newstate.rmid, this_cpu_ptr(rmid_bitmap));
> }
> }
>
> Now in rmid_free() we can collect that information:
>
> cpumask_clear(&tmpmask);
> cpumask_clear(rmid_entry->mask);
>
> cpus_read_lock();
> for_each_online_cpu(cpu) {
> if (test_and_clear_bit(rmid, per_cpu_ptr(cpu, rmid_bitmap)))
> cpumask_set(cpu, tmpmask);
> }
>
> for_each_domain(d, resource) {
> cpu = cpumask_any_and(d->cpu_mask, tmpmask);
> if (cpu < nr_cpu_ids)
> cpumask_set(cpu, rmid_entry->mask);

When this cpu goes offline - the rmid_entry->mask needs an update. Otherwise,
the work function would return true for
if (!cpumask_test_cpu(cpu, rme->mask))

since the work may have been moved to a different cpu.

So we really need a package mask ? or really a per-domain mask and for that we
dont know the max domain number(which is why we use a list..)

> }
>
> list_add(&rmid_entry->list, &limbo_list);
>
> for_each_cpu(cpu, rmid_entry->mask)
> schedule_delayed_work_on(cpu, rmid_work);
> cpus_read_unlock();
>
> The work function:
>
> boot resched = false;
>
> list_for_each_entry(rme, limbo_list,...) {
> if (!cpumask_test_cpu(cpu, rme->mask))
> continue;
>
> if (!rmid_is_reusable(rme)) {
> resched = true;
> continue;
> }
>
> cpumask_clear_cpu(cpu, rme->mask);
> if (!cpumask_empty(rme->mask))
> continue;
>
> /* Ready for reuse */
> list_del(rme->list);
> list_add(&rme->list, &free_list);
> }
>
> The alloc function then becomes:
>
> if (list_empty(&free_list))
> return list_empty(&limbo_list) ? -ENOSPC : -EBUSY;
>
> The switch_to() covers the task rmids. The per cpu default rmids can be
> marked at the point where they are installed on a CPU in the per cpu
> rmid_bitmap. The free path is the same for per task and per cpu.
>
> Another thing which needs some thought it the CPU hotplug code. We need to
> make sure that pending work which is scheduled on an outgoing CPU is moved
> in the offline callback to a still online CPU of the same domain and not
> moved to some random CPU by the workqueue hotplug code.
>
> There is another subtle issue. Assume a RMID is freed. The limbo stuff is
> scheduled on all domains which have online CPUs.
>
> Now the last CPU of a domain goes offline before the threshold for clearing
> the domain CPU bit in the rme->mask is reached.
>
> So we have two options here:
>
> 1) Clear the bit unconditionally when the last CPU of a domain goes
> offline.
>
> 2) Arm a timer which clears the bit after a grace period
>
> #1 The RMID might become available for reuse right away because all other
> domains have not used it or have cleared their bits already.
>
> If one of the CPUs of that domain comes online again and is associated
> to that reused RMID again, then the counter content might still contain
> leftovers from the previous usage.
>
> #2 Prevents #1 but has it's own issues vs. serialization and coordination
> with CPU hotplug.
>
> I'd say we go for #1 as the simplest solution, document it and if really
> the need arises revisit it later.
>
> Thanks,
>
> tglx
>

2017-07-12 20:14:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

On Tue, 11 Jul 2017, Shivappa Vikas wrote:
> On Mon, 3 Jul 2017, Thomas Gleixner wrote:
> > That means, the free list is used as the primary source. One of my boxes
> > has 143 RMIDs. So it only takes 142 mkdir/rmdir invocations to move all
> > RMIDs to the limbo list. On the next mkdir invocation the allocation goes
> > into the limbo path and the SMP function call has to walk the list with 142
> > entries on ALL online domains whether they used the RMID or not!
>
> Would it be better if we do this in the MBM 1s overflow timer delayed_work?
> That is not in the interupt context. So we do a periodic flush of the limbo
> list and then mkdir fails with -EBUSY if list_empty(&free_list) &&
> !list_empty(&limbo_list).

Well, the overflow timer is just running when MBM monitoring is active. I'd
rather avoid tying thing together which do not belong technically together.

> To improve that -
> We may also include the optimization Tony suggested to skip the checks for
> RMIDs which are already checked to be < threshold (however that needs a domain
> mask like I mention below but may be we can just check the list here).

Yes.

> >
> > for_each_domain(d, resource) {
> > cpu = cpumask_any_and(d->cpu_mask, tmpmask);
> > if (cpu < nr_cpu_ids)
> > cpumask_set(cpu, rmid_entry->mask);
>
> When this cpu goes offline - the rmid_entry->mask needs an update. Otherwise,
> the work function would return true for
> if (!cpumask_test_cpu(cpu, rme->mask))

Sure. You need to flush the work from the cpu offline callback and then
reschedule it on another online CPU of the domain or clear the domain from the
mask when the last CPU goes offline.

> since the work may have been moved to a different cpu.
>
> So we really need a package mask ? or really a per-domain mask and for that we
> dont know the max domain number(which is why we use a list..)

Well, you can assume a maximum number of domains per package and we have an
upper limit of possible packages. So sizing the mask should be trivial.

Thanks,

tglx

2017-07-13 18:35:46

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support



On Thu, 6 Jul 2017, Thomas Gleixner wrote:

> On Thu, 6 Jul 2017, Shivappa Vikas wrote:
>> On Sun, 2 Jul 2017, Thomas Gleixner wrote:
>>>> + /* Check whether cpus belong to parent ctrl group */
>>>> + cpumask_andnot(tmpmask, newmask, &pr->cpu_mask);
>>>> + if (cpumask_weight(tmpmask)) {
>>>> + ret = -EINVAL;
>>>> + goto out;
>>>> + }
>>>> +
>>>> + /* Check whether cpus are dropped from this group */
>>>> + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask);
>>>> + if (cpumask_weight(tmpmask)) {
>>>> + /* Give any dropped cpus to parent rdtgroup */
>>>> + cpumask_or(&pr->cpu_mask, &pr->cpu_mask, tmpmask);
>>>
>>> This does not make any sense. The check above verifies that all cpus in
>>> newmask belong to the parent->cpu_mask. If they don't then you return
>>> -EINVAL, but here you give them back to parent->cpu_mask. How is that
>>> supposed to work? You never get into this code path!
>>
>> The parent->cpu_mask always is the parent->cpus_valid_mask if i understand
>> right. With monitor group, the cpu is present is always present in "one"
>> ctrl_mon group and one mon_group. And the mon group can have only cpus in its
>> parent. May be it needs a comment? (its explaind in the documentation patch).
>
> Sigh, the code needs to be written in a way that it is halfways obvious
> what's going on.
>
>> # mkdir /sys/fs/resctrl/p1
>> # mkdir /sys/fs/resctrl/p1/mon_groups/m1
>> # echo 5-10 > /sys/fs/resctr/p1/cpus_list
>> Say p1 has RMID 2
>> cpus 5-10 have RMID 2
>
> So what you say, is that parent is always the resource control group
> itself.
>
> Can we please have a proper distinction in the code? I tripped over that
> ambigiousities several times.
>
> The normal meaning of parent->child relations is that both have the same
> type. While this is the case at the implementation detail level (both are
> type struct rdtgroup), from a conceptual level they are different:
>
> parent is a resource group and child is a monitoring group
>
> That should be expressed in the code, at the very least by variable naming,
> so it becomes immediately clear that this operates on two different
> entities.
>
> The proper solution is to have different data types or at least embedd the
> monitoring bits in a seperate entity inside of struct rdtgroup.

Yes they are conceptually different. There were data which were
specific to monitoring only but they share a lot of data. So I was still
thinking whats best but kept a type which seperates them both. But the
monitoring only data seems like only the 'parent' so we can embed the monitoring
bits in a seperate struct (The parent is initialized for ctrl_mon group but
never really used).

Thanks,
Vikas

2017-07-13 22:07:58

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

>> {
>> + struct rdtgroup *pr = rdtgrp->parent, *cr;
>
> *pr and *cr really suck.

We used r before rdtgroup. pr would be parent rdtgrp. Wanted to keep them short
as there are more in this function.

prgrp can be used if thats not ok?