What is Cache Allocation Technology ( CAT )
-------------------------------------------
Cache Allocation Technology provides a way for the Software (OS/VMM) to
restrict cache allocation to a defined 'subset' of cache which may be
overlapping with other 'subsets'. This feature is used when allocating
a line in cache ie when pulling new data into the cache. The
programming of the h/w is done via programming MSRs.
The different cache subsets are identified by CLOS identifier (class of
service) and each CLOS has a CBM (cache bit mask). The CBM is a
contiguous set of bits which defines the amount of cache resource that
is available for each 'subset'.
Why is CAT (cache allocation technology) needed
------------------------------------------------
The CAT enables more cache resources to be made available for higher
priority applications based on guidance from the execution
environment.
The architecture also allows dynamically changing these subsets during
runtime to further optimize the performance of the higher priority
application with minimal degradation to the low priority app.
Additionally, resources can be rebalanced for system throughput benefit.
This technique may be useful in managing large computer systems which
large LLC. Examples may be large servers running instances of
webservers or database servers. In such complex systems, these subsets
can be used for more careful placing of the available cache resources.
The CAT kernel patch would provide a basic kernel framework for users to
be able to implement such cache subsets.
Kernel Implementation
---------------------
This patch implements a cgroup subsystem to support cache allocation.
Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping. A
CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
to the kernel and not exposed to user. Each cgroup would have one CBM
and would just represent one cache 'subset'.
The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
cgroup never fails. When a child cgroup is created it inherits the
CLOSid and the CBM from its parent. When a user changes the default
CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
used before. The changing of 'cbm' may fail with -ERRNOSPC once the
kernel runs out of maximum CLOSids it can support.
User can create as many cgroups as he wants but having different CBMs
at the same time is restricted by the maximum number of CLOSids
(multiple cgroups can have the same CBM).
Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
for each cgroup using a CLOSid.
The tasks in the cgroup would get to fill the LLC cache represented by
the cgroup's 'cbm' file.
Root directory would have all available bits set in 'cbm' file by
default.
Assignment of CBM,CLOS
---------------------------------
The 'cbm' needs to be a subset of the parent node's 'cbm'. Any
contiguous subset of these bits(with a minimum of 2 bits) maybe set to
indicate the cache mapping desired. The 'cbm' between 2 directories can
overlap. The 'cbm' would represent the cache 'subset' of the CAT cgroup.
For ex: on a system with 16 bits of max cbm bits, if the directory has
the least significant 4 bits set in its 'cbm' file(meaning the 'cbm' is
just 0xf), it would be allocated the right quarter of the Last level
cache which means the tasks belonging to this CAT cgroup can use the
right quarter of the cache to fill. If it has the most significant 8
bits set ,it would be allocated the left half of the cache(8 bits out
of 16 represents 50%).
The cache portion defined in the CBM file is available to all tasks
within the cgroup to fill and these task are not allowed to allocate
space in other parts of the cache.
Scheduling and Context Switch
------------------------------
During context switch kernel implements this by writing the CLOSid
(internally maintained by kernel) of the cgroup to which the task
belongs to the CPU's IA32_PQR_ASSOC MSR.
Reviewed-by: Matt Flemming <[email protected]>
Tested-by: Priya Autee <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/cacheqe.h | 144 +++++++++++
arch/x86/include/asm/cpufeature.h | 4 +
arch/x86/include/asm/processor.h | 5 +-
arch/x86/kernel/cpu/Makefile | 5 +
arch/x86/kernel/cpu/cacheqe.c | 487 ++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/common.c | 21 ++
include/linux/cgroup_subsys.h | 5 +
init/Kconfig | 22 ++
kernel/sched/core.c | 4 +-
kernel/sched/sched.h | 24 ++
10 files changed, 718 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/include/asm/cacheqe.h
create mode 100644 arch/x86/kernel/cpu/cacheqe.c
diff --git a/arch/x86/include/asm/cacheqe.h b/arch/x86/include/asm/cacheqe.h
new file mode 100644
index 0000000..91d175e
--- /dev/null
+++ b/arch/x86/include/asm/cacheqe.h
@@ -0,0 +1,144 @@
+#ifndef _CACHEQE_H_
+#define _CACHEQE_H_
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/percpu.h>
+#include <linux/spinlock.h>
+#include <linux/cpumask.h>
+#include <linux/seq_file.h>
+#include <linux/rcupdate.h>
+#include <linux/kernel_stat.h>
+#include <linux/err.h>
+
+#ifdef CONFIG_CGROUP_CACHEQE
+
+#define IA32_PQR_ASSOC 0xc8f
+#define IA32_PQR_MASK(x) (x << 32)
+
+/* maximum possible cbm length */
+#define MAX_CBM_LENGTH 32
+
+#define IA32_CBMMAX_MASK(x) (0xffffffff & (~((u64)(1 << x) - 1)))
+
+#define IA32_CBM_MASK 0xffffffff
+#define IA32_L3_CBM_BASE 0xc90
+#define CQECBMMSR(x) (IA32_L3_CBM_BASE + x)
+
+#ifdef CONFIG_CACHEQE_DEBUG
+#define CQE_DEBUG(X) do { pr_info X; } while (0)
+#else
+#define CQE_DEBUG(X)
+#endif
+
+extern bool cqe_genable;
+
+struct cacheqe_subsys_info {
+ unsigned long *closmap;
+};
+
+struct cacheqe {
+ struct cgroup_subsys_state css;
+
+ /* class of service for the group*/
+ unsigned int clos;
+ /* corresponding cache bit mask*/
+ unsigned long *cbm;
+
+};
+
+struct closcbm_map {
+ unsigned long cbm;
+ unsigned int ref;
+};
+
+extern struct cacheqe root_cqe_group;
+
+/*
+ * Return cacheqos group corresponding to this container.
+ */
+static inline struct cacheqe *css_cacheqe(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct cacheqe, css) : NULL;
+}
+
+static inline struct cacheqe *parent_cqe(struct cacheqe *cq)
+{
+ return css_cacheqe(cq->css.parent);
+}
+
+/*
+ * Return cacheqe group to which this task belongs.
+ */
+static inline struct cacheqe *task_cacheqe(struct task_struct *task)
+{
+ return css_cacheqe(task_css(task, cacheqe_cgrp_id));
+}
+
+static inline void cacheqe_sched_in(struct task_struct *task)
+{
+ struct cacheqe *cq;
+ unsigned int clos;
+ unsigned int l, h;
+
+ if (!cqe_genable)
+ return;
+
+ rdmsr(IA32_PQR_ASSOC, l, h);
+
+ rcu_read_lock();
+ cq = task_cacheqe(task);
+
+ if (cq == NULL || cq->clos == h) {
+ rcu_read_unlock();
+ return;
+ }
+
+ clos = cq->clos;
+
+ /*
+ * After finding the cacheqe of the task , write the PQR for the proc.
+ * We are assuming the current core is the one its scheduled to.
+ * In unified scheduling , write the PQR each time.
+ */
+ wrmsr(IA32_PQR_ASSOC, l, clos);
+ rcu_read_unlock();
+
+ CQE_DEBUG(("schedule in clos :0x%x,task cpu:%u, currcpu: %u,pid:%u\n",
+ clos, task_cpu(task), smp_processor_id(), task->pid));
+
+}
+
+static inline void cacheqe_sched_out(struct task_struct *task)
+{
+ unsigned int l, h;
+
+ if (!cqe_genable)
+ return;
+
+ rdmsr(IA32_PQR_ASSOC, l, h);
+
+ if (h == 0)
+ return;
+
+ /*
+ *After finding the cacheqe of the task , write the PQR for the proc.
+ * We are assuming the current core is the one its scheduled to.
+ * Write zero when scheduling out so that we get a more accurate
+ * cache allocation.
+ */
+
+ wrmsr(IA32_PQR_ASSOC, l, 0);
+
+ CQE_DEBUG(("schedule out done cpu :%u,curr cpu:%u, pid:%u\n",
+ task_cpu(task), smp_processor_id(), task->pid));
+
+}
+
+#else
+static inline void cacheqe_sched_in(struct task_struct *task) {}
+
+static inline void cacheqe_sched_out(struct task_struct *task) {}
+
+#endif
+#endif
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 0bb1335..21290ac 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -221,6 +221,7 @@
#define X86_FEATURE_INVPCID ( 9*32+10) /* Invalidate Processor Context ID */
#define X86_FEATURE_RTM ( 9*32+11) /* Restricted Transactional Memory */
#define X86_FEATURE_MPX ( 9*32+14) /* Memory Protection Extension */
+#define X86_FEATURE_CQE (9*32+15) /* Cache QOS Enforcement */
#define X86_FEATURE_AVX512F ( 9*32+16) /* AVX-512 Foundation */
#define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */
#define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX instructions */
@@ -236,6 +237,9 @@
#define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 */
#define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS */
+/* Intel-defined CPU features, CPUID level 0x0000000A:0 (ebx), word 10 */
+#define X86_FEATURE_CQE_L3 (10*32 + 1)
+
/*
* BUG word(s)
*/
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eb71ec7..6be953f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -111,8 +111,11 @@ struct cpuinfo_x86 {
int x86_cache_alignment; /* In bytes */
int x86_power;
unsigned long loops_per_jiffy;
+ /* Cache QOS Enforement values */
+ int x86_cqe_cbmlength;
+ int x86_cqe_closs;
/* cpuid returned max cores value: */
- u16 x86_max_cores;
+ u16 x86_max_cores;
u16 apicid;
u16 initial_apicid;
u16 x86_clflush_size;
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index e27b49d..c2b0a6b 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -8,6 +8,10 @@ CFLAGS_REMOVE_common.o = -pg
CFLAGS_REMOVE_perf_event.o = -pg
endif
+ifdef CONFIG_CACHEQE_DEBUG
+CFLAGS_cacheqe.o := -DDEBUG
+endif
+
# Make sure load_percpu_segment has no stackprotector
nostackp := $(call cc-option, -fno-stack-protector)
CFLAGS_common.o := $(nostackp)
@@ -47,6 +51,7 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
perf_event_intel_uncore_nhmex.o
endif
+obj-$(CONFIG_CGROUP_CACHEQE) += cacheqe.o
obj-$(CONFIG_X86_MCE) += mcheck/
obj-$(CONFIG_MTRR) += mtrr/
diff --git a/arch/x86/kernel/cpu/cacheqe.c b/arch/x86/kernel/cpu/cacheqe.c
new file mode 100644
index 0000000..2ac3d4e
--- /dev/null
+++ b/arch/x86/kernel/cpu/cacheqe.c
@@ -0,0 +1,487 @@
+
+/*
+ * kernel/cacheqe.c
+ *
+ * Processor Cache Allocation code
+ * (Also called cache quality enforcement - cqe)
+ *
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * 2014-10-15 Written by Vikas Shivappa
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <asm/cacheqe.h>
+
+struct cacheqe root_cqe_group;
+static DEFINE_MUTEX(cqe_group_mutex);
+
+bool cqe_genable;
+
+/* ccmap maintains 1:1 mapping between CLOSid and cbm.*/
+
+static struct closcbm_map *ccmap;
+static struct cacheqe_subsys_info *cqess_info;
+
+char hsw_brandstrs[5][64] = {
+ "Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz",
+ "Intel(R) Xeon(R) CPU E5-2648L v3 @ 1.80GHz",
+ "Intel(R) Xeon(R) CPU E5-2628L v3 @ 2.00GHz",
+ "Intel(R) Xeon(R) CPU E5-2618L v3 @ 2.30GHz",
+ "Intel(R) Xeon(R) CPU E5-2608L v3 @ 2.00GHz"
+};
+
+#define cacheqe_for_each_child(child_cq, pos_css, parent_cq) \
+ css_for_each_child((pos_css), \
+ &(parent_cq)->css)
+
+#if CONFIG_CACHEQE_DEBUG
+
+/*DUMP the closid-cbm map.*/
+
+static inline void cbmmap_dump(void)
+{
+
+ int i;
+
+ pr_debug("CBMMAP\n");
+ for (i = 0; i < boot_cpu_data.x86_cqe_closs; i++)
+ pr_debug("cbm: 0x%x,ref: %u\n",
+ (unsigned int)ccmap[i].cbm, ccmap[i].ref);
+
+}
+
+#else
+
+static inline void cbmmap_dump(void) {}
+
+#endif
+
+static inline bool cqe_enabled(struct cpuinfo_x86 *c)
+{
+
+ int i;
+
+ if (cpu_has(c, X86_FEATURE_CQE_L3))
+ return true;
+
+ /*
+ * Hard code the checks and values for HSW SKUs.
+ * Unfortunately! have to check against only these brand name strings.
+ */
+
+ for (i = 0; i < 5; i++)
+ if (!strcmp(hsw_brandstrs[i], c->x86_model_id)) {
+ c->x86_cqe_closs = 4;
+ c->x86_cqe_cbmlength = 20;
+ return true;
+ }
+
+ return false;
+
+}
+
+
+static int __init cqe_late_init(void)
+{
+
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+ size_t sizeb;
+ int maxid = boot_cpu_data.x86_cqe_closs;
+
+ cqe_genable = false;
+
+ /*
+ * Need the cqe_genable hint helps decide if the
+ * kernel has enabled cache allocation.
+ */
+
+ if (!cqe_enabled(c)) {
+
+ root_cqe_group.css.ss->disabled = 1;
+ return -ENODEV;
+
+ } else {
+
+ cqess_info =
+ kzalloc(sizeof(struct cacheqe_subsys_info),
+ GFP_KERNEL);
+
+ if (!cqess_info)
+ return -ENOMEM;
+
+ sizeb = BITS_TO_LONGS(c->x86_cqe_closs) * sizeof(long);
+ cqess_info->closmap =
+ kzalloc(sizeb, GFP_KERNEL);
+
+ if (!cqess_info->closmap) {
+ kfree(cqess_info);
+ return -ENOMEM;
+ }
+
+ sizeb = maxid * sizeof(struct closcbm_map);
+ ccmap = kzalloc(sizeb, GFP_KERNEL);
+
+ if (!ccmap)
+ return -ENOMEM;
+
+ /* Allocate the CLOS for root.*/
+ set_bit(0, cqess_info->closmap);
+ root_cqe_group.clos = 0;
+
+ /*
+ * The cbmlength expected be atleast 1.
+ * All bits are set for the root cbm.
+ */
+
+ ccmap[root_cqe_group.clos].cbm =
+ (u32)((u64)(1 << c->x86_cqe_cbmlength) - 1);
+ root_cqe_group.cbm = &ccmap[root_cqe_group.clos].cbm;
+ ccmap[root_cqe_group.clos].ref++;
+
+ barrier();
+ cqe_genable = true;
+
+ pr_info("CQE enabled cbmlength is %u\ncqe Closs : %u ",
+ c->x86_cqe_cbmlength, c->x86_cqe_closs);
+
+ }
+
+ return 0;
+
+}
+
+late_initcall(cqe_late_init);
+
+/*
+ * Allocates a new closid from unused list of closids.
+ * Called with the cqe_group_mutex held.
+ */
+
+static int cqe_alloc_closid(struct cacheqe *cq)
+{
+ unsigned int tempid;
+ unsigned int maxid;
+ int err;
+
+ maxid = boot_cpu_data.x86_cqe_closs;
+
+ tempid = find_next_zero_bit(cqess_info->closmap, maxid, 0);
+
+ if (tempid == maxid) {
+ err = -ENOSPC;
+ goto closidallocfail;
+ }
+
+ set_bit(tempid, cqess_info->closmap);
+ ccmap[tempid].ref++;
+ cq->clos = tempid;
+
+ pr_debug("cqe : Allocated a directory.closid:%u\n", cq->clos);
+
+ return 0;
+
+closidallocfail:
+
+ return err;
+
+}
+
+/*
+* Called with the cqe_group_mutex held.
+*/
+
+static void cqe_free_closid(struct cacheqe *cq)
+{
+
+ pr_debug("cqe :Freeing closid:%u\n", cq->clos);
+
+ ccmap[cq->clos].ref--;
+
+ if (!ccmap[cq->clos].ref)
+ clear_bit(cq->clos, cqess_info->closmap);
+
+ return;
+
+}
+
+/* Create a new cacheqe cgroup.*/
+static struct cgroup_subsys_state *
+cqe_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct cacheqe *parent = css_cacheqe(parent_css);
+ struct cacheqe *cq;
+
+ /* This is the call before the feature is detected */
+ if (!parent) {
+ root_cqe_group.clos = 0;
+ return &root_cqe_group.css;
+ }
+
+ /* To check if cqe is enabled.*/
+ if (!cqe_genable)
+ return ERR_PTR(-ENODEV);
+
+ cq = kzalloc(sizeof(struct cacheqe), GFP_KERNEL);
+ if (!cq)
+ return ERR_PTR(-ENOMEM);
+
+ /*
+ * Child inherits the ClosId and cbm from parent.
+ */
+
+ cq->clos = parent->clos;
+ mutex_lock(&cqe_group_mutex);
+ ccmap[parent->clos].ref++;
+ mutex_unlock(&cqe_group_mutex);
+
+ cq->cbm = parent->cbm;
+
+ pr_debug("cqe : Allocated cgroup closid:%u,ref:%u\n",
+ cq->clos, ccmap[parent->clos].ref);
+
+ return &cq->css;
+
+}
+
+/* Destroy an existing CAT cgroup.*/
+static void cqe_css_free(struct cgroup_subsys_state *css)
+{
+ struct cacheqe *cq = css_cacheqe(css);
+ int len = boot_cpu_data.x86_cqe_cbmlength;
+
+ pr_debug("cqe : In cacheqe_css_free\n");
+
+ mutex_lock(&cqe_group_mutex);
+
+ /* Reset the CBM for the cgroup.Should be all 1s by default !*/
+
+ wrmsrl(CQECBMMSR(cq->clos), ((1 << len) - 1));
+ cqe_free_closid(cq);
+ kfree(cq);
+
+ mutex_unlock(&cqe_group_mutex);
+
+}
+
+/*
+ * Called during do_exit() syscall during a task exit.
+ * This assumes that the thread is running on the current
+ * cpu.
+ */
+
+static void cqe_exit(struct cgroup_subsys_state *css,
+ struct cgroup_subsys_state *old_css,
+ struct task_struct *task)
+{
+
+ cacheqe_sched_out(task);
+
+}
+
+static inline bool cbm_minbits(unsigned long var)
+{
+
+ unsigned long i;
+
+ /*Minimum of 2 bits must be set.*/
+
+ i = var & (var - 1);
+ if (!i || !var)
+ return false;
+
+ return true;
+
+}
+
+/*
+ * Tests if only contiguous bits are set.
+ */
+
+static inline bool cbm_iscontiguous(unsigned long var)
+{
+
+ unsigned long i;
+
+ /* Reset the least significant bit.*/
+ i = var & (var - 1);
+
+ /*
+ * We would have a set of non-contiguous bits when
+ * there is at least one zero
+ * between the most significant 1 and least significant 1.
+ * In the below '&' operation,(var <<1) would have zero in
+ * at least 1 bit position in var apart from least
+ * significant bit if it does not have contiguous bits.
+ * Multiple sets of contiguous bits wont succeed in the below
+ * case as well.
+ */
+
+ if (i != (var & (var << 1)))
+ return false;
+
+ return true;
+
+}
+
+static int cqe_cbm_read(struct seq_file *m, void *v)
+{
+ struct cacheqe *cq = css_cacheqe(seq_css(m));
+
+ pr_debug("cqe : In cqe_cqemode_read\n");
+ seq_printf(m, "0x%x\n", (unsigned int)*(cq->cbm));
+
+ return 0;
+
+}
+
+static int validate_cbm(struct cacheqe *cq, unsigned long cbmvalue)
+{
+ struct cacheqe *par, *c;
+ struct cgroup_subsys_state *css;
+
+ if (!cbm_minbits(cbmvalue) || !cbm_iscontiguous(cbmvalue)) {
+ pr_info("CQE error: minimum bits not set or non contiguous mask\n");
+ return -EINVAL;
+ }
+
+ /*
+ * Needs to be a subset of its parent.
+ */
+ par = parent_cqe(cq);
+
+ if (!bitmap_subset(&cbmvalue, par->cbm, MAX_CBM_LENGTH))
+ return -EINVAL;
+
+ rcu_read_lock();
+
+ /*
+ * Each of children should be a subset of the mask.
+ */
+
+ cacheqe_for_each_child(c, css, cq) {
+ c = css_cacheqe(css);
+ if (!bitmap_subset(c->cbm, &cbmvalue, MAX_CBM_LENGTH)) {
+ pr_debug("cqe : Children's cbm not a subset\n");
+ return -EINVAL;
+ }
+ }
+
+ rcu_read_unlock();
+
+ return 0;
+
+}
+
+static bool cbm_search(unsigned long cbm, int *closid)
+{
+
+ int maxid = boot_cpu_data.x86_cqe_closs;
+ unsigned int i;
+
+ for (i = 0; i < maxid; i++)
+ if (bitmap_equal(&cbm, &ccmap[i].cbm, MAX_CBM_LENGTH)) {
+ *closid = i;
+ return true;
+ }
+
+ return false;
+
+}
+
+static int cqe_cbm_write(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 cbmvalue)
+{
+ struct cacheqe *cq = css_cacheqe(css);
+ ssize_t err = 0;
+ unsigned long cbm;
+ unsigned int closid;
+
+ pr_debug("cqe : In cqe_cbm_write\n");
+
+ if (!cqe_genable)
+ return -ENODEV;
+
+ if (cq == &root_cqe_group || !cq)
+ return -EPERM;
+
+ /*
+ * Need global mutex as cbm write may allocate the closid.
+ */
+
+ mutex_lock(&cqe_group_mutex);
+ cbm = (cbmvalue & IA32_CBM_MASK);
+
+ if (bitmap_equal(&cbm, cq->cbm, MAX_CBM_LENGTH))
+ goto cbmwriteend;
+
+ err = validate_cbm(cq, cbm);
+ if (err)
+ goto cbmwriteend;
+
+ /*
+ * Need to assign a CLOSid to the cgroup
+ * if it has a new cbm , or reuse.
+ * This takes care to allocate only
+ * the number of CLOSs available.
+ */
+
+ cqe_free_closid(cq);
+
+ if (cbm_search(cbm, &closid)) {
+ cq->clos = closid;
+ ccmap[cq->clos].ref++;
+
+ } else {
+
+ err = cqe_alloc_closid(cq);
+
+ if (err)
+ goto cbmwriteend;
+
+ wrmsrl(CQECBMMSR(cq->clos), cbm);
+
+ }
+
+ /*
+ * Finally store the cbm in cbm map
+ * and store a reference in the cq.
+ */
+
+ ccmap[cq->clos].cbm = cbm;
+ cq->cbm = &ccmap[cq->clos].cbm;
+
+ cbmmap_dump();
+
+cbmwriteend:
+
+ mutex_unlock(&cqe_group_mutex);
+ return err;
+
+}
+
+static struct cftype cqe_files[] = {
+ {
+ .name = "cbm",
+ .seq_show = cqe_cbm_read,
+ .write_u64 = cqe_cbm_write,
+ .mode = 0666,
+ },
+ { } /* terminate */
+};
+
+struct cgroup_subsys cacheqe_cgrp_subsys = {
+ .name = "cacheqe",
+ .css_alloc = cqe_css_alloc,
+ .css_free = cqe_css_free,
+ .exit = cqe_exit,
+ .base_cftypes = cqe_files,
+};
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 4b4f78c..a9b277a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -633,6 +633,27 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
c->x86_capability[9] = ebx;
}
+/* Additional Intel-defined flags: level 0x00000010 */
+ if (c->cpuid_level >= 0x00000010) {
+ u32 eax, ebx, ecx, edx;
+
+ cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
+
+ c->x86_capability[10] = ebx;
+
+ if (cpu_has(c, X86_FEATURE_CQE_L3)) {
+
+ u32 eax, ebx, ecx, edx;
+
+ cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
+
+ c->x86_cqe_closs = (edx & 0xffff) + 1;
+ c->x86_cqe_cbmlength = (eax & 0xf) + 1;
+
+ }
+
+ }
+
/* Extended state features: level 0x0000000d */
if (c->cpuid_level >= 0x0000000d) {
u32 eax, ebx, ecx, edx;
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 98c4f9b..a131c1e 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -53,6 +53,11 @@ SUBSYS(hugetlb)
#if IS_ENABLED(CONFIG_CGROUP_DEBUG)
SUBSYS(debug)
#endif
+
+#if IS_ENABLED(CONFIG_CGROUP_CACHEQE)
+SUBSYS(cacheqe)
+#endif
+
/*
* DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS.
*/
diff --git a/init/Kconfig b/init/Kconfig
index 2081a4d..bec92a4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -968,6 +968,28 @@ config CPUSETS
Say N if unsure.
+config CGROUP_CACHEQE
+ bool "Cache QoS Enforcement cgroup subsystem"
+ depends on X86 || X86_64
+ help
+ This option provides framework to allocate Cache cache lines when
+ applications fill cache.
+ This can be used by users to configure how much cache that can be
+ allocated to different PIDs.
+
+ Say N if unsure.
+
+config CACHEQE_DEBUG
+ bool "Cache QoS Enforcement cgroup subsystem debug"
+ depends on X86 || X86_64
+ help
+ This option provides framework to allocate Cache cache lines when
+ applications fill cache.
+ This can be used by users to configure how much cache that can be
+ allocated to different PIDs.Enables debug
+
+ Say N if unsure.
+
config PROC_PID_CPUSET
bool "Include legacy /proc/<pid>/cpuset file"
depends on CPUSETS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240157c..afa2897 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2215,7 +2215,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
perf_event_task_sched_out(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
prepare_lock_switch(rq, next);
- prepare_arch_switch(next);
+ prepare_arch_switch(prev);
}
/**
@@ -2254,7 +2254,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
*/
prev_state = prev->state;
vtime_task_switch(prev);
- finish_arch_switch(prev);
+ finish_arch_switch(current);
perf_event_task_sched_in(prev, current);
finish_lock_switch(rq, prev);
finish_arch_post_lock_switch();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24156c84..79b9ff6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -965,12 +965,36 @@ static inline int task_on_rq_migrating(struct task_struct *p)
return p->on_rq == TASK_ON_RQ_MIGRATING;
}
+#ifdef CONFIG_X86_64
+#ifdef CONFIG_CGROUP_CACHEQE
+
+#include <asm/cacheqe.h>
+
+# define prepare_arch_switch(prev) cacheqe_sched_out(prev)
+# define finish_arch_switch(current) cacheqe_sched_in(current)
+
+#else
+
#ifndef prepare_arch_switch
# define prepare_arch_switch(next) do { } while (0)
#endif
#ifndef finish_arch_switch
# define finish_arch_switch(prev) do { } while (0)
#endif
+
+#endif
+#else
+
+#ifndef prepare_arch_switch
+# define prepare_arch_switch(prev) do { } while (0)
+#endif
+
+#ifndef finish_arch_switch
+# define finish_arch_switch(current) do { } while (0)
+#endif
+
+#endif
+
#ifndef finish_arch_post_lock_switch
# define finish_arch_post_lock_switch() do { } while (0)
#endif
--
1.9.1
On Wed, 19 Nov 2014, Vikas Shivappa wrote:
> +/* maximum possible cbm length */
> +#define MAX_CBM_LENGTH 32
> +
> +#define IA32_CBMMAX_MASK(x) (0xffffffff & (~((u64)(1 << x) - 1)))
Unused define.
> +
> +#define IA32_CBM_MASK 0xffffffff
(~0U) ?
> @@ -0,0 +1,144 @@
> +static inline void cacheqe_sched_in(struct task_struct *task)
> +{
> + struct cacheqe *cq;
> + unsigned int clos;
> + unsigned int l, h;
These variable names suck. You say in the comment below:
write the PQR for the proc.
So what is it now? clos or pqr or what?
IA32_PQR_ASSOC is the MSR name and bit 0-9 is RMID and bit 32-63 is
COS
But you create random names just because it makes it harder to map the
code to the SDM or is there some other sensible reason I'm missing
here?
> +
> + if (!cqe_genable)
> + return;
This wants to be a static key.
> + rdmsr(IA32_PQR_ASSOC, l, h);
Why on earth do we want to read an MSR on every context switch? What's
wrong with having
DEFINE_PER_CPU(u64, cacheqe_pqr);
u64 next_pqr, cur_pqr = this_cpu_read(cacheqe_pqr);
cq = task_cacheqe(task);
next_pqr = cq ? cq->pqr : 0;
if (next_pqr != cur_pqr) {
wrmsrl(IA32_PQR_ASSOC, next_pqr);
this_cpu_write(cacheqe_pqr, next_pqr);
}
That saves the whole idiocity of reading MSRs on schedule in and
schedule out. In fact it makes the schedule_out part completely
superfluous.
> + rcu_read_lock();
> + cq = task_cacheqe(task);
> +
> + if (cq == NULL || cq->clos == h) {
> + rcu_read_unlock();
> + return;
> + }
> +
> + clos = cq->clos;
> +
> + /*
> + * After finding the cacheqe of the task , write the PQR for the proc.
> + * We are assuming the current core is the one its scheduled to.
You are in the middle of the scheduler. So what needs to be assumed
here?
> + * In unified scheduling , write the PQR each time.
What the heck is unified scheduling?
> + */
> + wrmsr(IA32_PQR_ASSOC, l, clos);
> + rcu_read_unlock();
> +
> + CQE_DEBUG(("schedule in clos :0x%x,task cpu:%u, currcpu: %u,pid:%u\n",
> + clos, task_cpu(task), smp_processor_id(), task->pid));
A printk in the guts of the sched_switch function? Ever heard about
trace points?
> +
> +}
> +
> +static inline void cacheqe_sched_out(struct task_struct *task)
> +{
> + unsigned int l, h;
> +
> + if (!cqe_genable)
> + return;
> +
> + rdmsr(IA32_PQR_ASSOC, l, h);
> +
> + if (h == 0)
> + return;
> +
> + /*
> + *After finding the cacheqe of the task , write the PQR for the proc.
> + * We are assuming the current core is the one its scheduled to.
> + * Write zero when scheduling out so that we get a more accurate
> + * cache allocation.
This comment makes no sense at all. But well, this does not make sense
anyway.
> +ifdef CONFIG_CACHEQE_DEBUG
> +CFLAGS_cacheqe.o := -DDEBUG
> +endif
Certainly NOT!
> +++ b/arch/x86/kernel/cpu/cacheqe.c
> @@ -0,0 +1,487 @@
> +
> +/*
> + * kernel/cacheqe.c
What's the point of the file name here? Especially if the file is not
there at all.
> +#include <asm/cacheqe.h>
> +
> +struct cacheqe root_cqe_group;
> +static DEFINE_MUTEX(cqe_group_mutex);
> +
> +bool cqe_genable;
This wants to be readmostly, if it's a global, but we really want a
static key for this.
> +
> +/* ccmap maintains 1:1 mapping between CLOSid and cbm.*/
> +
> +static struct closcbm_map *ccmap;
> +static struct cacheqe_subsys_info *cqess_info;
> +
> +char hsw_brandstrs[5][64] = {
> + "Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz",
> + "Intel(R) Xeon(R) CPU E5-2648L v3 @ 1.80GHz",
> + "Intel(R) Xeon(R) CPU E5-2628L v3 @ 2.00GHz",
> + "Intel(R) Xeon(R) CPU E5-2618L v3 @ 2.30GHz",
> + "Intel(R) Xeon(R) CPU E5-2608L v3 @ 2.00GHz"
> +};
> +
> +#define cacheqe_for_each_child(child_cq, pos_css, parent_cq) \
> + css_for_each_child((pos_css), \
> + &(parent_cq)->css)
> +
> +#if CONFIG_CACHEQE_DEBUG
We really do NOT need another config option for this. See above.
> +/*DUMP the closid-cbm map.*/
Wow that comment is really informative.
> +static inline bool cqe_enabled(struct cpuinfo_x86 *c)
> +{
> +
> + int i;
> +
> + if (cpu_has(c, X86_FEATURE_CQE_L3))
> + return true;
> +
> + /*
> + * Hard code the checks and values for HSW SKUs.
> + * Unfortunately! have to check against only these brand name strings.
> + */
You must be kidding.
> +
> + for (i = 0; i < 5; i++)
> + if (!strcmp(hsw_brandstrs[i], c->x86_model_id)) {
> + c->x86_cqe_closs = 4;
> + c->x86_cqe_cbmlength = 20;
> + return true;
> + }
> +
> + return false;
> +
> +}
> +
> +
> +static int __init cqe_late_init(void)
> +{
> +
> + struct cpuinfo_x86 *c = &boot_cpu_data;
> + size_t sizeb;
> + int maxid = boot_cpu_data.x86_cqe_closs;
> +
> + cqe_genable = false;
Right, we need to initialize bss storage.
> +
> + /*
> + * Need the cqe_genable hint helps decide if the
> + * kernel has enabled cache allocation.
> + */
What hint?
> + if (!cqe_enabled(c)) {
> +
> + root_cqe_group.css.ss->disabled = 1;
> + return -ENODEV;
> +
> + } else {
> +
> + cqess_info =
> + kzalloc(sizeof(struct cacheqe_subsys_info),
> + GFP_KERNEL);
Can you add a few more newlines so the code gets better readable?
> +
> + if (!cqess_info)
> + return -ENOMEM;
> +
> + sizeb = BITS_TO_LONGS(c->x86_cqe_closs) * sizeof(long);
> + cqess_info->closmap =
> + kzalloc(sizeb, GFP_KERNEL);
Sigh.
> + if (!cqess_info->closmap) {
> + kfree(cqess_info);
> + return -ENOMEM;
> + }
> +
> + sizeb = maxid * sizeof(struct closcbm_map);
> + ccmap = kzalloc(sizeb, GFP_KERNEL);
> +
> + if (!ccmap)
> + return -ENOMEM;
Leaks closmap and cqess_info.
> + barrier();
> + cqe_genable = true;
What's the exact point of that barrier?
> +}
> +
Stray newline. checkpatch.pl is optional, right?
> +late_initcall(cqe_late_init);
> +
> +/*
> + * Allocates a new closid from unused list of closids.
There is no unused list.
> + * Called with the cqe_group_mutex held.
> + */
> +
> +static int cqe_alloc_closid(struct cacheqe *cq)
> +{
> + unsigned int tempid;
> + unsigned int maxid;
> + int err;
> +
> + maxid = boot_cpu_data.x86_cqe_closs;
> +
> + tempid = find_next_zero_bit(cqess_info->closmap, maxid, 0);
> +
> + if (tempid == maxid) {
> + err = -ENOSPC;
> + goto closidallocfail;
What's the point of the goto? What's wrong with 'return -ENOSPC;' ?
> + }
> +
> + set_bit(tempid, cqess_info->closmap);
> + ccmap[tempid].ref++;
> + cq->clos = tempid;
> +
> + pr_debug("cqe : Allocated a directory.closid:%u\n", cq->clos);
And of course we have no idea whatfor this closid got allocated. Your
debug info is just useless.
> +static void cqe_free_closid(struct cacheqe *cq)
> +{
> +
> + pr_debug("cqe :Freeing closid:%u\n", cq->clos);
> +
> + ccmap[cq->clos].ref--;
This wants a sanity check to catch ref == 0.
> +/* Create a new cacheqe cgroup.*/
> +static struct cgroup_subsys_state *
> +cqe_css_alloc(struct cgroup_subsys_state *parent_css)
> +{
> + struct cacheqe *parent = css_cacheqe(parent_css);
> + struct cacheqe *cq;
> +
> + /* This is the call before the feature is detected */
Huch? We know at early boot that this is available. So why do you need
this? The first call to this should never happen before the whole
thing is initialized. If it happens, it's a design failure.
> + if (!parent) {
> + root_cqe_group.clos = 0;
> + return &root_cqe_group.css;
> + }
> +
> + /* To check if cqe is enabled.*/
> + if (!cqe_genable)
> + return ERR_PTR(-ENODEV);
So we first check for !parent and return &root_cqe_group.css and later
we return an error pointer? Really consistent behaviour.
> +/* Destroy an existing CAT cgroup.*/
If you would spend some time on providing real comments instead of
documenting the obvious ...
> +static void cqe_css_free(struct cgroup_subsys_state *css)
> +{
> + struct cacheqe *cq = css_cacheqe(css);
> + int len = boot_cpu_data.x86_cqe_cbmlength;
> +
> + pr_debug("cqe : In cacheqe_css_free\n");
> +
> + mutex_lock(&cqe_group_mutex);
> +
> + /* Reset the CBM for the cgroup.Should be all 1s by default !*/
> +
> + wrmsrl(CQECBMMSR(cq->clos), ((1 << len) - 1));
So that's writing to IA32_L3_MASK_n, right? Pretty obvious name:
CQECBMMSR.
And the write should set ALL bits to 1 not just the enabled ones if I
understand the SDM correctly and what you said in your own comment
above.
Aside of that what guarantees that all references to this L3_MASK are
gone? You allow the sharing of COS ids. Your whole refcounting scheme
is a trainwreck. And when the COS id is not longer in use, there is no
point in writing the MSR at all. It does not matter whats in there if
nothing ever uses it.
> + cqe_free_closid(cq);
> + kfree(cq);
> +
> + mutex_unlock(&cqe_group_mutex);
> +
> +}
> +
> +/*
> + * Called during do_exit() syscall during a task exit.
> + * This assumes that the thread is running on the current
> + * cpu.
Your assumptions about running on current cpu are interesting. On
which CPU is a task supposed to run, when it is on a CPU? Surely on
some other cpu, right?
> + */
> +
> +static void cqe_exit(struct cgroup_subsys_state *css,
> + struct cgroup_subsys_state *old_css,
> + struct task_struct *task)
> +{
> +
> + cacheqe_sched_out(task);
And whats the point here? Nothing at all. You need to clean the
associated data and then the exit schedule will clean up the PQR
automatically.
> +
> +}
> +
> +static inline bool cbm_minbits(unsigned long var)
> +{
> +
Your supply of useless newlines is endless.
> + unsigned long i;
> +
> + /*Minimum of 2 bits must be set.*/
> +
> + i = var & (var - 1);
> + if (!i || !var)
> + return false;
> +
> + return true;
> +
> +}
What's wrong with using bitmap functions consistently? So this test
would simply become:
bitmap_weight(map, nrbits) < 2
> +
> +/*
> + * Tests if only contiguous bits are set.
> + */
> +
> +static inline bool cbm_iscontiguous(unsigned long var)
> +{
This one here can be implemented with existing bitmap functions as
well.
Would be too simple and too obvious to understand, right?
> +static int cqe_cbm_read(struct seq_file *m, void *v)
> +{
> + struct cacheqe *cq = css_cacheqe(seq_css(m));
> +
> + pr_debug("cqe : In cqe_cqemode_read\n");
We really need a debug printk for a seqfile read function, right?
> + seq_printf(m, "0x%x\n", (unsigned int)*(cq->cbm));
seq_bitmap()
> +static int validate_cbm(struct cacheqe *cq, unsigned long cbmvalue)
> +{
> + struct cacheqe *par, *c;
> + struct cgroup_subsys_state *css;
> +
> + if (!cbm_minbits(cbmvalue) || !cbm_iscontiguous(cbmvalue)) {
> + pr_info("CQE error: minimum bits not set or non contiguous mask\n");
> + return -EINVAL;
> + }
> +
> + /*
> + * Needs to be a subset of its parent.
> + */
> + par = parent_cqe(cq);
> +
> + if (!bitmap_subset(&cbmvalue, par->cbm, MAX_CBM_LENGTH))
> + return -EINVAL;
> +
> + rcu_read_lock();
> +
> + /*
> + * Each of children should be a subset of the mask.
> + */
> +
> + cacheqe_for_each_child(c, css, cq) {
> + c = css_cacheqe(css);
> + if (!bitmap_subset(c->cbm, &cbmvalue, MAX_CBM_LENGTH)) {
> + pr_debug("cqe : Children's cbm not a subset\n");
> + return -EINVAL;
> + }
> + }
So now you have 3 error exits here. One with pr_info, one with out and
one with pr_debug. And of course the most obvious to figure out has a
pr_info. Useful.
> +static bool cbm_search(unsigned long cbm, int *closid)
> +{
> +
> + int maxid = boot_cpu_data.x86_cqe_closs;
> + unsigned int i;
> +
> + for (i = 0; i < maxid; i++)
> + if (bitmap_equal(&cbm, &ccmap[i].cbm, MAX_CBM_LENGTH)) {
> + *closid = i;
> + return true;
> + }
> +
> + return false;
> +
> +}
> +
> +static int cqe_cbm_write(struct cgroup_subsys_state *css,
> + struct cftype *cft, u64 cbmvalue)
> +{
> + struct cacheqe *cq = css_cacheqe(css);
> + ssize_t err = 0;
> + unsigned long cbm;
> + unsigned int closid;
> +
> + pr_debug("cqe : In cqe_cbm_write\n");
Groan.
> + if (!cqe_genable)
> + return -ENODEV;
> +
> + if (cq == &root_cqe_group || !cq)
> + return -EPERM;
> +
> + /*
> + * Need global mutex as cbm write may allocate the closid.
You need to protect against changes on all ends, not only allocating a
closid.
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 4b4f78c..a9b277a 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -633,6 +633,27 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
> c->x86_capability[9] = ebx;
> }
>
> +/* Additional Intel-defined flags: level 0x00000010 */
> + if (c->cpuid_level >= 0x00000010) {
> + u32 eax, ebx, ecx, edx;
> +
> + cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
> +
> + c->x86_capability[10] = ebx;
> +
> + if (cpu_has(c, X86_FEATURE_CQE_L3)) {
> +
> + u32 eax, ebx, ecx, edx;
If you add another indentation level, you can define the same
variables once more.
> +
> + cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
> +
> + c->x86_cqe_closs = (edx & 0xffff) + 1;
> + c->x86_cqe_cbmlength = (eax & 0xf) + 1;
> +config CACHEQE_DEBUG
> + bool "Cache QoS Enforcement cgroup subsystem debug"
This needs to go away. The debugging you provide is beyond useless. We
want proper tracepoints for that not random debug printks with no value.
> + depends on X86 || X86_64
> + help
> + This option provides framework to allocate Cache cache lines when
> + applications fill cache.
> + This can be used by users to configure how much cache that can be
> + allocated to different PIDs.Enables debug
Copy and paste is wonderful, right?
> + Say N if unsure.
> +
> config PROC_PID_CPUSET
> bool "Include legacy /proc/<pid>/cpuset file"
> depends on CPUSETS
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 240157c..afa2897 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2215,7 +2215,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
> perf_event_task_sched_out(prev, next);
> fire_sched_out_preempt_notifiers(prev, next);
> prepare_lock_switch(rq, next);
> - prepare_arch_switch(next);
> + prepare_arch_switch(prev);
Great. You just broke all architectures which implement
prepare_arch_switch(). Congratulations!
> }
>
> /**
> @@ -2254,7 +2254,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> */
> prev_state = prev->state;
> vtime_task_switch(prev);
> - finish_arch_switch(prev);
> + finish_arch_switch(current);
Ditto.
> perf_event_task_sched_in(prev, current);
> finish_lock_switch(rq, prev);
> finish_arch_post_lock_switch();
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 24156c84..79b9ff6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -965,12 +965,36 @@ static inline int task_on_rq_migrating(struct task_struct *p)
> return p->on_rq == TASK_ON_RQ_MIGRATING;
> }
>
> +#ifdef CONFIG_X86_64
> +#ifdef CONFIG_CGROUP_CACHEQE
> +
> +#include <asm/cacheqe.h>
We surely not put x86 specific nonsense in the core scheduler
code. Did you ever come up with the idea to look how this is used by
other architectures? Obviously not.
I have no idea how Matts reviewed tag got onto this masterpiece of trainwreck engineering and I better dont want to know.
Thanks,
tglx
On Fri, 21 Nov 2014, Thomas Gleixner wrote:
> On Wed, 19 Nov 2014, Vikas Shivappa wrote:
>
>> +/* maximum possible cbm length */
>> +#define MAX_CBM_LENGTH 32
>> +
>> +#define IA32_CBMMAX_MASK(x) (0xffffffff & (~((u64)(1 << x) - 1)))
>
> Unused define.
Will remove , is there any automated option during compile or checkpatch.pl
which could find this ?
>
>> +
>> +#define IA32_CBM_MASK 0xffffffff
>
> (~0U) ?
>
>> @@ -0,0 +1,144 @@
>> +static inline void cacheqe_sched_in(struct task_struct *task)
>> +{
>> + struct cacheqe *cq;
>> + unsigned int clos;
>> + unsigned int l, h;
>
> These variable names suck. You say in the comment below:
>
> write the PQR for the proc.
>
> So what is it now? clos or pqr or what?
>
> IA32_PQR_ASSOC is the MSR name and bit 0-9 is RMID and bit 32-63 is
> COS
>
> But you create random names just because it makes it harder to map the
> code to the SDM or is there some other sensible reason I'm missing
> here?
>
>> +
>> + if (!cqe_genable)
>> + return;
>
> This wants to be a static key.
Ok , will fix
>
>> + rdmsr(IA32_PQR_ASSOC, l, h);
>
> Why on earth do we want to read an MSR on every context switch? What's
> wrong with having
>
> DEFINE_PER_CPU(u64, cacheqe_pqr);
>
> u64 next_pqr, cur_pqr = this_cpu_read(cacheqe_pqr);
>
> cq = task_cacheqe(task);
> next_pqr = cq ? cq->pqr : 0;
>
> if (next_pqr != cur_pqr) {
> wrmsrl(IA32_PQR_ASSOC, next_pqr);
when cqm(cache monitoring) and cqe are both running together , cant corrupt the
low 32 bits, the rdmsr was a fix for that.
If there are better alternatives at low cost , certainly acceptable
> this_cpu_write(cacheqe_pqr, next_pqr);
> }
>
> That saves the whole idiocity of reading MSRs on schedule in and
> schedule out. In fact it makes the schedule_out part completely
> superfluous.
>
>> + rcu_read_lock();
>> + cq = task_cacheqe(task);
>> +
>> + if (cq == NULL || cq->clos == h) {
>> + rcu_read_unlock();
>> + return;
>> + }
>> +
>> + clos = cq->clos;
>> +
>> + /*
>> + * After finding the cacheqe of the task , write the PQR for the proc.
>> + * We are assuming the current core is the one its scheduled to.
>
> You are in the middle of the scheduler. So what needs to be assumed
> here?
>
>> + * In unified scheduling , write the PQR each time.
>
> What the heck is unified scheduling?
>
>> + */
>> + wrmsr(IA32_PQR_ASSOC, l, clos);
>> + rcu_read_unlock();
>> +
>> + CQE_DEBUG(("schedule in clos :0x%x,task cpu:%u, currcpu: %u,pid:%u\n",
>> + clos, task_cpu(task), smp_processor_id(), task->pid));
>
> A printk in the guts of the sched_switch function? Ever heard about
> trace points?
>
>> +
>> +}
>> +
>> +static inline void cacheqe_sched_out(struct task_struct *task)
>> +{
>> + unsigned int l, h;
>> +
>> + if (!cqe_genable)
>> + return;
>> +
>> + rdmsr(IA32_PQR_ASSOC, l, h);
>> +
>> + if (h == 0)
>> + return;
>> +
>> + /*
>> + *After finding the cacheqe of the task , write the PQR for the proc.
>> + * We are assuming the current core is the one its scheduled to.
>> + * Write zero when scheduling out so that we get a more accurate
>> + * cache allocation.
>
> This comment makes no sense at all. But well, this does not make sense
> anyway.
>
>> +ifdef CONFIG_CACHEQE_DEBUG
>> +CFLAGS_cacheqe.o := -DDEBUG
>> +endif
>
> Certainly NOT!
Since it was a new feature , had the debug patch posted.
Will plan to remove the debug prints in next version
>
>> +++ b/arch/x86/kernel/cpu/cacheqe.c
>> @@ -0,0 +1,487 @@
>> +
>> +/*
>> + * kernel/cacheqe.c
>
> What's the point of the file name here? Especially if the file is not
> there at all.
>
>> +#include <asm/cacheqe.h>
>> +
>> +struct cacheqe root_cqe_group;
>> +static DEFINE_MUTEX(cqe_group_mutex);
>> +
>> +bool cqe_genable;
>
> This wants to be readmostly, if it's a global, but we really want a
> static key for this.
>
>> +
>> +/* ccmap maintains 1:1 mapping between CLOSid and cbm.*/
>> +
>> +static struct closcbm_map *ccmap;
>> +static struct cacheqe_subsys_info *cqess_info;
>> +
>> +char hsw_brandstrs[5][64] = {
>> + "Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz",
>> + "Intel(R) Xeon(R) CPU E5-2648L v3 @ 1.80GHz",
>> + "Intel(R) Xeon(R) CPU E5-2628L v3 @ 2.00GHz",
>> + "Intel(R) Xeon(R) CPU E5-2618L v3 @ 2.30GHz",
>> + "Intel(R) Xeon(R) CPU E5-2608L v3 @ 2.00GHz"
>> +};
>> +
>> +#define cacheqe_for_each_child(child_cq, pos_css, parent_cq) \
>> + css_for_each_child((pos_css), \
>> + &(parent_cq)->css)
>> +
>> +#if CONFIG_CACHEQE_DEBUG
>
> We really do NOT need another config option for this. See above.
>
>> +/*DUMP the closid-cbm map.*/
>
> Wow that comment is really informative.
>
>> +static inline bool cqe_enabled(struct cpuinfo_x86 *c)
>> +{
>> +
>> + int i;
>> +
>> + if (cpu_has(c, X86_FEATURE_CQE_L3))
>> + return true;
>> +
>> + /*
>> + * Hard code the checks and values for HSW SKUs.
>> + * Unfortunately! have to check against only these brand name strings.
>> + */
>
> You must be kidding.
No. Will have a microcode version check as well in next patch after thats
confirmed from h/w team
>
>> +
>> + for (i = 0; i < 5; i++)
>> + if (!strcmp(hsw_brandstrs[i], c->x86_model_id)) {
>> + c->x86_cqe_closs = 4;
>> + c->x86_cqe_cbmlength = 20;
>> + return true;
>> + }
>> +
>> + return false;
>> +
>> +}
>> +
>> +
>> +static int __init cqe_late_init(void)
>> +{
>> +
>> + struct cpuinfo_x86 *c = &boot_cpu_data;
>> + size_t sizeb;
>> + int maxid = boot_cpu_data.x86_cqe_closs;
>> +
>> + cqe_genable = false;
>
> Right, we need to initialize bss storage.
will remove
>
>> +
>> + /*
>> + * Need the cqe_genable hint helps decide if the
>> + * kernel has enabled cache allocation.
>> + */
>
> What hint?
>
>> + if (!cqe_enabled(c)) {
>> +
>> + root_cqe_group.css.ss->disabled = 1;
>> + return -ENODEV;
>> +
>> + } else {
>> +
>> + cqess_info =
>> + kzalloc(sizeof(struct cacheqe_subsys_info),
>> + GFP_KERNEL);
>
> Can you add a few more newlines so the code gets better readable?
>
>> +
>> + if (!cqess_info)
>> + return -ENOMEM;
>> +
>> + sizeb = BITS_TO_LONGS(c->x86_cqe_closs) * sizeof(long);
>> + cqess_info->closmap =
>> + kzalloc(sizeb, GFP_KERNEL);
>
> Sigh.
>
>> + if (!cqess_info->closmap) {
>> + kfree(cqess_info);
>> + return -ENOMEM;
>> + }
>> +
>> + sizeb = maxid * sizeof(struct closcbm_map);
>> + ccmap = kzalloc(sizeb, GFP_KERNEL);
>> +
>> + if (!ccmap)
>> + return -ENOMEM;
>
> Leaks closmap and cqess_info.
will fix
>
>> + barrier();
>> + cqe_genable = true;
>
> What's the exact point of that barrier?
>
>> +}
>> +
>
> Stray newline. checkpatch.pl is optional, right?
did run checkpatch.pl and it did not report new lines as errors/warnings.
is there a better way than manually checking new lines and missing?
>
>> +late_initcall(cqe_late_init);
>> +
>> +/*
>> + * Allocates a new closid from unused list of closids.
>
> There is no unused list.
>
>> + * Called with the cqe_group_mutex held.
>> + */
>> +
>> +static int cqe_alloc_closid(struct cacheqe *cq)
>> +{
>> + unsigned int tempid;
>> + unsigned int maxid;
>> + int err;
>> +
>> + maxid = boot_cpu_data.x86_cqe_closs;
>> +
>> + tempid = find_next_zero_bit(cqess_info->closmap, maxid, 0);
>> +
>> + if (tempid == maxid) {
>> + err = -ENOSPC;
>> + goto closidallocfail;
>
> What's the point of the goto? What's wrong with 'return -ENOSPC;' ?
>
will change to return -ENOSPC.
>> + }
>> +
>> + set_bit(tempid, cqess_info->closmap);
>> + ccmap[tempid].ref++;
>> + cq->clos = tempid;
>> +
>> + pr_debug("cqe : Allocated a directory.closid:%u\n", cq->clos);
>
> And of course we have no idea whatfor this closid got allocated. Your
> debug info is just useless.
The debugprints are put mostly for the debug option and also as the feature was
new to help testing. Will remove them and add better comments.
>
>> +static void cqe_free_closid(struct cacheqe *cq)
>> +{
>> +
>> + pr_debug("cqe :Freeing closid:%u\n", cq->clos);
>> +
>> + ccmap[cq->clos].ref--;
>
> This wants a sanity check to catch ref == 0.
will fix
>
>> +/* Create a new cacheqe cgroup.*/
>> +static struct cgroup_subsys_state *
>> +cqe_css_alloc(struct cgroup_subsys_state *parent_css)
>> +{
>> + struct cacheqe *parent = css_cacheqe(parent_css);
>> + struct cacheqe *cq;
>> +
>> + /* This is the call before the feature is detected */
>
> Huch? We know at early boot that this is available. So why do you need
> this? The first call to this should never happen before the whole
> thing is initialized. If it happens, it's a design failure.
>
>> + if (!parent) {
>> + root_cqe_group.clos = 0;
>> + return &root_cqe_group.css;
>> + }
>> +
>> + /* To check if cqe is enabled.*/
>> + if (!cqe_genable)
>> + return ERR_PTR(-ENODEV);
>
> So we first check for !parent and return &root_cqe_group.css and later
> we return an error pointer? Really consistent behaviour.
this is due to how the cgroup init works.
>
>> +/* Destroy an existing CAT cgroup.*/
>
> If you would spend some time on providing real comments instead of
> documenting the obvious ...
>
>> +static void cqe_css_free(struct cgroup_subsys_state *css)
>> +{
>> + struct cacheqe *cq = css_cacheqe(css);
>> + int len = boot_cpu_data.x86_cqe_cbmlength;
>> +
>> + pr_debug("cqe : In cacheqe_css_free\n");
>> +
>> + mutex_lock(&cqe_group_mutex);
>> +
>> + /* Reset the CBM for the cgroup.Should be all 1s by default !*/
>> +
>> + wrmsrl(CQECBMMSR(cq->clos), ((1 << len) - 1));
>
> So that's writing to IA32_L3_MASK_n, right? Pretty obvious name:
> CQECBMMSR.
>
> And the write should set ALL bits to 1 not just the enabled ones if I
> understand the SDM correctly and what you said in your own comment
> above.
>
> Aside of that what guarantees that all references to this L3_MASK are
> gone? You allow the sharing of COS ids. Your whole refcounting scheme
> is a trainwreck. And when the COS id is not longer in use, there is no
> point in writing the MSR at all. It does not matter whats in there if
> nothing ever uses it.
will fix this bug.
This is bug which got introduced after making the change to reuse the closIDs.
it breaks the functionality ! This should have never be there
>
>> + cqe_free_closid(cq);
>> + kfree(cq);
>> +
>> + mutex_unlock(&cqe_group_mutex);
>> +
>> +}
>> +
>> +/*
>> + * Called during do_exit() syscall during a task exit.
>> + * This assumes that the thread is running on the current
>> + * cpu.
>
> Your assumptions about running on current cpu are interesting. On
> which CPU is a task supposed to run, when it is on a CPU? Surely on
> some other cpu, right?
>
>> + */
>> +
>> +static void cqe_exit(struct cgroup_subsys_state *css,
>> + struct cgroup_subsys_state *old_css,
>> + struct task_struct *task)
>> +{
>> +
>> + cacheqe_sched_out(task);
>
> And whats the point here? Nothing at all. You need to clean the
> associated data and then the exit schedule will clean up the PQR
> automatically.
>
ok , can be removed
>> +
>> +}
>> +
>> +static inline bool cbm_minbits(unsigned long var)
>> +{
>> +
>
> Your supply of useless newlines is endless.
>
>> + unsigned long i;
>> +
>> + /*Minimum of 2 bits must be set.*/
>> +
>> + i = var & (var - 1);
>> + if (!i || !var)
>> + return false;
>> +
>> + return true;
>> +
>> +}
>
> What's wrong with using bitmap functions consistently? So this test
> would simply become:
>
> bitmap_weight(map, nrbits) < 2
>
>> +
>> +/*
>> + * Tests if only contiguous bits are set.
>> + */
>> +
>> +static inline bool cbm_iscontiguous(unsigned long var)
>> +{
>
> This one here can be implemented with existing bitmap functions as
> well.
>
> Would be too simple and too obvious to understand, right?
>
>> +static int cqe_cbm_read(struct seq_file *m, void *v)
>> +{
>> + struct cacheqe *cq = css_cacheqe(seq_css(m));
>> +
>> + pr_debug("cqe : In cqe_cqemode_read\n");
>
> We really need a debug printk for a seqfile read function, right?
>
>> + seq_printf(m, "0x%x\n", (unsigned int)*(cq->cbm));
>
> seq_bitmap()
>
>> +static int validate_cbm(struct cacheqe *cq, unsigned long cbmvalue)
>> +{
>> + struct cacheqe *par, *c;
>> + struct cgroup_subsys_state *css;
>> +
>> + if (!cbm_minbits(cbmvalue) || !cbm_iscontiguous(cbmvalue)) {
>> + pr_info("CQE error: minimum bits not set or non contiguous mask\n");
>> + return -EINVAL;
>> + }
>> +
>> + /*
>> + * Needs to be a subset of its parent.
>> + */
>> + par = parent_cqe(cq);
>> +
>> + if (!bitmap_subset(&cbmvalue, par->cbm, MAX_CBM_LENGTH))
>> + return -EINVAL;
>> +
>> + rcu_read_lock();
>> +
>> + /*
>> + * Each of children should be a subset of the mask.
>> + */
>> +
>> + cacheqe_for_each_child(c, css, cq) {
>> + c = css_cacheqe(css);
>> + if (!bitmap_subset(c->cbm, &cbmvalue, MAX_CBM_LENGTH)) {
>> + pr_debug("cqe : Children's cbm not a subset\n");
>> + return -EINVAL;
>> + }
>> + }
>
> So now you have 3 error exits here. One with pr_info, one with out and
> one with pr_debug. And of course the most obvious to figure out has a
> pr_info. Useful.
>
>> +static bool cbm_search(unsigned long cbm, int *closid)
>> +{
>> +
>> + int maxid = boot_cpu_data.x86_cqe_closs;
>> + unsigned int i;
>> +
>> + for (i = 0; i < maxid; i++)
>> + if (bitmap_equal(&cbm, &ccmap[i].cbm, MAX_CBM_LENGTH)) {
>> + *closid = i;
>> + return true;
>> + }
>> +
>> + return false;
>> +
>> +}
>> +
>> +static int cqe_cbm_write(struct cgroup_subsys_state *css,
>> + struct cftype *cft, u64 cbmvalue)
>> +{
>> + struct cacheqe *cq = css_cacheqe(css);
>> + ssize_t err = 0;
>> + unsigned long cbm;
>> + unsigned int closid;
>> +
>> + pr_debug("cqe : In cqe_cbm_write\n");
>
> Groan.
>
>> + if (!cqe_genable)
>> + return -ENODEV;
>> +
>> + if (cq == &root_cqe_group || !cq)
>> + return -EPERM;
>> +
>> + /*
>> + * Need global mutex as cbm write may allocate the closid.
>
> You need to protect against changes on all ends, not only allocating a
> closid.
>
>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>> index 4b4f78c..a9b277a 100644
>> --- a/arch/x86/kernel/cpu/common.c
>> +++ b/arch/x86/kernel/cpu/common.c
>> @@ -633,6 +633,27 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
>> c->x86_capability[9] = ebx;
>> }
>>
>> +/* Additional Intel-defined flags: level 0x00000010 */
>> + if (c->cpuid_level >= 0x00000010) {
>> + u32 eax, ebx, ecx, edx;
>> +
>> + cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
>> +
>> + c->x86_capability[10] = ebx;
>> +
>> + if (cpu_has(c, X86_FEATURE_CQE_L3)) {
>> +
>> + u32 eax, ebx, ecx, edx;
>
> If you add another indentation level, you can define the same
> variables once more.
>
>> +
>> + cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
>> +
>> + c->x86_cqe_closs = (edx & 0xffff) + 1;
>> + c->x86_cqe_cbmlength = (eax & 0xf) + 1;
>> +config CACHEQE_DEBUG
>> + bool "Cache QoS Enforcement cgroup subsystem debug"
>
> This needs to go away. The debugging you provide is beyond useless. We
> want proper tracepoints for that not random debug printks with no value.
>
>> + depends on X86 || X86_64
>> + help
>> + This option provides framework to allocate Cache cache lines when
>> + applications fill cache.
>> + This can be used by users to configure how much cache that can be
>> + allocated to different PIDs.Enables debug
>
> Copy and paste is wonderful, right?
>
>> + Say N if unsure.
>> +
>> config PROC_PID_CPUSET
>> bool "Include legacy /proc/<pid>/cpuset file"
>> depends on CPUSETS
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 240157c..afa2897 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2215,7 +2215,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>> perf_event_task_sched_out(prev, next);
>> fire_sched_out_preempt_notifiers(prev, next);
>> prepare_lock_switch(rq, next);
>> - prepare_arch_switch(next);
>> + prepare_arch_switch(prev);
>
> Great. You just broke all architectures which implement
> prepare_arch_switch(). Congratulations!
>
Will fix this bug.
>> }
>>
>> /**
>> @@ -2254,7 +2254,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>> */
>> prev_state = prev->state;
>> vtime_task_switch(prev);
>> - finish_arch_switch(prev);
>> + finish_arch_switch(current);
>
> Ditto.
>
>> perf_event_task_sched_in(prev, current);
>> finish_lock_switch(rq, prev);
>> finish_arch_post_lock_switch();
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 24156c84..79b9ff6 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -965,12 +965,36 @@ static inline int task_on_rq_migrating(struct task_struct *p)
>> return p->on_rq == TASK_ON_RQ_MIGRATING;
>> }
>>
>> +#ifdef CONFIG_X86_64
>> +#ifdef CONFIG_CGROUP_CACHEQE
>> +
>> +#include <asm/cacheqe.h>
>
> We surely not put x86 specific nonsense in the core scheduler
> code. Did you ever come up with the idea to look how this is used by
> other architectures? Obviously not.
>
> I have no idea how Matts reviewed tag got onto this masterpiece of trainwreck engineering and I better dont want to know.
>
> Thanks,
>
> tglx
>
>
>
>
>
>
On Fri, Nov 21, 2014 at 12:00:27PM -0800, Vikas Shivappa wrote:
> >>+char hsw_brandstrs[5][64] = {
> >>+ "Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz",
> >>+ "Intel(R) Xeon(R) CPU E5-2648L v3 @ 1.80GHz",
> >>+ "Intel(R) Xeon(R) CPU E5-2628L v3 @ 2.00GHz",
> >>+ "Intel(R) Xeon(R) CPU E5-2618L v3 @ 2.30GHz",
> >>+ "Intel(R) Xeon(R) CPU E5-2608L v3 @ 2.00GHz"
> >>+};
> >>+
> >>+#define cacheqe_for_each_child(child_cq, pos_css, parent_cq) \
> >>+ css_for_each_child((pos_css), \
> >>+ &(parent_cq)->css)
> >>+
> >>+#if CONFIG_CACHEQE_DEBUG
> >
> >We really do NOT need another config option for this. See above.
> >
> >>+/*DUMP the closid-cbm map.*/
> >
> >Wow that comment is really informative.
> >
> >>+static inline bool cqe_enabled(struct cpuinfo_x86 *c)
> >>+{
> >>+
> >>+ int i;
> >>+
> >>+ if (cpu_has(c, X86_FEATURE_CQE_L3))
> >>+ return true;
> >>+
> >>+ /*
> >>+ * Hard code the checks and values for HSW SKUs.
> >>+ * Unfortunately! have to check against only these brand name strings.
> >>+ */
> >
> >You must be kidding.
>
> No. Will have a microcode version check as well in next patch after thats
> confirmed from h/w team
Checking random brand strings? Please don't tell me those are not really
immutable either...
And what happens with newer models appearing? Add more brand strings?
Lovely stuff, that.
Well, since you're talking to the h/w team: can they give you some
immutable bit somewhere which you can check instead of looking at brand
strings? This'll be a sane solution, actually.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On 11/19/2014 05:05 PM, Vikas Shivappa wrote:
> + /*
> + * Hard code the checks and values for HSW SKUs.
> + * Unfortunately! have to check against only these brand name strings.
> + */
> +
> + for (i = 0; i < 5; i++)
> + if (!strcmp(hsw_brandstrs[i], c->x86_model_id)) {
> + c->x86_cqe_closs = 4;
> + c->x86_cqe_cbmlength = 20;
> + return true;
> + }
Please use ARRAY_SIZE() here. Otherwise, I guarantee the next string
you add to hsw_brandstrs[] gets silently ignored.
Are there really only 5 CPUs? This:
> http://ark.intel.com/products/family/78583/Intel-Xeon-Processor-E5-v3-Family#@Server
lists 32 skus.
On Fri, 21 Nov 2014, Dave Hansen wrote:
> On 11/19/2014 05:05 PM, Vikas Shivappa wrote:
> > + /*
> > + * Hard code the checks and values for HSW SKUs.
> > + * Unfortunately! have to check against only these brand name strings.
> > + */
> > +
> > + for (i = 0; i < 5; i++)
> > + if (!strcmp(hsw_brandstrs[i], c->x86_model_id)) {
> > + c->x86_cqe_closs = 4;
> > + c->x86_cqe_cbmlength = 20;
> > + return true;
> > + }
>
> Please use ARRAY_SIZE() here. Otherwise, I guarantee the next string
> you add to hsw_brandstrs[] gets silently ignored.
>
> Are there really only 5 CPUs? This:
>
> > http://ark.intel.com/products/family/78583/Intel-Xeon-Processor-E5-v3-Family#@Server
>
> lists 32 skus.
We really should find a proper software probing solution for this
instead of having a gazillion of brand strings plus micro code version
checks around for this.
Why cant the HW folks release a micro code version which fixes the
obviously wreckaged CPUID enumeration of this feature instead of
burdening us with that horror?
Can you please find a proper sized clue bat and whack your HW folks
over the head for this insanity?
Thanks,
tglx
On Fri, 21 Nov 2014, Borislav Petkov wrote:
> On Fri, Nov 21, 2014 at 12:00:27PM -0800, Vikas Shivappa wrote:
>>>> +char hsw_brandstrs[5][64] = {
>>>> + "Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz",
>>>> + "Intel(R) Xeon(R) CPU E5-2648L v3 @ 1.80GHz",
>>>> + "Intel(R) Xeon(R) CPU E5-2628L v3 @ 2.00GHz",
>>>> + "Intel(R) Xeon(R) CPU E5-2618L v3 @ 2.30GHz",
>>>> + "Intel(R) Xeon(R) CPU E5-2608L v3 @ 2.00GHz"
>>>> +};
>>>> +
>>>> +#define cacheqe_for_each_child(child_cq, pos_css, parent_cq) \
>>>> + css_for_each_child((pos_css), \
>>>> + &(parent_cq)->css)
>>>> +
>>>> +#if CONFIG_CACHEQE_DEBUG
>>>
>>> We really do NOT need another config option for this. See above.
>>>
>>>> +/*DUMP the closid-cbm map.*/
>>>
>>> Wow that comment is really informative.
>>>
>>>> +static inline bool cqe_enabled(struct cpuinfo_x86 *c)
>>>> +{
>>>> +
>>>> + int i;
>>>> +
>>>> + if (cpu_has(c, X86_FEATURE_CQE_L3))
>>>> + return true;
>>>> +
>>>> + /*
>>>> + * Hard code the checks and values for HSW SKUs.
>>>> + * Unfortunately! have to check against only these brand name strings.
>>>> + */
>>>
>>> You must be kidding.
>>
>> No. Will have a microcode version check as well in next patch after thats
>> confirmed from h/w team
>
> Checking random brand strings? Please don't tell me those are not really
> immutable either...
>
> And what happens with newer models appearing? Add more brand strings?
> Lovely stuff, that.
>
> Well, since you're talking to the h/w team: can they give you some
> immutable bit somewhere which you can check instead of looking at brand
> strings? This'll be a sane solution, actually.
>
Yes , I did check for something like model stepping , not received anything yet.
will update in my next version.
Thanks,
Vikas
> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --
>
Correcting email address for Matt.
On Wed, 19 Nov 2014, Vikas Shivappa wrote:
> What is Cache Allocation Technology ( CAT )
> -------------------------------------------
>
> Cache Allocation Technology provides a way for the Software (OS/VMM) to
> restrict cache allocation to a defined 'subset' of cache which may be
> overlapping with other 'subsets'. This feature is used when allocating
> a line in cache ie when pulling new data into the cache. The
> programming of the h/w is done via programming MSRs.
>
> The different cache subsets are identified by CLOS identifier (class of
> service) and each CLOS has a CBM (cache bit mask). The CBM is a
> contiguous set of bits which defines the amount of cache resource that
> is available for each 'subset'.
>
> Why is CAT (cache allocation technology) needed
> ------------------------------------------------
>
> The CAT enables more cache resources to be made available for higher
> priority applications based on guidance from the execution
> environment.
>
> The architecture also allows dynamically changing these subsets during
> runtime to further optimize the performance of the higher priority
> application with minimal degradation to the low priority app.
> Additionally, resources can be rebalanced for system throughput benefit.
>
> This technique may be useful in managing large computer systems which
> large LLC. Examples may be large servers running instances of
> webservers or database servers. In such complex systems, these subsets
> can be used for more careful placing of the available cache resources.
>
> The CAT kernel patch would provide a basic kernel framework for users to
> be able to implement such cache subsets.
>
> Kernel Implementation
> ---------------------
>
> This patch implements a cgroup subsystem to support cache allocation.
> Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping. A
> CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
> to the kernel and not exposed to user. Each cgroup would have one CBM
> and would just represent one cache 'subset'.
>
> The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> cgroup never fails. When a child cgroup is created it inherits the
> CLOSid and the CBM from its parent. When a user changes the default
> CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> used before. The changing of 'cbm' may fail with -ERRNOSPC once the
> kernel runs out of maximum CLOSids it can support.
> User can create as many cgroups as he wants but having different CBMs
> at the same time is restricted by the maximum number of CLOSids
> (multiple cgroups can have the same CBM).
> Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
> for each cgroup using a CLOSid.
>
> The tasks in the cgroup would get to fill the LLC cache represented by
> the cgroup's 'cbm' file.
>
> Root directory would have all available bits set in 'cbm' file by
> default.
>
> Assignment of CBM,CLOS
> ---------------------------------
>
> The 'cbm' needs to be a subset of the parent node's 'cbm'. Any
> contiguous subset of these bits(with a minimum of 2 bits) maybe set to
> indicate the cache mapping desired. The 'cbm' between 2 directories can
> overlap. The 'cbm' would represent the cache 'subset' of the CAT cgroup.
> For ex: on a system with 16 bits of max cbm bits, if the directory has
> the least significant 4 bits set in its 'cbm' file(meaning the 'cbm' is
> just 0xf), it would be allocated the right quarter of the Last level
> cache which means the tasks belonging to this CAT cgroup can use the
> right quarter of the cache to fill. If it has the most significant 8
> bits set ,it would be allocated the left half of the cache(8 bits out
> of 16 represents 50%).
>
> The cache portion defined in the CBM file is available to all tasks
> within the cgroup to fill and these task are not allowed to allocate
> space in other parts of the cache.
>
> Scheduling and Context Switch
> ------------------------------
>
> During context switch kernel implements this by writing the CLOSid
> (internally maintained by kernel) of the cgroup to which the task
> belongs to the CPU's IA32_PQR_ASSOC MSR.
>
> Reviewed-by: Matt Flemming <[email protected]>
> Tested-by: Priya Autee <[email protected]>
> Signed-off-by: Vikas Shivappa <[email protected]>
> ---
> arch/x86/include/asm/cacheqe.h | 144 +++++++++++
> arch/x86/include/asm/cpufeature.h | 4 +
> arch/x86/include/asm/processor.h | 5 +-
> arch/x86/kernel/cpu/Makefile | 5 +
> arch/x86/kernel/cpu/cacheqe.c | 487 ++++++++++++++++++++++++++++++++++++++
> arch/x86/kernel/cpu/common.c | 21 ++
> include/linux/cgroup_subsys.h | 5 +
> init/Kconfig | 22 ++
> kernel/sched/core.c | 4 +-
> kernel/sched/sched.h | 24 ++
> 10 files changed, 718 insertions(+), 3 deletions(-)
> create mode 100644 arch/x86/include/asm/cacheqe.h
> create mode 100644 arch/x86/kernel/cpu/cacheqe.c
>
> diff --git a/arch/x86/include/asm/cacheqe.h b/arch/x86/include/asm/cacheqe.h
> new file mode 100644
> index 0000000..91d175e
> --- /dev/null
> +++ b/arch/x86/include/asm/cacheqe.h
> @@ -0,0 +1,144 @@
> +#ifndef _CACHEQE_H_
> +#define _CACHEQE_H_
> +
> +#include <linux/cgroup.h>
> +#include <linux/slab.h>
> +#include <linux/percpu.h>
> +#include <linux/spinlock.h>
> +#include <linux/cpumask.h>
> +#include <linux/seq_file.h>
> +#include <linux/rcupdate.h>
> +#include <linux/kernel_stat.h>
> +#include <linux/err.h>
> +
> +#ifdef CONFIG_CGROUP_CACHEQE
> +
> +#define IA32_PQR_ASSOC 0xc8f
> +#define IA32_PQR_MASK(x) (x << 32)
> +
> +/* maximum possible cbm length */
> +#define MAX_CBM_LENGTH 32
> +
> +#define IA32_CBMMAX_MASK(x) (0xffffffff & (~((u64)(1 << x) - 1)))
> +
> +#define IA32_CBM_MASK 0xffffffff
> +#define IA32_L3_CBM_BASE 0xc90
> +#define CQECBMMSR(x) (IA32_L3_CBM_BASE + x)
> +
> +#ifdef CONFIG_CACHEQE_DEBUG
> +#define CQE_DEBUG(X) do { pr_info X; } while (0)
> +#else
> +#define CQE_DEBUG(X)
> +#endif
> +
> +extern bool cqe_genable;
> +
> +struct cacheqe_subsys_info {
> + unsigned long *closmap;
> +};
> +
> +struct cacheqe {
> + struct cgroup_subsys_state css;
> +
> + /* class of service for the group*/
> + unsigned int clos;
> + /* corresponding cache bit mask*/
> + unsigned long *cbm;
> +
> +};
> +
> +struct closcbm_map {
> + unsigned long cbm;
> + unsigned int ref;
> +};
> +
> +extern struct cacheqe root_cqe_group;
> +
> +/*
> + * Return cacheqos group corresponding to this container.
> + */
> +static inline struct cacheqe *css_cacheqe(struct cgroup_subsys_state *css)
> +{
> + return css ? container_of(css, struct cacheqe, css) : NULL;
> +}
> +
> +static inline struct cacheqe *parent_cqe(struct cacheqe *cq)
> +{
> + return css_cacheqe(cq->css.parent);
> +}
> +
> +/*
> + * Return cacheqe group to which this task belongs.
> + */
> +static inline struct cacheqe *task_cacheqe(struct task_struct *task)
> +{
> + return css_cacheqe(task_css(task, cacheqe_cgrp_id));
> +}
> +
> +static inline void cacheqe_sched_in(struct task_struct *task)
> +{
> + struct cacheqe *cq;
> + unsigned int clos;
> + unsigned int l, h;
> +
> + if (!cqe_genable)
> + return;
> +
> + rdmsr(IA32_PQR_ASSOC, l, h);
> +
> + rcu_read_lock();
> + cq = task_cacheqe(task);
> +
> + if (cq == NULL || cq->clos == h) {
> + rcu_read_unlock();
> + return;
> + }
> +
> + clos = cq->clos;
> +
> + /*
> + * After finding the cacheqe of the task , write the PQR for the proc.
> + * We are assuming the current core is the one its scheduled to.
> + * In unified scheduling , write the PQR each time.
> + */
> + wrmsr(IA32_PQR_ASSOC, l, clos);
> + rcu_read_unlock();
> +
> + CQE_DEBUG(("schedule in clos :0x%x,task cpu:%u, currcpu: %u,pid:%u\n",
> + clos, task_cpu(task), smp_processor_id(), task->pid));
> +
> +}
> +
> +static inline void cacheqe_sched_out(struct task_struct *task)
> +{
> + unsigned int l, h;
> +
> + if (!cqe_genable)
> + return;
> +
> + rdmsr(IA32_PQR_ASSOC, l, h);
> +
> + if (h == 0)
> + return;
> +
> + /*
> + *After finding the cacheqe of the task , write the PQR for the proc.
> + * We are assuming the current core is the one its scheduled to.
> + * Write zero when scheduling out so that we get a more accurate
> + * cache allocation.
> + */
> +
> + wrmsr(IA32_PQR_ASSOC, l, 0);
> +
> + CQE_DEBUG(("schedule out done cpu :%u,curr cpu:%u, pid:%u\n",
> + task_cpu(task), smp_processor_id(), task->pid));
> +
> +}
> +
> +#else
> +static inline void cacheqe_sched_in(struct task_struct *task) {}
> +
> +static inline void cacheqe_sched_out(struct task_struct *task) {}
> +
> +#endif
> +#endif
> diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
> index 0bb1335..21290ac 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -221,6 +221,7 @@
> #define X86_FEATURE_INVPCID ( 9*32+10) /* Invalidate Processor Context ID */
> #define X86_FEATURE_RTM ( 9*32+11) /* Restricted Transactional Memory */
> #define X86_FEATURE_MPX ( 9*32+14) /* Memory Protection Extension */
> +#define X86_FEATURE_CQE (9*32+15) /* Cache QOS Enforcement */
> #define X86_FEATURE_AVX512F ( 9*32+16) /* AVX-512 Foundation */
> #define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */
> #define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX instructions */
> @@ -236,6 +237,9 @@
> #define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 */
> #define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS */
>
> +/* Intel-defined CPU features, CPUID level 0x0000000A:0 (ebx), word 10 */
> +#define X86_FEATURE_CQE_L3 (10*32 + 1)
> +
> /*
> * BUG word(s)
> */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index eb71ec7..6be953f 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -111,8 +111,11 @@ struct cpuinfo_x86 {
> int x86_cache_alignment; /* In bytes */
> int x86_power;
> unsigned long loops_per_jiffy;
> + /* Cache QOS Enforement values */
> + int x86_cqe_cbmlength;
> + int x86_cqe_closs;
> /* cpuid returned max cores value: */
> - u16 x86_max_cores;
> + u16 x86_max_cores;
> u16 apicid;
> u16 initial_apicid;
> u16 x86_clflush_size;
> diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
> index e27b49d..c2b0a6b 100644
> --- a/arch/x86/kernel/cpu/Makefile
> +++ b/arch/x86/kernel/cpu/Makefile
> @@ -8,6 +8,10 @@ CFLAGS_REMOVE_common.o = -pg
> CFLAGS_REMOVE_perf_event.o = -pg
> endif
>
> +ifdef CONFIG_CACHEQE_DEBUG
> +CFLAGS_cacheqe.o := -DDEBUG
> +endif
> +
> # Make sure load_percpu_segment has no stackprotector
> nostackp := $(call cc-option, -fno-stack-protector)
> CFLAGS_common.o := $(nostackp)
> @@ -47,6 +51,7 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
> perf_event_intel_uncore_nhmex.o
> endif
>
> +obj-$(CONFIG_CGROUP_CACHEQE) += cacheqe.o
>
> obj-$(CONFIG_X86_MCE) += mcheck/
> obj-$(CONFIG_MTRR) += mtrr/
> diff --git a/arch/x86/kernel/cpu/cacheqe.c b/arch/x86/kernel/cpu/cacheqe.c
> new file mode 100644
> index 0000000..2ac3d4e
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/cacheqe.c
> @@ -0,0 +1,487 @@
> +
> +/*
> + * kernel/cacheqe.c
> + *
> + * Processor Cache Allocation code
> + * (Also called cache quality enforcement - cqe)
> + *
> + * Copyright (c) 2014, Intel Corporation.
> + *
> + * 2014-10-15 Written by Vikas Shivappa
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <asm/cacheqe.h>
> +
> +struct cacheqe root_cqe_group;
> +static DEFINE_MUTEX(cqe_group_mutex);
> +
> +bool cqe_genable;
> +
> +/* ccmap maintains 1:1 mapping between CLOSid and cbm.*/
> +
> +static struct closcbm_map *ccmap;
> +static struct cacheqe_subsys_info *cqess_info;
> +
> +char hsw_brandstrs[5][64] = {
> + "Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz",
> + "Intel(R) Xeon(R) CPU E5-2648L v3 @ 1.80GHz",
> + "Intel(R) Xeon(R) CPU E5-2628L v3 @ 2.00GHz",
> + "Intel(R) Xeon(R) CPU E5-2618L v3 @ 2.30GHz",
> + "Intel(R) Xeon(R) CPU E5-2608L v3 @ 2.00GHz"
> +};
> +
> +#define cacheqe_for_each_child(child_cq, pos_css, parent_cq) \
> + css_for_each_child((pos_css), \
> + &(parent_cq)->css)
> +
> +#if CONFIG_CACHEQE_DEBUG
> +
> +/*DUMP the closid-cbm map.*/
> +
> +static inline void cbmmap_dump(void)
> +{
> +
> + int i;
> +
> + pr_debug("CBMMAP\n");
> + for (i = 0; i < boot_cpu_data.x86_cqe_closs; i++)
> + pr_debug("cbm: 0x%x,ref: %u\n",
> + (unsigned int)ccmap[i].cbm, ccmap[i].ref);
> +
> +}
> +
> +#else
> +
> +static inline void cbmmap_dump(void) {}
> +
> +#endif
> +
> +static inline bool cqe_enabled(struct cpuinfo_x86 *c)
> +{
> +
> + int i;
> +
> + if (cpu_has(c, X86_FEATURE_CQE_L3))
> + return true;
> +
> + /*
> + * Hard code the checks and values for HSW SKUs.
> + * Unfortunately! have to check against only these brand name strings.
> + */
> +
> + for (i = 0; i < 5; i++)
> + if (!strcmp(hsw_brandstrs[i], c->x86_model_id)) {
> + c->x86_cqe_closs = 4;
> + c->x86_cqe_cbmlength = 20;
> + return true;
> + }
> +
> + return false;
> +
> +}
> +
> +
> +static int __init cqe_late_init(void)
> +{
> +
> + struct cpuinfo_x86 *c = &boot_cpu_data;
> + size_t sizeb;
> + int maxid = boot_cpu_data.x86_cqe_closs;
> +
> + cqe_genable = false;
> +
> + /*
> + * Need the cqe_genable hint helps decide if the
> + * kernel has enabled cache allocation.
> + */
> +
> + if (!cqe_enabled(c)) {
> +
> + root_cqe_group.css.ss->disabled = 1;
> + return -ENODEV;
> +
> + } else {
> +
> + cqess_info =
> + kzalloc(sizeof(struct cacheqe_subsys_info),
> + GFP_KERNEL);
> +
> + if (!cqess_info)
> + return -ENOMEM;
> +
> + sizeb = BITS_TO_LONGS(c->x86_cqe_closs) * sizeof(long);
> + cqess_info->closmap =
> + kzalloc(sizeb, GFP_KERNEL);
> +
> + if (!cqess_info->closmap) {
> + kfree(cqess_info);
> + return -ENOMEM;
> + }
> +
> + sizeb = maxid * sizeof(struct closcbm_map);
> + ccmap = kzalloc(sizeb, GFP_KERNEL);
> +
> + if (!ccmap)
> + return -ENOMEM;
> +
> + /* Allocate the CLOS for root.*/
> + set_bit(0, cqess_info->closmap);
> + root_cqe_group.clos = 0;
> +
> + /*
> + * The cbmlength expected be atleast 1.
> + * All bits are set for the root cbm.
> + */
> +
> + ccmap[root_cqe_group.clos].cbm =
> + (u32)((u64)(1 << c->x86_cqe_cbmlength) - 1);
> + root_cqe_group.cbm = &ccmap[root_cqe_group.clos].cbm;
> + ccmap[root_cqe_group.clos].ref++;
> +
> + barrier();
> + cqe_genable = true;
> +
> + pr_info("CQE enabled cbmlength is %u\ncqe Closs : %u ",
> + c->x86_cqe_cbmlength, c->x86_cqe_closs);
> +
> + }
> +
> + return 0;
> +
> +}
> +
> +late_initcall(cqe_late_init);
> +
> +/*
> + * Allocates a new closid from unused list of closids.
> + * Called with the cqe_group_mutex held.
> + */
> +
> +static int cqe_alloc_closid(struct cacheqe *cq)
> +{
> + unsigned int tempid;
> + unsigned int maxid;
> + int err;
> +
> + maxid = boot_cpu_data.x86_cqe_closs;
> +
> + tempid = find_next_zero_bit(cqess_info->closmap, maxid, 0);
> +
> + if (tempid == maxid) {
> + err = -ENOSPC;
> + goto closidallocfail;
> + }
> +
> + set_bit(tempid, cqess_info->closmap);
> + ccmap[tempid].ref++;
> + cq->clos = tempid;
> +
> + pr_debug("cqe : Allocated a directory.closid:%u\n", cq->clos);
> +
> + return 0;
> +
> +closidallocfail:
> +
> + return err;
> +
> +}
> +
> +/*
> +* Called with the cqe_group_mutex held.
> +*/
> +
> +static void cqe_free_closid(struct cacheqe *cq)
> +{
> +
> + pr_debug("cqe :Freeing closid:%u\n", cq->clos);
> +
> + ccmap[cq->clos].ref--;
> +
> + if (!ccmap[cq->clos].ref)
> + clear_bit(cq->clos, cqess_info->closmap);
> +
> + return;
> +
> +}
> +
> +/* Create a new cacheqe cgroup.*/
> +static struct cgroup_subsys_state *
> +cqe_css_alloc(struct cgroup_subsys_state *parent_css)
> +{
> + struct cacheqe *parent = css_cacheqe(parent_css);
> + struct cacheqe *cq;
> +
> + /* This is the call before the feature is detected */
> + if (!parent) {
> + root_cqe_group.clos = 0;
> + return &root_cqe_group.css;
> + }
> +
> + /* To check if cqe is enabled.*/
> + if (!cqe_genable)
> + return ERR_PTR(-ENODEV);
> +
> + cq = kzalloc(sizeof(struct cacheqe), GFP_KERNEL);
> + if (!cq)
> + return ERR_PTR(-ENOMEM);
> +
> + /*
> + * Child inherits the ClosId and cbm from parent.
> + */
> +
> + cq->clos = parent->clos;
> + mutex_lock(&cqe_group_mutex);
> + ccmap[parent->clos].ref++;
> + mutex_unlock(&cqe_group_mutex);
> +
> + cq->cbm = parent->cbm;
> +
> + pr_debug("cqe : Allocated cgroup closid:%u,ref:%u\n",
> + cq->clos, ccmap[parent->clos].ref);
> +
> + return &cq->css;
> +
> +}
> +
> +/* Destroy an existing CAT cgroup.*/
> +static void cqe_css_free(struct cgroup_subsys_state *css)
> +{
> + struct cacheqe *cq = css_cacheqe(css);
> + int len = boot_cpu_data.x86_cqe_cbmlength;
> +
> + pr_debug("cqe : In cacheqe_css_free\n");
> +
> + mutex_lock(&cqe_group_mutex);
> +
> + /* Reset the CBM for the cgroup.Should be all 1s by default !*/
> +
> + wrmsrl(CQECBMMSR(cq->clos), ((1 << len) - 1));
> + cqe_free_closid(cq);
> + kfree(cq);
> +
> + mutex_unlock(&cqe_group_mutex);
> +
> +}
> +
> +/*
> + * Called during do_exit() syscall during a task exit.
> + * This assumes that the thread is running on the current
> + * cpu.
> + */
> +
> +static void cqe_exit(struct cgroup_subsys_state *css,
> + struct cgroup_subsys_state *old_css,
> + struct task_struct *task)
> +{
> +
> + cacheqe_sched_out(task);
> +
> +}
> +
> +static inline bool cbm_minbits(unsigned long var)
> +{
> +
> + unsigned long i;
> +
> + /*Minimum of 2 bits must be set.*/
> +
> + i = var & (var - 1);
> + if (!i || !var)
> + return false;
> +
> + return true;
> +
> +}
> +
> +/*
> + * Tests if only contiguous bits are set.
> + */
> +
> +static inline bool cbm_iscontiguous(unsigned long var)
> +{
> +
> + unsigned long i;
> +
> + /* Reset the least significant bit.*/
> + i = var & (var - 1);
> +
> + /*
> + * We would have a set of non-contiguous bits when
> + * there is at least one zero
> + * between the most significant 1 and least significant 1.
> + * In the below '&' operation,(var <<1) would have zero in
> + * at least 1 bit position in var apart from least
> + * significant bit if it does not have contiguous bits.
> + * Multiple sets of contiguous bits wont succeed in the below
> + * case as well.
> + */
> +
> + if (i != (var & (var << 1)))
> + return false;
> +
> + return true;
> +
> +}
> +
> +static int cqe_cbm_read(struct seq_file *m, void *v)
> +{
> + struct cacheqe *cq = css_cacheqe(seq_css(m));
> +
> + pr_debug("cqe : In cqe_cqemode_read\n");
> + seq_printf(m, "0x%x\n", (unsigned int)*(cq->cbm));
> +
> + return 0;
> +
> +}
> +
> +static int validate_cbm(struct cacheqe *cq, unsigned long cbmvalue)
> +{
> + struct cacheqe *par, *c;
> + struct cgroup_subsys_state *css;
> +
> + if (!cbm_minbits(cbmvalue) || !cbm_iscontiguous(cbmvalue)) {
> + pr_info("CQE error: minimum bits not set or non contiguous mask\n");
> + return -EINVAL;
> + }
> +
> + /*
> + * Needs to be a subset of its parent.
> + */
> + par = parent_cqe(cq);
> +
> + if (!bitmap_subset(&cbmvalue, par->cbm, MAX_CBM_LENGTH))
> + return -EINVAL;
> +
> + rcu_read_lock();
> +
> + /*
> + * Each of children should be a subset of the mask.
> + */
> +
> + cacheqe_for_each_child(c, css, cq) {
> + c = css_cacheqe(css);
> + if (!bitmap_subset(c->cbm, &cbmvalue, MAX_CBM_LENGTH)) {
> + pr_debug("cqe : Children's cbm not a subset\n");
> + return -EINVAL;
> + }
> + }
> +
> + rcu_read_unlock();
> +
> + return 0;
> +
> +}
> +
> +static bool cbm_search(unsigned long cbm, int *closid)
> +{
> +
> + int maxid = boot_cpu_data.x86_cqe_closs;
> + unsigned int i;
> +
> + for (i = 0; i < maxid; i++)
> + if (bitmap_equal(&cbm, &ccmap[i].cbm, MAX_CBM_LENGTH)) {
> + *closid = i;
> + return true;
> + }
> +
> + return false;
> +
> +}
> +
> +static int cqe_cbm_write(struct cgroup_subsys_state *css,
> + struct cftype *cft, u64 cbmvalue)
> +{
> + struct cacheqe *cq = css_cacheqe(css);
> + ssize_t err = 0;
> + unsigned long cbm;
> + unsigned int closid;
> +
> + pr_debug("cqe : In cqe_cbm_write\n");
> +
> + if (!cqe_genable)
> + return -ENODEV;
> +
> + if (cq == &root_cqe_group || !cq)
> + return -EPERM;
> +
> + /*
> + * Need global mutex as cbm write may allocate the closid.
> + */
> +
> + mutex_lock(&cqe_group_mutex);
> + cbm = (cbmvalue & IA32_CBM_MASK);
> +
> + if (bitmap_equal(&cbm, cq->cbm, MAX_CBM_LENGTH))
> + goto cbmwriteend;
> +
> + err = validate_cbm(cq, cbm);
> + if (err)
> + goto cbmwriteend;
> +
> + /*
> + * Need to assign a CLOSid to the cgroup
> + * if it has a new cbm , or reuse.
> + * This takes care to allocate only
> + * the number of CLOSs available.
> + */
> +
> + cqe_free_closid(cq);
> +
> + if (cbm_search(cbm, &closid)) {
> + cq->clos = closid;
> + ccmap[cq->clos].ref++;
> +
> + } else {
> +
> + err = cqe_alloc_closid(cq);
> +
> + if (err)
> + goto cbmwriteend;
> +
> + wrmsrl(CQECBMMSR(cq->clos), cbm);
> +
> + }
> +
> + /*
> + * Finally store the cbm in cbm map
> + * and store a reference in the cq.
> + */
> +
> + ccmap[cq->clos].cbm = cbm;
> + cq->cbm = &ccmap[cq->clos].cbm;
> +
> + cbmmap_dump();
> +
> +cbmwriteend:
> +
> + mutex_unlock(&cqe_group_mutex);
> + return err;
> +
> +}
> +
> +static struct cftype cqe_files[] = {
> + {
> + .name = "cbm",
> + .seq_show = cqe_cbm_read,
> + .write_u64 = cqe_cbm_write,
> + .mode = 0666,
> + },
> + { } /* terminate */
> +};
> +
> +struct cgroup_subsys cacheqe_cgrp_subsys = {
> + .name = "cacheqe",
> + .css_alloc = cqe_css_alloc,
> + .css_free = cqe_css_free,
> + .exit = cqe_exit,
> + .base_cftypes = cqe_files,
> +};
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 4b4f78c..a9b277a 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -633,6 +633,27 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
> c->x86_capability[9] = ebx;
> }
>
> +/* Additional Intel-defined flags: level 0x00000010 */
> + if (c->cpuid_level >= 0x00000010) {
> + u32 eax, ebx, ecx, edx;
> +
> + cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
> +
> + c->x86_capability[10] = ebx;
> +
> + if (cpu_has(c, X86_FEATURE_CQE_L3)) {
> +
> + u32 eax, ebx, ecx, edx;
> +
> + cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
> +
> + c->x86_cqe_closs = (edx & 0xffff) + 1;
> + c->x86_cqe_cbmlength = (eax & 0xf) + 1;
> +
> + }
> +
> + }
> +
> /* Extended state features: level 0x0000000d */
> if (c->cpuid_level >= 0x0000000d) {
> u32 eax, ebx, ecx, edx;
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 98c4f9b..a131c1e 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -53,6 +53,11 @@ SUBSYS(hugetlb)
> #if IS_ENABLED(CONFIG_CGROUP_DEBUG)
> SUBSYS(debug)
> #endif
> +
> +#if IS_ENABLED(CONFIG_CGROUP_CACHEQE)
> +SUBSYS(cacheqe)
> +#endif
> +
> /*
> * DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS.
> */
> diff --git a/init/Kconfig b/init/Kconfig
> index 2081a4d..bec92a4 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -968,6 +968,28 @@ config CPUSETS
>
> Say N if unsure.
>
> +config CGROUP_CACHEQE
> + bool "Cache QoS Enforcement cgroup subsystem"
> + depends on X86 || X86_64
> + help
> + This option provides framework to allocate Cache cache lines when
> + applications fill cache.
> + This can be used by users to configure how much cache that can be
> + allocated to different PIDs.
> +
> + Say N if unsure.
> +
> +config CACHEQE_DEBUG
> + bool "Cache QoS Enforcement cgroup subsystem debug"
> + depends on X86 || X86_64
> + help
> + This option provides framework to allocate Cache cache lines when
> + applications fill cache.
> + This can be used by users to configure how much cache that can be
> + allocated to different PIDs.Enables debug
> +
> + Say N if unsure.
> +
> config PROC_PID_CPUSET
> bool "Include legacy /proc/<pid>/cpuset file"
> depends on CPUSETS
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 240157c..afa2897 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2215,7 +2215,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
> perf_event_task_sched_out(prev, next);
> fire_sched_out_preempt_notifiers(prev, next);
> prepare_lock_switch(rq, next);
> - prepare_arch_switch(next);
> + prepare_arch_switch(prev);
> }
>
> /**
> @@ -2254,7 +2254,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> */
> prev_state = prev->state;
> vtime_task_switch(prev);
> - finish_arch_switch(prev);
> + finish_arch_switch(current);
> perf_event_task_sched_in(prev, current);
> finish_lock_switch(rq, prev);
> finish_arch_post_lock_switch();
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 24156c84..79b9ff6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -965,12 +965,36 @@ static inline int task_on_rq_migrating(struct task_struct *p)
> return p->on_rq == TASK_ON_RQ_MIGRATING;
> }
>
> +#ifdef CONFIG_X86_64
> +#ifdef CONFIG_CGROUP_CACHEQE
> +
> +#include <asm/cacheqe.h>
> +
> +# define prepare_arch_switch(prev) cacheqe_sched_out(prev)
> +# define finish_arch_switch(current) cacheqe_sched_in(current)
> +
> +#else
> +
> #ifndef prepare_arch_switch
> # define prepare_arch_switch(next) do { } while (0)
> #endif
> #ifndef finish_arch_switch
> # define finish_arch_switch(prev) do { } while (0)
> #endif
> +
> +#endif
> +#else
> +
> +#ifndef prepare_arch_switch
> +# define prepare_arch_switch(prev) do { } while (0)
> +#endif
> +
> +#ifndef finish_arch_switch
> +# define finish_arch_switch(current) do { } while (0)
> +#endif
> +
> +#endif
> +
> #ifndef finish_arch_post_lock_switch
> # define finish_arch_post_lock_switch() do { } while (0)
> #endif
> --
> 1.9.1
>
>
On Fri, 21 Nov 2014, Thomas Gleixner wrote:
> On Fri, 21 Nov 2014, Dave Hansen wrote:
>> On 11/19/2014 05:05 PM, Vikas Shivappa wrote:
>>> + /*
>>> + * Hard code the checks and values for HSW SKUs.
>>> + * Unfortunately! have to check against only these brand name strings.
>>> + */
>>> +
>>> + for (i = 0; i < 5; i++)
>>> + if (!strcmp(hsw_brandstrs[i], c->x86_model_id)) {
>>> + c->x86_cqe_closs = 4;
>>> + c->x86_cqe_cbmlength = 20;
>>> + return true;
>>> + }
>>
>> Please use ARRAY_SIZE() here. Otherwise, I guarantee the next string
>> you add to hsw_brandstrs[] gets silently ignored.
>>
>> Are there really only 5 CPUs? This:
>>
>>> http://ark.intel.com/products/family/78583/Intel-Xeon-Processor-E5-v3-Family#@Server
>>
>> lists 32 skus.
>
> We really should find a proper software probing solution for this
> instead of having a gazillion of brand strings plus micro code version
> checks around for this.
>
> Why cant the HW folks release a micro code version which fixes the
> obviously wreckaged CPUID enumeration of this feature instead of
> burdening us with that horror?
>
> Can you please find a proper sized clue bat and whack your HW folks
> over the head for this insanity?
Ok , we were in the process of checking this as its not usual to make 5
expensive brand strings. Will update status in next version.
Thanks,
Vikas
>
> Thanks,
>
> tglx
>
On Fri, 21 Nov, at 03:19:52PM, Thomas Gleixner wrote:
> > + barrier();
> > + cqe_genable = true;
>
> What's the exact point of that barrier?
Yes, this definitely needs documenting. Vikas?
> > +
> > +/*
> > + * Tests if only contiguous bits are set.
> > + */
> > +
> > +static inline bool cbm_iscontiguous(unsigned long var)
> > +{
>
> This one here can be implemented with existing bitmap functions as
> well.
Something like this?
first_bit = find_next_bit(map, nr_bits, -1);
zero_bit = find_next_zero_bit(map, nr_bits, first_bit);
if (find_next_bit(map, nr_bits, zero_bit) < nr_bits)
return -EINVAL; /* non-contiguous bits */
--
Matt Fleming, Intel Open Source Technology Center
On Sun, 23 Nov 2014, Matt Fleming wrote:
> Something like this?
>
> first_bit = find_next_bit(map, nr_bits, -1);
> zero_bit = find_next_zero_bit(map, nr_bits, first_bit);
>
> if (find_next_bit(map, nr_bits, zero_bit) < nr_bits)
> return -EINVAL; /* non-contiguous bits */
Looks pretty self explaining, right?
Thanks,
tglx
On Fri, 21 Nov 2014, Vikas Shivappa wrote:
> On Fri, 21 Nov 2014, Thomas Gleixner wrote:
> > On Wed, 19 Nov 2014, Vikas Shivappa wrote:
> > > + rdmsr(IA32_PQR_ASSOC, l, h);
> >
> > Why on earth do we want to read an MSR on every context switch? What's
> > wrong with having
> >
> > DEFINE_PER_CPU(u64, cacheqe_pqr);
> >
> > u64 next_pqr, cur_pqr = this_cpu_read(cacheqe_pqr);
> >
> > cq = task_cacheqe(task);
> > next_pqr = cq ? cq->pqr : 0;
> >
> > if (next_pqr != cur_pqr) {
> > wrmsrl(IA32_PQR_ASSOC, next_pqr);
>
> when cqm(cache monitoring) and cqe are both running together , cant corrupt
> the low 32 bits, the rdmsr was a fix for that.
> If there are better alternatives at low cost , certainly acceptable
Sure there are. If cache monitoring and this runs together then they
should better synchronize the cached PQR bits for the current cpu
instead of enforcing useless MSR reads into the hot path.
What you provided is certainly unacceptable.
> > > +/* Create a new cacheqe cgroup.*/
> > > +static struct cgroup_subsys_state *
> > > +cqe_css_alloc(struct cgroup_subsys_state *parent_css)
> > > +{
> > > + struct cacheqe *parent = css_cacheqe(parent_css);
> > > + struct cacheqe *cq;
> > > +
> > > + /* This is the call before the feature is detected */
> >
> > Huch? We know at early boot that this is available. So why do you need
> > this? The first call to this should never happen before the whole
> > thing is initialized. If it happens, it's a design failure.
> >
> > > + if (!parent) {
> > > + root_cqe_group.clos = 0;
> > > + return &root_cqe_group.css;
> > > + }
> > > +
> > > + /* To check if cqe is enabled.*/
> > > + if (!cqe_genable)
> > > + return ERR_PTR(-ENODEV);
> >
> > So we first check for !parent and return &root_cqe_group.css and later
> > we return an error pointer? Really consistent behaviour.
>
> this is due to how the cgroup init works.
Sorry, I can't follow that argument. Why enforces the cgroup init this
inconsistency? Why is cgroup init BEFORE we detect the feature?
Thanks,
tglx
-----Original Message-----
From: Thomas Gleixner [mailto:[email protected]]
Sent: Sunday, November 23, 2014 12:05 PM
To: Shivappa, Vikas
Cc: Vikas Shivappa; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Auld, Will; [email protected]
Subject: Re: [PATCH] x86: Intel Cache Allocation Technology support
On Fri, 21 Nov 2014, Vikas Shivappa wrote:
> On Fri, 21 Nov 2014, Thomas Gleixner wrote:
> > On Wed, 19 Nov 2014, Vikas Shivappa wrote:
> > > + rdmsr(IA32_PQR_ASSOC, l, h);
> >
> > Why on earth do we want to read an MSR on every context switch?
> > What's wrong with having
> >
> > DEFINE_PER_CPU(u64, cacheqe_pqr);
> >
> > u64 next_pqr, cur_pqr = this_cpu_read(cacheqe_pqr);
> >
> > cq = task_cacheqe(task);
> > next_pqr = cq ? cq->pqr : 0;
> >
> > if (next_pqr != cur_pqr) {
> > wrmsrl(IA32_PQR_ASSOC, next_pqr);
>
> when cqm(cache monitoring) and cqe are both running together , cant
> corrupt the low 32 bits, the rdmsr was a fix for that.
> If there are better alternatives at low cost , certainly acceptable
Sure there are. If cache monitoring and this runs together then they should better synchronize the cached PQR bits for the current cpu instead of enforcing useless MSR reads into the hot path.
What you provided is certainly unacceptable.
> > > +/* Create a new cacheqe cgroup.*/ static struct
> > > +cgroup_subsys_state * cqe_css_alloc(struct cgroup_subsys_state
> > > +*parent_css) {
> > > + struct cacheqe *parent = css_cacheqe(parent_css);
> > > + struct cacheqe *cq;
> > > +
> > > + /* This is the call before the feature is detected */
> >
> > Huch? We know at early boot that this is available. So why do you
> > need this? The first call to this should never happen before the
> > whole thing is initialized. If it happens, it's a design failure.
> >
> > > + if (!parent) {
> > > + root_cqe_group.clos = 0;
> > > + return &root_cqe_group.css;
> > > + }
> > > +
> > > + /* To check if cqe is enabled.*/
> > > + if (!cqe_genable)
> > > + return ERR_PTR(-ENODEV);
> >
> > So we first check for !parent and return &root_cqe_group.css and
> > later we return an error pointer? Really consistent behaviour.
>
> this is due to how the cgroup init works.
Sorry, I can't follow that argument. Why enforces the cgroup init this inconsistency? Why is cgroup init BEFORE we detect the feature?
Actually this can be fixed by disabling the earlyinit for the subsystem.. so the above order will be reversed meaning we first return error ptr
Thanks,
Vikas
Thanks,
tglx
-----Original Message-----
From: Thomas Gleixner [mailto:[email protected]]
Sent: Friday, November 21, 2014 6:20 AM
To: Vikas Shivappa
Cc: [email protected]; Shivappa, Vikas; [email protected]; [email protected]; [email protected]; [email protected]; Auld, Will; [email protected]
Subject: Re: [PATCH] x86: Intel Cache Allocation Technology support
On Wed, 19 Nov 2014, Vikas Shivappa wrote:
> +/* maximum possible cbm length */
> +#define MAX_CBM_LENGTH 32
> +
> +#define IA32_CBMMAX_MASK(x) (0xffffffff & (~((u64)(1 << x) - 1)))
Unused define.
> +
> +#define IA32_CBM_MASK 0xffffffff
(~0U) ?
> @@ -0,0 +1,144 @@
> +static inline void cacheqe_sched_in(struct task_struct *task) {
> + struct cacheqe *cq;
> + unsigned int clos;
> + unsigned int l, h;
These variable names suck. You say in the comment below:
write the PQR for the proc.
So what is it now? clos or pqr or what?
IA32_PQR_ASSOC is the MSR name and bit 0-9 is RMID and bit 32-63 is COS
But you create random names just because it makes it harder to map the code to the SDM or is there some other sensible reason I'm missing here?
> +
> + if (!cqe_genable)
> + return;
This wants to be a static key.
> + rdmsr(IA32_PQR_ASSOC, l, h);
Why on earth do we want to read an MSR on every context switch? What's wrong with having
DEFINE_PER_CPU(u64, cacheqe_pqr);
u64 next_pqr, cur_pqr = this_cpu_read(cacheqe_pqr);
cq = task_cacheqe(task);
next_pqr = cq ? cq->pqr : 0;
if (next_pqr != cur_pqr) {
wrmsrl(IA32_PQR_ASSOC, next_pqr);
this_cpu_write(cacheqe_pqr, next_pqr);
}
That saves the whole idiocity of reading MSRs on schedule in and schedule out. In fact it makes the schedule_out part completely superfluous.
> + rcu_read_lock();
> + cq = task_cacheqe(task);
> +
> + if (cq == NULL || cq->clos == h) {
> + rcu_read_unlock();
> + return;
> + }
> +
> + clos = cq->clos;
> +
> + /*
> + * After finding the cacheqe of the task , write the PQR for the proc.
> + * We are assuming the current core is the one its scheduled to.
You are in the middle of the scheduler. So what needs to be assumed here?
> + * In unified scheduling , write the PQR each time.
What the heck is unified scheduling?
> + */
> + wrmsr(IA32_PQR_ASSOC, l, clos);
> + rcu_read_unlock();
> +
> + CQE_DEBUG(("schedule in clos :0x%x,task cpu:%u, currcpu: %u,pid:%u\n",
> + clos, task_cpu(task), smp_processor_id(), task->pid));
A printk in the guts of the sched_switch function? Ever heard about trace points?
> +
> +}
> +
> +static inline void cacheqe_sched_out(struct task_struct *task) {
> + unsigned int l, h;
> +
> + if (!cqe_genable)
> + return;
> +
> + rdmsr(IA32_PQR_ASSOC, l, h);
> +
> + if (h == 0)
> + return;
> +
> + /*
> + *After finding the cacheqe of the task , write the PQR for the proc.
> + * We are assuming the current core is the one its scheduled to.
> + * Write zero when scheduling out so that we get a more accurate
> + * cache allocation.
This comment makes no sense at all. But well, this does not make sense anyway.
> +ifdef CONFIG_CACHEQE_DEBUG
> +CFLAGS_cacheqe.o := -DDEBUG
> +endif
Certainly NOT!
> +++ b/arch/x86/kernel/cpu/cacheqe.c
> @@ -0,0 +1,487 @@
> +
> +/*
> + * kernel/cacheqe.c
What's the point of the file name here? Especially if the file is not there at all.
> +#include <asm/cacheqe.h>
> +
> +struct cacheqe root_cqe_group;
> +static DEFINE_MUTEX(cqe_group_mutex);
> +
> +bool cqe_genable;
This wants to be readmostly, if it's a global, but we really want a static key for this.
> +
> +/* ccmap maintains 1:1 mapping between CLOSid and cbm.*/
> +
> +static struct closcbm_map *ccmap;
> +static struct cacheqe_subsys_info *cqess_info;
> +
> +char hsw_brandstrs[5][64] = {
> + "Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz",
> + "Intel(R) Xeon(R) CPU E5-2648L v3 @ 1.80GHz",
> + "Intel(R) Xeon(R) CPU E5-2628L v3 @ 2.00GHz",
> + "Intel(R) Xeon(R) CPU E5-2618L v3 @ 2.30GHz",
> + "Intel(R) Xeon(R) CPU E5-2608L v3 @ 2.00GHz"
> +};
> +
> +#define cacheqe_for_each_child(child_cq, pos_css, parent_cq) \
> + css_for_each_child((pos_css), \
> + &(parent_cq)->css)
> +
> +#if CONFIG_CACHEQE_DEBUG
We really do NOT need another config option for this. See above.
> +/*DUMP the closid-cbm map.*/
Wow that comment is really informative.
> +static inline bool cqe_enabled(struct cpuinfo_x86 *c) {
> +
> + int i;
> +
> + if (cpu_has(c, X86_FEATURE_CQE_L3))
> + return true;
> +
> + /*
> + * Hard code the checks and values for HSW SKUs.
> + * Unfortunately! have to check against only these brand name strings.
> + */
You must be kidding.
> +
> + for (i = 0; i < 5; i++)
> + if (!strcmp(hsw_brandstrs[i], c->x86_model_id)) {
> + c->x86_cqe_closs = 4;
> + c->x86_cqe_cbmlength = 20;
> + return true;
> + }
> +
> + return false;
> +
> +}
> +
> +
> +static int __init cqe_late_init(void) {
> +
> + struct cpuinfo_x86 *c = &boot_cpu_data;
> + size_t sizeb;
> + int maxid = boot_cpu_data.x86_cqe_closs;
> +
> + cqe_genable = false;
Right, we need to initialize bss storage.
> +
> + /*
> + * Need the cqe_genable hint helps decide if the
> + * kernel has enabled cache allocation.
> + */
What hint?
> + if (!cqe_enabled(c)) {
> +
> + root_cqe_group.css.ss->disabled = 1;
> + return -ENODEV;
> +
> + } else {
> +
> + cqess_info =
> + kzalloc(sizeof(struct cacheqe_subsys_info),
> + GFP_KERNEL);
Can you add a few more newlines so the code gets better readable?
> +
> + if (!cqess_info)
> + return -ENOMEM;
> +
> + sizeb = BITS_TO_LONGS(c->x86_cqe_closs) * sizeof(long);
> + cqess_info->closmap =
> + kzalloc(sizeb, GFP_KERNEL);
Sigh.
> + if (!cqess_info->closmap) {
> + kfree(cqess_info);
> + return -ENOMEM;
> + }
> +
> + sizeb = maxid * sizeof(struct closcbm_map);
> + ccmap = kzalloc(sizeb, GFP_KERNEL);
> +
> + if (!ccmap)
> + return -ENOMEM;
Leaks closmap and cqess_info.
> + barrier();
> + cqe_genable = true;
What's the exact point of that barrier?
> +}
> +
Stray newline. checkpatch.pl is optional, right?
> +late_initcall(cqe_late_init);
> +
> +/*
> + * Allocates a new closid from unused list of closids.
There is no unused list.
> + * Called with the cqe_group_mutex held.
> + */
> +
> +static int cqe_alloc_closid(struct cacheqe *cq) {
> + unsigned int tempid;
> + unsigned int maxid;
> + int err;
> +
> + maxid = boot_cpu_data.x86_cqe_closs;
> +
> + tempid = find_next_zero_bit(cqess_info->closmap, maxid, 0);
> +
> + if (tempid == maxid) {
> + err = -ENOSPC;
> + goto closidallocfail;
What's the point of the goto? What's wrong with 'return -ENOSPC;' ?
> + }
> +
> + set_bit(tempid, cqess_info->closmap);
> + ccmap[tempid].ref++;
> + cq->clos = tempid;
> +
> + pr_debug("cqe : Allocated a directory.closid:%u\n", cq->clos);
And of course we have no idea whatfor this closid got allocated. Your debug info is just useless.
> +static void cqe_free_closid(struct cacheqe *cq) {
> +
> + pr_debug("cqe :Freeing closid:%u\n", cq->clos);
> +
> + ccmap[cq->clos].ref--;
This wants a sanity check to catch ref == 0.
> +/* Create a new cacheqe cgroup.*/
> +static struct cgroup_subsys_state *
> +cqe_css_alloc(struct cgroup_subsys_state *parent_css) {
> + struct cacheqe *parent = css_cacheqe(parent_css);
> + struct cacheqe *cq;
> +
> + /* This is the call before the feature is detected */
Huch? We know at early boot that this is available. So why do you need this? The first call to this should never happen before the whole thing is initialized. If it happens, it's a design failure.
> + if (!parent) {
> + root_cqe_group.clos = 0;
> + return &root_cqe_group.css;
> + }
> +
> + /* To check if cqe is enabled.*/
> + if (!cqe_genable)
> + return ERR_PTR(-ENODEV);
So we first check for !parent and return &root_cqe_group.css and later we return an error pointer? Really consistent behaviour.
> +/* Destroy an existing CAT cgroup.*/
If you would spend some time on providing real comments instead of documenting the obvious ...
> +static void cqe_css_free(struct cgroup_subsys_state *css) {
> + struct cacheqe *cq = css_cacheqe(css);
> + int len = boot_cpu_data.x86_cqe_cbmlength;
> +
> + pr_debug("cqe : In cacheqe_css_free\n");
> +
> + mutex_lock(&cqe_group_mutex);
> +
> + /* Reset the CBM for the cgroup.Should be all 1s by default !*/
> +
> + wrmsrl(CQECBMMSR(cq->clos), ((1 << len) - 1));
So that's writing to IA32_L3_MASK_n, right? Pretty obvious name:
CQECBMMSR.
And the write should set ALL bits to 1 not just the enabled ones if I understand the SDM correctly and what you said in your own comment above.
Aside of that what guarantees that all references to this L3_MASK are gone? You allow the sharing of COS ids. Your whole refcounting scheme is a trainwreck. And when the COS id is not longer in use, there is no point in writing the MSR at all. It does not matter whats in there if nothing ever uses it.
> + cqe_free_closid(cq);
> + kfree(cq);
> +
> + mutex_unlock(&cqe_group_mutex);
> +
> +}
> +
> +/*
> + * Called during do_exit() syscall during a task exit.
> + * This assumes that the thread is running on the current
> + * cpu.
Your assumptions about running on current cpu are interesting. On which CPU is a task supposed to run, when it is on a CPU? Surely on some other cpu, right?
> + */
> +
> +static void cqe_exit(struct cgroup_subsys_state *css,
> + struct cgroup_subsys_state *old_css,
> + struct task_struct *task)
> +{
> +
> + cacheqe_sched_out(task);
And whats the point here? Nothing at all. You need to clean the associated data and then the exit schedule will clean up the PQR automatically.
> +
> +}
> +
> +static inline bool cbm_minbits(unsigned long var) {
> +
Your supply of useless newlines is endless.
> + unsigned long i;
> +
> + /*Minimum of 2 bits must be set.*/
> +
> + i = var & (var - 1);
> + if (!i || !var)
> + return false;
> +
> + return true;
> +
> +}
What's wrong with using bitmap functions consistently? So this test would simply become:
bitmap_weight(map, nrbits) < 2
the bitmap_weight checks each bit , one shift and '&' each time. Since we always have only 2 bits to check, was done this way.
> +
> +/*
> + * Tests if only contiguous bits are set.
> + */
> +
> +static inline bool cbm_iscontiguous(unsigned long var) {
This one here can be implemented with existing bitmap functions as well.
Would be too simple and too obvious to understand, right?
> +static int cqe_cbm_read(struct seq_file *m, void *v) {
> + struct cacheqe *cq = css_cacheqe(seq_css(m));
> +
> + pr_debug("cqe : In cqe_cqemode_read\n");
We really need a debug printk for a seqfile read function, right?
> + seq_printf(m, "0x%x\n", (unsigned int)*(cq->cbm));
seq_bitmap()
> +static int validate_cbm(struct cacheqe *cq, unsigned long cbmvalue) {
> + struct cacheqe *par, *c;
> + struct cgroup_subsys_state *css;
> +
> + if (!cbm_minbits(cbmvalue) || !cbm_iscontiguous(cbmvalue)) {
> + pr_info("CQE error: minimum bits not set or non contiguous mask\n");
> + return -EINVAL;
> + }
> +
> + /*
> + * Needs to be a subset of its parent.
> + */
> + par = parent_cqe(cq);
> +
> + if (!bitmap_subset(&cbmvalue, par->cbm, MAX_CBM_LENGTH))
> + return -EINVAL;
> +
> + rcu_read_lock();
> +
> + /*
> + * Each of children should be a subset of the mask.
> + */
> +
> + cacheqe_for_each_child(c, css, cq) {
> + c = css_cacheqe(css);
> + if (!bitmap_subset(c->cbm, &cbmvalue, MAX_CBM_LENGTH)) {
> + pr_debug("cqe : Children's cbm not a subset\n");
> + return -EINVAL;
> + }
> + }
So now you have 3 error exits here. One with pr_info, one with out and one with pr_debug. And of course the most obvious to figure out has a pr_info. Useful.
> +static bool cbm_search(unsigned long cbm, int *closid) {
> +
> + int maxid = boot_cpu_data.x86_cqe_closs;
> + unsigned int i;
> +
> + for (i = 0; i < maxid; i++)
> + if (bitmap_equal(&cbm, &ccmap[i].cbm, MAX_CBM_LENGTH)) {
> + *closid = i;
> + return true;
> + }
> +
> + return false;
> +
> +}
> +
> +static int cqe_cbm_write(struct cgroup_subsys_state *css,
> + struct cftype *cft, u64 cbmvalue) {
> + struct cacheqe *cq = css_cacheqe(css);
> + ssize_t err = 0;
> + unsigned long cbm;
> + unsigned int closid;
> +
> + pr_debug("cqe : In cqe_cbm_write\n");
Groan.
> + if (!cqe_genable)
> + return -ENODEV;
> +
> + if (cq == &root_cqe_group || !cq)
> + return -EPERM;
> +
> + /*
> + * Need global mutex as cbm write may allocate the closid.
You need to protect against changes on all ends, not only allocating a closid.
> diff --git a/arch/x86/kernel/cpu/common.c
> b/arch/x86/kernel/cpu/common.c index 4b4f78c..a9b277a 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -633,6 +633,27 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
> c->x86_capability[9] = ebx;
> }
>
> +/* Additional Intel-defined flags: level 0x00000010 */
> + if (c->cpuid_level >= 0x00000010) {
> + u32 eax, ebx, ecx, edx;
> +
> + cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
> +
> + c->x86_capability[10] = ebx;
> +
> + if (cpu_has(c, X86_FEATURE_CQE_L3)) {
> +
> + u32 eax, ebx, ecx, edx;
If you add another indentation level, you can define the same variables once more.
> +
> + cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
> +
> + c->x86_cqe_closs = (edx & 0xffff) + 1;
> + c->x86_cqe_cbmlength = (eax & 0xf) + 1; config CACHEQE_DEBUG
> + bool "Cache QoS Enforcement cgroup subsystem debug"
This needs to go away. The debugging you provide is beyond useless. We want proper tracepoints for that not random debug printks with no value.
> + depends on X86 || X86_64
> + help
> + This option provides framework to allocate Cache cache lines when
> + applications fill cache.
> + This can be used by users to configure how much cache that can be
> + allocated to different PIDs.Enables debug
Copy and paste is wonderful, right?
> + Say N if unsure.
> +
> config PROC_PID_CPUSET
> bool "Include legacy /proc/<pid>/cpuset file"
> depends on CPUSETS
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c index
> 240157c..afa2897 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2215,7 +2215,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
> perf_event_task_sched_out(prev, next);
> fire_sched_out_preempt_notifiers(prev, next);
> prepare_lock_switch(rq, next);
> - prepare_arch_switch(next);
> + prepare_arch_switch(prev);
Great. You just broke all architectures which implement prepare_arch_switch(). Congratulations!
> }
>
> /**
> @@ -2254,7 +2254,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> */
> prev_state = prev->state;
> vtime_task_switch(prev);
> - finish_arch_switch(prev);
> + finish_arch_switch(current);
Ditto.
> perf_event_task_sched_in(prev, current);
> finish_lock_switch(rq, prev);
> finish_arch_post_lock_switch();
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index
> 24156c84..79b9ff6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -965,12 +965,36 @@ static inline int task_on_rq_migrating(struct task_struct *p)
> return p->on_rq == TASK_ON_RQ_MIGRATING; }
>
> +#ifdef CONFIG_X86_64
> +#ifdef CONFIG_CGROUP_CACHEQE
> +
> +#include <asm/cacheqe.h>
We surely not put x86 specific nonsense in the core scheduler code. Did you ever come up with the idea to look how this is used by other architectures? Obviously not.
I have no idea how Matts reviewed tag got onto this masterpiece of trainwreck engineering and I better dont want to know.
Thanks,
tglx
-----Original Message-----
From: Thomas Gleixner [mailto:[email protected]]
Sent: Friday, November 21, 2014 6:20 AM
To: Vikas Shivappa
Cc: [email protected]; Shivappa, Vikas; [email protected]; [email protected]; [email protected]; [email protected]; Auld, Will; [email protected]
Subject: Re: [PATCH] x86: Intel Cache Allocation Technology support
On Wed, 19 Nov 2014, Vikas Shivappa wrote:
> +/* maximum possible cbm length */
> +#define MAX_CBM_LENGTH 32
> +
> +#define IA32_CBMMAX_MASK(x) (0xffffffff & (~((u64)(1 << x) - 1)))
Unused define.
> +
> +#define IA32_CBM_MASK 0xffffffff
(~0U) ?
> @@ -0,0 +1,144 @@
> +static inline void cacheqe_sched_in(struct task_struct *task) {
> + struct cacheqe *cq;
> + unsigned int clos;
> + unsigned int l, h;
These variable names suck. You say in the comment below:
write the PQR for the proc.
So what is it now? clos or pqr or what?
IA32_PQR_ASSOC is the MSR name and bit 0-9 is RMID and bit 32-63 is COS
But you create random names just because it makes it harder to map the code to the SDM or is there some other sensible reason I'm missing here?
> +
> + if (!cqe_genable)
> + return;
This wants to be a static key.
> + rdmsr(IA32_PQR_ASSOC, l, h);
Why on earth do we want to read an MSR on every context switch? What's wrong with having
DEFINE_PER_CPU(u64, cacheqe_pqr);
u64 next_pqr, cur_pqr = this_cpu_read(cacheqe_pqr);
cq = task_cacheqe(task);
next_pqr = cq ? cq->pqr : 0;
if (next_pqr != cur_pqr) {
wrmsrl(IA32_PQR_ASSOC, next_pqr);
this_cpu_write(cacheqe_pqr, next_pqr);
}
That saves the whole idiocity of reading MSRs on schedule in and schedule out. In fact it makes the schedule_out part completely superfluous.
> + rcu_read_lock();
> + cq = task_cacheqe(task);
> +
> + if (cq == NULL || cq->clos == h) {
> + rcu_read_unlock();
> + return;
> + }
> +
> + clos = cq->clos;
> +
> + /*
> + * After finding the cacheqe of the task , write the PQR for the proc.
> + * We are assuming the current core is the one its scheduled to.
You are in the middle of the scheduler. So what needs to be assumed here?
> + * In unified scheduling , write the PQR each time.
What the heck is unified scheduling?
> + */
> + wrmsr(IA32_PQR_ASSOC, l, clos);
> + rcu_read_unlock();
> +
> + CQE_DEBUG(("schedule in clos :0x%x,task cpu:%u, currcpu: %u,pid:%u\n",
> + clos, task_cpu(task), smp_processor_id(), task->pid));
A printk in the guts of the sched_switch function? Ever heard about trace points?
> +
> +}
> +
> +static inline void cacheqe_sched_out(struct task_struct *task) {
> + unsigned int l, h;
> +
> + if (!cqe_genable)
> + return;
> +
> + rdmsr(IA32_PQR_ASSOC, l, h);
> +
> + if (h == 0)
> + return;
> +
> + /*
> + *After finding the cacheqe of the task , write the PQR for the proc.
> + * We are assuming the current core is the one its scheduled to.
> + * Write zero when scheduling out so that we get a more accurate
> + * cache allocation.
This comment makes no sense at all. But well, this does not make sense anyway.
> +ifdef CONFIG_CACHEQE_DEBUG
> +CFLAGS_cacheqe.o := -DDEBUG
> +endif
Certainly NOT!
> +++ b/arch/x86/kernel/cpu/cacheqe.c
> @@ -0,0 +1,487 @@
> +
> +/*
> + * kernel/cacheqe.c
What's the point of the file name here? Especially if the file is not there at all.
> +#include <asm/cacheqe.h>
> +
> +struct cacheqe root_cqe_group;
> +static DEFINE_MUTEX(cqe_group_mutex);
> +
> +bool cqe_genable;
This wants to be readmostly, if it's a global, but we really want a static key for this.
> +
> +/* ccmap maintains 1:1 mapping between CLOSid and cbm.*/
> +
> +static struct closcbm_map *ccmap;
> +static struct cacheqe_subsys_info *cqess_info;
> +
> +char hsw_brandstrs[5][64] = {
> + "Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz",
> + "Intel(R) Xeon(R) CPU E5-2648L v3 @ 1.80GHz",
> + "Intel(R) Xeon(R) CPU E5-2628L v3 @ 2.00GHz",
> + "Intel(R) Xeon(R) CPU E5-2618L v3 @ 2.30GHz",
> + "Intel(R) Xeon(R) CPU E5-2608L v3 @ 2.00GHz"
> +};
> +
> +#define cacheqe_for_each_child(child_cq, pos_css, parent_cq) \
> + css_for_each_child((pos_css), \
> + &(parent_cq)->css)
> +
> +#if CONFIG_CACHEQE_DEBUG
We really do NOT need another config option for this. See above.
> +/*DUMP the closid-cbm map.*/
Wow that comment is really informative.
> +static inline bool cqe_enabled(struct cpuinfo_x86 *c) {
> +
> + int i;
> +
> + if (cpu_has(c, X86_FEATURE_CQE_L3))
> + return true;
> +
> + /*
> + * Hard code the checks and values for HSW SKUs.
> + * Unfortunately! have to check against only these brand name strings.
> + */
You must be kidding.
> +
> + for (i = 0; i < 5; i++)
> + if (!strcmp(hsw_brandstrs[i], c->x86_model_id)) {
> + c->x86_cqe_closs = 4;
> + c->x86_cqe_cbmlength = 20;
> + return true;
> + }
> +
> + return false;
> +
> +}
> +
> +
> +static int __init cqe_late_init(void) {
> +
> + struct cpuinfo_x86 *c = &boot_cpu_data;
> + size_t sizeb;
> + int maxid = boot_cpu_data.x86_cqe_closs;
> +
> + cqe_genable = false;
Right, we need to initialize bss storage.
> +
> + /*
> + * Need the cqe_genable hint helps decide if the
> + * kernel has enabled cache allocation.
> + */
What hint?
> + if (!cqe_enabled(c)) {
> +
> + root_cqe_group.css.ss->disabled = 1;
> + return -ENODEV;
> +
> + } else {
> +
> + cqess_info =
> + kzalloc(sizeof(struct cacheqe_subsys_info),
> + GFP_KERNEL);
Can you add a few more newlines so the code gets better readable?
> +
> + if (!cqess_info)
> + return -ENOMEM;
> +
> + sizeb = BITS_TO_LONGS(c->x86_cqe_closs) * sizeof(long);
> + cqess_info->closmap =
> + kzalloc(sizeb, GFP_KERNEL);
Sigh.
> + if (!cqess_info->closmap) {
> + kfree(cqess_info);
> + return -ENOMEM;
> + }
> +
> + sizeb = maxid * sizeof(struct closcbm_map);
> + ccmap = kzalloc(sizeb, GFP_KERNEL);
> +
> + if (!ccmap)
> + return -ENOMEM;
Leaks closmap and cqess_info.
> + barrier();
> + cqe_genable = true;
What's the exact point of that barrier?
> +}
> +
Stray newline. checkpatch.pl is optional, right?
> +late_initcall(cqe_late_init);
> +
> +/*
> + * Allocates a new closid from unused list of closids.
There is no unused list.
> + * Called with the cqe_group_mutex held.
> + */
> +
> +static int cqe_alloc_closid(struct cacheqe *cq) {
> + unsigned int tempid;
> + unsigned int maxid;
> + int err;
> +
> + maxid = boot_cpu_data.x86_cqe_closs;
> +
> + tempid = find_next_zero_bit(cqess_info->closmap, maxid, 0);
> +
> + if (tempid == maxid) {
> + err = -ENOSPC;
> + goto closidallocfail;
What's the point of the goto? What's wrong with 'return -ENOSPC;' ?
> + }
> +
> + set_bit(tempid, cqess_info->closmap);
> + ccmap[tempid].ref++;
> + cq->clos = tempid;
> +
> + pr_debug("cqe : Allocated a directory.closid:%u\n", cq->clos);
And of course we have no idea whatfor this closid got allocated. Your debug info is just useless.
> +static void cqe_free_closid(struct cacheqe *cq) {
> +
> + pr_debug("cqe :Freeing closid:%u\n", cq->clos);
> +
> + ccmap[cq->clos].ref--;
This wants a sanity check to catch ref == 0.
> +/* Create a new cacheqe cgroup.*/
> +static struct cgroup_subsys_state *
> +cqe_css_alloc(struct cgroup_subsys_state *parent_css) {
> + struct cacheqe *parent = css_cacheqe(parent_css);
> + struct cacheqe *cq;
> +
> + /* This is the call before the feature is detected */
Huch? We know at early boot that this is available. So why do you need this? The first call to this should never happen before the whole thing is initialized. If it happens, it's a design failure.
> + if (!parent) {
> + root_cqe_group.clos = 0;
> + return &root_cqe_group.css;
> + }
> +
> + /* To check if cqe is enabled.*/
> + if (!cqe_genable)
> + return ERR_PTR(-ENODEV);
So we first check for !parent and return &root_cqe_group.css and later we return an error pointer? Really consistent behaviour.
> +/* Destroy an existing CAT cgroup.*/
If you would spend some time on providing real comments instead of documenting the obvious ...
> +static void cqe_css_free(struct cgroup_subsys_state *css) {
> + struct cacheqe *cq = css_cacheqe(css);
> + int len = boot_cpu_data.x86_cqe_cbmlength;
> +
> + pr_debug("cqe : In cacheqe_css_free\n");
> +
> + mutex_lock(&cqe_group_mutex);
> +
> + /* Reset the CBM for the cgroup.Should be all 1s by default !*/
> +
> + wrmsrl(CQECBMMSR(cq->clos), ((1 << len) - 1));
So that's writing to IA32_L3_MASK_n, right? Pretty obvious name:
CQECBMMSR.
And the write should set ALL bits to 1 not just the enabled ones if I understand the SDM correctly and what you said in your own comment above.
Aside of that what guarantees that all references to this L3_MASK are gone? You allow the sharing of COS ids. Your whole refcounting scheme is a trainwreck. And when the COS id is not longer in use, there is no point in writing the MSR at all. It does not matter whats in there if nothing ever uses it.
> + cqe_free_closid(cq);
> + kfree(cq);
> +
> + mutex_unlock(&cqe_group_mutex);
> +
> +}
> +
> +/*
> + * Called during do_exit() syscall during a task exit.
> + * This assumes that the thread is running on the current
> + * cpu.
Your assumptions about running on current cpu are interesting. On which CPU is a task supposed to run, when it is on a CPU? Surely on some other cpu, right?
> + */
> +
> +static void cqe_exit(struct cgroup_subsys_state *css,
> + struct cgroup_subsys_state *old_css,
> + struct task_struct *task)
> +{
> +
> + cacheqe_sched_out(task);
And whats the point here? Nothing at all. You need to clean the associated data and then the exit schedule will clean up the PQR automatically.
> +
> +}
> +
> +static inline bool cbm_minbits(unsigned long var) {
> +
Your supply of useless newlines is endless.
> + unsigned long i;
> +
> + /*Minimum of 2 bits must be set.*/
> +
> + i = var & (var - 1);
> + if (!i || !var)
> + return false;
> +
> + return true;
> +
> +}
What's wrong with using bitmap functions consistently? So this test would simply become:
bitmap_weight(map, nrbits) < 2
> +
> +/*
> + * Tests if only contiguous bits are set.
> + */
> +
> +static inline bool cbm_iscontiguous(unsigned long var) {
This one here can be implemented with existing bitmap functions as well.
Would be too simple and too obvious to understand, right?
> +static int cqe_cbm_read(struct seq_file *m, void *v) {
> + struct cacheqe *cq = css_cacheqe(seq_css(m));
> +
> + pr_debug("cqe : In cqe_cqemode_read\n");
We really need a debug printk for a seqfile read function, right?
> + seq_printf(m, "0x%x\n", (unsigned int)*(cq->cbm));
seq_bitmap()
> +static int validate_cbm(struct cacheqe *cq, unsigned long cbmvalue) {
> + struct cacheqe *par, *c;
> + struct cgroup_subsys_state *css;
> +
> + if (!cbm_minbits(cbmvalue) || !cbm_iscontiguous(cbmvalue)) {
> + pr_info("CQE error: minimum bits not set or non contiguous mask\n");
> + return -EINVAL;
> + }
> +
> + /*
> + * Needs to be a subset of its parent.
> + */
> + par = parent_cqe(cq);
> +
> + if (!bitmap_subset(&cbmvalue, par->cbm, MAX_CBM_LENGTH))
> + return -EINVAL;
> +
> + rcu_read_lock();
> +
> + /*
> + * Each of children should be a subset of the mask.
> + */
> +
> + cacheqe_for_each_child(c, css, cq) {
> + c = css_cacheqe(css);
> + if (!bitmap_subset(c->cbm, &cbmvalue, MAX_CBM_LENGTH)) {
> + pr_debug("cqe : Children's cbm not a subset\n");
> + return -EINVAL;
> + }
> + }
So now you have 3 error exits here. One with pr_info, one with out and one with pr_debug. And of course the most obvious to figure out has a pr_info. Useful.
> +static bool cbm_search(unsigned long cbm, int *closid) {
> +
> + int maxid = boot_cpu_data.x86_cqe_closs;
> + unsigned int i;
> +
> + for (i = 0; i < maxid; i++)
> + if (bitmap_equal(&cbm, &ccmap[i].cbm, MAX_CBM_LENGTH)) {
> + *closid = i;
> + return true;
> + }
> +
> + return false;
> +
> +}
> +
> +static int cqe_cbm_write(struct cgroup_subsys_state *css,
> + struct cftype *cft, u64 cbmvalue) {
> + struct cacheqe *cq = css_cacheqe(css);
> + ssize_t err = 0;
> + unsigned long cbm;
> + unsigned int closid;
> +
> + pr_debug("cqe : In cqe_cbm_write\n");
Groan.
> + if (!cqe_genable)
> + return -ENODEV;
> +
> + if (cq == &root_cqe_group || !cq)
> + return -EPERM;
> +
> + /*
> + * Need global mutex as cbm write may allocate the closid.
You need to protect against changes on all ends, not only allocating a closid.
> diff --git a/arch/x86/kernel/cpu/common.c
> b/arch/x86/kernel/cpu/common.c index 4b4f78c..a9b277a 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -633,6 +633,27 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
> c->x86_capability[9] = ebx;
> }
>
> +/* Additional Intel-defined flags: level 0x00000010 */
> + if (c->cpuid_level >= 0x00000010) {
> + u32 eax, ebx, ecx, edx;
> +
> + cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
> +
> + c->x86_capability[10] = ebx;
> +
> + if (cpu_has(c, X86_FEATURE_CQE_L3)) {
> +
> + u32 eax, ebx, ecx, edx;
If you add another indentation level, you can define the same variables once more.
> +
> + cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
> +
> + c->x86_cqe_closs = (edx & 0xffff) + 1;
> + c->x86_cqe_cbmlength = (eax & 0xf) + 1; config CACHEQE_DEBUG
> + bool "Cache QoS Enforcement cgroup subsystem debug"
This needs to go away. The debugging you provide is beyond useless. We want proper tracepoints for that not random debug printks with no value.
> + depends on X86 || X86_64
> + help
> + This option provides framework to allocate Cache cache lines when
> + applications fill cache.
> + This can be used by users to configure how much cache that can be
> + allocated to different PIDs.Enables debug
Copy and paste is wonderful, right?
> + Say N if unsure.
> +
> config PROC_PID_CPUSET
> bool "Include legacy /proc/<pid>/cpuset file"
> depends on CPUSETS
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c index
> 240157c..afa2897 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2215,7 +2215,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
> perf_event_task_sched_out(prev, next);
> fire_sched_out_preempt_notifiers(prev, next);
> prepare_lock_switch(rq, next);
> - prepare_arch_switch(next);
> + prepare_arch_switch(prev);
Great. You just broke all architectures which implement prepare_arch_switch(). Congratulations!
> }
>
> /**
> @@ -2254,7 +2254,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> */
> prev_state = prev->state;
> vtime_task_switch(prev);
> - finish_arch_switch(prev);
> + finish_arch_switch(current);
Ditto.
> perf_event_task_sched_in(prev, current);
> finish_lock_switch(rq, prev);
> finish_arch_post_lock_switch();
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index
> 24156c84..79b9ff6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -965,12 +965,36 @@ static inline int task_on_rq_migrating(struct task_struct *p)
> return p->on_rq == TASK_ON_RQ_MIGRATING; }
>
> +#ifdef CONFIG_X86_64
> +#ifdef CONFIG_CGROUP_CACHEQE
> +
> +#include <asm/cacheqe.h>
We surely not put x86 specific nonsense in the core scheduler code. Did you ever come up with the idea to look how this is used by other architectures? Obviously not.
I have no idea how Matts reviewed tag got onto this masterpiece of trainwreck engineering and I better dont want to know.
Thanks,
tglx
thanks to my email client , last email was sent by mistake !! sorry for spam