2015-05-02 01:38:23

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH V6 0/7] x86/intel_rdt: Intel Cache Allocation Technology

This patch adds a cgroup subsystem to support the new Cache Allocation
Technology (CAT) feature found in future Intel Xeon Intel processors. CAT is
part of Resource Director Technology(RDT) or
Platform Shared resource control which provides support to control
sharing of platform resources like L3 cache.

Cache Allocation Technology(CAT) provides a way for the Software
(OS/VMM) to restrict cache allocation to a defined 'subset' of cache
which may be overlapping with other 'subsets'. This feature is used
when allocating a line in cache ie when pulling new data into the cache.
The programming of the h/w is done via programming MSRs.
The patch series support to perform L3 cache allocation.

In todays new processors the number of cores is continuously increasing
which in turn increase the number of threads or workloads that can
simultaneously be run. When multi-threaded
applications run concurrently, they compete for shared
resources including L3 cache. At times, this L3 cache resource contention may
result in inefficient space utilization. For example a higher priority thread
may end up with lesser L3 cache resource or a cache sensitive app may not get
optimal cache occupancy thereby degrading the performance.
CAT kernel patch helps provides a framework for sharing L3 cache so that users
can allocate the resource according to set requirements.

More information about the feature can be found in the Intel SDM, Volume 3
section 17.15. SDM does not yet use the 'RDT' term yet and it is planned to be
changed at a later time.

*All the patches will apply on 4.1-rc0*.

Changes in V6:
- rebased to 4.1-rc1 which has the CMT(cache monitoring) support included.
- (Thanks to Marcelo's feedback).Fixed support for hot cpu handling for
IA32_L3_QOS MSRs. Although during deep C states the MSR need not be restored
this is needed when physically a new package is added.
-coding convention changes including renaming to cache_mask using a refcnt to
track the number of cgroups using a closid in clos_cbm map.
-1b cbm support for non-hsw SKUs. HSW is an exception which needs the cache bit
masks to be at least 2 bits.

Changes in v5:
- Added support to propagate the cache bit mask update for each
package.
- Removed the cache bit mask reference in the intel_rdt structure as
there was no need for that and we already maintain a separate
closid<->cbm mapping.
- Made a few coding convention changes which include adding the
assertion while freeing the CLOSID.

Changes in V4:
- Integrated with the latest V5 CMT patches.
- Changed naming of cgroup to rdt(resource director technology) from
cat(cache allocation technology). This was done as the RDT is the
umbrella term for platform shared resources allocation. Hence in
future it would be easier to add resource allocation to the same
cgroup
- Naming changes also applied to a lot of other data structures/APIs.
- Added documentation on cgroup usage for cache allocation to address
a lot of questions from various academic and industry regarding
cache allocation usage.

Changes in V3:
- Implements a common software cache for IA32_PQR_MSR
- Implements support for hsw CAT enumeration. This does not use the brand
strings like earlier version but does a probe test. The probe test is done only
on hsw family of processors
- Made a few coding convention, name changes
- Check for lock being held when ClosID manipulation happens

Changes in V2:
- Removed HSW specific enumeration changes. Plan to include it later as a
separate patch.
- Fixed the code in prep_arch_switch to be specific for x86 and removed
x86 defines.
- Fixed cbm_write to not write all 1s when a cgroup is freed.
- Fixed one possible memory leak in init.
- Changed some of manual bitmap
manipulation to use the predefined bitmap APIs to make code more readable
- Changed name in sources from cqe to cat
- Global cat enable flag changed to static_key and disabled cgroup early_init

[PATCH 1/7] x86/intel_rdt: Intel Cache Allocation Technology detection
[PATCH 2/7] x86/intel_rdt: Adds support for Class of service
[PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT
[PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT
[PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
[PATCH 6/7] x86/intel_rdt: Intel haswell CAT enumeration
[PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide


2015-05-02 01:38:31

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation Technology detection

This patch adds support for the new Cache Allocation Technology (CAT)
feature found in future Intel Xeon processors. CAT is part of Intel
Resource Director Technology(RDT) which enables sharing of processor
resources. This patch includes CPUID enumeration routines for CAT and
new values to track CAT resources to the cpuinfo_x86 structure.

Cache Allocation Technology(CAT) provides a way for the Software
(OS/VMM) to restrict cache allocation to a defined 'subset' of cache
which may be overlapping with other 'subsets'. This feature is used
when allocating a line in cache ie when pulling new data into the cache.
The programming of the h/w is done via programming MSRs.

More information about CAT be found in the Intel (R) x86 Architecture
Software Developer Manual,Volume 3, section 17.15.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/cpufeature.h | 6 +++++-
arch/x86/include/asm/processor.h | 3 +++
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/common.c | 15 +++++++++++++
arch/x86/kernel/cpu/intel_rdt.c | 44 +++++++++++++++++++++++++++++++++++++++
init/Kconfig | 11 ++++++++++
6 files changed, 79 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/cpu/intel_rdt.c

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 3d6606f..30cb56c 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -12,7 +12,7 @@
#include <asm/disabled-features.h>
#endif

-#define NCAPINTS 13 /* N 32-bit words worth of info */
+#define NCAPINTS 14 /* N 32-bit words worth of info */
#define NBUGINTS 1 /* N 32-bit bug flags */

/*
@@ -229,6 +229,7 @@
#define X86_FEATURE_RTM ( 9*32+11) /* Restricted Transactional Memory */
#define X86_FEATURE_CQM ( 9*32+12) /* Cache QoS Monitoring */
#define X86_FEATURE_MPX ( 9*32+14) /* Memory Protection Extension */
+#define X86_FEATURE_RDT ( 9*32+15) /* Resource Allocation */
#define X86_FEATURE_AVX512F ( 9*32+16) /* AVX-512 Foundation */
#define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */
#define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX instructions */
@@ -252,6 +253,9 @@
/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
#define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */

+/* Intel-defined CPU features, CPUID level 0x00000010:0 (ebx), word 13 */
+#define X86_FEATURE_CAT_L3 (13*32 + 1) /* Cache QOS Enforcement L3 */
+
/*
* BUG word(s)
*/
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 23ba676..7d9aee2 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -114,6 +114,9 @@ struct cpuinfo_x86 {
int x86_cache_occ_scale; /* scale to bytes */
int x86_power;
unsigned long loops_per_jiffy;
+ /* Cache Allocation Technology values */
+ u16 x86_cat_cbmlength;
+ u16 x86_cat_closs;
/* cpuid returned max cores value: */
u16 x86_max_cores;
u16 apicid;
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 9bff687..4ff7a1f 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
perf_event_intel_uncore_nhmex.o
endif

+obj-$(CONFIG_CGROUP_RDT) += intel_rdt.o

obj-$(CONFIG_X86_MCE) += mcheck/
obj-$(CONFIG_MTRR) += mtrr/
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a62cf04..f39d948 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -670,6 +670,21 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
}
}

+ /* Additional Intel-defined flags: level 0x00000010 */
+ if (c->cpuid_level >= 0x00000010) {
+ u32 eax, ebx, ecx, edx;
+
+ cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
+ c->x86_capability[13] = ebx;
+
+ if (cpu_has(c, X86_FEATURE_CAT_L3)) {
+
+ cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
+ c->x86_cat_closs = edx + 1;
+ c->x86_cat_cbmlength = eax + 1;
+ }
+ }
+
/* AMD-defined flags: level 0x80000001 */
xlvl = cpuid_eax(0x80000000);
c->extended_cpuid_level = xlvl;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
new file mode 100644
index 0000000..901b6fa
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -0,0 +1,44 @@
+/*
+ * Resource Director Technology(RDT)/Cache quality of service code.
+ *
+ * Copyright (C) 2014 Intel Corporation
+ *
+ * 2014-09-10 Written by
+ * Vikas Shivappa <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual, volume 3, section 17.15.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/spinlock.h>
+
+static int __init rdt_late_init(void)
+{
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+ int maxid, cbm_len;
+
+ if (!cpu_has(c, X86_FEATURE_CAT_L3))
+ return -ENODEV;
+
+ maxid = c->x86_cat_closs;
+ cbm_len = c->x86_cat_cbmlength;
+
+ pr_info("Max bitmask length:%u,Max ClosIds: %u\n", cbm_len, maxid);
+
+ return 0;
+}
+
+late_initcall(rdt_late_init);
diff --git a/init/Kconfig b/init/Kconfig
index dc24dec..d97ff5e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -983,6 +983,17 @@ config CPUSETS

Say N if unsure.

+config CGROUP_RDT
+ bool "Resource Director Technology cgroup subsystem"
+ depends on X86_64 && CPU_SUP_INTEL
+ help
+ This option provides a cgroup to allocate Platform shared
+ resources. Among the shared resources, current implementation
+ focuses on L3 Cache. Using the interface user can specify the
+ amount of L3 cache space into which an application can fill.
+
+ Say N if unsure.
+
config PROC_PID_CPUSET
bool "Include legacy /proc/<pid>/cpuset file"
depends on CPUSETS
--
1.9.1

2015-05-02 01:38:33

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management

This patch adds a cgroup subsystem to support Intel Resource Director
Technology(RDT) or Platform Shared resources Control. The resources that
are currently supported for sharing is L3 cache
(Cache Allocation Technology or CAT).
When a RDT cgroup is created it has a CLOSid and CBM associated with it
which are inherited from its parent. A Class of service(CLOS) in Cache
Allocation is represented by a CLOSid. CLOSid is internal to the kernel
and not exposed to user. Cache bitmask(CBM) represents one global cache
'subset'. Tasks belonging to a cgroup would get to fill the L3 cache
represented by the CBM. Root cgroup would have all available bits set
for its CBM and would be assigned the CLOSid 0.

CLOSid allocation is tracked using a separate bitmap. The maximum number
of CLOSids is specified by the h/w during CPUID enumeration and the
kernel simply throws an -ENOSPC when it runs out of CLOSids.

Each CBM has an associated CLOSid. If multiple cgroups have the same CBM
they would also have the same CLOSid. The reference count parameter in
CLOSid-CBM map keeps track of how many cgroups are using each
CLOSid<->CBM mapping.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/intel_rdt.h | 38 +++++++++++++++
arch/x86/kernel/cpu/intel_rdt.c | 100 ++++++++++++++++++++++++++++++++++++++-
include/linux/cgroup_subsys.h | 4 ++
3 files changed, 140 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/include/asm/intel_rdt.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
new file mode 100644
index 0000000..87af1a5
--- /dev/null
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -0,0 +1,38 @@
+#ifndef _RDT_H_
+#define _RDT_H_
+
+#ifdef CONFIG_CGROUP_RDT
+
+#include <linux/cgroup.h>
+
+struct rdt_subsys_info {
+ /* Clos Bitmap to keep track of available CLOSids.*/
+ unsigned long *closmap;
+};
+
+struct intel_rdt {
+ struct cgroup_subsys_state css;
+ /* Class of service for the cgroup.*/
+ unsigned int clos;
+};
+
+struct clos_cbm_map {
+ unsigned long cbm;
+ unsigned int cgrp_count;
+};
+
+/*
+ * Return rdt group corresponding to this container.
+ */
+static inline struct intel_rdt *css_rdt(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct intel_rdt, css) : NULL;
+}
+
+static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
+{
+ return css_rdt(ir->css.parent);
+}
+
+#endif
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 901b6fa..eec57fe 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -24,17 +24,97 @@
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/spinlock.h>
+#include <asm/intel_rdt.h>
+
+/*
+ * ccmap maintains 1:1 mapping between CLOSid and cbm.
+ */
+static struct clos_cbm_map *ccmap;
+static struct rdt_subsys_info rdtss_info;
+static DEFINE_MUTEX(rdt_group_mutex);
+struct intel_rdt rdt_root_group;
+
+static inline bool cat_supported(struct cpuinfo_x86 *c)
+{
+ if (cpu_has(c, X86_FEATURE_CAT_L3))
+ return true;
+
+ return false;
+}
+
+/*
+* Called with the rdt_group_mutex held.
+*/
+static int rdt_free_closid(struct intel_rdt *ir)
+{
+
+ lockdep_assert_held(&rdt_group_mutex);
+
+ WARN_ON(!ccmap[ir->clos].cgrp_count);
+ ccmap[ir->clos].cgrp_count--;
+ if (!ccmap[ir->clos].cgrp_count)
+ clear_bit(ir->clos, rdtss_info.closmap);
+
+ return 0;
+}
+
+static struct cgroup_subsys_state *
+rdt_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct intel_rdt *parent = css_rdt(parent_css);
+ struct intel_rdt *ir;
+
+ /*
+ * Cannot return failure on systems with no Cache Allocation
+ * as the cgroup_init does not handle failures gracefully.
+ */
+ if (!parent)
+ return &rdt_root_group.css;
+
+ ir = kzalloc(sizeof(struct intel_rdt), GFP_KERNEL);
+ if (!ir)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_lock(&rdt_group_mutex);
+ ir->clos = parent->clos;
+ ccmap[parent->clos].cgrp_count++;
+ mutex_unlock(&rdt_group_mutex);
+
+ return &ir->css;
+}

static int __init rdt_late_init(void)
{
struct cpuinfo_x86 *c = &boot_cpu_data;
+ static struct clos_cbm_map *ccm;
+ size_t sizeb;
int maxid, cbm_len;

- if (!cpu_has(c, X86_FEATURE_CAT_L3))
+ if (!cat_supported(c)) {
+ rdt_root_group.css.ss->disabled = 1;
return -ENODEV;
-
+ }
maxid = c->x86_cat_closs;
cbm_len = c->x86_cat_cbmlength;
+ sizeb = BITS_TO_LONGS(maxid) * sizeof(long);
+
+ rdtss_info.closmap = kzalloc(sizeb, GFP_KERNEL);
+ if (!rdtss_info.closmap)
+ return -ENOMEM;
+
+ sizeb = maxid * sizeof(struct clos_cbm_map);
+ ccmap = kzalloc(sizeb, GFP_KERNEL);
+ if (!ccmap) {
+ kfree(rdtss_info.closmap);
+ return -ENOMEM;
+ }
+
+ set_bit(0, rdtss_info.closmap);
+ rdt_root_group.clos = 0;
+
+ ccm = &ccmap[0];
+ ccm->cbm = (u32)((u64)(1 << cbm_len) - 1);
+ ccm->cgrp_count++;

pr_info("Max bitmask length:%u,Max ClosIds: %u\n", cbm_len, maxid);

@@ -42,3 +122,19 @@ static int __init rdt_late_init(void)
}

late_initcall(rdt_late_init);
+
+static void rdt_css_free(struct cgroup_subsys_state *css)
+{
+ struct intel_rdt *ir = css_rdt(css);
+
+ mutex_lock(&rdt_group_mutex);
+ rdt_free_closid(ir);
+ kfree(ir);
+ mutex_unlock(&rdt_group_mutex);
+}
+
+struct cgroup_subsys rdt_cgrp_subsys = {
+ .css_alloc = rdt_css_alloc,
+ .css_free = rdt_css_free,
+ .early_init = 0,
+};
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e4a96fb..81c803d 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
SUBSYS(hugetlb)
#endif

+#if IS_ENABLED(CONFIG_CGROUP_RDT)
+SUBSYS(rdt)
+#endif
+
/*
* The following subsystems are not supported on the default hierarchy.
*/
--
1.9.1

2015-05-02 01:38:36

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT

Add support for cache bit mask manipulation. The change adds a file
cache_mask to the RDT cgroup which represents the CBM(cache bit mask)
for the cgroup.

Update to the CBM is done by writing to the IA32_L3_MASK_n.
The RDT cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
cgroup never fails. When a child cgroup is created it inherits the
CLOSid and the cache_mask from its parent. When a user changes the
default CBM for a cgroup, a new CLOSid may be allocated if the
cache_mask was not used before. If the new CBM is the one that is
already used, the count for that CLOSid<->CBM is incremented. The
changing of 'cbm' may fail with -ENOSPC once the kernel runs out of
maximum CLOSids it can support.
User can create as many cgroups as he wants but having different CBMs
at the same time is restricted by the maximum number of CLOSids
(multiple cgroups can have the same CBM).
Kernel maintains a CLOSid<->cbm mapping which keeps count
of cgroups using a CLOSid.

The tasks in the CAT cgroup would get to fill the L3 cache represented
by the cgroup's cache_mask file.

Reuse of CLOSids for cgroups with same bitmask also has following
advantages:
- This helps to use the scant CLOSids optimally.
- This also implies that during context switch, write to PQR-MSR is done
only when a task with a different bitmask is scheduled in.

During cpu bringup due to a hotplug event, IA32_L3_MASK_n MSR is
synchronized from the clos cbm map if it is used by any cgroup for the
package.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/intel_rdt.h | 7 +-
arch/x86/kernel/cpu/intel_rdt.c | 364 ++++++++++++++++++++++++++++++++++++---
2 files changed, 346 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 87af1a5..9e9dbbe 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,6 +4,9 @@
#ifdef CONFIG_CGROUP_RDT

#include <linux/cgroup.h>
+#define MAX_CBM_LENGTH 32
+#define IA32_L3_CBM_BASE 0xc90
+#define CBM_FROM_INDEX(x) (IA32_L3_CBM_BASE + x)

struct rdt_subsys_info {
/* Clos Bitmap to keep track of available CLOSids.*/
@@ -17,8 +20,8 @@ struct intel_rdt {
};

struct clos_cbm_map {
- unsigned long cbm;
- unsigned int cgrp_count;
+ unsigned long cache_mask;
+ unsigned int clos_refcnt;
};

/*
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index eec57fe..58b39d6 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -24,16 +24,25 @@
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/spinlock.h>
+#include <linux/cpu.h>
#include <asm/intel_rdt.h>

/*
- * ccmap maintains 1:1 mapping between CLOSid and cbm.
+ * ccmap maintains 1:1 mapping between CLOSid and cache_mask.
*/
static struct clos_cbm_map *ccmap;
static struct rdt_subsys_info rdtss_info;
static DEFINE_MUTEX(rdt_group_mutex);
struct intel_rdt rdt_root_group;

+/*
+ * Mask of CPUs for writing CBM values. We only need one per-socket.
+ */
+static cpumask_t rdt_cpumask;
+
+#define rdt_for_each_child(pos_css, parent_ir) \
+ css_for_each_child((pos_css), &(parent_ir)->css)
+
static inline bool cat_supported(struct cpuinfo_x86 *c)
{
if (cpu_has(c, X86_FEATURE_CAT_L3))
@@ -42,22 +51,66 @@ static inline bool cat_supported(struct cpuinfo_x86 *c)
return false;
}

+static void __clos_init(unsigned int closid)
+{
+ struct clos_cbm_map *ccm = &ccmap[closid];
+
+ lockdep_assert_held(&rdt_group_mutex);
+
+ ccm->clos_refcnt = 1;
+}
+
/*
-* Called with the rdt_group_mutex held.
-*/
-static int rdt_free_closid(struct intel_rdt *ir)
+ * Allocates a new closid from unused closids.
+ */
+static int rdt_alloc_closid(struct intel_rdt *ir)
{
+ unsigned int id;
+ unsigned int maxid;

lockdep_assert_held(&rdt_group_mutex);

- WARN_ON(!ccmap[ir->clos].cgrp_count);
- ccmap[ir->clos].cgrp_count--;
- if (!ccmap[ir->clos].cgrp_count)
- clear_bit(ir->clos, rdtss_info.closmap);
+ maxid = boot_cpu_data.x86_cat_closs;
+ id = find_next_zero_bit(rdtss_info.closmap, maxid, 0);
+ if (id == maxid)
+ return -ENOSPC;
+
+ set_bit(id, rdtss_info.closmap);
+ __clos_init(id);
+ ir->clos = id;

return 0;
}

+static void rdt_free_closid(unsigned int clos)
+{
+
+ lockdep_assert_held(&rdt_group_mutex);
+
+ clear_bit(clos, rdtss_info.closmap);
+}
+
+static void __clos_get(unsigned int closid)
+{
+ struct clos_cbm_map *ccm = &ccmap[closid];
+
+ lockdep_assert_held(&rdt_group_mutex);
+
+ ccm->clos_refcnt += 1;
+}
+
+static void __clos_put(unsigned int closid)
+{
+ struct clos_cbm_map *ccm = &ccmap[closid];
+
+ lockdep_assert_held(&rdt_group_mutex);
+ WARN_ON(!ccm->clos_refcnt);
+
+ ccm->clos_refcnt -= 1;
+ if (!ccm->clos_refcnt)
+ rdt_free_closid(closid);
+}
+
static struct cgroup_subsys_state *
rdt_css_alloc(struct cgroup_subsys_state *parent_css)
{
@@ -77,27 +130,285 @@ rdt_css_alloc(struct cgroup_subsys_state *parent_css)

mutex_lock(&rdt_group_mutex);
ir->clos = parent->clos;
- ccmap[parent->clos].cgrp_count++;
+ __clos_get(ir->clos);
mutex_unlock(&rdt_group_mutex);

return &ir->css;
}

+static void rdt_css_free(struct cgroup_subsys_state *css)
+{
+ struct intel_rdt *ir = css_rdt(css);
+
+ mutex_lock(&rdt_group_mutex);
+ __clos_put(ir->clos);
+ kfree(ir);
+ mutex_unlock(&rdt_group_mutex);
+}
+
+static inline bool cbm_is_contiguous(unsigned long var)
+{
+ unsigned long first_bit, zero_bit;
+ unsigned long maxcbm = MAX_CBM_LENGTH;
+
+ if (!var)
+ return false;
+
+ first_bit = find_next_bit(&var, maxcbm, 0);
+ zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
+
+ if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
+ return false;
+
+ return true;
+}
+
+static int cat_cbm_read(struct seq_file *m, void *v)
+{
+ struct intel_rdt *ir = css_rdt(seq_css(m));
+
+ seq_printf(m, "%08lx\n", ccmap[ir->clos].cache_mask);
+ return 0;
+}
+
+static int validate_cbm(struct intel_rdt *ir, unsigned long cbmvalue)
+{
+ struct intel_rdt *par, *c;
+ struct cgroup_subsys_state *css;
+ unsigned long *cbm_tmp;
+
+ if (!cbm_is_contiguous(cbmvalue)) {
+ pr_err("bitmask should have >= 1 bits and be contiguous\n");
+ return -EINVAL;
+ }
+
+ par = parent_rdt(ir);
+ cbm_tmp = &ccmap[par->clos].cache_mask;
+ if (!bitmap_subset(&cbmvalue, cbm_tmp, MAX_CBM_LENGTH))
+ return -EINVAL;
+
+ rcu_read_lock();
+ rdt_for_each_child(css, ir) {
+ c = css_rdt(css);
+ cbm_tmp = &ccmap[c->clos].cache_mask;
+ if (!bitmap_subset(cbm_tmp, &cbmvalue, MAX_CBM_LENGTH)) {
+ rcu_read_unlock();
+ pr_err("Children's mask not a subset\n");
+ return -EINVAL;
+ }
+ }
+
+ rcu_read_unlock();
+ return 0;
+}
+
+static bool cbm_search(unsigned long cbm, int *closid)
+{
+ int maxid = boot_cpu_data.x86_cat_closs;
+ unsigned int i;
+
+ for (i = 0; i < maxid; i++) {
+ if (bitmap_equal(&cbm, &ccmap[i].cache_mask, MAX_CBM_LENGTH)) {
+ *closid = i;
+ return true;
+ }
+ }
+ return false;
+}
+
+static void cbmmap_dump(void)
+{
+ int i;
+
+ pr_debug("CBMMAP\n");
+ for (i = 0; i < boot_cpu_data.x86_cat_closs; i++)
+ pr_debug("cache_mask: 0x%x,clos_refcnt: %u\n",
+ (unsigned int)ccmap[i].cache_mask, ccmap[i].clos_refcnt);
+}
+
+static void __cpu_cbm_update(void *info)
+{
+ unsigned int closid = *((unsigned int *)info);
+
+ wrmsrl(CBM_FROM_INDEX(closid), ccmap[closid].cache_mask);
+}
+
+/*
+ * cbm_update_all() - Update the cache bit mask for all packages.
+ */
+static inline void cbm_update_all(unsigned int closid)
+{
+ on_each_cpu_mask(&rdt_cpumask, __cpu_cbm_update, &closid, 1);
+}
+
+/*
+ * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
+ * which are one per CLOSid, on the current package.
+ * @cpu : the cpu on which the mask is updated.
+ */
+static inline void cbm_update_msrs(int cpu)
+{
+ int maxid = boot_cpu_data.x86_cat_closs;
+ unsigned int i;
+
+ if (WARN_ON(cpu != smp_processor_id()))
+ return;
+
+ for (i = 1; i < maxid; i++) {
+ if (ccmap[i].clos_refcnt)
+ __cpu_cbm_update(&i);
+ }
+}
+
+/*
+ * rdt_cbm_write() - Validates and writes the cache bit mask(cbm)
+ * to the IA32_L3_MASK_n and also store the same in the ccmap.
+ *
+ * CLOSids are reused for cgroups which have same bitmask.
+ * - This helps to use the scant CLOSids optimally.
+ * - This also implies that at context switch write
+ * to PQR-MSR is done only when a task with a
+ * different bitmask is scheduled in.
+ */
+static int cat_cbm_write(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 cbmvalue)
+{
+ struct intel_rdt *ir = css_rdt(css);
+ ssize_t err = 0;
+ unsigned long cache_mask, max_mask;
+ unsigned long *cbm_tmp;
+ unsigned int closid;
+ u32 max_cbm = boot_cpu_data.x86_cat_cbmlength;
+
+ if (ir == &rdt_root_group)
+ return -EPERM;
+ bitmap_set(&max_mask, 0, max_cbm);
+
+ /*
+ * Need global mutex as cbm write may allocate a closid.
+ */
+ mutex_lock(&rdt_group_mutex);
+ bitmap_and(&cache_mask, (unsigned long *)&cbmvalue, &max_mask, max_cbm);
+ cbm_tmp = &ccmap[ir->clos].cache_mask;
+
+ if (bitmap_equal(&cache_mask, cbm_tmp, MAX_CBM_LENGTH))
+ goto out;
+
+ err = validate_cbm(ir, cache_mask);
+ if (err)
+ goto out;
+
+ /*
+ * At this point we are sure to change the cache_mask.Hence release the
+ * reference to the current CLOSid and try to get a reference for
+ * a different CLOSid.
+ */
+ __clos_put(ir->clos);
+
+ if (cbm_search(cache_mask, &closid)) {
+ ir->clos = closid;
+ __clos_get(closid);
+ } else {
+ err = rdt_alloc_closid(ir);
+ if (err)
+ goto out;
+
+ ccmap[ir->clos].cache_mask = cache_mask;
+ cbm_update_all(ir->clos);
+ }
+
+ cbmmap_dump();
+out:
+
+ mutex_unlock(&rdt_group_mutex);
+ return err;
+}
+
+static inline bool rdt_update_cpumask(int cpu)
+{
+ int phys_id = topology_physical_package_id(cpu);
+ struct cpumask *mask = &rdt_cpumask;
+ int i;
+
+ for_each_cpu(i, mask) {
+ if (phys_id == topology_physical_package_id(i))
+ return false;
+ }
+
+ cpumask_set_cpu(cpu, mask);
+ return true;
+}
+
+/*
+ * rdt_cpu_start() - If a new package has come up, update all
+ * the Cache bitmasks on the package.
+ */
+static inline void rdt_cpu_start(int cpu)
+{
+ mutex_lock(&rdt_group_mutex);
+ if (rdt_update_cpumask(cpu))
+ cbm_update_msrs(cpu);
+ mutex_unlock(&rdt_group_mutex);
+}
+
+static void rdt_cpu_exit(unsigned int cpu)
+{
+ int phys_id = topology_physical_package_id(cpu);
+ int i;
+
+ mutex_lock(&rdt_group_mutex);
+ if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
+ mutex_unlock(&rdt_group_mutex);
+ return;
+ }
+
+ for_each_online_cpu(i) {
+ if (i == cpu)
+ continue;
+
+ if (phys_id == topology_physical_package_id(i)) {
+ cpumask_set_cpu(i, &rdt_cpumask);
+ break;
+ }
+ }
+ mutex_unlock(&rdt_group_mutex);
+}
+
+static int rdt_cpu_notifier(struct notifier_block *nb,
+ unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (unsigned long)hcpu;
+
+ switch (action) {
+ case CPU_STARTING:
+ rdt_cpu_start(cpu);
+ break;
+ case CPU_DOWN_PREPARE:
+ rdt_cpu_exit(cpu);
+ break;
+ default:
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
static int __init rdt_late_init(void)
{
struct cpuinfo_x86 *c = &boot_cpu_data;
static struct clos_cbm_map *ccm;
size_t sizeb;
- int maxid, cbm_len;
+ int maxid, cbm_len, i;

if (!cat_supported(c)) {
rdt_root_group.css.ss->disabled = 1;
return -ENODEV;
}
+
maxid = c->x86_cat_closs;
cbm_len = c->x86_cat_cbmlength;
- sizeb = BITS_TO_LONGS(maxid) * sizeof(long);

+ sizeb = BITS_TO_LONGS(maxid) * sizeof(long);
rdtss_info.closmap = kzalloc(sizeb, GFP_KERNEL);
if (!rdtss_info.closmap)
return -ENOMEM;
@@ -111,11 +422,17 @@ static int __init rdt_late_init(void)

set_bit(0, rdtss_info.closmap);
rdt_root_group.clos = 0;
-
ccm = &ccmap[0];
- ccm->cbm = (u32)((u64)(1 << cbm_len) - 1);
- ccm->cgrp_count++;
+ bitmap_set(&ccm->cache_mask, 0, cbm_len);
+ ccm->clos_refcnt = 1;
+
+ cpu_notifier_register_begin();
+ for_each_online_cpu(i)
+ rdt_update_cpumask(i);
+
+ __hotcpu_notifier(rdt_cpu_notifier, 0);

+ cpu_notifier_register_done();
pr_info("Max bitmask length:%u,Max ClosIds: %u\n", cbm_len, maxid);

return 0;
@@ -123,18 +440,19 @@ static int __init rdt_late_init(void)

late_initcall(rdt_late_init);

-static void rdt_css_free(struct cgroup_subsys_state *css)
-{
- struct intel_rdt *ir = css_rdt(css);
-
- mutex_lock(&rdt_group_mutex);
- rdt_free_closid(ir);
- kfree(ir);
- mutex_unlock(&rdt_group_mutex);
-}
+static struct cftype rdt_files[] = {
+ {
+ .name = "cache_mask",
+ .seq_show = cat_cbm_read,
+ .write_u64 = cat_cbm_write,
+ .mode = 0666,
+ },
+ { } /* terminate */
+};

struct cgroup_subsys rdt_cgrp_subsys = {
.css_alloc = rdt_css_alloc,
.css_free = rdt_css_free,
+ .legacy_cftypes = rdt_files,
.early_init = 0,
};
--
1.9.1

2015-05-02 01:38:38

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT

Adds support for IA32_PQR_ASSOC MSR writes during task scheduling.

The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
CLOSid. During context switch kernel implements this by writing the
CLOSid of the cgroup to which the task belongs to the CPU's
IA32_PQR_ASSOC MSR.

For Cache Allocation, this would let the task fill in the cache 'subset'
represented by the cgroup's Cache bit mask(CBM).

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/intel_rdt.h | 54 ++++++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/switch_to.h | 3 +++
arch/x86/kernel/cpu/intel_rdt.c | 3 +++
kernel/sched/core.c | 1 +
kernel/sched/sched.h | 3 +++
5 files changed, 64 insertions(+)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 9e9dbbe..2fc496f 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,9 +4,13 @@
#ifdef CONFIG_CGROUP_RDT

#include <linux/cgroup.h>
+
+#define MSR_IA32_PQR_ASSOC 0xc8f
#define MAX_CBM_LENGTH 32
#define IA32_L3_CBM_BASE 0xc90
#define CBM_FROM_INDEX(x) (IA32_L3_CBM_BASE + x)
+DECLARE_PER_CPU(unsigned int, x86_cpu_clos);
+extern struct static_key rdt_enable_key;

struct rdt_subsys_info {
/* Clos Bitmap to keep track of available CLOSids.*/
@@ -24,6 +28,11 @@ struct clos_cbm_map {
unsigned int clos_refcnt;
};

+static inline bool rdt_enabled(void)
+{
+ return static_key_false(&rdt_enable_key);
+}
+
/*
* Return rdt group corresponding to this container.
*/
@@ -37,5 +46,50 @@ static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
return css_rdt(ir->css.parent);
}

+/*
+ * Return rdt group to which this task belongs.
+ */
+static inline struct intel_rdt *task_rdt(struct task_struct *task)
+{
+ return css_rdt(task_css(task, rdt_cgrp_id));
+}
+
+/*
+ * rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ * if the current Closid is different than the new one.
+ */
+static inline void rdt_sched_in(struct task_struct *task)
+{
+ struct intel_rdt *ir;
+ unsigned int clos;
+
+ if (!rdt_enabled())
+ return;
+
+ /*
+ * This needs to be fixed
+ * to cache the whole PQR instead of just CLOSid.
+ * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
+ * Should not write a 0 to the low 10 bits of PQR
+ * and corrupt RMID.
+ */
+ clos = this_cpu_read(x86_cpu_clos);
+
+ rcu_read_lock();
+ ir = task_rdt(task);
+ if (ir->clos == clos) {
+ rcu_read_unlock();
+ return;
+ }
+
+ wrmsr(MSR_IA32_PQR_ASSOC, 0, ir->clos);
+ this_cpu_write(x86_cpu_clos, ir->clos);
+ rcu_read_unlock();
+}
+
+#else
+
+static inline void rdt_sched_in(struct task_struct *task) {}
+
#endif
#endif
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 751bf4b..82ef4b3 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,9 @@ struct tss_struct;
void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
struct tss_struct *tss);

+#include <asm/intel_rdt.h>
+#define post_arch_switch(current) rdt_sched_in(current)
+
#ifdef CONFIG_X86_32

#ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 58b39d6..74b1e28 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -34,6 +34,8 @@ static struct clos_cbm_map *ccmap;
static struct rdt_subsys_info rdtss_info;
static DEFINE_MUTEX(rdt_group_mutex);
struct intel_rdt rdt_root_group;
+struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
+DEFINE_PER_CPU(unsigned int, x86_cpu_clos);

/*
* Mask of CPUs for writing CBM values. We only need one per-socket.
@@ -433,6 +435,7 @@ static int __init rdt_late_init(void)
__hotcpu_notifier(rdt_cpu_notifier, 0);

cpu_notifier_register_done();
+ static_key_slow_inc(&rdt_enable_key);
pr_info("Max bitmask length:%u,Max ClosIds: %u\n", cbm_len, maxid);

return 0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9123a8..cacb490 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2241,6 +2241,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
prev_state = prev->state;
vtime_task_switch(prev);
finish_arch_switch(prev);
+ post_arch_switch(current);
perf_event_task_sched_in(prev, current);
finish_lock_switch(rq, prev);
finish_arch_post_lock_switch();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e0e1299..9153747 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1045,6 +1045,9 @@ static inline int task_on_rq_migrating(struct task_struct *p)
#ifndef finish_arch_switch
# define finish_arch_switch(prev) do { } while (0)
#endif
+#ifndef post_arch_switch
+# define post_arch_switch(current) do { } while (0)
+#endif
#ifndef finish_arch_post_lock_switch
# define finish_arch_post_lock_switch() do { } while (0)
#endif
--
1.9.1

2015-05-02 01:38:46

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR

This patch implements a common software cache for IA32_PQR_MSR(RMID 0:9,
CLOSId 32:63) to be used by both CMT and CAT. CMT updates the RMID
where as CAT updates the CLOSid in the software cache. When the new
RMID/CLOSid value is different from the cached values, IA32_PQR_MSR is
updated. Since the measured rdmsr latency for IA32_PQR_MSR is very
high(~250 cycles) this software cache is necessary to avoid reading the
MSR to compare the current CLOSid value.
During CPU hotplug the pqr cache is updated to zero.

Signed-off-by: Vikas Shivappa <[email protected]>

Conflicts:
arch/x86/kernel/cpu/perf_event_intel_cqm.c
---
arch/x86/include/asm/intel_rdt.h | 31 +++++++++++++++---------------
arch/x86/include/asm/rdt_common.h | 13 +++++++++++++
arch/x86/kernel/cpu/intel_rdt.c | 3 +++
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 20 +++++++------------
4 files changed, 39 insertions(+), 28 deletions(-)
create mode 100644 arch/x86/include/asm/rdt_common.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 2fc496f..6aae109 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,12 +4,13 @@
#ifdef CONFIG_CGROUP_RDT

#include <linux/cgroup.h>
+#include <asm/rdt_common.h>

-#define MSR_IA32_PQR_ASSOC 0xc8f
#define MAX_CBM_LENGTH 32
#define IA32_L3_CBM_BASE 0xc90
#define CBM_FROM_INDEX(x) (IA32_L3_CBM_BASE + x)
-DECLARE_PER_CPU(unsigned int, x86_cpu_clos);
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
extern struct static_key rdt_enable_key;

struct rdt_subsys_info {
@@ -61,30 +62,30 @@ static inline struct intel_rdt *task_rdt(struct task_struct *task)
static inline void rdt_sched_in(struct task_struct *task)
{
struct intel_rdt *ir;
- unsigned int clos;
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+ unsigned long flags;

if (!rdt_enabled())
return;

- /*
- * This needs to be fixed
- * to cache the whole PQR instead of just CLOSid.
- * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
- * Should not write a 0 to the low 10 bits of PQR
- * and corrupt RMID.
- */
- clos = this_cpu_read(x86_cpu_clos);
-
+ raw_spin_lock_irqsave(&state->lock, flags);
rcu_read_lock();
ir = task_rdt(task);
- if (ir->clos == clos) {
+ if (ir->clos == state->clos) {
rcu_read_unlock();
+ raw_spin_unlock_irqrestore(&state->lock, flags);
return;
}

- wrmsr(MSR_IA32_PQR_ASSOC, 0, ir->clos);
- this_cpu_write(x86_cpu_clos, ir->clos);
+ /*
+ * PQR has closid in high 32 bits and CQM-RMID
+ * in low 10 bits. Rewrite the exsting rmid from
+ * software cache.
+ */
+ wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->clos);
+ state->clos = ir->clos;
rcu_read_unlock();
+ raw_spin_unlock_irqrestore(&state->lock, flags);
}

#else
diff --git a/arch/x86/include/asm/rdt_common.h b/arch/x86/include/asm/rdt_common.h
new file mode 100644
index 0000000..33fd8ea
--- /dev/null
+++ b/arch/x86/include/asm/rdt_common.h
@@ -0,0 +1,13 @@
+#ifndef _X86_RDT_H_
+#define _X86_RDT_H_
+
+#define MSR_IA32_PQR_ASSOC 0x0c8f
+
+struct intel_pqr_state {
+ raw_spinlock_t lock;
+ int rmid;
+ int clos;
+ int cnt;
+};
+
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 74b1e28..9da61b2 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -347,6 +347,9 @@ static inline bool rdt_update_cpumask(int cpu)
*/
static inline void rdt_cpu_start(int cpu)
{
+ struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
+
+ state->clos = 0;
mutex_lock(&rdt_group_mutex);
if (rdt_update_cpumask(cpu))
cbm_update_msrs(cpu);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index e4d1b8b..fd039899 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -7,22 +7,16 @@
#include <linux/perf_event.h>
#include <linux/slab.h>
#include <asm/cpu_device_id.h>
+#include <asm/rdt_common.h>
#include "perf_event.h"

-#define MSR_IA32_PQR_ASSOC 0x0c8f
#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d

static unsigned int cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */

-struct intel_cqm_state {
- raw_spinlock_t lock;
- int rmid;
- int cnt;
-};
-
-static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);

/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -961,7 +955,7 @@ out:

static void intel_cqm_event_start(struct perf_event *event, int mode)
{
- struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
unsigned int rmid = event->hw.cqm_rmid;
unsigned long flags;

@@ -978,14 +972,14 @@ static void intel_cqm_event_start(struct perf_event *event, int mode)
WARN_ON_ONCE(state->rmid);

state->rmid = rmid;
- wrmsrl(MSR_IA32_PQR_ASSOC, state->rmid);
+ wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->clos);

raw_spin_unlock_irqrestore(&state->lock, flags);
}

static void intel_cqm_event_stop(struct perf_event *event, int mode)
{
- struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
unsigned long flags;

if (event->hw.cqm_state & PERF_HES_STOPPED)
@@ -998,7 +992,7 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)

if (!--state->cnt) {
state->rmid = 0;
- wrmsrl(MSR_IA32_PQR_ASSOC, 0);
+ wrmsr(MSR_IA32_PQR_ASSOC, 0, state->clos);
} else {
WARN_ON_ONCE(!state->rmid);
}
@@ -1243,7 +1237,7 @@ static inline void cqm_pick_event_reader(int cpu)

static void intel_cqm_cpu_prepare(unsigned int cpu)
{
- struct intel_cqm_state *state = &per_cpu(cqm_state, cpu);
+ struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
struct cpuinfo_x86 *c = &cpu_data(cpu);

raw_spin_lock_init(&state->lock);
--
1.9.1

2015-05-02 01:38:42

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 6/7] x86/intel_rdt: Intel haswell CAT enumeration

CAT(Cache Allocation Technology) on hsw needs to be enumerated
separately. CAT is only supported on certain HSW SKUs. This patch does
a probe test for hsw CPUs by writing a CLOSid into high 32 bits of
IA32_PQR_MSR and see if the bits stick. The probe test is only done
after confirming that the CPU is HSW.
HSW also requires the L3 cache bitmask to be at least two bits.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 56 ++++++++++++++++++++++++++++++++++++++---
1 file changed, 53 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 9da61b2..4c12e5b 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -38,6 +38,11 @@ struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
DEFINE_PER_CPU(unsigned int, x86_cpu_clos);

/*
+ * Minimum bits required in Cache bitmasks.
+ */
+static unsigned int min_bitmask_len = 1;
+
+/*
* Mask of CPUs for writing CBM values. We only need one per-socket.
*/
static cpumask_t rdt_cpumask;
@@ -45,11 +50,54 @@ static cpumask_t rdt_cpumask;
#define rdt_for_each_child(pos_css, parent_ir) \
css_for_each_child((pos_css), &(parent_ir)->css)

+/*
+ * hsw_probetest() - Have to do probe
+ * test for Intel haswell CPUs as it does not have
+ * CPUID enumeration support for CAT.
+ *
+ * Probes by writing to the high 32 bits(CLOSid)
+ * of the IA32_PQR_MSR and testing if the bits stick.
+ * Then hardcode the max CLOS and max bitmask length on hsw.
+ * The minimum cache bitmask length allowed for HSW is 2 bits.
+ */
+static inline bool hsw_probetest(void)
+{
+ u32 l, h_old, h_new, h_tmp;
+
+ if (rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_old))
+ return false;
+
+ /*
+ * Default value is always 0 if feature is present.
+ */
+ h_tmp = h_old ^ 0x1U;
+ if (wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_tmp) ||
+ rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_new))
+ return false;
+
+ if (h_tmp != h_new)
+ return false;
+
+ wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_old);
+
+ boot_cpu_data.x86_cat_closs = 4;
+ boot_cpu_data.x86_cat_cbmlength = 20;
+ min_bitmask_len = 2;
+
+ return true;
+}
+
static inline bool cat_supported(struct cpuinfo_x86 *c)
{
if (cpu_has(c, X86_FEATURE_CAT_L3))
return true;

+ /*
+ * Probe test for Haswell CPUs.
+ */
+ if (c->x86 == 0x6 && c->x86_model == 0x3f)
+ return hsw_probetest();
+
return false;
}

@@ -153,7 +201,7 @@ static inline bool cbm_is_contiguous(unsigned long var)
unsigned long first_bit, zero_bit;
unsigned long maxcbm = MAX_CBM_LENGTH;

- if (!var)
+ if (bitmap_weight(&var, maxcbm) < min_bitmask_len)
return false;

first_bit = find_next_bit(&var, maxcbm, 0);
@@ -180,7 +228,8 @@ static int validate_cbm(struct intel_rdt *ir, unsigned long cbmvalue)
unsigned long *cbm_tmp;

if (!cbm_is_contiguous(cbmvalue)) {
- pr_err("bitmask should have >= 1 bits and be contiguous\n");
+ pr_err("bitmask should have >=%d bits and be contiguous\n",
+ min_bitmask_len);
return -EINVAL;
}

@@ -236,7 +285,8 @@ static void __cpu_cbm_update(void *info)
}

/*
- * cbm_update_all() - Update the cache bit mask for all packages.
+ * cbm_update_all() - Update the cache bit mask for
+ * all packages.
*/
static inline void cbm_update_all(unsigned int closid)
{
--
1.9.1

2015-05-02 01:38:49

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide

Adds a description of Cache allocation technology, overview
of kernel implementation and usage of CAT cgroup interface.

Signed-off-by: Vikas Shivappa <[email protected]>
---
Documentation/cgroups/rdt.txt | 180 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 180 insertions(+)
create mode 100644 Documentation/cgroups/rdt.txt

diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
new file mode 100644
index 0000000..6051b73
--- /dev/null
+++ b/Documentation/cgroups/rdt.txt
@@ -0,0 +1,180 @@
+ RDT
+ ---
+
+Copyright (C) 2014 Intel Corporation
+Written by [email protected]
+(based on contents and format from cpusets.txt)
+
+CONTENTS:
+=========
+
+1. Cache Allocation Technology
+ 1.1 What is RDT and CAT ?
+ 1.2 Why is CAT needed ?
+ 1.3 CAT implementation overview
+ 1.4 Assignment of CBM and CLOS
+ 1.5 Scheduling and Context Switch
+2. Usage Examples and Syntax
+
+1. Cache Allocation Technology(CAT)
+===================================
+
+1.1 What is RDT and CAT
+-----------------------
+
+CAT is a part of Resource Director Technology(RDT) or Platform Shared
+resource control which provides support to control Platform shared
+resources like cache. Currently Cache is the only resource that is
+supported in RDT.
+More information can be found in the Intel SDM section 17.15.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM)
+to restrict cache allocation to a defined 'subset' of cache which may
+be overlapping with other 'subsets'. This feature is used when
+allocating a line in cache ie when pulling new data into the cache.
+The programming of the h/w is done via programming MSRs.
+
+The different cache subsets are identified by CLOS identifier (class
+of service) and each CLOS has a CBM (cache bit mask). The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is CAT needed
+---------------------
+
+The CAT enables more cache resources to be made available for higher
+priority applications based on guidance from the execution
+environment.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput
+benefit. (Refer to Section 17.15 in the Intel SDM)
+
+This technique may be useful in managing large computer systems which
+large LLC. Examples may be large servers running instances of
+webservers or database servers. In such complex systems, these subsets
+can be used for more careful placing of the available cache
+resources.
+
+1.3 CAT implementation Overview
+-------------------------------
+
+Kernel implements a cgroup subsystem to support cache allocation.
+
+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
+to the kernel and not exposed to user. Each cgroup would have one CBM
+and would just represent one cache 'subset'.
+
+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
+cgroup never fails. When a child cgroup is created it inherits the
+CLOSid and the CBM from its parent. When a user changes the default
+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
+used before. The changing of 'cache_mask' may fail with -ERRNOSPC once the
+kernel runs out of maximum CLOSids it can support.
+User can create as many cgroups as he wants but having different CBMs
+at the same time is restricted by the maximum number of CLOSids
+(multiple cgroups can have the same CBM).
+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
+for each cgroup using a CLOSid.
+
+The tasks in the cgroup would get to fill the LLC cache represented by
+the cgroup's 'cache_mask' file.
+
+Root directory would have all available bits set in 'cache_mask' file by
+default.
+
+1.4 Assignment of CBM,CLOS
+--------------------------
+
+The 'cache_mask' needs to be a subset of the parent node's 'cache_mask'.
+Any contiguous subset of these bits(with a minimum of 2 bits) maybe
+set to indicate the cache mapping desired. The 'cache_mask' between 2
+directories can overlap. The 'cache_mask' would represent the cache 'subset'
+of the CAT cgroup. For ex: on a system with 16 bits of max cbm bits,
+if the directory has the least significant 4 bits set in its 'cache_mask'
+file(meaning the 'cache_mask' is just 0xf), it would be allocated the right
+quarter of the Last level cache which means the tasks belonging to
+this CAT cgroup can use the right quarter of the cache to fill. If it
+has the most significant 8 bits set ,it would be allocated the left
+half of the cache(8 bits out of 16 represents 50%).
+
+The cache portion defined in the CBM file is available to all tasks
+within the cgroup to fill and these task are not allowed to allocate
+space in other parts of the cache.
+
+1.5 Scheduling and Context Switch
+---------------------------------
+
+During context switch kernel implements this by writing the
+CLOSid (internally maintained by kernel) of the cgroup to which the
+task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
+when there is a change in the CLOSid for the CPU in order to minimize
+the latency incurred during context switch.
+
+2. Usage examples and syntax
+============================
+
+To check if CAT was enabled on your system
+
+dmesg | grep -i intel_rdt
+should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
+the length of cache_mask and CLOS should depend on the system you use.
+
+
+Following would mount the cache allocation cgroup subsystem and create
+2 directories. Please refer to Documentation/cgroups/cgroups.txt on
+details about how to use cgroups.
+
+ cd /sys/fs/cgroup
+ mkdir rdt
+ mount -t cgroup -ordt rdt /sys/fs/cgroup/rdt
+ cd rdt
+
+Create 2 rdt cgroups
+
+ mkdir group1
+ mkdir group2
+
+Following are some of the Files in the directory
+
+ ls
+ rdt.cache_mask
+ tasks
+
+Say if the cache is 2MB and cbm supports 16 bits, then setting the
+below allocates the 'right 1/4th(512KB)' of the cache to group2
+
+Edit the CBM for group2 to set the least significant 4 bits. This
+allocates 'right quarter' of the cache.
+
+ cd group2
+ /bin/echo 0xf > cat.cache_mask
+
+
+Edit the CBM for group2 to set the least significant 8 bits.This
+allocates the right half of the cache to 'group2'.
+
+ cd group2
+ /bin/echo 0xff > rdt.cache_mask
+
+Assign tasks to the group2
+
+ /bin/echo PID1 > tasks
+ /bin/echo PID2 > tasks
+
+ Meaning now threads
+ PID1 and PID2 get to fill the 'right half' of
+ the cache as the belong to cgroup group2.
+
+Create a group under group2
+
+ cd group2
+ mkdir group21
+ cat rdt.cache_mask
+ 0xff - inherits parents mask.
+
+ /bin/echo 0xfff > rdt.cache_mask - throws error as mask has to parent's mask's subset
+
--
1.9.1

2015-05-02 18:35:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation Technology detection

On Fri, May 01, 2015 at 06:36:35PM -0700, Vikas Shivappa wrote:
> +#define X86_FEATURE_RDT ( 9*32+15) /* Resource Allocation */
> +#define X86_FEATURE_CAT_L3 (13*32 + 1) /* Cache QOS Enforcement L3 */
> + /* Cache Allocation Technology values */
> + u16 x86_cat_cbmlength;
> + u16 x86_cat_closs;

At the very least be consistent with the silly TLA nonsense.

2015-05-02 18:38:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management

On Fri, May 01, 2015 at 06:36:36PM -0700, Vikas Shivappa wrote:
> +static inline bool cat_supported(struct cpuinfo_x86 *c)

Is search and replace really that hard?

> +/*
> +* Called with the rdt_group_mutex held.
> +*/

Whitespace damaged and pointless comment.

> +static int rdt_free_closid(struct intel_rdt *ir)
> +{
> +

superfluous whitespace

> + lockdep_assert_held(&rdt_group_mutex);
> +
> + WARN_ON(!ccmap[ir->clos].cgrp_count);
> + ccmap[ir->clos].cgrp_count--;
> + if (!ccmap[ir->clos].cgrp_count)
> + clear_bit(ir->clos, rdtss_info.closmap);
> +
> + return 0;
> +}


These patches really are very sloppy..

2015-05-02 18:46:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT


There's CAT in your subject, make up your minds already on what you want
to call this stuff.

On Fri, May 01, 2015 at 06:36:37PM -0700, Vikas Shivappa wrote:
> +static void rdt_free_closid(unsigned int clos)
> +{
> +

superfluous whitespace

> + lockdep_assert_held(&rdt_group_mutex);
> +
> + clear_bit(clos, rdtss_info.closmap);
> +}

> +static inline bool cbm_is_contiguous(unsigned long var)
> +{
> + unsigned long first_bit, zero_bit;
> + unsigned long maxcbm = MAX_CBM_LENGTH;

flip these two lines

> +
> + if (!var)
> + return false;
> +
> + first_bit = find_next_bit(&var, maxcbm, 0);

What was wrong with find_first_bit() ?

> + zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
> +
> + if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
> + return false;
> +
> + return true;
> +}
> +
> +static int cat_cbm_read(struct seq_file *m, void *v)
> +{
> + struct intel_rdt *ir = css_rdt(seq_css(m));
> +
> + seq_printf(m, "%08lx\n", ccmap[ir->clos].cache_mask);

inconsistent spacing, you mostly have a whilespace before the return
statement, but here you have not.

> + return 0;
> +}
> +
> +static int validate_cbm(struct intel_rdt *ir, unsigned long cbmvalue)
> +{
> + struct intel_rdt *par, *c;
> + struct cgroup_subsys_state *css;
> + unsigned long *cbm_tmp;

No reason no to order these on line length is there?

> +
> + if (!cbm_is_contiguous(cbmvalue)) {
> + pr_err("bitmask should have >= 1 bits and be contiguous\n");
> + return -EINVAL;
> + }
> +
> + par = parent_rdt(ir);
> + cbm_tmp = &ccmap[par->clos].cache_mask;
> + if (!bitmap_subset(&cbmvalue, cbm_tmp, MAX_CBM_LENGTH))
> + return -EINVAL;
> +
> + rcu_read_lock();
> + rdt_for_each_child(css, ir) {
> + c = css_rdt(css);
> + cbm_tmp = &ccmap[c->clos].cache_mask;
> + if (!bitmap_subset(cbm_tmp, &cbmvalue, MAX_CBM_LENGTH)) {
> + rcu_read_unlock();
> + pr_err("Children's mask not a subset\n");
> + return -EINVAL;
> + }
> + }
> +
> + rcu_read_unlock();

Daft whitespace again.

> + return 0;
> +}
> +
> +static bool cbm_search(unsigned long cbm, int *closid)
> +{
> + int maxid = boot_cpu_data.x86_cat_closs;
> + unsigned int i;
> +
> + for (i = 0; i < maxid; i++) {
> + if (bitmap_equal(&cbm, &ccmap[i].cache_mask, MAX_CBM_LENGTH)) {
> + *closid = i;
> + return true;
> + }
> + }

and again

> + return false;
> +}
> +
> +static void cbmmap_dump(void)
> +{
> + int i;
> +
> + pr_debug("CBMMAP\n");
> + for (i = 0; i < boot_cpu_data.x86_cat_closs; i++)
> + pr_debug("cache_mask: 0x%x,clos_refcnt: %u\n",
> + (unsigned int)ccmap[i].cache_mask, ccmap[i].clos_refcnt);

This is missing {}

> +}
> +
> +static void __cpu_cbm_update(void *info)
> +{
> + unsigned int closid = *((unsigned int *)info);
> +
> + wrmsrl(CBM_FROM_INDEX(closid), ccmap[closid].cache_mask);
> +}

> +static int cat_cbm_write(struct cgroup_subsys_state *css,
> + struct cftype *cft, u64 cbmvalue)
> +{
> + struct intel_rdt *ir = css_rdt(css);
> + ssize_t err = 0;
> + unsigned long cache_mask, max_mask;
> + unsigned long *cbm_tmp;
> + unsigned int closid;
> + u32 max_cbm = boot_cpu_data.x86_cat_cbmlength;

That's just a right mess isn't it?

> +
> + if (ir == &rdt_root_group)
> + return -EPERM;
> + bitmap_set(&max_mask, 0, max_cbm);
> +
> + /*
> + * Need global mutex as cbm write may allocate a closid.
> + */
> + mutex_lock(&rdt_group_mutex);
> + bitmap_and(&cache_mask, (unsigned long *)&cbmvalue, &max_mask, max_cbm);
> + cbm_tmp = &ccmap[ir->clos].cache_mask;
> +
> + if (bitmap_equal(&cache_mask, cbm_tmp, MAX_CBM_LENGTH))
> + goto out;
> +
> + err = validate_cbm(ir, cache_mask);
> + if (err)
> + goto out;
> +
> + /*
> + * At this point we are sure to change the cache_mask.Hence release the
> + * reference to the current CLOSid and try to get a reference for
> + * a different CLOSid.
> + */
> + __clos_put(ir->clos);
> +
> + if (cbm_search(cache_mask, &closid)) {
> + ir->clos = closid;
> + __clos_get(closid);
> + } else {
> + err = rdt_alloc_closid(ir);
> + if (err)
> + goto out;
> +
> + ccmap[ir->clos].cache_mask = cache_mask;
> + cbm_update_all(ir->clos);
> + }
> +
> + cbmmap_dump();
> +out:
> +

Daft whitespace again.. Also inconsistent return paradigm, here you use
an out label, where in validate_cbm() you did rcu_read_unlock() and
return from the middle.

> + mutex_unlock(&rdt_group_mutex);
> + return err;
> +}
> +
> +static inline bool rdt_update_cpumask(int cpu)
> +{
> + int phys_id = topology_physical_package_id(cpu);
> + struct cpumask *mask = &rdt_cpumask;
> + int i;
> +
> + for_each_cpu(i, mask) {
> + if (phys_id == topology_physical_package_id(i))
> + return false;
> + }
> +
> + cpumask_set_cpu(cpu, mask);

More daft whitespace

> + return true;
> +}
> +
> +/*
> + * rdt_cpu_start() - If a new package has come up, update all
> + * the Cache bitmasks on the package.
> + */
> +static inline void rdt_cpu_start(int cpu)
> +{
> + mutex_lock(&rdt_group_mutex);
> + if (rdt_update_cpumask(cpu))
> + cbm_update_msrs(cpu);
> + mutex_unlock(&rdt_group_mutex);
> +}
> +
> +static void rdt_cpu_exit(unsigned int cpu)
> +{
> + int phys_id = topology_physical_package_id(cpu);
> + int i;
> +
> + mutex_lock(&rdt_group_mutex);
> + if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
> + mutex_unlock(&rdt_group_mutex);
> + return;

And here we return from the middle again..

> + }
> +
> + for_each_online_cpu(i) {
> + if (i == cpu)
> + continue;
> +
> + if (phys_id == topology_physical_package_id(i)) {
> + cpumask_set_cpu(i, &rdt_cpumask);
> + break;
> + }
> + }
> + mutex_unlock(&rdt_group_mutex);
> +}

/me tired and gives up..

2015-05-02 18:51:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT

On Fri, May 01, 2015 at 06:36:38PM -0700, Vikas Shivappa wrote:
> Adds support for IA32_PQR_ASSOC MSR writes during task scheduling.
>
> The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
> CLOSid. During context switch kernel implements this by writing the
> CLOSid of the cgroup to which the task belongs to the CPU's
> IA32_PQR_ASSOC MSR.
>
> For Cache Allocation, this would let the task fill in the cache 'subset'
> represented by the cgroup's Cache bit mask(CBM).
>

Are you guys for real? Have you even looked at the trainwreck this
makes?

> +static inline bool rdt_enabled(void)
> +{
> + return static_key_false(&rdt_enable_key);
> +}

> +/*
> + * rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
> + * if the current Closid is different than the new one.
> + */
> +static inline void rdt_sched_in(struct task_struct *task)
> +{
> + struct intel_rdt *ir;
> + unsigned int clos;
> +
> + if (!rdt_enabled())
> + return;
> +
> + /*
> + * This needs to be fixed
> + * to cache the whole PQR instead of just CLOSid.
> + * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
> + * Should not write a 0 to the low 10 bits of PQR
> + * and corrupt RMID.
> + */
> + clos = this_cpu_read(x86_cpu_clos);
> +
> + rcu_read_lock();
> + ir = task_rdt(task);
> + if (ir->clos == clos) {
> + rcu_read_unlock();
> + return;
> + }
> +
> + wrmsr(MSR_IA32_PQR_ASSOC, 0, ir->clos);
> + this_cpu_write(x86_cpu_clos, ir->clos);
> + rcu_read_unlock();
> +}

You inject _ALL_ that into the scheduler hot path. Insane much?

> +
> +#else
> +
> +static inline void rdt_sched_in(struct task_struct *task) {}
> +
> #endif
> #endif
> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index 751bf4b..82ef4b3 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -8,6 +8,9 @@ struct tss_struct;
> void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
> struct tss_struct *tss);
>
> +#include <asm/intel_rdt.h>
> +#define post_arch_switch(current) rdt_sched_in(current)
> +
> #ifdef CONFIG_X86_32
>
> #ifdef CONFIG_CC_STACKPROTECTOR

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f9123a8..cacb490 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2241,6 +2241,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> prev_state = prev->state;
> vtime_task_switch(prev);
> finish_arch_switch(prev);
> + post_arch_switch(current);
> perf_event_task_sched_in(prev, current);
> finish_lock_switch(rq, prev);
> finish_arch_post_lock_switch();

Not a word in the Changelog on this hook; that's double fail.

Please _THINK_ before writing code.

2015-05-04 17:32:00

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT



On Sat, 2 May 2015, Peter Zijlstra wrote:

>
> There's CAT in your subject, make up your minds already on what you want
> to call this stuff.

We dont have control over the names.It is clear from the patch 0/7 where
its explained that RDT is
the umbrella term and CAT is a part of it and this patch series is only for CAT
... It also mentions what exact section of the Intel manual this refers to. Is
there still some lack of clarification here ?
If its just your disliking the term thats already known.

would have received suggestions for some fancy names
like venilla / or icecream or burger and spent time deciding on the names but
we cant do that to keep consistent with the rest of naming.

>
> On Fri, May 01, 2015 at 06:36:37PM -0700, Vikas Shivappa wrote:
>> +static void rdt_free_closid(unsigned int clos)
>> +{
>> +
>
> superfluous whitespace

Will fix the whitespace issues (including before return) or other possible
coding convention issues.

It could be more of a common sense to have this in checkpatch rather that manually having to
pointing out. If you want to have fun with that go for it though.
An automated approach would make the time taken much
smaller in terms of reviewer's time and for people submiting the code
as well.
They are all run against checkpatch.

>
>> + lockdep_assert_held(&rdt_group_mutex);
>> +
>> + clear_bit(clos, rdtss_info.closmap);
>> +}
>
>> +static inline bool cbm_is_contiguous(unsigned long var)
>> +{
>> + unsigned long first_bit, zero_bit;
>> + unsigned long maxcbm = MAX_CBM_LENGTH;
>
> flip these two lines
>
>> +
>> + if (!var)
>> + return false;
>> +
>> + first_bit = find_next_bit(&var, maxcbm, 0);
>
> What was wrong with find_first_bit() ?
>
>> + zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
>> +
>> + if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
>> + return false;
>> +
>> + return true;
>> +}
>> +
>> +static int cat_cbm_read(struct seq_file *m, void *v)
>> +{
>> + struct intel_rdt *ir = css_rdt(seq_css(m));
>> +
>> + seq_printf(m, "%08lx\n", ccmap[ir->clos].cache_mask);
>
> inconsistent spacing, you mostly have a whilespace before the return
> statement, but here you have not.
>
>> + return 0;
>> +}
>> +
>> +static int validate_cbm(struct intel_rdt *ir, unsigned long cbmvalue)
>> +{
>> + struct intel_rdt *par, *c;
>> + struct cgroup_subsys_state *css;
>> + unsigned long *cbm_tmp;
>
> No reason no to order these on line length is there?
>
>> +
>> + if (!cbm_is_contiguous(cbmvalue)) {
>> + pr_err("bitmask should have >= 1 bits and be contiguous\n");
>> + return -EINVAL;
>> + }
>> +
>> + par = parent_rdt(ir);
>> + cbm_tmp = &ccmap[par->clos].cache_mask;
>> + if (!bitmap_subset(&cbmvalue, cbm_tmp, MAX_CBM_LENGTH))
>> + return -EINVAL;
>> +
>> + rcu_read_lock();
>> + rdt_for_each_child(css, ir) {
>> + c = css_rdt(css);
>> + cbm_tmp = &ccmap[c->clos].cache_mask;
>> + if (!bitmap_subset(cbm_tmp, &cbmvalue, MAX_CBM_LENGTH)) {
>> + rcu_read_unlock();
>> + pr_err("Children's mask not a subset\n");
>> + return -EINVAL;
>> + }
>> + }
>> +
>> + rcu_read_unlock();
>
> Daft whitespace again.
>
>> + return 0;
>> +}
>> +
>> +static bool cbm_search(unsigned long cbm, int *closid)
>> +{
>> + int maxid = boot_cpu_data.x86_cat_closs;
>> + unsigned int i;
>> +
>> + for (i = 0; i < maxid; i++) {
>> + if (bitmap_equal(&cbm, &ccmap[i].cache_mask, MAX_CBM_LENGTH)) {
>> + *closid = i;
>> + return true;
>> + }
>> + }
>
> and again
>
>> + return false;
>> +}
>> +
>> +static void cbmmap_dump(void)
>> +{
>> + int i;
>> +
>> + pr_debug("CBMMAP\n");
>> + for (i = 0; i < boot_cpu_data.x86_cat_closs; i++)
>> + pr_debug("cache_mask: 0x%x,clos_refcnt: %u\n",
>> + (unsigned int)ccmap[i].cache_mask, ccmap[i].clos_refcnt);
>
> This is missing {}
>
>> +}
>> +
>> +static void __cpu_cbm_update(void *info)
>> +{
>> + unsigned int closid = *((unsigned int *)info);
>> +
>> + wrmsrl(CBM_FROM_INDEX(closid), ccmap[closid].cache_mask);
>> +}
>
>> +static int cat_cbm_write(struct cgroup_subsys_state *css,
>> + struct cftype *cft, u64 cbmvalue)
>> +{
>> + struct intel_rdt *ir = css_rdt(css);
>> + ssize_t err = 0;
>> + unsigned long cache_mask, max_mask;
>> + unsigned long *cbm_tmp;
>> + unsigned int closid;
>> + u32 max_cbm = boot_cpu_data.x86_cat_cbmlength;
>
> That's just a right mess isn't it?
>
>> +
>> + if (ir == &rdt_root_group)
>> + return -EPERM;
>> + bitmap_set(&max_mask, 0, max_cbm);
>> +
>> + /*
>> + * Need global mutex as cbm write may allocate a closid.
>> + */
>> + mutex_lock(&rdt_group_mutex);
>> + bitmap_and(&cache_mask, (unsigned long *)&cbmvalue, &max_mask, max_cbm);
>> + cbm_tmp = &ccmap[ir->clos].cache_mask;
>> +
>> + if (bitmap_equal(&cache_mask, cbm_tmp, MAX_CBM_LENGTH))
>> + goto out;
>> +
>> + err = validate_cbm(ir, cache_mask);
>> + if (err)
>> + goto out;
>> +
>> + /*
>> + * At this point we are sure to change the cache_mask.Hence release the
>> + * reference to the current CLOSid and try to get a reference for
>> + * a different CLOSid.
>> + */
>> + __clos_put(ir->clos);
>> +
>> + if (cbm_search(cache_mask, &closid)) {
>> + ir->clos = closid;
>> + __clos_get(closid);
>> + } else {
>> + err = rdt_alloc_closid(ir);
>> + if (err)
>> + goto out;
>> +
>> + ccmap[ir->clos].cache_mask = cache_mask;
>> + cbm_update_all(ir->clos);
>> + }
>> +
>> + cbmmap_dump();
>> +out:
>> +
>
> Daft whitespace again.. Also inconsistent return paradigm, here you use
> an out label, where in validate_cbm() you did rcu_read_unlock() and
> return from the middle.
>
>> + mutex_unlock(&rdt_group_mutex);
>> + return err;
>> +}
>> +
>> +static inline bool rdt_update_cpumask(int cpu)
>> +{
>> + int phys_id = topology_physical_package_id(cpu);
>> + struct cpumask *mask = &rdt_cpumask;
>> + int i;
>> +
>> + for_each_cpu(i, mask) {
>> + if (phys_id == topology_physical_package_id(i))
>> + return false;
>> + }
>> +
>> + cpumask_set_cpu(cpu, mask);
>
> More daft whitespace
>
>> + return true;
>> +}
>> +
>> +/*
>> + * rdt_cpu_start() - If a new package has come up, update all
>> + * the Cache bitmasks on the package.
>> + */
>> +static inline void rdt_cpu_start(int cpu)
>> +{
>> + mutex_lock(&rdt_group_mutex);
>> + if (rdt_update_cpumask(cpu))
>> + cbm_update_msrs(cpu);
>> + mutex_unlock(&rdt_group_mutex);
>> +}
>> +
>> +static void rdt_cpu_exit(unsigned int cpu)
>> +{
>> + int phys_id = topology_physical_package_id(cpu);
>> + int i;
>> +
>> + mutex_lock(&rdt_group_mutex);
>> + if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
>> + mutex_unlock(&rdt_group_mutex);
>> + return;
>
> And here we return from the middle again..
>
>> + }
>> +
>> + for_each_online_cpu(i) {
>> + if (i == cpu)
>> + continue;
>> +
>> + if (phys_id == topology_physical_package_id(i)) {
>> + cpumask_set_cpu(i, &rdt_cpumask);
>> + break;
>> + }
>> + }
>> + mutex_unlock(&rdt_group_mutex);
>> +}
>
> /me tired and gives up..
>

2015-05-04 17:33:05

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management



On Sat, 2 May 2015, Peter Zijlstra wrote:

> On Fri, May 01, 2015 at 06:36:36PM -0700, Vikas Shivappa wrote:
>> +static inline bool cat_supported(struct cpuinfo_x86 *c)
>
> Is search and replace really that hard?
>
>> +/*
>> +* Called with the rdt_group_mutex held.
>> +*/
>
> Whitespace damaged and pointless comment.
>
>> +static int rdt_free_closid(struct intel_rdt *ir)
>> +{
>> +
>
> superfluous whitespace
>
>> + lockdep_assert_held(&rdt_group_mutex);
>> +
>> + WARN_ON(!ccmap[ir->clos].cgrp_count);
>> + ccmap[ir->clos].cgrp_count--;
>> + if (!ccmap[ir->clos].cgrp_count)
>> + clear_bit(ir->clos, rdtss_info.closmap);
>> +
>> + return 0;
>> +}
>
>
> These patches really are very sloppy..

Will resend the updated patch - most of the changes here are removed in next
patch.

Thanks,
Vikas

>

2015-05-04 18:41:07

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT



On Sat, 2 May 2015, Peter Zijlstra wrote:

> On Fri, May 01, 2015 at 06:36:38PM -0700, Vikas Shivappa wrote:
>> Adds support for IA32_PQR_ASSOC MSR writes during task scheduling.
>>
>> The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
>> CLOSid. During context switch kernel implements this by writing the
>> CLOSid of the cgroup to which the task belongs to the CPU's
>> IA32_PQR_ASSOC MSR.
>>
>> For Cache Allocation, this would let the task fill in the cache 'subset'
>> represented by the cgroup's Cache bit mask(CBM).
>>
>
> Are you guys for real? Have you even looked at the trainwreck this
> makes?
>
>> +static inline bool rdt_enabled(void)
>> +{
>> + return static_key_false(&rdt_enable_key);
>> +}
>
>> +/*
>> + * rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
>> + * if the current Closid is different than the new one.
>> + */
>> +static inline void rdt_sched_in(struct task_struct *task)
>> +{
>> + struct intel_rdt *ir;
>> + unsigned int clos;
>> +
>> + if (!rdt_enabled())
>> + return;
>> +
>> + /*
>> + * This needs to be fixed
>> + * to cache the whole PQR instead of just CLOSid.
>> + * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
>> + * Should not write a 0 to the low 10 bits of PQR
>> + * and corrupt RMID.
>> + */
>> + clos = this_cpu_read(x86_cpu_clos);
>> +
>> + rcu_read_lock();
>> + ir = task_rdt(task);
>> + if (ir->clos == clos) {
>> + rcu_read_unlock();
>> + return;
>> + }
>> +
>> + wrmsr(MSR_IA32_PQR_ASSOC, 0, ir->clos);
>> + this_cpu_write(x86_cpu_clos, ir->clos);
>> + rcu_read_unlock();
>> +}
>
> You inject _ALL_ that into the scheduler hot path. Insane much?

At some point I had a #ifdef for the rdt_sched_in , will fix this.

>
>> +
>> +#else
>> +
>> +static inline void rdt_sched_in(struct task_struct *task) {}
>> +
>> #endif
>> #endif
>> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
>> index 751bf4b..82ef4b3 100644
>> --- a/arch/x86/include/asm/switch_to.h
>> +++ b/arch/x86/include/asm/switch_to.h
>> @@ -8,6 +8,9 @@ struct tss_struct;
>> void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>> struct tss_struct *tss);
>>
>> +#include <asm/intel_rdt.h>
>> +#define post_arch_switch(current) rdt_sched_in(current)
>> +
>> #ifdef CONFIG_X86_32
>>
>> #ifdef CONFIG_CC_STACKPROTECTOR
>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index f9123a8..cacb490 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2241,6 +2241,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>> prev_state = prev->state;
>> vtime_task_switch(prev);
>> finish_arch_switch(prev);
>> + post_arch_switch(current);
>> perf_event_task_sched_in(prev, current);
>> finish_lock_switch(rq, prev);
>> finish_arch_post_lock_switch();
>
> Not a word in the Changelog on this hook; that's double fail.

will add the changelog. we want the current task which no other existing hook
provides.

Thanks,
Vikas

>
> Please _THINK_ before writing code.
>

2015-05-06 00:21:17

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT



On Sat, 2 May 2015, Peter Zijlstra wrote:

> On Fri, May 01, 2015 at 06:36:38PM -0700, Vikas Shivappa wrote:
>> Adds support for IA32_PQR_ASSOC MSR writes during task scheduling.
>>
>> The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
>> CLOSid. During context switch kernel implements this by writing the
>> CLOSid of the cgroup to which the task belongs to the CPU's
>> IA32_PQR_ASSOC MSR.
>>
>> For Cache Allocation, this would let the task fill in the cache 'subset'
>> represented by the cgroup's Cache bit mask(CBM).
>>
>
> Are you guys for real? Have you even looked at the trainwreck this
> makes?

The h/w tags the cache lines with the closid and hence we associate the id with
the tasks scheduled.

The following considerations are done for the PQR MSR write to have minimal
effect on scheduling hot path:
- This would not exist on any non-intel platforms.
- On Intel platforms, this would not exist by default unless CGROUP_RDT
is enabled.
- remains a no-op when CGROUP_RDT is enabled and intel hardware does not
support the feature.
- When feature is available and RDT is enabled, does not do msr write till
the user manually creates a cgroup *and* assigns a new cache mask.
Since the
child node inherits the parents cache mask , by cgroup creation there is
no scheduling hot path impact from the new cgroup.
- per cpu PQR values are cached and the MSR write is only done when
there is a task with different PQR is scheduled on the CPU. Typically if
the task groups are bound to be scheduled on a set of CPUs , the number
of MSR writes is greatly reduced.

However Matt pointed out I could improve this to
if (static_key_false)
{ rdt_sched_in(); }
instead of a static inline which i will update. Will update the commit
message to include these details.

And can improve to not enable the feature till a user creates a new cgroup
and creates a new the bitmask.

Thanks,
Vikas

>
>> +static inline bool rdt_enabled(void)
>> +{
>> + return static_key_false(&rdt_enable_key);
>> +}
>
>> +/*
>> + * rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
>> + * if the current Closid is different than the new one.
>> + */
>> +static inline void rdt_sched_in(struct task_struct *task)
>> +{
>> + struct intel_rdt *ir;
>> + unsigned int clos;
>> +
>> + if (!rdt_enabled())
>> + return;
>> +
>> + /*
>> + * This needs to be fixed
>> + * to cache the whole PQR instead of just CLOSid.
>> + * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
>> + * Should not write a 0 to the low 10 bits of PQR
>> + * and corrupt RMID.
>> + */
>> + clos = this_cpu_read(x86_cpu_clos);
>> +
>> + rcu_read_lock();
>> + ir = task_rdt(task);
>> + if (ir->clos == clos) {
>> + rcu_read_unlock();
>> + return;
>> + }
>> +
>> + wrmsr(MSR_IA32_PQR_ASSOC, 0, ir->clos);
>> + this_cpu_write(x86_cpu_clos, ir->clos);
>> + rcu_read_unlock();
>> +}
>
> You inject _ALL_ that into the scheduler hot path. Insane much?
>
>> +
>> +#else
>> +
>> +static inline void rdt_sched_in(struct task_struct *task) {}
>> +
>> #endif
>> #endif
>> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
>> index 751bf4b..82ef4b3 100644
>> --- a/arch/x86/include/asm/switch_to.h
>> +++ b/arch/x86/include/asm/switch_to.h
>> @@ -8,6 +8,9 @@ struct tss_struct;
>> void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>> struct tss_struct *tss);
>>
>> +#include <asm/intel_rdt.h>
>> +#define post_arch_switch(current) rdt_sched_in(current)
>> +
>> #ifdef CONFIG_X86_32
>>
>> #ifdef CONFIG_CC_STACKPROTECTOR
>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index f9123a8..cacb490 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2241,6 +2241,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>> prev_state = prev->state;
>> vtime_task_switch(prev);
>> finish_arch_switch(prev);
>> + post_arch_switch(current);
>> perf_event_task_sched_in(prev, current);
>> finish_lock_switch(rq, prev);
>> finish_arch_post_lock_switch();
>
> Not a word in the Changelog on this hook; that's double fail.
>
> Please _THINK_ before writing code.
>

2015-05-06 07:48:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT

On Mon, May 04, 2015 at 11:39:21AM -0700, Vikas Shivappa wrote:

> >>--- a/arch/x86/include/asm/switch_to.h
> >>+++ b/arch/x86/include/asm/switch_to.h
> >>@@ -8,6 +8,9 @@ struct tss_struct;
> >> void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
> >> struct tss_struct *tss);
> >>
> >>+#include <asm/intel_rdt.h>
> >>+#define post_arch_switch(current) rdt_sched_in(current)
> >>+
> >> #ifdef CONFIG_X86_32
> >>
> >> #ifdef CONFIG_CC_STACKPROTECTOR
> >
> >>diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>index f9123a8..cacb490 100644
> >>--- a/kernel/sched/core.c
> >>+++ b/kernel/sched/core.c
> >>@@ -2241,6 +2241,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> >> prev_state = prev->state;
> >> vtime_task_switch(prev);
> >> finish_arch_switch(prev);
> >>+ post_arch_switch(current);
> >> perf_event_task_sched_in(prev, current);
> >> finish_lock_switch(rq, prev);
> >> finish_arch_post_lock_switch();
> >
> >Not a word in the Changelog on this hook; that's double fail.
>
> will add the changelog. we want the current task which no other existing
> hook provides.

No.

1) two arch hooks right after one another is FAIL
1a) just 'fix' the existing hook
2) current is cheap and easily obtainable without passing it as
an argument
3) why do you need the hook in the first place?
3a) why can't you put this in __switch_to()? This is very much x86 only
code.

2015-05-06 07:50:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT

On Tue, May 05, 2015 at 05:19:40PM -0700, Vikas Shivappa wrote:
> However Matt pointed out I could improve this to
> if (static_key_false)
> { rdt_sched_in(); }
> instead of a static inline which i will update. Will update the commit
> message to include these details.

Indeed, that causes minimal I$ bloat / impact for people not using this
feature.

2015-05-06 08:09:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT

On Mon, May 04, 2015 at 10:30:15AM -0700, Vikas Shivappa wrote:
> On Sat, 2 May 2015, Peter Zijlstra wrote:
>
> >
> >There's CAT in your subject, make up your minds already on what you want
> >to call this stuff.
>
> We dont have control over the names.It is clear from the patch 0/7 where its

If I read 0/n its _after_ I've read all the other patches. The thing is,
0/n should not contain anything persistent. Patches should stand on
their own.

> explained that RDT is the umbrella term and CAT is a part of it and this
> patch series is only for CAT ... It also mentions what exact section of the
> Intel manual this refers to. Is there still some lack of clarification here
> ?

But we're not implementing an umbrella right? We're implementing Cache
QoS Enforcement (CQE aka. CAT).

Why confuse things with calling it random other names?

>From what I understand the whole RDT thing is the umbrella term for
Cache QoS Monitoring and Enforcement together. CQM is implemented
elsewhere, this part is only implementing CQE.

So just call it that, calling it RDT is actively misleading, because it
explicitly does _NOT_ do the monitoring half of it.

> If its just your disliking the term thats already known.

I think its crazy to go CQE no CAT no RDT, but I could get over that in
time. But now it turns out you need _both_, and that's even more silly.

2015-05-06 08:12:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT

On Mon, May 04, 2015 at 10:30:15AM -0700, Vikas Shivappa wrote:
> Will fix the whitespace issues (including before return) or other possible
> coding convention issues.
>
> It could be more of a common sense to have this in checkpatch rather that
> manually having to pointing out. If you want to have fun with that go for it
> though.

My main objection was that your coding style is entirely inconsistent
with itself.

Sometimes you have a whitespace before return, sometimes you do not.

Sometimes you have exit labels with locks, sometimes you do not.

etc..

Pick one stick to it; although we'd all much prefer if you pick the one
that's common to the kernel.

2015-05-06 08:31:13

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT

On Wed, 2015-05-06 at 10:09 +0200, Peter Zijlstra wrote:
>
> But we're not implementing an umbrella right? We're implementing Cache
> QoS Enforcement (CQE aka. CAT).
>
> Why confuse things with calling it random other names?
>
> From what I understand the whole RDT thing is the umbrella term for
> Cache QoS Monitoring and Enforcement together. CQM is implemented
> elsewhere, this part is only implementing CQE.
>
> So just call it that, calling it RDT is actively misleading, because it
> explicitly does _NOT_ do the monitoring half of it.

Right, and we're already running into this problem where some of the
function names contain "rdt" and some contain "cat".

How about we go with "intel cache alloc"? We avoid the dreaded TLA-fest,
it clearly matches up with what's in the Software Developer's Manual
(Cache Allocation Technology) and it's pretty simple for people who
haven't read the SDM to understand.

2015-05-06 16:50:47

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT



On Wed, 6 May 2015, Peter Zijlstra wrote:

> On Mon, May 04, 2015 at 10:30:15AM -0700, Vikas Shivappa wrote:
>> On Sat, 2 May 2015, Peter Zijlstra wrote:
>>
>>>
>>> There's CAT in your subject, make up your minds already on what you want
>>> to call this stuff.
>>
>> We dont have control over the names.It is clear from the patch 0/7 where its
>
> If I read 0/n its _after_ I've read all the other patches. The thing is,
> 0/n should not contain anything persistent. Patches should stand on
> their own.
>
>> explained that RDT is the umbrella term and CAT is a part of it and this
>> patch series is only for CAT ... It also mentions what exact section of the
>> Intel manual this refers to. Is there still some lack of clarification here
>> ?
>
> But we're not implementing an umbrella right? We're implementing Cache
> QoS Enforcement (CQE aka. CAT).

In some sense we are - The idea was that the same rdt cgroup would include other
features in the rdt and
cache allocation is one of them. Hence the cgroup name is RDT. Like Matt just
commented we just found some naming issues with respect to the APIs whether to
use cat or rdt. I can plan to remove the 'cat' altogether and use cache alloc
as I just learnt it may be not liked because its cat(mean an animal.. in english
:) )

Only reason to do it now is that we cant change the cgroup name later.

>
> Why confuse things with calling it random other names?
>
> From what I understand the whole RDT thing is the umbrella term for
> Cache QoS Monitoring and Enforcement together. CQM is implemented
> elsewhere, this part is only implementing CQE.
>
> So just call it that, calling it RDT is actively misleading, because it
> explicitly does _NOT_ do the monitoring half of it.
>
>> If its just your disliking the term thats already known.
>
> I think its crazy to go CQE no CAT no RDT, but I could get over that in
> time. But now it turns out you need _both_, and that's even more silly.

I agree the changing of names have led to enough confusion and we can try to see
that if we can make this better in coming features.
I will send fixes where I will try to be more clear on names.
Basically have rdt for things which would be common to cache alloc and all other
features and keep cache alloc which is specific to cache alloc (like the
cache bit mask is only for cache alloc)..

>
>

2015-05-06 18:10:55

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT



On Wed, 6 May 2015, Peter Zijlstra wrote:

> On Mon, May 04, 2015 at 10:30:15AM -0700, Vikas Shivappa wrote:
>> Will fix the whitespace issues (including before return) or other possible
>> coding convention issues.
>>
>> It could be more of a common sense to have this in checkpatch rather that
>> manually having to pointing out. If you want to have fun with that go for it
>> though.
>
> My main objection was that your coding style is entirely inconsistent
> with itself.
>
> Sometimes you have a whitespace before return, sometimes you do not.
>
> Sometimes you have exit labels with locks, sometimes you do not.
>
> etc..
>
> Pick one stick to it; although we'd all much prefer if you pick the one
> that's common to the kernel.

Will fix the convention issues.

Thanks,
Vikas

>

2015-05-07 23:17:17

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT



On Wed, 6 May 2015, Peter Zijlstra wrote:

> On Mon, May 04, 2015 at 11:39:21AM -0700, Vikas Shivappa wrote:
>
>>>> --- a/arch/x86/include/asm/switch_to.h
>>>> +++ b/arch/x86/include/asm/switch_to.h
>>>> @@ -8,6 +8,9 @@ struct tss_struct;
>>>> void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>>>> struct tss_struct *tss);
>>>>
>>>> +#include <asm/intel_rdt.h>
>>>> +#define post_arch_switch(current) rdt_sched_in(current)
>>>> +
>>>> #ifdef CONFIG_X86_32
>>>>
>>>> #ifdef CONFIG_CC_STACKPROTECTOR
>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index f9123a8..cacb490 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -2241,6 +2241,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>>>> prev_state = prev->state;
>>>> vtime_task_switch(prev);
>>>> finish_arch_switch(prev);
>>>> + post_arch_switch(current);
>>>> perf_event_task_sched_in(prev, current);
>>>> finish_lock_switch(rq, prev);
>>>> finish_arch_post_lock_switch();
>>>
>>> Not a word in the Changelog on this hook; that's double fail.
>>
>> will add the changelog. we want the current task which no other existing
>> hook provides.
>
> No.
>
> 1) two arch hooks right after one another is FAIL
> 1a) just 'fix' the existing hook
> 2) current is cheap and easily obtainable without passing it as
> an argument

will fix to just use an existing hook in finish_task_switch and
current(get_current) since the stack would already be changed ..

Thanks,
Vikas

> 3) why do you need the hook in the first place?
> 3a) why can't you put this in __switch_to()? This is very much x86 only
> code.
>

2015-05-08 08:59:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT

On Thu, May 07, 2015 at 04:15:41PM -0700, Vikas Shivappa wrote:
> >No.
> >
> >1) two arch hooks right after one another is FAIL
> >1a) just 'fix' the existing hook
> >2) current is cheap and easily obtainable without passing it as
> > an argument
>
> will fix to just use an existing hook in finish_task_switch and
> current(get_current) since the stack would already be changed ..
>
> Thanks,
> Vikas
>
> >3) why do you need the hook in the first place?
> >3a) why can't you put this in __switch_to()? This is very much x86 only
> > code.

^ please also answer 3, why can't this go in __swtich_to()?

2015-05-08 20:57:35

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT



On Fri, 8 May 2015, Peter Zijlstra wrote:

> On Thu, May 07, 2015 at 04:15:41PM -0700, Vikas Shivappa wrote:
>>> No.
>>>
>>> 1) two arch hooks right after one another is FAIL
>>> 1a) just 'fix' the existing hook
>>> 2) current is cheap and easily obtainable without passing it as
>>> an argument
>>
>> will fix to just use an existing hook in finish_task_switch and
>> current(get_current) since the stack would already be changed ..
>>
>> Thanks,
>> Vikas
>>
>>> 3) why do you need the hook in the first place?
>>> 3a) why can't you put this in __switch_to()? This is very much x86 only
>>> code.
>
> ^ please also answer 3, why can't this go in __swtich_to()?

perf uses similar(#1a) hook to update its MSRs (including for cache
monitoring ).
Also since switch_to is for registers state and stack , may be a
safer option to use it in the finish_arch_switch? Thats kind of why we had it
there at first but had a seperate hook.

#define finish_arch_switch(prev) \
do { \
intel_rdt_sched_in(); \
} while (0)


Hpa , any comments ?

>

2015-07-28 23:55:42

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide

On Tue, Mar 31, 2015 at 10:27:32AM -0700, Vikas Shivappa wrote:
>
>
> On Thu, 26 Mar 2015, Marcelo Tosatti wrote:
>
> >
> >I can't find any discussion relating to exposing the CBM interface
> >directly to userspace in that thread ?
> >
> >Cpu.shares is written in ratio form, which is much more natural.
> >Do you see any advantage in maintaining the
> >
> >(ratio -> cbm bitmasks)
> >
> >translation in userspace rather than in the kernel ?
> >
> >What about something like:
> >
> >
> > root cgroup
> > / \
> > / \
> > / \
> > cgroupA-80 cgroupB-30
> >
> >
> >So that whatever exceeds 100% is the ratio of cache
> >shared at that level (cgroup A and B share 10% of cache
> >at that level).
>
> But this also means the 2 groups share all of the cache ?
>
> Specifying the amount of bits to be shared lets you specify the
> exact cache area where you want to share and also when your total
> occupancy does not cover all of the cache. For ex: it gets more
> complex when you want to share say only the left quarter of the
> cache. cgroupA gets left half and cgroup gets left quarter. The
> bitmask aligns with how the h/w is designed to share the cache which
> gives you flexibility to define any specific overlapping areas of
> the cache.
>
> >
> >https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html
> >
> >cpu — the cpu.shares parameter determines the share of CPU resources
> >available to each process in all cgroups. Setting the parameter to 250,
> >250, and 500 in the finance, sales, and engineering cgroups respectively
> >means that processes started in these groups will split the resources
> >with a 1:1:2 ratio. Note that when a single process is running, it
> >consumes as much CPU as necessary no matter which cgroup it is placed
> >in. The CPU limitation only comes into effect when two or more processes
> >compete for CPU resources.
> >
> >
>
> These are more defined in terms of how many cache lines (or how many
> cache ways) they can use and would be difficult to define them in
> terms of percentage. In contrast the cpu share is a time shared
> thing and is much more granular where as here its not , its
> occupancy in terms of cache lines/ways.. (however this is not really
> defined as a restriction but thats the way it is now).
> Also note that the granularity of the bitmasks define the
> granularity of the percentages and in some SKUs the granularity is
> 2b and not 1b.. So technically you wont be able to even allocate
> percentage of cache even in 10% granularity for most of the cases
> (if there are 30MB and 25 ways like in one of hsw SKU) and this will
> vary for different SKUs which makes it more complicated for users.
> However the user library is free to define own interface based on
> the underlying cgroup interface say for example you never care about
> the overlapping and using it for a specific SKU etc.. The underlying
> cgroup framework is meant to be generic for all SKus and used for
> most of the use cases.
>
> Also at this point I see a lot of enterprise and and other users
> already using the cgroup interface or shown interest in the same.
> However I see your point where you indicate the ease with which user
> can specify in size/percentage which he might be used to doing for
> other resources rather than bits where he needs to get an idea size
> by calculating it seperately - But again note that you may not be
> able to define percentages in many scenarios like the one above. And
> another question would be we would need to convince the users to
> adapt to the modified percentage user model (ex: like the one you
> say above where percentage - 100 is the one thats shared)
> I can review this requirements and others I have received and get
> back to see the closest that can be done if possible.
>
> Thanks,
> Vikas

Vikas,

Three questions:

First, usage model. The usage model for CAT is the following
(please correct me if i'm wrong):

1) measure application performance without L3 cache reservation.
2) measure application perf with L3 cache reservation and
X number of ways until desired perf is attained.

On migration to a new hardware platform, to achieve similar benefit
achieved when going from 1) to 2) is to reserve _at least_ the number of
bytes that "X ways" provided when the measurement was performed. Is that
correct?

If that is correct, then the user does want to record "number of bytes"
that X ways on measurement CPU provided.

Second question:
Do you envision any use case which the placement of cache
and not the quantity of cache is a criteria for decision?
That is, two cases with the same amount of cache for each CLOSid,
but with different locations inside the cache?
(except sharing of ways by two CLOSid's, of course).

Third question:
How about support for the (new) I/D cache division?






2015-07-29 21:20:25

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide



On Tue, 28 Jul 2015, Marcelo Tosatti wrote:

> On Tue, Mar 31, 2015 at 10:27:32AM -0700, Vikas Shivappa wrote:
>>
>>
>> On Thu, 26 Mar 2015, Marcelo Tosatti wrote:
>>
>>>
>>> I can't find any discussion relating to exposing the CBM interface
>>> directly to userspace in that thread ?
>>>
>>> Cpu.shares is written in ratio form, which is much more natural.
>>> Do you see any advantage in maintaining the
>>>
>>> (ratio -> cbm bitmasks)
>>>
>>> translation in userspace rather than in the kernel ?
>>>
>>> What about something like:
>>>
>>>
>>> root cgroup
>>> / \
>>> / \
>>> / \
>>> cgroupA-80 cgroupB-30
>>>
>>>
>>> So that whatever exceeds 100% is the ratio of cache
>>> shared at that level (cgroup A and B share 10% of cache
>>> at that level).
>>
>> But this also means the 2 groups share all of the cache ?
>>
>> Specifying the amount of bits to be shared lets you specify the
>> exact cache area where you want to share and also when your total
>> occupancy does not cover all of the cache. For ex: it gets more
>> complex when you want to share say only the left quarter of the
>> cache. cgroupA gets left half and cgroup gets left quarter. The
>> bitmask aligns with how the h/w is designed to share the cache which
>> gives you flexibility to define any specific overlapping areas of
>> the cache.
>>
>>>
>>> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html
>>>
>>> cpu — the cpu.shares parameter determines the share of CPU resources
>>> available to each process in all cgroups. Setting the parameter to 250,
>>> 250, and 500 in the finance, sales, and engineering cgroups respectively
>>> means that processes started in these groups will split the resources
>>> with a 1:1:2 ratio. Note that when a single process is running, it
>>> consumes as much CPU as necessary no matter which cgroup it is placed
>>> in. The CPU limitation only comes into effect when two or more processes
>>> compete for CPU resources.
>>>
>>>
>>
>> These are more defined in terms of how many cache lines (or how many
>> cache ways) they can use and would be difficult to define them in
>> terms of percentage. In contrast the cpu share is a time shared
>> thing and is much more granular where as here its not , its
>> occupancy in terms of cache lines/ways.. (however this is not really
>> defined as a restriction but thats the way it is now).
>> Also note that the granularity of the bitmasks define the
>> granularity of the percentages and in some SKUs the granularity is
>> 2b and not 1b.. So technically you wont be able to even allocate
>> percentage of cache even in 10% granularity for most of the cases
>> (if there are 30MB and 25 ways like in one of hsw SKU) and this will
>> vary for different SKUs which makes it more complicated for users.
>> However the user library is free to define own interface based on
>> the underlying cgroup interface say for example you never care about
>> the overlapping and using it for a specific SKU etc.. The underlying
>> cgroup framework is meant to be generic for all SKus and used for
>> most of the use cases.
>>
>> Also at this point I see a lot of enterprise and and other users
>> already using the cgroup interface or shown interest in the same.
>> However I see your point where you indicate the ease with which user
>> can specify in size/percentage which he might be used to doing for
>> other resources rather than bits where he needs to get an idea size
>> by calculating it seperately - But again note that you may not be
>> able to define percentages in many scenarios like the one above. And
>> another question would be we would need to convince the users to
>> adapt to the modified percentage user model (ex: like the one you
>> say above where percentage - 100 is the one thats shared)
>> I can review this requirements and others I have received and get
>> back to see the closest that can be done if possible.
>>
>> Thanks,
>> Vikas
>
> Vikas,
>
> Three questions:
>
> First, usage model. The usage model for CAT is the following
> (please correct me if i'm wrong):
>
> 1) measure application performance without L3 cache reservation.
> 2) measure application perf with L3 cache reservation and
> X number of ways until desired perf is attained.
>
> On migration to a new hardware platform, to achieve similar benefit
> achieved when going from 1) to 2) is to reserve _at least_ the number of
> bytes that "X ways" provided when the measurement was performed. Is that
> correct?
>
> If that is correct, then the user does want to record "number of bytes"
> that X ways on measurement CPU provided.
>

The number of ways mapping to bits is implementation dependent. So we really
cannot refer one way as a bit..

to map the size to bits. could check the cache capacity in /proc and then the
number of bits in the cbm (max bits are shown in the root intel_rdt cgroup) .
ex: cache is 2MB. we have 16 bits cbm - a mask of 0xff would represent 1MB.

> Second question:
> Do you envision any use case which the placement of cache
> and not the quantity of cache is a criteria for decision?
> That is, two cases with the same amount of cache for each CLOSid,
> but with different locations inside the cache?
> (except sharing of ways by two CLOSid's, of course).
>

cbm max - 16 bits. 000f - allocate right quarter. f000 - allocate left
quarter.. ? extend the case to any number of valid contiguous bits.


> Third question:
> How about support for the (new) I/D cache division?
>

Planning to be sending a patch end of this week or early next week.

Thanks,
Vikas


>
>
>
>
>
>
>