2015-06-12 18:20:17

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH V9 00/10] New cpumask API and Intel Cache Allocation support

This patch has some preparatory patches which add a new API
cpumask_any_online_but and change hot cpu handling code in existing
cache monitoring and RAPL kernel code. This improves hot cpu
notification handling by not looping through all online cpus which could
be expensive in large systems.

Cache allocation patches(dependent on prep patches) adds a cgroup
subsystem to support the new Cache Allocation feature found in future
Intel Xeon Intel processors. Cache Allocation is a sub-feature with in
Resource Director Technology(RDT) feature. RDT which provides support to
control sharing of platform resources like L3 cache.

Cache Allocation Technology provides a way for the Software (OS/VMM) to
restrict cache allocation to a defined 'subset' of cache which may be
overlapping with other 'subsets'. This feature is used when allocating
a line in cache ie when pulling new data into the cache. The
programming of the h/w is done via programming MSRs. The patch series
support to perform L3 cache allocation.

In todays new processors the number of cores is continuously increasing
which in turn increase the number of threads or workloads that can
simultaneously be run. When multi-threaded applications run
concurrently, they compete for shared resources including L3 cache. At
times, this L3 cache resource contention may result in inefficient space
utilization. For example a higher priority thread may end up with lesser
L3 cache resource or a cache sensitive app may not get optimal cache
occupancy thereby degrading the performance. Cache Allocation kernel
patch helps provides a framework for sharing L3 cache so that users can
allocate the resource according to set requirements.

More information about the feature can be found in the Intel SDM, Volume
3 section 17.15. SDM does not yet use the 'RDT' term yet and it is
planned to be changed at a later time.

*All the patches will apply on tip/perf/core*.

Changes in V9:
Changes made as per Thomas feedback:
- added a comment where we call schedule in code only when RDT is
enabled.
- Reordered the local declarations to follow convention in
intel_cqm_xchg_rmid

Changes in V8: Thanks to feedback from Thomas and following changes are
made based on his feedback:

Generic changes/Preparatory patches:
-added a new cpumask_any_online_but which returns the next
core sibling that is online.
-Made changes in Intel Cache monitoring and Intel RAPL(Running average
power limit) code to use the new function above to find the next cpu
that can be a designated reader for the package. Also changed the way
the package masks are computed which can be simplified using
topology_core_cpumask.

Cache allocation specific changes:
-Moved the documentation to the begining of the patch series.
-Added more documentation for the rdt cgroup files in the documentation.
-Changed the dmesg output when cache alloc is enabled to be more helpful
and updated few other comments to be better readable.
-removed __ prefix to functions like clos_get which were not following
convention.
-added code to take action on a WARN_ON in clos_put. Made a few other
changes to reduce code text.
-updated better readable/Kernel doc format comments for the
call to rdt_css_alloc, datastructures .
-removed cgroup_init
-changed the names of functions to only have intel_ prefix for external
APIs.
-replaced (void *)&closid with (void *)closid when calling
on_each_cpu_mask
-fixed the reference release of closid during cache bitmask write.
-changed the code to not ignore a cache mask which has bits set outside
of the max bits allowed. It returns an error instead.
-replaced bitmap_set(&max_mask, 0, max_cbm_len) with max_mask =
(1ULL << max_cbm) - 1.
- update the rdt_cpu_mask which has one cpu for each package, using
topology_core_cpumask instead of looping through existing rdt_cpu_mask.
Realized topology_core_cpumask name is misleading and it actually
returns the cores in a cpu package!
-arranged the code better to have the code relating to similar task
together.
-Improved searching for the next online cpu sibling and maintaining the
rdt_cpu_mask which has one cpu per package.
-removed the unnecessary wrapper rdt_enabled.
-removed unnecessary spin lock and rculock in the scheduling code.
-merged all scheduling code into one patch not seperating the RDT common
software cache code.

Changes in V7: Based on feedback from PeterZ and Matt and following
discussions :
- changed lot of naming to reflect the data structures which are common
to RDT and specific to Cache allocation.
- removed all usage of 'cat'. replace with more friendly cache
allocation
- fixed lot of convention issues (whitespace, return paradigm etc)
- changed the scheduling hook for RDT to not use a inline.
- removed adding new scheduling hook and just reused the existing one
similar to perf hook.

Changes in V6:
- rebased to 4.1-rc1 which has the CMT(cache monitoring) support included.
- (Thanks to Marcelo's feedback).Fixed support for hot cpu handling for
IA32_L3_QOS MSRs. Although during deep C states the MSR need not be restored
this is needed when physically a new package is added.
-some other coding convention changes including renaming to cache_mask using a
refcnt to track the number of cgroups using a closid in clos_cbm map.
-1b cbm support for non-hsw SKUs. HSW is an exception which needs the cache
bit masks to be at least 2 bits.

Changes in v5:
- Added support to propagate the cache bit mask update for each
package.
- Removed the cache bit mask reference in the intel_rdt structure as
there was no need for that and we already maintain a separate
closid<->cbm mapping.
- Made a few coding convention changes which include adding the
assertion while freeing the CLOSID.

Changes in V4:
- Integrated with the latest V5 CMT patches.
- Changed naming of cgroup to rdt(resource director technology) from
cat(cache allocation technology). This was done as the RDT is the
umbrella term for platform shared resources allocation. Hence in
future it would be easier to add resource allocation to the same
cgroup
- Naming changes also applied to a lot of other data structures/APIs.
- Added documentation on cgroup usage for cache allocation to address
a lot of questions from various academic and industry regarding
cache allocation usage.

Changes in V3:
- Implements a common software cache for IA32_PQR_MSR
- Implements support for hsw Cache Allocation enumeration. This does not use the brand
strings like earlier version but does a probe test. The probe test is done only
on hsw family of processors
- Made a few coding convention, name changes
- Check for lock being held when ClosID manipulation happens

Changes in V2:
- Removed HSW specific enumeration changes. Plan to include it later as a
separate patch.
- Fixed the code in prep_arch_switch to be specific for x86 and removed
x86 defines.
- Fixed cbm_write to not write all 1s when a cgroup is freed.
- Fixed one possible memory leak in init.
- Changed some of manual bitmap
manipulation to use the predefined bitmap APIs to make code more readable
- Changed name in sources from cqe to cat
- Global cat enable flag changed to static_key and disabled cgroup early_init

[PATCH 01/10] cpumask: Introduce cpumask_any_online_but
[PATCH 02/10] x86/intel_cqm: Modify hot cpu notification handling
[PATCH 03/10] x86/intel_rapl: Modify hot cpu notification handling
[PATCH 04/10] x86/intel_rdt: Cache Allocation documentation and
[PATCH 05/10] x86/intel_rdt: Add support for Cache Allocation
[PATCH 06/10] x86/intel_rdt: Add new cgroup and Class of service
[PATCH 07/10] x86/intel_rdt: Add support for cache bit mask
[PATCH 08/10] x86/intel_rdt: Implement scheduling support for Intel
[PATCH 09/10] x86/intel_rdt: Hot cpu support for Cache Allocation
[PATCH 10/10] x86/intel_rdt: Intel haswell Cache Allocation


2015-06-12 18:20:19

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 01/10] cpumask: Introduce cpumask_any_online_but

There is currently no cpumask helper function to pick a "random" cpu
from a mask which is also online.

cpumask_any_online_but() does that which is similar to cpumask_any_but()
but also returns a cpu that is online.

Signed-off-by: Vikas Shivappa <[email protected]>
---
include/linux/cpumask.h | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 27e285b..f2d7e8a 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -548,6 +548,24 @@ static inline void cpumask_copy(struct cpumask *dstp,
#define cpumask_of(cpu) (get_cpu_mask(cpu))

/**
+ * cpumask_any_online_but - return a "random" and online cpu in a cpumask,
+ * but not this one
+ * @mask: the input mask to search
+ * @cpu: the cpu to ignore
+ *
+ * Returns >= nr_cpu_ids if no cpus set.
+*/
+static inline unsigned int cpumask_any_online_but(const struct cpumask *mask,
+ unsigned int cpu)
+{
+ cpumask_t tmp;
+
+ cpumask_and(&tmp, cpu_online_mask, mask);
+ cpumask_clear_cpu(cpu, &tmp);
+ return cpumask_any(&tmp);
+}
+
+/**
* cpumask_parse_user - extract a cpumask from a user string
* @buf: the buffer to extract from
* @len: the length of the buffer
--
1.9.1

2015-06-12 18:20:26

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 02/10] x86/intel_cqm: Modify hot cpu notification handling

This patch modifies hot cpu notification handling in Intel cache
monitoring:

- to add a new cpu to the cqm_cpumask(which has one cpu per package)
during cpu start, it uses the existing package<->core map instead of
looping through all cpus in cqm_cpumask.
- to search for the next online sibling during cpu exit, it uses the
cpumask_any_online_but instead of looping through all online cpus. In
large systems with large number of cpus the time taken to loop may be
expensive and also the time increase linearly.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 27 ++++++++++-----------------
1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 1880761..b224142 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -1236,15 +1236,15 @@ static struct pmu intel_cqm_pmu = {

static inline void cqm_pick_event_reader(int cpu)
{
- int phys_id = topology_physical_package_id(cpu);
- int i;
+ struct cpumask tmp;

- for_each_cpu(i, &cqm_cpumask) {
- if (phys_id == topology_physical_package_id(i))
- return; /* already got reader for this socket */
- }
+ cpumask_and(&tmp, &cqm_cpumask, topology_core_cpumask(cpu));

- cpumask_set_cpu(cpu, &cqm_cpumask);
+ /*
+ * Pick a reader if there isn't one already.
+ */
+ if (cpumask_empty(&tmp))
+ cpumask_set_cpu(cpu, &cqm_cpumask);
}

static void intel_cqm_cpu_prepare(unsigned int cpu)
@@ -1262,7 +1262,6 @@ static void intel_cqm_cpu_prepare(unsigned int cpu)

static void intel_cqm_cpu_exit(unsigned int cpu)
{
- int phys_id = topology_physical_package_id(cpu);
int i;

/*
@@ -1271,15 +1270,9 @@ static void intel_cqm_cpu_exit(unsigned int cpu)
if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask))
return;

- for_each_online_cpu(i) {
- if (i == cpu)
- continue;
-
- if (phys_id == topology_physical_package_id(i)) {
- cpumask_set_cpu(i, &cqm_cpumask);
- break;
- }
- }
+ i = cpumask_any_online_but(topology_core_cpumask(cpu), cpu);
+ if (i < nr_cpu_ids)
+ cpumask_set_cpu(i, &cqm_cpumask);
}

static int intel_cqm_cpu_notifier(struct notifier_block *nb,
--
1.9.1

2015-06-12 18:20:23

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 03/10] x86/intel_rapl: Modify hot cpu notification handling for RAPL

This patch modifies the hot cpu notification handling in
Intel Running Average Power Limit(RAPL) driver.

- to add a cpu reader to the rapl_cpumask(which has one cpu per package
set) it uses the existing package<->core map instead of looping
through all cpus in rapl_cpumask.
- to search for the next online sibling during hot cpu exit, it uses
the cpumask_any_online_but instead of looping all online cpus. In
large systems with large number of cpus the time taken to loop may be
expensive and also the time increase linearly.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 27 ++++++++++-----------------
1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index 358c54a..987383e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -524,18 +524,14 @@ static struct pmu rapl_pmu_class = {
static void rapl_cpu_exit(int cpu)
{
struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
- int i, phys_id = topology_physical_package_id(cpu);
int target = -1;
+ int i;

/* find a new cpu on same package */
- for_each_online_cpu(i) {
- if (i == cpu)
- continue;
- if (phys_id == topology_physical_package_id(i)) {
- target = i;
- break;
- }
- }
+ i = cpumask_any_online_but(topology_core_cpumask(cpu), cpu);
+ if (i < nr_cpu_ids)
+ target = i;
+
/*
* clear cpu from cpumask
* if was set in cpumask and still some cpu on package,
@@ -557,15 +553,12 @@ static void rapl_cpu_exit(int cpu)

static void rapl_cpu_init(int cpu)
{
- int i, phys_id = topology_physical_package_id(cpu);
+ struct cpumask tmp;

- /* check if phys_is is already covered */
- for_each_cpu(i, &rapl_cpu_mask) {
- if (phys_id == topology_physical_package_id(i))
- return;
- }
- /* was not found, so add it */
- cpumask_set_cpu(cpu, &rapl_cpu_mask);
+ /* check if cpu's package is already covered.If not, add it.*/
+ cpumask_and(&tmp, &rapl_cpu_mask, topology_core_cpumask(cpu));
+ if (cpumask_empty(&tmp))
+ cpumask_set_cpu(cpu, &rapl_cpu_mask);
}

static __init void rapl_hsw_server_quirk(void)
--
1.9.1

2015-06-12 18:20:32

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 04/10] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

Adds a description of Cache allocation technology, overview
of kernel implementation and usage of Cache Allocation cgroup interface.

Cache allocation is a sub-feature of Resource Director Technology(RDT)
Allocation or Platform Shared resource control which provides support to
control Platform shared resources like L3 cache. Currently L3 Cache is
the only resource that is supported in RDT. More information can be
found in the Intel SDM, Volume 3, section 17.15.

Cache Allocation Technology provides a way for the Software (OS/VMM)
to restrict cache allocation to a defined 'subset' of cache which may
be overlapping with other 'subsets'. This feature is used when
allocating a line in cache ie when pulling new data into the cache.

Signed-off-by: Vikas Shivappa <[email protected]>
---
Documentation/cgroups/rdt.txt | 215 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 215 insertions(+)
create mode 100644 Documentation/cgroups/rdt.txt

diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
new file mode 100644
index 0000000..291b4d6
--- /dev/null
+++ b/Documentation/cgroups/rdt.txt
@@ -0,0 +1,215 @@
+ RDT
+ ---
+
+Copyright (C) 2014 Intel Corporation
+Written by [email protected]
+(based on contents and format from cpusets.txt)
+
+CONTENTS:
+=========
+
+1. Cache Allocation Technology
+ 1.1 What is RDT and Cache allocation ?
+ 1.2 Why is Cache allocation needed ?
+ 1.3 Cache allocation implementation overview
+ 1.4 Assignment of CBM and CLOS
+ 1.5 Scheduling and Context Switch
+2. Usage Examples and Syntax
+
+1. Cache Allocation Technology(Cache allocation)
+===================================
+
+1.1 What is RDT and Cache allocation
+------------------------------------
+
+Cache allocation is a sub-feature of Resource Director Technology(RDT)
+Allocation or Platform Shared resource control which provides support to
+control Platform shared resources like L3 cache. Currently L3 Cache is
+the only resource that is supported in RDT. More information can be
+found in the Intel SDM, Volume 3, section 17.15.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM)
+to restrict cache allocation to a defined 'subset' of cache which may
+be overlapping with other 'subsets'. This feature is used when
+allocating a line in cache ie when pulling new data into the cache.
+The programming of the h/w is done via programming MSRs.
+
+The different cache subsets are identified by CLOS identifier (class
+of service) and each CLOS has a CBM (cache bit mask). The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is Cache allocation needed
+----------------------------------
+
+In todays new processors the number of cores is continuously increasing,
+especially in large scale usage models where VMs are used like
+webservers and datacenters. The number of cores increase the number
+of threads or workloads that can simultaneously be run. When
+multi-threaded-applications, VMs, workloads run concurrently they
+compete for shared resources including L3 cache.
+
+The Cache allocation enables more cache resources to be made available
+for higher priority applications based on guidance from the execution
+environment.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput benefit.
+
+This technique may be useful in managing large computer systems which
+large L3 cache. Examples may be large servers running instances of
+webservers or database servers. In such complex systems, these subsets
+can be used for more careful placing of the available cache
+resources.
+
+1.3 Cache allocation implementation Overview
+--------------------------------------------
+
+Kernel implements a cgroup subsystem to support cache allocation.
+
+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
+to the kernel and not exposed to user. Each cgroup would have one CBM
+and would just represent one cache 'subset'.
+
+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
+cgroup never fails. When a child cgroup is created it inherits the
+CLOSid and the CBM from its parent. When a user changes the default
+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
+used before. The changing of 'cache_mask' may fail with -ENOSPC once
+the kernel runs out of maximum CLOSids it can support.
+User can create as many cgroups as he wants but having different CBMs
+at the same time is restricted by the maximum number of CLOSids
+(multiple cgroups can have the same CBM).
+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
+for each cgroup using a CLOSid.
+
+The tasks in the cgroup would get to fill the L3 cache represented by
+the cgroup's 'cache_mask' file.
+
+Root directory would have all available bits set in 'cache_mask' file
+by default.
+
+Each RDT cgroup directory has the following files. Some of them may be a
+part of common RDT framework or be specific to RDT sub-features like
+cache allocation.
+
+ - intel_rdt.cache_mask: The cache bitmask(CBM) is represented by this
+ file. The bitmask must be contiguous and would have a 1 or 2 bit
+ minimum length.
+
+1.4 Assignment of CBM,CLOS
+--------------------------
+
+The 'cache_mask' needs to be a subset of the parent node's
+'cache_mask'. Any contiguous subset of these bits(with a minimum of 2
+bits on hsw SKUs) maybe set to indicate the cache mapping desired. The
+'cache_mask' between 2 directories can overlap. The 'cache_mask' would
+represent the cache 'subset' of the Cache allocation cgroup. For ex: on
+a system with 16 bits of max cbm bits, if the directory has the least
+significant 4 bits set in its 'cache_mask' file(meaning the 'cache_mask'
+is just 0xf), it would be allocated the right quarter of the Last level
+cache which means the tasks belonging to this Cache allocation cgroup
+can use the right quarter of the cache to fill. If it
+has the most significant 8 bits set ,it would be allocated the left
+half of the cache(8 bits out of 16 represents 50%).
+
+The cache portion defined in the CBM file is available to all tasks
+within the cgroup to fill and these task are not allowed to allocate
+space in other parts of the cache.
+
+1.5 Scheduling and Context Switch
+---------------------------------
+
+During context switch kernel implements this by writing the
+CLOSid (internally maintained by kernel) of the cgroup to which the
+task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
+when there is a change in the CLOSid for the CPU in order to minimize
+the latency incurred during context switch.
+
+The following considerations are done for the PQR MSR write so that it
+has minimal impact on scheduling hot path:
+- This path doesnt exist on any non-intel platforms.
+- On Intel platforms, this would not exist by default unless CGROUP_RDT
+is enabled.
+- remains a no-op when CGROUP_RDT is enabled and intel hardware does not
+support the feature.
+- When feature is available, still remains a no-op till the user
+manually creates a cgroup *and* assigns a new cache mask. Since the
+child node inherits the parents cache mask , by cgroup creation there is
+no scheduling hot path impact from the new cgroup.
+- per cpu PQR values are cached and the MSR write is only done when
+there is a task with different PQR is scheduled on the CPU. Typically if
+the task groups are bound to be scheduled on a set of CPUs , the number
+of MSR writes is greatly reduced.
+
+2. Usage examples and syntax
+============================
+
+To check if Cache allocation was enabled on your system
+
+dmesg | grep -i intel_rdt
+should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
+the length of cache_mask and CLOS should depend on the system you use.
+
+Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3
+ cache allocation is enabled).
+
+Following would mount the cache allocation cgroup subsystem and create
+2 directories. Please refer to Documentation/cgroups/cgroups.txt on
+details about how to use cgroups.
+
+ cd /sys/fs/cgroup
+ mkdir rdt
+ mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
+ cd rdt
+
+Create 2 rdt cgroups
+
+ mkdir group1
+ mkdir group2
+
+Following are some of the Files in the directory
+
+ ls
+ rdt.cache_mask
+ tasks
+
+Say if the cache is 2MB and cbm supports 16 bits, then setting the
+below allocates the 'right 1/4th(512KB)' of the cache to group2
+
+Edit the CBM for group2 to set the least significant 4 bits. This
+allocates 'right quarter' of the cache.
+
+ cd group2
+ /bin/echo 0xf > rdt.cache_mask
+
+
+Edit the CBM for group2 to set the least significant 8 bits.This
+allocates the right half of the cache to 'group2'.
+
+ cd group2
+ /bin/echo 0xff > rdt.cache_mask
+
+Assign tasks to the group2
+
+ /bin/echo PID1 > tasks
+ /bin/echo PID2 > tasks
+
+ Meaning now threads
+ PID1 and PID2 get to fill the 'right half' of
+ the cache as the belong to cgroup group2.
+
+Create a group under group2
+
+ cd group2
+ mkdir group21
+ cat rdt.cache_mask
+ 0xff - inherits parents mask.
+
+ /bin/echo 0xfff > rdt.cache_mask - throws error as mask has to parent's mask's subset
+
+In order to restrict RDT cgroups to specific set of CPUs rdt can be
+comounted with cpusets.
--
1.9.1

2015-06-12 18:22:46

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 05/10] x86/intel_rdt: Add support for Cache Allocation detection

This patch adds support for Cache Allocation Technology feature found in
future Intel Xeon processors. Cache allocation is a sub-feature of Intel
Resource Director Technology(RDT) which enables sharing of processor
resources. This patch includes CPUID enumeration routines for Cache
allocation and new values to track resources to the cpuinfo_x86
structure.

Cache allocation provides a way for the Software (OS/VMM) to restrict
cache allocation to a defined 'subset' of cache which may be overlapping
with other 'subsets'. This feature is used when allocating a line in
cache ie when pulling new data into the cache. The programming of the
hardware is done via programming MSRs(model specific registers).

More information about Cache allocation be found in the Intel (R) x86
Architecture Software Developer Manual,Volume 3, section 17.15.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/cpufeature.h | 6 +++++-
arch/x86/include/asm/processor.h | 3 +++
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/common.c | 15 +++++++++++++++
arch/x86/kernel/cpu/intel_rdt.c | 40 +++++++++++++++++++++++++++++++++++++++
init/Kconfig | 11 +++++++++++
6 files changed, 75 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/cpu/intel_rdt.c

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 3d6606f..ae5ae9d 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -12,7 +12,7 @@
#include <asm/disabled-features.h>
#endif

-#define NCAPINTS 13 /* N 32-bit words worth of info */
+#define NCAPINTS 14 /* N 32-bit words worth of info */
#define NBUGINTS 1 /* N 32-bit bug flags */

/*
@@ -229,6 +229,7 @@
#define X86_FEATURE_RTM ( 9*32+11) /* Restricted Transactional Memory */
#define X86_FEATURE_CQM ( 9*32+12) /* Cache QoS Monitoring */
#define X86_FEATURE_MPX ( 9*32+14) /* Memory Protection Extension */
+#define X86_FEATURE_RDT ( 9*32+15) /* Resource Allocation */
#define X86_FEATURE_AVX512F ( 9*32+16) /* AVX-512 Foundation */
#define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */
#define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX instructions */
@@ -252,6 +253,9 @@
/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
#define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */

+/* Intel-defined CPU features, CPUID level 0x00000010:0 (ebx), word 13 */
+#define X86_FEATURE_CAT_L3 (13*32 + 1) /* Cache Allocation L3 */
+
/*
* BUG word(s)
*/
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 23ba676..e84de35 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -114,6 +114,9 @@ struct cpuinfo_x86 {
int x86_cache_occ_scale; /* scale to bytes */
int x86_power;
unsigned long loops_per_jiffy;
+ /* Resource Allocation values */
+ u16 x86_rdt_max_cbm_len;
+ u16 x86_rdt_max_closid;
/* cpuid returned max cores value: */
u16 x86_max_cores;
u16 apicid;
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 9bff687..4ff7a1f 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
perf_event_intel_uncore_nhmex.o
endif

+obj-$(CONFIG_CGROUP_RDT) += intel_rdt.o

obj-$(CONFIG_X86_MCE) += mcheck/
obj-$(CONFIG_MTRR) += mtrr/
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a62cf04..4133d3c 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -670,6 +670,21 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
}
}

+ /* Additional Intel-defined flags: level 0x00000010 */
+ if (c->cpuid_level >= 0x00000010) {
+ u32 eax, ebx, ecx, edx;
+
+ cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
+ c->x86_capability[13] = ebx;
+
+ if (cpu_has(c, X86_FEATURE_CAT_L3)) {
+
+ cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
+ c->x86_rdt_max_closid = edx + 1;
+ c->x86_rdt_max_cbm_len = eax + 1;
+ }
+ }
+
/* AMD-defined flags: level 0x80000001 */
xlvl = cpuid_eax(0x80000000);
c->extended_cpuid_level = xlvl;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
new file mode 100644
index 0000000..3cd6db6
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -0,0 +1,40 @@
+/*
+ * Resource Director Technology(RDT)
+ * - Cache Allocation code.
+ *
+ * Copyright (C) 2014 Intel Corporation
+ *
+ * 2015-05-25 Written by
+ * Vikas Shivappa <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual, volume 3, section 17.15.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/slab.h>
+#include <linux/err.h>
+
+static int __init intel_rdt_late_init(void)
+{
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+
+ if (!cpu_has(c, X86_FEATURE_CAT_L3))
+ return -ENODEV;
+
+ pr_info("Intel cache allocation enabled\n");
+
+ return 0;
+}
+
+late_initcall(intel_rdt_late_init);
diff --git a/init/Kconfig b/init/Kconfig
index 81050e4..203f116 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -983,6 +983,17 @@ config CPUSETS

Say N if unsure.

+config CGROUP_RDT
+ bool "Resource Director Technology cgroup subsystem"
+ depends on X86_64 && CPU_SUP_INTEL
+ help
+ This option provides a cgroup to allocate Platform shared
+ resources. Among the shared resources, current implementation
+ focuses on L3 Cache. Using the interface user can specify the
+ amount of L3 cache space into which an application can fill.
+
+ Say N if unsure.
+
config PROC_PID_CPUSET
bool "Include legacy /proc/<pid>/cpuset file"
depends on CPUSETS
--
1.9.1

2015-06-12 18:22:44

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 06/10] x86/intel_rdt: Add new cgroup and Class of service management

This patch adds a cgroup subsystem for Intel Resource Director
Technology(RDT) feature and Class of service(CLOSid) management which is
part of common RDT framework. This cgroup would eventually be used by
all sub-features of RDT and hence be associated with the common RDT
framework as well as sub-feature specific framework. However current
patch series only adds cache allocation sub-feature specific code.

When a cgroup directory is created it has a CLOSid associated with it
which is inherited from its parent. The Closid is mapped to a
cache_mask which represents the L3 cache allocation to the cgroup.
Tasks belonging to the cgroup get to fill the cache represented by the
cache_mask.

CLOSid is internal to the kernel and not exposed to user. Kernel uses
several ways to optimize the allocation of Closid and thereby exposing
the available Closids may actually provide wrong information to users as
it may be dynamically changing depending on its usage.

CLOSid allocation is tracked using a separate bitmap. The maximum number
of CLOSids is specified by the h/w during CPUID enumeration and the
kernel simply throws an -ENOSPC when it runs out of CLOSids. Each
cache_mask(CBM) has an associated CLOSid. However if multiple cgroups
have the same cache mask they would also have the same CLOSid. The
reference count parameter in CLOSid-CBM map keeps track of how many
cgroups are using each CLOSid<->CBM mapping.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/intel_rdt.h | 36 +++++++++++
arch/x86/kernel/cpu/intel_rdt.c | 132 ++++++++++++++++++++++++++++++++++++++-
include/linux/cgroup_subsys.h | 4 ++
3 files changed, 170 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/include/asm/intel_rdt.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
new file mode 100644
index 0000000..2ce3e2c
--- /dev/null
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -0,0 +1,36 @@
+#ifndef _RDT_H_
+#define _RDT_H_
+
+#ifdef CONFIG_CGROUP_RDT
+
+#include <linux/cgroup.h>
+
+struct rdt_subsys_info {
+ unsigned long *closmap;
+};
+
+struct intel_rdt {
+ struct cgroup_subsys_state css;
+ u32 closid;
+};
+
+struct clos_cbm_map {
+ unsigned long cache_mask;
+ unsigned int clos_refcnt;
+};
+
+/*
+ * Return rdt group corresponding to this container.
+ */
+static inline struct intel_rdt *css_rdt(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct intel_rdt, css) : NULL;
+}
+
+static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
+{
+ return css_rdt(ir->css.parent);
+}
+
+#endif
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 3cd6db6..5ba241e 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -24,17 +24,145 @@

#include <linux/slab.h>
#include <linux/err.h>
+#include <linux/spinlock.h>
+#include <asm/intel_rdt.h>
+
+/*
+ * ccmap maintains 1:1 mapping between CLOSid and cache bitmask.
+ */
+static struct clos_cbm_map *ccmap;
+static struct rdt_subsys_info rdtss_info;
+static DEFINE_MUTEX(rdt_group_mutex);
+struct intel_rdt rdt_root_group;
+
+static inline void closid_get(u32 closid)
+{
+ struct clos_cbm_map *ccm = &ccmap[closid];
+
+ lockdep_assert_held(&rdt_group_mutex);
+
+ ccm->clos_refcnt++;
+}
+
+static int closid_alloc(struct intel_rdt *ir)
+{
+ u32 maxid;
+ u32 id;
+
+ lockdep_assert_held(&rdt_group_mutex);
+
+ maxid = boot_cpu_data.x86_rdt_max_closid;
+ id = find_next_zero_bit(rdtss_info.closmap, maxid, 0);
+ if (id == maxid)
+ return -ENOSPC;
+
+ set_bit(id, rdtss_info.closmap);
+ closid_get(id);
+ ir->closid = id;
+
+ return 0;
+}
+
+static inline void closid_free(u32 closid)
+{
+ clear_bit(closid, rdtss_info.closmap);
+ ccmap[closid].cache_mask = 0;
+}
+
+static inline void closid_put(u32 closid)
+{
+ struct clos_cbm_map *ccm = &ccmap[closid];
+
+ lockdep_assert_held(&rdt_group_mutex);
+ if (WARN_ON(!ccm->clos_refcnt))
+ return;
+
+ if (!--ccm->clos_refcnt)
+ closid_free(closid);
+}
+
+static struct cgroup_subsys_state *
+intel_rdt_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct intel_rdt *parent = css_rdt(parent_css);
+ struct intel_rdt *ir;
+
+ /*
+ * cgroup_init cannot handle failures gracefully.
+ * Return rdt_root_group.css instead of failure
+ * always even when Cache allocation is not supported.
+ */
+ if (!parent)
+ return &rdt_root_group.css;
+
+ ir = kzalloc(sizeof(struct intel_rdt), GFP_KERNEL);
+ if (!ir)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_lock(&rdt_group_mutex);
+ ir->closid = parent->closid;
+ closid_get(ir->closid);
+ mutex_unlock(&rdt_group_mutex);
+
+ return &ir->css;
+}
+
+static void intel_rdt_css_free(struct cgroup_subsys_state *css)
+{
+ struct intel_rdt *ir = css_rdt(css);
+
+ mutex_lock(&rdt_group_mutex);
+ closid_put(ir->closid);
+ kfree(ir);
+ mutex_unlock(&rdt_group_mutex);
+}

static int __init intel_rdt_late_init(void)
{
struct cpuinfo_x86 *c = &boot_cpu_data;
+ static struct clos_cbm_map *ccm;
+ u32 maxid, max_cbm_len;
+ size_t sizeb;
+ int err = 0;

- if (!cpu_has(c, X86_FEATURE_CAT_L3))
+ if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
+ rdt_root_group.css.ss->disabled = 1;
return -ENODEV;
+ }
+ maxid = c->x86_rdt_max_closid;
+ max_cbm_len = c->x86_rdt_max_cbm_len;
+
+ sizeb = BITS_TO_LONGS(maxid) * sizeof(long);
+ rdtss_info.closmap = kzalloc(sizeb, GFP_KERNEL);
+ if (!rdtss_info.closmap) {
+ err = -ENOMEM;
+ goto out_err;
+ }
+
+ sizeb = maxid * sizeof(struct clos_cbm_map);
+ ccmap = kzalloc(sizeb, GFP_KERNEL);
+ if (!ccmap) {
+ kfree(rdtss_info.closmap);
+ err = -ENOMEM;
+ goto out_err;
+ }
+
+ set_bit(0, rdtss_info.closmap);
+ rdt_root_group.closid = 0;
+ ccm = &ccmap[0];
+ ccm->cache_mask = (1ULL << max_cbm_len) - 1;
+ ccm->clos_refcnt = 1;

pr_info("Intel cache allocation enabled\n");
+out_err:

- return 0;
+ return err;
}

late_initcall(intel_rdt_late_init);
+
+struct cgroup_subsys intel_rdt_cgrp_subsys = {
+ .css_alloc = intel_rdt_css_alloc,
+ .css_free = intel_rdt_css_free,
+ .early_init = 0,
+};
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e4a96fb..0339312 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
SUBSYS(hugetlb)
#endif

+#if IS_ENABLED(CONFIG_CGROUP_RDT)
+SUBSYS(intel_rdt)
+#endif
+
/*
* The following subsystems are not supported on the default hierarchy.
*/
--
1.9.1

2015-06-12 18:22:29

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 07/10] x86/intel_rdt: Add support for cache bit mask management

The change adds a file cache_mask to the RDT cgroup which represents the
cache bit mask(CBM) for the cgroup. cache_mask is specific to the Cache
allocation sub-feature of RDT. The tasks in the RDT cgroup would get to
fill the L3 cache represented by the cgroup's cache_mask file.

Update to the CBM is done by writing to the IA32_L3_MASK_n. The RDT
cgroup follows cgroup hierarchy ,mkdir and adding tasks to the cgroup
never fails. When a child cgroup is created it inherits the CLOSid and
the cache_mask from its parent. When a user changes the default CBM for
a cgroup, a new CLOSid may be allocated if the cache_mask was not used
before. If the new CBM is the one that is already used, the count for
that CLOSid<->CBM is incremented. The changing of 'cache_mask' may fail
with -ENOSPC once the kernel runs out of maximum CLOSids it can support.

User can create as many cgroups as he wants but having different CBMs at
the same time is restricted by the maximum number of CLOSids .Kernel
maintains a CLOSid<->cbm mapping which keeps count of cgroups using a
CLOSid.

Reuse of CLOSids for cgroups with same bitmask also has following
advantages:
- This helps to use the scant CLOSids optimally.
- This also implies that during context switch, write to PQR-MSR is
done only when a task with a different bitmask is scheduled in.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/intel_rdt.h | 3 +
arch/x86/kernel/cpu/intel_rdt.c | 205 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 207 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 2ce3e2c..3ad426c 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,6 +4,9 @@
#ifdef CONFIG_CGROUP_RDT

#include <linux/cgroup.h>
+#define MAX_CBM_LENGTH 32
+#define IA32_L3_CBM_BASE 0xc90
+#define CBM_FROM_INDEX(x) (IA32_L3_CBM_BASE + x)

struct rdt_subsys_info {
unsigned long *closmap;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 5ba241e..becb487 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -34,6 +34,13 @@ static struct clos_cbm_map *ccmap;
static struct rdt_subsys_info rdtss_info;
static DEFINE_MUTEX(rdt_group_mutex);
struct intel_rdt rdt_root_group;
+/*
+ * Mask of CPUs for writing CBM values. We only need one CPU per-socket.
+ */
+static cpumask_t rdt_cpumask;
+
+#define rdt_for_each_child(pos_css, parent_ir) \
+ css_for_each_child((pos_css), &(parent_ir)->css)

static inline void closid_get(u32 closid)
{
@@ -117,13 +124,195 @@ static void intel_rdt_css_free(struct cgroup_subsys_state *css)
mutex_unlock(&rdt_group_mutex);
}

+static int intel_cache_alloc_cbm_read(struct seq_file *m, void *v)
+{
+ struct intel_rdt *ir = css_rdt(seq_css(m));
+
+ seq_printf(m, "%08lx\n", ccmap[ir->closid].cache_mask);
+
+ return 0;
+}
+
+static inline bool cbm_is_contiguous(unsigned long var)
+{
+ unsigned long maxcbm = MAX_CBM_LENGTH;
+ unsigned long first_bit, zero_bit;
+
+ if (!var)
+ return false;
+
+ first_bit = find_next_bit(&var, maxcbm, 0);
+ zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
+
+ if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
+ return false;
+
+ return true;
+}
+
+static int cbm_validate(struct intel_rdt *ir, unsigned long cbmvalue)
+{
+ struct cgroup_subsys_state *css;
+ struct intel_rdt *par, *c;
+ unsigned long *cbm_tmp;
+ int err = 0;
+
+ if (!cbm_is_contiguous(cbmvalue)) {
+ pr_err("bitmask should have >= 1 bit and be contiguous\n");
+ err = -EINVAL;
+ goto out_err;
+ }
+
+ par = parent_rdt(ir);
+ cbm_tmp = &ccmap[par->closid].cache_mask;
+ if (!bitmap_subset(&cbmvalue, cbm_tmp, MAX_CBM_LENGTH)) {
+ err = -EINVAL;
+ goto out_err;
+ }
+
+ rcu_read_lock();
+ rdt_for_each_child(css, ir) {
+ c = css_rdt(css);
+ cbm_tmp = &ccmap[c->closid].cache_mask;
+ if (!bitmap_subset(cbm_tmp, &cbmvalue, MAX_CBM_LENGTH)) {
+ rcu_read_unlock();
+ pr_err("Children's mask not a subset\n");
+ err = -EINVAL;
+ goto out_err;
+ }
+ }
+ rcu_read_unlock();
+out_err:
+
+ return err;
+}
+
+static bool cbm_search(unsigned long cbm, u32 *closid)
+{
+ u32 maxid = boot_cpu_data.x86_rdt_max_closid;
+ u32 i;
+
+ for (i = 0; i < maxid; i++) {
+ if (bitmap_equal(&cbm, &ccmap[i].cache_mask, MAX_CBM_LENGTH)) {
+ *closid = i;
+ return true;
+ }
+ }
+
+ return false;
+}
+
+static void closcbm_map_dump(void)
+{
+ u32 i;
+
+ pr_debug("CBMMAP\n");
+ for (i = 0; i < boot_cpu_data.x86_rdt_max_closid; i++) {
+ pr_debug("cache_mask: 0x%x,clos_refcnt: %u\n",
+ (unsigned int)ccmap[i].cache_mask, ccmap[i].clos_refcnt);
+ }
+}
+
+static void cbm_cpu_update(void *info)
+{
+ u32 closid = (u32) info;
+
+ wrmsrl(CBM_FROM_INDEX(closid), ccmap[closid].cache_mask);
+}
+
+/*
+ * cbm_update_all() - Update the cache bit mask for all packages.
+ */
+static inline void cbm_update_all(u32 closid)
+{
+ on_each_cpu_mask(&rdt_cpumask, cbm_cpu_update, (void *)closid, 1);
+}
+
+/*
+ * intel_cache_alloc_cbm_write() - Validates and writes the
+ * cache bit mask(cbm) to the IA32_L3_MASK_n
+ * and also store the same in the ccmap.
+ *
+ * CLOSids are reused for cgroups which have same bitmask.
+ * This helps to use the scant CLOSids optimally. This also
+ * implies that at context switch write to PQR-MSR is done
+ * only when a task with a different bitmask is scheduled in.
+ */
+static int intel_cache_alloc_cbm_write(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 cbmvalue)
+{
+ u32 max_cbm = boot_cpu_data.x86_rdt_max_cbm_len;
+ struct intel_rdt *ir = css_rdt(css);
+ ssize_t err = 0;
+ u64 max_mask;
+ u32 closid;
+
+ if (ir == &rdt_root_group)
+ return -EPERM;
+
+ /*
+ * Need global mutex as cbm write may allocate a closid.
+ */
+ mutex_lock(&rdt_group_mutex);
+
+ max_mask = (1ULL << max_cbm) - 1;
+ if (cbmvalue & ~max_mask) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (cbmvalue == ccmap[ir->closid].cache_mask)
+ goto out;
+
+ err = cbm_validate(ir, cbmvalue);
+ if (err)
+ goto out;
+
+ /*
+ * Try to get a reference for a different CLOSid and release the
+ * reference to the current CLOSid.
+ * Need to put down the reference here and get it back in case we
+ * run out of closids. Otherwise we run into a problem when
+ * we could be using the last closid that could have been available.
+ */
+ closid_put(ir->closid);
+ if (cbm_search(cbmvalue, &closid)) {
+ ir->closid = closid;
+ closid_get(closid);
+ } else {
+ closid = ir->closid;
+ err = closid_alloc(ir);
+ if (err) {
+ closid_get(ir->closid);
+ goto out;
+ }
+
+ ccmap[ir->closid].cache_mask = cbmvalue;
+ cbm_update_all(ir->closid);
+ }
+ closcbm_map_dump();
+out:
+ mutex_unlock(&rdt_group_mutex);
+
+ return err;
+}
+
+static inline void rdt_cpumask_update(int cpu)
+{
+ cpumask_t tmp;
+
+ cpumask_and(&tmp, &rdt_cpumask, topology_core_cpumask(cpu));
+ if (cpumask_empty(&tmp))
+ cpumask_set_cpu(cpu, &rdt_cpumask);
+}
+
static int __init intel_rdt_late_init(void)
{
struct cpuinfo_x86 *c = &boot_cpu_data;
static struct clos_cbm_map *ccm;
u32 maxid, max_cbm_len;
+ int err = 0, i;
size_t sizeb;
- int err = 0;

if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
rdt_root_group.css.ss->disabled = 1;
@@ -153,6 +342,9 @@ static int __init intel_rdt_late_init(void)
ccm->cache_mask = (1ULL << max_cbm_len) - 1;
ccm->clos_refcnt = 1;

+ for_each_online_cpu(i)
+ rdt_cpumask_update(i);
+
pr_info("Intel cache allocation enabled\n");
out_err:

@@ -161,8 +353,19 @@ out_err:

late_initcall(intel_rdt_late_init);

+static struct cftype rdt_files[] = {
+ {
+ .name = "cache_mask",
+ .seq_show = intel_cache_alloc_cbm_read,
+ .write_u64 = intel_cache_alloc_cbm_write,
+ .mode = 0666,
+ },
+ { } /* terminate */
+};
+
struct cgroup_subsys intel_rdt_cgrp_subsys = {
.css_alloc = intel_rdt_css_alloc,
.css_free = intel_rdt_css_free,
+ .legacy_cftypes = rdt_files,
.early_init = 0,
};
--
1.9.1

2015-06-12 18:21:32

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 08/10] x86/intel_rdt: Implement scheduling support for Intel RDT

Adds support for IA32_PQR_ASSOC MSR writes during task scheduling. For
Cache Allocation, MSR write would let the task fill in the cache
'subset' represented by the cgroup's cache_mask.

The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
CLOSid. During context switch kernel implements this by writing the
CLOSid of the cgroup to which the task belongs to the CPU's
IA32_PQR_ASSOC MSR.

This patch also implements a common software cache for IA32_PQR_MSR(RMID
0:9, CLOSId 32:63) to be used by both Cache monitoring(CMT) and
Cache allocation. CMT updates the RMID where as cache_alloc updates the
CLOSid in the software cache. During scheduling when the new
RMID/CLOSid value is different from the cached values, IA32_PQR_MSR is
updated. Since the measured rdmsr latency for IA32_PQR_MSR is very
high(~250 cycles) this software cache is necessary to avoid reading the
MSR to compare the current CLOSid value.

The following considerations are done for the PQR MSR write so that it
minimally impacts scheduler hot path:
- This path does not exist on any non-intel platforms.
- On Intel platforms, this would not exist by default unless CGROUP_RDT
is enabled.
- remains a no-op when CGROUP_RDT is enabled and intel SKU does not
support the feature.
- When feature is available and enabled, never does MSR write till the
user manually creates a cgroup directory *and* assigns a cache_mask
different from root cgroup directory. Since the child node inherits
the parents cache mask , by cgroup creation there is no scheduling hot
path impact from the new cgroup.
- MSR write is only done when there is a task with different Closid is
scheduled on the CPU. Typically if the task groups are bound to be
scheduled on a set of CPUs , the number of MSR writes is greatly
reduced.
- A per CPU cache of CLOSids is maintained to do the check so that we
dont have to do a rdmsr which actually costs a lot of cycles.
- For cgroup directories having same cache_mask the CLOSids are reused.
This minimizes the number of CLOSids used and hence reduces the MSR
write frequency.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/intel_rdt.h | 45 ++++++++++++++++++++++++++++++
arch/x86/include/asm/rdt_common.h | 25 +++++++++++++++++
arch/x86/include/asm/switch_to.h | 3 ++
arch/x86/kernel/cpu/intel_rdt.c | 17 +++++++++++
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 26 ++---------------
5 files changed, 93 insertions(+), 23 deletions(-)
create mode 100644 arch/x86/include/asm/rdt_common.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 3ad426c..78df3d7 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,10 +4,16 @@
#ifdef CONFIG_CGROUP_RDT

#include <linux/cgroup.h>
+#include <asm/rdt_common.h>
+
#define MAX_CBM_LENGTH 32
#define IA32_L3_CBM_BASE 0xc90
#define CBM_FROM_INDEX(x) (IA32_L3_CBM_BASE + x)

+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
+extern struct static_key rdt_enable_key;
+extern void __intel_rdt_sched_in(void);
+
struct rdt_subsys_info {
unsigned long *closmap;
};
@@ -35,5 +41,44 @@ static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
return css_rdt(ir->css.parent);
}

+/*
+ * Return rdt group to which this task belongs.
+ */
+static inline struct intel_rdt *task_rdt(struct task_struct *task)
+{
+ return css_rdt(task_css(task, intel_rdt_cgrp_id));
+}
+
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ * which supports L3 cache allocation.
+ * - When support is present and enabled, does not do any
+ * IA32_PQR_MSR writes until the user starts really using the feature
+ * ie creates a rdt cgroup directory and assigns a cache_mask thats
+ * different from the root cgroup's cache_mask.
+ * - Caches the per cpu CLOSid values and does the MSR write only
+ * when a task with a different CLOSid is scheduled in. That
+ * means the task belongs to a different cgroup.
+ * - Closids are allocated so that different cgroup directories
+ * with same cache_mask gets the same CLOSid. This minimizes CLOSids
+ * used and reduces MSR write frequency.
+ */
+static inline void intel_rdt_sched_in(void)
+{
+ /*
+ * Call the schedule in code only when RDT is enabled.
+ */
+ if (static_key_false(&rdt_enable_key))
+ __intel_rdt_sched_in();
+}
+
+#else
+
+static inline void intel_rdt_sched_in(void) {}
+
#endif
#endif
diff --git a/arch/x86/include/asm/rdt_common.h b/arch/x86/include/asm/rdt_common.h
new file mode 100644
index 0000000..01502c5
--- /dev/null
+++ b/arch/x86/include/asm/rdt_common.h
@@ -0,0 +1,25 @@
+#ifndef _X86_RDT_H_
+#define _X86_RDT_H_
+
+#define MSR_IA32_PQR_ASSOC 0x0c8f
+
+/**
+ * struct intel_pqr_state - State cache for the PQR MSR
+ * @rmid: The cached Resource Monitoring ID
+ * @closid: The cached Class Of Service ID
+ * @rmid_usecnt: The usage counter for rmid
+ *
+ * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
+ * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
+ * contains both parts, so we need to cache them.
+ *
+ * The cache also helps to avoid pointless updates if the value does
+ * not change.
+ */
+struct intel_pqr_state {
+ u32 rmid;
+ u32 closid;
+ int rmid_usecnt;
+};
+
+#endif
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 751bf4b..9149577 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,9 @@ struct tss_struct;
void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
struct tss_struct *tss);

+#include <asm/intel_rdt.h>
+#define finish_arch_switch(prev) intel_rdt_sched_in()
+
#ifdef CONFIG_X86_32

#ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index becb487..f90e7ab 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -34,6 +34,8 @@ static struct clos_cbm_map *ccmap;
static struct rdt_subsys_info rdtss_info;
static DEFINE_MUTEX(rdt_group_mutex);
struct intel_rdt rdt_root_group;
+struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
+
/*
* Mask of CPUs for writing CBM values. We only need one CPU per-socket.
*/
@@ -88,6 +90,20 @@ static inline void closid_put(u32 closid)
closid_free(closid);
}

+void __intel_rdt_sched_in(void)
+{
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+ struct task_struct *task = current;
+ struct intel_rdt *ir;
+
+ ir = task_rdt(task);
+ if (ir->closid == state->closid)
+ return;
+
+ wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->closid);
+ state->closid = ir->closid;
+}
+
static struct cgroup_subsys_state *
intel_rdt_css_alloc(struct cgroup_subsys_state *parent_css)
{
@@ -345,6 +361,7 @@ static int __init intel_rdt_late_init(void)
for_each_online_cpu(i)
rdt_cpumask_update(i);

+ static_key_slow_inc(&rdt_enable_key);
pr_info("Intel cache allocation enabled\n");
out_err:

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index b224142..9b22a5f9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -7,41 +7,22 @@
#include <linux/perf_event.h>
#include <linux/slab.h>
#include <asm/cpu_device_id.h>
+#include <asm/rdt_common.h>
#include "perf_event.h"

-#define MSR_IA32_PQR_ASSOC 0x0c8f
#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d

static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */

-/**
- * struct intel_pqr_state - State cache for the PQR MSR
- * @rmid: The cached Resource Monitoring ID
- * @closid: The cached Class Of Service ID
- * @rmid_usecnt: The usage counter for rmid
- *
- * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
- * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
- * contains both parts, so we need to cache them.
- *
- * The cache also helps to avoid pointless updates if the value does
- * not change.
- */
-struct intel_pqr_state {
- u32 rmid;
- u32 closid;
- int rmid_usecnt;
-};
-
/*
* The cached intel_pqr_state is strictly per CPU and can never be
* updated from a remote CPU. Both functions which modify the state
* (intel_cqm_event_start and intel_cqm_event_stop) are called with
* interrupts disabled, which is sufficient for the protection.
*/
-static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);

/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -402,9 +383,9 @@ static void __intel_cqm_event_count(void *info);
*/
static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
{
- struct perf_event *event;
struct list_head *head = &group->hw.cqm_group_entry;
u32 old_rmid = group->hw.cqm_rmid;
+ struct perf_event *event;

lockdep_assert_held(&cache_mutex);

@@ -1253,7 +1234,6 @@ static void intel_cqm_cpu_prepare(unsigned int cpu)
struct cpuinfo_x86 *c = &cpu_data(cpu);

state->rmid = 0;
- state->closid = 0;
state->rmid_usecnt = 0;

WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid);
--
1.9.1

2015-06-12 18:21:28

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 09/10] x86/intel_rdt: Hot cpu support for Cache Allocation

This patch adds hot cpu support for Intel Cache allocation. Support
includes updating the cache bitmask MSRs IA32_L3_QOS_n when a new CPU
package comes online. The IA32_L3_QOS_n MSRs are one per Class of
service on each CPU package. The new package's MSRs are synchronized
with the values of existing MSRs. Also the software cache for
IA32_PQR_ASSOC MSRs are updated during hot cpu notifications.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 84 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index f90e7ab..d0be46d 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -25,6 +25,7 @@
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/spinlock.h>
+#include <linux/cpu.h>
#include <asm/intel_rdt.h>

/*
@@ -313,13 +314,84 @@ out:
return err;
}

-static inline void rdt_cpumask_update(int cpu)
+static inline bool rdt_cpumask_update(int cpu)
{
cpumask_t tmp;

cpumask_and(&tmp, &rdt_cpumask, topology_core_cpumask(cpu));
- if (cpumask_empty(&tmp))
+ if (cpumask_empty(&tmp)) {
cpumask_set_cpu(cpu, &rdt_cpumask);
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
+ * which are one per CLOSid except IA32_L3_MASK_0 on the current package.
+ */
+static inline void cbm_update_msrs(void)
+{
+ int maxid = boot_cpu_data.x86_rdt_max_closid;
+ unsigned int i;
+
+ /*
+ * At cpureset, all bits of IA32_L3_MASK_n are set.
+ * The index starts from one as there is no need
+ * to update IA32_L3_MASK_0 as it belongs to root cgroup
+ * whose cache mask is all 1s always.
+ */
+ for (i = 1; i < maxid; i++) {
+ if (ccmap[i].clos_refcnt)
+ cbm_cpu_update((void *)i);
+ }
+}
+
+static inline void intel_rdt_cpu_start(int cpu)
+{
+ struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
+
+ state->closid = 0;
+ mutex_lock(&rdt_group_mutex);
+ if (rdt_cpumask_update(cpu))
+ cbm_update_msrs();
+ mutex_unlock(&rdt_group_mutex);
+}
+
+static void intel_rdt_cpu_exit(unsigned int cpu)
+{
+ int i;
+
+ mutex_lock(&rdt_group_mutex);
+ if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
+ mutex_unlock(&rdt_group_mutex);
+ return;
+ }
+
+ i = cpumask_any_online_but(topology_core_cpumask(cpu), cpu);
+ if (i < nr_cpu_ids)
+ cpumask_set_cpu(i, &rdt_cpumask);
+ mutex_unlock(&rdt_group_mutex);
+}
+
+static int intel_rdt_cpu_notifier(struct notifier_block *nb,
+ unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (unsigned long)hcpu;
+
+ switch (action) {
+ case CPU_STARTING:
+ intel_rdt_cpu_start(cpu);
+ break;
+ case CPU_DOWN_PREPARE:
+ intel_rdt_cpu_exit(cpu);
+ break;
+ default:
+ break;
+ }
+
+ return NOTIFY_OK;
}

static int __init intel_rdt_late_init(void)
@@ -358,8 +430,16 @@ static int __init intel_rdt_late_init(void)
ccm->cache_mask = (1ULL << max_cbm_len) - 1;
ccm->clos_refcnt = 1;

+ cpu_notifier_register_begin();
+
+ mutex_lock(&rdt_group_mutex);
for_each_online_cpu(i)
rdt_cpumask_update(i);
+ mutex_unlock(&rdt_group_mutex);
+
+ __hotcpu_notifier(intel_rdt_cpu_notifier, 0);
+
+ cpu_notifier_register_done();

static_key_slow_inc(&rdt_enable_key);
pr_info("Intel cache allocation enabled\n");
--
1.9.1

2015-06-12 18:21:30

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 10/10] x86/intel_rdt: Intel haswell Cache Allocation enumeration

Cache Allocation on hsw(haswell) needs to be enumerated separately as
HSW does not have support for CPUID enumeration for Cache Allocation.
Cache Allocation is only supported on certain HSW SKUs. This patch does
a probe test for hsw CPUs by writing a CLOSid(Class of service id) into
high 32 bits of IA32_PQR_MSR and see if the bits stick. The probe test
is only done after confirming that the CPU is HSW. Other HSW specific
quirks are:
- HSW requires the L3 cache bit mask to be at least two bits.
- Maximum CLOSids supported is always 4.
- Maximum bits support in cache bit mask is always 20.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 60 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 58 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index d0be46d..09a0b9c 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -38,6 +38,11 @@ struct intel_rdt rdt_root_group;
struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;

/*
+ * Minimum bits required in Cache bitmask.
+ */
+static unsigned int min_bitmask_len = 1;
+
+/*
* Mask of CPUs for writing CBM values. We only need one CPU per-socket.
*/
static cpumask_t rdt_cpumask;
@@ -45,6 +50,56 @@ static cpumask_t rdt_cpumask;
#define rdt_for_each_child(pos_css, parent_ir) \
css_for_each_child((pos_css), &(parent_ir)->css)

+/*
+ * hsw_probetest() - Have to do probe test for Intel haswell CPUs as it
+ * does not have CPUID enumeration support for Cache allocation.
+ *
+ * Probes by writing to the high 32 bits(CLOSid) of the IA32_PQR_MSR and
+ * testing if the bits stick. Then hardcode the max CLOS and max
+ * bitmask length on hsw. The minimum cache bitmask length allowed for
+ * HSW is 2 bits.
+ */
+static inline bool hsw_probetest(void)
+{
+ u32 l, h_old, h_new, h_tmp;
+
+ if (rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_old))
+ return false;
+
+ /*
+ * Default value is always 0 if feature is present.
+ */
+ h_tmp = h_old ^ 0x1U;
+ if (wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_tmp) ||
+ rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_new))
+ return false;
+
+ if (h_tmp != h_new)
+ return false;
+
+ wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_old);
+
+ boot_cpu_data.x86_rdt_max_closid = 4;
+ boot_cpu_data.x86_rdt_max_cbm_len = 20;
+ min_bitmask_len = 2;
+
+ return true;
+}
+
+static inline bool cache_alloc_supported(struct cpuinfo_x86 *c)
+{
+ if (cpu_has(c, X86_FEATURE_CAT_L3))
+ return true;
+
+ /*
+ * Probe test for Haswell CPUs.
+ */
+ if (c->x86 == 0x6 && c->x86_model == 0x3f)
+ return hsw_probetest();
+
+ return false;
+}
+
static inline void closid_get(u32 closid)
{
struct clos_cbm_map *ccm = &ccmap[closid];
@@ -155,7 +210,7 @@ static inline bool cbm_is_contiguous(unsigned long var)
unsigned long maxcbm = MAX_CBM_LENGTH;
unsigned long first_bit, zero_bit;

- if (!var)
+ if (bitmap_weight(&var, maxcbm) < min_bitmask_len)
return false;

first_bit = find_next_bit(&var, maxcbm, 0);
@@ -175,7 +230,8 @@ static int cbm_validate(struct intel_rdt *ir, unsigned long cbmvalue)
int err = 0;

if (!cbm_is_contiguous(cbmvalue)) {
- pr_err("bitmask should have >= 1 bit and be contiguous\n");
+ pr_err("bitmask should have >=%d bits and be contiguous\n",
+ min_bitmask_len);
err = -EINVAL;
goto out_err;
}
--
1.9.1

2015-06-15 12:36:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/10] cpumask: Introduce cpumask_any_online_but

On Fri, Jun 12, 2015 at 11:17:08AM -0700, Vikas Shivappa wrote:
> There is currently no cpumask helper function to pick a "random" cpu
> from a mask which is also online.
>
> cpumask_any_online_but() does that which is similar to cpumask_any_but()
> but also returns a cpu that is online.
>
> Signed-off-by: Vikas Shivappa <[email protected]>
> ---
> include/linux/cpumask.h | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 27e285b..f2d7e8a 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -548,6 +548,24 @@ static inline void cpumask_copy(struct cpumask *dstp,
> #define cpumask_of(cpu) (get_cpu_mask(cpu))
>
> /**
> + * cpumask_any_online_but - return a "random" and online cpu in a cpumask,
> + * but not this one
> + * @mask: the input mask to search
> + * @cpu: the cpu to ignore
> + *
> + * Returns >= nr_cpu_ids if no cpus set.
> +*/
> +static inline unsigned int cpumask_any_online_but(const struct cpumask *mask,
> + unsigned int cpu)
> +{
> + cpumask_t tmp;

No, you cannot put a cpumask_t on stack like that. Those things can be
massive.

> +
> + cpumask_and(&tmp, cpu_online_mask, mask);
> + cpumask_clear_cpu(cpu, &tmp);
> + return cpumask_any(&tmp);
> +}

You had a good example in cpumask_any_but() copy that.

2015-06-15 12:39:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 02/10] x86/intel_cqm: Modify hot cpu notification handling

On Fri, Jun 12, 2015 at 11:17:09AM -0700, Vikas Shivappa wrote:
> static inline void cqm_pick_event_reader(int cpu)
> {
> - int phys_id = topology_physical_package_id(cpu);
> - int i;
> + struct cpumask tmp;

No cpumasks on stacks.

2015-06-15 12:49:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 05/10] x86/intel_rdt: Add support for Cache Allocation detection

On Fri, Jun 12, 2015 at 11:17:12AM -0700, Vikas Shivappa wrote:
> + /* Additional Intel-defined flags: level 0x00000010 */
> + if (c->cpuid_level >= 0x00000010) {
> + u32 eax, ebx, ecx, edx;
> +
> + cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
> + c->x86_capability[13] = ebx;
> +
> + if (cpu_has(c, X86_FEATURE_CAT_L3)) {
> +
> + cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
> + c->x86_rdt_max_closid = edx + 1;
> + c->x86_rdt_max_cbm_len = eax + 1;
> + }
> + }

I'm still annoyed by the whole RDT/CAT thing, so the above reads a CAT
leaf and puts the values in an RDT variable.

That's inconsistent.

2015-06-15 14:05:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 10/10] x86/intel_rdt: Intel haswell Cache Allocation enumeration

On Fri, Jun 12, 2015 at 11:17:17AM -0700, Vikas Shivappa wrote:
> + /*
> + * Probe test for Haswell CPUs.
> + */
> + if (c->x86 == 0x6 && c->x86_model == 0x3f)
> + return hsw_probetest();

Firstly, isn't a probe already a test?

Secondly, there's more HSW models:

case 60: /* 22nm Haswell Core */
case 63: /* 22nm Haswell Server */
case 69: /* 22nm Haswell ULT */
case 70: /* 22nm Haswell + GT3e (Intel Iris Pro graphics) */

Is this really only HSW server, or should they all be listed?

2015-06-15 16:51:46

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 01/10] cpumask: Introduce cpumask_any_online_but



On Mon, 15 Jun 2015, Peter Zijlstra wrote:

> On Fri, Jun 12, 2015 at 11:17:08AM -0700, Vikas Shivappa wrote:
>> There is currently no cpumask helper function to pick a "random" cpu
>> from a mask which is also online.
>>
>> cpumask_any_online_but() does that which is similar to cpumask_any_but()
>> but also returns a cpu that is online.
>>
>> Signed-off-by: Vikas Shivappa <[email protected]>
>> ---
>> include/linux/cpumask.h | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>>
>> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
>> index 27e285b..f2d7e8a 100644
>> --- a/include/linux/cpumask.h
>> +++ b/include/linux/cpumask.h
>> @@ -548,6 +548,24 @@ static inline void cpumask_copy(struct cpumask *dstp,
>> #define cpumask_of(cpu) (get_cpu_mask(cpu))
>>
>> /**
>> + * cpumask_any_online_but - return a "random" and online cpu in a cpumask,
>> + * but not this one
>> + * @mask: the input mask to search
>> + * @cpu: the cpu to ignore
>> + *
>> + * Returns >= nr_cpu_ids if no cpus set.
>> +*/
>> +static inline unsigned int cpumask_any_online_but(const struct cpumask *mask,
>> + unsigned int cpu)
>> +{
>> + cpumask_t tmp;
>
> No, you cannot put a cpumask_t on stack like that. Those things can be
> massive.

ok , Will fix.

>
>> +
>> + cpumask_and(&tmp, cpu_online_mask, mask);
>> + cpumask_clear_cpu(cpu, &tmp);
>> + return cpumask_any(&tmp);
>> +}
>
> You had a good example in cpumask_any_but() copy that.

I saw the cpumask_any_but but wanted to avoid the
for loop in the cpumask_any_but , but now i see why from your previous comment.
Without the cpumask_t I will have to use the cpumask_any_but .. the two were
related.

Thanks,
Vikas

>

2015-06-15 16:55:09

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 02/10] x86/intel_cqm: Modify hot cpu notification handling




On Mon, 15 Jun 2015, Peter Zijlstra wrote:

> On Fri, Jun 12, 2015 at 11:17:09AM -0700, Vikas Shivappa wrote:
>> static inline void cqm_pick_event_reader(int cpu)
>> {
>> - int phys_id = topology_physical_package_id(cpu);
>> - int i;
>> + struct cpumask tmp;
>
> No cpumasks on stacks.

ok , will change to static cpumask_t. Same change in rdt_cpumask_update and
rapl_cpu_init.

>

2015-06-15 17:07:30

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 05/10] x86/intel_rdt: Add support for Cache Allocation detection



On Mon, 15 Jun 2015, Peter Zijlstra wrote:

> On Fri, Jun 12, 2015 at 11:17:12AM -0700, Vikas Shivappa wrote:
>> + /* Additional Intel-defined flags: level 0x00000010 */
>> + if (c->cpuid_level >= 0x00000010) {
>> + u32 eax, ebx, ecx, edx;
>> +
>> + cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
>> + c->x86_capability[13] = ebx;
>> +
>> + if (cpu_has(c, X86_FEATURE_CAT_L3)) {
>> +
>> + cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
>> + c->x86_rdt_max_closid = edx + 1;
>> + c->x86_rdt_max_cbm_len = eax + 1;
>> + }
>> + }
>
> I'm still annoyed by the whole RDT/CAT thing, so the above reads a CAT
> leaf and puts the values in an RDT variable.

Will fix. Was confusing as the closid is generic for rdt but just that its
enumerated for each leaf ..

>
> That's inconsistent.
>

2015-06-15 21:47:26

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 10/10] x86/intel_rdt: Intel haswell Cache Allocation enumeration



On Mon, 15 Jun 2015, Peter Zijlstra wrote:

> On Fri, Jun 12, 2015 at 11:17:17AM -0700, Vikas Shivappa wrote:
>> + /*
>> + * Probe test for Haswell CPUs.
>> + */
>> + if (c->x86 == 0x6 && c->x86_model == 0x3f)
>> + return hsw_probetest();
>
> Firstly, isn't a probe already a test?

Will fix the name to hsw_probe

>
> Secondly, there's more HSW models:
>
> case 60: /* 22nm Haswell Core */
> case 63: /* 22nm Haswell Server */
> case 69: /* 22nm Haswell ULT */
> case 70: /* 22nm Haswell + GT3e (Intel Iris Pro graphics) */
>
> Is this really only HSW server,

Yes , this probe is only targeted at HSW servers as of now.

or should they all be listed?
>

2015-06-16 08:20:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 01/10] cpumask: Introduce cpumask_any_online_but

On Mon, 15 Jun 2015, Vikas Shivappa wrote:
> On Mon, 15 Jun 2015, Peter Zijlstra wrote:
> > On Fri, Jun 12, 2015 at 11:17:08AM -0700, Vikas Shivappa wrote:
> > > + cpumask_and(&tmp, cpu_online_mask, mask);
> > > + cpumask_clear_cpu(cpu, &tmp);
> > > + return cpumask_any(&tmp);
> > > +}
> >
> > You had a good example in cpumask_any_but() copy that.
>
> I saw the cpumask_any_but but wanted to avoid the for loop in the
> cpumask_any_but , but now i see why from your previous comment. Without the
> cpumask_t I will have to use the cpumask_any_but .. the two were related.

It can be done w/o a loop. Hint, you need a static cpumask in your
code anyway.

Thanks

tglx

2015-06-16 08:23:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 10/10] x86/intel_rdt: Intel haswell Cache Allocation enumeration

On Mon, 15 Jun 2015, Vikas Shivappa wrote:
> On Mon, 15 Jun 2015, Peter Zijlstra wrote:
>
> > On Fri, Jun 12, 2015 at 11:17:17AM -0700, Vikas Shivappa wrote:
> > > + /*
> > > + * Probe test for Haswell CPUs.
> > > + */
> > > + if (c->x86 == 0x6 && c->x86_model == 0x3f)
> > > + return hsw_probetest();
> >
> > Firstly, isn't a probe already a test?
>
> Will fix the name to hsw_probe
>
> >
> > Secondly, there's more HSW models:
> >
> > case 60: /* 22nm Haswell Core */
> > case 63: /* 22nm Haswell Server */
> > case 69: /* 22nm Haswell ULT */
> > case 70: /* 22nm Haswell + GT3e (Intel Iris Pro graphics) */
> >
> > Is this really only HSW server,
>
> Yes , this probe is only targeted at HSW servers as of now.
>
> or should they all be listed?

you should know, whether they support that feature and need that quirk,

Thanks

tglx

2015-06-16 08:53:04

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 09/10] x86/intel_rdt: Hot cpu support for Cache Allocation

On Fri, 12 Jun 2015, Vikas Shivappa wrote:
> +static inline void intel_rdt_cpu_start(int cpu)
> +{
> + struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
> +
> + state->closid = 0;
> + mutex_lock(&rdt_group_mutex);

This is called from CPU_STARTING, which runs on the starting cpu with
interrupts disabled. Clearly never tested with any of the mandatory
debug configs enabled.

Thanks

tglx

2015-06-16 09:19:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 10/10] x86/intel_rdt: Intel haswell Cache Allocation enumeration

On Mon, Jun 15, 2015 at 02:44:32PM -0700, Vikas Shivappa wrote:
> >Secondly, there's more HSW models:
> >
> > case 60: /* 22nm Haswell Core */
> > case 63: /* 22nm Haswell Server */
> > case 69: /* 22nm Haswell ULT */
> > case 70: /* 22nm Haswell + GT3e (Intel Iris Pro graphics) */
> >
> >Is this really only HSW server,
>
> Yes , this probe is only targeted at HSW servers as of now.

But do the others have it? What you're targeting this code for is
irrelevant, if those models have the hardware we should support them.

> or should they all be listed?

If they support it, yes. In any case, be explicit on which models have
the hardware. IIRC your current Changelog has the words 'certain SKUs'
in, which is as ambiguous as one can get.

2015-06-16 19:04:05

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 01/10] cpumask: Introduce cpumask_any_online_but



On Tue, 16 Jun 2015, Thomas Gleixner wrote:

> On Mon, 15 Jun 2015, Vikas Shivappa wrote:
>> On Mon, 15 Jun 2015, Peter Zijlstra wrote:
>>> On Fri, Jun 12, 2015 at 11:17:08AM -0700, Vikas Shivappa wrote:
>>>> + cpumask_and(&tmp, cpu_online_mask, mask);
>>>> + cpumask_clear_cpu(cpu, &tmp);
>>>> + return cpumask_any(&tmp);
>>>> +}
>>>
>>> You had a good example in cpumask_any_but() copy that.
>>
>> I saw the cpumask_any_but but wanted to avoid the for loop in the
>> cpumask_any_but , but now i see why from your previous comment. Without the
>> cpumask_t I will have to use the cpumask_any_but .. the two were related.
>
> It can be done w/o a loop. Hint, you need a static cpumask in your
> code anyway.

Ah thats right, I always need the tmp mask. Confused this with
avoiding the cpumask_clear_cpu line in the code vs. using cpumask_any_but.

something like this just making it static then ?

static cpumask_t tmp;

cpumask_and(&tmp, cpu_online_mask, mask);
cpumask_clear_cpu(cpu, &tmp);
return cpumask_any(&tmp);

Thanks,
Vikas

>
> Thanks
>
> tglx
>

2015-06-16 19:04:31

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 09/10] x86/intel_rdt: Hot cpu support for Cache Allocation



On Tue, 16 Jun 2015, Thomas Gleixner wrote:

> On Fri, 12 Jun 2015, Vikas Shivappa wrote:
>> +static inline void intel_rdt_cpu_start(int cpu)
>> +{
>> + struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
>> +
>> + state->closid = 0;
>> + mutex_lock(&rdt_group_mutex);
>
> This is called from CPU_STARTING, which runs on the starting cpu with
> interrupts disabled. Clearly never tested with any of the mandatory
> debug configs enabled.

But this can race with cbm_update_all calling on_each_cpu_mask ? or in other
words the lock helps on_each_cpu_mask not race with hot cpu code updating the
rdt_cpumask since the on_each_cpu_mask is also called with the lock always.

Its tested on the 0 day build which should include the debug config. Will add a
tested tag.

Thanks,
Vikas

>
> Thanks
>
> tglx
>

2015-06-16 19:25:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 09/10] x86/intel_rdt: Hot cpu support for Cache Allocation

On Tue, 16 Jun 2015, Vikas Shivappa wrote:
> On Tue, 16 Jun 2015, Thomas Gleixner wrote:
>
> > On Fri, 12 Jun 2015, Vikas Shivappa wrote:
> > > +static inline void intel_rdt_cpu_start(int cpu)
> > > +{
> > > + struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
> > > +
> > > + state->closid = 0;
> > > + mutex_lock(&rdt_group_mutex);
> >
> > This is called from CPU_STARTING, which runs on the starting cpu with
> > interrupts disabled. Clearly never tested with any of the mandatory
> > debug configs enabled.
>
> But this can race with cbm_update_all calling on_each_cpu_mask ? or in other
> words the lock helps on_each_cpu_mask not race with hot cpu code updating the
> rdt_cpumask since the on_each_cpu_mask is also called with the lock always.
>
> Its tested on the 0 day build which should include the debug config. Will add
> a tested tag.

And that tag gives you special permission to take a mutex in irq
disabled context on a cpu which cannot schedule, right?

Thanks,

tglx

2015-06-17 16:25:58

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 10/10] x86/intel_rdt: Intel haswell Cache Allocation enumeration



On Tue, 16 Jun 2015, Peter Zijlstra wrote:

> On Mon, Jun 15, 2015 at 02:44:32PM -0700, Vikas Shivappa wrote:
>>> Secondly, there's more HSW models:
>>>
>>> case 60: /* 22nm Haswell Core */
>>> case 63: /* 22nm Haswell Server */
>>> case 69: /* 22nm Haswell ULT */
>>> case 70: /* 22nm Haswell + GT3e (Intel Iris Pro graphics) */
>>>
>>> Is this really only HSW server,
>>
>> Yes , this probe is only targeted at HSW servers as of now.
>
> But do the others have it? What you're targeting this code for is
> irrelevant, if those models have the hardware we should support them.
>
>> or should they all be listed?
>
> If they support it, yes. In any case, be explicit on which models have
> the hardware. IIRC your current Changelog has the words 'certain SKUs'
> in, which is as ambiguous as one can get.

Its a little confusing because the quirks we have to use in the clients or even
the details of those is not public. Hence the linux kernel would only support it
in servers for now. Just checked the latest info again and its the same status
as of now.

As far as servers are concerned , its only supported
in certain SKUs and subSKus and
some require a ucode patch. The probe should takes care of them all..

Hope that answers both the questions.

Thanks,
Vikas

>

2015-06-19 20:46:00

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 09/10] x86/intel_rdt: Hot cpu support for Cache Allocation



On Tue, 16 Jun 2015, Thomas Gleixner wrote:

> On Tue, 16 Jun 2015, Vikas Shivappa wrote:
>> On Tue, 16 Jun 2015, Thomas Gleixner wrote:
>>
>>> On Fri, 12 Jun 2015, Vikas Shivappa wrote:
>>>> +static inline void intel_rdt_cpu_start(int cpu)
>>>> +{
>>>> + struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
>>>> +
>>>> + state->closid = 0;
>>>> + mutex_lock(&rdt_group_mutex);
>>>
>>> This is called from CPU_STARTING, which runs on the starting cpu with
>>> interrupts disabled. Clearly never tested with any of the mandatory
>>> debug configs enabled.
>>
>> But this can race with cbm_update_all calling on_each_cpu_mask ? or in other
>> words the lock helps on_each_cpu_mask not race with hot cpu code updating the
>> rdt_cpumask since the on_each_cpu_mask is also called with the lock always.
>>
>> Its tested on the 0 day build which should include the debug config. Will add
>> a tested tag.
>
> And that tag gives you special permission to take a mutex in irq
> disabled context on a cpu which cannot schedule, right?

Will fix.

Doing
case CPU_ONLINE:
intel_rdt_cpu_start(cpu);
instead of
CPU_UP_PREPARE:
intel_rdt_cpu_start(cpu);

should work ?

CPU_ONLINE is just enough for me which doesnot have interrupts
disabled and i can then sync using the lock.

Thanks,
Vikas

>
> Thanks,
>
> tglx
>