This patch has some changes to hot cpu handling code in existing cache
monitoring and RAPL kernel code. This improves hot cpu notification
handling by not looping through all online cpus which could be expensive
in large systems.
Cache allocation patches(dependent on prep patches) adds a cgroup
subsystem to support the new Cache Allocation feature found in future
Intel Xeon Intel processors. Cache Allocation is a sub-feature with in
Resource Director Technology(RDT) feature. RDT which provides support to
control sharing of platform resources like L3 cache.
Cache Allocation Technology provides a way for the Software (OS/VMM) to
restrict cache allocation to a defined 'subset' of cache which may be
overlapping with other 'subsets'. This feature is used when allocating
a line in cache ie when pulling new data into the cache. The
programming of the h/w is done via programming MSRs. The patch series
support to perform L3 cache allocation.
In todays new processors the number of cores is continuously increasing
which in turn increase the number of threads or workloads that can
simultaneously be run. When multi-threaded applications run
concurrently, they compete for shared resources including L3 cache. At
times, this L3 cache resource contention may result in inefficient space
utilization. For example a higher priority thread may end up with lesser
L3 cache resource or a cache sensitive app may not get optimal cache
occupancy thereby degrading the performance. Cache Allocation kernel
patch helps provides a framework for sharing L3 cache so that users can
allocate the resource according to set requirements.
More information about the feature can be found in the Intel SDM, Volume
3 section 17.15. SDM does not yet use the 'RDT' term yet and it is
planned to be changed at a later time.
*All the patches will apply on tip/perf/core*.
Changes in v12:
- From Matt's feedback replaced static cpumask_t tmp with function
scope at multiple locations to static cpumask_t tmp_cpumask for the
whole file. This is a temporary mask used during handling of hot cpu
notifications in cqm/rapl and rdt code(1/9,2/9 and 8/9). Although all
the usage was serialized by hot cpu locking this makes it more
readable.
Changes in V11: As per feedback from Thomas and discussions:
- removed the cpumask_any_online_but.its usage could be easily replaced with
'and'ing the cpu_online mask during hot cpu notifications. Thomas
pointed the API had issue where there tmp mask wasnt thread safe. I
realized the support it indends to give does not seem to match with
others in cpumask.h
- the cqm patch which added mutex to hot cpu notification was merged
with the cqm hot plug patch to improve notificaiton handling
without commit logs and wasnt correct. seperated and just sending the
cqm hot plug patch and will send the mutex cqm patch seperately
- fixed issues in the hot cpu rdt handling. Since the cpu_starting was
replaced with cpu_online , now the wrmsr needs to be actually
scheduled on the target cpu - which the previous patch wasnt doing.
Replaced the cpu_dead with cpu_down_prepare. the cpu_down_failed is
handled the same way as cpu_online. By waiting till cpu_dead to update
the rdt_cpumask , we may miss some of the msr updates.
Changes in V10:
- changed the hot cpu notification we handle in cqm and cache allocation
to cpu_online and cpu_dead and removed others as the
cpu_*_prepare also had corresponding cancel notification
which we did not handle.
- changed the file in rdt cgroup to l3_cache_mask to represent that its
for l3 cache.
Changes as per Thomas and PeterZ feedback:
- fixed the cpumask declarations in cpumask.h and rdt,cmt and rapl to
have static so that they burden stack space when large.
- removed mutex in cpu_starting notifications, replaced the locking with
cpu_online.
- changed name from hsw_probetest to cache_alloc_hsw_probe.
- changed x86_rdt_max_closid to x86_cache_max_closid and
x86_rdt_max_cbm_len to x86_cache_max_cbm_len as they are only related
to cache allocation and not to all rdt.
Changes in V9:
Changes made as per Thomas feedback:
- added a comment where we call schedule in code only when RDT is
enabled.
- Reordered the local declarations to follow convention in
intel_cqm_xchg_rmid
Changes in V8: Thanks to feedback from Thomas and following changes are
made based on his feedback:
Generic changes/Preparatory patches:
-added a new cpumask_any_online_but which returns the next
core sibling that is online.
-Made changes in Intel Cache monitoring and Intel RAPL(Running average
power limit) code to use the new function above to find the next cpu
that can be a designated reader for the package. Also changed the way
the package masks are computed which can be simplified using
topology_core_cpumask.
Cache allocation specific changes:
-Moved the documentation to the begining of the patch series.
-Added more documentation for the rdt cgroup files in the documentation.
-Changed the dmesg output when cache alloc is enabled to be more helpful
and updated few other comments to be better readable.
-removed __ prefix to functions like clos_get which were not following
convention.
-added code to take action on a WARN_ON in clos_put. Made a few other
changes to reduce code text.
-updated better readable/Kernel doc format comments for the
call to rdt_css_alloc, datastructures .
-removed cgroup_init
-changed the names of functions to only have intel_ prefix for external
APIs.
-replaced (void *)&closid with (void *)closid when calling
on_each_cpu_mask
-fixed the reference release of closid during cache bitmask write.
-changed the code to not ignore a cache mask which has bits set outside
of the max bits allowed. It returns an error instead.
-replaced bitmap_set(&max_mask, 0, max_cbm_len) with max_mask =
(1ULL << max_cbm) - 1.
- update the rdt_cpu_mask which has one cpu for each package, using
topology_core_cpumask instead of looping through existing rdt_cpu_mask.
Realized topology_core_cpumask name is misleading and it actually
returns the cores in a cpu package!
-arranged the code better to have the code relating to similar task
together.
-Improved searching for the next online cpu sibling and maintaining the
rdt_cpu_mask which has one cpu per package.
-removed the unnecessary wrapper rdt_enabled.
-removed unnecessary spin lock and rculock in the scheduling code.
-merged all scheduling code into one patch not seperating the RDT common
software cache code.
Changes in V7: Based on feedback from PeterZ and Matt and following
discussions :
- changed lot of naming to reflect the data structures which are common
to RDT and specific to Cache allocation.
- removed all usage of 'cat'. replace with more friendly cache
allocation
- fixed lot of convention issues (whitespace, return paradigm etc)
- changed the scheduling hook for RDT to not use a inline.
- removed adding new scheduling hook and just reused the existing one
similar to perf hook.
Changes in V6:
- rebased to 4.1-rc1 which has the CMT(cache monitoring) support included.
- (Thanks to Marcelo's feedback).Fixed support for hot cpu handling for
IA32_L3_QOS MSRs. Although during deep C states the MSR need not be restored
this is needed when physically a new package is added.
-some other coding convention changes including renaming to cache_mask using a
refcnt to track the number of cgroups using a closid in clos_cbm map.
-1b cbm support for non-hsw SKUs. HSW is an exception which needs the cache
bit masks to be at least 2 bits.
Changes in v5:
- Added support to propagate the cache bit mask update for each
package.
- Removed the cache bit mask reference in the intel_rdt structure as
there was no need for that and we already maintain a separate
closid<->cbm mapping.
- Made a few coding convention changes which include adding the
assertion while freeing the CLOSID.
Changes in V4:
- Integrated with the latest V5 CMT patches.
- Changed naming of cgroup to rdt(resource director technology) from
cat(cache allocation technology). This was done as the RDT is the
umbrella term for platform shared resources allocation. Hence in
future it would be easier to add resource allocation to the same
cgroup
- Naming changes also applied to a lot of other data structures/APIs.
- Added documentation on cgroup usage for cache allocation to address
a lot of questions from various academic and industry regarding
cache allocation usage.
Changes in V3:
- Implements a common software cache for IA32_PQR_MSR
- Implements support for hsw Cache Allocation enumeration. This does not use the brand
strings like earlier version but does a probe test. The probe test is done only
on hsw family of processors
- Made a few coding convention, name changes
- Check for lock being held when ClosID manipulation happens
Changes in V2:
- Removed HSW specific enumeration changes. Plan to include it later as a
separate patch.
- Fixed the code in prep_arch_switch to be specific for x86 and removed
x86 defines.
- Fixed cbm_write to not write all 1s when a cgroup is freed.
- Fixed one possible memory leak in init.
- Changed some of manual bitmap
manipulation to use the predefined bitmap APIs to make code more readable
- Changed name in sources from cqe to cat
- Global cat enable flag changed to static_key and disabled cgroup early_init
[PATCH 1/9] x86/intel_cqm: Modify hot cpu notification handling
[PATCH 2/9] x86/intel_rapl: Modify hot cpu notification handling for
[PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup
[PATCH 4/9] x86/intel_rdt: Add support for Cache Allocation detection
[PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service
[PATCH 6/9] x86/intel_rdt: Add support for cache bit mask management
[PATCH 7/9] x86/intel_rdt: Implement scheduling support for Intel RDT
[PATCH 8/9] x86/intel_rdt: Hot cpu support for Cache Allocation
[PATCH 9/9] x86/intel_rdt: Intel haswell Cache Allocation enumeration
This patch modifies hot cpu notification handling in Intel cache
monitoring:
- to add a new cpu to the cqm_cpumask(which has one cpu per package)
during cpu start, it uses the existing package<->core map instead of
looping through all cpus in cqm_cpumask.
- to search for the next online sibling during cpu exit, it uses the
same map instead of looping through all online cpus. In
large systems with large number of cpus the time taken to loop may be
expensive and also the time increase linearly.
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 +++++++++++++++---------------
1 file changed, 17 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 1880761..58bdefa 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -62,6 +62,12 @@ static LIST_HEAD(cache_groups);
*/
static cpumask_t cqm_cpumask;
+/*
+ * Temporary cpumask used during hot cpu notificaiton handling. The usage
+ * is serialized by hot cpu locks.
+ */
+static cpumask_t tmp_cpumask;
+
#define RMID_VAL_ERROR (1ULL << 63)
#define RMID_VAL_UNAVAIL (1ULL << 62)
@@ -1236,15 +1242,13 @@ static struct pmu intel_cqm_pmu = {
static inline void cqm_pick_event_reader(int cpu)
{
- int phys_id = topology_physical_package_id(cpu);
- int i;
+ cpumask_and(&tmp_cpumask, &cqm_cpumask, topology_core_cpumask(cpu));
- for_each_cpu(i, &cqm_cpumask) {
- if (phys_id == topology_physical_package_id(i))
- return; /* already got reader for this socket */
- }
-
- cpumask_set_cpu(cpu, &cqm_cpumask);
+ /*
+ * Pick a reader if there isn't one already.
+ */
+ if (cpumask_empty(&tmp_cpumask))
+ cpumask_set_cpu(cpu, &cqm_cpumask);
}
static void intel_cqm_cpu_prepare(unsigned int cpu)
@@ -1262,7 +1266,6 @@ static void intel_cqm_cpu_prepare(unsigned int cpu)
static void intel_cqm_cpu_exit(unsigned int cpu)
{
- int phys_id = topology_physical_package_id(cpu);
int i;
/*
@@ -1271,15 +1274,12 @@ static void intel_cqm_cpu_exit(unsigned int cpu)
if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask))
return;
- for_each_online_cpu(i) {
- if (i == cpu)
- continue;
+ cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
+ cpumask_clear_cpu(cpu, &tmp_cpumask);
+ i = cpumask_any(&tmp_cpumask);
- if (phys_id == topology_physical_package_id(i)) {
- cpumask_set_cpu(i, &cqm_cpumask);
- break;
- }
- }
+ if (i < nr_cpu_ids)
+ cpumask_set_cpu(i, &cqm_cpumask);
}
static int intel_cqm_cpu_notifier(struct notifier_block *nb,
--
1.9.1
This patch modifies the hot cpu notification handling in
Intel Running Average Power Limit(RAPL) driver.
- to add a cpu reader to the rapl_cpumask(which has one cpu per package
set) it uses the existing package<->core map instead of looping
through all cpus in rapl_cpumask.
- to search for the next online sibling during hot cpu exit, it uses
the same mapping instead of looping all online cpus. In
large systems with large number of cpus the time taken to loop may be
expensive and also the time increase linearly.
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 35 ++++++++++++++---------------
1 file changed, 17 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index 358c54a..c5ab686 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -132,6 +132,12 @@ static struct pmu rapl_pmu_class;
static cpumask_t rapl_cpu_mask;
static int rapl_cntr_mask;
+/*
+ * Temporary cpumask used during hot cpu notificaiton handling. The usage
+ * is serialized by hot cpu locks.
+ */
+static cpumask_t tmp_cpumask;
+
static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu);
static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu_to_free);
@@ -524,18 +530,16 @@ static struct pmu rapl_pmu_class = {
static void rapl_cpu_exit(int cpu)
{
struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
- int i, phys_id = topology_physical_package_id(cpu);
int target = -1;
+ int i;
/* find a new cpu on same package */
- for_each_online_cpu(i) {
- if (i == cpu)
- continue;
- if (phys_id == topology_physical_package_id(i)) {
- target = i;
- break;
- }
- }
+ cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
+ cpumask_clear_cpu(cpu, &tmp_cpumask);
+ i = cpumask_any(&tmp_cpumask);
+ if (i < nr_cpu_ids)
+ target = i;
+
/*
* clear cpu from cpumask
* if was set in cpumask and still some cpu on package,
@@ -557,15 +561,10 @@ static void rapl_cpu_exit(int cpu)
static void rapl_cpu_init(int cpu)
{
- int i, phys_id = topology_physical_package_id(cpu);
-
- /* check if phys_is is already covered */
- for_each_cpu(i, &rapl_cpu_mask) {
- if (phys_id == topology_physical_package_id(i))
- return;
- }
- /* was not found, so add it */
- cpumask_set_cpu(cpu, &rapl_cpu_mask);
+ /* check if cpu's package is already covered.If not, add it.*/
+ cpumask_and(&tmp_cpumask, &rapl_cpu_mask, topology_core_cpumask(cpu));
+ if (cpumask_empty(&tmp_cpumask))
+ cpumask_set_cpu(cpu, &rapl_cpu_mask);
}
static __init void rapl_hsw_server_quirk(void)
--
1.9.1
Adds a description of Cache allocation technology, overview
of kernel implementation and usage of Cache Allocation cgroup interface.
Cache allocation is a sub-feature of Resource Director Technology(RDT)
Allocation or Platform Shared resource control which provides support to
control Platform shared resources like L3 cache. Currently L3 Cache is
the only resource that is supported in RDT. More information can be
found in the Intel SDM, Volume 3, section 17.15.
Cache Allocation Technology provides a way for the Software (OS/VMM)
to restrict cache allocation to a defined 'subset' of cache which may
be overlapping with other 'subsets'. This feature is used when
allocating a line in cache ie when pulling new data into the cache.
Signed-off-by: Vikas Shivappa <[email protected]>
---
Documentation/cgroups/rdt.txt | 215 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 215 insertions(+)
create mode 100644 Documentation/cgroups/rdt.txt
diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
new file mode 100644
index 0000000..dfff477
--- /dev/null
+++ b/Documentation/cgroups/rdt.txt
@@ -0,0 +1,215 @@
+ RDT
+ ---
+
+Copyright (C) 2014 Intel Corporation
+Written by [email protected]
+(based on contents and format from cpusets.txt)
+
+CONTENTS:
+=========
+
+1. Cache Allocation Technology
+ 1.1 What is RDT and Cache allocation ?
+ 1.2 Why is Cache allocation needed ?
+ 1.3 Cache allocation implementation overview
+ 1.4 Assignment of CBM and CLOS
+ 1.5 Scheduling and Context Switch
+2. Usage Examples and Syntax
+
+1. Cache Allocation Technology(Cache allocation)
+===================================
+
+1.1 What is RDT and Cache allocation
+------------------------------------
+
+Cache allocation is a sub-feature of Resource Director Technology(RDT)
+Allocation or Platform Shared resource control which provides support to
+control Platform shared resources like L3 cache. Currently L3 Cache is
+the only resource that is supported in RDT. More information can be
+found in the Intel SDM, Volume 3, section 17.15.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM)
+to restrict cache allocation to a defined 'subset' of cache which may
+be overlapping with other 'subsets'. This feature is used when
+allocating a line in cache ie when pulling new data into the cache.
+The programming of the h/w is done via programming MSRs.
+
+The different cache subsets are identified by CLOS identifier (class
+of service) and each CLOS has a CBM (cache bit mask). The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is Cache allocation needed
+----------------------------------
+
+In todays new processors the number of cores is continuously increasing,
+especially in large scale usage models where VMs are used like
+webservers and datacenters. The number of cores increase the number
+of threads or workloads that can simultaneously be run. When
+multi-threaded-applications, VMs, workloads run concurrently they
+compete for shared resources including L3 cache.
+
+The Cache allocation enables more cache resources to be made available
+for higher priority applications based on guidance from the execution
+environment.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput benefit.
+
+This technique may be useful in managing large computer systems which
+large L3 cache. Examples may be large servers running instances of
+webservers or database servers. In such complex systems, these subsets
+can be used for more careful placing of the available cache
+resources.
+
+1.3 Cache allocation implementation Overview
+--------------------------------------------
+
+Kernel implements a cgroup subsystem to support cache allocation.
+
+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
+to the kernel and not exposed to user. Each cgroup would have one CBM
+and would just represent one cache 'subset'.
+
+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
+cgroup never fails. When a child cgroup is created it inherits the
+CLOSid and the CBM from its parent. When a user changes the default
+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
+used before. The changing of 'l3_cache_mask' may fail with -ENOSPC once
+the kernel runs out of maximum CLOSids it can support.
+User can create as many cgroups as he wants but having different CBMs
+at the same time is restricted by the maximum number of CLOSids
+(multiple cgroups can have the same CBM).
+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
+for each cgroup using a CLOSid.
+
+The tasks in the cgroup would get to fill the L3 cache represented by
+the cgroup's 'l3_cache_mask' file.
+
+Root directory would have all available bits set in 'l3_cache_mask' file
+by default.
+
+Each RDT cgroup directory has the following files. Some of them may be a
+part of common RDT framework or be specific to RDT sub-features like
+cache allocation.
+
+ - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by this
+ file. The bitmask must be contiguous and would have a 1 or 2 bit
+ minimum length.
+
+1.4 Assignment of CBM,CLOS
+--------------------------
+
+The 'l3_cache_mask' needs to be a subset of the parent node's
+'l3_cache_mask'. Any contiguous subset of these bits(with a minimum of 2
+bits on hsw SKUs) maybe set to indicate the cache mapping desired. The
+'l3_cache_mask' between 2 directories can overlap. The 'l3_cache_mask' would
+represent the cache 'subset' of the Cache allocation cgroup. For ex: on
+a system with 16 bits of max cbm bits, if the directory has the least
+significant 4 bits set in its 'l3_cache_mask' file(meaning the 'l3_cache_mask'
+is just 0xf), it would be allocated the right quarter of the Last level
+cache which means the tasks belonging to this Cache allocation cgroup
+can use the right quarter of the cache to fill. If it
+has the most significant 8 bits set ,it would be allocated the left
+half of the cache(8 bits out of 16 represents 50%).
+
+The cache portion defined in the CBM file is available to all tasks
+within the cgroup to fill and these task are not allowed to allocate
+space in other parts of the cache.
+
+1.5 Scheduling and Context Switch
+---------------------------------
+
+During context switch kernel implements this by writing the
+CLOSid (internally maintained by kernel) of the cgroup to which the
+task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
+when there is a change in the CLOSid for the CPU in order to minimize
+the latency incurred during context switch.
+
+The following considerations are done for the PQR MSR write so that it
+has minimal impact on scheduling hot path:
+- This path doesnt exist on any non-intel platforms.
+- On Intel platforms, this would not exist by default unless CGROUP_RDT
+is enabled.
+- remains a no-op when CGROUP_RDT is enabled and intel hardware does not
+support the feature.
+- When feature is available, still remains a no-op till the user
+manually creates a cgroup *and* assigns a new cache mask. Since the
+child node inherits the parents cache mask , by cgroup creation there is
+no scheduling hot path impact from the new cgroup.
+- per cpu PQR values are cached and the MSR write is only done when
+there is a task with different PQR is scheduled on the CPU. Typically if
+the task groups are bound to be scheduled on a set of CPUs , the number
+of MSR writes is greatly reduced.
+
+2. Usage examples and syntax
+============================
+
+To check if Cache allocation was enabled on your system
+
+dmesg | grep -i intel_rdt
+should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
+the length of l3_cache_mask and CLOS should depend on the system you use.
+
+Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3
+ cache allocation is enabled).
+
+Following would mount the cache allocation cgroup subsystem and create
+2 directories. Please refer to Documentation/cgroups/cgroups.txt on
+details about how to use cgroups.
+
+ cd /sys/fs/cgroup
+ mkdir rdt
+ mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
+ cd rdt
+
+Create 2 rdt cgroups
+
+ mkdir group1
+ mkdir group2
+
+Following are some of the Files in the directory
+
+ ls
+ rdt.l3_cache_mask
+ tasks
+
+Say if the cache is 2MB and cbm supports 16 bits, then setting the
+below allocates the 'right 1/4th(512KB)' of the cache to group2
+
+Edit the CBM for group2 to set the least significant 4 bits. This
+allocates 'right quarter' of the cache.
+
+ cd group2
+ /bin/echo 0xf > rdt.l3_cache_mask
+
+
+Edit the CBM for group2 to set the least significant 8 bits.This
+allocates the right half of the cache to 'group2'.
+
+ cd group2
+ /bin/echo 0xff > rdt.l3_cache_mask
+
+Assign tasks to the group2
+
+ /bin/echo PID1 > tasks
+ /bin/echo PID2 > tasks
+
+ Meaning now threads
+ PID1 and PID2 get to fill the 'right half' of
+ the cache as the belong to cgroup group2.
+
+Create a group under group2
+
+ cd group2
+ mkdir group21
+ cat rdt.l3_cache_mask
+ 0xff - inherits parents mask.
+
+ /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to parent's mask's subset
+
+In order to restrict RDT cgroups to specific set of CPUs rdt can be
+comounted with cpusets.
--
1.9.1
This patch adds support for Cache Allocation Technology feature found in
future Intel Xeon processors. Cache allocation is a sub-feature of Intel
Resource Director Technology(RDT) which enables sharing of processor
resources. This patch includes CPUID enumeration routines for Cache
allocation and new values to track resources to the cpuinfo_x86
structure.
Cache allocation provides a way for the Software (OS/VMM) to restrict
cache allocation to a defined 'subset' of cache which may be overlapping
with other 'subsets'. This feature is used when allocating a line in
cache ie when pulling new data into the cache. The programming of the
hardware is done via programming MSRs(model specific registers).
More information about Cache allocation be found in the Intel (R) x86
Architecture Software Developer Manual,Volume 3, section 17.15.
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/cpufeature.h | 6 +++++-
arch/x86/include/asm/processor.h | 3 +++
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/common.c | 15 +++++++++++++++
arch/x86/kernel/cpu/intel_rdt.c | 40 +++++++++++++++++++++++++++++++++++++++
init/Kconfig | 11 +++++++++++
6 files changed, 75 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/cpu/intel_rdt.c
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 3d6606f..ae5ae9d 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -12,7 +12,7 @@
#include <asm/disabled-features.h>
#endif
-#define NCAPINTS 13 /* N 32-bit words worth of info */
+#define NCAPINTS 14 /* N 32-bit words worth of info */
#define NBUGINTS 1 /* N 32-bit bug flags */
/*
@@ -229,6 +229,7 @@
#define X86_FEATURE_RTM ( 9*32+11) /* Restricted Transactional Memory */
#define X86_FEATURE_CQM ( 9*32+12) /* Cache QoS Monitoring */
#define X86_FEATURE_MPX ( 9*32+14) /* Memory Protection Extension */
+#define X86_FEATURE_RDT ( 9*32+15) /* Resource Allocation */
#define X86_FEATURE_AVX512F ( 9*32+16) /* AVX-512 Foundation */
#define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */
#define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX instructions */
@@ -252,6 +253,9 @@
/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
#define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
+/* Intel-defined CPU features, CPUID level 0x00000010:0 (ebx), word 13 */
+#define X86_FEATURE_CAT_L3 (13*32 + 1) /* Cache Allocation L3 */
+
/*
* BUG word(s)
*/
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 23ba676..16ca766 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -114,6 +114,9 @@ struct cpuinfo_x86 {
int x86_cache_occ_scale; /* scale to bytes */
int x86_power;
unsigned long loops_per_jiffy;
+ /* Cache Allocation values: */
+ u16 x86_cache_max_cbm_len;
+ u16 x86_cache_max_closid;
/* cpuid returned max cores value: */
u16 x86_max_cores;
u16 apicid;
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 9bff687..4ff7a1f 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
perf_event_intel_uncore_nhmex.o
endif
+obj-$(CONFIG_CGROUP_RDT) += intel_rdt.o
obj-$(CONFIG_X86_MCE) += mcheck/
obj-$(CONFIG_MTRR) += mtrr/
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a62cf04..fd014f5 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -670,6 +670,21 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
}
}
+ /* Additional Intel-defined flags: level 0x00000010 */
+ if (c->cpuid_level >= 0x00000010) {
+ u32 eax, ebx, ecx, edx;
+
+ cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
+ c->x86_capability[13] = ebx;
+
+ if (cpu_has(c, X86_FEATURE_CAT_L3)) {
+
+ cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
+ c->x86_cache_max_closid = edx + 1;
+ c->x86_cache_max_cbm_len = eax + 1;
+ }
+ }
+
/* AMD-defined flags: level 0x80000001 */
xlvl = cpuid_eax(0x80000000);
c->extended_cpuid_level = xlvl;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
new file mode 100644
index 0000000..3cd6db6
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -0,0 +1,40 @@
+/*
+ * Resource Director Technology(RDT)
+ * - Cache Allocation code.
+ *
+ * Copyright (C) 2014 Intel Corporation
+ *
+ * 2015-05-25 Written by
+ * Vikas Shivappa <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual, volume 3, section 17.15.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/slab.h>
+#include <linux/err.h>
+
+static int __init intel_rdt_late_init(void)
+{
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+
+ if (!cpu_has(c, X86_FEATURE_CAT_L3))
+ return -ENODEV;
+
+ pr_info("Intel cache allocation enabled\n");
+
+ return 0;
+}
+
+late_initcall(intel_rdt_late_init);
diff --git a/init/Kconfig b/init/Kconfig
index 81050e4..203f116 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -983,6 +983,17 @@ config CPUSETS
Say N if unsure.
+config CGROUP_RDT
+ bool "Resource Director Technology cgroup subsystem"
+ depends on X86_64 && CPU_SUP_INTEL
+ help
+ This option provides a cgroup to allocate Platform shared
+ resources. Among the shared resources, current implementation
+ focuses on L3 Cache. Using the interface user can specify the
+ amount of L3 cache space into which an application can fill.
+
+ Say N if unsure.
+
config PROC_PID_CPUSET
bool "Include legacy /proc/<pid>/cpuset file"
depends on CPUSETS
--
1.9.1
This patch adds a cgroup subsystem for Intel Resource Director
Technology(RDT) feature and Class of service(CLOSid) management which is
part of common RDT framework. This cgroup would eventually be used by
all sub-features of RDT and hence be associated with the common RDT
framework as well as sub-feature specific framework. However current
patch series only adds cache allocation sub-feature specific code.
When a cgroup directory is created it has a CLOSid associated with it
which is inherited from its parent. The Closid is mapped to a
cache_mask which represents the L3 cache allocation to the cgroup.
Tasks belonging to the cgroup get to fill the cache represented by the
cache_mask.
CLOSid is internal to the kernel and not exposed to user. Kernel uses
several ways to optimize the allocation of Closid and thereby exposing
the available Closids may actually provide wrong information to users as
it may be dynamically changing depending on its usage.
CLOSid allocation is tracked using a separate bitmap. The maximum number
of CLOSids is specified by the h/w during CPUID enumeration and the
kernel simply throws an -ENOSPC when it runs out of CLOSids. Each
cache_mask(CBM) has an associated CLOSid. However if multiple cgroups
have the same cache mask they would also have the same CLOSid. The
reference count parameter in CLOSid-CBM map keeps track of how many
cgroups are using each CLOSid<->CBM mapping.
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/intel_rdt.h | 36 +++++++++++
arch/x86/kernel/cpu/intel_rdt.c | 132 ++++++++++++++++++++++++++++++++++++++-
include/linux/cgroup_subsys.h | 4 ++
3 files changed, 170 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/include/asm/intel_rdt.h
diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
new file mode 100644
index 0000000..2ce3e2c
--- /dev/null
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -0,0 +1,36 @@
+#ifndef _RDT_H_
+#define _RDT_H_
+
+#ifdef CONFIG_CGROUP_RDT
+
+#include <linux/cgroup.h>
+
+struct rdt_subsys_info {
+ unsigned long *closmap;
+};
+
+struct intel_rdt {
+ struct cgroup_subsys_state css;
+ u32 closid;
+};
+
+struct clos_cbm_map {
+ unsigned long cache_mask;
+ unsigned int clos_refcnt;
+};
+
+/*
+ * Return rdt group corresponding to this container.
+ */
+static inline struct intel_rdt *css_rdt(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct intel_rdt, css) : NULL;
+}
+
+static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
+{
+ return css_rdt(ir->css.parent);
+}
+
+#endif
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 3cd6db6..b13dd57 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -24,17 +24,145 @@
#include <linux/slab.h>
#include <linux/err.h>
+#include <linux/spinlock.h>
+#include <asm/intel_rdt.h>
+
+/*
+ * ccmap maintains 1:1 mapping between CLOSid and cache bitmask.
+ */
+static struct clos_cbm_map *ccmap;
+static struct rdt_subsys_info rdtss_info;
+static DEFINE_MUTEX(rdt_group_mutex);
+struct intel_rdt rdt_root_group;
+
+static inline void closid_get(u32 closid)
+{
+ struct clos_cbm_map *ccm = &ccmap[closid];
+
+ lockdep_assert_held(&rdt_group_mutex);
+
+ ccm->clos_refcnt++;
+}
+
+static int closid_alloc(struct intel_rdt *ir)
+{
+ u32 maxid;
+ u32 id;
+
+ lockdep_assert_held(&rdt_group_mutex);
+
+ maxid = boot_cpu_data.x86_cache_max_closid;
+ id = find_next_zero_bit(rdtss_info.closmap, maxid, 0);
+ if (id == maxid)
+ return -ENOSPC;
+
+ set_bit(id, rdtss_info.closmap);
+ closid_get(id);
+ ir->closid = id;
+
+ return 0;
+}
+
+static inline void closid_free(u32 closid)
+{
+ clear_bit(closid, rdtss_info.closmap);
+ ccmap[closid].cache_mask = 0;
+}
+
+static inline void closid_put(u32 closid)
+{
+ struct clos_cbm_map *ccm = &ccmap[closid];
+
+ lockdep_assert_held(&rdt_group_mutex);
+ if (WARN_ON(!ccm->clos_refcnt))
+ return;
+
+ if (!--ccm->clos_refcnt)
+ closid_free(closid);
+}
+
+static struct cgroup_subsys_state *
+intel_rdt_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct intel_rdt *parent = css_rdt(parent_css);
+ struct intel_rdt *ir;
+
+ /*
+ * cgroup_init cannot handle failures gracefully.
+ * Return rdt_root_group.css instead of failure
+ * always even when Cache allocation is not supported.
+ */
+ if (!parent)
+ return &rdt_root_group.css;
+
+ ir = kzalloc(sizeof(struct intel_rdt), GFP_KERNEL);
+ if (!ir)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_lock(&rdt_group_mutex);
+ ir->closid = parent->closid;
+ closid_get(ir->closid);
+ mutex_unlock(&rdt_group_mutex);
+
+ return &ir->css;
+}
+
+static void intel_rdt_css_free(struct cgroup_subsys_state *css)
+{
+ struct intel_rdt *ir = css_rdt(css);
+
+ mutex_lock(&rdt_group_mutex);
+ closid_put(ir->closid);
+ kfree(ir);
+ mutex_unlock(&rdt_group_mutex);
+}
static int __init intel_rdt_late_init(void)
{
struct cpuinfo_x86 *c = &boot_cpu_data;
+ static struct clos_cbm_map *ccm;
+ u32 maxid, max_cbm_len;
+ size_t sizeb;
+ int err = 0;
- if (!cpu_has(c, X86_FEATURE_CAT_L3))
+ if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
+ rdt_root_group.css.ss->disabled = 1;
return -ENODEV;
+ }
+ maxid = c->x86_cache_max_closid;
+ max_cbm_len = c->x86_cache_max_cbm_len;
+
+ sizeb = BITS_TO_LONGS(maxid) * sizeof(long);
+ rdtss_info.closmap = kzalloc(sizeb, GFP_KERNEL);
+ if (!rdtss_info.closmap) {
+ err = -ENOMEM;
+ goto out_err;
+ }
+
+ sizeb = maxid * sizeof(struct clos_cbm_map);
+ ccmap = kzalloc(sizeb, GFP_KERNEL);
+ if (!ccmap) {
+ kfree(rdtss_info.closmap);
+ err = -ENOMEM;
+ goto out_err;
+ }
+
+ set_bit(0, rdtss_info.closmap);
+ rdt_root_group.closid = 0;
+ ccm = &ccmap[0];
+ ccm->cache_mask = (1ULL << max_cbm_len) - 1;
+ ccm->clos_refcnt = 1;
pr_info("Intel cache allocation enabled\n");
+out_err:
- return 0;
+ return err;
}
late_initcall(intel_rdt_late_init);
+
+struct cgroup_subsys intel_rdt_cgrp_subsys = {
+ .css_alloc = intel_rdt_css_alloc,
+ .css_free = intel_rdt_css_free,
+ .early_init = 0,
+};
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e4a96fb..0339312 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
SUBSYS(hugetlb)
#endif
+#if IS_ENABLED(CONFIG_CGROUP_RDT)
+SUBSYS(intel_rdt)
+#endif
+
/*
* The following subsystems are not supported on the default hierarchy.
*/
--
1.9.1
The change adds a file cache_mask to the RDT cgroup which represents the
cache bit mask(CBM) for the cgroup. cache_mask is specific to the Cache
allocation sub-feature of RDT. The tasks in the RDT cgroup would get to
fill the L3 cache represented by the cgroup's cache_mask file.
Update to the CBM is done by writing to the IA32_L3_MASK_n. The RDT
cgroup follows cgroup hierarchy ,mkdir and adding tasks to the cgroup
never fails. When a child cgroup is created it inherits the CLOSid and
the cache_mask from its parent. When a user changes the default CBM for
a cgroup, a new CLOSid may be allocated if the cache_mask was not used
before. If the new CBM is the one that is already used, the count for
that CLOSid<->CBM is incremented. The changing of 'cache_mask' may fail
with -ENOSPC once the kernel runs out of maximum CLOSids it can support.
User can create as many cgroups as he wants but having different CBMs at
the same time is restricted by the maximum number of CLOSids .Kernel
maintains a CLOSid<->cbm mapping which keeps count of cgroups using a
CLOSid.
Reuse of CLOSids for cgroups with same bitmask also has following
advantages:
- This helps to use the scant CLOSids optimally.
- This also implies that during context switch, write to PQR-MSR is
done only when a task with a different bitmask is scheduled in.
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/intel_rdt.h | 3 +
arch/x86/kernel/cpu/intel_rdt.c | 205 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 207 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 2ce3e2c..3ad426c 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,6 +4,9 @@
#ifdef CONFIG_CGROUP_RDT
#include <linux/cgroup.h>
+#define MAX_CBM_LENGTH 32
+#define IA32_L3_CBM_BASE 0xc90
+#define CBM_FROM_INDEX(x) (IA32_L3_CBM_BASE + x)
struct rdt_subsys_info {
unsigned long *closmap;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index b13dd57..750b02a 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -34,6 +34,13 @@ static struct clos_cbm_map *ccmap;
static struct rdt_subsys_info rdtss_info;
static DEFINE_MUTEX(rdt_group_mutex);
struct intel_rdt rdt_root_group;
+/*
+ * Mask of CPUs for writing CBM values. We only need one CPU per-socket.
+ */
+static cpumask_t rdt_cpumask;
+
+#define rdt_for_each_child(pos_css, parent_ir) \
+ css_for_each_child((pos_css), &(parent_ir)->css)
static inline void closid_get(u32 closid)
{
@@ -117,13 +124,195 @@ static void intel_rdt_css_free(struct cgroup_subsys_state *css)
mutex_unlock(&rdt_group_mutex);
}
+static int intel_cache_alloc_cbm_read(struct seq_file *m, void *v)
+{
+ struct intel_rdt *ir = css_rdt(seq_css(m));
+
+ seq_printf(m, "%08lx\n", ccmap[ir->closid].cache_mask);
+
+ return 0;
+}
+
+static inline bool cbm_is_contiguous(unsigned long var)
+{
+ unsigned long maxcbm = MAX_CBM_LENGTH;
+ unsigned long first_bit, zero_bit;
+
+ if (!var)
+ return false;
+
+ first_bit = find_next_bit(&var, maxcbm, 0);
+ zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
+
+ if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
+ return false;
+
+ return true;
+}
+
+static int cbm_validate(struct intel_rdt *ir, unsigned long cbmvalue)
+{
+ struct cgroup_subsys_state *css;
+ struct intel_rdt *par, *c;
+ unsigned long *cbm_tmp;
+ int err = 0;
+
+ if (!cbm_is_contiguous(cbmvalue)) {
+ pr_err("bitmask should have >= 1 bit and be contiguous\n");
+ err = -EINVAL;
+ goto out_err;
+ }
+
+ par = parent_rdt(ir);
+ cbm_tmp = &ccmap[par->closid].cache_mask;
+ if (!bitmap_subset(&cbmvalue, cbm_tmp, MAX_CBM_LENGTH)) {
+ err = -EINVAL;
+ goto out_err;
+ }
+
+ rcu_read_lock();
+ rdt_for_each_child(css, ir) {
+ c = css_rdt(css);
+ cbm_tmp = &ccmap[c->closid].cache_mask;
+ if (!bitmap_subset(cbm_tmp, &cbmvalue, MAX_CBM_LENGTH)) {
+ rcu_read_unlock();
+ pr_err("Children's mask not a subset\n");
+ err = -EINVAL;
+ goto out_err;
+ }
+ }
+ rcu_read_unlock();
+out_err:
+
+ return err;
+}
+
+static bool cbm_search(unsigned long cbm, u32 *closid)
+{
+ u32 maxid = boot_cpu_data.x86_cache_max_closid;
+ u32 i;
+
+ for (i = 0; i < maxid; i++) {
+ if (bitmap_equal(&cbm, &ccmap[i].cache_mask, MAX_CBM_LENGTH)) {
+ *closid = i;
+ return true;
+ }
+ }
+
+ return false;
+}
+
+static void closcbm_map_dump(void)
+{
+ u32 i;
+
+ pr_debug("CBMMAP\n");
+ for (i = 0; i < boot_cpu_data.x86_cache_max_closid; i++) {
+ pr_debug("cache_mask: 0x%x,clos_refcnt: %u\n",
+ (unsigned int)ccmap[i].cache_mask, ccmap[i].clos_refcnt);
+ }
+}
+
+static void cbm_cpu_update(void *info)
+{
+ u32 closid = (u32) info;
+
+ wrmsrl(CBM_FROM_INDEX(closid), ccmap[closid].cache_mask);
+}
+
+/*
+ * cbm_update_all() - Update the cache bit mask for all packages.
+ */
+static inline void cbm_update_all(u32 closid)
+{
+ on_each_cpu_mask(&rdt_cpumask, cbm_cpu_update, (void *)closid, 1);
+}
+
+/*
+ * intel_cache_alloc_cbm_write() - Validates and writes the
+ * cache bit mask(cbm) to the IA32_L3_MASK_n
+ * and also store the same in the ccmap.
+ *
+ * CLOSids are reused for cgroups which have same bitmask.
+ * This helps to use the scant CLOSids optimally. This also
+ * implies that at context switch write to PQR-MSR is done
+ * only when a task with a different bitmask is scheduled in.
+ */
+static int intel_cache_alloc_cbm_write(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 cbmvalue)
+{
+ u32 max_cbm = boot_cpu_data.x86_cache_max_cbm_len;
+ struct intel_rdt *ir = css_rdt(css);
+ ssize_t err = 0;
+ u64 max_mask;
+ u32 closid;
+
+ if (ir == &rdt_root_group)
+ return -EPERM;
+
+ /*
+ * Need global mutex as cbm write may allocate a closid.
+ */
+ mutex_lock(&rdt_group_mutex);
+
+ max_mask = (1ULL << max_cbm) - 1;
+ if (cbmvalue & ~max_mask) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (cbmvalue == ccmap[ir->closid].cache_mask)
+ goto out;
+
+ err = cbm_validate(ir, cbmvalue);
+ if (err)
+ goto out;
+
+ /*
+ * Try to get a reference for a different CLOSid and release the
+ * reference to the current CLOSid.
+ * Need to put down the reference here and get it back in case we
+ * run out of closids. Otherwise we run into a problem when
+ * we could be using the last closid that could have been available.
+ */
+ closid_put(ir->closid);
+ if (cbm_search(cbmvalue, &closid)) {
+ ir->closid = closid;
+ closid_get(closid);
+ } else {
+ closid = ir->closid;
+ err = closid_alloc(ir);
+ if (err) {
+ closid_get(ir->closid);
+ goto out;
+ }
+
+ ccmap[ir->closid].cache_mask = cbmvalue;
+ cbm_update_all(ir->closid);
+ }
+ closcbm_map_dump();
+out:
+ mutex_unlock(&rdt_group_mutex);
+
+ return err;
+}
+
+static inline void rdt_cpumask_update(int cpu)
+{
+ static cpumask_t tmp;
+
+ cpumask_and(&tmp, &rdt_cpumask, topology_core_cpumask(cpu));
+ if (cpumask_empty(&tmp))
+ cpumask_set_cpu(cpu, &rdt_cpumask);
+}
+
static int __init intel_rdt_late_init(void)
{
struct cpuinfo_x86 *c = &boot_cpu_data;
static struct clos_cbm_map *ccm;
u32 maxid, max_cbm_len;
+ int err = 0, i;
size_t sizeb;
- int err = 0;
if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
rdt_root_group.css.ss->disabled = 1;
@@ -153,6 +342,9 @@ static int __init intel_rdt_late_init(void)
ccm->cache_mask = (1ULL << max_cbm_len) - 1;
ccm->clos_refcnt = 1;
+ for_each_online_cpu(i)
+ rdt_cpumask_update(i);
+
pr_info("Intel cache allocation enabled\n");
out_err:
@@ -161,8 +353,19 @@ out_err:
late_initcall(intel_rdt_late_init);
+static struct cftype rdt_files[] = {
+ {
+ .name = "l3_cache_mask",
+ .seq_show = intel_cache_alloc_cbm_read,
+ .write_u64 = intel_cache_alloc_cbm_write,
+ .mode = 0666,
+ },
+ { } /* terminate */
+};
+
struct cgroup_subsys intel_rdt_cgrp_subsys = {
.css_alloc = intel_rdt_css_alloc,
.css_free = intel_rdt_css_free,
+ .legacy_cftypes = rdt_files,
.early_init = 0,
};
--
1.9.1
Adds support for IA32_PQR_ASSOC MSR writes during task scheduling. For
Cache Allocation, MSR write would let the task fill in the cache
'subset' represented by the cgroup's cache_mask.
The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
CLOSid. During context switch kernel implements this by writing the
CLOSid of the cgroup to which the task belongs to the CPU's
IA32_PQR_ASSOC MSR.
This patch also implements a common software cache for IA32_PQR_MSR(RMID
0:9, CLOSId 32:63) to be used by both Cache monitoring(CMT) and
Cache allocation. CMT updates the RMID where as cache_alloc updates the
CLOSid in the software cache. During scheduling when the new
RMID/CLOSid value is different from the cached values, IA32_PQR_MSR is
updated. Since the measured rdmsr latency for IA32_PQR_MSR is very
high(~250 cycles) this software cache is necessary to avoid reading the
MSR to compare the current CLOSid value.
The following considerations are done for the PQR MSR write so that it
minimally impacts scheduler hot path:
- This path does not exist on any non-intel platforms.
- On Intel platforms, this would not exist by default unless CGROUP_RDT
is enabled.
- remains a no-op when CGROUP_RDT is enabled and intel SKU does not
support the feature.
- When feature is available and enabled, never does MSR write till the
user manually creates a cgroup directory *and* assigns a cache_mask
different from root cgroup directory. Since the child node inherits
the parents cache mask , by cgroup creation there is no scheduling hot
path impact from the new cgroup.
- MSR write is only done when there is a task with different Closid is
scheduled on the CPU. Typically if the task groups are bound to be
scheduled on a set of CPUs , the number of MSR writes is greatly
reduced.
- A per CPU cache of CLOSids is maintained to do the check so that we
dont have to do a rdmsr which actually costs a lot of cycles.
- For cgroup directories having same cache_mask the CLOSids are reused.
This minimizes the number of CLOSids used and hence reduces the MSR
write frequency.
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/intel_rdt.h | 45 ++++++++++++++++++++++++++++++
arch/x86/include/asm/rdt_common.h | 25 +++++++++++++++++
arch/x86/include/asm/switch_to.h | 3 ++
arch/x86/kernel/cpu/intel_rdt.c | 17 +++++++++++
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 26 ++---------------
5 files changed, 93 insertions(+), 23 deletions(-)
create mode 100644 arch/x86/include/asm/rdt_common.h
diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 3ad426c..78df3d7 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,10 +4,16 @@
#ifdef CONFIG_CGROUP_RDT
#include <linux/cgroup.h>
+#include <asm/rdt_common.h>
+
#define MAX_CBM_LENGTH 32
#define IA32_L3_CBM_BASE 0xc90
#define CBM_FROM_INDEX(x) (IA32_L3_CBM_BASE + x)
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
+extern struct static_key rdt_enable_key;
+extern void __intel_rdt_sched_in(void);
+
struct rdt_subsys_info {
unsigned long *closmap;
};
@@ -35,5 +41,44 @@ static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
return css_rdt(ir->css.parent);
}
+/*
+ * Return rdt group to which this task belongs.
+ */
+static inline struct intel_rdt *task_rdt(struct task_struct *task)
+{
+ return css_rdt(task_css(task, intel_rdt_cgrp_id));
+}
+
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ * which supports L3 cache allocation.
+ * - When support is present and enabled, does not do any
+ * IA32_PQR_MSR writes until the user starts really using the feature
+ * ie creates a rdt cgroup directory and assigns a cache_mask thats
+ * different from the root cgroup's cache_mask.
+ * - Caches the per cpu CLOSid values and does the MSR write only
+ * when a task with a different CLOSid is scheduled in. That
+ * means the task belongs to a different cgroup.
+ * - Closids are allocated so that different cgroup directories
+ * with same cache_mask gets the same CLOSid. This minimizes CLOSids
+ * used and reduces MSR write frequency.
+ */
+static inline void intel_rdt_sched_in(void)
+{
+ /*
+ * Call the schedule in code only when RDT is enabled.
+ */
+ if (static_key_false(&rdt_enable_key))
+ __intel_rdt_sched_in();
+}
+
+#else
+
+static inline void intel_rdt_sched_in(void) {}
+
#endif
#endif
diff --git a/arch/x86/include/asm/rdt_common.h b/arch/x86/include/asm/rdt_common.h
new file mode 100644
index 0000000..01502c5
--- /dev/null
+++ b/arch/x86/include/asm/rdt_common.h
@@ -0,0 +1,25 @@
+#ifndef _X86_RDT_H_
+#define _X86_RDT_H_
+
+#define MSR_IA32_PQR_ASSOC 0x0c8f
+
+/**
+ * struct intel_pqr_state - State cache for the PQR MSR
+ * @rmid: The cached Resource Monitoring ID
+ * @closid: The cached Class Of Service ID
+ * @rmid_usecnt: The usage counter for rmid
+ *
+ * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
+ * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
+ * contains both parts, so we need to cache them.
+ *
+ * The cache also helps to avoid pointless updates if the value does
+ * not change.
+ */
+struct intel_pqr_state {
+ u32 rmid;
+ u32 closid;
+ int rmid_usecnt;
+};
+
+#endif
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 751bf4b..9149577 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,9 @@ struct tss_struct;
void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
struct tss_struct *tss);
+#include <asm/intel_rdt.h>
+#define finish_arch_switch(prev) intel_rdt_sched_in()
+
#ifdef CONFIG_X86_32
#ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 750b02a..c8bb134 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -34,6 +34,8 @@ static struct clos_cbm_map *ccmap;
static struct rdt_subsys_info rdtss_info;
static DEFINE_MUTEX(rdt_group_mutex);
struct intel_rdt rdt_root_group;
+struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
+
/*
* Mask of CPUs for writing CBM values. We only need one CPU per-socket.
*/
@@ -88,6 +90,20 @@ static inline void closid_put(u32 closid)
closid_free(closid);
}
+void __intel_rdt_sched_in(void)
+{
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+ struct task_struct *task = current;
+ struct intel_rdt *ir;
+
+ ir = task_rdt(task);
+ if (ir->closid == state->closid)
+ return;
+
+ wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->closid);
+ state->closid = ir->closid;
+}
+
static struct cgroup_subsys_state *
intel_rdt_css_alloc(struct cgroup_subsys_state *parent_css)
{
@@ -345,6 +361,7 @@ static int __init intel_rdt_late_init(void)
for_each_online_cpu(i)
rdt_cpumask_update(i);
+ static_key_slow_inc(&rdt_enable_key);
pr_info("Intel cache allocation enabled\n");
out_err:
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 58bdefa..edffa11 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -7,41 +7,22 @@
#include <linux/perf_event.h>
#include <linux/slab.h>
#include <asm/cpu_device_id.h>
+#include <asm/rdt_common.h>
#include "perf_event.h"
-#define MSR_IA32_PQR_ASSOC 0x0c8f
#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d
static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
-/**
- * struct intel_pqr_state - State cache for the PQR MSR
- * @rmid: The cached Resource Monitoring ID
- * @closid: The cached Class Of Service ID
- * @rmid_usecnt: The usage counter for rmid
- *
- * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
- * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
- * contains both parts, so we need to cache them.
- *
- * The cache also helps to avoid pointless updates if the value does
- * not change.
- */
-struct intel_pqr_state {
- u32 rmid;
- u32 closid;
- int rmid_usecnt;
-};
-
/*
* The cached intel_pqr_state is strictly per CPU and can never be
* updated from a remote CPU. Both functions which modify the state
* (intel_cqm_event_start and intel_cqm_event_stop) are called with
* interrupts disabled, which is sufficient for the protection.
*/
-static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -408,9 +389,9 @@ static void __intel_cqm_event_count(void *info);
*/
static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
{
- struct perf_event *event;
struct list_head *head = &group->hw.cqm_group_entry;
u32 old_rmid = group->hw.cqm_rmid;
+ struct perf_event *event;
lockdep_assert_held(&cache_mutex);
@@ -1257,7 +1238,6 @@ static void intel_cqm_cpu_prepare(unsigned int cpu)
struct cpuinfo_x86 *c = &cpu_data(cpu);
state->rmid = 0;
- state->closid = 0;
state->rmid_usecnt = 0;
WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid);
--
1.9.1
This patch adds hot cpu support for Intel Cache allocation. Support
includes updating the cache bitmask MSRs IA32_L3_QOS_n when a new CPU
package comes online. The IA32_L3_QOS_n MSRs are one per Class of
service on each CPU package. The new package's MSRs are synchronized
with the values of existing MSRs. Also the software cache for
IA32_PQR_ASSOC MSRs are updated during hot cpu notifications.
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 95 ++++++++++++++++++++++++++++++++++++++---
1 file changed, 90 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index c8bb134..1f9716c 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -25,6 +25,7 @@
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/spinlock.h>
+#include <linux/cpu.h>
#include <asm/intel_rdt.h>
/*
@@ -40,6 +41,11 @@ struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
* Mask of CPUs for writing CBM values. We only need one CPU per-socket.
*/
static cpumask_t rdt_cpumask;
+/*
+ * Temporary cpumask used during hot cpu notificaiton handling. The usage
+ * is serialized by hot cpu locks.
+ */
+static cpumask_t tmp_cpumask;
#define rdt_for_each_child(pos_css, parent_ir) \
css_for_each_child((pos_css), &(parent_ir)->css)
@@ -313,13 +319,86 @@ out:
return err;
}
-static inline void rdt_cpumask_update(int cpu)
+static inline bool rdt_cpumask_update(int cpu)
{
- static cpumask_t tmp;
-
- cpumask_and(&tmp, &rdt_cpumask, topology_core_cpumask(cpu));
- if (cpumask_empty(&tmp))
+ cpumask_and(&tmp_cpumask, &rdt_cpumask, topology_core_cpumask(cpu));
+ if (cpumask_empty(&tmp_cpumask)) {
cpumask_set_cpu(cpu, &rdt_cpumask);
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
+ * which are one per CLOSid except IA32_L3_MASK_0 on the current package.
+ */
+static void cbm_update_msrs(void *info)
+{
+ int maxid = boot_cpu_data.x86_cache_max_closid;
+ unsigned int i;
+
+ /*
+ * At cpureset, all bits of IA32_L3_MASK_n are set.
+ * The index starts from one as there is no need
+ * to update IA32_L3_MASK_0 as it belongs to root cgroup
+ * whose cache mask is all 1s always.
+ */
+ for (i = 1; i < maxid; i++) {
+ if (ccmap[i].clos_refcnt)
+ cbm_cpu_update((void *)i);
+ }
+}
+
+static inline void intel_rdt_cpu_start(int cpu)
+{
+ struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
+
+ state->closid = 0;
+ mutex_lock(&rdt_group_mutex);
+ if (rdt_cpumask_update(cpu))
+ smp_call_function_single(cpu, cbm_update_msrs, NULL, 1);
+ mutex_unlock(&rdt_group_mutex);
+}
+
+static void intel_rdt_cpu_exit(unsigned int cpu)
+{
+ int i;
+
+ mutex_lock(&rdt_group_mutex);
+ if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
+ mutex_unlock(&rdt_group_mutex);
+ return;
+ }
+
+ cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
+ cpumask_clear_cpu(cpu, &tmp_cpumask);
+ i = cpumask_any(&tmp_cpumask);
+
+ if (i < nr_cpu_ids)
+ cpumask_set_cpu(i, &rdt_cpumask);
+ mutex_unlock(&rdt_group_mutex);
+}
+
+static int intel_rdt_cpu_notifier(struct notifier_block *nb,
+ unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (unsigned long)hcpu;
+
+ switch (action) {
+ case CPU_DOWN_FAILED:
+ case CPU_ONLINE:
+ intel_rdt_cpu_start(cpu);
+ break;
+ case CPU_DOWN_PREPARE:
+ intel_rdt_cpu_exit(cpu);
+ break;
+ default:
+ break;
+ }
+
+ return NOTIFY_OK;
}
static int __init intel_rdt_late_init(void)
@@ -358,9 +437,15 @@ static int __init intel_rdt_late_init(void)
ccm->cache_mask = (1ULL << max_cbm_len) - 1;
ccm->clos_refcnt = 1;
+ cpu_notifier_register_begin();
+
for_each_online_cpu(i)
rdt_cpumask_update(i);
+ __hotcpu_notifier(intel_rdt_cpu_notifier, 0);
+
+ cpu_notifier_register_done();
+
static_key_slow_inc(&rdt_enable_key);
pr_info("Intel cache allocation enabled\n");
out_err:
--
1.9.1
Cache Allocation on hsw(haswell) needs to be enumerated separately as
HSW does not have support for CPUID enumeration for Cache Allocation.
Cache Allocation is only supported on certain HSW SKUs. This patch does
a probe test for hsw CPUs by writing a CLOSid(Class of service id) into
high 32 bits of IA32_PQR_MSR and see if the bits stick. The probe test
is only done after confirming that the CPU is HSW. Other HSW specific
quirks are:
- HSW requires the L3 cache bit mask to be at least two bits.
- Maximum CLOSids supported is always 4.
- Maximum bits support in cache bit mask is always 20.
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt.c | 62 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 59 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 1f9716c..790cdba 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -38,6 +38,11 @@ struct intel_rdt rdt_root_group;
struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
/*
+ * Minimum bits required in Cache bitmask.
+ */
+static unsigned int min_bitmask_len = 1;
+
+/*
* Mask of CPUs for writing CBM values. We only need one CPU per-socket.
*/
static cpumask_t rdt_cpumask;
@@ -50,6 +55,56 @@ static cpumask_t tmp_cpumask;
#define rdt_for_each_child(pos_css, parent_ir) \
css_for_each_child((pos_css), &(parent_ir)->css)
+/*
+ * cache_alloc_hsw_probe() - Have to do probe test for Intel haswell CPUs as it
+ * does not have CPUID enumeration support for Cache allocation.
+ *
+ * Probes by writing to the high 32 bits(CLOSid) of the IA32_PQR_MSR and
+ * testing if the bits stick. Then hardcode the max CLOS and max
+ * bitmask length on hsw. The minimum cache bitmask length allowed for
+ * HSW is 2 bits.
+ */
+static inline bool cache_alloc_hsw_probe(void)
+{
+ u32 l, h_old, h_new, h_tmp;
+
+ if (rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_old))
+ return false;
+
+ /*
+ * Default value is always 0 if feature is present.
+ */
+ h_tmp = h_old ^ 0x1U;
+ if (wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_tmp) ||
+ rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_new))
+ return false;
+
+ if (h_tmp != h_new)
+ return false;
+
+ wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_old);
+
+ boot_cpu_data.x86_cache_max_closid = 4;
+ boot_cpu_data.x86_cache_max_cbm_len = 20;
+ min_bitmask_len = 2;
+
+ return true;
+}
+
+static inline bool cache_alloc_supported(struct cpuinfo_x86 *c)
+{
+ if (cpu_has(c, X86_FEATURE_CAT_L3))
+ return true;
+
+ /*
+ * Probe test for Haswell CPUs.
+ */
+ if (c->x86 == 0x6 && c->x86_model == 0x3f)
+ return cache_alloc_hsw_probe();
+
+ return false;
+}
+
static inline void closid_get(u32 closid)
{
struct clos_cbm_map *ccm = &ccmap[closid];
@@ -160,7 +215,7 @@ static inline bool cbm_is_contiguous(unsigned long var)
unsigned long maxcbm = MAX_CBM_LENGTH;
unsigned long first_bit, zero_bit;
- if (!var)
+ if (bitmap_weight(&var, maxcbm) < min_bitmask_len)
return false;
first_bit = find_next_bit(&var, maxcbm, 0);
@@ -180,7 +235,8 @@ static int cbm_validate(struct intel_rdt *ir, unsigned long cbmvalue)
int err = 0;
if (!cbm_is_contiguous(cbmvalue)) {
- pr_err("bitmask should have >= 1 bit and be contiguous\n");
+ pr_err("bitmask should have >=%d bits and be contiguous\n",
+ min_bitmask_len);
err = -EINVAL;
goto out_err;
}
@@ -409,7 +465,7 @@ static int __init intel_rdt_late_init(void)
int err = 0, i;
size_t sizeb;
- if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
+ if (!cache_alloc_supported(c)) {
rdt_root_group.css.ss->disabled = 1;
return -ENODEV;
}
--
1.9.1
Hello Thomas,
Just a ping for any feedback if any. Have tried to fix some issues you pointed
out in V11 and V12.
Thanks,
Vikas
On Wed, 1 Jul 2015, Vikas Shivappa wrote:
> This patch has some changes to hot cpu handling code in existing cache
> monitoring and RAPL kernel code. This improves hot cpu notification
> handling by not looping through all online cpus which could be expensive
> in large systems.
>
> Cache allocation patches(dependent on prep patches) adds a cgroup
> subsystem to support the new Cache Allocation feature found in future
> Intel Xeon Intel processors. Cache Allocation is a sub-feature with in
> Resource Director Technology(RDT) feature. RDT which provides support to
> control sharing of platform resources like L3 cache.
>
> Cache Allocation Technology provides a way for the Software (OS/VMM) to
> restrict cache allocation to a defined 'subset' of cache which may be
> overlapping with other 'subsets'. This feature is used when allocating
> a line in cache ie when pulling new data into the cache. The
> programming of the h/w is done via programming MSRs. The patch series
> support to perform L3 cache allocation.
>
> In todays new processors the number of cores is continuously increasing
> which in turn increase the number of threads or workloads that can
> simultaneously be run. When multi-threaded applications run
> concurrently, they compete for shared resources including L3 cache. At
> times, this L3 cache resource contention may result in inefficient space
> utilization. For example a higher priority thread may end up with lesser
> L3 cache resource or a cache sensitive app may not get optimal cache
> occupancy thereby degrading the performance. Cache Allocation kernel
> patch helps provides a framework for sharing L3 cache so that users can
> allocate the resource according to set requirements.
>
> More information about the feature can be found in the Intel SDM, Volume
> 3 section 17.15. SDM does not yet use the 'RDT' term yet and it is
> planned to be changed at a later time.
>
> *All the patches will apply on tip/perf/core*.
>
> Changes in v12:
>
> - From Matt's feedback replaced static cpumask_t tmp with function
> scope at multiple locations to static cpumask_t tmp_cpumask for the
> whole file. This is a temporary mask used during handling of hot cpu
> notifications in cqm/rapl and rdt code(1/9,2/9 and 8/9). Although all
> the usage was serialized by hot cpu locking this makes it more
> readable.
>
> Changes in V11: As per feedback from Thomas and discussions:
>
> - removed the cpumask_any_online_but.its usage could be easily replaced with
> 'and'ing the cpu_online mask during hot cpu notifications. Thomas
> pointed the API had issue where there tmp mask wasnt thread safe. I
> realized the support it indends to give does not seem to match with
> others in cpumask.h
> - the cqm patch which added mutex to hot cpu notification was merged
> with the cqm hot plug patch to improve notificaiton handling
> without commit logs and wasnt correct. seperated and just sending the
> cqm hot plug patch and will send the mutex cqm patch seperately
> - fixed issues in the hot cpu rdt handling. Since the cpu_starting was
> replaced with cpu_online , now the wrmsr needs to be actually
> scheduled on the target cpu - which the previous patch wasnt doing.
> Replaced the cpu_dead with cpu_down_prepare. the cpu_down_failed is
> handled the same way as cpu_online. By waiting till cpu_dead to update
> the rdt_cpumask , we may miss some of the msr updates.
>
> Changes in V10:
>
> - changed the hot cpu notification we handle in cqm and cache allocation
> to cpu_online and cpu_dead and removed others as the
> cpu_*_prepare also had corresponding cancel notification
> which we did not handle.
> - changed the file in rdt cgroup to l3_cache_mask to represent that its
> for l3 cache.
>
> Changes as per Thomas and PeterZ feedback:
> - fixed the cpumask declarations in cpumask.h and rdt,cmt and rapl to
> have static so that they burden stack space when large.
> - removed mutex in cpu_starting notifications, replaced the locking with
> cpu_online.
> - changed name from hsw_probetest to cache_alloc_hsw_probe.
> - changed x86_rdt_max_closid to x86_cache_max_closid and
> x86_rdt_max_cbm_len to x86_cache_max_cbm_len as they are only related
> to cache allocation and not to all rdt.
>
> Changes in V9:
> Changes made as per Thomas feedback:
> - added a comment where we call schedule in code only when RDT is
> enabled.
> - Reordered the local declarations to follow convention in
> intel_cqm_xchg_rmid
>
> Changes in V8: Thanks to feedback from Thomas and following changes are
> made based on his feedback:
>
> Generic changes/Preparatory patches:
> -added a new cpumask_any_online_but which returns the next
> core sibling that is online.
> -Made changes in Intel Cache monitoring and Intel RAPL(Running average
> power limit) code to use the new function above to find the next cpu
> that can be a designated reader for the package. Also changed the way
> the package masks are computed which can be simplified using
> topology_core_cpumask.
>
> Cache allocation specific changes:
> -Moved the documentation to the begining of the patch series.
> -Added more documentation for the rdt cgroup files in the documentation.
> -Changed the dmesg output when cache alloc is enabled to be more helpful
> and updated few other comments to be better readable.
> -removed __ prefix to functions like clos_get which were not following
> convention.
> -added code to take action on a WARN_ON in clos_put. Made a few other
> changes to reduce code text.
> -updated better readable/Kernel doc format comments for the
> call to rdt_css_alloc, datastructures .
> -removed cgroup_init
> -changed the names of functions to only have intel_ prefix for external
> APIs.
> -replaced (void *)&closid with (void *)closid when calling
> on_each_cpu_mask
> -fixed the reference release of closid during cache bitmask write.
> -changed the code to not ignore a cache mask which has bits set outside
> of the max bits allowed. It returns an error instead.
> -replaced bitmap_set(&max_mask, 0, max_cbm_len) with max_mask =
> (1ULL << max_cbm) - 1.
> - update the rdt_cpu_mask which has one cpu for each package, using
> topology_core_cpumask instead of looping through existing rdt_cpu_mask.
> Realized topology_core_cpumask name is misleading and it actually
> returns the cores in a cpu package!
> -arranged the code better to have the code relating to similar task
> together.
> -Improved searching for the next online cpu sibling and maintaining the
> rdt_cpu_mask which has one cpu per package.
> -removed the unnecessary wrapper rdt_enabled.
> -removed unnecessary spin lock and rculock in the scheduling code.
> -merged all scheduling code into one patch not seperating the RDT common
> software cache code.
>
> Changes in V7: Based on feedback from PeterZ and Matt and following
> discussions :
> - changed lot of naming to reflect the data structures which are common
> to RDT and specific to Cache allocation.
> - removed all usage of 'cat'. replace with more friendly cache
> allocation
> - fixed lot of convention issues (whitespace, return paradigm etc)
> - changed the scheduling hook for RDT to not use a inline.
> - removed adding new scheduling hook and just reused the existing one
> similar to perf hook.
>
> Changes in V6:
> - rebased to 4.1-rc1 which has the CMT(cache monitoring) support included.
> - (Thanks to Marcelo's feedback).Fixed support for hot cpu handling for
> IA32_L3_QOS MSRs. Although during deep C states the MSR need not be restored
> this is needed when physically a new package is added.
> -some other coding convention changes including renaming to cache_mask using a
> refcnt to track the number of cgroups using a closid in clos_cbm map.
> -1b cbm support for non-hsw SKUs. HSW is an exception which needs the cache
> bit masks to be at least 2 bits.
>
> Changes in v5:
> - Added support to propagate the cache bit mask update for each
> package.
> - Removed the cache bit mask reference in the intel_rdt structure as
> there was no need for that and we already maintain a separate
> closid<->cbm mapping.
> - Made a few coding convention changes which include adding the
> assertion while freeing the CLOSID.
>
> Changes in V4:
> - Integrated with the latest V5 CMT patches.
> - Changed naming of cgroup to rdt(resource director technology) from
> cat(cache allocation technology). This was done as the RDT is the
> umbrella term for platform shared resources allocation. Hence in
> future it would be easier to add resource allocation to the same
> cgroup
> - Naming changes also applied to a lot of other data structures/APIs.
> - Added documentation on cgroup usage for cache allocation to address
> a lot of questions from various academic and industry regarding
> cache allocation usage.
>
> Changes in V3:
> - Implements a common software cache for IA32_PQR_MSR
> - Implements support for hsw Cache Allocation enumeration. This does not use the brand
> strings like earlier version but does a probe test. The probe test is done only
> on hsw family of processors
> - Made a few coding convention, name changes
> - Check for lock being held when ClosID manipulation happens
>
> Changes in V2:
> - Removed HSW specific enumeration changes. Plan to include it later as a
> separate patch.
> - Fixed the code in prep_arch_switch to be specific for x86 and removed
> x86 defines.
> - Fixed cbm_write to not write all 1s when a cgroup is freed.
> - Fixed one possible memory leak in init.
> - Changed some of manual bitmap
> manipulation to use the predefined bitmap APIs to make code more readable
> - Changed name in sources from cqe to cat
> - Global cat enable flag changed to static_key and disabled cgroup early_init
>
> [PATCH 1/9] x86/intel_cqm: Modify hot cpu notification handling
> [PATCH 2/9] x86/intel_rapl: Modify hot cpu notification handling for
> [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup
> [PATCH 4/9] x86/intel_rdt: Add support for Cache Allocation detection
> [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service
> [PATCH 6/9] x86/intel_rdt: Add support for cache bit mask management
> [PATCH 7/9] x86/intel_rdt: Implement scheduling support for Intel RDT
> [PATCH 8/9] x86/intel_rdt: Hot cpu support for Cache Allocation
> [PATCH 9/9] x86/intel_rdt: Intel haswell Cache Allocation enumeration
>
On Mon, 13 Jul 2015, Vikas Shivappa wrote:
>
> Hello Thomas,
>
> Just a ping for any feedback if any. Have tried to fix some issues you pointed
> out in V11 and V12.
It's on my todo list.
On Wed, 1 Jul 2015, Vikas Shivappa wrote:
> Cache allocation patches(dependent on prep patches) adds a cgroup
> subsystem to support the new Cache Allocation feature found in future
> Intel Xeon Intel processors. Cache Allocation is a sub-feature with in
> Resource Director Technology(RDT) feature. RDT which provides support to
> control sharing of platform resources like L3 cache.
Just a few general observations:
1) The changelogs need some loving care.
2) I let it up to PeterZ to decide whether he wants to fold the
sched support into switch_to.
Otherwise, I'm more or less content with the outcome of this patch
marathon. So for the whole pile:
Reviewed-by: Thomas Gleixner <[email protected]>
On Fri, 24 Jul 2015, Thomas Gleixner wrote:
> On Wed, 1 Jul 2015, Vikas Shivappa wrote:
>> Cache allocation patches(dependent on prep patches) adds a cgroup
>> subsystem to support the new Cache Allocation feature found in future
>> Intel Xeon Intel processors. Cache Allocation is a sub-feature with in
>> Resource Director Technology(RDT) feature. RDT which provides support to
>> control sharing of platform resources like L3 cache.
>
> Just a few general observations:
>
> 1) The changelogs need some loving care.
Will edit the changelogs and send changes.
>
> Otherwise, I'm more or less content with the outcome of this patch
> marathon. So for the whole pile:
>
> Reviewed-by: Thomas Gleixner <[email protected]>
Appreciate your time and feedback !
Vikas
>
>
>
Hello PeterZ,
On Fri, 24 Jul 2015, Thomas Gleixner wrote:
> On Wed, 1 Jul 2015, Vikas Shivappa wrote:
>> Cache allocation patches(dependent on prep patches) adds a cgroup
>> subsystem to support the new Cache Allocation feature found in future
>> Intel Xeon Intel processors. Cache Allocation is a sub-feature with in
>> Resource Director Technology(RDT) feature. RDT which provides support to
>> control sharing of platform resources like L3 cache.
>
> Just a few general observations:
>
> 1) The changelogs need some loving care.
>
> 2) I let it up to PeterZ to decide whether he wants to fold the
> sched support into switch_to.
My prior answer to this was that
- since cache alloc sched support doesnt deal with the register state
(gen purpose, fpu , extended states ) did not want to bother the switch_to
code.
Also cache monitoring scheduling code falls in the finish_arch_switch.
let me know if anything otherwise ..
Thanks,
Vikas
>
> Otherwise, I'm more or less content with the outcome of this patch
> marathon. So for the whole pile:
>
> Reviewed-by: Thomas Gleixner <[email protected]>
>
>
>
On Fri, 24 Jul 2015, Vikas Shivappa wrote:
> On Fri, 24 Jul 2015, Thomas Gleixner wrote:
>
> > On Wed, 1 Jul 2015, Vikas Shivappa wrote:
> > > Cache allocation patches(dependent on prep patches) adds a cgroup
> > > subsystem to support the new Cache Allocation feature found in future
> > > Intel Xeon Intel processors. Cache Allocation is a sub-feature with in
> > > Resource Director Technology(RDT) feature. RDT which provides support to
> > > control sharing of platform resources like L3 cache.
> >
> > Just a few general observations:
> >
> > 1) The changelogs need some loving care.
>
> Will edit the changelogs and send changes.
Please wait for Peters feedback.
> >
> > Otherwise, I'm more or less content with the outcome of this patch
> > marathon. So for the whole pile:
> >
> > Reviewed-by: Thomas Gleixner <[email protected]>
>
> Appreciate your time and feedback !
Welcome!
tglx
On Fri, 24 Jul 2015, Thomas Gleixner wrote:
> On Fri, 24 Jul 2015, Vikas Shivappa wrote:
>> On Fri, 24 Jul 2015, Thomas Gleixner wrote:
>>
>>> On Wed, 1 Jul 2015, Vikas Shivappa wrote:
>>>> Cache allocation patches(dependent on prep patches) adds a cgroup
>>>> subsystem to support the new Cache Allocation feature found in future
>>>> Intel Xeon Intel processors. Cache Allocation is a sub-feature with in
>>>> Resource Director Technology(RDT) feature. RDT which provides support to
>>>> control sharing of platform resources like L3 cache.
>>>
>>> Just a few general observations:
>>>
>>> 1) The changelogs need some loving care.
>>
>> Will edit the changelogs and send changes.
>
> Please wait for Peters feedback.
ok
Thanks,
Vikas
>
>>>
>>> Otherwise, I'm more or less content with the outcome of this patch
>>> marathon. So for the whole pile:
>>>
>>> Reviewed-by: Thomas Gleixner <[email protected]>
>>
>> Appreciate your time and feedback !
>
> Welcome!
>
> tglx
>
On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
Please edit this document to have consistent spacing. Its really hard to
read this. Every time I spot a misplaced space my brain stumbles and I
need to restart.
> diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
> new file mode 100644
> index 0000000..dfff477
> --- /dev/null
> +++ b/Documentation/cgroups/rdt.txt
> @@ -0,0 +1,215 @@
> + RDT
> + ---
> +
> +Copyright (C) 2014 Intel Corporation
> +Written by [email protected]
> +(based on contents and format from cpusets.txt)
> +
> +CONTENTS:
> +=========
> +
> +1. Cache Allocation Technology
> + 1.1 What is RDT and Cache allocation ?
> + 1.2 Why is Cache allocation needed ?
> + 1.3 Cache allocation implementation overview
> + 1.4 Assignment of CBM and CLOS
> + 1.5 Scheduling and Context Switch
> +2. Usage Examples and Syntax
> +
> +1. Cache Allocation Technology(Cache allocation)
> +===================================
> +
> +1.1 What is RDT and Cache allocation
> +------------------------------------
> +
> +Cache allocation is a sub-feature of Resource Director Technology(RDT)
missing ' ' before the '('.
> +Allocation or Platform Shared resource control which provides support to
> +control Platform shared resources like L3 cache. Currently L3 Cache is
Double ' ' after '.' -- which _can_ be correct, but is inconsistent
throughout the document.
> +the only resource that is supported in RDT. More information can be
> +found in the Intel SDM, Volume 3, section 17.15.
Please also include the SDM revision, like June 2015.
In fact, in the June 2015 V3 17.15 is CQM, not CAT.
> +Cache Allocation Technology provides a way for the Software (OS/VMM)
> +to restrict cache allocation to a defined 'subset' of cache which may
> +be overlapping with other 'subsets'. This feature is used when
> +allocating a line in cache ie when pulling new data into the cache.
> +The programming of the h/w is done via programming MSRs.
Double ' ' before 'MSRs'.
> +The different cache subsets are identified by CLOS identifier (class
> +of service) and each CLOS has a CBM (cache bit mask). The CBM is a
> +contiguous set of bits which defines the amount of cache resource that
> +is available for each 'subset'.
> +
> +1.2 Why is Cache allocation needed
> +----------------------------------
> +
> +In todays new processors the number of cores is continuously increasing,
> +especially in large scale usage models where VMs are used like
> +webservers and datacenters. The number of cores increase the number
Single ' ' after .
> +of threads or workloads that can simultaneously be run. When
> +multi-threaded-applications, VMs, workloads run concurrently they
> +compete for shared resources including L3 cache.
> +
> +The Cache allocation enables more cache resources to be made available
Double ' ' for no apparent reason.
> +for higher priority applications based on guidance from the execution
> +environment.
> +
> +The architecture also allows dynamically changing these subsets during
> +runtime to further optimize the performance of the higher priority
> +application with minimal degradation to the low priority app.
> +Additionally, resources can be rebalanced for system throughput benefit.
> +
> +This technique may be useful in managing large computer systems which
> +large L3 cache. Examples may be large servers running instances of
Double ' '
> +webservers or database servers. In such complex systems, these subsets
> +can be used for more careful placing of the available cache
> +resources.
> +
> +1.3 Cache allocation implementation Overview
> +--------------------------------------------
> +
> +Kernel implements a cgroup subsystem to support cache allocation.
> +
> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
No ' ' before '('
> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
Idem, also, _no_ space after '.'
> +to the kernel and not exposed to user. Each cgroup would have one CBM
Double space after '.'
> +and would just represent one cache 'subset'.
> +
> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
I'm thinking the convention is ' ' _after_ ',', not before.
> +cgroup never fails. When a child cgroup is created it inherits the
> +CLOSid and the CBM from its parent. When a user changes the default
> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> +used before. The changing of 'l3_cache_mask' may fail with -ENOSPC once
> +the kernel runs out of maximum CLOSids it can support.
> +User can create as many cgroups as he wants but having different CBMs
> +at the same time is restricted by the maximum number of CLOSids
> +(multiple cgroups can have the same CBM).
> +Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
Above you had ' ' around the arrows.
> +for each cgroup using a CLOSid.
> +
> +The tasks in the cgroup would get to fill the L3 cache represented by
> +the cgroup's 'l3_cache_mask' file.
> +
> +Root directory would have all available bits set in 'l3_cache_mask' file
Random double ' '
> +by default.
> +
> +Each RDT cgroup directory has the following files. Some of them may be a
> +part of common RDT framework or be specific to RDT sub-features like
> +cache allocation.
> +
> + - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by this
> + file. The bitmask must be contiguous and would have a 1 or 2 bit
> + minimum length.
> +
> +1.4 Assignment of CBM,CLOS
> +--------------------------
> +
> +The 'l3_cache_mask' needs to be a subset of the parent node's
> +'l3_cache_mask'. Any contiguous subset of these bits(with a minimum of 2
> +bits on hsw SKUs) maybe set to indicate the cache mapping desired. The
> +'l3_cache_mask' between 2 directories can overlap. The 'l3_cache_mask' would
> +represent the cache 'subset' of the Cache allocation cgroup. For ex: on
> +a system with 16 bits of max cbm bits, if the directory has the least
> +significant 4 bits set in its 'l3_cache_mask' file(meaning the 'l3_cache_mask'
> +is just 0xf), it would be allocated the right quarter of the Last level
> +cache which means the tasks belonging to this Cache allocation cgroup
> +can use the right quarter of the cache to fill. If it
> +has the most significant 8 bits set ,it would be allocated the left
> +half of the cache(8 bits out of 16 represents 50%).
Random whitespace again. Also try and limit paragraphs to 5-6 lines max.
> +
> +
> +The cache portion defined in the CBM file is available to all tasks
> +within the cgroup to fill and these task are not allowed to allocate
> +space in other parts of the cache.
> +
> +1.5 Scheduling and Context Switch
> +---------------------------------
> +
> +During context switch kernel implements this by writing the
> +CLOSid (internally maintained by kernel) of the cgroup to which the
> +task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
> +when there is a change in the CLOSid for the CPU in order to minimize
> +the latency incurred during context switch.
> +
> +The following considerations are done for the PQR MSR write so that it
> +has minimal impact on scheduling hot path:
> +- This path doesnt exist on any non-intel platforms.
!x86 I think you mean, its entirely possible to have the code present
on AMD systems for instance.
> +- On Intel platforms, this would not exist by default unless CGROUP_RDT
> +is enabled.
You can enable this just fine on AMD machines.
> +- remains a no-op when CGROUP_RDT is enabled and intel hardware does not
> +support the feature.
> +- When feature is available, still remains a no-op till the user
> +manually creates a cgroup *and* assigns a new cache mask. Since the
> +child node inherits the parents cache mask , by cgroup creation there is
> +no scheduling hot path impact from the new cgroup.
> +- per cpu PQR values are cached and the MSR write is only done when
> +there is a task with different PQR is scheduled on the CPU. Typically if
> +the task groups are bound to be scheduled on a set of CPUs , the number
> +of MSR writes is greatly reduced.
Aside from many instances of random whitespace, maybe also format like:
- point;
- multi
line point;
- another
multi
line
thing.
> +
> +2. Usage examples and syntax
> +============================
> +
> +To check if Cache allocation was enabled on your system
> +
> +dmesg | grep -i intel_rdt
$ dmesg | grep -i intel_rdt
That is, whitespace before _and_ after _and_ indent, plus a prompt, to
clarify its a command and not part of the text and weirdly formatted.
> +should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
intel_rdt: Max bitmask length: xx
Again, wrap in whitespace and indent to set apart.
> +the length of l3_cache_mask and CLOS should depend on the system you use.
> +
> +Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3
Many more instances of random whitespace.
> + cache allocation is enabled).
> +
> +Following would mount the cache allocation cgroup subsystem and create
> +2 directories. Please refer to Documentation/cgroups/cgroups.txt on
> +details about how to use cgroups.
> +
> + cd /sys/fs/cgroup
> + mkdir rdt
> + mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
> + cd rdt
> +
> +Create 2 rdt cgroups
> +
> + mkdir group1
> + mkdir group2
> +
> +Following are some of the Files in the directory
> +
> + ls
> + rdt.l3_cache_mask
> + tasks
> +
See, here you do the whitespace and indent thing, but above you didn't.
That kind of inconsistency just bugs the hell out of me.
> +Say if the cache is 2MB and cbm supports 16 bits, then setting the
> +below allocates the 'right 1/4th(512KB)' of the cache to group2
Another few random whitespace fails.
> +
> +Edit the CBM for group2 to set the least significant 4 bits. This
> +allocates 'right quarter' of the cache.
> +
> + cd group2
> + /bin/echo 0xf > rdt.l3_cache_mask
> +
> +
> +Edit the CBM for group2 to set the least significant 8 bits.This
> +allocates the right half of the cache to 'group2'.
> +
> + cd group2
> + /bin/echo 0xff > rdt.l3_cache_mask
> +
> +Assign tasks to the group2
> +
> + /bin/echo PID1 > tasks
> + /bin/echo PID2 > tasks
> +
> + Meaning now threads
> + PID1 and PID2 get to fill the 'right half' of
> + the cache as the belong to cgroup group2.
This doesn't want to be indented, right?
> +
> +Create a group under group2
> +
> + cd group2
> + mkdir group21
> + cat rdt.l3_cache_mask
> + 0xff - inherits parents mask.
And this would show the use of the prompt ($), allows one to distinguish
between commands and output.
> +
> + /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to parent's mask's subset
I'm betting you don't actually want us to type the "- ..." bit? Either
use a regular bash comment (#) to make it harmless, or format it
differently.
Because some poor sod is going to literally type that into his console
and wonder WTF just happened.
> +
> +In order to restrict RDT cgroups to specific set of CPUs rdt can be
> +comounted with cpusets.
Either RDT is in capitals or it is not, but this is silly.
On Wed, Jul 01, 2015 at 03:21:05PM -0700, Vikas Shivappa wrote:
> +static int __init intel_rdt_late_init(void)
> +{
> + struct cpuinfo_x86 *c = &boot_cpu_data;
> +
> + if (!cpu_has(c, X86_FEATURE_CAT_L3))
> + return -ENODEV;
> +
> + pr_info("Intel cache allocation enabled\n");
s/enabled/detected/ ? This hardly enables anything.
> +
> + return 0;
> +}
On Wed, Jul 01, 2015 at 03:21:07PM -0700, Vikas Shivappa wrote:
> +static inline bool cbm_is_contiguous(unsigned long var)
> +{
> + unsigned long maxcbm = MAX_CBM_LENGTH;
> + unsigned long first_bit, zero_bit;
> +
> + if (!var)
> + return false;
> +
> + first_bit = find_next_bit(&var, maxcbm, 0);
We actually have: find_first_bit().
> + zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
> +
> + if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
> + return false;
> +
> + return true;
> +}
On Wed, Jul 01, 2015 at 03:21:07PM -0700, Vikas Shivappa wrote:
> +static int cbm_validate(struct intel_rdt *ir, unsigned long cbmvalue)
> +{
> + struct cgroup_subsys_state *css;
> + struct intel_rdt *par, *c;
> + unsigned long *cbm_tmp;
> + int err = 0;
> +
> + if (!cbm_is_contiguous(cbmvalue)) {
> + pr_err("bitmask should have >= 1 bit and be contiguous\n");
> + err = -EINVAL;
> + goto out_err;
> + }
> +static struct cftype rdt_files[] = {
> + {
> + .name = "l3_cache_mask",
> + .seq_show = intel_cache_alloc_cbm_read,
> + .write_u64 = intel_cache_alloc_cbm_write,
> + .mode = 0666,
So this file is world writable? How is the above pr_err() not a DoS ?
> + },
> + { } /* terminate */
> +};
On Wed, Jul 01, 2015 at 03:21:06PM -0700, Vikas Shivappa wrote:
> +struct clos_cbm_map {
> + unsigned long cache_mask;
> + unsigned int clos_refcnt;
> +};
This structure is not a map at all, its the map value. Furthermore,
cache_mask seems a confusing name for the capacity bitmask (CBM).
On Wed, Jul 01, 2015 at 03:21:06PM -0700, Vikas Shivappa wrote:
> static int __init intel_rdt_late_init(void)
> {
> struct cpuinfo_x86 *c = &boot_cpu_data;
> + static struct clos_cbm_map *ccm;
> + u32 maxid, max_cbm_len;
> + size_t sizeb;
Why 'sizeb' ? 'size' is still available, right?
> + int err = 0;
>
> - if (!cpu_has(c, X86_FEATURE_CAT_L3))
> + if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
> + rdt_root_group.css.ss->disabled = 1;
> return -ENODEV;
> + }
> + maxid = c->x86_cache_max_closid;
> + max_cbm_len = c->x86_cache_max_cbm_len;
> +
> + sizeb = BITS_TO_LONGS(maxid) * sizeof(long);
> + rdtss_info.closmap = kzalloc(sizeb, GFP_KERNEL);
> + if (!rdtss_info.closmap) {
> + err = -ENOMEM;
> + goto out_err;
> + }
> +
> + sizeb = maxid * sizeof(struct clos_cbm_map);
> + ccmap = kzalloc(sizeb, GFP_KERNEL);
> + if (!ccmap) {
> + kfree(rdtss_info.closmap);
> + err = -ENOMEM;
> + goto out_err;
> + }
What's the expected size of max_closid? iow, how big of an array are you
in fact allocating here?
On Tue, 28 Jul 2015, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 03:21:05PM -0700, Vikas Shivappa wrote:
>> +static int __init intel_rdt_late_init(void)
>> +{
>> + struct cpuinfo_x86 *c = &boot_cpu_data;
>> +
>> + if (!cpu_has(c, X86_FEATURE_CAT_L3))
>> + return -ENODEV;
>> +
>> + pr_info("Intel cache allocation enabled\n");
>
> s/enabled/detected/ ? This hardly enables anything.
will fix. detected seems appropriate here.
Thanks,
Vikas
>
>> +
>> + return 0;
>> +}
>
On Tue, 28 Jul 2015, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 03:21:07PM -0700, Vikas Shivappa wrote:
>> +static inline bool cbm_is_contiguous(unsigned long var)
>> +{
>> + unsigned long maxcbm = MAX_CBM_LENGTH;
>> + unsigned long first_bit, zero_bit;
>> +
>> + if (!var)
>> + return false;
>> +
>> + first_bit = find_next_bit(&var, maxcbm, 0);
>
> We actually have: find_first_bit().
will fix
Thanks,
Vikas
>
>> + zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
>> +
>> + if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
>> + return false;
>> +
>> + return true;
>> +}
>
On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
> Adds a description of Cache allocation technology, overview
> of kernel implementation and usage of Cache Allocation cgroup interface.
>
> Cache allocation is a sub-feature of Resource Director Technology(RDT)
> Allocation or Platform Shared resource control which provides support to
> control Platform shared resources like L3 cache. Currently L3 Cache is
> the only resource that is supported in RDT. More information can be
> found in the Intel SDM, Volume 3, section 17.15.
>
> Cache Allocation Technology provides a way for the Software (OS/VMM)
> to restrict cache allocation to a defined 'subset' of cache which may
> be overlapping with other 'subsets'. This feature is used when
> allocating a line in cache ie when pulling new data into the cache.
>
> Signed-off-by: Vikas Shivappa <[email protected]>
> ---
> Documentation/cgroups/rdt.txt | 215 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 215 insertions(+)
> create mode 100644 Documentation/cgroups/rdt.txt
>
> diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
> new file mode 100644
> index 0000000..dfff477
> --- /dev/null
> +++ b/Documentation/cgroups/rdt.txt
> @@ -0,0 +1,215 @@
> + RDT
> + ---
> +
> +Copyright (C) 2014 Intel Corporation
> +Written by [email protected]
> +(based on contents and format from cpusets.txt)
> +
> +CONTENTS:
> +=========
> +
> +1. Cache Allocation Technology
> + 1.1 What is RDT and Cache allocation ?
> + 1.2 Why is Cache allocation needed ?
> + 1.3 Cache allocation implementation overview
> + 1.4 Assignment of CBM and CLOS
> + 1.5 Scheduling and Context Switch
> +2. Usage Examples and Syntax
> +
> +1. Cache Allocation Technology(Cache allocation)
> +===================================
> +
> +1.1 What is RDT and Cache allocation
> +------------------------------------
> +
> +Cache allocation is a sub-feature of Resource Director Technology(RDT)
> +Allocation or Platform Shared resource control which provides support to
> +control Platform shared resources like L3 cache. Currently L3 Cache is
> +the only resource that is supported in RDT. More information can be
> +found in the Intel SDM, Volume 3, section 17.15.
> +
> +Cache Allocation Technology provides a way for the Software (OS/VMM)
> +to restrict cache allocation to a defined 'subset' of cache which may
> +be overlapping with other 'subsets'. This feature is used when
> +allocating a line in cache ie when pulling new data into the cache.
> +The programming of the h/w is done via programming MSRs.
> +
> +The different cache subsets are identified by CLOS identifier (class
> +of service) and each CLOS has a CBM (cache bit mask). The CBM is a
> +contiguous set of bits which defines the amount of cache resource that
> +is available for each 'subset'.
> +
> +1.2 Why is Cache allocation needed
> +----------------------------------
> +
> +In todays new processors the number of cores is continuously increasing,
> +especially in large scale usage models where VMs are used like
> +webservers and datacenters. The number of cores increase the number
> +of threads or workloads that can simultaneously be run. When
> +multi-threaded-applications, VMs, workloads run concurrently they
> +compete for shared resources including L3 cache.
> +
> +The Cache allocation enables more cache resources to be made available
> +for higher priority applications based on guidance from the execution
> +environment.
> +
> +The architecture also allows dynamically changing these subsets during
> +runtime to further optimize the performance of the higher priority
> +application with minimal degradation to the low priority app.
> +Additionally, resources can be rebalanced for system throughput benefit.
> +
> +This technique may be useful in managing large computer systems which
> +large L3 cache. Examples may be large servers running instances of
> +webservers or database servers. In such complex systems, these subsets
> +can be used for more careful placing of the available cache
> +resources.
> +
> +1.3 Cache allocation implementation Overview
> +--------------------------------------------
> +
> +Kernel implements a cgroup subsystem to support cache allocation.
> +
> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
> +to the kernel and not exposed to user. Each cgroup would have one CBM
> +and would just represent one cache 'subset'.
> +
> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> +cgroup never fails. When a child cgroup is created it inherits the
> +CLOSid and the CBM from its parent. When a user changes the default
> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> +used before. The changing of 'l3_cache_mask' may fail with -ENOSPC once
> +the kernel runs out of maximum CLOSids it can support.
> +User can create as many cgroups as he wants but having different CBMs
> +at the same time is restricted by the maximum number of CLOSids
> +(multiple cgroups can have the same CBM).
> +Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
> +for each cgroup using a CLOSid.
> +
> +The tasks in the cgroup would get to fill the L3 cache represented by
> +the cgroup's 'l3_cache_mask' file.
> +
> +Root directory would have all available bits set in 'l3_cache_mask' file
> +by default.
> +
> +Each RDT cgroup directory has the following files. Some of them may be a
> +part of common RDT framework or be specific to RDT sub-features like
> +cache allocation.
> +
> + - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by this
> + file. The bitmask must be contiguous and would have a 1 or 2 bit
> + minimum length.
> +
> +1.4 Assignment of CBM,CLOS
> +--------------------------
> +
> +The 'l3_cache_mask' needs to be a subset of the parent node's
> +'l3_cache_mask'. Any contiguous subset of these bits(with a minimum of 2
> +bits on hsw SKUs) maybe set to indicate the cache mapping desired. The
> +'l3_cache_mask' between 2 directories can overlap. The 'l3_cache_mask' would
> +represent the cache 'subset' of the Cache allocation cgroup. For ex: on
> +a system with 16 bits of max cbm bits, if the directory has the least
> +significant 4 bits set in its 'l3_cache_mask' file(meaning the 'l3_cache_mask'
> +is just 0xf), it would be allocated the right quarter of the Last level
> +cache which means the tasks belonging to this Cache allocation cgroup
> +can use the right quarter of the cache to fill. If it
> +has the most significant 8 bits set ,it would be allocated the left
> +half of the cache(8 bits out of 16 represents 50%).
> +
> +The cache portion defined in the CBM file is available to all tasks
> +within the cgroup to fill and these task are not allowed to allocate
> +space in other parts of the cache.
> +
> +1.5 Scheduling and Context Switch
> +---------------------------------
> +
> +During context switch kernel implements this by writing the
> +CLOSid (internally maintained by kernel) of the cgroup to which the
> +task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
> +when there is a change in the CLOSid for the CPU in order to minimize
> +the latency incurred during context switch.
> +
> +The following considerations are done for the PQR MSR write so that it
> +has minimal impact on scheduling hot path:
> +- This path doesnt exist on any non-intel platforms.
> +- On Intel platforms, this would not exist by default unless CGROUP_RDT
> +is enabled.
> +- remains a no-op when CGROUP_RDT is enabled and intel hardware does not
> +support the feature.
> +- When feature is available, still remains a no-op till the user
> +manually creates a cgroup *and* assigns a new cache mask. Since the
> +child node inherits the parents cache mask , by cgroup creation there is
> +no scheduling hot path impact from the new cgroup.
> +- per cpu PQR values are cached and the MSR write is only done when
> +there is a task with different PQR is scheduled on the CPU. Typically if
> +the task groups are bound to be scheduled on a set of CPUs , the number
> +of MSR writes is greatly reduced.
> +
> +2. Usage examples and syntax
> +============================
> +
> +To check if Cache allocation was enabled on your system
> +
> +dmesg | grep -i intel_rdt
> +should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
> +the length of l3_cache_mask and CLOS should depend on the system you use.
> +
> +Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3
> + cache allocation is enabled).
> +
> +Following would mount the cache allocation cgroup subsystem and create
> +2 directories. Please refer to Documentation/cgroups/cgroups.txt on
> +details about how to use cgroups.
> +
> + cd /sys/fs/cgroup
> + mkdir rdt
> + mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
> + cd rdt
> +
> +Create 2 rdt cgroups
> +
> + mkdir group1
> + mkdir group2
> +
> +Following are some of the Files in the directory
> +
> + ls
> + rdt.l3_cache_mask
> + tasks
> +
> +Say if the cache is 2MB and cbm supports 16 bits, then setting the
> +below allocates the 'right 1/4th(512KB)' of the cache to group2
> +
> +Edit the CBM for group2 to set the least significant 4 bits. This
> +allocates 'right quarter' of the cache.
> +
> + cd group2
> + /bin/echo 0xf > rdt.l3_cache_mask
> +
> +
> +Edit the CBM for group2 to set the least significant 8 bits.This
> +allocates the right half of the cache to 'group2'.
> +
> + cd group2
> + /bin/echo 0xff > rdt.l3_cache_mask
> +
> +Assign tasks to the group2
> +
> + /bin/echo PID1 > tasks
> + /bin/echo PID2 > tasks
> +
> + Meaning now threads
> + PID1 and PID2 get to fill the 'right half' of
> + the cache as the belong to cgroup group2.
> +
> +Create a group under group2
> +
> + cd group2
> + mkdir group21
> + cat rdt.l3_cache_mask
> + 0xff - inherits parents mask.
> +
> + /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to parent's mask's subset
> +
> +In order to restrict RDT cgroups to specific set of CPUs rdt can be
> +comounted with cpusets.
> --
> 1.9.1
Vikas,
Can you give an example of comounting with cpusets? What do you mean by
restrict RDT cgroups to specific set of CPUs?
Another limitation of this interface is that it assumes the
task <-> control group assignment is pertinent, that is:
| taskgroup, L3 policy|:
| taskgroupA, 50% L3 exclusive |,
| taskgroupB, 50% L3 |,
| taskgroupC, 50% L3 |.
Whenever taskgroup A is empty (that is no runnable task in it), you waste 50% of
L3 cache.
I think this problem and the similar problem of L3 reservation with CPU
isolation can be solved in this way: whenever a task from cgroupE with exclusive way
access is migrated to a new die, impose the exclusivity (by removing
access to that way by other cgroups).
Whenever cgroupE has zero tasks, remove exclusivity (by allowing
other cgroups to use the exclusive ways of it).
I'll cook a patch.
On Tue, 28 Jul 2015, Marcelo Tosatti wrote:
> On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
>> Adds a description of Cache allocation technology, overview
>> of kernel implementation and usage of Cache Allocation cgroup interface.
>>
>> Cache allocation is a sub-feature of Resource Director Technology(RDT)
>> Allocation or Platform Shared resource control which provides support to
>> control Platform shared resources like L3 cache. Currently L3 Cache is
>> the only resource that is supported in RDT. More information can be
>> found in the Intel SDM, Volume 3, section 17.15.
>>
>> Cache Allocation Technology provides a way for the Software (OS/VMM)
>> to restrict cache allocation to a defined 'subset' of cache which may
>> be overlapping with other 'subsets'. This feature is used when
>> allocating a line in cache ie when pulling new data into the cache.
>>
>> Signed-off-by: Vikas Shivappa <[email protected]>
>> ---
>> Documentation/cgroups/rdt.txt | 215 ++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 215 insertions(+)
>> create mode 100644 Documentation/cgroups/rdt.txt
>>
>> diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
>> new file mode 100644
>> index 0000000..dfff477
>> --- /dev/null
>> +++ b/Documentation/cgroups/rdt.txt
>> @@ -0,0 +1,215 @@
>> + RDT
>> + ---
>> +
>> +Copyright (C) 2014 Intel Corporation
>> +Written by [email protected]
>> +(based on contents and format from cpusets.txt)
>> +
>> +CONTENTS:
>> +=========
>> +
>> +1. Cache Allocation Technology
>> + 1.1 What is RDT and Cache allocation ?
>> + 1.2 Why is Cache allocation needed ?
>> + 1.3 Cache allocation implementation overview
>> + 1.4 Assignment of CBM and CLOS
>> + 1.5 Scheduling and Context Switch
>> +2. Usage Examples and Syntax
>> +
>> +1. Cache Allocation Technology(Cache allocation)
>> +===================================
>> +
>> +1.1 What is RDT and Cache allocation
>> +------------------------------------
>> +
>> +Cache allocation is a sub-feature of Resource Director Technology(RDT)
>> +Allocation or Platform Shared resource control which provides support to
>> +control Platform shared resources like L3 cache. Currently L3 Cache is
>> +the only resource that is supported in RDT. More information can be
>> +found in the Intel SDM, Volume 3, section 17.15.
>> +
>> +Cache Allocation Technology provides a way for the Software (OS/VMM)
>> +to restrict cache allocation to a defined 'subset' of cache which may
>> +be overlapping with other 'subsets'. This feature is used when
>> +allocating a line in cache ie when pulling new data into the cache.
>> +The programming of the h/w is done via programming MSRs.
>> +
>> +The different cache subsets are identified by CLOS identifier (class
>> +of service) and each CLOS has a CBM (cache bit mask). The CBM is a
>> +contiguous set of bits which defines the amount of cache resource that
>> +is available for each 'subset'.
>> +
>> +1.2 Why is Cache allocation needed
>> +----------------------------------
>> +
>> +In todays new processors the number of cores is continuously increasing,
>> +especially in large scale usage models where VMs are used like
>> +webservers and datacenters. The number of cores increase the number
>> +of threads or workloads that can simultaneously be run. When
>> +multi-threaded-applications, VMs, workloads run concurrently they
>> +compete for shared resources including L3 cache.
>> +
>> +The Cache allocation enables more cache resources to be made available
>> +for higher priority applications based on guidance from the execution
>> +environment.
>> +
>> +The architecture also allows dynamically changing these subsets during
>> +runtime to further optimize the performance of the higher priority
>> +application with minimal degradation to the low priority app.
>> +Additionally, resources can be rebalanced for system throughput benefit.
>> +
>> +This technique may be useful in managing large computer systems which
>> +large L3 cache. Examples may be large servers running instances of
>> +webservers or database servers. In such complex systems, these subsets
>> +can be used for more careful placing of the available cache
>> +resources.
>> +
>> +1.3 Cache allocation implementation Overview
>> +--------------------------------------------
>> +
>> +Kernel implements a cgroup subsystem to support cache allocation.
>> +
>> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
>> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
>> +to the kernel and not exposed to user. Each cgroup would have one CBM
>> +and would just represent one cache 'subset'.
>> +
>> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
>> +cgroup never fails. When a child cgroup is created it inherits the
>> +CLOSid and the CBM from its parent. When a user changes the default
>> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
>> +used before. The changing of 'l3_cache_mask' may fail with -ENOSPC once
>> +the kernel runs out of maximum CLOSids it can support.
>> +User can create as many cgroups as he wants but having different CBMs
>> +at the same time is restricted by the maximum number of CLOSids
>> +(multiple cgroups can have the same CBM).
>> +Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
>> +for each cgroup using a CLOSid.
>> +
>> +The tasks in the cgroup would get to fill the L3 cache represented by
>> +the cgroup's 'l3_cache_mask' file.
>> +
>> +Root directory would have all available bits set in 'l3_cache_mask' file
>> +by default.
>> +
>> +Each RDT cgroup directory has the following files. Some of them may be a
>> +part of common RDT framework or be specific to RDT sub-features like
>> +cache allocation.
>> +
>> + - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by this
>> + file. The bitmask must be contiguous and would have a 1 or 2 bit
>> + minimum length.
>> +
>> +1.4 Assignment of CBM,CLOS
>> +--------------------------
>> +
>> +The 'l3_cache_mask' needs to be a subset of the parent node's
>> +'l3_cache_mask'. Any contiguous subset of these bits(with a minimum of 2
>> +bits on hsw SKUs) maybe set to indicate the cache mapping desired. The
>> +'l3_cache_mask' between 2 directories can overlap. The 'l3_cache_mask' would
>> +represent the cache 'subset' of the Cache allocation cgroup. For ex: on
>> +a system with 16 bits of max cbm bits, if the directory has the least
>> +significant 4 bits set in its 'l3_cache_mask' file(meaning the 'l3_cache_mask'
>> +is just 0xf), it would be allocated the right quarter of the Last level
>> +cache which means the tasks belonging to this Cache allocation cgroup
>> +can use the right quarter of the cache to fill. If it
>> +has the most significant 8 bits set ,it would be allocated the left
>> +half of the cache(8 bits out of 16 represents 50%).
>> +
>> +The cache portion defined in the CBM file is available to all tasks
>> +within the cgroup to fill and these task are not allowed to allocate
>> +space in other parts of the cache.
>> +
>> +1.5 Scheduling and Context Switch
>> +---------------------------------
>> +
>> +During context switch kernel implements this by writing the
>> +CLOSid (internally maintained by kernel) of the cgroup to which the
>> +task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
>> +when there is a change in the CLOSid for the CPU in order to minimize
>> +the latency incurred during context switch.
>> +
>> +The following considerations are done for the PQR MSR write so that it
>> +has minimal impact on scheduling hot path:
>> +- This path doesnt exist on any non-intel platforms.
>> +- On Intel platforms, this would not exist by default unless CGROUP_RDT
>> +is enabled.
>> +- remains a no-op when CGROUP_RDT is enabled and intel hardware does not
>> +support the feature.
>> +- When feature is available, still remains a no-op till the user
>> +manually creates a cgroup *and* assigns a new cache mask. Since the
>> +child node inherits the parents cache mask , by cgroup creation there is
>> +no scheduling hot path impact from the new cgroup.
>> +- per cpu PQR values are cached and the MSR write is only done when
>> +there is a task with different PQR is scheduled on the CPU. Typically if
>> +the task groups are bound to be scheduled on a set of CPUs , the number
>> +of MSR writes is greatly reduced.
>> +
>> +2. Usage examples and syntax
>> +============================
>> +
>> +To check if Cache allocation was enabled on your system
>> +
>> +dmesg | grep -i intel_rdt
>> +should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
>> +the length of l3_cache_mask and CLOS should depend on the system you use.
>> +
>> +Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3
>> + cache allocation is enabled).
>> +
>> +Following would mount the cache allocation cgroup subsystem and create
>> +2 directories. Please refer to Documentation/cgroups/cgroups.txt on
>> +details about how to use cgroups.
>> +
>> + cd /sys/fs/cgroup
>> + mkdir rdt
>> + mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
>> + cd rdt
>> +
>> +Create 2 rdt cgroups
>> +
>> + mkdir group1
>> + mkdir group2
>> +
>> +Following are some of the Files in the directory
>> +
>> + ls
>> + rdt.l3_cache_mask
>> + tasks
>> +
>> +Say if the cache is 2MB and cbm supports 16 bits, then setting the
>> +below allocates the 'right 1/4th(512KB)' of the cache to group2
>> +
>> +Edit the CBM for group2 to set the least significant 4 bits. This
>> +allocates 'right quarter' of the cache.
>> +
>> + cd group2
>> + /bin/echo 0xf > rdt.l3_cache_mask
>> +
>> +
>> +Edit the CBM for group2 to set the least significant 8 bits.This
>> +allocates the right half of the cache to 'group2'.
>> +
>> + cd group2
>> + /bin/echo 0xff > rdt.l3_cache_mask
>> +
>> +Assign tasks to the group2
>> +
>> + /bin/echo PID1 > tasks
>> + /bin/echo PID2 > tasks
>> +
>> + Meaning now threads
>> + PID1 and PID2 get to fill the 'right half' of
>> + the cache as the belong to cgroup group2.
>> +
>> +Create a group under group2
>> +
>> + cd group2
>> + mkdir group21
>> + cat rdt.l3_cache_mask
>> + 0xff - inherits parents mask.
>> +
>> + /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to parent's mask's subset
>> +
>> +In order to restrict RDT cgroups to specific set of CPUs rdt can be
>> +comounted with cpusets.
>> --
>> 1.9.1
>
> Vikas,
>
> Can you give an example of comounting with cpusets? What do you mean by
> restrict RDT cgroups to specific set of CPUs?
I was going to edit the documentation soon as i see a lot of feedback on the
same. It may have caused confusion.
I mean just pinning down tasks to a set of cpus. This does not mean we make the
cache exclusive to the tasks..
>
> Another limitation of this interface is that it assumes the
> task <-> control group assignment is pertinent, that is:
>
> | taskgroup, L3 policy|:
>
> | taskgroupA, 50% L3 exclusive |,
> | taskgroupB, 50% L3 |,
> | taskgroupC, 50% L3 |.
>
> Whenever taskgroup A is empty (that is no runnable task in it), you waste 50% of
> L3 cache.
Cgroup masks can always overlap , and hence wont have exclusive cache
allocation.
>
> I think this problem and the similar problem of L3 reservation with CPU
> isolation can be solved in this way: whenever a task from cgroupE with exclusive way
> access is migrated to a new die, impose the exclusivity (by removing
> access to that way by other cgroups).
>
> Whenever cgroupE has zero tasks, remove exclusivity (by allowing
> other cgroups to use the exclusive ways of it).
Same comment as above - Cgroup masks can always overlap and other cgroups can
allocate the same cache , and hence wont have exclusive cache
allocation.
So natuarally the cgroup with tasks would get to use the cache if it has the
same mask (say representing 50% of cache in your example) as others .
(assume there are 8 bits max cbm)
cgroupa - mask - 0xf
cgroupb - mask - 0xf . Now if cgroupa has no tasks , cgroupb naturally gets all
the cache.
Thanks,
Vikas
>
> I'll cook a patch.
>
>
>
>
>
> -----Original Message-----
> From: Shivappa, Vikas
> Sent: Tuesday, July 28, 2015 5:07 PM
> To: Marcelo Tosatti
> Cc: Vikas Shivappa; [email protected]; Shivappa, Vikas;
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; Fleming, Matt; Auld, Will; Williamson,
> Glenn P; Juvva, Kanaka D
> Subject: Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and
> cgroup usage guide
>
>
>
> On Tue, 28 Jul 2015, Marcelo Tosatti wrote:
>
> > On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
> >> Adds a description of Cache allocation technology, overview of kernel
> >> implementation and usage of Cache Allocation cgroup interface.
> >>
> >> Cache allocation is a sub-feature of Resource Director
> >> Technology(RDT) Allocation or Platform Shared resource control which
> >> provides support to control Platform shared resources like L3 cache.
> >> Currently L3 Cache is the only resource that is supported in RDT.
> >> More information can be found in the Intel SDM, Volume 3, section 17.15.
> >>
> >> Cache Allocation Technology provides a way for the Software (OS/VMM)
> >> to restrict cache allocation to a defined 'subset' of cache which may
> >> be overlapping with other 'subsets'. This feature is used when
> >> allocating a line in cache ie when pulling new data into the cache.
> >>
> >> Signed-off-by: Vikas Shivappa <[email protected]>
> >> ---
> >> Documentation/cgroups/rdt.txt | 215
> >> ++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 215 insertions(+)
> >> create mode 100644 Documentation/cgroups/rdt.txt
> >>
> >> diff --git a/Documentation/cgroups/rdt.txt
> >> b/Documentation/cgroups/rdt.txt new file mode 100644 index
> >> 0000000..dfff477
> >> --- /dev/null
> >> +++ b/Documentation/cgroups/rdt.txt
> >> @@ -0,0 +1,215 @@
> >> + RDT
> >> + ---
> >> +
> >> +Copyright (C) 2014 Intel Corporation Written by
> >> [email protected] (based on contents and format from
> >> +cpusets.txt)
> >> +
> >> +CONTENTS:
> >> +=========
> >> +
> >> +1. Cache Allocation Technology
> >> + 1.1 What is RDT and Cache allocation ?
> >> + 1.2 Why is Cache allocation needed ?
> >> + 1.3 Cache allocation implementation overview
> >> + 1.4 Assignment of CBM and CLOS
> >> + 1.5 Scheduling and Context Switch
> >> +2. Usage Examples and Syntax
> >> +
> >> +1. Cache Allocation Technology(Cache allocation)
> >> +===================================
> >> +
> >> +1.1 What is RDT and Cache allocation
> >> +------------------------------------
> >> +
> >> +Cache allocation is a sub-feature of Resource Director
> >> +Technology(RDT) Allocation or Platform Shared resource control which
> >> +provides support to control Platform shared resources like L3 cache.
> >> +Currently L3 Cache is the only resource that is supported in RDT.
> >> +More information can be found in the Intel SDM, Volume 3, section 17.15.
> >> +
> >> +Cache Allocation Technology provides a way for the Software (OS/VMM)
> >> +to restrict cache allocation to a defined 'subset' of cache which
> >> +may be overlapping with other 'subsets'. This feature is used when
> >> +allocating a line in cache ie when pulling new data into the cache.
> >> +The programming of the h/w is done via programming MSRs.
> >> +
> >> +The different cache subsets are identified by CLOS identifier (class
> >> +of service) and each CLOS has a CBM (cache bit mask). The CBM is a
> >> +contiguous set of bits which defines the amount of cache resource
> >> +that is available for each 'subset'.
> >> +
> >> +1.2 Why is Cache allocation needed
> >> +----------------------------------
> >> +
> >> +In todays new processors the number of cores is continuously
> >> +increasing, especially in large scale usage models where VMs are
> >> +used like webservers and datacenters. The number of cores increase
> >> +the number of threads or workloads that can simultaneously be run.
> >> +When multi-threaded-applications, VMs, workloads run concurrently
> >> +they compete for shared resources including L3 cache.
> >> +
> >> +The Cache allocation enables more cache resources to be made
> >> +available for higher priority applications based on guidance from
> >> +the execution environment.
> >> +
> >> +The architecture also allows dynamically changing these subsets
> >> +during runtime to further optimize the performance of the higher
> >> +priority application with minimal degradation to the low priority app.
> >> +Additionally, resources can be rebalanced for system throughput benefit.
> >> +
> >> +This technique may be useful in managing large computer systems
> >> +which large L3 cache. Examples may be large servers running
> >> +instances of webservers or database servers. In such complex
> >> +systems, these subsets can be used for more careful placing of the
> >> +available cache resources.
> >> +
> >> +1.3 Cache allocation implementation Overview
> >> +--------------------------------------------
> >> +
> >> +Kernel implements a cgroup subsystem to support cache allocation.
> >> +
> >> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
> >> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is
> >> +internal to the kernel and not exposed to user. Each cgroup would
> >> +have one CBM and would just represent one cache 'subset'.
> >> +
> >> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> >> +cgroup never fails. When a child cgroup is created it inherits the
> >> +CLOSid and the CBM from its parent. When a user changes the default
> >> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> >> +used before. The changing of 'l3_cache_mask' may fail with -ENOSPC
> >> +once the kernel runs out of maximum CLOSids it can support.
> >> +User can create as many cgroups as he wants but having different
> >> +CBMs at the same time is restricted by the maximum number of CLOSids
> >> +(multiple cgroups can have the same CBM).
> >> +Kernel maintains a CLOSid<->cbm mapping which keeps reference
> >> +counter for each cgroup using a CLOSid.
> >> +
> >> +The tasks in the cgroup would get to fill the L3 cache represented
> >> +by the cgroup's 'l3_cache_mask' file.
> >> +
> >> +Root directory would have all available bits set in 'l3_cache_mask'
> >> +file by default.
> >> +
> >> +Each RDT cgroup directory has the following files. Some of them may
> >> +be a part of common RDT framework or be specific to RDT sub-features
> >> +like cache allocation.
> >> +
> >> + - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by
> >> + this file. The bitmask must be contiguous and would have a 1 or 2
> >> + bit minimum length.
> >> +
> >> +1.4 Assignment of CBM,CLOS
> >> +--------------------------
> >> +
> >> +The 'l3_cache_mask' needs to be a subset of the parent node's
> >> +'l3_cache_mask'. Any contiguous subset of these bits(with a minimum
> >> +of 2 bits on hsw SKUs) maybe set to indicate the cache mapping
> >> +desired. The 'l3_cache_mask' between 2 directories can overlap. The
> >> +'l3_cache_mask' would represent the cache 'subset' of the Cache
> >> +allocation cgroup. For ex: on a system with 16 bits of max cbm bits,
> >> +if the directory has the least significant 4 bits set in its 'l3_cache_mask'
> file(meaning the 'l3_cache_mask'
> >> +is just 0xf), it would be allocated the right quarter of the Last
> >> +level cache which means the tasks belonging to this Cache allocation
> >> +cgroup can use the right quarter of the cache to fill. If it has the
> >> +most significant 8 bits set ,it would be allocated the left half of
> >> +the cache(8 bits out of 16 represents 50%).
> >> +
> >> +The cache portion defined in the CBM file is available to all tasks
> >> +within the cgroup to fill and these task are not allowed to allocate
> >> +space in other parts of the cache.
> >> +
> >> +1.5 Scheduling and Context Switch
> >> +---------------------------------
> >> +
> >> +During context switch kernel implements this by writing the CLOSid
> >> +(internally maintained by kernel) of the cgroup to which the task
> >> +belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
> >> +when there is a change in the CLOSid for the CPU in order to
> >> +minimize the latency incurred during context switch.
> >> +
> >> +The following considerations are done for the PQR MSR write so that
> >> +it has minimal impact on scheduling hot path:
> >> +- This path doesnt exist on any non-intel platforms.
> >> +- On Intel platforms, this would not exist by default unless
> >> +CGROUP_RDT is enabled.
> >> +- remains a no-op when CGROUP_RDT is enabled and intel hardware does
> >> +not support the feature.
> >> +- When feature is available, still remains a no-op till the user
> >> +manually creates a cgroup *and* assigns a new cache mask. Since the
> >> +child node inherits the parents cache mask , by cgroup creation
> >> +there is no scheduling hot path impact from the new cgroup.
> >> +- per cpu PQR values are cached and the MSR write is only done when
> >> +there is a task with different PQR is scheduled on the CPU.
> >> +Typically if the task groups are bound to be scheduled on a set of
> >> +CPUs , the number of MSR writes is greatly reduced.
> >> +
> >> +2. Usage examples and syntax
> >> +============================
> >> +
> >> +To check if Cache allocation was enabled on your system
> >> +
> >> +dmesg | grep -i intel_rdt
> >> +should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
> >> +the length of l3_cache_mask and CLOS should depend on the system you
> use.
> >> +
> >> +Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3
> >> + cache allocation is enabled).
> >> +
> >> +Following would mount the cache allocation cgroup subsystem and
> >> +create
> >> +2 directories. Please refer to Documentation/cgroups/cgroups.txt on
> >> +details about how to use cgroups.
> >> +
> >> + cd /sys/fs/cgroup
> >> + mkdir rdt
> >> + mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt cd rdt
> >> +
> >> +Create 2 rdt cgroups
> >> +
> >> + mkdir group1
> >> + mkdir group2
> >> +
> >> +Following are some of the Files in the directory
> >> +
> >> + ls
> >> + rdt.l3_cache_mask
> >> + tasks
> >> +
> >> +Say if the cache is 2MB and cbm supports 16 bits, then setting the
> >> +below allocates the 'right 1/4th(512KB)' of the cache to group2
> >> +
> >> +Edit the CBM for group2 to set the least significant 4 bits. This
> >> +allocates 'right quarter' of the cache.
> >> +
> >> + cd group2
> >> + /bin/echo 0xf > rdt.l3_cache_mask
> >> +
> >> +
> >> +Edit the CBM for group2 to set the least significant 8 bits.This
> >> +allocates the right half of the cache to 'group2'.
> >> +
> >> + cd group2
> >> + /bin/echo 0xff > rdt.l3_cache_mask
> >> +
> >> +Assign tasks to the group2
> >> +
> >> + /bin/echo PID1 > tasks
> >> + /bin/echo PID2 > tasks
> >> +
> >> + Meaning now threads
> >> + PID1 and PID2 get to fill the 'right half' of the cache as the
> >> + belong to cgroup group2.
> >> +
> >> +Create a group under group2
> >> +
> >> + cd group2
> >> + mkdir group21
> >> + cat rdt.l3_cache_mask
> >> + 0xff - inherits parents mask.
> >> +
> >> + /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to
> >> + parent's mask's subset
> >> +
> >> +In order to restrict RDT cgroups to specific set of CPUs rdt can be
> >> +comounted with cpusets.
> >> --
> >> 1.9.1
> >
> > Vikas,
> >
> > Can you give an example of comounting with cpusets? What do you mean
> > by restrict RDT cgroups to specific set of CPUs?
>
> I was going to edit the documentation soon as i see a lot of feedback on the
> same. It may have caused confusion.
>
> I mean just pinning down tasks to a set of cpus. This does not mean we make the
> cache exclusive to the tasks..
>
> >
> > Another limitation of this interface is that it assumes the task <->
> > control group assignment is pertinent, that is:
> >
> > | taskgroup, L3 policy|:
> >
> > | taskgroupA, 50% L3 exclusive |,
> > | taskgroupB, 50% L3 |,
> > | taskgroupC, 50% L3 |.
> >
> > Whenever taskgroup A is empty (that is no runnable task in it), you
> > waste 50% of
> > L3 cache.
>
> Cgroup masks can always overlap , and hence wont have exclusive cache
> allocation.
>
> >
> > I think this problem and the similar problem of L3 reservation with
> > CPU isolation can be solved in this way: whenever a task from cgroupE
> > with exclusive way access is migrated to a new die, impose the
> > exclusivity (by removing access to that way by other cgroups).
> >
> > Whenever cgroupE has zero tasks, remove exclusivity (by allowing other
> > cgroups to use the exclusive ways of it).
>
> Same comment as above - Cgroup masks can always overlap and other cgroups
> can allocate the same cache , and hence wont have exclusive cache allocation.
[Auld, Will] You can define all the cbm to provide one clos with an exclusive area
>
> So natuarally the cgroup with tasks would get to use the cache if it has the same
> mask (say representing 50% of cache in your example) as others .
[Auld, Will] automatic adjustment of the cbm make me nervous. There are times
when we want to limit the cache for a process independent of whether there is
lots of unused cache.
> (assume there are 8 bits max cbm)
> cgroupa - mask - 0xf
> cgroupb - mask - 0xf . Now if cgroupa has no tasks , cgroupb naturally gets all
> the cache.
>
> Thanks,
> Vikas
>
> >
> > I'll cook a patch.
> >
> >
> >
> >
> >
On Wed, Jul 01, 2015 at 03:21:08PM -0700, Vikas Shivappa wrote:
> diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
> index 3ad426c..78df3d7 100644
> --- a/arch/x86/include/asm/intel_rdt.h
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -4,10 +4,16 @@
> #ifdef CONFIG_CGROUP_RDT
>
> #include <linux/cgroup.h>
> +#include <asm/rdt_common.h>
> +
> #define MAX_CBM_LENGTH 32
> #define IA32_L3_CBM_BASE 0xc90
> #define CBM_FROM_INDEX(x) (IA32_L3_CBM_BASE + x)
>
> +DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
You don't think this should be in rdt_common.h ?
> diff --git a/arch/x86/include/asm/rdt_common.h b/arch/x86/include/asm/rdt_common.h
> new file mode 100644
> index 0000000..01502c5
> --- /dev/null
> +++ b/arch/x86/include/asm/rdt_common.h
> @@ -0,0 +1,25 @@
> +#ifndef _X86_RDT_H_
> +#define _X86_RDT_H_
> +
> +#define MSR_IA32_PQR_ASSOC 0x0c8f
> +
> +/**
> + * struct intel_pqr_state - State cache for the PQR MSR
> + * @rmid: The cached Resource Monitoring ID
> + * @closid: The cached Class Of Service ID
> + * @rmid_usecnt: The usage counter for rmid
> + *
> + * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
> + * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
> + * contains both parts, so we need to cache them.
> + *
> + * The cache also helps to avoid pointless updates if the value does
> + * not change.
> + */
> +struct intel_pqr_state {
> + u32 rmid;
> + u32 closid;
> + int rmid_usecnt;
> +};
> +
> +#endif
So why not call this file PQR something or other? That's all there is.
> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index 751bf4b..9149577 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -8,6 +8,9 @@ struct tss_struct;
> void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
> struct tss_struct *tss);
>
> +#include <asm/intel_rdt.h>
> +#define finish_arch_switch(prev) intel_rdt_sched_in()
Right, so please stuff that in __switch_to(), I think I can kill
finish_arch_switch() entirely.
On Wed, Jul 01, 2015 at 03:21:09PM -0700, Vikas Shivappa wrote:
> +/*
> + * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
> + * which are one per CLOSid except IA32_L3_MASK_0 on the current package.
> + */
> +static void cbm_update_msrs(void *info)
> +{
> + int maxid = boot_cpu_data.x86_cache_max_closid;
> + unsigned int i;
> +
> + /*
> + * At cpureset, all bits of IA32_L3_MASK_n are set.
> + * The index starts from one as there is no need
> + * to update IA32_L3_MASK_0 as it belongs to root cgroup
> + * whose cache mask is all 1s always.
> + */
> + for (i = 1; i < maxid; i++) {
> + if (ccmap[i].clos_refcnt)
> + cbm_cpu_update((void *)i);
> + }
> +}
> +
> +static inline void intel_rdt_cpu_start(int cpu)
> +{
> + struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
> +
> + state->closid = 0;
> + mutex_lock(&rdt_group_mutex);
> + if (rdt_cpumask_update(cpu))
> + smp_call_function_single(cpu, cbm_update_msrs, NULL, 1);
> + mutex_unlock(&rdt_group_mutex);
> +}
If you were to guard your array with both a mutex and a raw_spinlock
then you can avoid the IPI and use CPU_STARTING.
> +static int intel_rdt_cpu_notifier(struct notifier_block *nb,
> + unsigned long action, void *hcpu)
> +{
> + unsigned int cpu = (unsigned long)hcpu;
> +
> + switch (action) {
> + case CPU_DOWN_FAILED:
> + case CPU_ONLINE:
> + intel_rdt_cpu_start(cpu);
> + break;
> + case CPU_DOWN_PREPARE:
> + intel_rdt_cpu_exit(cpu);
> + break;
> + default:
> + break;
> + }
> +
> + return NOTIFY_OK;
> }
On Wed, Jul 01, 2015 at 03:21:10PM -0700, Vikas Shivappa wrote:
> + boot_cpu_data.x86_cache_max_closid = 4;
> + boot_cpu_data.x86_cache_max_cbm_len = 20;
That's just vile. And I'm surprised it even works, I would've expected
boot_cpu_data to be const.
So the CQM code has paranoid things like:
max_rmid = MAX_INT;
for_each_possible_cpu(cpu)
max_rmid = min(max_rmid, cpu_data(cpu)->x86_cache_max_rmid);
And then uses max_rmid. This has the advantage that if you mix parts in
a multi-socket environment and hotplug socket 0 to a later part which a
bigger {rm,clos}id your allocation isn't suddenly too small.
Please do similar things and only ever look at cpu_data once, at init
time.
On Wed, Jul 01, 2015 at 03:21:10PM -0700, Vikas Shivappa wrote:
> + /*
> + * Probe test for Haswell CPUs.
Maybe elucidate and say: Probe for Haswell Server parts.
As said before, probe and test mean roughly the same thing, and the
model test below makes the general 'Haswell CPUs' false, because there's
at least 3 other models that are also Haswell.
> + */
> + if (c->x86 == 0x6 && c->x86_model == 0x3f)
> + return cache_alloc_hsw_probe();
> +
On Wed, Jul 01, 2015 at 03:21:02PM -0700, Vikas Shivappa wrote:
> +/*
> + * Temporary cpumask used during hot cpu notificaiton handling. The usage
> + * is serialized by hot cpu locks.
> + */
> +static cpumask_t tmp_cpumask;
So the problem with this is that its 512 bytes on your general distro
config. And this patch set includes at least 3 of them
So you've just shot 1k5 bytes of .data for no reason.
I know tglx whacked you over the head for this, but is this really worth
it? I mean, nobody sane should care about hotplug performance, so who
cares if we iterate a bunch of cpus on the abysmal slow path called
hotplug.
On Fri, Jul 24, 2015 at 11:28:22AM -0700, Vikas Shivappa wrote:
>
>
> On Fri, 24 Jul 2015, Thomas Gleixner wrote:
>
> >On Wed, 1 Jul 2015, Vikas Shivappa wrote:
> >>Cache allocation patches(dependent on prep patches) adds a cgroup
> >>subsystem to support the new Cache Allocation feature found in future
> >>Intel Xeon Intel processors. Cache Allocation is a sub-feature with in
> >>Resource Director Technology(RDT) feature. RDT which provides support to
> >>control sharing of platform resources like L3 cache.
> >
> >Just a few general observations:
> >
> > 1) The changelogs need some loving care.
>
> Will edit the changelogs and send changes.
Please apply the same feedback I have to the documentation patch to your
Changelogs, they're equally horrid.
I've not yet fully read them, but at least try and make them better
before I do.
On Wed, Jul 29, 2015 at 01:28:38AM +0000, Auld, Will wrote:
> > > Whenever cgroupE has zero tasks, remove exclusivity (by allowing other
> > > cgroups to use the exclusive ways of it).
> >
> > Same comment as above - Cgroup masks can always overlap and other cgroups
> > can allocate the same cache , and hence wont have exclusive cache allocation.
>
> [Auld, Will] You can define all the cbm to provide one clos with an exclusive area
>
> >
> > So natuarally the cgroup with tasks would get to use the cache if it has the same
> > mask (say representing 50% of cache in your example) as others .
>
> [Auld, Will] automatic adjustment of the cbm make me nervous. There are times
> when we want to limit the cache for a process independent of whether there is
> lots of unused cache.
How about this:
desiredclos (closid p1 p2 p3 p4)
1 1 0 0 0
2 0 0 0 1
3 0 1 1 0
p means part.
closid 1 is a exclusive cgroup.
closid 2 is a "cache hog" class.
closid 3 is "default closid".
Desiredclos is what user has specified.
Transition 1: desiredclos --> effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a
closid1 cgroup goes from empty->nonempty
and vice-versa).
effectiveclos (closid p1 p2 p3 p4)
1 0 0 0 0
2 0 0 0 1
3 0 1 1 0
Transition 2: effectiveclos --> expandedclos
expandedclos (closid p1 p2 p3 p4)
1 0 0 0 0
2 0 0 0 1
3 1 1 1 0
Then you have different inplacecos for each
CPU (see pseudo-code below):
On the following events.
- task migration to new pCPU:
- task creation:
id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
cosid is clear, for the part,
synchronize desiredclos --> inplacecos */
if (part[mycosid] == 1 &&
part[any_othercosid] == 0)
wrmsr(part, desiredclos);
On Tue, 28 Jul 2015, Auld, Will wrote:
>
>
>> -----Original Message-----
>>
>> Same comment as above - Cgroup masks can always overlap and other cgroups
>> can allocate the same cache , and hence wont have exclusive cache allocation.
>
> [Auld, Will] You can define all the cbm to provide one clos with an exclusive area
Do you mean a CLOS that has all the bits set. We donot support exclusive area
today. The bits in the mask can overlap .. hence can always share the same cache
allocation .
>
>>
>> So natuarally the cgroup with tasks would get to use the cache if it has the same
>> mask (say representing 50% of cache in your example) as others .
>
> [Auld, Will] automatic adjustment of the cbm make me nervous. There are times
> when we want to limit the cache for a process independent of whether there is
> lots of unused cache.
>
Please see example below - In general , I just mean the cache mask can have bits
that can overlap - does not matter whether there is tasks in it or not.
>
>> (assume there are 8 bits max cbm)
>> cgroupa - mask - 0xf
>> cgroupb - mask - 0xf . Now if cgroupa has no tasks , cgroupb naturally gets all
>> the cache.
>>
>> Thanks,
>> Vikas
On Wed, 29 Jul 2015, Peter Zijlstra wrote:
> On Fri, Jul 24, 2015 at 11:28:22AM -0700, Vikas Shivappa wrote:
>>
>>
>> On Fri, 24 Jul 2015, Thomas Gleixner wrote:
>>
>>> On Wed, 1 Jul 2015, Vikas Shivappa wrote:
>>>> Cache allocation patches(dependent on prep patches) adds a cgroup
>>>> subsystem to support the new Cache Allocation feature found in future
>>>> Intel Xeon Intel processors. Cache Allocation is a sub-feature with in
>>>> Resource Director Technology(RDT) feature. RDT which provides support to
>>>> control sharing of platform resources like L3 cache.
>>>
>>> Just a few general observations:
>>>
>>> 1) The changelogs need some loving care.
>>
>> Will edit the changelogs and send changes.
>
> Please apply the same feedback I have to the documentation patch to your
> Changelogs, they're equally horrid.
I am working on the changelogs and will fix the errors you pointed out in
documentation.
Thanks,
Vikas
>
> I've not yet fully read them, but at least try and make them better
> before I do.
>
Marcello,
On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
>
> How about this:
>
> desiredclos (closid p1 p2 p3 p4)
> 1 1 0 0 0
> 2 0 0 0 1
> 3 0 1 1 0
#1 Currently in the rdt cgroup , the root cgroup always has all the bits set and
cant be changed (because the cgroup hierarchy would by default make this to have
all bits as all the children need to have a subset of the root's bitmask). So if
the user creates a cgroup and not put any task in it , the tasks in the root
cgroup could be still using that part of the cache. Thats the reason i say we
can have really 'exclusive' masks.
Or in other words - there is always a desired clos (0) which has all parts set
which acts like a default pool.
Also the parts can overlap. Please apply this for all the below comments which
will change the way they work.
>
> p means part.
I am assuming p = (a contiguous cache capacity bit mask)
> closid 1 is a exclusive cgroup.
> closid 2 is a "cache hog" class.
> closid 3 is "default closid".
>
> Desiredclos is what user has specified.
>
> Transition 1: desiredclos --> effectiveclos
> Clean all bits of unused closid's
> (that must be updated whenever a
> closid1 cgroup goes from empty->nonempty
> and vice-versa).
>
> effectiveclos (closid p1 p2 p3 p4)
> 1 0 0 0 0
> 2 0 0 0 1
> 3 0 1 1 0
>
> Transition 2: effectiveclos --> expandedclos
> expandedclos (closid p1 p2 p3 p4)
> 1 0 0 0 0
> 2 0 0 0 1
> 3 1 1 1 0
> Then you have different inplacecos for each
> CPU (see pseudo-code below):
>
> On the following events.
>
> - task migration to new pCPU:
> - task creation:
>
> id = smp_processor_id();
> for (part = desiredclos.p1; ...; part++)
> /* if my cosid is set and any other
> cosid is clear, for the part,
> synchronize desiredclos --> inplacecos */
> if (part[mycosid] == 1 &&
> part[any_othercosid] == 0)
> wrmsr(part, desiredclos);
>
Currently the root cgroup would have all the bits set which will act like a
default cgroup where all the otherwise unused parts (assuming they are a
set of contiguous cache capacity bits) will be used.
Otherwise the question is in the expandedclos - who decides to expand the closx
parts to include some of the unused parts.. - that could just be a default root
always ?
Thanks,
Vikas
>
On Tue, 28 Jul 2015, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 03:21:07PM -0700, Vikas Shivappa wrote:
>> +static int cbm_validate(struct intel_rdt *ir, unsigned long cbmvalue)
>> +{
>> + struct cgroup_subsys_state *css;
>> + struct intel_rdt *par, *c;
>> + unsigned long *cbm_tmp;
>> + int err = 0;
>> +
>> + if (!cbm_is_contiguous(cbmvalue)) {
>> + pr_err("bitmask should have >= 1 bit and be contiguous\n");
>> + err = -EINVAL;
>> + goto out_err;
>> + }
>
>> +static struct cftype rdt_files[] = {
>> + {
>> + .name = "l3_cache_mask",
>> + .seq_show = intel_cache_alloc_cbm_read,
>> + .write_u64 = intel_cache_alloc_cbm_write,
>> + .mode = 0666,
>
> So this file is world writable? How is the above pr_err() not a DoS ?
Will fix. the mode can be default and can remove pr_err
Thanks,
Vikas
>
>> + },
>> + { } /* terminate */
>> +};
>
On Tue, 28 Jul 2015, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 03:21:06PM -0700, Vikas Shivappa wrote:
>> +struct clos_cbm_map {
>> + unsigned long cache_mask;
>> + unsigned int clos_refcnt;
>> +};
>
> This structure is not a map at all, its the map value. Furthermore,
> cache_mask seems a confusing name for the capacity bitmask (CBM).
clos_cbm_table ? since its really a table which is indexed by the clos.
will fix the mask names.
Thanks,
Vikas
>
On Tue, 28 Jul 2015, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 03:21:06PM -0700, Vikas Shivappa wrote:
>> static int __init intel_rdt_late_init(void)
>> {
>> struct cpuinfo_x86 *c = &boot_cpu_data;
>> + static struct clos_cbm_map *ccm;
>> + u32 maxid, max_cbm_len;
>> + size_t sizeb;
>
> Why 'sizeb' ? 'size' is still available, right?
will fix. int size should be good enough.
>
>> + int err = 0;
>>
>> - if (!cpu_has(c, X86_FEATURE_CAT_L3))
>> + if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
>> + rdt_root_group.css.ss->disabled = 1;
>> return -ENODEV;
>> + }
>> + maxid = c->x86_cache_max_closid;
>> + max_cbm_len = c->x86_cache_max_cbm_len;
>> +
>> + sizeb = BITS_TO_LONGS(maxid) * sizeof(long);
>> + rdtss_info.closmap = kzalloc(sizeb, GFP_KERNEL);
>> + if (!rdtss_info.closmap) {
>> + err = -ENOMEM;
>> + goto out_err;
>> + }
>> +
>> + sizeb = maxid * sizeof(struct clos_cbm_map);
>> + ccmap = kzalloc(sizeb, GFP_KERNEL);
>> + if (!ccmap) {
>> + kfree(rdtss_info.closmap);
>> + err = -ENOMEM;
>> + goto out_err;
>> + }
>
> What's the expected size of max_closid? iow, how big of an array are you
> in fact allocating here?
the size of maxclosid value is 16 bits.. For systems with large CPUs this may be
more but with EPs have only seen 20-30.
Thanks,
Vikas
>
On Wed, 29 Jul 2015, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 03:21:08PM -0700, Vikas Shivappa wrote:
>> diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
>> index 3ad426c..78df3d7 100644
>> --- a/arch/x86/include/asm/intel_rdt.h
>> +++ b/arch/x86/include/asm/intel_rdt.h
>> @@ -4,10 +4,16 @@
>> #ifdef CONFIG_CGROUP_RDT
>>
>> #include <linux/cgroup.h>
>> +#include <asm/rdt_common.h>
>> +
>> #define MAX_CBM_LENGTH 32
>> #define IA32_L3_CBM_BASE 0xc90
>> #define CBM_FROM_INDEX(x) (IA32_L3_CBM_BASE + x)
>>
>> +DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>
> You don't think this should be in rdt_common.h ?
Sounds good.
>
>> diff --git a/arch/x86/include/asm/rdt_common.h b/arch/x86/include/asm/rdt_common.h
>> new file mode 100644
>> index 0000000..01502c5
>> --- /dev/null
>> +++ b/arch/x86/include/asm/rdt_common.h
>> @@ -0,0 +1,25 @@
>> +#ifndef _X86_RDT_H_
>> +#define _X86_RDT_H_
>> +
>> +#define MSR_IA32_PQR_ASSOC 0x0c8f
>> +
>> +/**
>> + * struct intel_pqr_state - State cache for the PQR MSR
>> + * @rmid: The cached Resource Monitoring ID
>> + * @closid: The cached Class Of Service ID
>> + * @rmid_usecnt: The usage counter for rmid
>> + *
>> + * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
>> + * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
>> + * contains both parts, so we need to cache them.
>> + *
>> + * The cache also helps to avoid pointless updates if the value does
>> + * not change.
>> + */
>> +struct intel_pqr_state {
>> + u32 rmid;
>> + u32 closid;
>> + int rmid_usecnt;
>> +};
>> +
>> +#endif
>
> So why not call this file PQR something or other? That's all there is.
Well , had this to have things common between the cqm code and the other code. I
see right now we dont put anything else here other than pqr.
Will fix this now and change this later to rdt_common when i add more thigs in
really then.
>
>> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
>> index 751bf4b..9149577 100644
>> --- a/arch/x86/include/asm/switch_to.h
>> +++ b/arch/x86/include/asm/switch_to.h
>> @@ -8,6 +8,9 @@ struct tss_struct;
>> void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>> struct tss_struct *tss);
>>
>> +#include <asm/intel_rdt.h>
>> +#define finish_arch_switch(prev) intel_rdt_sched_in()
>
> Right, so please stuff that in __switch_to(),
will fix.
I think I can kill
> finish_arch_switch() entirely.
>
what about other architectures using them ?
Thanks,
Vikas
>
On Wed, 29 Jul 2015, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 03:21:10PM -0700, Vikas Shivappa wrote:
>> + /*
>> + * Probe test for Haswell CPUs.
>
> Maybe elucidate and say: Probe for Haswell Server parts.
>
> As said before, probe and test mean roughly the same thing, and the
> model test below makes the general 'Haswell CPUs' false, because there's
> at least 3 other models that are also Haswell.
Will fix,
Thanks,
Vikas
>
>> + */
>> + if (c->x86 == 0x6 && c->x86_model == 0x3f)
>> + return cache_alloc_hsw_probe();
>> +
>
Hello, Vikas.
On Wed, Jul 01, 2015 at 03:21:06PM -0700, Vikas Shivappa wrote:
> This patch adds a cgroup subsystem for Intel Resource Director
> Technology(RDT) feature and Class of service(CLOSid) management which is
> part of common RDT framework. This cgroup would eventually be used by
> all sub-features of RDT and hence be associated with the common RDT
> framework as well as sub-feature specific framework. However current
> patch series only adds cache allocation sub-feature specific code.
>
> When a cgroup directory is created it has a CLOSid associated with it
> which is inherited from its parent. The Closid is mapped to a
> cache_mask which represents the L3 cache allocation to the cgroup.
> Tasks belonging to the cgroup get to fill the cache represented by the
> cache_mask.
First of all, I apologize for being so late. I've been thinking about
it but the thoughts didn't quite crystalize (which isn't to say that
it's very crystal now) until recently. If I understand correctly,
there are a couple suggested use cases for explicitly managing cache
usage.
1. Pinning known hot areas of memory in cache.
2. Explicitly regulating cache usage so that cacheline allocation can
be better than CPU itself doing it.
#1 isn't part of this patchset, right? Is there any plan for working
towards this too?
For #2, it is likely that the targeted use cases would involve threads
of a process or at least cooperating processes and having a simple API
which just goes "this (or the current) thread is only gonna use this
part of cache" would be a lot easier to use and actually beneficial.
I don't really think it makes sense to implement a fully hierarchical
cgroup solution when there isn't the basic affinity-adjusting
interface and it isn't clear whether fully hierarchical resource
distribution would be necessary especially given that the granularity
of the target resource is very coarse.
I can see that how cpuset would seem to invite this sort of usage but
cpuset itself is more of an arbitrary outgrowth (regardless of
history) in terms of resource control and most things controlled by
cpuset already have countepart interface which is readily accessible
to the normal applications.
Given that what the feature allows is restricting usage rather than
granting anything exclusively, a programmable interface wouldn't need
to worry about complications around priviledges while being able to
reap most of the benefits in an a lot easier way. Am I missing
something?
Thanks.
--
tejun
On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
>
>
> Marcello,
>
>
> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >
> >How about this:
> >
> >desiredclos (closid p1 p2 p3 p4)
> > 1 1 0 0 0
> > 2 0 0 0 1
> > 3 0 1 1 0
>
> #1 Currently in the rdt cgroup , the root cgroup always has all the
> bits set and cant be changed (because the cgroup hierarchy would by
> default make this to have all bits as all the children need to have
> a subset of the root's bitmask). So if the user creates a cgroup and
> not put any task in it , the tasks in the root cgroup could be still
> using that part of the cache. Thats the reason i say we can have
> really 'exclusive' masks.
>
> Or in other words - there is always a desired clos (0) which has all
> parts set which acts like a default pool.
>
> Also the parts can overlap. Please apply this for all the below
> comments which will change the way they work.
>
> >
> >p means part.
>
> I am assuming p = (a contiguous cache capacity bit mask)
Yes.
> >closid 1 is a exclusive cgroup.
> >closid 2 is a "cache hog" class.
> >closid 3 is "default closid".
> >
> >Desiredclos is what user has specified.
> >
> >Transition 1: desiredclos --> effectiveclos
> >Clean all bits of unused closid's
> >(that must be updated whenever a
> >closid1 cgroup goes from empty->nonempty
> >and vice-versa).
> >
> >effectiveclos (closid p1 p2 p3 p4)
> > 1 0 0 0 0
> > 2 0 0 0 1
> > 3 0 1 1 0
>
> >
> >Transition 2: effectiveclos --> expandedclos
> >expandedclos (closid p1 p2 p3 p4)
> > 1 0 0 0 0
> > 2 0 0 0 1
> > 3 1 1 1 0
> >Then you have different inplacecos for each
> >CPU (see pseudo-code below):
> >
> >On the following events.
> >
> >- task migration to new pCPU:
> >- task creation:
> >
> > id = smp_processor_id();
> > for (part = desiredclos.p1; ...; part++)
> > /* if my cosid is set and any other
> > cosid is clear, for the part,
> > synchronize desiredclos --> inplacecos */
> > if (part[mycosid] == 1 &&
> > part[any_othercosid] == 0)
> > wrmsr(part, desiredclos);
> >
>
> Currently the root cgroup would have all the bits set which will act
> like a default cgroup where all the otherwise unused parts (assuming
> they are a set of contiguous cache capacity bits) will be used.
>
> Otherwise the question is in the expandedclos - who decides to
> expand the closx parts to include some of the unused parts.. - that
> could just be a default root always ?
Right, so the problem is for certain closid's you might never want
to expand (because doing so would cause data to be cached in a
cache way which might have high eviction rate in the future).
See the example from Will.
But for the default cache (that is "unclassified applications"
i suppose it is beneficial to expand in most cases, that is,
use maximum amount of cache irrespective of eviction rate, which
is the behaviour that exists now without CAT).
So perhaps a new flag "expand=y/n" can be added to the cgroup
directories... What do you say?
Userspace representation of CAT
-------------------------------
Usage model:
1) measure application performance without L3 cache reservation.
2) measure application perf with L3 cache reservation and
X number of cache ways until desired performance is attained.
Requirements:
1) Persistency of CLOS configuration across hardware. On migration
of operating system or application between different hardware
systems we'd like the following to be maintained:
- exclusive number of bytes (*) reserved to a certain CLOSid.
- shared number of bytes (*) reserved between a certain group
of CLOSid's.
For both code and data, rounded down or up in cache way size.
2) Reasoning:
Different CBM masks in different hardware platforms might be necessary
to specify the same CLOS configuration, in terms of exclusive number of
bytes and shared number of bytes. (cache-way rounded number of bytes).
For example, due to L3 allocation by other hardware entities in certain parts
of the cache it might be necessary to relocate CBM mask to achieve
the same CLOS configuration.
3) Proposed format:
sharedregionK.exclusive - Number of exclusive cache bytes reserved for
shared region.
sharedregionK.excl_data - Number of exclusive cache data bytes reserved for
shared region.
sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for
shared region.
sharedregionK.round_down - Round down to cache way bytes from respective number
specification (default is round up).
sharedregionK.expand - y/n - Expand shared region to more cache ways
when available (default N).
cgroupN.exclusive - Number of exclusive L3 cache bytes reserved
for cgroup.
cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
for cgroup.
cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
for cgroup.
cgroupN.round_down - Round down to cache way bytes from respective number
specification (default is round up).
cgroupN.expand - y/n - Expand shared region to more cache ways when
available (default N).
cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
regions)
Example 1:
One application with 2M exclusive cache, two applications
with 1M exclusive each, sharing an expansive shared region of 1M.
cgroup1.exclusive = 2M
sharedregion1.exclusive = 1M
sharedregion1.expand = Y
cgroup2.exclusive = 1M
cgroup2.shared = sharedregion1
cgroup3.exclusive = 1M
cgroup3.shared = sharedregion1
Example 2:
3 high performance applications running, one of which is a cache hog
with no cache locality.
cgroup1.exclusive = 8M
cgroup2.exclusive = 8M
cgroup3.exclusive = 512K
cgroup3.round_down = Y
In all cases the default cgroup (which requires no explicit
specification) is expansive and uses the remaining cache
ways, including the ways shared by other hardware entities.
On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
>
>
> Marcello,
>
>
> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >
> >How about this:
> >
> >desiredclos (closid p1 p2 p3 p4)
> > 1 1 0 0 0
> > 2 0 0 0 1
> > 3 0 1 1 0
>
> #1 Currently in the rdt cgroup , the root cgroup always has all the
> bits set and cant be changed (because the cgroup hierarchy would by
> default make this to have all bits as all the children need to have
> a subset of the root's bitmask). So if the user creates a cgroup and
> not put any task in it , the tasks in the root cgroup could be still
> using that part of the cache. Thats the reason i say we can have
> really 'exclusive' masks.
>
> Or in other words - there is always a desired clos (0) which has all
> parts set which acts like a default pool.
>
> Also the parts can overlap. Please apply this for all the below
> comments which will change the way they work.
>
> >
> >p means part.
>
> I am assuming p = (a contiguous cache capacity bit mask)
>
> >closid 1 is a exclusive cgroup.
> >closid 2 is a "cache hog" class.
> >closid 3 is "default closid".
> >
> >Desiredclos is what user has specified.
> >
> >Transition 1: desiredclos --> effectiveclos
> >Clean all bits of unused closid's
> >(that must be updated whenever a
> >closid1 cgroup goes from empty->nonempty
> >and vice-versa).
> >
> >effectiveclos (closid p1 p2 p3 p4)
> > 1 0 0 0 0
> > 2 0 0 0 1
> > 3 0 1 1 0
>
> >
> >Transition 2: effectiveclos --> expandedclos
> >expandedclos (closid p1 p2 p3 p4)
> > 1 0 0 0 0
> > 2 0 0 0 1
> > 3 1 1 1 0
> >Then you have different inplacecos for each
> >CPU (see pseudo-code below):
> >
> >On the following events.
> >
> >- task migration to new pCPU:
> >- task creation:
> >
> > id = smp_processor_id();
> > for (part = desiredclos.p1; ...; part++)
> > /* if my cosid is set and any other
> > cosid is clear, for the part,
> > synchronize desiredclos --> inplacecos */
> > if (part[mycosid] == 1 &&
> > part[any_othercosid] == 0)
> > wrmsr(part, desiredclos);
> >
>
> Currently the root cgroup would have all the bits set which will act
> like a default cgroup where all the otherwise unused parts (assuming
> they are a set of contiguous cache capacity bits) will be used.
Right, but we don't want to place tasks in there in case one cgroup
wants exclusive cache access.
So whenever you want an exclusive cgroup you'd do:
create cgroup-exclusive; reserve desired part of the cache
for it.
create cgroup-default; reserved all cache minus that of cgroup-exclusive
for it.
place tasks that belong to cgroup-exclusive into it.
place all other tasks (including init) into cgroup-default.
Is that right?
On Thu, 30 Jul 2015, Marcelo Tosatti wrote:
> On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
>>
>>
>> Marcello,
>>
>>
>> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
>>>
>>> How about this:
>>>
>>> desiredclos (closid p1 p2 p3 p4)
>>> 1 1 0 0 0
>>> 2 0 0 0 1
>>> 3 0 1 1 0
>>
>> #1 Currently in the rdt cgroup , the root cgroup always has all the
>> bits set and cant be changed (because the cgroup hierarchy would by
>> default make this to have all bits as all the children need to have
>> a subset of the root's bitmask). So if the user creates a cgroup and
>> not put any task in it , the tasks in the root cgroup could be still
>> using that part of the cache. Thats the reason i say we can have
>> really 'exclusive' masks.
>>
>> Or in other words - there is always a desired clos (0) which has all
>> parts set which acts like a default pool.
>>
>> Also the parts can overlap. Please apply this for all the below
>> comments which will change the way they work.
>
>
>>
>>>
>>> p means part.
>>
>> I am assuming p = (a contiguous cache capacity bit mask)
>>
>>> closid 1 is a exclusive cgroup.
>>> closid 2 is a "cache hog" class.
>>> closid 3 is "default closid".
>>>
>>> Desiredclos is what user has specified.
>>>
>>> Transition 1: desiredclos --> effectiveclos
>>> Clean all bits of unused closid's
>>> (that must be updated whenever a
>>> closid1 cgroup goes from empty->nonempty
>>> and vice-versa).
>>>
>>> effectiveclos (closid p1 p2 p3 p4)
>>> 1 0 0 0 0
>>> 2 0 0 0 1
>>> 3 0 1 1 0
>>
>>>
>>> Transition 2: effectiveclos --> expandedclos
>>> expandedclos (closid p1 p2 p3 p4)
>>> 1 0 0 0 0
>>> 2 0 0 0 1
>>> 3 1 1 1 0
>>> Then you have different inplacecos for each
>>> CPU (see pseudo-code below):
>>>
>>> On the following events.
>>>
>>> - task migration to new pCPU:
>>> - task creation:
>>>
>>> id = smp_processor_id();
>>> for (part = desiredclos.p1; ...; part++)
>>> /* if my cosid is set and any other
>>> cosid is clear, for the part,
>>> synchronize desiredclos --> inplacecos */
>>> if (part[mycosid] == 1 &&
>>> part[any_othercosid] == 0)
>>> wrmsr(part, desiredclos);
>>>
>>
>> Currently the root cgroup would have all the bits set which will act
>> like a default cgroup where all the otherwise unused parts (assuming
>> they are a set of contiguous cache capacity bits) will be used.
>
> Right, but we don't want to place tasks in there in case one cgroup
> wants exclusive cache access.
>
> So whenever you want an exclusive cgroup you'd do:
>
> create cgroup-exclusive; reserve desired part of the cache
> for it.
> create cgroup-default; reserved all cache minus that of cgroup-exclusive
> for it.
>
> place tasks that belong to cgroup-exclusive into it.
> place all other tasks (including init) into cgroup-default.
>
> Is that right?
Yes you could do that.
You can create cgroups to have masks which are exclusive in todays
implementation, just that you could also created more cgroups to overlap the
masks again.. iow we dont have an exclusive flag for the cgroup mask.
Is that a common use case in
the server environment that you need to prevent other cgroups from using a
certain mask ? (since the root user should control these allocations .. he
should know?)
>
>
On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:
>
>
> On Thu, 30 Jul 2015, Marcelo Tosatti wrote:
>
> >On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> >>
> >>
> >>Marcello,
> >>
> >>
> >>On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >>>
> >>>How about this:
> >>>
> >>>desiredclos (closid p1 p2 p3 p4)
> >>> 1 1 0 0 0
> >>> 2 0 0 0 1
> >>> 3 0 1 1 0
> >>
> >>#1 Currently in the rdt cgroup , the root cgroup always has all the
> >>bits set and cant be changed (because the cgroup hierarchy would by
> >>default make this to have all bits as all the children need to have
> >>a subset of the root's bitmask). So if the user creates a cgroup and
> >>not put any task in it , the tasks in the root cgroup could be still
> >>using that part of the cache. Thats the reason i say we can have
> >>really 'exclusive' masks.
> >>
> >>Or in other words - there is always a desired clos (0) which has all
> >>parts set which acts like a default pool.
> >>
> >>Also the parts can overlap. Please apply this for all the below
> >>comments which will change the way they work.
> >
> >
> >>
> >>>
> >>>p means part.
> >>
> >>I am assuming p = (a contiguous cache capacity bit mask)
> >>
> >>>closid 1 is a exclusive cgroup.
> >>>closid 2 is a "cache hog" class.
> >>>closid 3 is "default closid".
> >>>
> >>>Desiredclos is what user has specified.
> >>>
> >>>Transition 1: desiredclos --> effectiveclos
> >>>Clean all bits of unused closid's
> >>>(that must be updated whenever a
> >>>closid1 cgroup goes from empty->nonempty
> >>>and vice-versa).
> >>>
> >>>effectiveclos (closid p1 p2 p3 p4)
> >>> 1 0 0 0 0
> >>> 2 0 0 0 1
> >>> 3 0 1 1 0
> >>
> >>>
> >>>Transition 2: effectiveclos --> expandedclos
> >>>expandedclos (closid p1 p2 p3 p4)
> >>> 1 0 0 0 0
> >>> 2 0 0 0 1
> >>> 3 1 1 1 0
> >>>Then you have different inplacecos for each
> >>>CPU (see pseudo-code below):
> >>>
> >>>On the following events.
> >>>
> >>>- task migration to new pCPU:
> >>>- task creation:
> >>>
> >>> id = smp_processor_id();
> >>> for (part = desiredclos.p1; ...; part++)
> >>> /* if my cosid is set and any other
> >>> cosid is clear, for the part,
> >>> synchronize desiredclos --> inplacecos */
> >>> if (part[mycosid] == 1 &&
> >>> part[any_othercosid] == 0)
> >>> wrmsr(part, desiredclos);
> >>>
> >>
> >>Currently the root cgroup would have all the bits set which will act
> >>like a default cgroup where all the otherwise unused parts (assuming
> >>they are a set of contiguous cache capacity bits) will be used.
> >
> >Right, but we don't want to place tasks in there in case one cgroup
> >wants exclusive cache access.
> >
> >So whenever you want an exclusive cgroup you'd do:
> >
> >create cgroup-exclusive; reserve desired part of the cache
> >for it.
> >create cgroup-default; reserved all cache minus that of cgroup-exclusive
> >for it.
> >
> >place tasks that belong to cgroup-exclusive into it.
> >place all other tasks (including init) into cgroup-default.
> >
> >Is that right?
>
> Yes you could do that.
>
> You can create cgroups to have masks which are exclusive in todays
> implementation, just that you could also created more cgroups to
> overlap the masks again.. iow we dont have an exclusive flag for the
> cgroup mask.
> Is that a common use case in the server environment that you need to
> prevent other cgroups from using a certain mask ? (since the root
> user should control these allocations .. he should know?)
Yes, there are two known use-cases that have this characteristic:
1) High performance numeric application which has been optimized
to a certain fraction of the cache.
2) Low latency application in multi-application OS.
For both cases exclusive cache access is wanted.
On Thu, Jul 30, 2015 at 03:44:58PM -0400, Tejun Heo wrote:
> Hello, Vikas.
>
> On Wed, Jul 01, 2015 at 03:21:06PM -0700, Vikas Shivappa wrote:
> > This patch adds a cgroup subsystem for Intel Resource Director
> > Technology(RDT) feature and Class of service(CLOSid) management which is
> > part of common RDT framework. This cgroup would eventually be used by
> > all sub-features of RDT and hence be associated with the common RDT
> > framework as well as sub-feature specific framework. However current
> > patch series only adds cache allocation sub-feature specific code.
> >
> > When a cgroup directory is created it has a CLOSid associated with it
> > which is inherited from its parent. The Closid is mapped to a
> > cache_mask which represents the L3 cache allocation to the cgroup.
> > Tasks belonging to the cgroup get to fill the cache represented by the
> > cache_mask.
>
> First of all, I apologize for being so late. I've been thinking about
> it but the thoughts didn't quite crystalize (which isn't to say that
> it's very crystal now) until recently. If I understand correctly,
> there are a couple suggested use cases for explicitly managing cache
> usage.
>
> 1. Pinning known hot areas of memory in cache.
>
> 2. Explicitly regulating cache usage so that cacheline allocation can
> be better than CPU itself doing it.
>
> #1 isn't part of this patchset, right? Is there any plan for working
> towards this too?
>
> For #2, it is likely that the targeted use cases would involve threads
> of a process or at least cooperating processes and having a simple API
> which just goes "this (or the current) thread is only gonna use this
> part of cache" would be a lot easier to use and actually beneficial.
>
> I don't really think it makes sense to implement a fully hierarchical
> cgroup solution when there isn't the basic affinity-adjusting
> interface
What is an "affinity adjusting interface" ? Can you give an example
please?
> and it isn't clear whether fully hierarchical resource
> distribution would be necessary especially given that the granularity
> of the target resource is very coarse.
As i see it, the benefit of the hierarchical structure to the CAT
configuration is simply to organize sharing of cache ways in subtrees
- two cgroups can share a given cache way only if they have a common
parent.
That is the only benefit. Vikas, please correct me if i'm wrong.
> I can see that how cpuset would seem to invite this sort of usage but
> cpuset itself is more of an arbitrary outgrowth (regardless of
> history) in terms of resource control and most things controlled by
> cpuset already have countepart interface which is readily accessible
> to the normal applications.
I can't parse that phrase (due to ignorance). Please educate.
> Given that what the feature allows is restricting usage rather than
> granting anything exclusively, a programmable interface wouldn't need
> to worry about complications around priviledges
What complications about priviledges you refer to?
> while being able to reap most of the benefits in an a lot easier way.
> Am I missing something?
The interface does allow for exclusive cache usage by an application.
Please read the Intel manual, section 17, it is very instructive.
The use cases we have now are the following:
Scenario 1: Consider a system with 4 high performance applications
running, one of which is a streaming application that manages a very
large address space from which it reads and writes as it does its processing.
As such the application will use all the cache it can get but does
not need much if any cache. So, it spoils the cache for everyone for no
gain on its own. In this case we'd like to constrain it to the
smallest possible amount of cache while at the same time constraining
the other 3 applications to stay out of this thrashed area of the
cache.
Scenario 2: We have a numeric application that has been highly optimized
to fit in the L2 cache (2M for example). We want to ensure that its
cached data does not get flushed from the cache hierarchy while it is
scheduled out. In this case we exclusively allocate enough L3 cache to
hold all of the L2 cache.
Scenario 3: Latency sensitive application executing in a shared
environment, where memory to handle an event must be in L3 cache
for latency requirements to be met.
On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
> On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> >
> >
> > Marcello,
> >
> >
> > On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> > >
> > >How about this:
> > >
> > >desiredclos (closid p1 p2 p3 p4)
> > > 1 1 0 0 0
> > > 2 0 0 0 1
> > > 3 0 1 1 0
> >
> > #1 Currently in the rdt cgroup , the root cgroup always has all the
> > bits set and cant be changed (because the cgroup hierarchy would by
> > default make this to have all bits as all the children need to have
> > a subset of the root's bitmask). So if the user creates a cgroup and
> > not put any task in it , the tasks in the root cgroup could be still
> > using that part of the cache. Thats the reason i say we can have
> > really 'exclusive' masks.
> >
> > Or in other words - there is always a desired clos (0) which has all
> > parts set which acts like a default pool.
> >
> > Also the parts can overlap. Please apply this for all the below
> > comments which will change the way they work.
> >
> > >
> > >p means part.
> >
> > I am assuming p = (a contiguous cache capacity bit mask)
>
> Yes.
>
> > >closid 1 is a exclusive cgroup.
> > >closid 2 is a "cache hog" class.
> > >closid 3 is "default closid".
> > >
> > >Desiredclos is what user has specified.
> > >
> > >Transition 1: desiredclos --> effectiveclos
> > >Clean all bits of unused closid's
> > >(that must be updated whenever a
> > >closid1 cgroup goes from empty->nonempty
> > >and vice-versa).
> > >
> > >effectiveclos (closid p1 p2 p3 p4)
> > > 1 0 0 0 0
> > > 2 0 0 0 1
> > > 3 0 1 1 0
> >
> > >
> > >Transition 2: effectiveclos --> expandedclos
> > >expandedclos (closid p1 p2 p3 p4)
> > > 1 0 0 0 0
> > > 2 0 0 0 1
> > > 3 1 1 1 0
> > >Then you have different inplacecos for each
> > >CPU (see pseudo-code below):
> > >
> > >On the following events.
> > >
> > >- task migration to new pCPU:
> > >- task creation:
> > >
> > > id = smp_processor_id();
> > > for (part = desiredclos.p1; ...; part++)
> > > /* if my cosid is set and any other
> > > cosid is clear, for the part,
> > > synchronize desiredclos --> inplacecos */
> > > if (part[mycosid] == 1 &&
> > > part[any_othercosid] == 0)
> > > wrmsr(part, desiredclos);
> > >
> >
> > Currently the root cgroup would have all the bits set which will act
> > like a default cgroup where all the otherwise unused parts (assuming
> > they are a set of contiguous cache capacity bits) will be used.
> >
> > Otherwise the question is in the expandedclos - who decides to
> > expand the closx parts to include some of the unused parts.. - that
> > could just be a default root always ?
>
> Right, so the problem is for certain closid's you might never want
> to expand (because doing so would cause data to be cached in a
> cache way which might have high eviction rate in the future).
> See the example from Will.
>
> But for the default cache (that is "unclassified applications"
> i suppose it is beneficial to expand in most cases, that is,
> use maximum amount of cache irrespective of eviction rate, which
> is the behaviour that exists now without CAT).
>
> So perhaps a new flag "expand=y/n" can be added to the cgroup
> directories... What do you say?
>
> Userspace representation of CAT
> -------------------------------
>
> Usage model:
> 1) measure application performance without L3 cache reservation.
> 2) measure application perf with L3 cache reservation and
> X number of cache ways until desired performance is attained.
>
> Requirements:
> 1) Persistency of CLOS configuration across hardware. On migration
> of operating system or application between different hardware
> systems we'd like the following to be maintained:
> - exclusive number of bytes (*) reserved to a certain CLOSid.
> - shared number of bytes (*) reserved between a certain group
> of CLOSid's.
>
> For both code and data, rounded down or up in cache way size.
>
> 2) Reasoning:
> Different CBM masks in different hardware platforms might be necessary
> to specify the same CLOS configuration, in terms of exclusive number of
> bytes and shared number of bytes. (cache-way rounded number of bytes).
> For example, due to L3 allocation by other hardware entities in certain parts
> of the cache it might be necessary to relocate CBM mask to achieve
> the same CLOS configuration.
>
> 3) Proposed format:
>
> sharedregionK.exclusive - Number of exclusive cache bytes reserved for
> shared region.
> sharedregionK.excl_data - Number of exclusive cache data bytes reserved for
> shared region.
> sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for
> shared region.
> sharedregionK.round_down - Round down to cache way bytes from respective number
> specification (default is round up).
> sharedregionK.expand - y/n - Expand shared region to more cache ways
> when available (default N).
>
> cgroupN.exclusive - Number of exclusive L3 cache bytes reserved
> for cgroup.
> cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
> for cgroup.
> cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
> for cgroup.
> cgroupN.round_down - Round down to cache way bytes from respective number
> specification (default is round up).
> cgroupN.expand - y/n - Expand shared region to more cache ways when
> available (default N).
> cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
> regions)
>
> Example 1:
> One application with 2M exclusive cache, two applications
> with 1M exclusive each, sharing an expansive shared region of 1M.
>
> cgroup1.exclusive = 2M
>
> sharedregion1.exclusive = 1M
> sharedregion1.expand = Y
>
> cgroup2.exclusive = 1M
> cgroup2.shared = sharedregion1
>
> cgroup3.exclusive = 1M
> cgroup3.shared = sharedregion1
>
> Example 2:
> 3 high performance applications running, one of which is a cache hog
> with no cache locality.
>
> cgroup1.exclusive = 8M
> cgroup2.exclusive = 8M
>
> cgroup3.exclusive = 512K
> cgroup3.round_down = Y
>
> In all cases the default cgroup (which requires no explicit
> specification) is expansive and uses the remaining cache
> ways, including the ways shared by other hardware entities.
>
Moving this discussion from another to this thread, sorry.
> >Second question:
> >Do you envision any use case which the placement of cache
> >and not the quantity of cache is a criteria for decision?
> >That is, two cases with the same amount of cache for each CLOSid,
> >but with different locations inside the cache?
> >(except sharing of ways by two CLOSid's, of course).
> >
>
> cbm max - 16 bits. 000f - allocate right quarter. f000 - allocate
> left quarter.. ? extend the case to any number of valid contiguous
> bits.
Yes, the hardware allows you to specify the same number of cache ways
to a given COSid in different cache locations. The question was whether
do you envision any use case where different locations make a
difference?
I can't see any (except for hardware users of cache ways, which
the OS could control automatically, all it needs to know from
the user configuration is whether a given cgroup is a "exclusive user"
of a number of cache ways, in which case it should not use a
This information is crucial because if there are no forseeable use
cases then that can simplify the interface enormously (could have the
kernel handle the issues that my "userspace interface" proposal is
handling).
On Thu, 30 Jul 2015, Tejun Heo wrote:
> Hello, Vikas.
>
> On Wed, Jul 01, 2015 at 03:21:06PM -0700, Vikas Shivappa wrote:
>> This patch adds a cgroup subsystem for Intel Resource Director
>> Technology(RDT) feature and Class of service(CLOSid) management which is
>> part of common RDT framework. This cgroup would eventually be used by
>> all sub-features of RDT and hence be associated with the common RDT
>> framework as well as sub-feature specific framework. However current
>> patch series only adds cache allocation sub-feature specific code.
>>
>> When a cgroup directory is created it has a CLOSid associated with it
>> which is inherited from its parent. The Closid is mapped to a
>> cache_mask which represents the L3 cache allocation to the cgroup.
>> Tasks belonging to the cgroup get to fill the cache represented by the
>> cache_mask.
>
> First of all, I apologize for being so late. I've been thinking about
> it but the thoughts didn't quite crystalize (which isn't to say that
> it's very crystal now) until recently. If I understand correctly,
> there are a couple suggested use cases for explicitly managing cache
> usage.
>
> 1. Pinning known hot areas of memory in cache.
No , the cache allocation doesnt do this. (or it isn't expected to do)
>
> 2. Explicitly regulating cache usage so that cacheline allocation can
> be better than CPU itself doing it.
yes , this is what we want to do using cache alloc.
>
> #1 isn't part of this patchset, right? Is there any plan for working
> towards this too?
cache allocation is not intended to do #1 , so we dont have to support this.
>
> For #2, it is likely that the targeted use cases would involve threads
> of a process or at least cooperating processes and having a simple API
> which just goes "this (or the current) thread is only gonna use this
> part of cache" would be a lot easier to use and actually beneficial.
>
> I don't really think it makes sense to implement a fully hierarchical
> cgroup solution when there isn't the basic affinity-adjusting
> interface and it isn't clear whether fully hierarchical resource
> distribution would be necessary especially given that the granularity
> of the target resource is very coarse.
>
> I can see that how cpuset would seem to invite this sort of usage but
> cpuset itself is more of an arbitrary outgrowth (regardless of
> history) in terms of resource control and most things controlled by
> cpuset already have countepart interface which is readily accessible
> to the normal applications.
Yes today we dont have an alternative interface - but we can always build one.
We simply dont have it because till now Linux kernel just tolerated the
degradation that could have occured by cache contention and this is the first
interface we are building.
>
> Given that what the feature allows is restricting usage rather than
> granting anything exclusively, a programmable interface wouldn't need
> to worry about complications around priviledges while being able to
> reap most of the benefits in an a lot easier way. Am I missing
> something?
>
For #2 , from the intel_rdt cgroup we develop a framework where the user can
regulate the cache allocation. A user space app could also eventually use this
as underlying support and then do things on top of it depending on the
enterprise or other requirements.
A typical use case would be that an application which
is say continuously polluting the cache(low priority app from cache usage
perspective) by bringing in data from the network (copying/streaming app) and
and not letting an app to use the cache which has legitimate requirement of
cache usage(high priority app).
We need to map the group of tasks to a particular class of service and way for
the user to specify the cache capacity for that class of service . Also a
default cgroup which could have all the tasks and use all the cache.
The hierarchical interface can be used by the user as required and does not
really interfere with allocating exclusive blocks of cache - all the user needs
to do is make sure the masks dont overlap.
The user can configure the masks to be exclusive from others.
But note that overlapping mask provides a very easy way to share the cache usage
which is what you may want to do sometimes. The current implementation can be
easily extended to *enforce* exclusive capacity masks between child nodes if
required. But since its expected for the super user to be using this , the usage
may be limited as well or the user can still care of it like i said above. Some
of the emails may have been confusing that we cannot do exclusive allocations -
but thats not true all together : we can do canfigure the masks to have
exclusive cache blocks for different cgroups but its just left to the user...
We did have a lot of discussions during the design and V3 if you remember and
were closed on using a seperate controller ... Below is one such thread where
we discussed the same . Dont want to loop throug again with this already full
marathon patch :)
https://lkml.org/lkml/2015/1/27/846
quick copy from V3 thread -
"
> proposal but was removed as we did not get agreement on lkml.
>
> the original lkml thread is here from 10/2014 for your reference -
> https://lkml.org/lkml/2014/10/16/568
Yeap, I followed that thread and this being a separate controller
definitely makes a lot more sense.
"
Thanks,
Vikas
> Thanks.
>
> --
> tejun
>
To summarize the ever growing thread :
1. the rdt_cgroup can be used to configure exclusive cache bitmaps for the child
nodes which can be used for the scenarios which Marcello mentions.
simle examples which were mentioned :
max bitmask length : 16 . hence full mask is 0xffff
groupx_realtime - 0xff .
group2_systemtraffic - 0xf. : put a lot of tasks from root node to here or which
ever is offending and thrashing.
groupy_<mytraffic> - 0x0f
Now the groupx has its own area of cache that can used by the realtime/(specific
scenario) apps. Similarly configure any groupy.
2. Can the maps can let you specify which cache ways ways the cache is allocated
? - No , this is
implementation specific as mentioned in the SDM. So when we configure a mask ,
you really dont know which ways or which exact lines are used on which SKUs ..
We may not see any use case as well
which is needed for apps to allocate cache in specific areas and the h/w does
not support this as well.
3. Letting the user specify size in bytes instead of bitmap : we have already
gone through this discussion in older versions. The user can simply check the
size of the total cache and understand what map could be what size. I dont see a
special need to specify an interface to enter the cache in bytes and then round
off - user could instead use the roundoff values before hand or iow it
automatically does when he specifies the bitmask.
ex: find cache size from /proc/cpuinfo. - say 20MB
bitmask max - 0xfffff.
This means the roundoff(chunk) size supported is only 1MB , so when you specify
the mask say 0x3(2MB) thats already taken care of.
Same applies to percentage - the masks automatically round off the percentage.
Please note that this is quite different from the way we can allocate memory in
bytes and needs to be treated differently given that the hardware provides interface
in a particular way.
4. Letting the kernel automatically extend the bitmap may affect a lot of other
things and will need a lot of heuristics - note that we have overlapping masks .
This interface lets the super-user control the cache allocation and it may be
very confusing for the user if he has allocated a cache mask and suddenly from
under the floor the kernel changes it.
Thanks,
Vikas
On Fri, 31 Jul 2015, Marcelo Tosatti wrote:
> On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:
>>
>>
>> On Thu, 30 Jul 2015, Marcelo Tosatti wrote:
>>
>>> On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
>>>>
>>>>
>>>> Marcello,
>>>>
>>>>
>>>> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
>>>>>
>>>>> How about this:
>>>>>
>>>>> desiredclos (closid p1 p2 p3 p4)
>>>>> 1 1 0 0 0
>>>>> 2 0 0 0 1
>>>>> 3 0 1 1 0
>>>>
>>>> #1 Currently in the rdt cgroup , the root cgroup always has all the
>>>> bits set and cant be changed (because the cgroup hierarchy would by
>>>> default make this to have all bits as all the children need to have
>>>> a subset of the root's bitmask). So if the user creates a cgroup and
>>>> not put any task in it , the tasks in the root cgroup could be still
>>>> using that part of the cache. Thats the reason i say we can have
>>>> really 'exclusive' masks.
>>>>
>>>> Or in other words - there is always a desired clos (0) which has all
>>>> parts set which acts like a default pool.
>>>>
>>>> Also the parts can overlap. Please apply this for all the below
>>>> comments which will change the way they work.
>>>
>>>
>>>>
>>>>>
>>>>> p means part.
>>>>
>>>> I am assuming p = (a contiguous cache capacity bit mask)
>>>>
>>>>> closid 1 is a exclusive cgroup.
>>>>> closid 2 is a "cache hog" class.
>>>>> closid 3 is "default closid".
>>>>>
>>>>> Desiredclos is what user has specified.
>>>>>
>>>>> Transition 1: desiredclos --> effectiveclos
>>>>> Clean all bits of unused closid's
>>>>> (that must be updated whenever a
>>>>> closid1 cgroup goes from empty->nonempty
>>>>> and vice-versa).
>>>>>
>>>>> effectiveclos (closid p1 p2 p3 p4)
>>>>> 1 0 0 0 0
>>>>> 2 0 0 0 1
>>>>> 3 0 1 1 0
>>>>
>>>>>
>>>>> Transition 2: effectiveclos --> expandedclos
>>>>> expandedclos (closid p1 p2 p3 p4)
>>>>> 1 0 0 0 0
>>>>> 2 0 0 0 1
>>>>> 3 1 1 1 0
>>>>> Then you have different inplacecos for each
>>>>> CPU (see pseudo-code below):
>>>>>
>>>>> On the following events.
>>>>>
>>>>> - task migration to new pCPU:
>>>>> - task creation:
>>>>>
>>>>> id = smp_processor_id();
>>>>> for (part = desiredclos.p1; ...; part++)
>>>>> /* if my cosid is set and any other
>>>>> cosid is clear, for the part,
>>>>> synchronize desiredclos --> inplacecos */
>>>>> if (part[mycosid] == 1 &&
>>>>> part[any_othercosid] == 0)
>>>>> wrmsr(part, desiredclos);
>>>>>
>>>>
>>>> Currently the root cgroup would have all the bits set which will act
>>>> like a default cgroup where all the otherwise unused parts (assuming
>>>> they are a set of contiguous cache capacity bits) will be used.
>>>
>>> Right, but we don't want to place tasks in there in case one cgroup
>>> wants exclusive cache access.
>>>
>>> So whenever you want an exclusive cgroup you'd do:
>>>
>>> create cgroup-exclusive; reserve desired part of the cache
>>> for it.
>>> create cgroup-default; reserved all cache minus that of cgroup-exclusive
>>> for it.
>>>
>>> place tasks that belong to cgroup-exclusive into it.
>>> place all other tasks (including init) into cgroup-default.
>>>
>>> Is that right?
>>
>> Yes you could do that.
>>
>> You can create cgroups to have masks which are exclusive in todays
>> implementation, just that you could also created more cgroups to
>> overlap the masks again.. iow we dont have an exclusive flag for the
>> cgroup mask.
>> Is that a common use case in the server environment that you need to
>> prevent other cgroups from using a certain mask ? (since the root
>> user should control these allocations .. he should know?)
>
> Yes, there are two known use-cases that have this characteristic:
>
> 1) High performance numeric application which has been optimized
> to a certain fraction of the cache.
>
> 2) Low latency application in multi-application OS.
>
> For both cases exclusive cache access is wanted.
>
>
On Fri, Jul 31, 2015 at 09:41:58AM -0700, Vikas Shivappa wrote:
>
> To summarize the ever growing thread :
>
> 1. the rdt_cgroup can be used to configure exclusive cache bitmaps
> for the child nodes which can be used for the scenarios which
> Marcello mentions.
>
> simle examples which were mentioned :
> max bitmask length : 16 . hence full mask is 0xffff
> groupx_realtime - 0xff .
> group2_systemtraffic - 0xf. : put a lot of tasks from root node to
> here or which ever is offending and thrashing.
> groupy_<mytraffic> - 0x0f
>
> Now the groupx has its own area of cache that can used by the
> realtime/(specific scenario) apps. Similarly configure any groupy.
>
> 2. Can the maps can let you specify which cache ways ways the cache
> is allocated ? - No , this is implementation specific as mentioned
> in the SDM. So when we configure a mask , you really dont know which
> ways or which exact lines are used on which SKUs .. We may not see
> any use case as well which is needed for apps to allocate cache in
> specific areas and the h/w does not support this as well.
Ok, can you comment whether the userspace interface proposed addresses
all your use cases ?
> 3. Letting the user specify size in bytes instead of bitmap : we
> have already gone through this discussion in older versions. The
> user can simply check the size of the total cache and understand
> what map could be what size. I dont see a special need to specify an
> interface to enter the cache in bytes and then round off - user
> could instead use the roundoff values before hand or iow it
> automatically does when he specifies the bitmask.
When you move from processor A with CBM bitmask format X to hardware B
with CBM bitmask format Y, and the formats Y and X are different, you
have to manually adjust the format.
Please reply to the userspace proposal, the problem is very explicit
there.
> ex: find cache size from /proc/cpuinfo. - say 20MB
> bitmask max - 0xfffff.
>
> This means the roundoff(chunk) size supported is only 1MB , so when
> you specify the mask say 0x3(2MB) thats already taken care of.
> Same applies to percentage - the masks automatically round off the percentage.
>
> Please note that this is quite different from the way we can
> allocate memory in bytes and needs to be treated differently given
> that the hardware provides interface in a particular way.
>
> 4. Letting the kernel automatically extend the bitmap may affect a
> lot of other things
Lets talk about them. What other things?
> and will need a lot of heuristics - note that we
> have overlapping masks.
I proposed a way to avoid heuristics by exposing whether the cgroup is
"expandable" or not and asked your input.
We really do not want to waste cache if we can avoid it.
> This interface lets the super-user control
> the cache allocation and it may be very confusing for the user if he
> has allocated a cache mask and suddenly from under the floor the
> kernel changes it.
Agree.
>
> Thanks,
> Vikas
>
>
> On Fri, 31 Jul 2015, Marcelo Tosatti wrote:
>
> >On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:
> >>
> >>
> >>On Thu, 30 Jul 2015, Marcelo Tosatti wrote:
> >>
> >>>On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> >>>>
> >>>>
> >>>>Marcello,
> >>>>
> >>>>
> >>>>On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >>>>>
> >>>>>How about this:
> >>>>>
> >>>>>desiredclos (closid p1 p2 p3 p4)
> >>>>> 1 1 0 0 0
> >>>>> 2 0 0 0 1
> >>>>> 3 0 1 1 0
> >>>>
> >>>>#1 Currently in the rdt cgroup , the root cgroup always has all the
> >>>>bits set and cant be changed (because the cgroup hierarchy would by
> >>>>default make this to have all bits as all the children need to have
> >>>>a subset of the root's bitmask). So if the user creates a cgroup and
> >>>>not put any task in it , the tasks in the root cgroup could be still
> >>>>using that part of the cache. Thats the reason i say we can have
> >>>>really 'exclusive' masks.
> >>>>
> >>>>Or in other words - there is always a desired clos (0) which has all
> >>>>parts set which acts like a default pool.
> >>>>
> >>>>Also the parts can overlap. Please apply this for all the below
> >>>>comments which will change the way they work.
> >>>
> >>>
> >>>>
> >>>>>
> >>>>>p means part.
> >>>>
> >>>>I am assuming p = (a contiguous cache capacity bit mask)
> >>>>
> >>>>>closid 1 is a exclusive cgroup.
> >>>>>closid 2 is a "cache hog" class.
> >>>>>closid 3 is "default closid".
> >>>>>
> >>>>>Desiredclos is what user has specified.
> >>>>>
> >>>>>Transition 1: desiredclos --> effectiveclos
> >>>>>Clean all bits of unused closid's
> >>>>>(that must be updated whenever a
> >>>>>closid1 cgroup goes from empty->nonempty
> >>>>>and vice-versa).
> >>>>>
> >>>>>effectiveclos (closid p1 p2 p3 p4)
> >>>>> 1 0 0 0 0
> >>>>> 2 0 0 0 1
> >>>>> 3 0 1 1 0
> >>>>
> >>>>>
> >>>>>Transition 2: effectiveclos --> expandedclos
> >>>>>expandedclos (closid p1 p2 p3 p4)
> >>>>> 1 0 0 0 0
> >>>>> 2 0 0 0 1
> >>>>> 3 1 1 1 0
> >>>>>Then you have different inplacecos for each
> >>>>>CPU (see pseudo-code below):
> >>>>>
> >>>>>On the following events.
> >>>>>
> >>>>>- task migration to new pCPU:
> >>>>>- task creation:
> >>>>>
> >>>>> id = smp_processor_id();
> >>>>> for (part = desiredclos.p1; ...; part++)
> >>>>> /* if my cosid is set and any other
> >>>>> cosid is clear, for the part,
> >>>>> synchronize desiredclos --> inplacecos */
> >>>>> if (part[mycosid] == 1 &&
> >>>>> part[any_othercosid] == 0)
> >>>>> wrmsr(part, desiredclos);
> >>>>>
> >>>>
> >>>>Currently the root cgroup would have all the bits set which will act
> >>>>like a default cgroup where all the otherwise unused parts (assuming
> >>>>they are a set of contiguous cache capacity bits) will be used.
> >>>
> >>>Right, but we don't want to place tasks in there in case one cgroup
> >>>wants exclusive cache access.
> >>>
> >>>So whenever you want an exclusive cgroup you'd do:
> >>>
> >>>create cgroup-exclusive; reserve desired part of the cache
> >>>for it.
> >>>create cgroup-default; reserved all cache minus that of cgroup-exclusive
> >>>for it.
> >>>
> >>>place tasks that belong to cgroup-exclusive into it.
> >>>place all other tasks (including init) into cgroup-default.
> >>>
> >>>Is that right?
> >>
> >>Yes you could do that.
> >>
> >>You can create cgroups to have masks which are exclusive in todays
> >>implementation, just that you could also created more cgroups to
> >>overlap the masks again.. iow we dont have an exclusive flag for the
> >>cgroup mask.
> >>Is that a common use case in the server environment that you need to
> >>prevent other cgroups from using a certain mask ? (since the root
> >>user should control these allocations .. he should know?)
> >
> >Yes, there are two known use-cases that have this characteristic:
> >
> >1) High performance numeric application which has been optimized
> >to a certain fraction of the cache.
> >
> >2) Low latency application in multi-application OS.
> >
> >For both cases exclusive cache access is wanted.
> >
> >
On Wed, 29 Jul 2015, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 03:21:02PM -0700, Vikas Shivappa wrote:
>> +/*
>> + * Temporary cpumask used during hot cpu notificaiton handling. The usage
>> + * is serialized by hot cpu locks.
>> + */
>> +static cpumask_t tmp_cpumask;
>
> So the problem with this is that its 512 bytes on your general distro
> config. And this patch set includes at least 3 of them
>
> So you've just shot 1k5 bytes of .data for no reason.
>
> I know tglx whacked you over the head for this, but is this really worth
> it? I mean, nobody sane should care about hotplug performance, so who
> cares if we iterate a bunch of cpus on the abysmal slow path called
> hotplug.
We did this so that we dont keep looping on every cpu to check if it
belongs to a particular package. especially the cost being linear with more and
more cpus getting added and on large systems.
Would it not make sense to use the mask which would tell you all the
cores on a particular core's package ? I realized to use the mask
topology_core_cpumask only after seeing tglx's pseudo code because the name is
definitely confusing and earlier I assumed such mask doesnt exist and hence we
had to just loop through.
I know you pointed out to not put the mask on the stack , but the static usage
cost should be reasonable to avoid the cost of looping through all the
available cpus..
also it doesnt mean we put more crap when we see crapy code as per tglx as well
? so why contradict that.
>
On Wed, 29 Jul 2015, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 03:21:09PM -0700, Vikas Shivappa wrote:
>> +/*
>> + * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
>> + * which are one per CLOSid except IA32_L3_MASK_0 on the current package.
>> + */
>> +static void cbm_update_msrs(void *info)
>> +{
>> + int maxid = boot_cpu_data.x86_cache_max_closid;
>> + unsigned int i;
>> +
>> + /*
>> + * At cpureset, all bits of IA32_L3_MASK_n are set.
>> + * The index starts from one as there is no need
>> + * to update IA32_L3_MASK_0 as it belongs to root cgroup
>> + * whose cache mask is all 1s always.
>> + */
>> + for (i = 1; i < maxid; i++) {
>> + if (ccmap[i].clos_refcnt)
>> + cbm_cpu_update((void *)i);
>> + }
>> +}
>> +
>> +static inline void intel_rdt_cpu_start(int cpu)
>> +{
>> + struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
>> +
>> + state->closid = 0;
>> + mutex_lock(&rdt_group_mutex);
>> + if (rdt_cpumask_update(cpu))
>> + smp_call_function_single(cpu, cbm_update_msrs, NULL, 1);
>> + mutex_unlock(&rdt_group_mutex);
>> +}
>
> If you were to guard your array with both a mutex and a raw_spinlock
> then you can avoid the IPI and use CPU_STARTING.
Cpu_online was just good enough as the tasks would be ready to be scheduled. iow
, its just at the right time.
could avoid using the interrupt disabled time ?
Dont really need the *interrupt disabled* cpu_starting notification - can leave
that for more important code/lock free code can go there. or this change should
not be a big concern ?
>
>> +static int intel_rdt_cpu_notifier(struct notifier_block *nb,
>> + unsigned long action, void *hcpu)
>> +{
>> + unsigned int cpu = (unsigned long)hcpu;
>> +
>> + switch (action) {
>> + case CPU_DOWN_FAILED:
>> + case CPU_ONLINE:
>> + intel_rdt_cpu_start(cpu);
>> + break;
>> + case CPU_DOWN_PREPARE:
>> + intel_rdt_cpu_exit(cpu);
>> + break;
>> + default:
>> + break;
>> + }
>> +
>> + return NOTIFY_OK;
>> }
>
On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
>On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
>>
>>
>> Marcello,
>>
>>
>> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
>> >
>> >How about this:
>> >
>> >desiredclos (closid p1 p2 p3 p4)
>> > 1 1 0 0 0
>> > 2 0 0 0 1
>> > 3 0 1 1 0
>>
>> #1 Currently in the rdt cgroup , the root cgroup always has all the
>> bits set and cant be changed (because the cgroup hierarchy would by
>> default make this to have all bits as all the children need to have
>> a subset of the root's bitmask). So if the user creates a cgroup and
>> not put any task in it , the tasks in the root cgroup could be still
>> using that part of the cache. Thats the reason i say we can have
>> really 'exclusive' masks.
>>
>> Or in other words - there is always a desired clos (0) which has all
>> parts set which acts like a default pool.
>>
>> Also the parts can overlap. Please apply this for all the below
>> comments which will change the way they work.
>>
>> >
>> >p means part.
>>
>> I am assuming p = (a contiguous cache capacity bit mask)
>
>Yes.
>
>> >closid 1 is a exclusive cgroup.
>> >closid 2 is a "cache hog" class.
>> >closid 3 is "default closid".
>> >
>> >Desiredclos is what user has specified.
>> >
>> >Transition 1: desiredclos --> effectiveclos
>> >Clean all bits of unused closid's
>> >(that must be updated whenever a
>> >closid1 cgroup goes from empty->nonempty
>> >and vice-versa).
>> >
>> >effectiveclos (closid p1 p2 p3 p4)
>> > 1 0 0 0 0
>> > 2 0 0 0 1
>> > 3 0 1 1 0
>>
>> >
>> >Transition 2: effectiveclos --> expandedclos
>> >expandedclos (closid p1 p2 p3 p4)
>> > 1 0 0 0 0
>> > 2 0 0 0 1
>> > 3 1 1 1 0
>> >Then you have different inplacecos for each
>> >CPU (see pseudo-code below):
>> >
>> >On the following events.
>> >
>> >- task migration to new pCPU:
>> >- task creation:
>> >
>> > id = smp_processor_id();
>> > for (part = desiredclos.p1; ...; part++)
>> > /* if my cosid is set and any other
>> > cosid is clear, for the part,
>> > synchronize desiredclos --> inplacecos */
>> > if (part[mycosid] == 1 &&
>> > part[any_othercosid] == 0)
>> > wrmsr(part, desiredclos);
>> >
>>
>> Currently the root cgroup would have all the bits set which will act
>> like a default cgroup where all the otherwise unused parts (assuming
>> they are a set of contiguous cache capacity bits) will be used.
>>
>> Otherwise the question is in the expandedclos - who decides to
>> expand the closx parts to include some of the unused parts.. - that
>> could just be a default root always ?
>
>Right, so the problem is for certain closid's you might never want
>to expand (because doing so would cause data to be cached in a
>cache way which might have high eviction rate in the future).
>See the example from Will.
>
>But for the default cache (that is "unclassified applications"
>i suppose it is beneficial to expand in most cases, that is,
>use maximum amount of cache irrespective of eviction rate, which
>is the behaviour that exists now without CAT).
>
>So perhaps a new flag "expand=y/n" can be added to the cgroup
>directories... What do you say?
>
>Userspace representation of CAT
>-------------------------------
>
>Usage model:
>1) measure application performance without L3 cache reservation.
>2) measure application perf with L3 cache reservation and
>X number of cache ways until desired performance is attained.
>
>Requirements:
>1) Persistency of CLOS configuration across hardware. On migration
>of operating system or application between different hardware
>systems we'd like the following to be maintained:
> - exclusive number of bytes (*) reserved to a certain CLOSid.
> - shared number of bytes (*) reserved between a certain group
> of CLOSid's.
>
>For both code and data, rounded down or up in cache way size.
>
>2) Reasoning:
>Different CBM masks in different hardware platforms might be necessary
>to specify the same CLOS configuration, in terms of exclusive number of
>bytes and shared number of bytes. (cache-way rounded number of bytes).
>For example, due to L3 allocation by other hardware entities in certain parts
>of the cache it might be necessary to relocate CBM mask to achieve
>the same CLOS configuration.
>
>3) Proposed format:
>
Few questions from a random listener, I apologise if some of them are
in a wrong place due to me missing some information from past threads.
I'm not sure whether the following proposal to the format is the
internal structure or what's going to be in cgroups. If this is
user-visible interface, I think it could be a little less detailed.
>sharedregionK.exclusive - Number of exclusive cache bytes reserved for
> shared region.
>sharedregionK.excl_data - Number of exclusive cache data bytes reserved for
> shared region.
>sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for
> shared region.
>sharedregionK.round_down - Round down to cache way bytes from respective number
> specification (default is round up).
>sharedregionK.expand - y/n - Expand shared region to more cache ways
> when available (default N).
>
>cgroupN.exclusive - Number of exclusive L3 cache bytes reserved
> for cgroup.
>cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
> for cgroup.
>cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
> for cgroup.
By exclusive, you mean that it's exclusive to the tasks in this
cgroup?
The thing is that we must differentiate between limiting some
process's from hogging the memory (like example 2 below) and making
some part of the cache exclusive for particular application (example 1
below).
I just hope we won't need to add something similar to 'isolcpus=' just
so we can make sure none of the tasks in the root cgroup can spoil the
part of the cache we need to have exclusive.
I'm not sure creating a new subgroup and moving all the tasks there
would work, It certainly is not possible with other cgroups, like the
cpuset cgroup mentioned beforehand.
I also don't quite fully understand how the co-mounting with the
cpuset cgroup should work, but that's not design-related.
One more question, how does this work on systems with multiple L3
caches (e.g. large NUMA node systems)? I'm guessing if the process is
running only on some CPUs, the wrmsr() will be called on that
particular CPU(s), right?
>cgroupN.round_down - Round down to cache way bytes from respective number
> specification (default is round up).
>cgroupN.expand - y/n - Expand shared region to more cache ways when
> available (default N).
>cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
>regions)
>
>Example 1:
>One application with 2M exclusive cache, two applications
>with 1M exclusive each, sharing an expansive shared region of 1M.
>
>cgroup1.exclusive = 2M
>
>sharedregion1.exclusive = 1M
>sharedregion1.expand = Y
>
>cgroup2.exclusive = 1M
>cgroup2.shared = sharedregion1
>
>cgroup3.exclusive = 1M
>cgroup3.shared = sharedregion1
>
>Example 2:
>3 high performance applications running, one of which is a cache hog
>with no cache locality.
>
>cgroup1.exclusive = 8M
>cgroup2.exclusive = 8M
>
>cgroup3.exclusive = 512K
>cgroup3.round_down = Y
>
>In all cases the default cgroup (which requires no explicit
>specification) is expansive and uses the remaining cache
>ways, including the ways shared by other hardware entities.
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
Hello,
On Fri, Jul 31, 2015 at 12:12:18PM -0300, Marcelo Tosatti wrote:
> > I don't really think it makes sense to implement a fully hierarchical
> > cgroup solution when there isn't the basic affinity-adjusting
> > interface
>
> What is an "affinity adjusting interface" ? Can you give an example
> please?
Something similar to sched_setaffinity(). Just a syscall / prctl or
whatever programmable interface which sets per-task attribute.
> > and it isn't clear whether fully hierarchical resource
> > distribution would be necessary especially given that the granularity
> > of the target resource is very coarse.
>
> As i see it, the benefit of the hierarchical structure to the CAT
> configuration is simply to organize sharing of cache ways in subtrees
> - two cgroups can share a given cache way only if they have a common
> parent.
>
> That is the only benefit. Vikas, please correct me if i'm wrong.
cgroups is not a superset of a programmable interface. It has
distinctive disadvantages and not a substitute with hirearchy support
for regular systemcall-like interface. I don't think it makes sense
to go full-on hierarchical cgroups when we don't have basic interface
which is likely to cover many use cases better. A syscall-like
interface combined with a tool similar to taskset would cover a lot in
a more accessible way.
> > I can see that how cpuset would seem to invite this sort of usage but
> > cpuset itself is more of an arbitrary outgrowth (regardless of
> > history) in terms of resource control and most things controlled by
> > cpuset already have countepart interface which is readily accessible
> > to the normal applications.
>
> I can't parse that phrase (due to ignorance). Please educate.
Hmmm... consider CPU affinity. cpuset definitely is useful for some
use cases as a management tool especially if the workloads are not
cooperative or delegated; however, it's no substitute for a proper
syscall interface and it'd be silly to try to replace that with
cpuset.
> > Given that what the feature allows is restricting usage rather than
> > granting anything exclusively, a programmable interface wouldn't need
> > to worry about complications around priviledges
>
> What complications about priviledges you refer to?
It's not granting exclusive access, so individual user applications
can be allowed to do whatever it wanna do as long as the issuer has
enough priv over the target task.
> > while being able to reap most of the benefits in an a lot easier way.
> > Am I missing something?
>
> The interface does allow for exclusive cache usage by an application.
> Please read the Intel manual, section 17, it is very instructive.
For that, it'd have to require some CAP but I think just having
restrictive interface in the style of CPU or NUMA affinity would go a
long way.
> The use cases we have now are the following:
>
> Scenario 1: Consider a system with 4 high performance applications
> running, one of which is a streaming application that manages a very
> large address space from which it reads and writes as it does its processing.
> As such the application will use all the cache it can get but does
> not need much if any cache. So, it spoils the cache for everyone for no
> gain on its own. In this case we'd like to constrain it to the
> smallest possible amount of cache while at the same time constraining
> the other 3 applications to stay out of this thrashed area of the
> cache.
A tool in the style of taskset should be enough for the above
scenario.
> Scenario 2: We have a numeric application that has been highly optimized
> to fit in the L2 cache (2M for example). We want to ensure that its
> cached data does not get flushed from the cache hierarchy while it is
> scheduled out. In this case we exclusively allocate enough L3 cache to
> hold all of the L2 cache.
>
> Scenario 3: Latency sensitive application executing in a shared
> environment, where memory to handle an event must be in L3 cache
> for latency requirements to be met.
Either isolate CPUs or run other stuff with affinity restricted.
cpuset-style allocation can be easier for things like this but that
should be an addition on top not the one and only interface. How is
it gonna handle if multiple threads of a process want to restrict
cache usages to avoid stepping on each other's toes? Delegate the
subdirectory and let the process itself open it and write to files to
configure when there isn't even a way to atomically access the
process's own directory or a way to synchronize against migration?
cgroups may be an okay management interface but a horrible
programmable interface.
Sure, if this turns out to be as important as cpu or numa affinity and
gets widely used creating management burden in many use cases, we sure
can add cgroups controller for it but that's a remote possibility at
this point and the current attempt is over-engineering solution for
problems which haven't been shown to exist. Let's please first
implement something simple and easy to use.
Thanks.
--
tejun
Hello, Vikas.
On Fri, Jul 31, 2015 at 09:24:58AM -0700, Vikas Shivappa wrote:
> Yes today we dont have an alternative interface - but we can always build
> one. We simply dont have it because till now Linux kernel just tolerated the
> degradation that could have occured by cache contention and this is the
> first interface we are building.
But we're doing it the wrong way around. You can do most of what
cgroup interface can do with systemcall-like interface with some
inconvenience. The other way doesn't really work. As I wrote in the
other reply, cgroups is a horrible programmable interface and we don't
want individual applications to interact with it directly and CAT's
use cases most definitely include each application programming its own
cache mask. Let's build something which is simple and can be used
easily first. If this turns out to be widely useful and an overall
management capability over it is wanted, we can consider cgroups then.
Thanks.
--
tejun
On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote:
> On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
> >On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> >>
> >>
> >>Marcello,
> >>
> >>
> >>On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >>>
> >>>How about this:
> >>>
> >>>desiredclos (closid p1 p2 p3 p4)
> >>> 1 1 0 0 0
> >>> 2 0 0 0 1
> >>> 3 0 1 1 0
> >>
> >>#1 Currently in the rdt cgroup , the root cgroup always has all the
> >>bits set and cant be changed (because the cgroup hierarchy would by
> >>default make this to have all bits as all the children need to have
> >>a subset of the root's bitmask). So if the user creates a cgroup and
> >>not put any task in it , the tasks in the root cgroup could be still
> >>using that part of the cache. Thats the reason i say we can have
> >>really 'exclusive' masks.
> >>
> >>Or in other words - there is always a desired clos (0) which has all
> >>parts set which acts like a default pool.
> >>
> >>Also the parts can overlap. Please apply this for all the below
> >>comments which will change the way they work.
> >>
> >>>
> >>>p means part.
> >>
> >>I am assuming p = (a contiguous cache capacity bit mask)
> >
> >Yes.
> >
> >>>closid 1 is a exclusive cgroup.
> >>>closid 2 is a "cache hog" class.
> >>>closid 3 is "default closid".
> >>>
> >>>Desiredclos is what user has specified.
> >>>
> >>>Transition 1: desiredclos --> effectiveclos
> >>>Clean all bits of unused closid's
> >>>(that must be updated whenever a
> >>>closid1 cgroup goes from empty->nonempty
> >>>and vice-versa).
> >>>
> >>>effectiveclos (closid p1 p2 p3 p4)
> >>> 1 0 0 0 0
> >>> 2 0 0 0 1
> >>> 3 0 1 1 0
> >>
> >>>
> >>>Transition 2: effectiveclos --> expandedclos
> >>>expandedclos (closid p1 p2 p3 p4)
> >>> 1 0 0 0 0
> >>> 2 0 0 0 1
> >>> 3 1 1 1 0
> >>>Then you have different inplacecos for each
> >>>CPU (see pseudo-code below):
> >>>
> >>>On the following events.
> >>>
> >>>- task migration to new pCPU:
> >>>- task creation:
> >>>
> >>> id = smp_processor_id();
> >>> for (part = desiredclos.p1; ...; part++)
> >>> /* if my cosid is set and any other
> >>> cosid is clear, for the part,
> >>> synchronize desiredclos --> inplacecos */
> >>> if (part[mycosid] == 1 &&
> >>> part[any_othercosid] == 0)
> >>> wrmsr(part, desiredclos);
> >>>
> >>
> >>Currently the root cgroup would have all the bits set which will act
> >>like a default cgroup where all the otherwise unused parts (assuming
> >>they are a set of contiguous cache capacity bits) will be used.
> >>
> >>Otherwise the question is in the expandedclos - who decides to
> >>expand the closx parts to include some of the unused parts.. - that
> >>could just be a default root always ?
> >
> >Right, so the problem is for certain closid's you might never want
> >to expand (because doing so would cause data to be cached in a
> >cache way which might have high eviction rate in the future).
> >See the example from Will.
> >
> >But for the default cache (that is "unclassified applications"
> >i suppose it is beneficial to expand in most cases, that is,
> >use maximum amount of cache irrespective of eviction rate, which
> >is the behaviour that exists now without CAT).
> >
> >So perhaps a new flag "expand=y/n" can be added to the cgroup
> >directories... What do you say?
> >
> >Userspace representation of CAT
> >-------------------------------
> >
> >Usage model:
> >1) measure application performance without L3 cache reservation.
> >2) measure application perf with L3 cache reservation and
> >X number of cache ways until desired performance is attained.
> >
> >Requirements:
> >1) Persistency of CLOS configuration across hardware. On migration
> >of operating system or application between different hardware
> >systems we'd like the following to be maintained:
> > - exclusive number of bytes (*) reserved to a certain CLOSid.
> > - shared number of bytes (*) reserved between a certain group
> > of CLOSid's.
> >
> >For both code and data, rounded down or up in cache way size.
> >
> >2) Reasoning:
> >Different CBM masks in different hardware platforms might be necessary
> >to specify the same CLOS configuration, in terms of exclusive number of
> >bytes and shared number of bytes. (cache-way rounded number of bytes).
> >For example, due to L3 allocation by other hardware entities in certain parts
> >of the cache it might be necessary to relocate CBM mask to achieve
> >the same CLOS configuration.
> >
> >3) Proposed format:
> >
>
> Few questions from a random listener, I apologise if some of them are
> in a wrong place due to me missing some information from past threads.
>
> I'm not sure whether the following proposal to the format is the
> internal structure or what's going to be in cgroups. If this is
> user-visible interface, I think it could be a little less detailed.
User visible interface. The idea is to have userspace code that performs
[ user visible specification ] ----> [ cbm bitmasks on present hardware
platform ]
In systemd, probably (or whatever is between the user and the cgroup
interface).
> >sharedregionK.exclusive - Number of exclusive cache bytes reserved for
> > shared region.
> >sharedregionK.excl_data - Number of exclusive cache data bytes reserved for
> > shared region.
> >sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for
> > shared region.
> >sharedregionK.round_down - Round down to cache way bytes from respective number
> > specification (default is round up).
> >sharedregionK.expand - y/n - Expand shared region to more cache ways
> > when available (default N).
> >
> >cgroupN.exclusive - Number of exclusive L3 cache bytes reserved
> > for cgroup.
> >cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
> > for cgroup.
> >cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
> > for cgroup.
>
> By exclusive, you mean that it's exclusive to the tasks in this
> cgroup?
Correct.
> The thing is that we must differentiate between limiting some
> process's from hogging the memory (like example 2 below) and making
> some part of the cache exclusive for particular application (example 1
> below).
AFAICS there is no difference because: both require exclusive cache
access: the hog wants exclusive access between any other user of its
cachelines will be penalized. the high performance application wants
exclusive cache access because any other user of its cachelines will
penalize it.
Where do you see the need to differentiate?
> I just hope we won't need to add something similar to 'isolcpus=' just
> so we can make sure none of the tasks in the root cgroup can spoil the
> part of the cache we need to have exclusive.
>
> I'm not sure creating a new subgroup and moving all the tasks there
> would work, It certainly is not possible with other cgroups, like the
> cpuset cgroup mentioned beforehand.
Why not? Should be able to place all tasks in a given cgroup? (trying
to setup systemd to do that now...).
> I also don't quite fully understand how the co-mounting with the
> cpuset cgroup should work, but that's not design-related.
Neither do I.
> One more question, how does this work on systems with multiple L3
> caches (e.g. large NUMA node systems)? I'm guessing if the process is
> running only on some CPUs, the wrmsr() will be called on that
> particular CPU(s), right?
Not in the current patchset, that has to be fixed...
> >cgroupN.round_down - Round down to cache way bytes from respective number
> > specification (default is round up).
> >cgroupN.expand - y/n - Expand shared region to more cache ways when
> > available (default N).
> >cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
> >regions)
> >
> >Example 1:
> >One application with 2M exclusive cache, two applications
> >with 1M exclusive each, sharing an expansive shared region of 1M.
> >
> >cgroup1.exclusive = 2M
> >
> >sharedregion1.exclusive = 1M
> >sharedregion1.expand = Y
> >
> >cgroup2.exclusive = 1M
> >cgroup2.shared = sharedregion1
> >
> >cgroup3.exclusive = 1M
> >cgroup3.shared = sharedregion1
> >
> >Example 2:
> >3 high performance applications running, one of which is a cache hog
> >with no cache locality.
> >
> >cgroup1.exclusive = 8M
> >cgroup2.exclusive = 8M
> >
> >cgroup3.exclusive = 512K
> >cgroup3.round_down = Y
> >
> >In all cases the default cgroup (which requires no explicit
> >specification) is expansive and uses the remaining cache
> >ways, including the ways shared by other hardware entities.
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/
Hello Marcelo/Martin,
Like I mentioned let me modify the documentation to better help understand
the usage. Things like updating each package bitmask is already in the patches.
Lets discuss offline and come up a well defined proposal for change if any and
then update that in next series. We seem to be just looping over same items.
Thanks,
Vikas
On Mon, 3 Aug 2015, Marcelo Tosatti wrote:
> On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote:
>> On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
>>> On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
>>>>
>>>>
>>>> Marcello,
>>>>
>>>>
>>>> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
>>>>>
>>>>> How about this:
>>>>>
>>>>> desiredclos (closid p1 p2 p3 p4)
>>>>> 1 1 0 0 0
>>>>> 2 0 0 0 1
>>>>> 3 0 1 1 0
>>>>
>>>> #1 Currently in the rdt cgroup , the root cgroup always has all the
>>>> bits set and cant be changed (because the cgroup hierarchy would by
>>>> default make this to have all bits as all the children need to have
>>>> a subset of the root's bitmask). So if the user creates a cgroup and
>>>> not put any task in it , the tasks in the root cgroup could be still
>>>> using that part of the cache. Thats the reason i say we can have
>>>> really 'exclusive' masks.
>>>>
>>>> Or in other words - there is always a desired clos (0) which has all
>>>> parts set which acts like a default pool.
>>>>
>>>> Also the parts can overlap. Please apply this for all the below
>>>> comments which will change the way they work.
>>>>
>>>>>
>>>>> p means part.
>>>>
>>>> I am assuming p = (a contiguous cache capacity bit mask)
>>>
>>> Yes.
>>>
>>>>> closid 1 is a exclusive cgroup.
>>>>> closid 2 is a "cache hog" class.
>>>>> closid 3 is "default closid".
>>>>>
>>>>> Desiredclos is what user has specified.
>>>>>
>>>>> Transition 1: desiredclos --> effectiveclos
>>>>> Clean all bits of unused closid's
>>>>> (that must be updated whenever a
>>>>> closid1 cgroup goes from empty->nonempty
>>>>> and vice-versa).
>>>>>
>>>>> effectiveclos (closid p1 p2 p3 p4)
>>>>> 1 0 0 0 0
>>>>> 2 0 0 0 1
>>>>> 3 0 1 1 0
>>>>
>>>>>
>>>>> Transition 2: effectiveclos --> expandedclos
>>>>> expandedclos (closid p1 p2 p3 p4)
>>>>> 1 0 0 0 0
>>>>> 2 0 0 0 1
>>>>> 3 1 1 1 0
>>>>> Then you have different inplacecos for each
>>>>> CPU (see pseudo-code below):
>>>>>
>>>>> On the following events.
>>>>>
>>>>> - task migration to new pCPU:
>>>>> - task creation:
>>>>>
>>>>> id = smp_processor_id();
>>>>> for (part = desiredclos.p1; ...; part++)
>>>>> /* if my cosid is set and any other
>>>>> cosid is clear, for the part,
>>>>> synchronize desiredclos --> inplacecos */
>>>>> if (part[mycosid] == 1 &&
>>>>> part[any_othercosid] == 0)
>>>>> wrmsr(part, desiredclos);
>>>>>
>>>>
>>>> Currently the root cgroup would have all the bits set which will act
>>>> like a default cgroup where all the otherwise unused parts (assuming
>>>> they are a set of contiguous cache capacity bits) will be used.
>>>>
>>>> Otherwise the question is in the expandedclos - who decides to
>>>> expand the closx parts to include some of the unused parts.. - that
>>>> could just be a default root always ?
>>>
>>> Right, so the problem is for certain closid's you might never want
>>> to expand (because doing so would cause data to be cached in a
>>> cache way which might have high eviction rate in the future).
>>> See the example from Will.
>>>
>>> But for the default cache (that is "unclassified applications"
>>> i suppose it is beneficial to expand in most cases, that is,
>>> use maximum amount of cache irrespective of eviction rate, which
>>> is the behaviour that exists now without CAT).
>>>
>>> So perhaps a new flag "expand=y/n" can be added to the cgroup
>>> directories... What do you say?
>>>
>>> Userspace representation of CAT
>>> -------------------------------
>>>
>>> Usage model:
>>> 1) measure application performance without L3 cache reservation.
>>> 2) measure application perf with L3 cache reservation and
>>> X number of cache ways until desired performance is attained.
>>>
>>> Requirements:
>>> 1) Persistency of CLOS configuration across hardware. On migration
>>> of operating system or application between different hardware
>>> systems we'd like the following to be maintained:
>>> - exclusive number of bytes (*) reserved to a certain CLOSid.
>>> - shared number of bytes (*) reserved between a certain group
>>> of CLOSid's.
>>>
>>> For both code and data, rounded down or up in cache way size.
>>>
>>> 2) Reasoning:
>>> Different CBM masks in different hardware platforms might be necessary
>>> to specify the same CLOS configuration, in terms of exclusive number of
>>> bytes and shared number of bytes. (cache-way rounded number of bytes).
>>> For example, due to L3 allocation by other hardware entities in certain parts
>>> of the cache it might be necessary to relocate CBM mask to achieve
>>> the same CLOS configuration.
>>>
>>> 3) Proposed format:
>>>
>>
>> Few questions from a random listener, I apologise if some of them are
>> in a wrong place due to me missing some information from past threads.
>>
>> I'm not sure whether the following proposal to the format is the
>> internal structure or what's going to be in cgroups. If this is
>> user-visible interface, I think it could be a little less detailed.
>
> User visible interface. The idea is to have userspace code that performs
>
> [ user visible specification ] ----> [ cbm bitmasks on present hardware
> platform ]
>
> In systemd, probably (or whatever is between the user and the cgroup
> interface).
>
>>> sharedregionK.exclusive - Number of exclusive cache bytes reserved for
>>> shared region.
>>> sharedregionK.excl_data - Number of exclusive cache data bytes reserved for
>>> shared region.
>>> sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for
>>> shared region.
>>> sharedregionK.round_down - Round down to cache way bytes from respective number
>>> specification (default is round up).
>>> sharedregionK.expand - y/n - Expand shared region to more cache ways
>>> when available (default N).
>>>
>>> cgroupN.exclusive - Number of exclusive L3 cache bytes reserved
>>> for cgroup.
>>> cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
>>> for cgroup.
>>> cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
>>> for cgroup.
>>
>> By exclusive, you mean that it's exclusive to the tasks in this
>> cgroup?
>
> Correct.
>
>> The thing is that we must differentiate between limiting some
>> process's from hogging the memory (like example 2 below) and making
>> some part of the cache exclusive for particular application (example 1
>> below).
>
> AFAICS there is no difference because: both require exclusive cache
> access: the hog wants exclusive access between any other user of its
> cachelines will be penalized. the high performance application wants
> exclusive cache access because any other user of its cachelines will
> penalize it.
>
> Where do you see the need to differentiate?
>
>> I just hope we won't need to add something similar to 'isolcpus=' just
>> so we can make sure none of the tasks in the root cgroup can spoil the
>> part of the cache we need to have exclusive.
>>
>> I'm not sure creating a new subgroup and moving all the tasks there
>> would work, It certainly is not possible with other cgroups, like the
>> cpuset cgroup mentioned beforehand.
>
> Why not? Should be able to place all tasks in a given cgroup? (trying
> to setup systemd to do that now...).
>
>> I also don't quite fully understand how the co-mounting with the
>> cpuset cgroup should work, but that's not design-related.
>
> Neither do I.
>
>> One more question, how does this work on systems with multiple L3
>> caches (e.g. large NUMA node systems)? I'm guessing if the process is
>> running only on some CPUs, the wrmsr() will be called on that
>> particular CPU(s), right?
>
> Not in the current patchset, that has to be fixed...
>
>>> cgroupN.round_down - Round down to cache way bytes from respective number
>>> specification (default is round up).
>>> cgroupN.expand - y/n - Expand shared region to more cache ways when
>>> available (default N).
>>> cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
>>> regions)
>>>
>>> Example 1:
>>> One application with 2M exclusive cache, two applications
>>> with 1M exclusive each, sharing an expansive shared region of 1M.
>>>
>>> cgroup1.exclusive = 2M
>>>
>>> sharedregion1.exclusive = 1M
>>> sharedregion1.expand = Y
>>>
>>> cgroup2.exclusive = 1M
>>> cgroup2.shared = sharedregion1
>>>
>>> cgroup3.exclusive = 1M
>>> cgroup3.shared = sharedregion1
>>>
>>> Example 2:
>>> 3 high performance applications running, one of which is a cache hog
>>> with no cache locality.
>>>
>>> cgroup1.exclusive = 8M
>>> cgroup2.exclusive = 8M
>>>
>>> cgroup3.exclusive = 512K
>>> cgroup3.round_down = Y
>>>
>>> In all cases the default cgroup (which requires no explicit
>>> specification) is expansive and uses the remaining cache
>>> ways, including the ways shared by other hardware entities.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>
>
>
On Sun, Aug 02, 2015 at 12:23:25PM -0400, Tejun Heo wrote:
> Hello,
>
> On Fri, Jul 31, 2015 at 12:12:18PM -0300, Marcelo Tosatti wrote:
> > > I don't really think it makes sense to implement a fully hierarchical
> > > cgroup solution when there isn't the basic affinity-adjusting
> > > interface
> >
> > What is an "affinity adjusting interface" ? Can you give an example
> > please?
>
> Something similar to sched_setaffinity(). Just a syscall / prctl or
> whatever programmable interface which sets per-task attribute.
You really want to specify the cache configuration "at once":
having process-A exclusive access to 2MB of cache at all times,
and process-B 4MB exclusive, means you can't have process-C use 4MB of
cache exclusively (consider 8MB cache machine).
But the syscall allows processes to set and retrieve
> > > and it isn't clear whether fully hierarchical resource
> > > distribution would be necessary especially given that the granularity
> > > of the target resource is very coarse.
> >
> > As i see it, the benefit of the hierarchical structure to the CAT
> > configuration is simply to organize sharing of cache ways in subtrees
> > - two cgroups can share a given cache way only if they have a common
> > parent.
> >
> > That is the only benefit. Vikas, please correct me if i'm wrong.
>
> cgroups is not a superset of a programmable interface. It has
> distinctive disadvantages and not a substitute with hirearchy support
> for regular systemcall-like interface. I don't think it makes sense
> to go full-on hierarchical cgroups when we don't have basic interface
> which is likely to cover many use cases better. A syscall-like
> interface combined with a tool similar to taskset would cover a lot in
> a more accessible way.
How are you going to specify sharing of portions of cache by two sets
of tasks with a syscall interface?
> > > I can see that how cpuset would seem to invite this sort of usage but
> > > cpuset itself is more of an arbitrary outgrowth (regardless of
> > > history) in terms of resource control and most things controlled by
> > > cpuset already have countepart interface which is readily accessible
> > > to the normal applications.
> >
> > I can't parse that phrase (due to ignorance). Please educate.
>
> Hmmm... consider CPU affinity. cpuset definitely is useful for some
> use cases as a management tool especially if the workloads are not
> cooperative or delegated; however, it's no substitute for a proper
> syscall interface and it'd be silly to try to replace that with
> cpuset.
>
> > > Given that what the feature allows is restricting usage rather than
> > > granting anything exclusively, a programmable interface wouldn't need
> > > to worry about complications around priviledges
> >
> > What complications about priviledges you refer to?
>
> It's not granting exclusive access, so individual user applications
> can be allowed to do whatever it wanna do as long as the issuer has
> enough priv over the target task.
Priviledge management with cgroup system: to change cache allocation
requires priviledge over cgroups.
Priviledge management with system call interface: applications
could be allowed to reserve up to a certain percentage of the cache.
> > > while being able to reap most of the benefits in an a lot easier way.
> > > Am I missing something?
> >
> > The interface does allow for exclusive cache usage by an application.
> > Please read the Intel manual, section 17, it is very instructive.
>
> For that, it'd have to require some CAP but I think just having
> restrictive interface in the style of CPU or NUMA affinity would go a
> long way.
>
> > The use cases we have now are the following:
> >
> > Scenario 1: Consider a system with 4 high performance applications
> > running, one of which is a streaming application that manages a very
> > large address space from which it reads and writes as it does its processing.
> > As such the application will use all the cache it can get but does
> > not need much if any cache. So, it spoils the cache for everyone for no
> > gain on its own. In this case we'd like to constrain it to the
> > smallest possible amount of cache while at the same time constraining
> > the other 3 applications to stay out of this thrashed area of the
> > cache.
>
> A tool in the style of taskset should be enough for the above
> scenario.
>
> > Scenario 2: We have a numeric application that has been highly optimized
> > to fit in the L2 cache (2M for example). We want to ensure that its
> > cached data does not get flushed from the cache hierarchy while it is
> > scheduled out. In this case we exclusively allocate enough L3 cache to
> > hold all of the L2 cache.
> >
> > Scenario 3: Latency sensitive application executing in a shared
> > environment, where memory to handle an event must be in L3 cache
> > for latency requirements to be met.
>
> Either isolate CPUs or run other stuff with affinity restricted.
>
> cpuset-style allocation can be easier for things like this but that
> should be an addition on top not the one and only interface. How is
> it gonna handle if multiple threads of a process want to restrict
> cache usages to avoid stepping on each other's toes? Delegate the
> subdirectory and let the process itself open it and write to files to
> configure when there isn't even a way to atomically access the
> process's own directory or a way to synchronize against migration?
One would preconfigure that in advance - but you are right, a
syscall interface is more flexible in that respect.
> cgroups may be an okay management interface but a horrible
> programmable interface.
>
> Sure, if this turns out to be as important as cpu or numa affinity and
> gets widely used creating management burden in many use cases, we sure
> can add cgroups controller for it but that's a remote possibility at
> this point and the current attempt is over-engineering solution for
> problems which haven't been shown to exist. Let's please first
> implement something simple and easy to use.
>
> Thanks.
>
> --
> tejun
On Wed, 29 Jul 2015, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 03:21:10PM -0700, Vikas Shivappa wrote:
>> + boot_cpu_data.x86_cache_max_closid = 4;
>> + boot_cpu_data.x86_cache_max_cbm_len = 20;
>
> That's just vile. And I'm surprised it even works, I would've expected
> boot_cpu_data to be const.
This is updated only once as the cpuid enum is not done for hsw servers. For all
the hsw servers these numbers are always the same and hence hardcoded , the
comment says its hardcoded ,will update a comment to include this info as well.
>
> So the CQM code has paranoid things like:
>
> max_rmid = MAX_INT;
> for_each_possible_cpu(cpu)
> max_rmid = min(max_rmid, cpu_data(cpu)->x86_cache_max_rmid);
>
> And then uses max_rmid. This has the advantage that if you mix parts in
> a multi-socket environment and hotplug socket 0 to a later part which a
> bigger {rm,clos}id your allocation isn't suddenly too small.
>
> Please do similar things and only ever look at cpu_data once, at init
> time.
Cache alloc is under CPU_SUP_INTEL and all the cores should have the same
features. We use the bsp structure in cache alloc which should have the minimum
features.
Thanks,
Vikas
>
On Mon, Aug 03, 2015 at 05:32:50PM -0300, Marcelo Tosatti wrote:
> On Sun, Aug 02, 2015 at 12:23:25PM -0400, Tejun Heo wrote:
> > Hello,
> >
> > On Fri, Jul 31, 2015 at 12:12:18PM -0300, Marcelo Tosatti wrote:
> > > > I don't really think it makes sense to implement a fully hierarchical
> > > > cgroup solution when there isn't the basic affinity-adjusting
> > > > interface
> > >
> > > What is an "affinity adjusting interface" ? Can you give an example
> > > please?
> >
> > Something similar to sched_setaffinity(). Just a syscall / prctl or
> > whatever programmable interface which sets per-task attribute.
>
> You really want to specify the cache configuration "at once":
> having process-A exclusive access to 2MB of cache at all times,
> and process-B 4MB exclusive, means you can't have process-C use 4MB of
> cache exclusively (consider 8MB cache machine).
Thats not true. Its fine to setup the
task set <--> cache portion
mapping in pieces.
In fact, its more natural because you don't necessarily know in advance
the entire cache allocation (think of "cp largefile /destination" with
sequential use-once behavior).
However, there is a use-case for sharing: in scenario 1 it might be
possible (and desired) to share code between applications.
> > > > and it isn't clear whether fully hierarchical resource
> > > > distribution would be necessary especially given that the granularity
> > > > of the target resource is very coarse.
> > >
> > > As i see it, the benefit of the hierarchical structure to the CAT
> > > configuration is simply to organize sharing of cache ways in subtrees
> > > - two cgroups can share a given cache way only if they have a common
> > > parent.
> > >
> > > That is the only benefit. Vikas, please correct me if i'm wrong.
> >
> > cgroups is not a superset of a programmable interface. It has
> > distinctive disadvantages and not a substitute with hirearchy support
> > for regular systemcall-like interface. I don't think it makes sense
> > to go full-on hierarchical cgroups when we don't have basic interface
> > which is likely to cover many use cases better. A syscall-like
> > interface combined with a tool similar to taskset would cover a lot in
> > a more accessible way.
>
> How are you going to specify sharing of portions of cache by two sets
> of tasks with a syscall interface?
>
> > > > I can see that how cpuset would seem to invite this sort of usage but
> > > > cpuset itself is more of an arbitrary outgrowth (regardless of
> > > > history) in terms of resource control and most things controlled by
> > > > cpuset already have countepart interface which is readily accessible
> > > > to the normal applications.
> > >
> > > I can't parse that phrase (due to ignorance). Please educate.
> >
> > Hmmm... consider CPU affinity. cpuset definitely is useful for some
> > use cases as a management tool especially if the workloads are not
> > cooperative or delegated; however, it's no substitute for a proper
> > syscall interface and it'd be silly to try to replace that with
> > cpuset.
> >
> > > > Given that what the feature allows is restricting usage rather than
> > > > granting anything exclusively, a programmable interface wouldn't need
> > > > to worry about complications around priviledges
> > >
> > > What complications about priviledges you refer to?
> >
> > It's not granting exclusive access, so individual user applications
> > can be allowed to do whatever it wanna do as long as the issuer has
> > enough priv over the target task.
>
> Priviledge management with cgroup system: to change cache allocation
> requires priviledge over cgroups.
>
> Priviledge management with system call interface: applications
> could be allowed to reserve up to a certain percentage of the cache.
>
> > > > while being able to reap most of the benefits in an a lot easier way.
> > > > Am I missing something?
> > >
> > > The interface does allow for exclusive cache usage by an application.
> > > Please read the Intel manual, section 17, it is very instructive.
> >
> > For that, it'd have to require some CAP but I think just having
> > restrictive interface in the style of CPU or NUMA affinity would go a
> > long way.
> >
> > > The use cases we have now are the following:
> > >
> > > Scenario 1: Consider a system with 4 high performance applications
> > > running, one of which is a streaming application that manages a very
> > > large address space from which it reads and writes as it does its processing.
> > > As such the application will use all the cache it can get but does
> > > not need much if any cache. So, it spoils the cache for everyone for no
> > > gain on its own. In this case we'd like to constrain it to the
> > > smallest possible amount of cache while at the same time constraining
> > > the other 3 applications to stay out of this thrashed area of the
> > > cache.
> >
> > A tool in the style of taskset should be enough for the above
> > scenario.
> >
> > > Scenario 2: We have a numeric application that has been highly optimized
> > > to fit in the L2 cache (2M for example). We want to ensure that its
> > > cached data does not get flushed from the cache hierarchy while it is
> > > scheduled out. In this case we exclusively allocate enough L3 cache to
> > > hold all of the L2 cache.
> > >
> > > Scenario 3: Latency sensitive application executing in a shared
> > > environment, where memory to handle an event must be in L3 cache
> > > for latency requirements to be met.
> >
> > Either isolate CPUs or run other stuff with affinity restricted.
> >
> > cpuset-style allocation can be easier for things like this but that
> > should be an addition on top not the one and only interface. How is
> > it gonna handle if multiple threads of a process want to restrict
> > cache usages to avoid stepping on each other's toes? Delegate the
> > subdirectory and let the process itself open it and write to files to
> > configure when there isn't even a way to atomically access the
> > process's own directory or a way to synchronize against migration?
>
> One would preconfigure that in advance - but you are right, a
> syscall interface is more flexible in that respect.
So, systemd is responsible for locking.
> > cgroups may be an okay management interface but a horrible
> > programmable interface.
> >
> > Sure, if this turns out to be as important as cpu or numa affinity and
> > gets widely used creating management burden in many use cases, we sure
> > can add cgroups controller for it but that's a remote possibility at
> > this point and the current attempt is over-engineering solution for
> > problems which haven't been shown to exist. Let's please first
> > implement something simple and easy to use.
> >
> > Thanks.
> >
> > --
> > tejun
Don't see an easy way to fix the sharing use-case (it would require
exposing the "intersection" between two task sets).
Can't "cacheset" helper (similar to taskset) talk to systemd
to achieve the flexibility you point ?
Hello,
On Mon, Aug 03, 2015 at 05:32:50PM -0300, Marcelo Tosatti wrote:
> You really want to specify the cache configuration "at once":
> having process-A exclusive access to 2MB of cache at all times,
> and process-B 4MB exclusive, means you can't have process-C use 4MB of
> cache exclusively (consider 8MB cache machine).
This is akin to arguing for implementing cpuset without
sched_setaffinity() or any other facility to adjust affinity. People
have been using affinity fine before cgroups. Sure, certain things
are cumbersome but cgroups isn't a replacement for a proper API.
> > cgroups is not a superset of a programmable interface. It has
> > distinctive disadvantages and not a substitute with hirearchy support
> > for regular systemcall-like interface. I don't think it makes sense
> > to go full-on hierarchical cgroups when we don't have basic interface
> > which is likely to cover many use cases better. A syscall-like
> > interface combined with a tool similar to taskset would cover a lot in
> > a more accessible way.
>
> How are you going to specify sharing of portions of cache by two sets
> of tasks with a syscall interface?
Again, think about how people have been using CPU affinity.
> > cpuset-style allocation can be easier for things like this but that
> > should be an addition on top not the one and only interface. How is
> > it gonna handle if multiple threads of a process want to restrict
> > cache usages to avoid stepping on each other's toes? Delegate the
> > subdirectory and let the process itself open it and write to files to
> > configure when there isn't even a way to atomically access the
> > process's own directory or a way to synchronize against migration?
>
> One would preconfigure that in advance - but you are right, a
> syscall interface is more flexible in that respect.
I'm not trying to say cgroup controller would be useless but the
current approach seems somewhat backwards and over-engineered. Can't
we just start with something simple? e.g. a platform device driver
that allows restricting cache usage of a target thread (be that self
or ptraceable target)?
Thanks.
--
tejun
Hello,
On Tue, Aug 04, 2015 at 09:55:20AM -0300, Marcelo Tosatti wrote:
...
> Can't "cacheset" helper (similar to taskset) talk to systemd
> to achieve the flexibility you point ?
I don't know. This is the case in point. You're now suggesting doing
things completely backwards - a thread of an application talking to
external agent to tweak system management interface so that it can
change the attribute of that thread. Let's please build a
programmable interface first. I'm sure there are use cases which
aren't gonna be covered 100% but at the same time I'm sure just simple
inheritable per-thread attribute would cover majority of use cases.
This really isn't that different from CPU affinity after all. *If* it
turns out that a lot of people yearn for fully hierarchical
enforcement, we sure can do that in the future but at this point it
really looks like an overkill in the wrong direction.
Thanks.
--
tejun
Hello Tejun,
On Sun, 2 Aug 2015, Tejun Heo wrote:
> Hello, Vikas.
>
> On Fri, Jul 31, 2015 at 09:24:58AM -0700, Vikas Shivappa wrote:
>> Yes today we dont have an alternative interface - but we can always build
>> one. We simply dont have it because till now Linux kernel just tolerated the
>> degradation that could have occured by cache contention and this is the
>> first interface we are building.
>
> But we're doing it the wrong way around. You can do most of what
> cgroup interface can do with systemcall-like interface with some
> inconvenience. The other way doesn't really work. As I wrote in the
> other reply, cgroups is a horrible programmable interface and we don't
> want individual applications to interact with it directly and CAT's
> use cases most definitely include each application programming its own
> cache mask.
I will make this more clear in the documentation - We intend this cgroup
interface to be used by a root or superuser - more like a system administrator
being able to control the allocation of the threads , the one who has the
knowledge of the usage and being able to decide.
There is already a lot of such usage among different enterprise users at
Intel/google/cisco etc who have been testing the patches posted to lkml and
academically there is plenty of usage as well.
As a quick ref : below is a quick summary of usage
Cache Allocation Technology provides a way for the Software (OS/VMM) to
restrict cache allocation to a defined 'subset' of cache which may be
overlapping with other 'subsets'.
This feature is used when allocating a
line in cache ie when pulling new data into the cache.
- The tasks are grouped into CLOS (class of service). or grouped into a
administrator created cgroup.
- Then OS uses MSR writes to indicate the
CLOSid of the thread when scheduling in (this is done by kernel) and to indicate
the cache capacity associated with the CLOSid (the root user indicates the
capacity for each task).
Currently cache allocation is supported for L3 cache.
More information can be found in the Intel SDM June 2015, Volume 3,
section 17.16.
Thanks,
Vikas
Let's build something which is simple and can be used
> easily first. If this turns out to be widely useful and an overall
> management capability over it is wanted, we can consider cgroups then.
>
> Thanks.
>
> --
> tejun
>
Hello, Vikas.
On Tue, Aug 04, 2015 at 11:50:16AM -0700, Vikas Shivappa wrote:
> I will make this more clear in the documentation - We intend this cgroup
> interface to be used by a root or superuser - more like a system
> administrator being able to control the allocation of the threads , the one
> who has the knowledge of the usage and being able to decide.
I get that this would be an easier "bolt-on" solution but isn't a good
solution by itself in the long term. As I wrote multiple times
before, this is a really bad programmable interface. Unless you're
sure that this doesn't have to be programmable for threads of an
individual applications, this is a pretty bad interface by itself.
> There is already a lot of such usage among different enterprise users at
> Intel/google/cisco etc who have been testing the patches posted to lkml and
> academically there is plenty of usage as well.
I mean, that's the tool you gave them. Of course they'd be using it
but I suspect most of them would do fine with a programmable interface
too. Again, please think of cpu affinity.
Thanks.
--
tejun
On Tue, 28 Jul 2015, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
>
> Please edit this document to have consistent spacing. Its really hard to
> read this. Every time I spot a misplaced space my brain stumbles and I
> need to restart.
Will fix all the spacing and other indentions issues mentioned.
Thanks for pointing them all out. Although the other documents I see dont have a
consistent format completely which is what confused me, this format would be
better.
>> +
>> +The following considerations are done for the PQR MSR write so that it
>> +has minimal impact on scheduling hot path:
>> +- This path doesnt exist on any non-intel platforms.
>
> !x86 I think you mean, its entirely possible to have the code present
> on AMD systems for instance.
>
>> +- On Intel platforms, this would not exist by default unless CGROUP_RDT
>> +is enabled.
>
> You can enable this just fine on AMD machines.
The cache alloc code is under CPU_SUP_INTEL ..
Thanks,
Vikas
On Tue, 4 Aug 2015, Tejun Heo wrote:
> Hello, Vikas.
>
> On Tue, Aug 04, 2015 at 11:50:16AM -0700, Vikas Shivappa wrote:
>> I will make this more clear in the documentation - We intend this cgroup
>> interface to be used by a root or superuser - more like a system
>> administrator being able to control the allocation of the threads , the one
>> who has the knowledge of the usage and being able to decide.
>
> I get that this would be an easier "bolt-on" solution but isn't a good
> solution by itself in the long term. As I wrote multiple times
> before, this is a really bad programmable interface. Unless you're
> sure that this doesn't have to be programmable for threads of an
> individual applications,
Yes, this doesnt have to be a programmable interface for threads. May not be a
good idea to let the threads decide the cache allocation by themselves using this direct
interface. We are transfering the decision maker responsibility to the system
administrator.
- This interface like you said can easily bolt-on. basically an easy to use
interface without worrying about the architectural details.
- But still does the job. root user can allocate exclusive or overlapping cache
lines to threads or group of threads.
- No major roadblocks for usage as we can make the allocations like mentioned
above and still keep the hierarchy etc and use it when needed.
- An important factor is that it can co-exist with other interfaces like #2 and
#3 for the same easily. So I donot see a reason why we should not use this.
This is not meant to be a programmable interface, however it does not prevent
co-existence.
- If root user has to set affinity of threads that he is allocating cache, he
can do so using other cgroups like cpuset or set the masks seperately using
taskset. This would let him configure the cache allocation on a socket.
this is a pretty bad interface by itself.
>
>> There is already a lot of such usage among different enterprise users at
>> Intel/google/cisco etc who have been testing the patches posted to lkml and
>> academically there is plenty of usage as well.
>
> I mean, that's the tool you gave them. Of course they'd be using it
> but I suspect most of them would do fine with a programmable interface
> too. Again, please think of cpu affinity.
All the methodology to support the feature may need an arbitrator/agent to
decide the allocation.
1. Let the root user or system administrator be the one who decides the
allocation based on the current usage. We assume this to be one with
administrative privileges. He could use the cgroup interface to perform the
task. One way to do the cpu affinity is by mounting cpuset and rdt cgroup
together.
2. Kernel automatically assigning the cache based on the priority of the apps
etc. This is something which could be designed to co-exist with the #1 above
much like how the cpusets cgroup co-exist with the kernel assigning cpus to
tasks. (the task could be having a cache capacity mask
just like the cpu affinity mask)
3. User programmable interface , where say a resource management program
x (and hence apps) could link a library which supports cache alloc/monitoring
etc and then try to control and monitor the resources. The arbitrator could just
be the resource management interface itself or the kernel could decide.
If users use this programmable interface, we need to
make sure all the apps just cannot allocate resources without some interfacing
agent (in which case they could interface with #2 ?).
Do you think there are any issues for the user programmable interface to
co-exist with the cgroup interface ?
Thanks,
Vikas
>
> Thanks.
>
> --
> tejun
>
On Sun, 02 Aug, at 12:31:57PM, Tejun Heo wrote:
>
> But we're doing it the wrong way around. You can do most of what
> cgroup interface can do with systemcall-like interface with some
> inconvenience. The other way doesn't really work. As I wrote in the
> other reply, cgroups is a horrible programmable interface and we don't
> want individual applications to interact with it directly and CAT's
> use cases most definitely include each application programming its own
> cache mask.
I wager that this assertion is wrong. Having individual applications
program their own cache mask is not going to be the most common
scenario. Only in very specific situations would you trust an
application to do that.
A much more likely use case is having the sysadmin carve up the cache
for a workload which may include multiple, uncooperating applications.
Yes, a programmable interface would be useful, but only for a limited
set of workloads. I don't think it's how most people are going to want
to use this hardware technology.
--
Matt Fleming, Intel Open Source Technology Center
Hello,
On Tue, Aug 04, 2015 at 07:21:52PM -0700, Vikas Shivappa wrote:
> >I get that this would be an easier "bolt-on" solution but isn't a good
> >solution by itself in the long term. As I wrote multiple times
> >before, this is a really bad programmable interface. Unless you're
> >sure that this doesn't have to be programmable for threads of an
> >individual applications,
>
> Yes, this doesnt have to be a programmable interface for threads. May not be
> a good idea to let the threads decide the cache allocation by themselves
> using this direct interface. We are transfering the decision maker
> responsibility to the system administrator.
I'm having hard time believing that. There definitely are use cases
where cachelines are trashed among service threads. Are you
proclaiming that those cases aren't gonna be supported?
> - This interface like you said can easily bolt-on. basically an easy to use
> interface without worrying about the architectural details.
But it's ripe with architectural details. What I meant by bolt-on was
that this is a shortcut way of introducing this feature without
actually worrying about how this will be used by applications and
that's not a good thing. We need to be worrying about that.
> - But still does the job. root user can allocate exclusive or overlapping
> cache lines to threads or group of threads.
> - No major roadblocks for usage as we can make the allocations like
> mentioned above and still keep the hierarchy etc and use it when needed.
> - An important factor is that it can co-exist with other interfaces like #2
> and #3 for the same easily. So I donot see a reason why we should not use
> this.
> This is not meant to be a programmable interface, however it does not
> prevent co-existence.
I'm not saying they are mutually exclusive but that we're going
overboard in this direction when programmable interface should be the
priority. While this mostly happened naturally for other resources
because cgroups was introduced later but I think there's a general
rule to follow there.
> - If root user has to set affinity of threads that he is allocating cache,
> he can do so using other cgroups like cpuset or set the masks seperately
> using taskset. This would let him configure the cache allocation on a
> socket.
Well, root can do whatever it wants with programmable interface too.
The way things are designed, even containment isn't an issue, assign
an ID to all processes by default and change the allocation on that.
> this is a pretty bad interface by itself.
> >
> >>There is already a lot of such usage among different enterprise users at
> >>Intel/google/cisco etc who have been testing the patches posted to lkml and
> >>academically there is plenty of usage as well.
> >
> >I mean, that's the tool you gave them. Of course they'd be using it
> >but I suspect most of them would do fine with a programmable interface
> >too. Again, please think of cpu affinity.
>
> All the methodology to support the feature may need an arbitrator/agent to
> decide the allocation.
>
> 1. Let the root user or system administrator be the one who decides the
> allocation based on the current usage. We assume this to be one with
> administrative privileges. He could use the cgroup interface to perform the
> task. One way to do the cpu affinity is by mounting cpuset and rdt cgroup
> together.
If you factor in threads of a process, the above model is
fundamentally flawed. How would root or any external entity find out
what threads are to be allocated what? Each application would
constnatly have to tell an external agent about what its intentions
are. This might seem to work in a limited feature testing setup where
you know everything about who's doing what but is no way a widely
deployable solution. This pretty much degenerates into #3 you listed
below.
> 2. Kernel automatically assigning the cache based on the priority of the apps
> etc. This is something which could be designed to co-exist with the #1 above
> much like how the cpusets cgroup co-exist with the kernel assigning cpus to
> tasks. (the task could be having a cache capacity mask just like the cpu
> affinity mask)
I don't think CAT would be applicable in this manner. BE allocation
is what the CPU is doing by default already. I'm highly doubtful
something like CAT would be used automatically in generic systems. It
requires fairly specific coordination after all.
> 3. User programmable interface , where say a resource management program
> x (and hence apps) could link a library which supports cache alloc/monitoring
> etc and then try to control and monitor the resources. The arbitrator could just
> be the resource management interface itself or the kernel could decide.
>
> If users use this programmable interface, we need to make sure all the apps
> just cannot allocate resources without some interfacing agent (in which case
> they could interface with #2 ?).
>
> Do you think there are any issues for the user programmable interface to
> co-exist with the cgroup interface ?
Isn't that a weird question to ask when there's no reason to rush to a
full-on cgroup controller? We can start with something simpler and
more specific and easier for applications to program against. If the
hardware details make it difficult to design properly abstracted
interface around, make it a char device node, for example, and let
userland worry about how to control access to it. If you stick to
something like that, exposing most of hardware details verbatim is
fine. People know they're dealing with something very specific with
those types of interfaces.
Thanks.
--
tejun
Hello,
On Wed, Aug 05, 2015 at 01:22:57PM +0100, Matt Fleming wrote:
> I wager that this assertion is wrong. Having individual applications
> program their own cache mask is not going to be the most common
> scenario. Only in very specific situations would you trust an
> application to do that.
As I wrote in the other reply, I don't buy that. The above only holds
if you exclude use cases where this feature is used by multiple
threads of an application and I can't see a single reason why such
uses would be excluded.
> A much more likely use case is having the sysadmin carve up the cache
> for a workload which may include multiple, uncooperating applications.
>
> Yes, a programmable interface would be useful, but only for a limited
> set of workloads. I don't think it's how most people are going to want
> to use this hardware technology.
It's actually the other way around. You can achieve most of what
cgroups can do with programmable interface albeit with some
awkwardness. The other direction is a lot more heavier and painful.
Thanks.
--
tejun
On Wed, Aug 05, 2015 at 01:22:57PM +0100, Matt Fleming wrote:
> On Sun, 02 Aug, at 12:31:57PM, Tejun Heo wrote:
> >
> > But we're doing it the wrong way around. You can do most of what
> > cgroup interface can do with systemcall-like interface with some
> > inconvenience. The other way doesn't really work. As I wrote in the
> > other reply, cgroups is a horrible programmable interface and we don't
> > want individual applications to interact with it directly and CAT's
> > use cases most definitely include each application programming its own
> > cache mask.
>
> I wager that this assertion is wrong. Having individual applications
> program their own cache mask is not going to be the most common
> scenario.
What i like about the syscall interface is that it moves the knowledge
of cache behaviour close to the application launching (or inside it),
which allows the following common scenario, say on a multi purpose
desktop:
Event: launch high performance application: use cache reservation, finish
quickly.
Event: cache hog application: do not thrash the cache.
The two cache reservations are logically unrelated in terms of
configuration, and configured separately do not affect each other.
They should be configured separately.
Also, data/code reservation is specific to the application, so it
should its specification should be close to the application (its just
cumbersome to maintain that data somewhere else).
> Only in very specific situations would you trust an
> application to do that.
Perhaps ulimit can be used to allow a certain limit on applications.
> A much more likely use case is having the sysadmin carve up the cache
> for a workload which may include multiple, uncooperating applications.
Sorry, what cooperating means in this context?
> Yes, a programmable interface would be useful, but only for a limited
> set of workloads. I don't think it's how most people are going to want
> to use this hardware technology.
It seems syscall interface handles all usecases which the cgroup
interface handles.
> --
> Matt Fleming, Intel Open Source Technology Center
Tentative interface, please comment.
The "return key/use key" scheme would allow COSid sharing similarly to
shmget. Intra-application, that is functional, but i am not experienced
with shmget to judge whether there is a better alternative. Would have
to think how cross-application setup would work,
and in the simple "cacheset" configuration.
Also, the interface should work for other architectures (TODO item, PPC
at least has similar functionality).
enum cache_rsvt_flags {
CACHE_RSVT_ROUND_UP = (1 << 0), /* round "bytes" up */
CACHE_RSVT_ROUND_DOWN = (1 << 1), /* round "bytes" down */
CACHE_RSVT_EXTAGENTS = (1 << 2), /* allow usage of area common with external agents */
};
enum cache_rsvt_type {
CACHE_RSVT_TYPE_CODE = 0, /* cache reservation is for code */
CACHE_RSVT_TYPE_DATA, /* cache reservation is for data */
CACHE_RSVT_TYPE_BOTH, /* cache reservation is for code and data */
};
struct cache_reservation {
size_t kbytes;
u32 type;
u32 flags;
};
int sys_cache_reservation(struct cache_reservation *cv);
returns -ENOMEM if not enough space, -EPERM if no permission.
returns keyid > 0 if reservation has been successful, copying actual
number of kbytes reserved to "kbytes".
-----------------
int sys_use_cache_reservation_key(struct cache_reservation *cv, int
key);
returns -EPERM if no permission.
returns -EINVAL if no such key exists.
returns 0 if instantiation of reservation has been successful,
copying actual reservation to cv.
Backward compatibility for processors with no support for code/data
differentiation: by default code and data cache allocation types
fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
information that they done so via "flags").
On Wed, 5 Aug 2015, Marcelo Tosatti wrote:
> On Wed, Aug 05, 2015 at 01:22:57PM +0100, Matt Fleming wrote:
>> On Sun, 02 Aug, at 12:31:57PM, Tejun Heo wrote:
>>>
>>> But we're doing it the wrong way around. You can do most of what
>>> cgroup interface can do with systemcall-like interface with some
>>> inconvenience. The other way doesn't really work. As I wrote in the
>>> other reply, cgroups is a horrible programmable interface and we don't
>>> want individual applications to interact with it directly and CAT's
>>> use cases most definitely include each application programming its own
>>> cache mask.
>>
>> I wager that this assertion is wrong. Having individual applications
>> program their own cache mask is not going to be the most common
>> scenario.
>
> What i like about the syscall interface is that it moves the knowledge
> of cache behaviour close to the application launching (or inside it),
> which allows the following common scenario, say on a multi purpose
> desktop:
>
> Event: launch high performance application: use cache reservation, finish
> quickly.
> Event: cache hog application: do not thrash the cache.
>
> The two cache reservations are logically unrelated in terms of
> configuration, and configured separately do not affect each other.
There could be several issues to let apps allocate the cache themselves. We just
cannot treat the cache alloc just like memory allocation, please consider the
scenarios below:
all examples consider cache size : 10MB. cbm max bits : 10
(1)user programmable syscall:
1.1> Exclusive access: The task cannot give *itself* exclusive access from
using the cache. For this it needs to have visibility of the cache allocation of
other tasks and may need to reclaim or override others cache allocs which is not
feasible (isnt that the ability of a system managing agent?).
eg:
app1... 10 ask for 1MB of exclusive cache each.
they get it as there was 10MB.
But now a large portion of tasks on the system will end up without any cache ? -
this is not possible
or do they share a common pool or a default shared pool ? - if there is such a
default pool then that needs to be *managed* and this reduces the number
of exclusive cache access given.
1.2> Noisy neighbour problem: how does the task itself decide its the noisy
neighbor ? This is the
key requirement the feature wants to address. We want to address the
jitter and inconsistencies in the quality of service things like response times
the apps get. If you read the SDM
its mentioned clearly there as well. can the task voluntarily declare itself
noisy neighbour(how ??) and relinquish the cache allocation (how much ?). But
thats not even guaranteed.
How can we expect every application coder to know what system the app is going
to run and how much is the optimal amount of cache the app can get - its not
like memory allocation for #3 and #4 below.
1.3> cannot treat cache allocation similar to memory allocation.
there is system-calls alternatives to do memory allocation apart from cgroups
like cpuset but we cannot treat both as the same.
(This is with reference to the point that there are alternatives to memory
allocation apart from using cpuset, but the whole point is you cant treat
memory allocation and cache allocation as same)
1.3.1> memory is a very large pool in terms of GBs and we are talking
about only a few MBs (~10 - 20 orders and orders of magnitude). So this could
easily get into a situation mentioned
above where a few first apps get all the exclusive cache and the rest have to
starve.
1.3.2> memory is virtualized : each process has its own space and we are
not even bound by the physical memory capacity as we can virtualize it so an app
can indeed ask for more memory than the physical memory along with other apps
doing the same - but we cant do the same here with cache allocation. Even if we
evict the cache , that defeats the purpose of cache allocation to threads.
1.4> specific h/w requirements : With code data prioritization(cdp) , the h/w
requires the OS to reset all the capacity bitmasks once we change mode
from to legacy cache alloc. So
naturally we need to remove the tasks with all its allocations. We cannot
easily take away all the cache allocations that users will be thinking is theirs
when they had allocated using the syscall. This is something like the tasks
malloc successfully and midway their allocation is no more there.
Also this would add to the logic that you need to treat the cache allocation and
other resource allocation like memory differently.
1.5> In cloud and container environments , say we would need to allocate cache
for entire VM which runs a specific real_time workload vs. allocate cache for VMs
which run say noisy_workload - how can we achieve this by letting each app
decide how much cache that needs to be allocated ? This is best done by an
external system manager.
(2)cgroup interface:
(2.1) compare above usage
1.1> and 1.2> above can easily be done with cgroup interface.
The key difference is system management and process-self management of the cache
allocation. When there is a centralized system manager this works fine.
The administrator can
make sure that certain tasks/group of tasks get exclusive cache blocks. And the
administrator can determine the noisy neighbour application or workload using
cache monitoring and make allocations appropriately.
A classic use case is here :
http://www.intel.com/content/www/us/en/communications/cache-allocation-technology-white-paper.html
$ cd /sys/fs/cgroup/rdt
$ cd group1
$ /bin/echo 0xf > intel_rdt.l3_cbm
$ cd group2
$ /bin/echo 0xf0 > intel_rdt.l3_cbm
If we want to prevent the system admin to accidentally allocating overlapping
masks, that could be easily extended by having an always-exclusive flag.
Rounding off: We can easily write a batch file to calculate the chunk size and
show and then allocate based on byte size. This is something that can easily be
done on top of this interface.
Assign tasks to the group2
$ /bin/echo PID1 > tasks
$ /bin/echo PID2 > tasks
If a bunch of threads belonging to a process(Processidx) need to be allocated
cache -
$ /bin/echo <Processidx> > cgroup.procs
the 4> above can possibly be addressed in cgroup but would need some support
which we are planning to send. One way to address this is to tear down
the subsystem by deleting all the existing cgroup directories and then handling
the reset. So the cdp starts fresh with all bitmasks ready to be allocated.
(2.2) cpu affinity :
Similarly rdt cgroup can be used to assign affinity to the entire cgroup itself.
Also you could always use taskset as well !
example2: Below commands allocate '1MB L3 cache on socket1 to group1'
and '2MB of L3 cache on socket2 to group2'.
This mounts both cpuset and intel_rdt and hence the ls would list the
files in both the subsystems.
$ mount -t cgroup -ocpuset,intel_rdt cpuset,intel_rdt rdt/
$ ls /sys/fs/cgroup/rdt
cpuset.cpus
cpuset.mem
...
intel_rdt.l3_cbm
tasks
Assign the cache
$ /bin/echo 0xf > /sys/fs/cgroup/rdt/group1/intel_rdt.l3_cbm
$ /bin/echo 0xff > /sys/fs/cgroup/rdt/group2/intel_rdt.l3_cbm
Assign tasks for group1 and group2
$ /bin/echo PID1 > /sys/fs/cgroup/rdt/group1/tasks
$ /bin/echo PID2 > /sys/fs/cgroup/rdt/group1/tasks
$ /bin/echo PID3 > /sys/fs/cgroup/rdt/group2/tasks
$ /bin/echo PID4 > /sys/fs/cgroup/rdt/group2/tasks
Tie the group1 to socket1 and group2 to socket2
$ /bin/echo <cpumask for socket1> > /sys/fs/cgroup/rdt/group1/cpuset.cpus
$ /bin/echo <cpumask for socket2> > /sys/fs/cgroup/rdt/group2/cpuset.cpus
>
> They should be configured separately.
>
> Also, data/code reservation is specific to the application, so it
> should its specification should be close to the application (its just
> cumbersome to maintain that data somewhere else).
>
>> Only in very specific situations would you trust an
>> application to do that.
>
> Perhaps ulimit can be used to allow a certain limit on applications.
The ulimit is very subjective and depends on the workloads/amount of cache space
available/total cache etc - see here you are moving towards a controlling
agent which could possibly configure ulimit to control what apps get
>
>> A much more likely use case is having the sysadmin carve up the cache
>> for a workload which may include multiple, uncooperating applications.
>
> Sorry, what cooperating means in this context?
see example 1.2 above - a noisy neighbour cant be expected to relinquish the
cache alloc himself. thats one example of uncooperating app ?
>
>> Yes, a programmable interface would be useful, but only for a limited
>> set of workloads. I don't think it's how most people are going to want
>> to use this hardware technology.
>
> It seems syscall interface handles all usecases which the cgroup
> interface handles.
>
>> --
>> Matt Fleming, Intel Open Source Technology Center
>
> Tentative interface, please comment.
Please discuss the interface details once we are solid on the kind of interface
itself since we already have reviewed one interface and talking about a new one.
Otherwise it may miss a lot of and hardware requirements
like 1.4 above - without that we cant have a complete interface ?
Understand the cgroup interface has things like hierarchy which are of not
much use to the intel_rdt cgroup ? - is that the key issue here or the whole
'system management of the cache allocation' the issue ?
Thanks,
Vikas
>
> The "return key/use key" scheme would allow COSid sharing similarly to
> shmget. Intra-application, that is functional, but i am not experienced
> with shmget to judge whether there is a better alternative. Would have
> to think how cross-application setup would work,
> and in the simple "cacheset" configuration.
> Also, the interface should work for other architectures (TODO item, PPC
> at least has similar functionality).
>
> enum cache_rsvt_flags {
> CACHE_RSVT_ROUND_UP = (1 << 0), /* round "bytes" up */
> CACHE_RSVT_ROUND_DOWN = (1 << 1), /* round "bytes" down */
> CACHE_RSVT_EXTAGENTS = (1 << 2), /* allow usage of area common with external agents */
> };
>
> enum cache_rsvt_type {
> CACHE_RSVT_TYPE_CODE = 0, /* cache reservation is for code */
> CACHE_RSVT_TYPE_DATA, /* cache reservation is for data */
> CACHE_RSVT_TYPE_BOTH, /* cache reservation is for code and data */
> };
>
> struct cache_reservation {
> size_t kbytes;
> u32 type;
> u32 flags;
> };
>
> int sys_cache_reservation(struct cache_reservation *cv);
>
> returns -ENOMEM if not enough space, -EPERM if no permission.
> returns keyid > 0 if reservation has been successful, copying actual
> number of kbytes reserved to "kbytes".
>
> -----------------
>
> int sys_use_cache_reservation_key(struct cache_reservation *cv, int
> key);
>
> returns -EPERM if no permission.
> returns -EINVAL if no such key exists.
> returns 0 if instantiation of reservation has been successful,
> copying actual reservation to cv.
>
> Backward compatibility for processors with no support for code/data
> differentiation: by default code and data cache allocation types
> fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
> information that they done so via "flags").
>
>
>
On Wed, 5 Aug 2015, Tejun Heo wrote:
> Hello,
>
> On Tue, Aug 04, 2015 at 07:21:52PM -0700, Vikas Shivappa wrote:
>>> I get that this would be an easier "bolt-on" solution but isn't a good
>>> solution by itself in the long term. As I wrote multiple times
>>> before, this is a really bad programmable interface. Unless you're
>>> sure that this doesn't have to be programmable for threads of an
>>> individual applications,
>>
>> Yes, this doesnt have to be a programmable interface for threads. May not be
>> a good idea to let the threads decide the cache allocation by themselves
>> using this direct interface. We are transfering the decision maker
>> responsibility to the system administrator.
>
> I'm having hard time believing that. There definitely are use cases
> where cachelines are trashed among service threads. Are you
> proclaiming that those cases aren't gonna be supported?
Please refer to the noisy neighbour example i give here to help resolve
thrashing by a
noisy neighbour -
http://marc.info/?l=linux-kernel&m=143889397419199
and the reference
http://www.intel.com/content/www/us/en/communications/cache-allocation-technology-white-paper.html
>
>> - This interface like you said can easily bolt-on. basically an easy to use
>> interface without worrying about the architectural details.
>
> But it's ripe with architectural details.
If specifying the bitmask is an issue , it can easily be addressed by writing a
script which calculates the bitmask to size - like mentioned here
http://marc.info/?l=linux-kernel&m=143889397419199
What I meant by bolt-on was
> that this is a shortcut way of introducing this feature without
> actually worrying about how this will be used by applications and
> that's not a good thing. We need to be worrying about that.
>
>> - But still does the job. root user can allocate exclusive or overlapping
>> cache lines to threads or group of threads.
>> - No major roadblocks for usage as we can make the allocations like
>> mentioned above and still keep the hierarchy etc and use it when needed.
>> - An important factor is that it can co-exist with other interfaces like #2
>> and #3 for the same easily. So I donot see a reason why we should not use
>> this.
>> This is not meant to be a programmable interface, however it does not
>> prevent co-existence.
>
> I'm not saying they are mutually exclusive but that we're going
> overboard in this direction when programmable interface should be the
> priority. While this mostly happened naturally for other resources
> because cgroups was introduced later but I think there's a general
> rule to follow there.
Right , the cache allocation cannot be treated like memory like explained here
in 1.3 and 1.4
http://marc.info/?l=linux-kernel&m=143889397419199
>
>> - If root user has to set affinity of threads that he is allocating cache,
>> he can do so using other cgroups like cpuset or set the masks seperately
>> using taskset. This would let him configure the cache allocation on a
>> socket.
>
> Well, root can do whatever it wants with programmable interface too.
> The way things are designed, even containment isn't an issue, assign
> an ID to all processes by default and change the allocation on that.
>
>> this is a pretty bad interface by itself.
>>>
>>>> There is already a lot of such usage among different enterprise users at
>>>> Intel/google/cisco etc who have been testing the patches posted to lkml and
>>>> academically there is plenty of usage as well.
>>>
>>> I mean, that's the tool you gave them. Of course they'd be using it
>>> but I suspect most of them would do fine with a programmable interface
>>> too. Again, please think of cpu affinity.
>>
>> All the methodology to support the feature may need an arbitrator/agent to
>> decide the allocation.
>>
>> 1. Let the root user or system administrator be the one who decides the
>> allocation based on the current usage. We assume this to be one with
>> administrative privileges. He could use the cgroup interface to perform the
>> task. One way to do the cpu affinity is by mounting cpuset and rdt cgroup
>> together.
>
> If you factor in threads of a process, the above model is
> fundamentally flawed. How would root or any external entity find out
> what threads are to be allocated what?
the process ID can be added to the cgroup together with all its threads as shown
in example of cgroup usage in (2) here -
In most cases in the cloud you will be able to decide based on what workloads
are running - see the example 1.5 here
http://marc.info/?l=linux-kernel&m=143889397419199
Each application would
> constnatly have to tell an external agent about what its intentions
> are. This might seem to work in a limited feature testing setup where
> you know everything about who's doing what but is no way a widely
> deployable solution. This pretty much degenerates into #3 you listed
> below.
App may not be the best one to decide
1.1 and 1.2 here
http://marc.info/?l=linux-kernel&m=143889397419199
>
>> 2. Kernel automatically assigning the cache based on the priority of the apps
>> etc. This is something which could be designed to co-exist with the #1 above
>> much like how the cpusets cgroup co-exist with the kernel assigning cpus to
>> tasks. (the task could be having a cache capacity mask just like the cpu
>> affinity mask)
>
> I don't think CAT would be applicable in this manner. BE allocation
> is what the CPU is doing by default already. I'm highly doubtful
> something like CAT would be used automatically in generic systems. It
> requires fairly specific coordination after all.
The 3 items were generalized at high level to show the system management vs user
doing it or . I am not saying it should be done this way -
Thanks,
Vikas
>
>> 3. User programmable interface , where say a resource management program
>> x (and hence apps) could link a library which supports cache alloc/monitoring
>> etc and then try to control and monitor the resources. The arbitrator could just
>> be the resource management interface itself or the kernel could decide.
>>
>> If users use this programmable interface, we need to make sure all the apps
>> just cannot allocate resources without some interfacing agent (in which case
>> they could interface with #2 ?).
>>
>> Do you think there are any issues for the user programmable interface to
>> co-exist with the cgroup interface ?
>
> Isn't that a weird question to ask when there's no reason to rush to a
> full-on cgroup controller?
We can start with something simpler and
> more specific and easier for applications to program against. If the
> hardware details make it difficult to design properly abstracted
> interface around, make it a char device node, for example, and let
> userland worry about how to control access to it. If you stick to
> something like that, exposing most of hardware details verbatim is
> fine. People know they're dealing with something very specific with
> those types of interfaces.
>
> Thanks.
>
> --
> tejun
>
On Thu, Aug 06, 2015 at 01:46:06PM -0700, Vikas Shivappa wrote:
>
>
> On Wed, 5 Aug 2015, Marcelo Tosatti wrote:
>
> >On Wed, Aug 05, 2015 at 01:22:57PM +0100, Matt Fleming wrote:
> >>On Sun, 02 Aug, at 12:31:57PM, Tejun Heo wrote:
> >>>
> >>>But we're doing it the wrong way around. You can do most of what
> >>>cgroup interface can do with systemcall-like interface with some
> >>>inconvenience. The other way doesn't really work. As I wrote in the
> >>>other reply, cgroups is a horrible programmable interface and we don't
> >>>want individual applications to interact with it directly and CAT's
> >>>use cases most definitely include each application programming its own
> >>>cache mask.
> >>
> >>I wager that this assertion is wrong. Having individual applications
> >>program their own cache mask is not going to be the most common
> >>scenario.
> >
> >What i like about the syscall interface is that it moves the knowledge
> >of cache behaviour close to the application launching (or inside it),
> >which allows the following common scenario, say on a multi purpose
> >desktop:
> >
> >Event: launch high performance application: use cache reservation, finish
> >quickly.
> >Event: cache hog application: do not thrash the cache.
> >
> >The two cache reservations are logically unrelated in terms of
> >configuration, and configured separately do not affect each other.
>
> There could be several issues to let apps allocate the cache
> themselves. We just cannot treat the cache alloc just like memory
> allocation, please consider the scenarios below:
>
> all examples consider cache size : 10MB. cbm max bits : 10
>
>
> (1)user programmable syscall:
>
> 1.1> Exclusive access: The task cannot give *itself* exclusive
> access from using the cache. For this it needs to have visibility of
> the cache allocation of other tasks and may need to reclaim or
> override others cache allocs which is not feasible (isnt that the
> ability of a system managing agent?).
Different allocation of the resource (cache in this case) causes
different cache miss patterns and therefore different results.
> eg:
> app1... 10 ask for 1MB of exclusive cache each.
> they get it as there was 10MB.
>
> But now a large portion of tasks on the system will end up without any cache ? -
> this is not possible
> or do they share a common pool or a default shared pool ? - if there is such a
> default pool then that needs to be *managed* and this reduces the
> number of exclusive cache access given.
The proposal would be for the administrator to setup how much each user
can reserve via ulimit (per-user).
To change that per-user configuration, its necessary to
stop the tasks.
However, that makes no sense, revoking crossed my mind as well.
To allow revoking it would be necessary to have a special capability
(which only root has by default).
The point here is that it should be possible to modify cache
reservations.
Alternatively, use a priority system. So:
Revoking:
--------
Priviledged systemcall to list and invalidate cache reservations.
Assumes that reservations returned by "sys_cache_reservation"
are persistent and that users of the "remove" system call
are aware of the consequences.
Priority:
---------
Use some priority order (based on nice value, or a new separate
value to perform comparison), and use that to decide which
reservations have priority.
*I-1* (todo notes)
> 1.2> Noisy neighbour problem: how does the task itself decide its the noisy
> neighbor ? This is the
> key requirement the feature wants to address. We want to address the
> jitter and inconsistencies in the quality of service things like
> response times the apps get. If you read the SDM its mentioned
> clearly there as well. can the task voluntarily declare itself
> noisy neighbour(how ??) and relinquish the cache allocation (how
> much ?). But thats not even guaranteed.
I suppose this requires global information (how much cache each
application is using), and the goal: what is the end goal of
a particular cache resource division.
Each cache division has an outcome: certain instruction sequences
execute faster than others.
Whether a given task is a "cache hog" (that is, evicting cachelines
of other tasks does not reduce execution time of the "cache hog" task
itself, and therefore does not benefit the performance of the system
as a whole) is probably not an ideal visualization: each task has
different subparts that could be considered "cache hogs", and parts
that are not "cache hogs".
I think that for now, handling the static usecases is good enough.
> How can we expect every application coder to know what system the
> app is going to run and how much is the optimal amount of cache the
> app can get - its not like memory allocation for #3 and #4 below.
"Optimal" depends on what the desired end result is: execution time as
a whole, execution time of an individual task, etc.
In the case the applications are not aware of the cache, the OS should
divide the resource automatically by heuristics (in analogy with LRU).
For special applications, the programmer/compiler can find the optimal
tuning.
> 1.3> cannot treat cache allocation similar to memory allocation.
> there is system-calls alternatives to do memory allocation apart from cgroups
> like cpuset but we cannot treat both as the same.
> (This is with reference to the point that there are alternatives to memory
> allocation apart from using cpuset, but the whole point is you cant
> treat memory allocation and cache allocation as same)
> 1.3.1> memory is a very large pool in terms of GBs and we are talking
> about only a few MBs (~10 - 20 orders and orders of magnitude). So
> this could easily get into a situation mentioned
> above where a few first apps get all the exclusive cache and the rest have to
> starve.
Point. applications are allowed to set their cache reservations because
its convenient: its easier to consider and setup cache allocation of
a given application rather than have to consider and setup the whole
system.
If setting reservations individually conflicts or affects the system as
a whole, then the administrator or decision logic should resolve the
situation.
> 1.3.2> memory is virtualized : each process has its own space and we are
> not even bound by the physical memory capacity as we can virtualize
> it so an app can indeed ask for more memory than the physical memory
> along with other apps doing the same - but we cant do the same here
> with cache allocation. Even if we evict the cache , that defeats the
> purpose of cache allocation to threads.
ulimit.
*I-2*
> 1.4> specific h/w requirements : With code data prioritization(cdp) , the h/w
> requires the OS to reset all the capacity bitmasks once we change mode
> from to legacy cache alloc. So
> naturally we need to remove the tasks with all its allocations. We cannot
> easily take away all the cache allocations that users will be thinking is theirs
> when they had allocated using the syscall. This is something like the tasks
> malloc successfully and midway their allocation is no more there.
> Also this would add to the logic that you need to treat the cache allocation and
> other resource allocation like memory differently.
Point.
*I-3*
CPD -> CAT transition.
CAT -> CDP transition.
>
> 1.5> In cloud and container environments , say we would need to
> allocate cache for entire VM which runs a specific real_time
> workload vs. allocate cache for VMs which run say noisy_workload -
> how can we achieve this by letting each app decide how much cache
> that needs to be allocated ? This is best done by an external system
> manager.
Agreed. This is what will happen in that use-case, and the systemcall
interface allows it.
>
> (2)cgroup interface:
>
> (2.1) compare above usage
>
> 1.1> and 1.2> above can easily be done with cgroup interface.
> The key difference is system management and process-self management of the cache
> allocation. When there is a centralized system manager this works fine.
>
> The administrator can
> make sure that certain tasks/group of tasks get exclusive cache blocks. And the
> administrator can determine the noisy neighbour application or workload using
> cache monitoring and make allocations appropriately.
>
> A classic use case is here :
> http://www.intel.com/content/www/us/en/communications/cache-allocation-technology-white-paper.html
>
> $ cd /sys/fs/cgroup/rdt
> $ cd group1
> $ /bin/echo 0xf > intel_rdt.l3_cbm
>
> $ cd group2
> $ /bin/echo 0xf0 > intel_rdt.l3_cbm
>
> If we want to prevent the system admin to accidentally allocating
> overlapping masks, that could be easily extended by having an
> always-exclusive flag.
>
> Rounding off: We can easily write a batch file to calculate the
> chunk size and show and then allocate based on byte size. This is
> something that can easily be done on top of this interface.
Agree byte specification can be done in cgroups.
> Assign tasks to the group2
>
> $ /bin/echo PID1 > tasks
> $ /bin/echo PID2 > tasks
>
> If a bunch of threads belonging to a process(Processidx) need to be allocated
> cache -
> $ /bin/echo <Processidx> > cgroup.procs
>
>
> the 4> above can possibly be addressed in cgroup but would need some support
> which we are planning to send. One way to address this is to tear down
> the subsystem by deleting all the existing cgroup directories and then handling
> the reset. So the cdp starts fresh with all bitmasks ready to be allocated.
Agree this is a very good point. The syscall interface must handle it.
*I-4*
> (2.2) cpu affinity :
>
> Similarly rdt cgroup can be used to assign affinity to the entire cgroup itself.
> Also you could always use taskset as well !
>
> example2: Below commands allocate '1MB L3 cache on socket1 to group1'
> and '2MB of L3 cache on socket2 to group2'.
> This mounts both cpuset and intel_rdt and hence the ls would list the
> files in both the subsystems.
> $ mount -t cgroup -ocpuset,intel_rdt cpuset,intel_rdt rdt/
> $ ls /sys/fs/cgroup/rdt
> cpuset.cpus
> cpuset.mem
> ...
> intel_rdt.l3_cbm
> tasks
>
> Assign the cache
> $ /bin/echo 0xf > /sys/fs/cgroup/rdt/group1/intel_rdt.l3_cbm
> $ /bin/echo 0xff > /sys/fs/cgroup/rdt/group2/intel_rdt.l3_cbm
>
> Assign tasks for group1 and group2
> $ /bin/echo PID1 > /sys/fs/cgroup/rdt/group1/tasks
> $ /bin/echo PID2 > /sys/fs/cgroup/rdt/group1/tasks
> $ /bin/echo PID3 > /sys/fs/cgroup/rdt/group2/tasks
> $ /bin/echo PID4 > /sys/fs/cgroup/rdt/group2/tasks
>
> Tie the group1 to socket1 and group2 to socket2
> $ /bin/echo <cpumask for socket1> > /sys/fs/cgroup/rdt/group1/cpuset.cpus
> $ /bin/echo <cpumask for socket2> > /sys/fs/cgroup/rdt/group2/cpuset.cpus
>
> >
> >They should be configured separately.
> >
> >Also, data/code reservation is specific to the application, so it
> >should its specification should be close to the application (its just
> >cumbersome to maintain that data somewhere else).
> >
> >>Only in very specific situations would you trust an
> >>application to do that.
> >
> >Perhaps ulimit can be used to allow a certain limit on applications.
>
> The ulimit is very subjective and depends on the workloads/amount of
> cache space available/total cache etc - see here you are moving
> towards a controlling agent which could possibly configure ulimit to
> control what apps get
The point of ulimit is to let unrestricted users to use cache
reservations as well. So for example one configuration would be
HW: 32MB L3 cache.
user maximum cache reservation
root 32MB.
user-A 1MB.
user-B 1MB.
user-C 1MB.
...
But you'd probably want to say "no more than 2MB for
non-root". Don't think ulimit can handle that.
> >>A much more likely use case is having the sysadmin carve up the cache
> >>for a workload which may include multiple, uncooperating applications.
> >
> >Sorry, what cooperating means in this context?
>
> see example 1.2 above - a noisy neighbour cant be expected to
> relinquish the cache alloc himself. thats one example of
> uncooperating app ?
OK.
> >
> >>Yes, a programmable interface would be useful, but only for a limited
> >>set of workloads. I don't think it's how most people are going to want
> >>to use this hardware technology.
> >
> >It seems syscall interface handles all usecases which the cgroup
> >interface handles.
> >
> >>--
> >>Matt Fleming, Intel Open Source Technology Center
> >
> >Tentative interface, please comment.
>
> Please discuss the interface details once we are solid on the kind
> of interface itself since we already have reviewed one interface and
> talking about a new one. Otherwise it may miss a lot of and hardware
> requirements like 1.4 above - without that we cant have a complete
> interface ?
>
> Understand the cgroup interface has things like hierarchy which are
> of not much use to the intel_rdt cgroup ? - is that the key issue
> here or the whole 'system management of the cache allocation' the
> issue ?
There are several issues now -- can't say what the key issue is.
Hello,
On Thu, Aug 06, 2015 at 01:58:39PM -0700, Vikas Shivappa wrote:
> >I'm having hard time believing that. There definitely are use cases
> >where cachelines are trashed among service threads. Are you
> >proclaiming that those cases aren't gonna be supported?
>
> Please refer to the noisy neighbour example i give here to help resolve
> thrashing by a noisy neighbour -
> http://marc.info/?l=linux-kernel&m=143889397419199
I don't think that's relevant to the discussion. Implement a taskset
like tool and the administrator can deal with it just fine. As I
wrote multiple times now, people have been dealing with CPU affinity
fine w/o cgroups. Sure, cgroups do add on top but it's an a lot more
complex facility and not a replacement for a more basic control
mechanism.
> >>- This interface like you said can easily bolt-on. basically an easy to use
> >>interface without worrying about the architectural details.
> >
> >But it's ripe with architectural details.
>
> If specifying the bitmask is an issue , it can easily be addressed by
> writing a script which calculates the bitmask to size - like mentioned here
> http://marc.info/?l=linux-kernel&m=143889397419199
Let's say we fully virtualize cache partitioning so that each user can
express what they want and the kernel can compute and manage the
closest mapping supportable by the underlying hardware. That should
be doable but I don't think that's what we want at this point. This,
at least for now, is a niche feature which requires specific
configurations to be useful and while useful to certain narrow use
cases unlikely to be used across the board. Given that, we don't want
to overengineer the solution. Implement something simple and
specific. We don't yet even know the full usefulness or use cases of
the feature. It doesn't make sense to overcommit to complex
abstractions and mechanisms when there's a fairly good chance that our
understanding of the problem itself is very porous.
This applies the same to making it part of cgroups. It's a lot more
complex and we end up committing a lot more than implementing
something simple and specific. Let's please keep it simple.
> >I'm not saying they are mutually exclusive but that we're going
> >overboard in this direction when programmable interface should be the
> >priority. While this mostly happened naturally for other resources
> >because cgroups was introduced later but I think there's a general
> >rule to follow there.
>
> Right , the cache allocation cannot be treated like memory like explained
> here in 1.3 and 1.4
> http://marc.info/?l=linux-kernel&m=143889397419199
Who said that it could be? If it actually were a resource which is as
ubiquitous, flexible and dividable as memory, cgroups would be an a
lot better fit.
> >If you factor in threads of a process, the above model is
> >fundamentally flawed. How would root or any external entity find out
> >what threads are to be allocated what?
>
> the process ID can be added to the cgroup together with all its threads as
> shown in example of cgroup usage in (2) here -
And how does an external entity find out which ID should be put where?
This is a knowledge only known to the process itself. That's what I
meant by going this route requires individual applications
communicating with external agents.
> In most cases in the cloud you will be able to decide based on what
> workloads are running - see the example 1.5 here
>
> http://marc.info/?l=linux-kernel&m=143889397419199
Sure, that's an way outer scope. The point was that this can't handle
in-process scope.
> Each application would
> >constnatly have to tell an external agent about what its intentions
> >are. This might seem to work in a limited feature testing setup where
> >you know everything about who's doing what but is no way a widely
> >deployable solution. This pretty much degenerates into #3 you listed
> >below.
>
> App may not be the best one to decide 1.1 and 1.2 here
> http://marc.info/?l=linux-kernel&m=143889397419199
That paragraph just shows how little is understood, so you can't
imagine a situation where threads of a process agree upon how they'll
use cache to improve performance? Threads of the same program do
things like this all the time with different types of resources. This
is a large portion of what server software programmers do - making the
threads and other components behave in a way that maxmizes the
efficacy of the underlying system.
Thanks.
--
tejun
Vikas, Tejun,
This is an updated interface. It addresses all comments made
so far and also covers all use-cases the cgroup interface
covers.
Let me know what you think. I'll proceed to writing
the test applications.
Usage model:
------------
This document details how CAT technology is
exposed to userspace.
Each task has a list of task cache reservation entries (TCRE list).
The init process is created with empty TCRE list.
There is a system-wide unique ID space, each TCRE is assigned
an ID from this space. ID's can be reused (but no two TCREs
have the same ID at one time).
The interface accomodates transient and independent cache allocation
adjustments from applications, as well as static cache partitioning
schemes.
Allocation:
Usage of the system calls require CAP_SYS_CACHE_RESERVATION capability.
A configurable percentage is reserved to tasks with empty TCRE list.
On fork, the child inherits the TCR from its parent.
Semantics:
Once a TCRE is created and assigned to a task, that task has
guaranteed reservation on any CPU where its scheduled in,
for the lifetime of the TCRE.
A task can have its TCR list modified without notification.
FIXME: Add a per-task flag to not copy the TCR list of a task but delete
all TCR's on fork.
Interface:
enum cache_rsvt_flags {
CACHE_RSVT_ROUND_DOWN = (1 << 0), /* round "kbytes" down */
};
enum cache_rsvt_type {
CACHE_RSVT_TYPE_CODE = 0, /* cache reservation is for code */
CACHE_RSVT_TYPE_DATA, /* cache reservation is for data */
CACHE_RSVT_TYPE_BOTH, /* cache reservation is for code and data */
};
struct cache_reservation {
unsigned long kbytes;
int type;
int flags;
int trcid;
};
The following syscalls modify the TCR of a task:
* int sys_create_cache_reservation(struct cache_reservation *rsvt);
DESCRIPTION: Creates a cache reservation entry, and assigns
it to the current task.
returns -ENOMEM if not enough space, -EPERM if no permission.
returns 0 if reservation has been successful, copying actual
number of kbytes reserved to "kbytes", type to type, and tcrid.
* int sys_delete_cache_reservation(struct cache_reservation *rsvt);
DESCRIPTION: Deletes a cache reservation entry, deassigning it
from any task.
Backward compatibility for processors with no support for code/data
differentiation: by default code and data cache allocation types
fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
information that they done so via "flags").
* int sys_attach_cache_reservation(pid_t pid, unsigned int tcrid);
DESCRIPTION: Attaches cache reservation identified by "tcrid" to
task by identified by pid.
returns 0 if successful.
* int sys_detach_cache_reservation(pid_t pid, unsigned int tcrid);
DESCRIPTION: Detaches cache reservation identified by "tcrid" to
task by identified pid.
The following syscalls list the TCRs:
* int sys_get_cache_reservations(size_t size, struct cache_reservation list[]);
DESCRIPTION: Return all cache reservations in the system.
Size should be set to the maximum number of items that can be stored
in the buffer pointed to by list.
* int sys_get_tcrid_tasks(unsigned int tcrid, size_t size, pid_t list[]);
DESCRIPTION: Return which pids are associated to tcrid.
* sys_get_pid_cache_reservations(pid_t pid, size_t size,
struct cache_reservation list[]);
DESCRIPTION: Return all cache reservations associated with "pid".
Size should be set to the maximum number of items that can be stored
in the buffer pointed to by list.
* sys_get_cache_reservation_info()
DESCRIPTION: ioctl to retrieve hardware info: cache round size, whether
code/data separation is supported.
On Mon, 17 Aug 2015, Marcelo Tosatti wrote:
> Vikas, Tejun,
>
> This is an updated interface. It addresses all comments made
> so far and also covers all use-cases the cgroup interface
> covers.
>
> Let me know what you think. I'll proceed to writing
> the test applications.
>
> Usage model:
> ------------
>
> This document details how CAT technology is
> exposed to userspace.
>
> Each task has a list of task cache reservation entries (TCRE list).
>
> The init process is created with empty TCRE list.
>
> There is a system-wide unique ID space, each TCRE is assigned
> an ID from this space. ID's can be reused (but no two TCREs
> have the same ID at one time).
>
> The interface accomodates transient and independent cache allocation
> adjustments from applications, as well as static cache partitioning
> schemes.
>
> Allocation:
> Usage of the system calls require CAP_SYS_CACHE_RESERVATION capability.
>
> A configurable percentage is reserved to tasks with empty TCRE list.
And how do you think you will do this without a system controlled mechanism ?
Everytime in your proposal you include these caveats which actually mean to
include a system controlled interface in the background ,
and your below interfaces make no mention of this really ! Why do we want to
confuse ourselves like this ?
syscall only interface does not seem to work on its own for the cache
allocation scenario. This can only be a nice
to have interface on top of a system controlled mechanism like cgroup interface.
Sure you can do all the things you did with cgroup with the same with syscall
interface but the point is what are the use cases that cant be done with this
syscall only interface. (ex: to deal with cases you brought up earlier like when
an app does cache intensive work for some time and later changes - it could use
the syscall interface to quickly reqlinquish the cache lines or change a clos
associated with it)
I have repeatedly listed the use cases that can be dealt with , with this
interface. How will you address the cases like 1.1 and 1.2 with your syscall
only interface ? So we expect all the millions of apps like SAP, oracle etc and
etc and all the millions of app developers to magically learn our new syscall
interface and also cooperate between themselves to decide a cache allocation
that is agreeable to all ? (which btw the interface doesnt list below how to do
it) and then by some godly powers the noisly neighbour will decide himself to
give up the cache ? (that should be first ever app to not request more resource
in the world for himself and hurt his own performance - they surely dont want
to do social service !)
And how do we do the case 1.5 where the administrator want to assign cache to
specific VMs in a cloud etc - with the hypothetical syscall interface we now
should expect all the apps to do the above and now they also need to know where they run (what
VM , what socket etc) and then decide and cooperate an allocation : compare this
to a container environment like rancher where today the admin can convinetly use
docker underneath to allocate mem/storage/compute to containers and easily
extend this to include shared l3.
http://marc.info/?l=linux-kernel&m=143889397419199
without addressing the above the details of the interface below is irrelavant -
Your initial request was to extend the cgroup interface to include rounding off
the size of cache (which can easily be done with a bash script on top of cgroup
interface !) and now you are proposing a syscall only interface ? this is
very confusing and will only unnecessarily delay the process without adding any
value.
however like i mentioned the syscall interface or user/app being able to modify
the cache alloc could be used to address some very specific use
cases on top an existing system managed interface. This is not really a common
case in cloud or container environment and neither a feasible deployable
solution.
Just consider the millions of apps that have to transition to such an interface
to even use it - if thats the only way to do it, thats dead on arrival.
Also please donot include kernel automatically adjusting resources in your reply
as thats totally irrelavent and again more confusing as we have already
exchanged some >100 emails on this same patch version without meaning anything
so far.
The debate is purely between a syscall only
interface and a system manageable interface(like cgroup where admin or a central
entity controls the resources). If not define what is it first before going into
details.
Thanks,
Vikas
>
> On fork, the child inherits the TCR from its parent.
>
> Semantics:
> Once a TCRE is created and assigned to a task, that task has
> guaranteed reservation on any CPU where its scheduled in,
> for the lifetime of the TCRE.
>
> A task can have its TCR list modified without notification.
>
> FIXME: Add a per-task flag to not copy the TCR list of a task but delete
> all TCR's on fork.
>
> Interface:
>
> enum cache_rsvt_flags {
> CACHE_RSVT_ROUND_DOWN = (1 << 0), /* round "kbytes" down */
> };
>
> enum cache_rsvt_type {
> CACHE_RSVT_TYPE_CODE = 0, /* cache reservation is for code */
> CACHE_RSVT_TYPE_DATA, /* cache reservation is for data */
> CACHE_RSVT_TYPE_BOTH, /* cache reservation is for code and data */
> };
>
> struct cache_reservation {
> unsigned long kbytes;
> int type;
> int flags;
> int trcid;
> };
>
> The following syscalls modify the TCR of a task:
>
> * int sys_create_cache_reservation(struct cache_reservation *rsvt);
> DESCRIPTION: Creates a cache reservation entry, and assigns
> it to the current task.
>
> returns -ENOMEM if not enough space, -EPERM if no permission.
> returns 0 if reservation has been successful, copying actual
> number of kbytes reserved to "kbytes", type to type, and tcrid.
>
> * int sys_delete_cache_reservation(struct cache_reservation *rsvt);
> DESCRIPTION: Deletes a cache reservation entry, deassigning it
> from any task.
>
> Backward compatibility for processors with no support for code/data
> differentiation: by default code and data cache allocation types
> fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
> information that they done so via "flags").
>
> * int sys_attach_cache_reservation(pid_t pid, unsigned int tcrid);
> DESCRIPTION: Attaches cache reservation identified by "tcrid" to
> task by identified by pid.
> returns 0 if successful.
>
> * int sys_detach_cache_reservation(pid_t pid, unsigned int tcrid);
> DESCRIPTION: Detaches cache reservation identified by "tcrid" to
> task by identified pid.
>
> The following syscalls list the TCRs:
> * int sys_get_cache_reservations(size_t size, struct cache_reservation list[]);
> DESCRIPTION: Return all cache reservations in the system.
> Size should be set to the maximum number of items that can be stored
> in the buffer pointed to by list.
>
> * int sys_get_tcrid_tasks(unsigned int tcrid, size_t size, pid_t list[]);
> DESCRIPTION: Return which pids are associated to tcrid.
>
> * sys_get_pid_cache_reservations(pid_t pid, size_t size,
> struct cache_reservation list[]);
> DESCRIPTION: Return all cache reservations associated with "pid".
> Size should be set to the maximum number of items that can be stored
> in the buffer pointed to by list.
>
> * sys_get_cache_reservation_info()
> DESCRIPTION: ioctl to retrieve hardware info: cache round size, whether
> code/data separation is supported.
>
>
On Thu, 20 Aug 2015, Vikas Shivappa wrote:
>
>
> On Mon, 17 Aug 2015, Marcelo Tosatti wrote:
>
>> Vikas, Tejun,
>>
>> This is an updated interface. It addresses all comments made
>> so far and also covers all use-cases the cgroup interface
>> covers.
>>
>> Let me know what you think. I'll proceed to writing
>> the test applications.
>>
>> Usage model:
>> ------------
>>
>> This document details how CAT technology is
>> exposed to userspace.
>>
>> Each task has a list of task cache reservation entries (TCRE list).
>>
>> The init process is created with empty TCRE list.
>>
>> There is a system-wide unique ID space, each TCRE is assigned
>> an ID from this space. ID's can be reused (but no two TCREs
>> have the same ID at one time).
>>
>> The interface accomodates transient and independent cache allocation
>> adjustments from applications, as well as static cache partitioning
>> schemes.
>>
>> Allocation:
>> Usage of the system calls require CAP_SYS_CACHE_RESERVATION capability.
>>
>> A configurable percentage is reserved to tasks with empty TCRE list.
>
> And how do you think you will do this without a system controlled mechanism ?
> Everytime in your proposal you include these caveats which actually mean to
> include a system controlled interface in the background ,
> and your below interfaces make no mention of this really ! Why do we want to
> confuse ourselves like this ?
>
> syscall only interface does not seem to work on its own for the cache
> allocation scenario. This can only be a nice to have interface on top of a
> system controlled mechanism like cgroup interface. Sure you can do all the
> things you did with cgroup with the same with syscall interface but the point
> is what are the use cases that cant be done with this syscall only interface.
> (ex: to deal with cases you brought up earlier like when an app does cache
> intensive work for some time and later changes - it could use the syscall
> interface to quickly reqlinquish the cache lines or change a clos associated
> with it)
>
> I have repeatedly listed the use cases that can be dealt with , with this
big typo - 'use cases that cannot be dealt with'
> interface. How will you address the cases like 1.1 and 1.2 with your syscall
> only interface ? So we expect all the millions of apps like SAP, oracle etc
> and etc and all the millions of app developers to magically learn our new
> syscall interface and also cooperate between themselves to decide a cache
> allocation that is agreeable to all ? (which btw the interface doesnt list
> below how to do it) and then by some godly powers the noisly neighbour will
> decide himself to give up the cache ? (that should be first ever app to not
> request more resource in the world for himself and hurt his own performance -
> they surely dont want to do social service !)
>
> And how do we do the case 1.5 where the administrator want to assign cache to
> specific VMs in a cloud etc - with the hypothetical syscall interface we now
> should expect all the apps to do the above and now they also need to know
> where they run (what VM , what socket etc) and then decide and cooperate an
> allocation : compare this to a container environment like rancher where today
> the admin can convinetly use docker underneath to allocate
> mem/storage/compute to containers and easily extend this to include shared
> l3.
>
> http://marc.info/?l=linux-kernel&m=143889397419199
>
> without addressing the above the details of the interface below is irrelavant
> -
>
> Your initial request was to extend the cgroup interface to include rounding
> off the size of cache (which can easily be done with a bash script on top of
> cgroup interface !) and now you are proposing a syscall only interface ? this
> is very confusing and will only unnecessarily delay the process without
> adding any value.
>
> however like i mentioned the syscall interface or user/app being able to
> modify the cache alloc could be used to address some very specific use cases
> on top an existing system managed interface. This is not really a common case
> in cloud or container environment and neither a feasible deployable solution.
> Just consider the millions of apps that have to transition to such an
> interface to even use it - if thats the only way to do it, thats dead on
> arrival.
>
> Also please donot include kernel automatically adjusting resources in your
> reply as thats totally irrelavent and again more confusing as we have already
> exchanged some >100 emails on this same patch version without meaning
> anything so far.
>
> The debate is purely between a syscall only interface and a system manageable
> interface(like cgroup where admin or a central entity controls the
> resources). If not define what is it first before going into details.
>
> Thanks,
> Vikas
>
>>
>> On fork, the child inherits the TCR from its parent.
>>
>> Semantics:
>> Once a TCRE is created and assigned to a task, that task has
>> guaranteed reservation on any CPU where its scheduled in,
>> for the lifetime of the TCRE.
>>
>> A task can have its TCR list modified without notification.
>>
>> FIXME: Add a per-task flag to not copy the TCR list of a task but delete
>> all TCR's on fork.
>>
>> Interface:
>>
>> enum cache_rsvt_flags {
>> CACHE_RSVT_ROUND_DOWN = (1 << 0), /* round "kbytes" down */
>> };
>>
>> enum cache_rsvt_type {
>> CACHE_RSVT_TYPE_CODE = 0, /* cache reservation is for code */
>> CACHE_RSVT_TYPE_DATA, /* cache reservation is for data */
>> CACHE_RSVT_TYPE_BOTH, /* cache reservation is for code and data
>> */
>> };
>>
>> struct cache_reservation {
>> unsigned long kbytes;
>> int type;
>> int flags;
>> int trcid;
>> };
>>
>> The following syscalls modify the TCR of a task:
>>
>> * int sys_create_cache_reservation(struct cache_reservation *rsvt);
>> DESCRIPTION: Creates a cache reservation entry, and assigns
>> it to the current task.
>>
>> returns -ENOMEM if not enough space, -EPERM if no permission.
>> returns 0 if reservation has been successful, copying actual
>> number of kbytes reserved to "kbytes", type to type, and tcrid.
>>
>> * int sys_delete_cache_reservation(struct cache_reservation *rsvt);
>> DESCRIPTION: Deletes a cache reservation entry, deassigning it
>> from any task.
>>
>> Backward compatibility for processors with no support for code/data
>> differentiation: by default code and data cache allocation types
>> fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
>> information that they done so via "flags").
>>
>> * int sys_attach_cache_reservation(pid_t pid, unsigned int tcrid);
>> DESCRIPTION: Attaches cache reservation identified by "tcrid" to
>> task by identified by pid.
>> returns 0 if successful.
>>
>> * int sys_detach_cache_reservation(pid_t pid, unsigned int tcrid);
>> DESCRIPTION: Detaches cache reservation identified by "tcrid" to
>> task by identified pid.
>>
>> The following syscalls list the TCRs:
>> * int sys_get_cache_reservations(size_t size, struct cache_reservation
>> list[]);
>> DESCRIPTION: Return all cache reservations in the system.
>> Size should be set to the maximum number of items that can be stored
>> in the buffer pointed to by list.
>>
>> * int sys_get_tcrid_tasks(unsigned int tcrid, size_t size, pid_t list[]);
>> DESCRIPTION: Return which pids are associated to tcrid.
>>
>> * sys_get_pid_cache_reservations(pid_t pid, size_t size,
>> struct cache_reservation list[]);
>> DESCRIPTION: Return all cache reservations associated with "pid".
>> Size should be set to the maximum number of items that can be stored
>> in the buffer pointed to by list.
>>
>> * sys_get_cache_reservation_info()
>> DESCRIPTION: ioctl to retrieve hardware info: cache round size, whether
>> code/data separation is supported.
>>
>>
>
On Thu, Aug 20, 2015 at 05:06:51PM -0700, Vikas Shivappa wrote:
>
>
> On Mon, 17 Aug 2015, Marcelo Tosatti wrote:
>
> >Vikas, Tejun,
> >
> >This is an updated interface. It addresses all comments made
> >so far and also covers all use-cases the cgroup interface
> >covers.
> >
> >Let me know what you think. I'll proceed to writing
> >the test applications.
> >
> >Usage model:
> >------------
> >
> >This document details how CAT technology is
> >exposed to userspace.
> >
> >Each task has a list of task cache reservation entries (TCRE list).
> >
> >The init process is created with empty TCRE list.
> >
> >There is a system-wide unique ID space, each TCRE is assigned
> >an ID from this space. ID's can be reused (but no two TCREs
> >have the same ID at one time).
> >
> >The interface accomodates transient and independent cache allocation
> >adjustments from applications, as well as static cache partitioning
> >schemes.
> >
> >Allocation:
> >Usage of the system calls require CAP_SYS_CACHE_RESERVATION capability.
> >
> >A configurable percentage is reserved to tasks with empty TCRE list.
Hi Vikas,
> And how do you think you will do this without a system controlled
> mechanism ?
> Everytime in your proposal you include these caveats
> which actually mean to include a system controlled interface in the
> background ,
> and your below interfaces make no mention of this really ! Why do we
> want to confuse ourselves like this ?
> syscall only interface does not seem to work on its own for the
> cache allocation scenario. This can only be a nice to have interface
> on top of a system controlled mechanism like cgroup interface. Sure
> you can do all the things you did with cgroup with the same with
> syscall interface but the point is what are the use cases that cant
> be done with this syscall only interface. (ex: to deal with cases
> you brought up earlier like when an app does cache intensive work
> for some time and later changes - it could use the syscall interface
> to quickly reqlinquish the cache lines or change a clos associated
> with it)
All use cases can be covered with the syscall interface.
* How to convert from cgroups interface to syscall interface:
Cgroup: Partition cache in cgroups, add tasks to cgroups.
Syscall: Partition cache in TCRE, add TCREs to tasks.
You build the same structure (task <--> CBM) either via syscall
or via cgroups.
Please be more specific, can't really see any problem.
> I have repeatedly listed the use cases that can be dealt with , with
> this interface. How will you address the cases like 1.1 and 1.2 with
> your syscall only interface ?
Case 1.1:
--------
1.1> Exclusive access: The task cannot give *itself* exclusive
access from using the cache. For this it needs to have visibility of
the cache allocation of other tasks and may need to reclaim or
override others cache allocs which is not feasible (isnt that the
ability of a system managing agent?).
Answer: if the application has CAP_SYS_CACHE_RESERVATION, it can
create cache allocation and remove cache allocation from
other applications. So only the administrator could do it.
Case 1.2 answer below.
> So we expect all the millions of apps
> like SAP, oracle etc and etc and all the millions of app developers
> to magically learn our new syscall interface and also cooperate
> between themselves to decide a cache allocation that is agreeable to
> all ? (which btw the interface doesnt list below how to do it) and
They don't have to: the administrator can use "cacheset" application.
If an application wants to control the cache, it can.
> then by some godly powers the noisly neighbour will decide himself
> to give up the cache ?
I suppose you imagine something like this:
http://arxiv.org/pdf/1410.6513.pdf
No, the syscall interface does not need to care about that because:
* If you can set cache (CAP_SYS_CACHE_RESERVATION capability),
you can remove cache reservation from your neighbours.
So this problem does not exist (it assumes participants are
cooperative).
There is one confusion in the argument for cases 1.1 and case 1.2:
that applications are supposed to include in their decision of cache
allocation size the status of the system as a whole. This is a flawed
argument. Please point specifically if this is not the case or if there
is another case still not covered.
It would be possible to partition the cache into watermarks such
as:
task group A - can reserve up to 20% of cache.
task group B - can reserve up to 25% of cache.
task group C - can reserve 50% of cache.
But i am not sure... Tejun, do you think that is necessary?
(CAP_SYS_CACHE_RESERVATION is good enough for our usecases).
> (that should be first ever app to not request
> more resource in the world for himself and hurt his own performance
> - they surely dont want to do social service !)
>
> And how do we do the case 1.5 where the administrator want to assign
> cache to specific VMs in a cloud etc - with the hypothetical syscall
> interface we now should expect all the apps to do the above and now
> they also need to know where they run (what VM , what socket etc)
> and then decide and cooperate an allocation : compare this to a
> container environment like rancher where today the admin can
> convinetly use docker underneath to allocate mem/storage/compute to
> containers and easily extend this to include shared l3.
>
> http://marc.info/?l=linux-kernel&m=143889397419199
>
> without addressing the above the details of the interface below is irrelavant -
You are missing the point, there is supposed to be a "cacheset"
program which will allow the admin to setup TCRE and assign them to
tasks.
> Your initial request was to extend the cgroup interface to include
> rounding off the size of cache (which can easily be done with a bash
> script on top of cgroup interface !) and now you are proposing a
> syscall only interface ? this is very confusing and will only
> unnecessarily delay the process without adding any value.
I suppose you are assuming that its necessary for applications to
set their own cache. This assumption is not correct.
Take a look at Tuna / sched_getaffinity:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Affinity.html
> however like i mentioned the syscall interface or user/app being
> able to modify the cache alloc could be used to address some very
> specific use cases on top an existing system managed interface. This
> is not really a common case in cloud or container environment and
> neither a feasible deployable solution.
> Just consider the millions of apps that have to transition to such
> an interface to even use it - if thats the only way to do it, thats
> dead on arrival.
Applications should not rely on interfaces that are not upstream.
Is there an explicit request or comment from users about
their difficulty regarding a change in the interface?
> Also please donot include kernel automatically adjusting resources
> in your reply as thats totally irrelavent and again more confusing
> as we have already exchanged some >100 emails on this same patch
> version without meaning anything so far.
>
> The debate is purely between a syscall only interface and a system
> manageable interface(like cgroup where admin or a central entity
> controls the resources). If not define what is it first before going
> into details.
See the Tuna / taskset page.
The administrator could, for example, use "cacheset" from within
the scripts which initialize the applications.
Then having control over those scripts, he can view them as a "unified
system control interface".
Problems with cgroup interface:
1) Global IPI on CBM <---> task change does not scale.
2) Syscall interface specification is in kbytes, not
cache ways (which is what must be recorded by the OS
to allow migration of the OS between different
hardware systems).
3) Compilers are able to configure cache optimally for
given ranges of code inside applications, easily,
if desired.
4) Does not allow proper usage of shared caches between
applications. Think of the following scenario:
* AppA has threads which are created/destroyed,
but once initialized, want cache reservation.
* How is AppA going to coordinate with cgroups
system to initialized/shutdown cgroups?
I started writing the syscall interface on top of your latest
patchset yesterday (it should be relatively easy, given
that most of the low-level code is already there).
Any news on the data/code separation ?
> Thanks,
> Vikas
>
> >
> >On fork, the child inherits the TCR from its parent.
> >
> >Semantics:
> >Once a TCRE is created and assigned to a task, that task has
> >guaranteed reservation on any CPU where its scheduled in,
> >for the lifetime of the TCRE.
> >
> >A task can have its TCR list modified without notification.
> >
> >FIXME: Add a per-task flag to not copy the TCR list of a task but delete
> >all TCR's on fork.
> >
> >Interface:
> >
> >enum cache_rsvt_flags {
> > CACHE_RSVT_ROUND_DOWN = (1 << 0), /* round "kbytes" down */
> >};
> >
> >enum cache_rsvt_type {
> > CACHE_RSVT_TYPE_CODE = 0, /* cache reservation is for code */
> > CACHE_RSVT_TYPE_DATA, /* cache reservation is for data */
> > CACHE_RSVT_TYPE_BOTH, /* cache reservation is for code and data */
> >};
> >
> >struct cache_reservation {
> > unsigned long kbytes;
> > int type;
> > int flags;
> > int trcid;
> >};
> >
> >The following syscalls modify the TCR of a task:
> >
> >* int sys_create_cache_reservation(struct cache_reservation *rsvt);
> >DESCRIPTION: Creates a cache reservation entry, and assigns
> >it to the current task.
> >
> >returns -ENOMEM if not enough space, -EPERM if no permission.
> >returns 0 if reservation has been successful, copying actual
> >number of kbytes reserved to "kbytes", type to type, and tcrid.
> >
> >* int sys_delete_cache_reservation(struct cache_reservation *rsvt);
> >DESCRIPTION: Deletes a cache reservation entry, deassigning it
> >from any task.
> >
> >Backward compatibility for processors with no support for code/data
> >differentiation: by default code and data cache allocation types
> >fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
> >information that they done so via "flags").
> >
> >* int sys_attach_cache_reservation(pid_t pid, unsigned int tcrid);
> >DESCRIPTION: Attaches cache reservation identified by "tcrid" to
> >task by identified by pid.
> >returns 0 if successful.
> >
> >* int sys_detach_cache_reservation(pid_t pid, unsigned int tcrid);
> >DESCRIPTION: Detaches cache reservation identified by "tcrid" to
> >task by identified pid.
> >
> >The following syscalls list the TCRs:
> >* int sys_get_cache_reservations(size_t size, struct cache_reservation list[]);
> >DESCRIPTION: Return all cache reservations in the system.
> >Size should be set to the maximum number of items that can be stored
> >in the buffer pointed to by list.
> >
> >* int sys_get_tcrid_tasks(unsigned int tcrid, size_t size, pid_t list[]);
> >DESCRIPTION: Return which pids are associated to tcrid.
> >
> >* sys_get_pid_cache_reservations(pid_t pid, size_t size,
> > struct cache_reservation list[]);
> >DESCRIPTION: Return all cache reservations associated with "pid".
> >Size should be set to the maximum number of items that can be stored
> >in the buffer pointed to by list.
> >
> >* sys_get_cache_reservation_info()
> >DESCRIPTION: ioctl to retrieve hardware info: cache round size, whether
> >code/data separation is supported.
> >
> >
On Fri, 21 Aug 2015, Marcelo Tosatti wrote:
> On Thu, Aug 20, 2015 at 05:06:51PM -0700, Vikas Shivappa wrote:
>>
>>
>> On Mon, 17 Aug 2015, Marcelo Tosatti wrote:
>>
>>> Vikas, Tejun,
>>>
>>> This is an updated interface. It addresses all comments made
>>> so far and also covers all use-cases the cgroup interface
>>> covers.
>>>
>>> Let me know what you think. I'll proceed to writing
>>> the test applications.
>>>
>>> Usage model:
>>> ------------
>>>
>>> This document details how CAT technology is
>>> exposed to userspace.
>>>
>>> Each task has a list of task cache reservation entries (TCRE list).
>>>
>>> The init process is created with empty TCRE list.
>>>
>>> There is a system-wide unique ID space, each TCRE is assigned
>>> an ID from this space. ID's can be reused (but no two TCREs
>>> have the same ID at one time).
>>>
>>> The interface accomodates transient and independent cache allocation
>>> adjustments from applications, as well as static cache partitioning
>>> schemes.
>>>
>>> Allocation:
>>> Usage of the system calls require CAP_SYS_CACHE_RESERVATION capability.
>>>
>>> A configurable percentage is reserved to tasks with empty TCRE list.
>
> Hi Vikas,
>
>> And how do you think you will do this without a system controlled
>> mechanism ?
>> Everytime in your proposal you include these caveats
>> which actually mean to include a system controlled interface in the
>> background ,
>> and your below interfaces make no mention of this really ! Why do we
>> want to confuse ourselves like this ?
>> syscall only interface does not seem to work on its own for the
>> cache allocation scenario. This can only be a nice to have interface
>> on top of a system controlled mechanism like cgroup interface. Sure
>> you can do all the things you did with cgroup with the same with
>> syscall interface but the point is what are the use cases that cant
>> be done with this syscall only interface. (ex: to deal with cases
>> you brought up earlier like when an app does cache intensive work
>> for some time and later changes - it could use the syscall interface
>> to quickly reqlinquish the cache lines or change a clos associated
>> with it)
>
> All use cases can be covered with the syscall interface.
>
> * How to convert from cgroups interface to syscall interface:
> Cgroup: Partition cache in cgroups, add tasks to cgroups.
> Syscall: Partition cache in TCRE, add TCREs to tasks.
>
> You build the same structure (task <--> CBM) either via syscall
> or via cgroups.
>
> Please be more specific, can't really see any problem.
Well at first you mentioned that the cgroup does not support specifying size in
bytes and percentage and then you eventually agreed to my explanation that you
can easily write a bash script to do the same with cgroup bitmasks. (although i
had to go through the pain of reading all the proposals you sent without giving
a chance to explain how it can be used or so). Then you had a confusion in how I
explained the co mounting of the cpuset
and intel_rdt and instead of asking a question or pointing out issue, you go
ahead and write a whole proposal and in the end even say will cook a patch
before I even try to explain you.
And then you send proposals after proposals which varied from modifying the
cgroup interface itself to slightly modifying cgroups and adding syscalls and
then also automatically controlling the cache alloc (with all your extend mask
capabilities) without understanding what the framework is meant to do or just
asking or specifically pointing out
any issues in the patch. You had been reviewing the cgroup pathes for
many versions unlike others who accepted they need time to think about it or
accepted that they maynot understand the feature yet.
So what is that changed in the patches that is not acceptable now ? Many things
have been bought up multiple times even
after you agreed to a solution already proposed. I was only suggesting that this
can be better and less confusing if you point out the exact issue in the patch
just like how Thomas or all of the reviewers have been doing. With the rest of
the reviewers I either fix the issue or point out a flaw in the review.
If you dont like cgroup interface now , would be best to
indicate or discuss the specifics of the shortcommings clearly
before sending new proposals. That way we can come up with an interface which
does better and works better in linux if we can. Otherwise we may just end up
adding more code which just does the same thing?
However I have been working on an alternate interface as well and have just sent
it for your ref.
>
>> I have repeatedly listed the use cases that can be dealt with , with
>> this interface. How will you address the cases like 1.1 and 1.2 with
>> your syscall only interface ?
>
> Case 1.1:
> --------
>
> 1.1> Exclusive access: The task cannot give *itself* exclusive
> access from using the cache. For this it needs to have visibility of
> the cache allocation of other tasks and may need to reclaim or
> override others cache allocs which is not feasible (isnt that the
> ability of a system managing agent?).
>
> Answer: if the application has CAP_SYS_CACHE_RESERVATION, it can
> create cache allocation and remove cache allocation from
> other applications. So only the administrator could do it.
The 1.1 also includes an other use case(lets call this 1.1.1) which indicates
that the apps would just
allocate a lot of cache and soon run out space. Hence the first few apps would
get most of the cache (would get *most* even if you reserve some % of cache for
others - and again thats difficult to assign to the others).
Now if you say you want to put a threshold limit for each app to self allocate ,
then that turns out to an interface that can easily built on top of the existing
cgroup interface. iow its just a control you are giving the app on top of an
existing admin controlled interface (like cgroup).the threshold can just be the
cbm of the cgroup which the
tasks belong to. so now the apps can self allocate or reduce the allocation to
something which is a subset the cgroup has (thats one way..)
Also the issue was to discuss whether self allocation or process deciding its
own allocation vs. system controlled mechanism. It wasnt clear what
syscalls among the ones need to have this sys_cap and which ones would not.
>
> Case 1.2 answer below.
>
>> So we expect all the millions of apps
>> like SAP, oracle etc and etc and all the millions of app developers
>> to magically learn our new syscall interface and also cooperate
>> between themselves to decide a cache allocation that is agreeable to
>> all ? (which btw the interface doesnt list below how to do it) and
>
> They don't have to: the administrator can use "cacheset" application.
the "cacheset" wasnt mentioned before. Now you are talking about a tool which
is also doing a centralized or system controlled allocation. This is
where I pointed out earlier that its best to keep the discussion to the point
and not randomly expand the scope to a variety of other options. If you want to
build a taskset like tool thats again just doing a system conrolled interface or
a centralized control mechamism which is what cgroup does. Then it just comes
down to whether cgroup
interface or the cacheset is more easy or intutive. And why would the already
widely used interface for resource allocation be not intutive ? - we first need
to answer that may be ? or any really required features it lacks ?
Also give that dockers use cgroups for resource allocations , it seems most fit
and thats the feedback i received repeatedly in linuxcon as well.
>
> If an application wants to control the cache, it can.
>
>> then by some godly powers the noisly neighbour will decide himself
>> to give up the cache ?
>
> I suppose you imagine something like this:
> http://arxiv.org/pdf/1410.6513.pdf
>
> No, the syscall interface does not need to care about that because:
>
> * If you can set cache (CAP_SYS_CACHE_RESERVATION capability),
> you can remove cache reservation from your neighbours.
>
> So this problem does not exist (it assumes participants are
> cooperative).
>
> There is one confusion in the argument for cases 1.1 and case 1.2:
> that applications are supposed to include in their decision of cache
> allocation size the status of the system as a whole. This is a flawed
> argument. Please point specifically if this is not the case or if there
> is another case still not covered.
Like i said it wasnt clear what syscalls required this capability. also the
1.1.1 still breaks this , or iow the apps needs to have lesser control than a
system/admin controlled allocation.
>
> It would be possible to partition the cache into watermarks such
> as:
>
> task group A - can reserve up to 20% of cache.
> task group B - can reserve up to 25% of cache.
> task group C - can reserve 50% of cache.
>
> But i am not sure... Tejun, do you think that is necessary?
> (CAP_SYS_CACHE_RESERVATION is good enough for our usecases).
>
>> (that should be first ever app to not request
>> more resource in the world for himself and hurt his own performance
>> - they surely dont want to do social service !)
>>
>> And how do we do the case 1.5 where the administrator want to assign
>> cache to specific VMs in a cloud etc - with the hypothetical syscall
>> interface we now should expect all the apps to do the above and now
>> they also need to know where they run (what VM , what socket etc)
>> and then decide and cooperate an allocation : compare this to a
>> container environment like rancher where today the admin can
>> convinetly use docker underneath to allocate mem/storage/compute to
>> containers and easily extend this to include shared l3.
>>
>> http://marc.info/?l=linux-kernel&m=143889397419199
>>
>> without addressing the above the details of the interface below is irrelavant -
>
> You are missing the point, there is supposed to be a "cacheset"
> program which will allow the admin to setup TCRE and assign them to
> tasks.
>
>> Your initial request was to extend the cgroup interface to include
>> rounding off the size of cache (which can easily be done with a bash
>> script on top of cgroup interface !) and now you are proposing a
>> syscall only interface ? this is very confusing and will only
>> unnecessarily delay the process without adding any value.
>
> I suppose you are assuming that its necessary for applications to
> set their own cache. This assumption is not correct.
>
> Take a look at Tuna / sched_getaffinity:
>
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Affinity.html
>
>
>> however like i mentioned the syscall interface or user/app being
>> able to modify the cache alloc could be used to address some very
>> specific use cases on top an existing system managed interface. This
>> is not really a common case in cloud or container environment and
>> neither a feasible deployable solution.
>> Just consider the millions of apps that have to transition to such
>> an interface to even use it - if thats the only way to do it, thats
>> dead on arrival.
>
> Applications should not rely on interfaces that are not upstream.
>
> Is there an explicit request or comment from users about
> their difficulty regarding a change in the interface?
HOwever there needs to be a reasoning on why the cgroup interface is
not good as well?
>
>> Also please donot include kernel automatically adjusting resources
>> in your reply as thats totally irrelavent and again more confusing
>> as we have already exchanged some >100 emails on this same patch
>> version without meaning anything so far.
>>
>> The debate is purely between a syscall only interface and a system
>> manageable interface(like cgroup where admin or a central entity
>> controls the resources). If not define what is it first before going
>> into details.
>
> See the Tuna / taskset page.
> The administrator could, for example, use "cacheset" from within
> the scripts which initialize the applications.
> Then having control over those scripts, he can view them as a "unified
> system control interface".
>
> Problems with cgroup interface:
>
> 1) Global IPI on CBM <---> task change does not scale.
DOnt understand this . how is the IPI related to cgroups. A task is associated
with one closid and it needs to carry that along where ever it goes. it supports
the use case i explain in (basicaly cloud/container and server user cases
mainly)
http://marc.info/?l=linux-kernel&m=144035279828805
> 2) Syscall interface specification is in kbytes, not
> cache ways (which is what must be recorded by the OS
> to allow migration of the OS between different
> hardware systems).
I thought you agreed that a simple bash script can convert the bitmask to bytes
in chunk size. ALl you need is the cache size from /proc/cpuinfo and the max cbm
bits in the root intel_rdt cgroup. And its incorrect to say you can do it it
bytes. Its only chunk size really. (chunk size = cache size / max cbm bits).
Apart from that the mask gives you the ability to decide an exclusive,
overlapping, or partially overlapping and partially exclusive masks.
> 3) Compilers are able to configure cache optimally for
> given ranges of code inside applications, easily,
> if desired.
This is again not possible because of 1.1.1. And can be still done in a
restricted fashion like i explained above.
> 4) Does not allow proper usage of shared caches between
> applications. Think of the following scenario:
> * AppA has threads which are created/destroyed,
> but once initialized, want cache reservation.
> * How is AppA going to coordinate with cgroups
> system to initialized/shutdown cgroups?
>
Yes , the interface does not support apps to self control cache alloc. That is
accepted. But this is not the main use case we target like i explained above and
in the link i provided for the new proposal and before.. So its not very
important as such.
Also worst case, you can easily design a syscall for apps to self control
keeping the cgroup alloc for the task as max threshold.
So lets nail this list(of cgroup flaws you list) down before thinking about
changes ? - this should have
been the first things in the email really is what i was mentioning.
> I started writing the syscall interface on top of your latest
> patchset yesterday (it should be relatively easy, given
> that most of the low-level code is already there).
>
> Any news on the data/code separation ?
Will send them this week , untested partially due to h/w not yet being with me.
Have been ready , but was waiting to see the discussions on this patch as well.
more response below -
>
>
>> Thanks,
>> Vikas
>>
>>>
>>> On fork, the child inherits the TCR from its parent.
>>>
>>> Semantics:
>>> Once a TCRE is created and assigned to a task, that task has
>>> guaranteed reservation on any CPU where its scheduled in,
>>> for the lifetime of the TCRE.
>>>
>>> A task can have its TCR list modified without notification.
Whey does the task need a list of allocations ? A task is tagged with only one
closid and it needs to carry that along. Even if the list is for each socket,
that needs be an array.
>>>
>>> FIXME: Add a per-task flag to not copy the TCR list of a task but delete
>>> all TCR's on fork.
>>>
>>> Interface:
>>>
>>> enum cache_rsvt_flags {
>>> CACHE_RSVT_ROUND_DOWN = (1 << 0), /* round "kbytes" down */
>>> };
Not really optional is it ? the chunk size is decided by the h/w sku and you can
only allocate in that chunk size, not any bytes.
>>>
>>> enum cache_rsvt_type {
>>> CACHE_RSVT_TYPE_CODE = 0, /* cache reservation is for code */
>>> CACHE_RSVT_TYPE_DATA, /* cache reservation is for data */
>>> CACHE_RSVT_TYPE_BOTH, /* cache reservation is for code and data */
>>> };
>>>
>>> struct cache_reservation {
>>> unsigned long kbytes;
should be rounded off to chunk size really. And like i explained above the masks
let you do the exclusive/partially adjustable percentage exclusive easily (say
20% shared and rest exclusive) or a tolerated amount of shared...
>>> int type;
>>> int flags;
>>> int trcid;
>>> };
>>>
>>> The following syscalls modify the TCR of a task:
>>>
>>> * int sys_create_cache_reservation(struct cache_reservation *rsvt);
>>> DESCRIPTION: Creates a cache reservation entry, and assigns
>>> it to the current task.
So now i assume this is what the task can do itself and the ones below which pid
need the capability ? Again this breaks 1.1.1 like i said above and any way to
restrict to a threshold max alloc can just easily be done on top of cgroup alloc
keeping the cgroup alloc as max threshold.
>>>
>>> returns -ENOMEM if not enough space, -EPERM if no permission.
>>> returns 0 if reservation has been successful, copying actual
>>> number of kbytes reserved to "kbytes", type to type, and tcrid.
>>>
>>> * int sys_delete_cache_reservation(struct cache_reservation *rsvt);
>>> DESCRIPTION: Deletes a cache reservation entry, deassigning it
>>> from any task.
>>>
>>> Backward compatibility for processors with no support for code/data
>>> differentiation: by default code and data cache allocation types
>>> fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
>>> information that they done so via "flags").
Need to address the change of mode which is dynamic and it may be more intutive
to do that in cgroups for the reasons i said above and taking allocation back
from a process may need a call back , thats why it may best be to design an
interface where the apps know their control is very limited and within the
purview of the already set allocations by root user.
Please check the new proposal which tries to addresses the comments i made
mostly -
http://marc.info/?l=linux-kernel&m=144035279828805
The framework still lets any kernel mode or high level user mode library
developer build a cacheset like tool or others on top of it if that needs to be
more custom and more intutive.
Thanks,
Vikas
>>>
>>> * int sys_attach_cache_reservation(pid_t pid, unsigned int tcrid);
>>> DESCRIPTION: Attaches cache reservation identified by "tcrid" to
>>> task by identified by pid.
>>> returns 0 if successful.
>>>
>>> * int sys_detach_cache_reservation(pid_t pid, unsigned int tcrid);
>>> DESCRIPTION: Detaches cache reservation identified by "tcrid" to
>>> task by identified pid.
>>>
>>> The following syscalls list the TCRs:
>>> * int sys_get_cache_reservations(size_t size, struct cache_reservation list[]);
>>> DESCRIPTION: Return all cache reservations in the system.
>>> Size should be set to the maximum number of items that can be stored
>>> in the buffer pointed to by list.
>>>
>>> * int sys_get_tcrid_tasks(unsigned int tcrid, size_t size, pid_t list[]);
>>> DESCRIPTION: Return which pids are associated to tcrid.
>>>
>>> * sys_get_pid_cache_reservations(pid_t pid, size_t size,
>>> struct cache_reservation list[]);
>>> DESCRIPTION: Return all cache reservations associated with "pid".
>>> Size should be set to the maximum number of items that can be stored
>>> in the buffer pointed to by list.
>>>
>>> * sys_get_cache_reservation_info()
>>> DESCRIPTION: ioctl to retrieve hardware info: cache round size, whether
>>> code/data separation is supported.
>>>
>>>
>
On Sun, Aug 23, 2015 at 11:47:49AM -0700, Vikas Shivappa wrote:
>
>
> On Fri, 21 Aug 2015, Marcelo Tosatti wrote:
>
> >On Thu, Aug 20, 2015 at 05:06:51PM -0700, Vikas Shivappa wrote:
> >>
> >>
> >>On Mon, 17 Aug 2015, Marcelo Tosatti wrote:
> >>
> >>>Vikas, Tejun,
> >>>
> >>>This is an updated interface. It addresses all comments made
> >>>so far and also covers all use-cases the cgroup interface
> >>>covers.
> >>>
> >>>Let me know what you think. I'll proceed to writing
> >>>the test applications.
> >>>
> >>>Usage model:
> >>>------------
> >>>
> >>>This document details how CAT technology is
> >>>exposed to userspace.
> >>>
> >>>Each task has a list of task cache reservation entries (TCRE list).
> >>>
> >>>The init process is created with empty TCRE list.
> >>>
> >>>There is a system-wide unique ID space, each TCRE is assigned
> >>>an ID from this space. ID's can be reused (but no two TCREs
> >>>have the same ID at one time).
> >>>
> >>>The interface accomodates transient and independent cache allocation
> >>>adjustments from applications, as well as static cache partitioning
> >>>schemes.
> >>>
> >>>Allocation:
> >>>Usage of the system calls require CAP_SYS_CACHE_RESERVATION capability.
> >>>
> >>>A configurable percentage is reserved to tasks with empty TCRE list.
> >
> >Hi Vikas,
> >
> >>And how do you think you will do this without a system controlled
> >>mechanism ?
> >>Everytime in your proposal you include these caveats
> >>which actually mean to include a system controlled interface in the
> >>background ,
> >>and your below interfaces make no mention of this really ! Why do we
> >>want to confuse ourselves like this ?
> >>syscall only interface does not seem to work on its own for the
> >>cache allocation scenario. This can only be a nice to have interface
> >>on top of a system controlled mechanism like cgroup interface. Sure
> >>you can do all the things you did with cgroup with the same with
> >>syscall interface but the point is what are the use cases that cant
> >>be done with this syscall only interface. (ex: to deal with cases
> >>you brought up earlier like when an app does cache intensive work
> >>for some time and later changes - it could use the syscall interface
> >>to quickly reqlinquish the cache lines or change a clos associated
> >>with it)
> >
> >All use cases can be covered with the syscall interface.
> >
> >* How to convert from cgroups interface to syscall interface:
> >Cgroup: Partition cache in cgroups, add tasks to cgroups.
> >Syscall: Partition cache in TCRE, add TCREs to tasks.
> >
> >You build the same structure (task <--> CBM) either via syscall
> >or via cgroups.
> >
> >Please be more specific, can't really see any problem.
>
> Well at first you mentioned that the cgroup does not support
> specifying size in bytes and percentage and then you eventually
> agreed to my explanation that you can easily write a bash script to
> do the same with cgroup bitmasks. (although i had to go through the
> pain of reading all the proposals you sent without giving a chance
> to explain how it can be used or so).
Yes we could write the (bytes --to--> cacheways) convertion in
userspace. But since we are going for a different interface, can also
fix that problem as well in the kernel.
> Then you had a confusion in
> how I explained the co mounting of the cpuset and intel_rdt and
> instead of asking a question or pointing out issue, you go ahead and
> write a whole proposal and in the end even say will cook a patch
> before I even try to explain you.
The syscall interface is more flexible.
Why not use a more flexible interface if possible?
> And then you send proposals after proposals
> which varied from
> modifying the cgroup interface itself to slightly modifying cgroups
Yes, trying to solve the problems our customers will be facing in the field.
So, this proposals are not coming out of thin air.
> and adding syscalls and then also automatically controlling the
> cache alloc (with all your extend mask capabilities) without
> understanding what the framework is meant to do or just asking or
> specifically pointing out any issues in the patch.
There is a practical problem the "extension" of mask capabilities is
solving. Check item 6 of the attached text document.
> You had been
> reviewing the cgroup pathes for many versions unlike others who
> accepted they need time to think about it or accepted that they
> maynot understand the feature yet.
> So what is that changed in the patches that is not acceptable now ?
Tejun proposed a syscall interface. He is a right, a syscall interface
is much more flexible. Blame him.
> Many things have been bought up multiple times even after you agreed
> to a solution already proposed. I was only suggesting that this can
> be better and less confusing if you point out the exact issue in the
> patch just like how Thomas or all of the reviewers have been doing.
>
> With the rest of the reviewers I either fix the issue or point out a
> flaw in the review.
> If you dont like cgroup interface now ,
> would be best to indicate or
> discuss the specifics of the shortcommings clearly before sending
> new proposals.
> That way we can come up with an interface which does
> better and works better in linux if we can. Otherwise we may just
> end up adding more code which just does the same thing?
>
> However I have been working on an alternate interface as well and
> have just sent it for your ref.
Problem: locking.
> >>I have repeatedly listed the use cases that can be dealt with , with
> >>this interface. How will you address the cases like 1.1 and 1.2 with
> >>your syscall only interface ?
> >
> >Case 1.1:
> >--------
> >
> > 1.1> Exclusive access: The task cannot give *itself* exclusive
> >access from using the cache. For this it needs to have visibility of
> >the cache allocation of other tasks and may need to reclaim or
> >override others cache allocs which is not feasible (isnt that the
> >ability of a system managing agent?).
> >
> >Answer: if the application has CAP_SYS_CACHE_RESERVATION, it can
> >create cache allocation and remove cache allocation from
> >other applications. So only the administrator could do it.
>
> The 1.1 also includes an other use case(lets call this 1.1.1) which
> indicates that the apps would just allocate a lot of cache and soon
> run out space. Hence the first few apps would get most of the cache
> (would get *most* even if you reserve some % of cache for others -
> and again thats difficult to assign to the others).
>
> Now if you say you want to put a threshold limit for each app to
> self allocate , then that turns out to an interface that can easily
> built on top of the existing cgroup interface. iow its just a
> control you are giving the app on top of an existing admin
> controlled interface (like cgroup).the threshold can just be the cbm
> of the cgroup which the tasks belong to. so now the apps can self
> allocate or reduce the allocation to something which is a subset the
> cgroup has (thats one way..)
Yes.
> Also the issue was to discuss whether self allocation or process
> deciding its own allocation vs. system controlled mechanism.
> It
> wasnt clear what syscalls among the ones need to have this sys_cap
> and which ones would not.
>
> >
> >Case 1.2 answer below.
> >
> >>So we expect all the millions of apps
> >>like SAP, oracle etc and etc and all the millions of app developers
> >>to magically learn our new syscall interface and also cooperate
> >>between themselves to decide a cache allocation that is agreeable to
> >>all ? (which btw the interface doesnt list below how to do it) and
> >
> >They don't have to: the administrator can use "cacheset" application.
>
> the "cacheset" wasnt mentioned before. Now you are talking about a
> tool which is also doing a centralized or system controlled
> allocation.
Not me. Tejun proposed that.
> This is where I pointed out earlier that its best to
> keep the discussion to the point and not randomly expand the scope
> to a variety of other options. If you want to build a taskset like
> tool thats again just doing a system conrolled interface or a
> centralized control mechamism which is what cgroup does. Then it
> just comes down to whether cgroup interface or the cacheset is more
> easy or intutive. And why would the already widely used interface
> for resource allocation be not intutive ? - we first need to answer
> that may be ? or any really required features it lacks ?
> Also give that dockers use cgroups for resource allocations , it
> seems most fit and thats the feedback i received repeatedly in
> linuxcon as well.
>
> >
> >If an application wants to control the cache, it can.
> >
> >>then by some godly powers the noisly neighbour will decide himself
> >>to give up the cache ?
> >
> >I suppose you imagine something like this:
> >http://arxiv.org/pdf/1410.6513.pdf
> >
> >No, the syscall interface does not need to care about that because:
> >
> >* If you can set cache (CAP_SYS_CACHE_RESERVATION capability),
> >you can remove cache reservation from your neighbours.
> >
> >So this problem does not exist (it assumes participants are
> >cooperative).
> >
> >There is one confusion in the argument for cases 1.1 and case 1.2:
> >that applications are supposed to include in their decision of cache
> >allocation size the status of the system as a whole. This is a flawed
> >argument. Please point specifically if this is not the case or if there
> >is another case still not covered.
>
> Like i said it wasnt clear what syscalls required this capability.
> also the 1.1.1 still breaks this , or iow the apps needs to have
> lesser control than a system/admin controlled allocation.
Should separate access control from ability of applications to change
cache allocations.
Problem-1: Separation of percentages of totality of cache to particular users
This assumes each user has credentials to allocate/reserve cache. You
don't want to give user A more than 30% of cache allocation because
user B requires 80% cache to achieve his performance requirements.
Problem-2: The decision to allocate cache is tied to application
initialization / destruction, and application initialization is
essentially random from the POV of the system (the events which trigger
the execution of the application are not visible from the system).
Think of a server running two different servers: one database
with requests that are received with random poisson distribution, average 30
requests per hour, and every request takes 1 minute.
One httpd server with nearly constant load.
Without cache reservations, database requests takes 2 minutes.
That is not acceptable for the database clients.
But with cache reservation, database requests takes 1 minute.
You want to maximize performance of httpd and database requests
What you do? You allow the database server to perform cache
reservation once a request comes in, and to undo the reservation
once the request is finished.
Its impossible to perform this with a centralized interface.
---
The point of the syscall interface is to handle problem-2 by allowing
applications to modify cache allocation themselves.
And ignores problem-1 (which is similar to case 1.1.1). Yes, if an
application can allocate 80% of cache hurting performance of
other applications, then it can.
There is nothing we can to do solve it. We can allow it, if the
administrator decides not to, he can remove CAP_SYS_CACHE... from users
to avoid the problem.
So problem 1.1.1 is dealt with.
> >It would be possible to partition the cache into watermarks such
> >as:
> >
> >task group A - can reserve up to 20% of cache.
> >task group B - can reserve up to 25% of cache.
> >task group C - can reserve 50% of cache.
> >
> >But i am not sure... Tejun, do you think that is necessary?
> >(CAP_SYS_CACHE_RESERVATION is good enough for our usecases).
> >
> >> (that should be first ever app to not request
> >>more resource in the world for himself and hurt his own performance
> >>- they surely dont want to do social service !)
> >>
> >>And how do we do the case 1.5 where the administrator want to assign
> >>cache to specific VMs in a cloud etc - with the hypothetical syscall
> >>interface we now should expect all the apps to do the above and now
> >>they also need to know where they run (what VM , what socket etc)
> >>and then decide and cooperate an allocation : compare this to a
> >>container environment like rancher where today the admin can
> >>convinetly use docker underneath to allocate mem/storage/compute to
> >>containers and easily extend this to include shared l3.
> >>
> >>http://marc.info/?l=linux-kernel&m=143889397419199
> >>
> >>without addressing the above the details of the interface below is irrelavant -
> >
> >You are missing the point, there is supposed to be a "cacheset"
> >program which will allow the admin to setup TCRE and assign them to
> >tasks.
> >
> >>Your initial request was to extend the cgroup interface to include
> >>rounding off the size of cache (which can easily be done with a bash
> >>script on top of cgroup interface !) and now you are proposing a
> >>syscall only interface ? this is very confusing and will only
> >>unnecessarily delay the process without adding any value.
> >
> >I suppose you are assuming that its necessary for applications to
> >set their own cache. This assumption is not correct.
> >
> >Take a look at Tuna / sched_getaffinity:
> >
> >https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Affinity.html
> >
> >
> >>however like i mentioned the syscall interface or user/app being
> >>able to modify the cache alloc could be used to address some very
> >>specific use cases on top an existing system managed interface. This
> >>is not really a common case in cloud or container environment and
> >>neither a feasible deployable solution.
> >>Just consider the millions of apps that have to transition to such
> >>an interface to even use it - if thats the only way to do it, thats
> >>dead on arrival.
> >
> >Applications should not rely on interfaces that are not upstream.
> >
> >Is there an explicit request or comment from users about
> >their difficulty regarding a change in the interface?
>
> HOwever there needs to be a reasoning on why the cgroup interface is
> not good as well?
The main problem of the cgroup interface, to me, is problem-2 above.
> >>Also please donot include kernel automatically adjusting resources
> >>in your reply as thats totally irrelavent and again more confusing
> >>as we have already exchanged some >100 emails on this same patch
> >>version without meaning anything so far.
> >>
> >>The debate is purely between a syscall only interface and a system
> >>manageable interface(like cgroup where admin or a central entity
> >>controls the resources). If not define what is it first before going
> >>into details.
> >
> >See the Tuna / taskset page.
> >The administrator could, for example, use "cacheset" from within
> >the scripts which initialize the applications.
> >Then having control over those scripts, he can view them as a "unified
> >system control interface".
> >
> >Problems with cgroup interface:
> >
> >1) Global IPI on CBM <---> task change does not scale.
>
> DOnt understand this . how is the IPI related to cgroups. A task is
> associated with one closid and it needs to carry that along where
> ever it goes. it supports the use case i explain in (basicaly
> cloud/container and server user cases mainly)
Think of problem-2 above and the following:
* cbm_update_all() - Update the cache bit mask for all packages.
*/
static inline void cbm_update_all(u32 closid)
{
on_each_cpu_mask(&rdt_cpumask, cbm_cpu_update, (void *)closid, 1);
}
This needs to go.
> http://marc.info/?l=linux-kernel&m=144035279828805
>
> >2) Syscall interface specification is in kbytes, not
> >cache ways (which is what must be recorded by the OS
> >to allow migration of the OS between different
> >hardware systems).
>
> I thought you agreed that a simple bash script can convert the
> bitmask to bytes in chunk size. ALl you need is the cache size from
> /proc/cpuinfo and the max cbm bits in the root intel_rdt cgroup.
Yes, but that requires every user of the interface which considers
the possibility of moving to different platforms to perform
that convertion.
Why force the user (or the programmer) to maintain a quantity
that is not useable in any of those environments ?
So the above facts mean its preferred to expose size in bytes.
Yes, i had agree to "fix" this issue in userspace, but since there are
discussions to change interface, why not fix that problem as well in the
kernel rather than userspace?
> And
> its incorrect to say you can do it it bytes. Its only chunk size
> really. (chunk size = cache size / max cbm bits).
Yes, you can do it in bytes. Its written in the syscall
proposal how you can do that.
> Apart from that the mask gives you the ability to decide an
> exclusive, overlapping, or partially overlapping and partially
> exclusive masks.
>
> >3) Compilers are able to configure cache optimally for
> >given ranges of code inside applications, easily,
> >if desired.
>
> This is again not possible because of 1.1.1. And can be still done
> in a restricted fashion like i explained above.
1.1.1 is not a blocker. If it were, a similar argument would
be valid for sys_schedsetaffinity:
It is not possible to allow applications to set their own affinity
because two applications might set affinity for the same pCPU which
affects performance of both.
But still, applications with CAP_SYS_NICE are allowed to set their
own affinity.
> >4) Does not allow proper usage of shared caches between
> >applications. Think of the following scenario:
> > * AppA has threads which are created/destroyed,
> > but once initialized, want cache reservation.
> > * How is AppA going to coordinate with cgroups
> > system to initialized/shutdown cgroups?
> >
>
> Yes , the interface does not support apps to self control cache
> alloc. That is accepted. But this is not the main use case we target
> like i explained above and in the link i provided for the new
> proposal and before.. So its not very important as such.
> Also worst case, you can easily design a syscall for apps to self
> control keeping the cgroup alloc for the task as max threshold.
> So lets nail this list(of cgroup flaws you list) down before
> thinking about changes ? - this should have been the first things in
> the email really is what i was mentioning.
>
> >I started writing the syscall interface on top of your latest
> >patchset yesterday (it should be relatively easy, given
> >that most of the low-level code is already there).
> >
> >Any news on the data/code separation ?
>
> Will send them this week , untested partially due to h/w not yet
> being with me. Have been ready , but was waiting to see the
> discussions on this patch as well.
>
> more response below -
>
> >
> >
> >>Thanks,
> >>Vikas
> >>
> >>>
> >>>On fork, the child inherits the TCR from its parent.
> >>>
> >>>Semantics:
> >>>Once a TCRE is created and assigned to a task, that task has
> >>>guaranteed reservation on any CPU where its scheduled in,
> >>>for the lifetime of the TCRE.
> >>>
> >>>A task can have its TCR list modified without notification.
>
> Whey does the task need a list of allocations ? A task is tagged
> with only one closid and it needs to carry that along. Even if the
> list is for each socket, that needs be an array.
See item 5 of the attached text.
> >>>FIXME: Add a per-task flag to not copy the TCR list of a task but delete
> >>>all TCR's on fork.
> >>>
> >>>Interface:
> >>>
> >>>enum cache_rsvt_flags {
> >>> CACHE_RSVT_ROUND_DOWN = (1 << 0), /* round "kbytes" down */
> >>>};
>
> Not really optional is it ? the chunk size is decided by the h/w sku
> and you can only allocate in that chunk size, not any bytes.
Specify cache reservation in bytes.
By default, OS rounds bytes to cache ways.
This flag allows OS to round bytes down to cache ways.
> >>>
> >>>enum cache_rsvt_type {
> >>> CACHE_RSVT_TYPE_CODE = 0, /* cache reservation is for code */
> >>> CACHE_RSVT_TYPE_DATA, /* cache reservation is for data */
> >>> CACHE_RSVT_TYPE_BOTH, /* cache reservation is for code and data */
> >>>};
> >>>
> >>>struct cache_reservation {
> >>> unsigned long kbytes;
>
> should be rounded off to chunk size really. And like i explained
> above the masks let you do the exclusive/partially adjustable
> percentage exclusive easily (say 20% shared and rest exclusive) or a
> tolerated amount of shared...
Please read sentence above.
> >>> int type;
> >>> int flags;
> >>> int trcid;
> >>>};
> >>>
> >>>The following syscalls modify the TCR of a task:
> >>>
> >>>* int sys_create_cache_reservation(struct cache_reservation *rsvt);
> >>>DESCRIPTION: Creates a cache reservation entry, and assigns
> >>>it to the current task.
>
> So now i assume this is what the task can do itself and the ones
> below which pid need the capability ? Again this breaks 1.1.1 like i
> said above and any way to restrict to a threshold max alloc can just
> easily be done on top of cgroup alloc keeping the cgroup alloc as
> max threshold.
Not a problem, see sys_schedsetaffinity argument.
> >>>returns -ENOMEM if not enough space, -EPERM if no permission.
> >>>returns 0 if reservation has been successful, copying actual
> >>>number of kbytes reserved to "kbytes", type to type, and tcrid.
> >>>
> >>>* int sys_delete_cache_reservation(struct cache_reservation *rsvt);
> >>>DESCRIPTION: Deletes a cache reservation entry, deassigning it
> >>>from any task.
> >>>
> >>>Backward compatibility for processors with no support for code/data
> >>>differentiation: by default code and data cache allocation types
> >>>fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
> >>>information that they done so via "flags").
>
> Need to address the change of mode which is dynamic
There is no change of mode in the following case:
I/D capable processor: boots with I/D enabled and remains that way.
not I/D capable processor: boots with I/D disabled and remains that
way.
Do you see any problem with this scheme?
> and it may be
> more intutive to do that in cgroups for the reasons i said above and
> taking allocation back from a process may need a call back, thats
> why it may best be to design an interface where the apps know their
> control is very limited and within the purview of the already set
> allocations by root user.
>
> Please check the new proposal which tries to addresses the comments
> i made mostly -
> http://marc.info/?l=linux-kernel&m=144035279828805
> The framework still lets any kernel mode or high level user mode
> library developer build a cacheset like tool or others on top of it
> if that needs to be more custom and more intutive.
>
> Thanks,
> Vikas
A major problem of any filesystem based interface, pointed out by Tejun,
is that locking must performed by the user.
With the syscall interface, the kernel can properly handle locking for
the user. Can use RCU to nicely deal with locking in the kernel.
One issue you are trying to deal with, that i ignored, is Problem-1:
division of cache allocability per user.
> >>>* int sys_attach_cache_reservation(pid_t pid, unsigned int tcrid);
> >>>DESCRIPTION: Attaches cache reservation identified by "tcrid" to
> >>>task by identified by pid.
> >>>returns 0 if successful.
> >>>
> >>>* int sys_detach_cache_reservation(pid_t pid, unsigned int tcrid);
> >>>DESCRIPTION: Detaches cache reservation identified by "tcrid" to
> >>>task by identified pid.
> >>>
> >>>The following syscalls list the TCRs:
> >>>* int sys_get_cache_reservations(size_t size, struct cache_reservation list[]);
> >>>DESCRIPTION: Return all cache reservations in the system.
> >>>Size should be set to the maximum number of items that can be stored
> >>>in the buffer pointed to by list.
> >>>
> >>>* int sys_get_tcrid_tasks(unsigned int tcrid, size_t size, pid_t list[]);
> >>>DESCRIPTION: Return which pids are associated to tcrid.
> >>>
> >>>* sys_get_pid_cache_reservations(pid_t pid, size_t size,
> >>> struct cache_reservation list[]);
> >>>DESCRIPTION: Return all cache reservations associated with "pid".
> >>>Size should be set to the maximum number of items that can be stored
> >>>in the buffer pointed to by list.
> >>>
> >>>* sys_get_cache_reservation_info()
> >>>DESCRIPTION: ioctl to retrieve hardware info: cache round size, whether
> >>>code/data separation is supported.
> >>>
> >>>
> >