Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751577AbbG2AG5 (ORCPT ); Tue, 28 Jul 2015 20:06:57 -0400 Received: from mga02.intel.com ([134.134.136.20]:6760 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750819AbbG2AG4 (ORCPT ); Tue, 28 Jul 2015 20:06:56 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.15,567,1432623600"; d="scan'208";a="773157809" Date: Tue, 28 Jul 2015 17:06:51 -0700 (PDT) From: Vikas Shivappa X-X-Sender: vikas@vshiva-Udesk To: Marcelo Tosatti cc: Vikas Shivappa , linux-kernel@vger.kernel.org, vikas.shivappa@intel.com, x86@kernel.org, hpa@zytor.com, tglx@linutronix.de, mingo@kernel.org, tj@kernel.org, peterz@infradead.org, matt.fleming@intel.com, will.auld@intel.com, glenn.p.williamson@intel.com, kanaka.d.juvva@intel.com Subject: Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide In-Reply-To: <20150728231516.GA16204@amt.cnet> Message-ID: References: <1435789270-27010-1-git-send-email-vikas.shivappa@linux.intel.com> <1435789270-27010-4-git-send-email-vikas.shivappa@linux.intel.com> <20150728231516.GA16204@amt.cnet> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12438 Lines: 311 On Tue, 28 Jul 2015, Marcelo Tosatti wrote: > On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote: >> Adds a description of Cache allocation technology, overview >> of kernel implementation and usage of Cache Allocation cgroup interface. >> >> Cache allocation is a sub-feature of Resource Director Technology(RDT) >> Allocation or Platform Shared resource control which provides support to >> control Platform shared resources like L3 cache. Currently L3 Cache is >> the only resource that is supported in RDT. More information can be >> found in the Intel SDM, Volume 3, section 17.15. >> >> Cache Allocation Technology provides a way for the Software (OS/VMM) >> to restrict cache allocation to a defined 'subset' of cache which may >> be overlapping with other 'subsets'. This feature is used when >> allocating a line in cache ie when pulling new data into the cache. >> >> Signed-off-by: Vikas Shivappa >> --- >> Documentation/cgroups/rdt.txt | 215 ++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 215 insertions(+) >> create mode 100644 Documentation/cgroups/rdt.txt >> >> diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt >> new file mode 100644 >> index 0000000..dfff477 >> --- /dev/null >> +++ b/Documentation/cgroups/rdt.txt >> @@ -0,0 +1,215 @@ >> + RDT >> + --- >> + >> +Copyright (C) 2014 Intel Corporation >> +Written by vikas.shivappa@linux.intel.com >> +(based on contents and format from cpusets.txt) >> + >> +CONTENTS: >> +========= >> + >> +1. Cache Allocation Technology >> + 1.1 What is RDT and Cache allocation ? >> + 1.2 Why is Cache allocation needed ? >> + 1.3 Cache allocation implementation overview >> + 1.4 Assignment of CBM and CLOS >> + 1.5 Scheduling and Context Switch >> +2. Usage Examples and Syntax >> + >> +1. Cache Allocation Technology(Cache allocation) >> +=================================== >> + >> +1.1 What is RDT and Cache allocation >> +------------------------------------ >> + >> +Cache allocation is a sub-feature of Resource Director Technology(RDT) >> +Allocation or Platform Shared resource control which provides support to >> +control Platform shared resources like L3 cache. Currently L3 Cache is >> +the only resource that is supported in RDT. More information can be >> +found in the Intel SDM, Volume 3, section 17.15. >> + >> +Cache Allocation Technology provides a way for the Software (OS/VMM) >> +to restrict cache allocation to a defined 'subset' of cache which may >> +be overlapping with other 'subsets'. This feature is used when >> +allocating a line in cache ie when pulling new data into the cache. >> +The programming of the h/w is done via programming MSRs. >> + >> +The different cache subsets are identified by CLOS identifier (class >> +of service) and each CLOS has a CBM (cache bit mask). The CBM is a >> +contiguous set of bits which defines the amount of cache resource that >> +is available for each 'subset'. >> + >> +1.2 Why is Cache allocation needed >> +---------------------------------- >> + >> +In todays new processors the number of cores is continuously increasing, >> +especially in large scale usage models where VMs are used like >> +webservers and datacenters. The number of cores increase the number >> +of threads or workloads that can simultaneously be run. When >> +multi-threaded-applications, VMs, workloads run concurrently they >> +compete for shared resources including L3 cache. >> + >> +The Cache allocation enables more cache resources to be made available >> +for higher priority applications based on guidance from the execution >> +environment. >> + >> +The architecture also allows dynamically changing these subsets during >> +runtime to further optimize the performance of the higher priority >> +application with minimal degradation to the low priority app. >> +Additionally, resources can be rebalanced for system throughput benefit. >> + >> +This technique may be useful in managing large computer systems which >> +large L3 cache. Examples may be large servers running instances of >> +webservers or database servers. In such complex systems, these subsets >> +can be used for more careful placing of the available cache >> +resources. >> + >> +1.3 Cache allocation implementation Overview >> +-------------------------------------------- >> + >> +Kernel implements a cgroup subsystem to support cache allocation. >> + >> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping. >> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal >> +to the kernel and not exposed to user. Each cgroup would have one CBM >> +and would just represent one cache 'subset'. >> + >> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the >> +cgroup never fails. When a child cgroup is created it inherits the >> +CLOSid and the CBM from its parent. When a user changes the default >> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not >> +used before. The changing of 'l3_cache_mask' may fail with -ENOSPC once >> +the kernel runs out of maximum CLOSids it can support. >> +User can create as many cgroups as he wants but having different CBMs >> +at the same time is restricted by the maximum number of CLOSids >> +(multiple cgroups can have the same CBM). >> +Kernel maintains a CLOSid<->cbm mapping which keeps reference counter >> +for each cgroup using a CLOSid. >> + >> +The tasks in the cgroup would get to fill the L3 cache represented by >> +the cgroup's 'l3_cache_mask' file. >> + >> +Root directory would have all available bits set in 'l3_cache_mask' file >> +by default. >> + >> +Each RDT cgroup directory has the following files. Some of them may be a >> +part of common RDT framework or be specific to RDT sub-features like >> +cache allocation. >> + >> + - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by this >> + file. The bitmask must be contiguous and would have a 1 or 2 bit >> + minimum length. >> + >> +1.4 Assignment of CBM,CLOS >> +-------------------------- >> + >> +The 'l3_cache_mask' needs to be a subset of the parent node's >> +'l3_cache_mask'. Any contiguous subset of these bits(with a minimum of 2 >> +bits on hsw SKUs) maybe set to indicate the cache mapping desired. The >> +'l3_cache_mask' between 2 directories can overlap. The 'l3_cache_mask' would >> +represent the cache 'subset' of the Cache allocation cgroup. For ex: on >> +a system with 16 bits of max cbm bits, if the directory has the least >> +significant 4 bits set in its 'l3_cache_mask' file(meaning the 'l3_cache_mask' >> +is just 0xf), it would be allocated the right quarter of the Last level >> +cache which means the tasks belonging to this Cache allocation cgroup >> +can use the right quarter of the cache to fill. If it >> +has the most significant 8 bits set ,it would be allocated the left >> +half of the cache(8 bits out of 16 represents 50%). >> + >> +The cache portion defined in the CBM file is available to all tasks >> +within the cgroup to fill and these task are not allowed to allocate >> +space in other parts of the cache. >> + >> +1.5 Scheduling and Context Switch >> +--------------------------------- >> + >> +During context switch kernel implements this by writing the >> +CLOSid (internally maintained by kernel) of the cgroup to which the >> +task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written >> +when there is a change in the CLOSid for the CPU in order to minimize >> +the latency incurred during context switch. >> + >> +The following considerations are done for the PQR MSR write so that it >> +has minimal impact on scheduling hot path: >> +- This path doesnt exist on any non-intel platforms. >> +- On Intel platforms, this would not exist by default unless CGROUP_RDT >> +is enabled. >> +- remains a no-op when CGROUP_RDT is enabled and intel hardware does not >> +support the feature. >> +- When feature is available, still remains a no-op till the user >> +manually creates a cgroup *and* assigns a new cache mask. Since the >> +child node inherits the parents cache mask , by cgroup creation there is >> +no scheduling hot path impact from the new cgroup. >> +- per cpu PQR values are cached and the MSR write is only done when >> +there is a task with different PQR is scheduled on the CPU. Typically if >> +the task groups are bound to be scheduled on a set of CPUs , the number >> +of MSR writes is greatly reduced. >> + >> +2. Usage examples and syntax >> +============================ >> + >> +To check if Cache allocation was enabled on your system >> + >> +dmesg | grep -i intel_rdt >> +should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx >> +the length of l3_cache_mask and CLOS should depend on the system you use. >> + >> +Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3 >> + cache allocation is enabled). >> + >> +Following would mount the cache allocation cgroup subsystem and create >> +2 directories. Please refer to Documentation/cgroups/cgroups.txt on >> +details about how to use cgroups. >> + >> + cd /sys/fs/cgroup >> + mkdir rdt >> + mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt >> + cd rdt >> + >> +Create 2 rdt cgroups >> + >> + mkdir group1 >> + mkdir group2 >> + >> +Following are some of the Files in the directory >> + >> + ls >> + rdt.l3_cache_mask >> + tasks >> + >> +Say if the cache is 2MB and cbm supports 16 bits, then setting the >> +below allocates the 'right 1/4th(512KB)' of the cache to group2 >> + >> +Edit the CBM for group2 to set the least significant 4 bits. This >> +allocates 'right quarter' of the cache. >> + >> + cd group2 >> + /bin/echo 0xf > rdt.l3_cache_mask >> + >> + >> +Edit the CBM for group2 to set the least significant 8 bits.This >> +allocates the right half of the cache to 'group2'. >> + >> + cd group2 >> + /bin/echo 0xff > rdt.l3_cache_mask >> + >> +Assign tasks to the group2 >> + >> + /bin/echo PID1 > tasks >> + /bin/echo PID2 > tasks >> + >> + Meaning now threads >> + PID1 and PID2 get to fill the 'right half' of >> + the cache as the belong to cgroup group2. >> + >> +Create a group under group2 >> + >> + cd group2 >> + mkdir group21 >> + cat rdt.l3_cache_mask >> + 0xff - inherits parents mask. >> + >> + /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to parent's mask's subset >> + >> +In order to restrict RDT cgroups to specific set of CPUs rdt can be >> +comounted with cpusets. >> -- >> 1.9.1 > > Vikas, > > Can you give an example of comounting with cpusets? What do you mean by > restrict RDT cgroups to specific set of CPUs? I was going to edit the documentation soon as i see a lot of feedback on the same. It may have caused confusion. I mean just pinning down tasks to a set of cpus. This does not mean we make the cache exclusive to the tasks.. > > Another limitation of this interface is that it assumes the > task <-> control group assignment is pertinent, that is: > > | taskgroup, L3 policy|: > > | taskgroupA, 50% L3 exclusive |, > | taskgroupB, 50% L3 |, > | taskgroupC, 50% L3 |. > > Whenever taskgroup A is empty (that is no runnable task in it), you waste 50% of > L3 cache. Cgroup masks can always overlap , and hence wont have exclusive cache allocation. > > I think this problem and the similar problem of L3 reservation with CPU > isolation can be solved in this way: whenever a task from cgroupE with exclusive way > access is migrated to a new die, impose the exclusivity (by removing > access to that way by other cgroups). > > Whenever cgroupE has zero tasks, remove exclusivity (by allowing > other cgroups to use the exclusive ways of it). Same comment as above - Cgroup masks can always overlap and other cgroups can allocate the same cache , and hence wont have exclusive cache allocation. So natuarally the cgroup with tasks would get to use the cache if it has the same mask (say representing 50% of cache in your example) as others . (assume there are 8 bits max cbm) cgroupa - mask - 0xf cgroupb - mask - 0xf . Now if cgroupa has no tasks , cgroupb naturally gets all the cache. Thanks, Vikas > > I'll cook a patch. > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/