From: "Auld, Will" <will.auld@intel.com>
To: "Shivappa, Vikas" <vikas.shivappa@intel.com>,
        Marcelo Tosatti <mtosatti@redhat.com>
CC: Vikas Shivappa <vikas.shivappa@linux.intel.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "x86@kernel.org" <x86@kernel.org>, "hpa@zytor.com" <hpa@zytor.com>,
        "tglx@linutronix.de" <tglx@linutronix.de>,
        "mingo@kernel.org" <mingo@kernel.org>, "tj@kernel.org" <tj@kernel.org>,
        "peterz@infradead.org" <peterz@infradead.org>,
        "Fleming, Matt" <matt.fleming@intel.com>,
        "Williamson, Glenn P" <glenn.p.williamson@intel.com>,
        "Juvva, Kanaka D" <kanaka.d.juvva@intel.com>,
        "Auld, Will" <will.auld@intel.com>
Subject: RE: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and
 cgroup usage guide
Thread-Topic: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and
 cgroup usage guide
Thread-Index: AQHQyZD9gZkHYz30m0y5hc07zQVJjJ3yByaA//+f1UA=
Date: Wed, 29 Jul 2015 01:28:38 +0000
Message-ID: <96EC5A4F3149B74492D2D9B9B1602C27461EB932@ORSMSX105.amr.corp.intel.com>
References: <1435789270-27010-1-git-send-email-vikas.shivappa@linux.intel.com>
 <1435789270-27010-4-git-send-email-vikas.shivappa@linux.intel.com>
 <20150728231516.GA16204@amt.cnet>
 <alpine.DEB.2.10.1507281702030.921@vshiva-Udesk>
In-Reply-To: <alpine.DEB.2.10.1507281702030.921@vshiva-Udesk>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 13787
Lines: 337


> -----Original Message-----
> From: Shivappa, Vikas
> Sent: Tuesday, July 28, 2015 5:07 PM
> To: Marcelo Tosatti
> Cc: Vikas Shivappa; linux-kernel@vger.kernel.org; Shivappa, Vikas;
> x86@kernel.org; hpa@zytor.com; tglx@linutronix.de; mingo@kernel.org;
> tj@kernel.org; peterz@infradead.org; Fleming, Matt; Auld, Will; Williamson,
> Glenn P; Juvva, Kanaka D
> Subject: Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and
> cgroup usage guide
> 
> 
> 
> On Tue, 28 Jul 2015, Marcelo Tosatti wrote:
> 
> > On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
> >> Adds a description of Cache allocation technology, overview of kernel
> >> implementation and usage of Cache Allocation cgroup interface.
> >>
> >> Cache allocation is a sub-feature of Resource Director
> >> Technology(RDT) Allocation or Platform Shared resource control which
> >> provides support to control Platform shared resources like L3 cache.
> >> Currently L3 Cache is the only resource that is supported in RDT.
> >> More information can be found in the Intel SDM, Volume 3, section 17.15.
> >>
> >> Cache Allocation Technology provides a way for the Software (OS/VMM)
> >> to restrict cache allocation to a defined 'subset' of cache which may
> >> be overlapping with other 'subsets'.  This feature is used when
> >> allocating a line in cache ie when pulling new data into the cache.
> >>
> >> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> >> ---
> >>  Documentation/cgroups/rdt.txt | 215
> >> ++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 215 insertions(+)
> >>  create mode 100644 Documentation/cgroups/rdt.txt
> >>
> >> diff --git a/Documentation/cgroups/rdt.txt
> >> b/Documentation/cgroups/rdt.txt new file mode 100644 index
> >> 0000000..dfff477
> >> --- /dev/null
> >> +++ b/Documentation/cgroups/rdt.txt
> >> @@ -0,0 +1,215 @@
> >> +        RDT
> >> +        ---
> >> +
> >> +Copyright (C) 2014 Intel Corporation Written by
> >> +vikas.shivappa@linux.intel.com (based on contents and format from
> >> +cpusets.txt)
> >> +
> >> +CONTENTS:
> >> +=========
> >> +
> >> +1. Cache Allocation Technology
> >> +  1.1 What is RDT and Cache allocation ?
> >> +  1.2 Why is Cache allocation needed ?
> >> +  1.3 Cache allocation implementation overview
> >> +  1.4 Assignment of CBM and CLOS
> >> +  1.5 Scheduling and Context Switch
> >> +2. Usage Examples and Syntax
> >> +
> >> +1. Cache Allocation Technology(Cache allocation)
> >> +===================================
> >> +
> >> +1.1 What is RDT and Cache allocation
> >> +------------------------------------
> >> +
> >> +Cache allocation is a sub-feature of Resource Director
> >> +Technology(RDT) Allocation or Platform Shared resource control which
> >> +provides support to control Platform shared resources like L3 cache.
> >> +Currently L3 Cache is the only resource that is supported in RDT.
> >> +More information can be found in the Intel SDM, Volume 3, section 17.15.
> >> +
> >> +Cache Allocation Technology provides a way for the Software (OS/VMM)
> >> +to restrict cache allocation to a defined 'subset' of cache which
> >> +may be overlapping with other 'subsets'.  This feature is used when
> >> +allocating a line in cache ie when pulling new data into the cache.
> >> +The programming of the h/w is done via programming  MSRs.
> >> +
> >> +The different cache subsets are identified by CLOS identifier (class
> >> +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
> >> +contiguous set of bits which defines the amount of cache resource
> >> +that is available for each 'subset'.
> >> +
> >> +1.2 Why is Cache allocation needed
> >> +----------------------------------
> >> +
> >> +In todays new processors the number of cores is continuously
> >> +increasing, especially in large scale usage models where VMs are
> >> +used like webservers and datacenters. The number of cores increase
> >> +the number of threads or workloads that can simultaneously be run.
> >> +When multi-threaded-applications, VMs, workloads run concurrently
> >> +they compete for shared resources including L3 cache.
> >> +
> >> +The Cache allocation  enables more cache resources to be made
> >> +available for higher priority applications based on guidance from
> >> +the execution environment.
> >> +
> >> +The architecture also allows dynamically changing these subsets
> >> +during runtime to further optimize the performance of the higher
> >> +priority application with minimal degradation to the low priority app.
> >> +Additionally, resources can be rebalanced for system throughput benefit.
> >> +
> >> +This technique may be useful in managing large computer systems
> >> +which large L3 cache. Examples may be large servers running
> >> +instances of webservers or database servers. In such complex
> >> +systems, these subsets can be used for more careful placing of the
> >> +available cache resources.
> >> +
> >> +1.3 Cache allocation implementation Overview
> >> +--------------------------------------------
> >> +
> >> +Kernel implements a cgroup subsystem to support cache allocation.
> >> +
> >> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
> >> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is
> >> +internal to the kernel and not exposed to user.  Each cgroup would
> >> +have one CBM and would just represent one cache 'subset'.
> >> +
> >> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> >> +cgroup never fails.  When a child cgroup is created it inherits the
> >> +CLOSid and the CBM from its parent.  When a user changes the default
> >> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> >> +used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC
> >> +once the kernel runs out of maximum CLOSids it can support.
> >> +User can create as many cgroups as he wants but having different
> >> +CBMs at the same time is restricted by the maximum number of CLOSids
> >> +(multiple cgroups can have the same CBM).
> >> +Kernel maintains a CLOSid<->cbm mapping which keeps reference
> >> +counter for each cgroup using a CLOSid.
> >> +
> >> +The tasks in the cgroup would get to fill the L3 cache represented
> >> +by the cgroup's 'l3_cache_mask' file.
> >> +
> >> +Root directory would have all available  bits set in 'l3_cache_mask'
> >> +file by default.
> >> +
> >> +Each RDT cgroup directory has the following files. Some of them may
> >> +be a part of common RDT framework or be specific to RDT sub-features
> >> +like cache allocation.
> >> +
> >> + - intel_rdt.l3_cache_mask: The cache bitmask(CBM) is represented by
> >> + this file. The bitmask must be contiguous and would have a 1 or 2
> >> + bit minimum length.
> >> +
> >> +1.4 Assignment of CBM,CLOS
> >> +--------------------------
> >> +
> >> +The 'l3_cache_mask' needs to be a  subset of the parent node's
> >> +'l3_cache_mask'. Any contiguous subset of these bits(with a minimum
> >> +of 2 bits on hsw SKUs) maybe set to indicate the cache mapping
> >> +desired. The 'l3_cache_mask' between 2 directories can overlap. The
> >> +'l3_cache_mask' would represent the cache 'subset' of the Cache
> >> +allocation cgroup. For ex: on a system with 16 bits of max cbm bits,
> >> +if the directory has the least significant 4 bits set in its 'l3_cache_mask'
> file(meaning the 'l3_cache_mask'
> >> +is just 0xf), it would be allocated the right quarter of the Last
> >> +level cache which means the tasks belonging to this Cache allocation
> >> +cgroup can use the right quarter of the cache to fill. If it has the
> >> +most significant 8 bits set ,it would be allocated the left half of
> >> +the cache(8 bits  out of 16 represents 50%).
> >> +
> >> +The cache portion defined in the CBM file is available to all tasks
> >> +within the cgroup to fill and these task are not allowed to allocate
> >> +space in other parts of the cache.
> >> +
> >> +1.5 Scheduling and Context Switch
> >> +---------------------------------
> >> +
> >> +During context switch kernel implements this by writing the CLOSid
> >> +(internally maintained by kernel) of the cgroup to which the task
> >> +belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
> >> +when there is a change in the CLOSid for the CPU in order to
> >> +minimize the latency incurred during context switch.
> >> +
> >> +The following considerations are done for the PQR MSR write so that
> >> +it has minimal impact on scheduling hot path:
> >> +- This path doesnt exist on any non-intel platforms.
> >> +- On Intel platforms, this would not exist by default unless
> >> +CGROUP_RDT is enabled.
> >> +- remains a no-op when CGROUP_RDT is enabled and intel hardware does
> >> +not support the feature.
> >> +- When feature is available, still remains a no-op till the user
> >> +manually creates a cgroup *and* assigns a new cache mask. Since the
> >> +child node inherits the parents cache mask , by cgroup creation
> >> +there is no scheduling hot path impact from the new cgroup.
> >> +- per cpu PQR values are cached and the MSR write is only done when
> >> +there is a task with different PQR is scheduled on the CPU.
> >> +Typically if the task groups are bound to be scheduled on a set of
> >> +CPUs , the number of MSR writes is greatly reduced.
> >> +
> >> +2. Usage examples and syntax
> >> +============================
> >> +
> >> +To check if Cache allocation was enabled on your system
> >> +
> >> +dmesg | grep -i intel_rdt
> >> +should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
> >> +the length of l3_cache_mask and CLOS should depend on the system you
> use.
> >> +
> >> +Also /proc/cpuinfo would have rdt(if rdt is enabled) and cat_l3( if L3
> >> +    cache allocation is enabled).
> >> +
> >> +Following would mount the cache allocation cgroup subsystem and
> >> +create
> >> +2 directories. Please refer to Documentation/cgroups/cgroups.txt on
> >> +details about how to use cgroups.
> >> +
> >> +  cd /sys/fs/cgroup
> >> +  mkdir rdt
> >> +  mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt  cd rdt
> >> +
> >> +Create 2 rdt cgroups
> >> +
> >> +  mkdir group1
> >> +  mkdir group2
> >> +
> >> +Following are some of the Files in the directory
> >> +
> >> +  ls
> >> +  rdt.l3_cache_mask
> >> +  tasks
> >> +
> >> +Say if the cache is 2MB and cbm supports 16 bits, then setting the
> >> +below allocates the 'right 1/4th(512KB)' of the cache to group2
> >> +
> >> +Edit the CBM for group2 to set the least significant 4 bits.  This
> >> +allocates 'right quarter' of the cache.
> >> +
> >> +  cd group2
> >> +  /bin/echo 0xf > rdt.l3_cache_mask
> >> +
> >> +
> >> +Edit the CBM for group2 to set the least significant 8 bits.This
> >> +allocates the right half of the cache to 'group2'.
> >> +
> >> +  cd group2
> >> +  /bin/echo 0xff > rdt.l3_cache_mask
> >> +
> >> +Assign tasks to the group2
> >> +
> >> +  /bin/echo PID1 > tasks
> >> +  /bin/echo PID2 > tasks
> >> +
> >> +  Meaning now threads
> >> +  PID1 and PID2 get to fill the 'right half' of  the cache as the
> >> + belong to cgroup group2.
> >> +
> >> +Create a group under group2
> >> +
> >> +  cd group2
> >> +  mkdir group21
> >> +  cat rdt.l3_cache_mask
> >> +   0xff - inherits parents mask.
> >> +
> >> +  /bin/echo 0xfff > rdt.l3_cache_mask - throws error as mask has to
> >> + parent's mask's subset
> >> +
> >> +In order to restrict RDT cgroups to specific set of CPUs rdt can be
> >> +comounted with cpusets.
> >> --
> >> 1.9.1
> >
> > Vikas,
> >
> > Can you give an example of comounting with cpusets? What do you mean
> > by restrict RDT cgroups to specific set of CPUs?
> 
> I was going to edit the documentation soon as i see a lot of feedback on the
> same. It may have caused confusion.
> 
> I mean just pinning down tasks to a set of cpus. This does not mean we make the
> cache exclusive to the tasks..
> 
> >
> > Another limitation of this interface is that it assumes the task <->
> > control group assignment is pertinent, that is:
> >
> > | taskgroup, L3 policy|:
> >
> > | taskgroupA, 50% L3 exclusive |,
> > | taskgroupB, 50% L3 |,
> > | taskgroupC, 50% L3 |.
> >
> > Whenever taskgroup A is empty (that is no runnable task in it), you
> > waste 50% of
> > L3 cache.
> 
> Cgroup masks can always overlap , and hence wont have exclusive cache
> allocation.
> 
> >
> > I think this problem and the similar problem of L3 reservation with
> > CPU isolation can be solved in this way: whenever a task from cgroupE
> > with exclusive way access is migrated to a new die, impose the
> > exclusivity (by removing access to that way by other cgroups).
> >
> > Whenever cgroupE has zero tasks, remove exclusivity (by allowing other
> > cgroups to use the exclusive ways of it).
> 
> Same comment as above - Cgroup masks can always overlap and other cgroups
> can allocate the same cache , and hence wont have exclusive cache allocation.

[Auld, Will] You can define all the cbm to provide one clos with an exclusive area

> 
> So natuarally the cgroup with tasks would get to use the cache if it has the same
> mask (say representing 50% of cache in your example) as others .
 
[Auld, Will] automatic adjustment of the cbm make me nervous. There are times 
when we want to limit the cache for a process independent of whether there is 
lots of unused cache. 


> (assume there are 8 bits max cbm)
> cgroupa - mask - 0xf
> cgroupb - mask - 0xf . Now if cgroupa has no tasks , cgroupb naturally gets all
> the cache.
> 
> Thanks,
> Vikas
> 
> >
> > I'll cook a patch.
> >
> >
> >
> >
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/